MXPA05012638A - Virtual representations of nucleotide sequences - Google Patents
Virtual representations of nucleotide sequencesInfo
- Publication number
- MXPA05012638A MXPA05012638A MXPA/A/2005/012638A MXPA05012638A MXPA05012638A MX PA05012638 A MXPA05012638 A MX PA05012638A MX PA05012638 A MXPA05012638 A MX PA05012638A MX PA05012638 A MXPA05012638 A MX PA05012638A
- Authority
- MX
- Mexico
- Prior art keywords
- genome
- nucleic acid
- word
- acid molecules
- character
- Prior art date
Links
- 229920001850 Nucleic acid sequence Polymers 0.000 title claims abstract description 54
- 150000007523 nucleic acids Chemical group 0.000 claims abstract description 101
- 125000003729 nucleotide group Chemical group 0.000 claims abstract description 83
- 239000002773 nucleotide Substances 0.000 claims abstract description 82
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 43
- 238000002493 microarray Methods 0.000 claims abstract description 23
- 239000000203 mixture Substances 0.000 claims abstract description 20
- 108020004707 nucleic acids Proteins 0.000 claims description 98
- 238000009396 hybridization Methods 0.000 claims description 65
- 229920000272 Oligonucleotide Polymers 0.000 claims description 60
- 108091007521 restriction endonucleases Proteins 0.000 claims description 33
- 239000000470 constituent Substances 0.000 claims description 18
- 229920000023 polynucleotide Polymers 0.000 claims description 17
- 239000002157 polynucleotide Substances 0.000 claims description 17
- 230000011987 methylation Effects 0.000 claims description 16
- 238000007069 methylation reaction Methods 0.000 claims description 16
- 238000003776 cleavage reaction Methods 0.000 claims description 11
- 239000007790 solid phase Substances 0.000 claims description 10
- 230000000875 corresponding Effects 0.000 claims description 7
- 239000011521 glass Substances 0.000 claims description 7
- 239000012528 membrane Substances 0.000 claims description 7
- 239000004677 Nylon Substances 0.000 claims description 5
- 239000004005 microsphere Substances 0.000 claims description 5
- 229920001778 nylon Polymers 0.000 claims description 5
- 102000004190 Enzymes Human genes 0.000 claims description 4
- 108090000790 Enzymes Proteins 0.000 claims description 4
- 239000000020 Nitrocellulose Substances 0.000 claims description 4
- 239000002253 acid Substances 0.000 claims description 4
- 229920001220 nitrocellulos Polymers 0.000 claims description 4
- 239000002131 composite material Substances 0.000 claims description 3
- 238000000126 in silico method Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims 4
- 241000736772 Uria Species 0.000 claims 1
- 239000000126 substance Substances 0.000 claims 1
- 108020005187 Oligonucleotide Probes Proteins 0.000 abstract description 29
- 239000002751 oligonucleotide probe Substances 0.000 abstract description 29
- 238000011160 research Methods 0.000 abstract description 5
- 230000001225 therapeutic Effects 0.000 abstract 1
- 239000000523 sample Substances 0.000 description 98
- 229920003013 deoxyribonucleic acid Polymers 0.000 description 74
- 210000000349 Chromosomes Anatomy 0.000 description 44
- 238000000034 method Methods 0.000 description 35
- 238000004458 analytical method Methods 0.000 description 29
- 210000004027 cells Anatomy 0.000 description 27
- 230000003321 amplification Effects 0.000 description 22
- 238000003199 nucleic acid amplification method Methods 0.000 description 18
- 230000015654 memory Effects 0.000 description 16
- 230000011218 segmentation Effects 0.000 description 16
- 230000000295 complement Effects 0.000 description 15
- 238000002474 experimental method Methods 0.000 description 15
- 241001417495 Serranidae Species 0.000 description 11
- 238000007906 compression Methods 0.000 description 11
- 206010028980 Neoplasm Diseases 0.000 description 9
- 201000011510 cancer Diseases 0.000 description 9
- 230000002068 genetic Effects 0.000 description 9
- 201000010099 disease Diseases 0.000 description 8
- 230000003902 lesions Effects 0.000 description 8
- 230000029087 digestion Effects 0.000 description 7
- 230000001717 pathogenic Effects 0.000 description 7
- 244000052769 pathogens Species 0.000 description 7
- DBMJMQXJHONAFJ-UHFFFAOYSA-M Sodium laurylsulphate Chemical compound [Na+].CCCCCCCCCCCCOS([O-])(=O)=O DBMJMQXJHONAFJ-UHFFFAOYSA-M 0.000 description 6
- 230000000052 comparative effect Effects 0.000 description 6
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 6
- ZHNUHDYFZUAESO-UHFFFAOYSA-N formamide Chemical compound NC=O ZHNUHDYFZUAESO-UHFFFAOYSA-N 0.000 description 6
- 238000002360 preparation method Methods 0.000 description 6
- 210000001519 tissues Anatomy 0.000 description 6
- 206010006187 Breast cancer Diseases 0.000 description 5
- 125000004122 cyclic group Chemical group 0.000 description 5
- 238000002372 labelling Methods 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 230000002829 reduced Effects 0.000 description 5
- 230000003252 repetitive Effects 0.000 description 5
- 239000000243 solution Substances 0.000 description 5
- 230000001629 suppression Effects 0.000 description 5
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 5
- 102100015262 MYC Human genes 0.000 description 4
- 108020004711 Nucleic Acid Probes Proteins 0.000 description 4
- 210000004940 Nucleus Anatomy 0.000 description 4
- 210000001766 X Chromosome Anatomy 0.000 description 4
- 230000004075 alteration Effects 0.000 description 4
- 230000003322 aneuploid Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- UIIMBOGNXHQVGW-UHFFFAOYSA-M buffer Substances [Na+].OC([O-])=O UIIMBOGNXHQVGW-UHFFFAOYSA-M 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 239000002853 nucleic acid probe Substances 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- 210000004369 Blood Anatomy 0.000 description 3
- 208000005623 Carcinogenesis Diseases 0.000 description 3
- 229920002676 Complementary DNA Polymers 0.000 description 3
- 241000196324 Embryophyta Species 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 3
- 150000007513 acids Chemical class 0.000 description 3
- 239000008280 blood Substances 0.000 description 3
- 239000002299 complementary DNA Substances 0.000 description 3
- 230000003247 decreasing Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003292 diminished Effects 0.000 description 3
- 239000007850 fluorescent dye Substances 0.000 description 3
- 238000005755 formation reaction Methods 0.000 description 3
- 230000036541 health Effects 0.000 description 3
- 230000000670 limiting Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 241000894007 species Species 0.000 description 3
- 229920000160 (ribonucleotides)n+m Polymers 0.000 description 2
- BCOSEZGCLGPUSL-UHFFFAOYSA-N 2,3,3-trichloroprop-2-enoyl chloride Chemical compound ClC(Cl)=C(Cl)C(Cl)=O BCOSEZGCLGPUSL-UHFFFAOYSA-N 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 2
- 210000001772 Blood Platelets Anatomy 0.000 description 2
- 241000270322 Lepidosauria Species 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 241000840267 Moma Species 0.000 description 2
- 241000700159 Rattus Species 0.000 description 2
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N Thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 229920001949 Transfer RNA Polymers 0.000 description 2
- GSEJCLTVZPLZKY-UHFFFAOYSA-N Tris Chemical compound OCCN(CCO)CCO GSEJCLTVZPLZKY-UHFFFAOYSA-N 0.000 description 2
- 239000007983 Tris buffer Substances 0.000 description 2
- 241000700605 Viruses Species 0.000 description 2
- 238000000137 annealing Methods 0.000 description 2
- 230000001174 ascending Effects 0.000 description 2
- 230000027455 binding Effects 0.000 description 2
- 238000001574 biopsy Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- HEDRZPFGACZZDS-UHFFFAOYSA-N chloroform Chemical compound ClC(Cl)Cl HEDRZPFGACZZDS-UHFFFAOYSA-N 0.000 description 2
- 230000002759 chromosomal Effects 0.000 description 2
- 230000001809 detectable Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drugs Drugs 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- GNBHRKFJIUUOQI-UHFFFAOYSA-N fluorescein Chemical compound O1C(=O)C2=CC=CC=C2C21C1=CC=C(O)C=C1OC1=CC(O)=CC=C21 GNBHRKFJIUUOQI-UHFFFAOYSA-N 0.000 description 2
- 238000010438 heat treatment Methods 0.000 description 2
- 230000002401 inhibitory effect Effects 0.000 description 2
- 101710030587 ligN Proteins 0.000 description 2
- 101700077585 ligd Proteins 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 238000010208 microarray analysis Methods 0.000 description 2
- 244000005700 microbiome Species 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- ISWSIDIOOBJBQZ-UHFFFAOYSA-N phenol Chemical compound OC1=CC=CC=C1 ISWSIDIOOBJBQZ-UHFFFAOYSA-N 0.000 description 2
- 238000003752 polymerase chain reaction Methods 0.000 description 2
- 238000001556 precipitation Methods 0.000 description 2
- 239000011541 reaction mixture Substances 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 2
- 210000004881 tumor cells Anatomy 0.000 description 2
- 238000005406 washing Methods 0.000 description 2
- DGVVWUTYPXICAM-UHFFFAOYSA-N 2-mercaptoethanol Chemical compound OCCS DGVVWUTYPXICAM-UHFFFAOYSA-N 0.000 description 1
- 206010003805 Autism Diseases 0.000 description 1
- 206010003816 Autoimmune disease Diseases 0.000 description 1
- 210000003050 Axons Anatomy 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- 210000002230 Centromere Anatomy 0.000 description 1
- 206010008531 Chills Diseases 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- 239000004971 Cross linker Substances 0.000 description 1
- OPTASPLRGRRNAP-UHFFFAOYSA-N Cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 1
- 229940104302 Cytosine Drugs 0.000 description 1
- 108010017826 DNA Polymerase I Proteins 0.000 description 1
- 102000004594 DNA Polymerase I Human genes 0.000 description 1
- 206010012601 Diabetes mellitus Diseases 0.000 description 1
- 229920000665 Exon Polymers 0.000 description 1
- 102000036575 FBXLs Human genes 0.000 description 1
- 108091006925 FBXLs Proteins 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 206010071602 Genetic polymorphism Diseases 0.000 description 1
- UYTPUPDQBNUYGX-UHFFFAOYSA-N Guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 1
- VEXZGXHMUGYJMC-UHFFFAOYSA-N HCl Chemical compound Cl VEXZGXHMUGYJMC-UHFFFAOYSA-N 0.000 description 1
- 239000004472 Lysine Substances 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- TWRXJAOTZQYOKJ-UHFFFAOYSA-L MgCl2 Chemical compound [Mg+2].[Cl-].[Cl-] TWRXJAOTZQYOKJ-UHFFFAOYSA-L 0.000 description 1
- 244000278455 Morus laevigata Species 0.000 description 1
- 235000013382 Morus laevigata Nutrition 0.000 description 1
- 241000699666 Mus <mouse, genus> Species 0.000 description 1
- 102000003505 Myosin family Human genes 0.000 description 1
- 108060008487 Myosin family Proteins 0.000 description 1
- 101700010923 NA12 Proteins 0.000 description 1
- 101700008227 NA13 Proteins 0.000 description 1
- 101710034230 NR2F1 Proteins 0.000 description 1
- 206010033128 Ovarian cancer Diseases 0.000 description 1
- 239000004743 Polypropylene Substances 0.000 description 1
- 241000288906 Primates Species 0.000 description 1
- 102000009516 Protein-Serine-Threonine Kinases Human genes 0.000 description 1
- 108010009341 Protein-Serine-Threonine Kinases Proteins 0.000 description 1
- 102100005127 RNF139 Human genes 0.000 description 1
- 101710030991 RNF139 Proteins 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 229920000970 Repeated sequence (DNA) Polymers 0.000 description 1
- 102000000395 SH3 domain Human genes 0.000 description 1
- 108050008861 SH3 domain Proteins 0.000 description 1
- 241000282898 Sus scrofa Species 0.000 description 1
- 108010006785 Taq Polymerase Proteins 0.000 description 1
- 241001444177 Tetrao Species 0.000 description 1
- 229940113082 Thymine Drugs 0.000 description 1
- 102000006275 Ubiquitin-Protein Ligases Human genes 0.000 description 1
- 108010083111 Ubiquitin-Protein Ligases Proteins 0.000 description 1
- 101710017715 ZNF816 Proteins 0.000 description 1
- 102100001628 ZNF816 Human genes 0.000 description 1
- 101700070836 ZNFP Proteins 0.000 description 1
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical class C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 1
- 231100000494 adverse effect Toxicity 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 201000002055 autistic disease Diseases 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 239000011230 binding agent Substances 0.000 description 1
- 238000004166 bioassay Methods 0.000 description 1
- 238000009835 boiling Methods 0.000 description 1
- 230000001488 breeding Effects 0.000 description 1
- 238000010804 cDNA synthesis Methods 0.000 description 1
- 230000036952 cancer formation Effects 0.000 description 1
- 231100000504 carcinogenesis Toxicity 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000010205 computational analysis Methods 0.000 description 1
- 230000002596 correlated Effects 0.000 description 1
- 230000001808 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000004132 cross linking Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000004059 degradation Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 210000001840 diploid cell Anatomy 0.000 description 1
- KCXVZYZYPLLWCC-UHFFFAOYSA-N edta Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005206 flow analysis Methods 0.000 description 1
- 238000000684 flow cytometry Methods 0.000 description 1
- 231100000722 genetic damage Toxicity 0.000 description 1
- 239000003365 glass fiber Substances 0.000 description 1
- 239000008187 granular material Substances 0.000 description 1
- 201000010238 heart disease Diseases 0.000 description 1
- 229910000041 hydrogen chloride Inorganic materials 0.000 description 1
- 238000011065 in-situ storage Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 200000000018 inflammatory disease Diseases 0.000 description 1
- 230000000977 initiatory Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000003446 ligand Substances 0.000 description 1
- 238000007834 ligase chain reaction Methods 0.000 description 1
- 230000004301 light adaptation Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000003211 malignant Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000873 masking Effects 0.000 description 1
- 230000001404 mediated Effects 0.000 description 1
- 239000011325 microbead Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006011 modification reaction Methods 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 230000000926 neurological Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000005298 paramagnetic Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 125000001151 peptidyl group Chemical group 0.000 description 1
- 239000004033 plastic Substances 0.000 description 1
- 229920003023 plastic Polymers 0.000 description 1
- 229920002401 polyacrylamide Polymers 0.000 description 1
- -1 polypropylene Polymers 0.000 description 1
- 229920001155 polypropylene Polymers 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002250 progressing Effects 0.000 description 1
- 230000022983 regulation of cell cycle Effects 0.000 description 1
- 201000010174 renal carcinoma Diseases 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000000377 silicon dioxide Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000009987 spinning Methods 0.000 description 1
- 239000007858 starting material Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing Effects 0.000 description 1
- 102000003995 transcription factors Human genes 0.000 description 1
- 108090000464 transcription factors Proteins 0.000 description 1
- 230000001131 transforming Effects 0.000 description 1
- 102000035402 transmembrane proteins Human genes 0.000 description 1
- 108091005683 transmembrane proteins Proteins 0.000 description 1
- 102000027575 transmembrane receptors Human genes 0.000 description 1
- 108091007901 transmembrane receptors Proteins 0.000 description 1
- 239000001226 triphosphate Substances 0.000 description 1
- 235000011178 triphosphate Nutrition 0.000 description 1
- 125000002264 triphosphate group Chemical class [H]OP(=O)(O[H])OP(=O)(O[H])OP(=O)(O[H])O* 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000004304 visual acuity Effects 0.000 description 1
Abstract
The invention provides oligonucleotide probes that can be used to hybridize to a representation of nucleic acid sequences. Compositions containing the probes such as microarrays are also provided. The invention also provides methods of using these probes and compositions in therapeutic, diagnostic, and research applications. Systems and methods for using a word counting algorithm that can quickly and accurately count the number of times a particular string of characters (i.e., nucleotides) appears in a nucleotide sequence (e.g., a genome) are provided. This algorithm can be used to identify the oligonucleotide probes of the invention. The algorithm uses a transform of a genome and an auxiliary data structure to count the number of times a particular word occurs in the genome.
Description
VIRTUAL REPRESENTATION OF NUCLEOTIDIC SEQUENCES
FIELD OF THE INVENTION This invention relates generally to molecular biology. More specifically, this invention relates to materials and methods for generating nucleotide sequences that are representative of a given DNA source (e.g., a genome).
BACKGROUND OF THE INVENTION The global methods for genetic analyzes have provided useful insights into the pathophysiology of cancer and other diseases or conditions with a genetic component. Such methods include karyotyping, determination of ploidy, comparative genomic hybridization (CGH or comparative genomic hybridization), representation difference analysis (RDA or representational difference analysis) (see, for example, US Pat. No. 5,436,142) and analysis. of genomic representations (WO 99/23256, published May 14, 1999). Generally, these methods involve using probes to interrogate the expression of particular genes or examine changes in the genome itself. Using oligonucleotide arrays, these
P05 / 086 / CSHL methods can be used to obtain a high resolution global image of the genetic changes in the cells. However, these methods require knowledge of the sequences of the particular probes. This is particularly limiting for cDNA arrays, because such arrays only interrogate a limited set of genes. They are also limiting for the wide selection of the genome, because many oligonucleotides designed for an array may not be represented in the interrogated population, resulting in an inefficient or ineffective analysis.
SUMMARY OF THE INVENTION This invention provides compositions and methods useful for interrogating populations of nucleic acid molecules. These compositions and methods can be used to analyze complex genomes (e.g., mammalian genomes), optionally in conjunction with the microarray technology. This invention features a plurality of at least 100 nucleic acid molecules (A) wherein (a) each of the nucleic acid molecules hybridizes specifically to a sequence in a genome of at least Z base pairs and (b) ) at least P% of the plurality of nucleic acid molecules have (i) a length of at least K nucleotides; (ii) it
P05 / 086 / C? HL hybridized specifically to at least one nucleic acid molecule present in, or predicted to be present in a representation derived from the genome, the representation having no more than R% of the complexity of the genome and ( iii) no more than X exact correspondences of Ll nucleotides with the genome (or representation) and not less than Y exact correspondences of Ll nucleotides with the genome (or representation) and (B) where (a) Z > lxlO8; (b) 300 > K > 30; (c) 70 > R > 0.001; (d) P > 90-R; (e) the closest integer to (log (Z) +2) > Li > the integer closest to log4 (Z); (f) X is the integer closest to DI x (K -L? +1); (g) Y is the nearest integer to D2 x (K-L? +1); (h) 1.5 > Di > 1 and (i) 1 > D2 > 0.5 In some additional embodiments, (1) the plurality of nucleic acid molecules comprises at least 500; 1,000; 2,500; 5,000; 10,000; 25,000; 50,000; 85,000; 190,000; 350,000 or 550,000 nucleic acid molecules; (2) Z is at least 3 x 108, 1 x 109, 1 x 1010 or 1 x 1011; (3) R is 0.001, 1, 2, 4, 10, 15, 20, 30, 40, 50 or 70; (4) P is independent of R and is at least 70, 80, 90, 95, 97 or 99; (5) Di is 1; (6) Ll is 15, 16, 17, 18, 19 ,. 20, 21, 22, 23 or 24; (7) P is 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100 and / or (8) K is 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 200 or 250. In some embodiments, a nucleic acid molecule that hybridizes specifically to
P05 / 086 / CSHL another nucleic acid molecule has at least 90% sequence identity with a sequence of the same length in another nucleic acid molecule. In the additional modalities, have at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99% or 100% identity of the sequence. In some additional embodiments, each of the P% of the plurality of nucleic acid molecules furthermore has no more than A exact correspondences of L2 nucleotides with the genome and not less than B exact correspondences of L2 nucleotides with the genome, wherein ( a) Li >; L2 > the integer closest to log (Z) -3, (b) A is the nearest integer to D3 x ((K-L2 + l) x (Z / 4L2)); (c) B is the nearest integer to D4 x ((K-L2 + l) x (Z / 4L2)); (d) 4 > D3 > 1 and (e) 1 > D4 > 0.5 A representation of a population of DNA can be produced by a specific cleavage of the genome sequence, for example, achieved with a restriction endonuclease. It can also be derived from another representation. That is, the resulting representation is a representation of the compound. The nucleic acid molecules of this invention can be identified by a method comprising: (a) cleaving the genome in silico with a restriction enzyme
P05 / 086 / CSHL to generate a plurality of predicted nucleic acid molecules; (b) generate a virtual representation of the genome by identifying the predicted nucleic acid molecules, each having a length of 200-1,200 base pairs, inclusive, the virtual representation has a complexity of 0.001% -70%, inclusive, of the genome; (c) selecting an oligonucleotide having a length of 30-300 nucleotides, inclusive, and at least 90% sequence identity with a nucleic acid molecule predicted in (b); (d) calculate the complexity of the virtual representation in relation to the genome; (e) identifying all the stretches of Ll nucleotides that appear in the oligonucleotide; and (f) confirming that the number of times each stretch appears in the genome satisfies the various predetermined requirements. The nucleic acid molecules of this invention can be used as probes to analyze a DNA sample. These probes can be immobilized on the surface of a solid phase, including a semi-solid surface. The solid phases include, but are not limited to, nylon membranes, nitrocellulose membranes, glass slides and microspheres (eg, paramagnetic microbeads). In some embodiments, the positions of the nucleic acid molecules in the solid phase are known, for example, as they are used in a format of
P05 / 086 / CSHL microarray. The invention also features a method for analyzing a nucleic acid sample (e.g., a genomic representation), the method comprising (a) hybridizing the sample to the nucleic acid probes of this invention and (b) determining which of the plurality of nucleic acid molecules hybridize the sample. This invention also features a method for analyzing a copy number variation of a genomic sequence between two genomes, the method comprising: (a) providing two detectably labeled representations, each prepared from respective genomes with at least an identical restriction enzyme; (b) contacting these two representations with the nucleic acid probes of this invention to allow hybridization between the representations and the probes; (c) analyzing the hybridization levels of the two representations with the set of probes, wherein the difference in the levels with one member of the set of probes indicates a variation of the number of copies between the two genomes, with respect to a genomic sequence identified by the member. In some modalities, the representations are marked in a distinguishable way and / or the contact of the two representations is simultaneous. This invention also has the characteristic
P05 / 086 / CSHL a method for comparing the mutilating state of a genomic sequence between two genomes, the method involves providing two detectably labeled representations of the respective genomes, each representation is prepared by a method sensitive to methylation. For example, a first representation of a first genome is prepared using a first restriction enzyme and a second representation of a second genome is prepared using a second restriction enzyme, wherein the first and second restriction enzymes recognize the same restriction site , but one is sensitive to mutilation and the other is not. The sequences with methyl-C can also be chemically excised after making a representation with a restriction enzyme not sensitive to methylation, so that a derivative representation of a methylated genome is distinguishable from a representation derived from an unmethylated genome. Then the two representations are contacted with the probes of this invention to allow hybridization between the representations and the probes. The hybridization of the two representations to the probes is then analyzed, where a difference in the levels of hybridization between the representations, with respect to a particular probe, indicates a difference in the state of the mutilation between the two genomes, with respect to a genomic sequence
P05 / 086 / C? H identified by the probe. Similar methods can also be used to analyze the polymorphism of a complex genome, as illustrated further below. According to certain embodiments of the invention, an algorithm is provided to accurately and efficiently detect and count the number of times a word appears in the genome. This algorithm, sometimes referred to herein as a search engine or meric engine, uses a transform of a genome (eg, a Burrows-Wheeler Transform) and an auxiliary data structure to count the number of times a word appears particular in the genome. A "word" refers to a nucleotide sequence of a defined length. In general, the engine searches for a particular word by first finding the last character of the word. Proceed then to look for the character immediately preceding the last character. If the first immediately preceding character is found, then look for the second character immediately preceding the last character in the word, and so on until the word is found. If the preceding characters are not found, it will be concluded that the word does not exist in the genome. If the first character of the word is found,
P05 / 086 / CSHL then the number of times it appears is the word count of that particular word. This particular algorithm is advantageous because it can be used to implement several practical applications involving genomic studies, as discussed below. Other features and advantages of the invention will be apparent from the following drawings, the detailed description and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS Figures 1A-1D demonstrate the predictability of computing and the accuracy of array measurements, using microarrays comprising 10,000 oligonucleotides. Figure 1A shows the results where the hybridized samples are a representation of BgJZ.II and a representation of BgJZ.II depleted of fragments with a HindlII cleavage site. The Y axis (Mean Ratio) is the average ratio measured of two hybridizations of the exhausted representation to the normal representation, plotted on a logarithmic scale. The X-axis (index) is a false index constructed so that the probes that are derived from the fragments defined as having an internal HindlII site are on the right side. Figure IB shows the reproducibility of the experiments
P05 / 086 / CSHL duplicates used to generate the average relationship in Figure IA. The Y axis (Experiment 1 relationship) is the measured relationship of experiment 1 and the X axis (Experiment 2 relationship) is the measured relationship of experiment 2. Both axes are plotted on a logarithmic scale. Figure 1C graphs the normalized relationship on the Y axis as a function of the intensity of the sample that was not depleted on the X axis. Both the relationship and the intensity were plotted on a logarithmic scale. Figure ID represents the data generated by the simulation. The X axis
(index) is a false index. The probes, in groups of 600, detect the number of copies that increase, from left to right. 600 flanking probes detect the number of normal copies. The Y axis (Mean Ratio) is the average ratio plotted on a logarithmic scale. Figures 2A1-2A3, 2B1-2B3 and 2C1-2C3 show the genomic profiles for a sample of a primary breast cancer (CHTN159), with aneuploid nuclei, compared to the diploid nuclei of the same patient (Figures 2A1-2A3), a breast cancer cell line compared to a normal male reference (Figures 2B1-2B3) and an abnormal male reference with a normal male reference (Figures 2C1-2C3), using the printed 10K array (Figure 2A1, Figure 2B1, Figure 2C1 ) and the 85K photoprinted array (Figure 2A2, Figure 2B2, Figure 2C2).
P05 / 086 / CSHL In each case (Figure 2A1, Figure 2B1, Figure 2C1 and Figure 2A2, Figure 2B2, Figure 2C2) The Y axis is the average ratio, and the X axis (Genomic index) is an index, which graphs the probes in genomic order, concatenating the chromosomes, and allowing visualization of the entire genome from chromosome 1 to Y. Figure 2A3, Figure 2B3 and Figure 2C3 show the correspondence of the measured relationships of "sister" probes present in the 10K and 85K microarrays. The Y axis is the measured ratio of the 10K microarray and the X axis is the measured ratio of the 85K microarray. Figures 3A-3D show several chromosomes with variable fluctuations of the number of copies of the tumor cell line analysis SK-BR-3, in comparison with the normal reference. The Y axis (Mean Ratio) represents the Mean Ratio of two hybridizations in logarithmic scale. The X axis (Genomic index) is an index of the genomic coordinates. Figure 3A represents the fluctuations in the number of copies identified for chromosome 5, Figure 3B for chromosome 8, Figure 3C for chromosome 17 and Figure 3D for chromosome X. Figures 4A-4D show the average segmentation calculated from the SK-BR-3 analysis, compared to the normal reference (Figure 4A and Figure 4B) and CHTN159 (Figure 4C and Figure 4D). In Figures 4A-4D, the Y axis is the
P05 / 086 / CSHL mean segment value for each probe on a logarithmic scale. In Figure 4A and Figure 4C, the X-axis (Average Segment index) is each listed at an ascending value of its assigned average segment. In Figure 4B and Figure 4D, the X-axis (Genomic index) is a genomic index, which, as described above, places the entire genome end-to-end. A network of the number of copies is plotted at the top of the middle segment data, extrapolated from the array data using the formulas within the text (horizontal lines). The number of copies calculated for each horizontal line is to the right of the network. Figures 5A-5D plot on the Y axis (Mean Ratio of SK-BR-3) the Mean Ratio of two SK-BR-3 hybridizations, compared to a normal reference on a logarithmic scale. The X axis (Genomic index) is a genomic index. Figure 5A shows a region of the X chromosome with a loss region. The calculated segmentation value is plotted on the measured relation of the array. Figure 5B shows a region of chromosome 8. { c-myc located to the right of the center of the graph) of the SK-BR-3 results, compared to a normal reference. Segmentation values are plotted at the top of the data, for SK-BR-3, compared to the normal reference in diagonal stripes
P05 / 086 / CSH and the segmentation values for the primary tumor CHTN159 in vertical lines. Figure 5C shows a lesion on chromosome 5, demonstrating the resolving power of the 85K array, compared to that of 10. The results are from SK-BR-3, compared to a normal reference. The open circles are of the 10K printed microarray and the full circles are of the photoprinted arrangement of 85. The horizontal lines are the estimates of the copy number, based on the modeling of the values of the middle segment. Figure 5D shows the comparison of SK-BR-3 with the normal reference, showing a homozygous deletion region on chromosome 19. The value of the middle segment is plotted as a white line, and the network is the estimated number of copies , as described above. Figures 6A-6D show the results of a normal compared to a normal (sic), identical to those shown in Figure 2C2, with the exception that the singlet probes have been filtered, as described in the text. Figure 6B illustrates the comparison in series of experiments for a small region of chromosome 4. The Y axis is the Mean Ratio on a logarithmic scale. The X axis is a genomic index. The full (85K) and open (10K) circles are from the comparison of SK-BR-3 with the normal. The empty triangles are a comparison of a
P05 / 086 / CSHL pygmy with the normal reference. Figure 6C illustrates a lesion found in the normal population on chromosome 6. The filled circles are plotted by the Mean Ratio for the analysis of the pygmy with the normal reference. The line with vertical stripes is the value of the average segment for the comparison of the pygmy with the normal reference. The line with diagonal stripes is the value of the average segment for the comparison of SK-3-BR-3 with the normal reference. The line with crossed stripes is the comparison of the value of the segment of the primary tumor (CHTN159 aneuploid to diploid). Figure 6D shows a region of chromosome 2. The data shown in the circles are from the comparison of SK-BR-3 with the normal reference. The middle segment line for this comparison is shown in vertical stripes. The line of the middle segment for the comparison of a pygmy with the normal reference is shown in diagonal stripes and for the primary tumor CHTN159, in crossed stripes. For Figure 6C and Figure 6D, the number of copies calculated for the horizontal lines is to the right of the panel. Figure 7 shows a block diagram of an illustrative system according to certain embodiments of the invention. Figure 8 shows a flowchart of an illustrative preprocessing step, to perform the
P05 / 086 / CSHL exact count of the words, according to certain embodiments of the invention. Figures 9A and 9B show a flowchart of an algorithm illustrative of the word count, according to certain embodiments of the invention. Figures 10A and 10B show an illustrative example of the word counting algorithm of Figures 9A and 9B, according to certain embodiments of the invention. Figure 11 shows an illustrative arrangement of the suffixes, which has coordinate positions that correspond to the coordinates of the genome, according to certain embodiments of the invention. Figure 12A shows a graphic representation of the variables and data structures used in relation to the algorithm, according to a certain embodiment of the invention. Figure 12B shows a pseudocode representation of the algorithm, according to certain embodiments of the invention.
DETAILED DESCRIPTION OF THE INVENTION This invention features oligonucleotide probes for analyzing representations of a population of DNA (e.g., a genome, a chromosome or
P05 / 086 / CSH a mixture of DNA). The oligonucleotide probes can be used in solution or can be immobilized on a solid surface (including semisolid) such as an array or an icroperla (for example, Lechner et al., Curr Opin. Chem. Biol. 6: 31-38 (2001); Kwok, Annu, Rev. Genomics Human Genet, 2: 235-58 (2001), Aebersold et al., Nature 422: 198-207 (2003) and US Patents 6,355,431 and 6,429, 027). A representation is a reproducible sampling of a population of DNA in which the resulting DNA typically has a new or reduced complexity or both (Lisitsyn et al., Science 258: 946-51 (1993); Lucito et al. , Proc. Nati, Acad. Sci. USA 92: 151-5 (1998)). For example, a representation of a genome may consist of DNA sequences that are only from a small portion of the genome and that are largely free of repetitive sequences. The analysis of genomic representations can reveal changes in the genome, including mutations such as deletions, amplifications, chromosomal rearrangements and polymorphisms. When done in a clinical setting, the analysis can provide an understanding of the molecular basis of a disease, as well as useful guidelines for its diagnosis and treatment. The oligonucleotide compositions of this invention can be used to hybridize to the
P05 / 086 / CSHL representations of a primary DNA, wherein the hybridization data are processed to provide genetic profiles of the primary DNA (eg, lesions and genetic polymorphisms related to the disease). It may be preferred that the representations (or "test representations" hereafter) and at least a fraction of the oligonucleotide probes in the compositions are derived from the same species. DNA of any species can be used, including mammalian species (eg, pig, mouse, rat, primate (eg, human), dog and cat), fish species, reptile species, plant species and microorganism species .
I. OLIGONUCLEOTYDE PROBES The oligonucleotide probes of this invention are preferably designed for the virtual representation of a primary DNA, such as the genomic DNA of a reference individual. The representation of the genome, generally, but not invariably, results in a simplification of its complexity. The complexity of a representation corresponds to the fraction of the genome that is represented in it. One way to calculate complexity is to divide the number of nucleotides in the representation between the number of nucleotides in the genome.
P05 / 086 / CSHL The genomic complexity of a representation can vary from below 1% to as high as 95% of the total genome. Where the DNA of a common organism is relatively simple genome, the representation can have a complexity of 100% of the total genome, for example, the representation can be generated by restriction digestion of the total DNA without amplification. The representations associated with the invention typically have a complexity of between 0.001% and 70%. The reduction in complexity allows the desirable hybridization kinetics. A "real" representation of DNA involves laboratory procedures ("wet work") by which the DNA of the representation is selected. Virtual representations, on the other hand, take advantage of the fact that complete genomes, for example, the human genome, have been sequenced. Through computational analysis of the available genomic sequences, one can easily design a large number of oligonucleotide probes that hybridize to regions with a genome map and have a minimum degree of overlap of the sequence with the rest of the genome. As an example, to design a set of oligonucleotide probes for human genetic analysis,
P05 / 086 / CSHL one can perform a silico (ie, virtual) digestion of the human genome, locating all the cleavage sites of a selected restriction endonuclease in the sequenced genome. One can then analyze the resulting fragments to identify those that are in a desired range (eg, 200-1,200 bp, 100-400 bp and 400-600 bp) that can be amplified by eg PCR. Such fragments are defined herein as "predicted to be present" in a representation. A restriction endonuclease may be selected based on the desired complexity of the representation. For example, restriction endonucleases that are cut off infrequently, such as those that recognize 6 bp or 8 bp target sequences, will produce representations of less complexity, whereas restriction endonucleases that are frequently cut, such as those that recognize objective sequences of 4 bp, will produce representations of greater complexity. In addition, factors such as the G / C content of the genome analyzed, will affect the frequency of cleavage of the particular restriction endonucleases and, consequently, will influence the selection of the restriction endonucleases. Generally, robust restriction endonucleases that do not exhibit star activity are used. Alternately, also
P05 / 086 / CSHL can be employed cleavage based on the mutilated state of a target site, for example, through the use of a restriction enzyme sensitive to mutilation or another enzyme such as McrBC, which recognizes methylated cytosines in the DNA The sequences of all digested fragments of a desired range (eg, 200-1,200 bp, 100-400 bp and 400-600 bp) are analyzed by computer, where the regions of some of those fragments are at least about 30 bp in length and have minimal homology to the rest of the genome, they can be selected as representative oligonucleotide probes for the human genome. The following Examples 1 and Section VI further illustrate methods for identifying the oligonucleotides of this invention. The oligonucleotides of the invention may vary in length from about 30 nucleotides to about 1,200 nucleotides. The exact length of the oligonucleotides chosen will depend on the intended use, for example, the size of the primary DNA for which the representation is prepared and whether they are used as components in an array. Oligonucleotides typically have a length of at least 35 nucleotides, for example, at least 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or 100 nucleotides, but may also be shorter , having
P05 / 086 / CSHL a length of, for example, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 nucleotides. The oligonucleotides typically have a length of no more than 600 nucleotides, for example, no more than 550, 500, 450, 400, 350, 300, 250, 200 or 150 nucleotides. As will be recognized by one skilled in the art, the length of the oligonucleotides will depend on the characteristics of the genome analyzed, for example, the complexity and number of the repetitive sequences.
II. OLIGONUCLEOTIDE ARRANGEMENTS The oligonucleotide probes of this invention can be used in an array format. An array comprises a solid support with the nucleic acid probes attached thereto in defined coordinates or directions. Each address contains many copies of a probe with a single DNA or a mixture of probes with different DNA. Nucleic acid arrays, also referred to as "microarrays" or "microplates" have generally been described in the art. See, for example, U.S. Patent 6,361,947 and references cited therein. We have named the genomic analyzes using the new "analysis of the representation oligonucleotide microarray" ("ROMA" or "representational oligonucleotide microarray analysis" of
P05 / 08S / CSHL arrangements or, where the excision depends on mutilation at the target site, "analysis of the oligonucleotide microarray detection of methylation" ("MOMA" or methylation detection oligonucleotide microarray analysis.) To make a microarray of this invention, the previously synthesized oligonucleotides are attached to a solid support, which can be made of glass, plastic (e.g., polypropylene or nylon), polyacrylamide, nitrocellulose or other materials, and can be porous or non-porous. nucleic acids to its surface is printed on the glass plates, as generally described by Schena et al., Science 270: 467-70 (1995); DeRisi et al., Nature Gen. 14: 457-60 (1996); Shalon et al., Genome Res. 6: 639-45 (1996) and Schena et al., Proc. Nati. Acad. Sci. USA 93: 10539-1286 (1995). For low density arrays, one can also use spot spots on a nylon hybridization membrane. , for example, Sambrook et al., Molecular Cloning - A Laboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 1989. Another method for making microarrays is using photolithographic techniques. (or "photoprinting") to synthesize the oligonucleotides directly on the substrate of the array, i.e., in situ. See, for example, Fodor et al., Science 251: 767-73 (1991); Pease et
P0S / 086 / CSHL al., Proc. Nati Acad. Sci. USA 91: 5022-6 (1994); Lipschutz et al., Nat. Genet. 21 (1 Suppl): 20-46 (1999); Nuwaysir et al., Genome Res. 12 (11): 1749-55 (2002); Albert et al., Nucí. Acids Res. 31 (7): e35 (2003) and the Patents of the United States 5,578,832, 5,556,752 and 5,510,270. Other methods can also be used for the synthesis and rapid deposition of the defined oligonucleotides. See, for example, Blanchard et al., Biosensors & Bioelectronics 11: 687-90 (1996) and Maskos and Southern, Nucí. Acids Res. 20: 1679-1684 (1992). The arrays of the invention typically comprise at least 100 (eg, at least 500, 1,000, 5,000 or 10,000) oligonucleotide probes, and may comprise many more probes, eg, up to 25,000, 50,000, 75,000, 85,000, 100,000, 200,000, 250,000, 500,000 or 700,000 probes. The arrangements of the invention typically do not comprise more than 700,000 probes. However, they can understand more, for example, up to 800,000, 900,000 or 1,000,000 probes. In some embodiments, arrays are arrays of high density, with densities greater than approximately 60 different probes per 1 cm2. The oligonucleotides in the arrays can be single-stranded or double-stranded. To facilitate the manufacture and use of arrays, the oligonucleotide probes of this invention can be modified by, for example, the
P05 / 086 / CSHL incorporation of peptidyl structures and analogous nucleotides, in the probes.
III. TEST REPRESENTATIONS The oligonucleotide arrays of this invention can be used to probe any nucleic acid sample of choice. For example, the sample can be a cDNA library, a genomic DNA library or an RNA preparation. In other embodiments, the arrays of this invention are used to probe DNA samples that are representations (or "test representations") of a complex population of DNA, such as the genome of a higher organism. The representations and methods for their preparation are described in, for example, Lisitsyn et al., Proc. Nati Acad. Sci. USA 92: 151 (1995); Lucito et al., Proc. Nati Acad. Sci. USA 95: 4487-4492 (1998) and WO 99/23256. A method for making a representation involves reproducibly cleaving a population of DNA into fragments. Reproducible cleavage is generally achieved by digesting one or more restriction endonucleases (e.g., DpnT or BglII) or enzymes that cleave at particular methylated sites (e.g., McrBC), but any method that reproducibly cleaves the DNA can be used. The fragments of DNA
P05 / 086 / CSH are linked to adapter oligonucleotides. These fragments are then amplified by, for example, polymerase chain reaction ("polymerase chain reaction") or ligase chain reaction, using primers complementary to the adapters. The amplified fragments represent a subset of the starting DNA population. Due to the amplification step, the representations can be made from very small amounts of the starting material (for example, 5 'ng of DNA). Analysis of the representation difference ("RDA" or "representational difference analysis") as described in Lisitsyn et al., Science 258: 946-51 (1993) and United States Patents 5,436,142 and 5,501,964 can be used to eliminate any known, unwanted sequences of the representation, including repetitive sequences. The starting DNA population can be large DNA molecules, such as the genome of an organism or a part thereof (eg, a chromosome or a region thereof). We refer to the representations of such a population of DNA as chromosomal or genomic representations, respectively. Initiating DNA populations can be obtained from, for example, tissue samples with disease, such as samples from tumor biopsies, normal tissue samples, tumor cell lines,
PQ5 / 086 / CBHL normal cell lines, cells stored as fixed specimens, autopsy samples, forensic samples, DNA paleomamples, microdissected tissue samples, isolated nuclei, isolated chromosomes or regions of chromosomes and samples of fractionated cells or tissues. One can also make a representation of a representation (or a "composite representation"). Composite representations are useful for selecting polymorphisms. See, for example, WO 99/23256. For comparative analyzes of representations of two primary DNAs, such as a comparison of a genomic representation of a normal cell with a genomic representation of a cancer cell or otherwise ill, it may be preferable to prepare the two representations in parallel, for example , isolate the starting DNA of two cells at the same time and in the same way, prepare the representations from the same amount of start DNA and amplify the DNA fragments at the same time under the same conditions in the same thermal cycler. It may also be preferred that the normal cell and the diseased cell be taken from the same individual, although it is possible to obtain a "normal" genomic DNA by combining, for example, the DNA of both parents of the individual. The complexity of a representation is
P05 / 086 / CSHL generally lower than that of the starting DNA population, because there are sequences present in the starting population that are not present in the representation. The complexity of a representation is related to the cutoff frequency of the restriction endonuclease in a particular starting population. A more frequent cutter gives rise to a more complex representation. Because fragments of between 200-1,200 base pairs are preferably amplified by PCR under typical conditions, one can obtain high complexity representations by cleaving the starting DNA so that most fragments are between 200-1,200 Base pairs. Conversely, low complexity representations can be obtained by cleaving the DNA molecule, so that fewer of the fragments are between 200-1,200 base pairs. For example, digestion with Dp.nlI of human genomic DNA can result in a representation that has approximately 70% of the complexity of the entire human genome. Digestion with a less frequent cutter such as BamEI or BglII, on the other hand, can result in a representation that has only about 2% of the complexity of the human genome. High complexity representations are useful, for example, to determine the number of copies of the gene, map the deletions, determine the lossock.
P05 / 086 / CSH of heterozygosity, comparative genomic hybridization and to archive DNA. Generally, low complexity representations are useful for the same purposes, but have better hybridization kinetics than high complexity representations. The complexity of a representation can be further refined by using more than one restriction enzyme to generate the fragments prior to attachment of the adapters and / or by using one or more additional restriction enzymes to cleave a subset of the fragments after binding of the adapters, decreasing the abundance of these fragments in the resulting representation. Any restriction enzyme, including restriction enzymes susceptible to mutilation, can be used to produce a representation for the assays described herein. The complexity of the representation can also be satisfied by choosing the adapters used for the amplification. For example, what adapters are used can influence the sizes of the members of a representation. Where identical adapters bind to both ends of the cleaved fragments, the formation of the enclave between the adapters within the single strands competes with the first annealing, thus inhibiting amplification by PCR. See
P05 / 086 / CSH Lukyanov et al., Anal. Biochem. 229: 198-202 (1995). Amplification of the shorter fragments is more likely to be inhibited, because the adapters are closer to each other in the shorter fragments, resulting in a higher effective local concentration of the bonded adapters and therefore, higher interaction. The adapters that form enclaves of approximately 29 base pairs allow the amplification of fragments in the size range of 200-1,200 base pairs. The adapters that form smaller enclaves, for example, of 24 base pairs, release some of the inhibition of the smaller fragments, resulting in the favoring of smaller PCR amplification products and therefore, a representation of a complexity altered
IV. HYBRIDIZATION OF NUCLEIC ACIDS TO THE ARRANGEMENTS The microarrays of this invention typically hybridize to single-stranded nucleic acid samples in solution. Because the potential hybridization signal can vary from one direction to another in the hybridization chamber, the array of the probe can preferably be used as a comparator, by measuring the hybridization ratio between two differently labeled specimens (the sample) that are mixed thoroughly and by
P05 / 08S / CSHL therefore share the same hybridization conditions. Typically, the two specimens will be from test (for example, with disease) and control (for example, disease-free) cells, respectively. The samples to be hybridized to the microarrays, for example, the test representations described above, can be detectably labeled by any means known to one of skill in the art. In some embodiments, the sample is labeled with a fluorescent portion by, for example, random labeling with a primer or with nick translation. When the sample is a representation, it can be labeled during the amplification step by including labeled nucleotides in the reaction. The fluorescent label can be, for example, a nucleotide conjugated with lysamin or a nucleotide analog conjugated with fluorescein. In some embodiments, two differently labeled samples are used (for example, one labeled with lysine and the other with fluorescein). In some modalities, the samples are not marked. The hybridization and washing conditions are chosen so that the nucleic acid molecules in the sample bind specifically to the complementary oligonucleotides in the array. Arrays containing double-stranded oligonucleotides are generally subjected to
P05 / 086 / CSHL at denaturing conditions to revert to the single-stranded oligonucleotides, before contacting them with the sample. Optimal hybridization conditions will depend on the length and type (eg, RNA or DNA) of the oligonucleotide probe and the nucleic acids in the sample. Hybridization to an arrangement of the invention can be detected by any method known to those skilled in the art. In some embodiments, the hybridization of the nucleotides of the fluorescently labeled sample is detected by a laser scanner. In some embodiments, hybridization of labeled or unlabeled sample nucleotides is detected by measuring their mass. When two different fluorescent tags are used, the scanner can be one that is capable of detecting fluorescence of more than one wavelength, the wavelength corresponding to that of each fluorescent tag, typically simultaneously or almost simultaneously. V. EXEMPLARY USES FOR PROBES
OLIGONUCLEOTYDICS The oligonucleotide probes of the invention can be used to detect and quantify changes in the number of copies or the methylation status of specific sequences in a genome. Where the
P05 / 086 / CSHL representations derived from a plurality of DNA samples are hybridized to the same oligonucleotide probes, the relative intensity of the hybridization between the two samples to a particular probe is indicative of the relative copy number or the methylation status of the sequence corresponding to that probe in the two samples. Genomes, for example, typically contain additional copies of certain sequences due to amplification or fewer copies or no copies of certain sequences, due to the deletion of specific regions. These methods can be used, for example, to analyze the changes in the number of copies or the methylation status of the sequences between a reference sample and the samples from a patient, wherein the amplification, deletion or methylation status of the sequences specific is involved in, for example, the predisposition, progression or parking of specific diseases, including, for example, cancer, neurological diseases (eg, autism), diabetes, heart disease and inflammatory diseases (eg, autoimmune diseases). In addition, positional information on the alteration of the number of copies or the state of mutilation in a genome can be obtained because the sequences in the genome to which the oligonucleotide probes of the
P05 / 086 / CSHL invention are complementary, they are known. Where the oligonucleotide probes are designed to hybridize frequently in the genome sequence and the sample is a highly complex representation, it is possible to accurately map the regions of amplification, deletion or mutilation status of the genome. Thus, the invention can be used to identify individual genes that may be involved in the predisposition, progression or parking of specific diseases. These genes can be oncogenes and tumor suppressor genes, depending on whether the sequence is amplified, suppressed or methylated / non-methylated in a cancer genome, relative to a reference genome, respectively. The oligonucleotide probes of the invention can also be used to identify polymorphic sites, including single nucleotide polymorphisms (SNS, or single nucleotide polymorphisms), both within an individual and between individuals. These polymorphisms are common and as many as 2-3% of oligonucleotide probes show a polymorphic behavior even among "normal" individuals. Detectable polymorphisms can result from the loss or gain of fragments of the restriction endonuclease, for example, due to point mutations, deletions, rearrangements
P0S / 086 / CSHL genomic or genetic conversions that extend into heterozygous polymorphisms, where they are reflected in their presence or absence in a representation. For example, digestion of a nucleotide sequence with a restriction enzyme can result in a large (i.e., cleaved) fragment or two small fragments, depending on whether a restriction site is present. It is known that this polymorphic restriction site exists in a test genome if the oligonucleotide probes detect one or both of the small fragments in the test representation. Similarly, genomic rearrangements, including translocations, insertions, inversions, and deletions, may result in the creation of new fragments of the restriction endonuclease that span at least part of the rearrangement. Some of these new fragments may be amplifiable, and therefore, be present in a representation of a rearranged genome but absent in a reference representation. Conversely, genomic rearrangements can result in the loss of a fragment of a representation. In any case, a difference between the test and reference representations with certain probes suggests that genomic rearrangements may have occurred in the test genome, relative to the reference genome.
P05 / 086 / CSHL Analyzing the sequences of these probes and the locations of these probes in the reference genome, one can obtain information on genetic rearrangements, including the type of rearrangements and the rearrangements. The ability to analyze the number of copies and other polymorphisms of specific sequences within and between individuals has many uses that will be apparent to someone skilled in the art. These can be, in a non-exclusive way, identification of individuals, for example, for forensic tests and paternity tests; breeding of plants or animals; discovery of polymorphisms that are genetically linked to an inherited trait, including the analysis of quantitative traits; determination of the response to drugs in a patient, including the prediction of a beneficial or adverse response to a drug; diagnosis and for the identification of a patient and stratification in clinical trials.
P05 / 08S / CSHL VI. AN EXEMPLARY SEARCH ENGINE The following describes an algorithm that can be used to obtain the oligonucleotide probes mentioned above. It will be understood that the following description is not intended to show that this algorithm is the only means to obtain such probes. It will also be understood that this algorithm has different applications to the generation of the oligonucleotide probes of this invention. Some of those other applications are described here. This algorithm, sometimes referred to herein as a search engine or a motor, uses one of a genome (for example, one of Burrows-Wheeler) and an auxiliary data structure to count the number of times a particular word appears in the genome. A "word" refers to a nucleotide sequence of any length. In general, the engine searches for a particular word by first finding the last character of the word. Proceed then to look for the character immediately preceding the last character. If the first immediately preceding character is found, then look for the second character immediately preceding the last character in the word, and so on until the word is found. If the characters are not found
P05 / 086 / CSH precedent, it will be concluded that the word does not exist in the genome. This particular algorithm is advantageous because it can be used to implement several practical applications involving genomic studies, as discussed below. One application of the search engine is that it can be used to analyze a nucleotide sequence such as a genome. Particularly, the genome can be analyzed using subseries of a particular length that exist within the genome. The search engine can then count the number of times a substring of particular length in the genome. These counts provide an indication of the uniqueness of a particular sub-series, where the lower counts represent a higher degree of uniqueness than the higher counts. The design of probes is another practical application that is advantageously improved by using the search engine. The ability of the engine to quickly count the number of times a particular word appears in a genome is particularly useful for designing probes that are unique and that hybridize to a specific region of the DNA, with minimal cross-hybridization. By using the search engine, potential cross-hybridizations can be minimized by requiring a probe to be
P05 / 086 / CSHL comprised of constituent segments that are unique and that meet certain stringent conditions, such as having low word counts or not having word counts within the entire genome. Yet another application of the search engine is to detect the differences between two genomes. For example, as the human genome project progresses, the map of new segments of the genome is drawn and made available to the public. Using the search engine and the probes that were designed in another version of the same genome, it can be determined how many of these probes can be applied to the new version of the genome. Even another application in which the search engine can be used is to verify if a particular word exists in the genome. It may be desirable to find words that do not appear in the genome, so there is little chance that the word will hybridize to a section of the genome. These words can be generated randomly according to a predefined set of criteria. When a word is found, its complement is also presented to the search engine to determine if it appears in the genome. If both the word and its complement do not appear in the genome, it is known that both of these words will hybridize with one another and not with the genome.
P05 / 086 / CSHL A. DESCRIPTION OF THE SYSTEM The search engine and applications thereof can be performed in accordance with the present invention using the illustrative system 700, shown in Figure 7. The system 700 may include a computer 710, a computer of interconnection of user 730, Internet 740 and optional laboratory equipment (not shown). The system 700 may include multiple computers 710 and user interconnection equipment 730, but only one of each is illustrated in Figure 7 to avoid complicating the drawing. The computer 710 is shown connected to the user interface equipment 730 and the Internet 740 via communication paths 790. The computer 710 may include circuitry such as a 712 processor, a 714 database (e.g., a hard disk drive) , a memory 716 (e.g., random access memory) and a removable media disk 718 (e.g., a floppy disk drive, a CD-ROM drive or a DVD drive). This circuitry can be used to transmit data to, from and / or between the user's interconnect equipment 730 and the Internet 740. The computer 710 can initiate the techniques of the invention by responding to a user input from the user's interconnection equipment 730 .
P05 / 086 / CSH The computer 710 may also provide information to the user in the user interface equipment 730 with respect to the results obtained from the operation of the search engine. The database 714 stores the information provided by the search engine with the data. More particularly, the database 714 may include the sequence of a genome or a particular portion of the genome. The invention can use the genome information stored in the database 714 to build a suffix array, which can also be stored in the database 714. The suffix array is a data structure that is generated in preparation for building a of a genome or a portion thereof. The representative data of a genome can be obtained, for example, from a readable medium (for example, a floppy disk, a CD-Rom or a DVD), which can be accessed through a unit of a removable medium. 718. Alternatively, genome data can be obtained through Internet 740, where data is transmitted from a server located, for example, in a research facility (for example, the National Institutes of Health (National Institutes of Health). tutes of Health) or a university). If desired, the database 714 can be updated with new genome data as it becomes available.
P05 / 086 / CSH Generally, the amount of data representing the suffix array is much greater than the amount of data that represents the genome. Therefore, the database 714 may be more suitable for storing the suffix array than the memory 712, because the databases easily store more data than the memory. The user interconnection equipment 730 allows a user to enter commands to the computer 730 via an input device 732. The input device 732 can be any suitable device such as a conventional keyboard, a wireless keyboard, a mouse, a touch pad , a tracking ball, a voice activated console or any combination of such devices. The input device 732 can, for example, allow a user to enter commands to perform a word count or perform a statistical analysis of the potential probes. A user can verify the processes operating in the system 700 in a display device 734. The display device 734 can be a computer monitor, a television, a flat panel display, a liquid crystal display, a lightning tube cathode (CRT or cathode-ray tube) or any other suitable representation device.
P05 / 086 / CSHL The communication paths 790 can be any suitable communication paths, such as a cable link, a wired link, an optical fiber link, an infrared link, a ribbon wire link, a blue link. tooth, an analog communications link, a digital communications link or any combination of such links. The communication paths 790 are configured to allow the transfer of data between the computer 710, the user interconnection equipment 730 and the Internet 740. Laboratory equipment can be provided in the system 700, so that the results obtained with the search can be applied directly to experiments and vice versa. One advantage of the search engine is that the techniques for counting the exact correspondences of words can take place completely within the memory (e.g., memory 716) of the computer. This provides an extremely fast and efficient genome interrogation for the exact word correspondences. There is no need to have access to the database (for example, a hard drive). Such a need can substantially hinder the performance of the search engine. The techniques used to count the exact correspondences of the words are 100%
P05 / 086 / CSH exact.
B. SUBFIX ARRANGEMENT, TRANSFORMED BY BURROWS-WHEELER AND ALPHABETIC LIMITS Referring to Figure 8, an illustrative flow chart 800 shows the steps for preparing a genome to be used in the search engine in accordance with the principles of the present invention. The 800 flow diagram uses techniques to build a data structure of the suffix array that provides the basis for generating one of a particular genome. This provides a basis for the search engine of this invention, wherein the search engine can quickly count the number of occurrences of a particular word (for example, a word having a length of 15, 21, 70 or 80 characters) . In step 810, a nucleotide sequence such as a genome or a portion of a genome is provided. The genome can be arranged as a series of characters having a length of N nucleotides, where N represents the total number of nucleotides in the series of characters represented by the genome. The genome provided in step 810 can be derived from any organism or can be generated randomly. For example, all known human genome can be provided or a portion can be provided
P05 / 086 / CSH of the human genome (for example, a portion of the genome that represents a chromosome or a region of a chromosome). If desired, data from a non-human genome can be provided, such as genomes of viruses, bacteria, single-cell or multi-cell organisms, including yeasts, plants and animals such as lizards, fish and mammals (e.g., mice, rats and non-human primates). In step 820, the genome is subjected to a transformation process that rearranges the nucleotide rearrangement of the genome, according to a predetermined lexicographic order. It keeps the same constituent letters (for example, A, C, G and T) that appear in the genome, but these letters are arranged in a different order. In one embodiment of the invention, the genome is subjected to a known one, called Burrows-Wheeler. The Burrows-Wheeler can be obtained from a suffix arrangement. According to this invention, an arrangement of the suffix is a matrix of N x N which represents all the cyclic permutations of the genome, wherein the permutations are arranged according to a predetermined criterion (for example, alphabetical, numerical, etc.). Advantageously, that of Burrows-Wheeler represents the sorted N x N matrix of the cyclic permutations. Thus, when the search engine of the present invention
P05 / 086 / CSHL searches through the Burrows-Wheeler, which, by extension, searches through the suffix array, which, through additional extension, searches through the original series that represents the genome. The assemblies of the genome sequence may include an ambiguous character in addition to A, C, G and T, thus extending the genome alphabet to five characters. This ambiguous character, commonly referred to as N, is typically used when the nucleotide at a particular position of a nucleic acid sequence is unknown. Because the Burrows-Wheeler represents an array of the classified suffix, there is no need to access the suffix array when searching for a particular string of characters. Preferably, the "transform" is stored in the memory, where the search functions can be executed much faster than when the transform is stored on a hard disk. In addition, because the amount of data contained in a suffix array can be substantial, the suffix array may have to be stored on a hard drive, as opposed to a memory that operates faster (for example, a memory). Random access of a computer). For example, the size of a suffix array for the human genome is in the order of twelve gigabytes. If such an arrangement is stored in memory, the
P05 / 086 / CSHL cost a machine that has twelve gigabytes of memory would be much more expensive than a machine that has, for example, three gigabytes of memory. Therefore, one advantage of the search engine is that it does not require expensive machines with intensive memories, because the transform represents a condensed version of the array of the suffix classified. Although the suffix arrangement is not necessary to perform word searches in accordance with this invention, it is useful to describe how such arrangements are obtained, in order to show the relationship between the transform and the array. The arrangement of the suffix can be constructed by first obtaining the cyclic permutations of a nucleotide sequence. For example, Table 1 illustrates the cyclic permutations of the "AGACAGTCAT $" genome, where "$" is provided to mark the end of the genome series.
P05 / 086 / CSHL AGACAGTCAT $ GACAGTCAT $ A ACAGTCAT $ AG CAGTCAT $ AGA AGTCAT $ AGAC GTCAT $ AGACA TCAT $ AGACAG CAT $ AGACGTC AT $ AGACAGTC T AGACAGTCA $ AGACAGTCAT TABLE 1 After the cyclic permutations are obtained, the rows are they classify according to a predetermined criterion, to obtain a particular lexicographical order (for example, an alphabetic lexicographical order). For example, Table 2 illustrates an alphabetical arrangement of the permutations shown in Table 1, under the heading of Classified Arrangement.
Row Arranged Transformed Classified 0 $ AGACAGTCAT - > T 1 ACAGTCAT $ AG - > G 2 AGACAGTCAT $ - > $ 3 AGTCAT $ AGAC - > C 4 AT $ AGACAGTC - > C 5 CAGTCAT $ AGA - > A 6 CAT $ AGACAGT - > T 7 GACAGTCAT $ A - > A 8 GTCAT $ AGACA - > A 9 T $ AGACAGTCA - > A 10 TCAT $ AGACAG - > G TABLE 2 Once the permutations are classified
P05 / 086 / C? HL cyclical, the genome transform can be obtained by taking the last letter of each row of the classified array. These letters are reproduced under the heading of the "Transformed" column, indicating that the genome transform "AGACAGTCAT $" is "TG $ CCATAAAG". In one embodiment, the arrangement of the suffix of a genome such as the human genome can be constructed using a parallel root classification using a grouping of 16 nodes. Using this procedure, the genome is divided into X number (for example, 100) of subseries of equal size, each one superimposed by seven nucleotides, with X being a predetermined number. Deviations in the genome (ie, the "genome" coordinate) within each subseries are assigned to one of 57"prefix" deposits according to the 7 numbers (seven nucleotides) in each deviation. Deviations within each deposit are sorted based on the sequence after the 7-mer prefix, thus creating the suffix arrangement. In step 830, several statistics are calculated to generate an auxiliary data structure, which may include an alpha link limit data structure, a data structure of the K interval, and a dictionary count data structure. The alphabetical limits indicate how many adenine nucleotides,
P05 / 086 / CSHL cytosine, guanine and thymine are in the transformed. For example, using the genome of Tables 1 and 2, the alphabetic limits for A, C, G and T are 4, 2, 2 and 2, respectively. Alphabetical limits can be used to delimit the intervals in the transform that correspond to the particular characters that exist in front of each row of the array of the classified suffix. For example, a delimited range for nucleotide A includes each row of the suffix array beginning with A. Referring to Table 2, it shows that rows 1-4 of the array classified start with A. Thus, the four rows correspond to the alphabetic limits calculated for A. Table 2 shows that rows 5-6 start with C, which corresponds to the alphabetic limits calculated for C. Likewise, block G corresponds to rows 7 and 8, and block T to rows 9 and 10 of the transform. Step 830 may also generate intervals K for each K number of characters in the transform, where K is a predetermined number. The K intervals can be used to maintain a total of passes of each nucleotide as they appear in the transform. These ranges K can be used by the search engine of the present invention to accelerate the counting process,
P05 / 086 / CSHL which is discussed below with reference to Figures 3 and 4. Specifically, the use of the K interval allows the search engine to outperform performance and use less space than conventional word counting techniques, especially when applied to nucleotide sequences greater than four million characters in length. The following example further explains how a transform is tabulated using the intervals K. Assume that the transform has ten ACGTCAGTCA characters, and the K intervals are stored every five characters. In the first interval, the interval K includes an A, two C, a G and a T. In the second interval (for example, the tenth character) the interval K includes a tabulation of all the nucleotides that have appeared in the transformed up now. The second interval K includes three A, three C, two G and two T. In step 840, the Burrows-Wheeler series is compressed according to a predetermined compression ratio. Preferably, the series is compressed using a compression ratio of 3 to 1. That is, for every three characters, the series is compressed to one character (for example, 3000 characters are condensed to 1000 characters). Those skilled in the art will appreciate that other relationships of
P05 / 086 / CSHL compression. For example, a four-to-one or five-to-one compression may be employed. The series can be compressed using a dictionary-based compression scheme, where one of 125 single-byte codes represents one of each of the 53 possible three-letter substrings (for example, AAA, AAC, ..., TTT) . More specifically, the transform is divided into three character substrings and each subseries is compressed according to the compression scheme based on the dictionary. For example, if a substring of three characters is AAA, it may be equivalent to byte 0 of the dictionary compression scheme. Similarly, if the substring is TTT, this may be equivalent to the 124th byte of the dictionary compression scheme. The dictionary count data structure can be generated to assist the search engine in the counting process by providing a quick access search table to quickly identify the number of times a particular letter appears in a compressed byte. This is advantageous because it allows the search engine to perform counting operations on the transform while it is in its compressed state. Note, however, that a byte may have to be decompressed in order for the search engine to finish counting the number of times a letter appears
P05 / 086 / CSHL particular within a search region. On average, it has been found that one byte of the compressed transform is decompressed two thirds of the time during the step of counting the characters that is performed by the search engine. Once the transform is compressed, it is ready to be used in the search engine of the present invention. In particular, the transformed Burrows-Wheeler transform can be interrogated to locate and count each occurrence of a particular word contained within the genome.
C. ALGORITHM OF WORD COUNTING Figure 9 shows a simplified flow diagram of the illustrative steps to count the number of times a particular word exists in a given genome, according to the principles of the motor. Starting at step 910, a compressed transform of the genome and an auxiliary data structure are provided. The compressed transform and auxiliary data structure can be obtained, for example, from the flow chart illustrated in Figure 8.
In step 914, a query pattern of a particular length is provided (eg, ACG ... G). The pattern is preferably a series of nucleotides that the motor
P05 / 086 / CSHL search looks at the transformed genome. After the query pattern is provided, the search engine begins an iterative search process to determine if the pattern exists. If the pattern exists, it produces quickly and accurately the number of times it appears. In step 918, the iterative process begins by defining (or redefining) a search region, which delimits a range of character positions within the transform. The search region delineates a block of characters that starts at position X and ends at position Y of the compressed transform. This search region (or block) potentially contains all occurrences of the query pattern. The search region is defined using a predefined criterion such as a particular character of the query pattern, alphabetic limits and other data. A more detailed explanation of how the search region is defined is discussed in conjunction with the description that accompanies Figure 10. In step 920, the process determines how many times the next preceding character of the query pattern appears in the search region. . In step 922, if the count of the preceding character is zero, the query pattern does not exist and the process ends (step 924). If there is at least one character within the range
P05 / 086 / CSH delimited, the process proceeds to step 926. In step 926, it is determined whether the preceding character is the first character in the query pattern. If so, the process proceeds to step 928, where the account obtained in step 920 is transferred and the process ends. If the preceding character is not the first character of the query pattern, the process returns to step 918, because it has not yet determined whether or not the query pattern exists in the genome. In step 918, the search region is redefined using a predetermined criterion. More particularly, the search region is redefined using the following equations 1 and 2: Start position = A + Z (1) Final position = Start position + Ml (2) where A is the start position of the preceding character, according to the alphabetic limits, Z represents the number of times the preceding character appears in the transformed before the currently defined search region and M represents the number of times the preceding character appears in the currently defined search region. The redefined search region also potentially contains all occurrences of the query pattern, but the recently defined search region limits the character positions that need to be
P05 / 086 / C? H searched in step 920. After the new search region is defined, the process continues to step 920, wherein the next preceding character (i.e., the character preceding the last character used in the previous step 920) of the query pattern is counted within the newly defined search region. This cycle can be repeated as many times as necessary before finding the first character of the query pattern, and consequently, the number of counts of the word. If one of the preceding characters is not found in a search region, it will be concluded that there is no such pattern in the genome. Figures 10A-B illustrate an example of the above word counting algorithm. This example uses the illustrative genome (AGACAGTCAT $), the suffix array, the Burrows-Wheeler transform (TG $ CCATAAAG) and the alphabetic limits previously described in relation to Tables 1 and 2. In this example, assume that a user You want to determine how many times the word "CAG" appears in the genome. In Figure 10A, the process begins by delimiting block G because G is the last letter in the word "CAG". As illustrated, block G starts at position 7 and ends at position 8 of the Burrow-Wheeler transform. These positions are obtained from the limits
P05 / 086 / CSH alphabetical. Once block G is delimited, the engine searches and counts the number of A, the next preceding character of "CAG", which exists in block G. Figure 10A shows that two A's appear in block G, indicating that the genome contains two appearances of "AG". If desired, the K intervals could be used to facilitate the step of counting the number of times a particular letter appears in a search region (for example, counting the number of A in block G) and can also be used to count the number. number of times a particular letter appears before a search region. To carry out such counting steps, the particular character is counted by starting from a predetermined position (e.g., the starting position) and progressing to the nearest position that is a multiple of. An advantage of using the intervals in relation to the search engine is that the time it takes to determine how many times a particular word appears in a genome is linear with respect to the K intervals, the size of the word that is searched and the time required to have access to several memory addresses. Thus, the size of the genome is not a factor in determining the word count, unless the size of the compressed transform and the data structure of the interval are too large to fit in memory (for example,
P05 / 086 / CSHL the random access memory). In one mode, the K can be set to 300 characters, or equivalently, to 100 compressed bytes. With such an arrangement, the maximum number of counts that needs to be done does not exceed / 2. If desired, sub-ranges of size KA can be used within each interval, to maintain a total of passes of each character appearing within a particular K interval. If the size of K is limited to be less than 28, for example, then the counts for each letter in each K interval can be recorded using a single byte. This provides an increased density of the count index by a factor of K / K ?, while the space requirements for the counts of the K interval are increased by a factor of only [(K / K?) / 4]. Such restrictions of sub-ranges and sizes have been used for the auxiliary data structure used in relation to this algorithm. Depending on the choices of y, an increase of three to five times in the speed of execution of the query has been achieved, while maintaining a memory requirement of less than two gigabytes for the human genome. To further accelerate the counting process, the data structure of the dictionary counts can be used. Note that the scheme of
P05 / 086 / CSHL compression used is a 3: 1 compression scheme, where bytes 0 to 124 are decompressed from "AAA" to "TTT", respectively. The dictionary count structure is a two-dimensional array that can be considered as a matrix with 125 rows with five columns. Each row corresponds to one of the entries in the compression dictionary, and each column corresponds to each letter of the genome alphabet, A to T. The foing explains by way of example how the structure of dictionary counts and K intervals can be used to perform the counting operations. Assume, for example, that the search engine is in the process of determining the number of A that appears before the search region. Using the count structure of the K-range described above, the engine can "jump" at least 50 bytes from the current start position of the search region in a single search. Suppose further that the start position is indicating the third "T" in a compressed "ATT" (one byte) that is the 49th byte of the interval. For each of the preceding 48 bytes, the byte itself can be used as the row number in the data structure of the dictionary counts, and the letter of interest, "A", represents the number of the column. Using this information as the coordinates to access the array of counts of the
P05 / 086 / CSH dictionary, the data structure of the dictionary counts provides the number of times "A" appears in that compressed byte. Therefore, to determine how many A appear before the start of the search region, you need to have access to the dictionary count structure 48 times. In addition, the 49th byte may need to be decompressed in order to examine the first two letters "AT" of the "ATT" byte. Thus, when the data structure of the dictionary counts is combined with the data structure of the K intervals, the counting step of any number of ccters requires only K / 6 + 1 search tables, plus two comparisons of the ccter in the worst case. Referring again to Figure 10, the search engine then delimits the AG block within the transform, so that it knows where to look for the next preceding ccter. The limits of the AG block are found by adding the number of times that A precedes block G in the transform to the first position in which block A starts in the transform. In this example, only an A appears before the G block. Therefore, using equation 1 above, where A is 1 and Z is 1, a starting position of 2 is obtained for the AG block. The final position of AG is obtained using equation 2 above, where M is 2 (number of A found in the
P05 / 086 / CSHL block G). Equation 2 provides a final position of Block AG of 3, as shown in Figure 10B. Once the AG block is found, the search engine counts the number of times C appears in it. This count provides the number of CAGs that appear in the genome, because C is the first ccter of the word "CAG". Thus, the search engine provides a one word count. Figure 11 shows an illustrative genome that has coordinates positions and an array of the classified suffix that has coordinate positions that correspond to the positions of the genome coordinates. That is, the first ccter in each row of the suffix array corresponds to one of the ccters in the genome. For example, the second row of the array has a position of the coordinate of 2, which corresponds to position two of the genome. Thus, the positions of the coordinates of the suffix array are correlated with the positions of the coordinates of the genome. If desired, the suffix array can be used to locate the position of the coordinate of a particular word. For example, if you search for the position of the "CAG" coordinate, you can access the suffix array of Figure 11 and indicate that CAG starts at position 3. However, as mentioned
P05 / 086 / CSHL above, accessing the suffix array is a time consuming process because it requires access to the drive. Therefore, it is desirable to obtain the coordinates of the word only when accessing the memory. This can be achieved by assigning preset coordinates of the suffix array to the transform, thus allowing a coordinate location algorithm to use the transform to locate the start coordinate of a particular word. Such an algorithm for locating the coordinate is explained by way of example. Suppose that the portion enclosed in a circle of the suffix array is the transform of the genome and that only coordinates 3 and 7 have been taken to the transform from the suffix array. Assume further that you want to find the TC coordinates. (Note that if the transform had coordinates associated with the G that is affiliated with TC, the TC coordinates would be known without having to reclassify to use the coordinate localization algorithm). It is known that the TC is associated with the last G in the transform. Starting with this G, the algorithm determines how many G precedents there are. In this case, there is a preceding G. The data structure of the alphabetic limits and the number of preceding G are used to determine
P05 / 086 / CSH which letter precedes this particular G. Using the alphabetic limits, it is known that block G starts at position 7. Since there is a preceding G, the algorithm adds this number to 7 to get 8. Thus, the A that corresponds to the series of the suffix arrangement that starts with GT, is the letter that precedes the G mentioned above. This completes an iteration of the coordinate localization algorithm. Generally speaking, this iteration is repeated until a coordinate (for example, 3 or 7) is reached in the transform. Once the coordinate is reached, the number of iterations is added to the coordinate and the resulting sum is the actual start position of the coordinate of the desired word (for example, C). Continuing with the iterative process, it is known that two As precede the A associated with the array of the suffix array that starts with GT. Using the alphabetic limits and the preceding A number, the algorithm sets the C associated with the suffix array that begins with AGT. Since there is no C preceding this particular C, the algorithm sets the A associated with the series of the suffix array that starts with CAG. Because this A has a position of the coordinate (for example, 3), the actual position of the word TC can be determined by adding 3 (the position of the coordinate of
P05 / 086 / CSHL is A) the number of iterations, which in this example is 3, resulting in a position of the coordinate of 6. Thus, TC starts at the position of the coordinate of 6 in the original genome.
D. SEARCH ENGINE APPLICATIONS Now that the operational features of the search engine have been described, the practical applications of the engine can be discussed. A search engine application is that it can be used to analyze a genome (or any other type of nucleotide sequence). Particularly, the genome can be analyzed using subseries of a particular length that exist within the genome. The search engine can then count the number of times a substring of a particular length appears in the genome. These counts provide an indication of the uniqueness of a particular sub-series, where the lower counts represent a higher degree of uniqueness than the higher counts. If desired, any region of the genome or the entire genome can be analyzed based on its constituent "mere" frequencies. A "mere" is another term for a word or substring of a particular length. Thus, when a genome or a portion of it is analyzed, it is analyzed based on the numbers of a
P05 / 086 / CSHL particular length (for example, lengths of numbers of 15, 18, 21 and 24). Regardless of the length of the grouper being analyzed, each grouper of that length that exists in the genome is counted. For example, if the length of the grouper is 15, the search engine will determine the word count for the first 15-mer and every 15-mer that occurs later. Each subsequent 15-mer is overlaid with the previous 15-merus word by a character. That is, characters 1 through 15 constitute a 15-mer, characters 2 through 16 constitute another 15 mer, characters 3 through 17 constitute yet another 15 mer, and so on. This ensures that each analyzed 15-mer is assigned to a word count, so that the word count represents the number of times that particular 15-mer appears in the entire genome. The design of the probe is facilitated using the search engine. The ability of the engine to quickly count the number of times a particular word appears in a genome is useful for designing probes that are unique and hybridizing to a specific region of DNA with minimal cross-hybridization. By using the search engine, potential cross-hybridizations can be minimized by selecting a candidate probe that is comprised of smaller groupers that are unique and that meet certain stringent conditions, such as
P05 / 08S / CSHL have low word counts or do not have word counts throughout the genome. A single word can be a particular series of nucleotides that have less than a predetermined number of word counts (for example, less than 2, 5, 10, 25, 50 or 100 word counts) or an absence of word counts ( for example, zero word counts) within a genome or a portion thereof. More particularly, candidate probes are based on a set of predetermined criteria such as requiring candidates to have a length, Ll, and also requiring candidates to have a predetermined word count (for example, a candidate probe having a count of one's words). In addition, the predetermined criterion may also require that the inverse complement of a candidate have a predetermined word count (for example, one). Once the candidates are obtained, they are subjected to additional predetermined criteria to determine which candidates are suitable to be used as probes. These additional criteria are used to filter the candidates based on their constituent sub-regions (ie, groupers of a length contained within the candidate probe). For example, the filtering criterion may require a mere of a length L2, where L2 is less than Ll, to have word counts that are reduced to
P05 / 086 / CSH minimum in relation to other candidate probes. Thus, there is a relationship between the criteria used when finding probes, a relationship between "hard" restrictions (for example, in which each candidate is unique with respect to the genome) and "soft" restrictions (for example, in which the counts of the constituent groupers are reduced to a minimum). One way to satisfy the "hard" restrictions is to obtain the candidates based on the results of a previously conducted analysis. Using the word count information, candidates can be selected from regions of the genome that have low word count concentrations (for example, it is preferable to obtain candidates that have a minimum average value of counts of 'words of a predetermined length, a mean geometric value of word counts of a predetermined length, a fashion value of the word counts of a predetermined length, a minimized maximum value of word counts of a predetermined length, a value of the total sum of the counts of words of a predetermined length, a product value of word counts of a predetermined length, a maximum length series of a particular nucleotide or a combination thereof.
P05 / 086 / CSHL To satisfy the "soft" constraint, candidates can be analyzed according to a predefined criteria, such as 15-mer counts, 17-mer counts, etc. The data obtained from the analysis is analyzed to determine if a candidate is sufficiently unique to be used as a probe. A candidate can be selected as a probe if, for example, he has the lowest sum of the 15-mer counts of all candidates. Other criteria such as minimal occurrences of composition deviation (eg, long runs of a particular nucleotide) can be applied to determine which probe is better. After the criteria are applied to each candidate, the or more candidates are selected as suitable probes. Yet another affliction of the search engine is detecting changes from one genome to another. For example, as the human genome project progresses, the map of new segments of the genome is drawn and made available to the public. Using the search engine and the probes that were designed in another version of the same genome, it can be determined how many of these probes can be applied to the new version of the genome. Still another application in which the search engine can be used, is to verify if a particular word exists in the genome. It may be desirable
P05 / 086 / CSHL find words that do not appear in the genome, so that there is little chance that the word will hybridize to a section of the genome. These words can be generated randomly, according to a predefined set of criteria. When a word is found, its complement is also presented to the search engine to determine if it appears in the genome. If both the word and its complement do not appear in the genome, there is a minimal opportunity for this word and its complement to hybridize to the genome. Such non-hybridizing probes can be used in the hybridization as readable barcodes and in the hybridization array controls, and can be added to the nucleic acid probes for the purpose of improving the hybridization signals through the formation of network. One way to minimize the opportunity for hybridization is to minimize the frequency of the mere constituents of a particular word. That is, it is preferable to obtain probes that have as many lengths of the constituent groupers as their word counts of zero. For example, suppose that several 20-mer oligonucleotides are generated in order not to hybridize to the human genome. Next, assume that every 20-mer is analyzed for each of its 19-mer, 18-mer, 17-mer, 16-mer, until, for example, 6-mer constituents
P05 / 086 / CSHL that overlap. Theoretically, the most desirable 20-mer would preferably have zero word counts for each grouper in length. In practice, a probe having the minimum chance of hybridization preferably has as many counts of zero numbers to all the lengths of numbers as possible (for example, a desirable probe may have zero word counts for lengths of numbers of 19, 18, 17, 16, 15, 14 and 13). Thus, if a probe has zero counts of its 15 and 14-mer constituents, it is less likely to hybridize to the genome than a probe that has zero counts of its constituent 15-mers, but has one or more counts of its 14-mers constituents. Thus, the first probe has less of an opportunity to hybridize than the last probe, because it does not have any 14-mer that corresponds to the sections of the genome. Oligonucleotides that do not hybridize can be constructed using the mere constituents of a single particular that has a zero or low word count. For example, if a particular 20-mer has a 13-mer that has a word count of zero, this 13-mer can be used to construct oligonucleotides that probably do not exist in the genome (for example, two of these 13-mer may join with one another to create a unique 26-mer).
P05 / 086 / CSHL In a laboratory setting, for example, a word count of zero and its zero count complement (oligonucleotides that do not hybridize) can be attached to (hybridize) a probe or target word. In an abstract sense, words are the "arms" that join the "body" (that is, the probe). When a hybridization begins, the words ("arms") hybridize only to one another, while the probe hybridizes to the genome. Because words ("arms") typically carry a detectable material (eg, a fluorescent tag) self-hybridization helps a person distinguish the location of the probe within the genome against background hybridization. Thus, the self-hybridization of the arms serves to amplify the visibility of the probe that hybridizes to the genome. Oligonucleotides that do not hybridize can also be used as labels to uniquely identify a particular sequence among a vast population of other sequences. The oligonucleotides that do not hybridize can be linked to the known sequence, thus labeling or labeling a particular sequence. In yet another example, several different DNA sequences can be concatenated to form the single genome (eg, provided, for example, in step 810 of Figure 8). Such a concatenated genome is useful,
P05 / 086 / CSHL for example, if it is desired to design a probe that detects the presence of a particular pathogen (e.g., a virus) within a human blood sample. A concatenated genome is needed because DNA extracted from human blood not only contains human DNA, but also DNA from other sources such as the pathogen. Therefore, in order for the probe to effectively detect the pathogen in human blood, it must not cross-hybridize with the human genome. In the event that the pathogen probe is not completely unique with respect to the other genomes in a tissue sample (eg, the patient's genome and the genomes of other microorganisms found in the patient), it may be necessary to compare the counting of words for the probe in the genome of the pathogen with the word counts for the probe in the other genomes. This procedure may require two search engines, one for the pathogen of interest and the other for a combination of the other genomes. Note that when applying this double search engine procedure, it may be advantageous to design probes that have counts of larger numbers within the genome of the pathogen, as long as the probe counts in the other genomes in the tissue sample are disproportionately low.
P05 / 086 / CSHL VII. EXAMPLES The following examples are provided by way of illustration only. They are not intended to limit the scope of the invention described herein.
Example 1 - Selection of Complementary Oligonucleotides for a Representation This example demonstrates the identification of oligonucleotide probes that are complementary to the BglII-derived representation of a human genome. Similar procedures can be used to design oligonucleotides complementary to any population of nucleic acids whose sequences are known or predicted. Using the published preliminary assembly of the human genome sequence, we performed a digestion of Bg II in silico from the human genome, locating all the BglII restriction sites within the preliminary assembly. We also selected all sequences of BglII fragments that were between 200 to 1,200 base pairs in length. Next, we analyze the sequences of these fragments using an algorithm described here. This algorithm (also called "a micron motor") can be used to determine the number of copies of any given oligonucleotide sequence in any sequenced genome. This number of copies is
P05 / 086 / CSHL also calls the "word count" of the oligonucleotide sequence in the genome. We analyzed each fragment digested with BglII with the word counts of its 15 and 21-mer superimposed constituents (ie, oligonucleotides having 15 or 21 nucleotides) using the micron motor constructed from the same preliminary assembly of the human genome. To do this, we generated in silith for each fragment, each 70-mer oligonucleotide constituting superimposed (for example, a fragment of 100 base pairs would have 31 such 70-mer). The following attributes were determined for each of such 70-mers of a fragment, as described below: maximum 21-mer counts (or maximum count of 18-mers), arithmetic mean of 15-mer counts, content of the percent of G / C and amount of each base, and the longest pass of any single base. To determine the maximum 21-mer count, we broke each 70-mer in overlapping 21-mer and compared each of these 21-mer to all the 21-mer sequences in the genome. We discarded all 70-mers whose maximum 21-mer count was greater than 1, that is, those with a 21-mer sequence that was 100% complementary with more than one 21-mer sequence in the genome. This was our initial set of probes of 70-
P05 / 086 / CSHL numbers. We also optimize the set of 70-eros probes eliminating those with a GC content of less than 30% or greater than 70%, a pass of A / T greater than 6 bases, or a pass of G / C greater than 4 bases. Of the remaining 70-mer, we chose for each BglII fragment the 70-mer (or more) that had a GC / AT proportionality closer to that of the genome as a whole. We also analyzed each one of the thus chosen 70-meros, determining the counting of words of the genome for each one of the constituents of the 70-meros, the superimposed 15-mer. We chose 70-mers that had the lowest average count of 15-mers. As a final verification of the general uniqueness, the optimal 70-mer probes for each BglII fragment were compared to the entire genome using the programming elements of the BLAST program. The predetermined parameters were used, with the exception of filtering the low complexity sequence, which was not performed. Any 70-mer probe with any degree of homology along 50% or more of its length, with any sequence other than itself was eliminated. The motor algorithm provides rigor, flexibility and simplicity to the process design of the probe. The ability to quickly determine the
P05 / 086 / CSHL word counts for words of all sizes, allows design criteria to be framed quantitatively in a way that is analogous to actual hybridization events. The word counts can be considered as a quantitative measure of the degree to which the sequences belong to two or more sets of polynucleotides. For example, the small probe "AGT" can be considered as a set containing six different words, namely "A", "G", "T", "AG", "GT" and "AGT". If this probe were to be analyzed with word counts for all words of all sizes, it would be found that the number of times each word appears in the first set, which is the "AGT" probe, would be greatly overshadowed by the number of times that appears in the second set, namely, the genome of three billion nucleotides. This relationship can be expressed as an X / Y relation, where X is the sum of the counts for all the constituent words of the probe in relation to the probe and Y is the sum of the counts for all the same words within the genome. When selecting a 70-mer probe that hybridizes to a target sequence with minimal cross-hybridization, one can maximize the X / Y ratio, where the maximum value of X / Y for the probes derived from the genome sequence is 1. The technique of
P05 / 086 / CSHL selecting only lengths of two words with which to analyze, is essentially one of many possible shortcuts towards this goal. In the case that unique probes can not be found within a genomic region of interest, it is possible to use non-unique probes to provide clear measurements of the relative differences in the number of copies or simply the amount of matter. The problem then extends to a comparison between three sets of words: the sonsa, the region of interest covered and the genome. Let Z represent the sum of all the word counts of the probe in relation to the region covered. Assume that X and Y still represent the sums of all the word counts of the probe in relation to the probe and the genome, respectively. The goal is then to maximize the value of the expression, (X / Y) / (X / Z), or simply Z / Y. In other words, one can find probes that are specific to the region, regardless of the number of total copies. This special case can be generalized to include any circumstance in which one is selecting probes to recognize a particular entity of many, through hybridization. A further example is the recognition of the DNA of an organism when it is exposed to the DNA of many other organisms.
P05 / 086 / CSHL Yet another application of this paradigm is that of the minimization of fixed membership. We have designed probes that acted as controls for hybridization in microarray experiments. These probes were the controls in the sense that they were intended to hybridize to only those DNA fragments that any other probe had equal chance to recognize. The objective in this case was simply to design a probe where Y is as close to zero as possible. Such a probe would also be useful, for example, as unique identifiers readable for hybridization or as additions to other nucleic acid sequences to improve the signal of hybridization through network formation. In addition to the sums and arithmetic means of word counts, many other statistics can be used, including, for example, the variance of the word counts of the probe for words of a particular size. This variance can act as a quick pre-selection for the selection of probes that must exist in a particular number of copies. The maximum word count for a particular word size can be taken as an indication of the possible result of hybridization for an otherwise unique probe. These quantitative measurements are ideal to quickly determine the suitability of a hybridization probe with
P05 / 086 / CSHL relation to other candidates. The core motor algorithm in essence can reduce the selection process of the probe to a one-pass scan over the sequence of interest. One of the sets of probes we designed consisted of 85,000 70-mers, which had an average count of 18-mers in relation to the human genome of 1.2, with a standard deviation of 0.8. The mean was calculated with respect to the set of all the 18-mer of all the combined probes. Compared to the prior art, in particular a published set of approximately 23,000 probes of the 70-mer expression array, the mean of the 18-mer counts for all the combined probes was 1.9, with a standard deviation of 14.8. Therefore, this set of probes was the larger of the two by a factor of 4, and was uniquely more consistent by a factor of 18. The set of 85,000 probes in this example was selected by us based on the combination of a unique restriction of 21-mers and a reduced restriction to the minimum of the aggregate 15-m count, as previously described. The advantages included a large increase in confidence that the probes that proved to perform well empirically did not hybridize simply to a large heterogeneous population of DNA fragments and therefore increased their signal.
P05 / 086 / CSHL This further illustrates the precision with which the sets of probes can be designed to meet rigorously defined criteria, such as an extremely small standard deviation around an objective average word count.
Example 2 - Preparation of the Arrangements We use two formats to construct the microarrays containing the oligonucleotide probes designed according to Example 1. In the first of these, the "printed" format, we purchased approximately 10,000 oligonucleotides made with solid phase chemistry, and we print them with cannons on a glass surface. Specifically, we used the Cartesian PixSys 5500 (Genetic Microsystems) to fix our collection of probes on the slides, using a pin configuration. The dimensions of each printed arrangement were approximately 2 cm2. Our arrays were printed on commercially available silanose slides (Corning® ultra GAPS ™ # 40015). The pins used for the arrangement were from Majer Precision. In the second format, the "photoprinted" format, the oligonucleotides were synthesized by NimbleGen ™ Systems, Inc. Directly on a silica surface, using laser-directed photochemistry. Approximately
P05 / 086 / CSHL 700,000 unique 70-mer oligonucleotides were first selected for "performance", arranging them into eight platelets and hybridizing them with representations of BglII are diminished amounts of BglII and BcoRI of genomic DNA from a normal male J. Doe. We collected the 85,000 oligonucleotides that generated the strongest signal and fixed them in a single platelet. In both formats, we arranged the oligonucleotides in a random order to minimize the possibility that a geometric artifact during array hybridization was misinterpreted as a genomic lesion. In the following examples, we describe the results with the printed 10K arrangements and the 85K photoprinted arrangements.
Example 3 - Preparation and Marking of the Test Representations For some of the experiments described here, we chose Bglll to make the representations. BglII has useful features for these particular experiments: it is a robust enzyme; its cleavage site is not affected by CpG mutilation; it leaves a projection of four bases and its cleavage sites have a reasonably uniform distribution in the human genome. The Bglll representations are made up of
P05 / 086 / CSHL short fragments, usually smaller than 1,200 bp. We estimate that there are approximately 200, 000 of them, which comprise approximately 2.5% of the human genome, with an average separation of 17kb. In all the experiments described here, we use the comparative hybridization of the representations prepared in parallel. The DNA of two samples being compared was prepared at the same time, and the representations were prepared from the same concentration of the template, using the same protocols, reagents and thermal cycler. This would decrease the possible "noise" created by a variable yield after amplification by PCR. We prepare BglII representations of human genomic DNA as previously described by Lucito et al., 1998, supra. Briefly, we digest 3-10 ng of human genomic DNA with BglII under the conditions suggested by the supplier. We purified digestion by extraction with phenol and precipitation with ethanol in the presence of 10 μg of tRNA. We resuspend the granule in 30 μl of DNA ligase buffer T4 IX with 444 pmol of each adapter (RBgl24 and RBlgl2; Lucito, R. and M. Wigler. 2003. "Preparation of Target DNA". In Microarray-based Representational Analysis of DNA Copy Number (eds. D. Bowtell &J. Sambrook), p. 386-393. Cold Spring Harbor
P05 / 086 / CSHL Press, Cold Spring Harbor, NY). We placed the reaction mixture in a heating block preheated to 55 ° C and placed the heating block on ice for about 1 hour until the temperature dropped to 15 ° C. Next, we added 400 units of T4 DNA ligase and incubated the reaction mixture at 15 ° C for 12-18 hours. We added l / 40th of the bound material, 20 μl of 5X PCR buffer [Tris. 335 M HCl, pH 8.8; 20 mM MgCl 2; (NH4) 2SO4 80 mM; 50 mM β-mercaptoethanol and 0.5 mg / ml BSA], 5 '-dideoxynucleoside 2' triphosphates at a final concentration of 0.32 mM, adapter RBgl24 at a final concentration of 0.6 μM, 1.25 U Taq polymerase and water to tubes of 250 μl to carry a volume of 100 μl. The tubes were placed in a Research TETRAD ™ MJ thermistor preheated to 72 ° C. Then we perform the amplification as follows: one cycle at 72 ° C for 5 minutes and then 20 cycles of 1 minute at 95 ° C and 3 minutes at 72 ° C, followed by an extension time of 10 minutes at 72 ° C . We cleaned the representations (ie the PCR products) by extraction with phenol: chloroform and ethanol precipitation before resuspending in TE (pH 8) and determining the DNA concentration. For certain experiments, we prepare
P05 / 086 / CSH depleted representations with an additional restriction endonuclease to cleave those fragments containing their restriction site. In these cases, we digest the ligand mixture with the second restriction endonuclease just before the amplification step. In the experiments dibed below, representation with decreased amounts of BglII was produced using HyndIII. We marked the fragments in the representations by placing the DNA in a 0.2 ml PCR tube. We added 10 μl of primers from the Megaprime ™ labeling equipment from Amersham-Pharmacia and mixed them well with the DNA. We take the volume up to 100 μl with water. Place the tubes in a TETRAD ™ machine from MJ Research at 100 ° C for 5 minutes, place on ice for 5 minutes and add 20 μl of the marking buffer of the Megaprime ™ labeling equipment from Amersham-Pharmacia, 10 μl of the brand ( either Cy3 ™ -dCTP or Cy5 ™ -dCTP) and 1 μl of a Klenow fragment of England BioLabs®. We incubated the tubes at 37 ° C for two hours, combined the labeled samples (Cy3 ™ and Cy5 ™) in an Eppendorf® tube and then added 50 μl of Cot 1 Human DNA 1 μg / ul, 10 μl of yeast standard tRNA 10 mg / ml and 80 μl of Lower TE (3 mM Tris pH 7.4, 0.2 mM EDTA). We loaded the sample in a Centricon® filter and centrifuged for 10 minutes at 12,600 rcf. We discard
P05 / 086 / CSHL the flow and wash the filter with 450 μL of Lower TE. Repeat the centrifugation and wash with TE twice. We collected the marked sample by inverting the Centricon® column in a new tube and centrifuged for 2 minutes at 12,600 rcf. We transferred the labeled sample to a 200 μl PCR tube and adjusted the volume with TE below 10 μl. In addition, for some experiments, we digested the DNA isolates of a primary ovarian cancer cell and a normal reference with McrBC and ligated the binders and amplified them as dibed above.
Example 4 - Hybridisation of the Test Representations to the Reticulate Arrays with UV oligonucleotide probes to the slide, using a Stratagene® Stratagene® set at 300 mJ, rotate the slide 180 degrees, keeping the slide at the same point in the cross-linker, and we repeat the treatment. Wash slides for 2 minutes in 0.1% SDS, 2 minutes in Milli-Q® water, 5 minutes in boiled Milli-Q® water and finally in ice-cold 95% benzene-free ethanol. We dry the slides by placing them on a metal support and spinning them for 5 minutes at 75 rcf. We prehybridize the printed microarrays by placing them in
P05 / 086 / CSHL a coupling bottle or other slide processing chamber, adding prehybridization buffer (25% deionized formamide, 5X SSC and 0.1% SDS) and preheating the chamber at 61 ° C for two hours and then washing them in Milli-Q® water for 10 seconds. We again dry the slides by placing them on a metal slide holder and rotating them for 5 minutes at 75 rcf. NimbleGen ™ photoprinted arrays do not require UV crosslinking or prehybridization. Add 25 μl of the hybridization solution to 10 μl of the sample prepared as in Example 3 and mix. For the printed slides, the hybridization solution was 25% formamide, 5X SSC and 0.1% SDS. For the NimbleGen ™ photoprinted arrays, it was 50% formamide, 5X SSC and 0.1% SDS. We denatured the samples in a TETRAO ™ from MJ Research at 95 ° C for 5 minutes and then incubated at 37 ° C for 30 minutes. We rotated the samples and pipetted them onto a slides prepared with a slider that was lifted and incubated in a hybridization oven (such as Boekel's InSlide Out ™ oven) set at 58 ° C for printed arrangements or at 42 ° C for NimbleGen ™ photoprinted arrays for 14 to 16 hours. After hybridization, we wash the
P05 / 086 / CSHL slide as follows: briefly in 0.2% SDS / 0.2X SSC to remove the sliding cover; 1 minute in 0.2% SDS / 0.2X SSC, 30 seconds in 0.2X SSC and 30 seconds in 0.05X SSC. Dry the slides as before placing them on a support and turning them at 75 rcf for 5 minutes. Then we scan the slides immediately. We explored the slides using an Axon GenePix® 4000B scanner adjusted to a pixel size of 10 microns for the printed arrays and 5 microns for the photoprinted arrays. We quantify the intensity of the fixes using a GenePix ™ Pro 4.0 program and import the data into an S-PLUS® for further analysis. We calculate the relationships between the two signals in an experiment, using the intensities measured without the background subtraction. We normalized the data using an intensity based on a lowess curve adjustment algorithm similar to that described in Yang et al., Nucí. Acids Res. 30: el5-15 (2002). We average the data obtained from the color inversion experiments and show them as they are presented in the Figures.
P05 / 086 / CSH Example 5 - Performance and Validation of Arrangements As discussed above in Example 1, we should be able to predict, based on the published sequence of the human genome, which oligonucleotide probes can hybridize with which representations. To confirm this, we tested our 10K printed arrays by hybridizing them to BglII representations of normal human genomic DNA labeled with a fluorescent dye and to representations of BglII with diminished amounts of HindIII from the same DNA labeled with another fluorescent dye. Figure 1 illustrates the results obtained with the representations of BglII with diminished amounts of HindIII. In Figure 1A, we graph the ratios of the intensity of the hybridization of each probe along the Y axis. Each experiment was performed in inverse color and the geometric mean of the relationships of the separate experiments was plotted. The probes predicted to detect the fragments in both the complete and exhausted representations hybridize to both (Figure IA, left). There are approximately 8,000 of these probes. The probes predicted to not detect the probes in the exhausted representation are not (Figure 1A, right). There are approximately 1,800 of these probes. These results validate that: (1) the restriction profile of the fragments of the
P05 / 086 / CSHL representation was correctly predicted, (2) the oligonucleotides were fixed correctly and (3) the oligonucleotides detected the predicted probes with an acceptable signal intensity. In Figure IB, the concordance between the relationships of the color inversion experiments is plotted. These data confirm the reproducibility of our arrangement. A very small number of oligonucleotide probes failed to hybridize to the target fragments in the representations, as predicted. For example, of the 8,000 probes predicted to hybridize to the fragments not cleaved by i? Ir? DIII, approximately 16 appear to hybridize to the BgIII fragments which in fact were cleaved. This could be due to a divergence between our sample and the published human sequence, which could result from a polymorphism or errors in sequencing. However, the data herein show that the public human sequence is sufficiently reliable for the design of probes for the representative oligonucleotide microarrays.
Example 6 - Global Analysis of Tumor Genomes The oligonucleotide arrays of the invention readily detect genomic lesions on a large scale, whether they are deletions or amplifications. The figures
P05 / 086 / CSHL 2A1-A3, 2B1-B3 and 3C1-3C3 show the array hybridization data for three genomic comparisons: Figures 2A1-A3 compare aneuploid breast cancer cells with normal diploid cells from the same biopsy (CHTN159 ) (the two representations of the sample were prepared from approximately 100 ng of DNA, each isolated from nuclei of aneuploid and diploid fractions by flow cytometry); Figures 2B1-B3 compare a breast cancer cell line (SK-BR-3) derived from a patient of an unknown ethnicity with an unrelated normal male J. Doe (of European and African origin, see Example 2) and Figures 2C1-C3 compare the cells of another normal man (African pygmy) vs. the same J. Doe. In each case, the samples were hybridized twice, with the inversion of color, and the geometric mean ratio (on a logarithmic scale) was plotted against the genome order of the oligonucleotide probes. An increased number of copies (amplification) is indicated by a ratio above 1, and a number of copies decreased (deletion) by a ratio below 1. The data shown in Figures 2A1, 2B1 and 2C1 were obtained with the printed 10K arrangements. The data shown in Figures 2A2, 2B2 and 2C2 were obtained with the 85K photoprints. There were clear profiles with the cancer genomes.
P05 / 086 / CSHL The profiles of the two breast cancer cell lines are different, but each showed large regions of amplification and deletion in the genome (Figures 2A1-A2 and 2B1-B2). In contrast, the normal-normal profile was essentially flat, indicating no amplification or large-scale deletion between these genomes (Figures 2C1-C2). These data confirm that the oligonucleotide arrays of the invention can detect genomic changes on a large scale. The results also indicate that there are many oligonucleotide probes that detect lower gains and losses in the three genomes (the two cancer genomes and the African man genome). These gains and losses are shown as stand-alone points in Figures 2A1-A2, 2B1-B2 and 3C1-C2, and are shown in Figure 2C2 (the normal comparison-) as a "shell" or area of probes that approximate the relations of 0.5 and 2.0 through the genome. These losses and gains probably result from a heterozygous BglII polymorphism among the individuals to whom samples were taken. Furthermore, the comparison between the printed format of 10 and the 85K photo-printed format clearly demonstrates that, although they had different resolutions, both captured a similar view of genomic characteristics on a large scale. We call them probes "sisters" if
P05 / 085 / CSHL share the complementarity with the same BglII fragment. Sisters do not necessarily have overlapping sequences, although they can overlap by up to half their length, or they can be complementary across their entire length. In Figures 2A3, 2B3 and 2C3, we graph the ratios of the oligonucleotide siblings of the 10K format (Y axis) with the ratios of their sibling oligonucleotides of the 85K format (X axis). There was an excess of 7,000 sister probes. There was a remarkable concordance between the relationships of the sister probes in the two formats for the three experiments, despite the fact that the probe sequences differed between the formats, that their arrangement patterns were different, that the conditions of Hybridization differed and that the surfaces of the array were different. These data confirm the reproducibility of the results obtained using the arrays comprising the oligonucleotides of the invention. In addition, analyzes of the MOMA representations produced by excision with McrBC showed regions of the genome with an altered methylation status between the genomes of the cancer cell and the normal cell. The normalization to the differences in the number of copies in these regions using the BglII representation confirmed that the difference observed in many of these
P05 / 086 / CSHL sites were due to a difference in mutilated status and not in the number of copies.
Example 7 - Automated Segmentation and Complete Genome Analysis We also analyzed data from smaller regions of the genome to map the variation observed in Example 6. For example, we analyzed the data of one chromosome at a time, using an algorithm of statistical segmentation that analyzes syntactically the data of the relation of the probe in segments of a similar meaning, after taking into account the variance
(termed circular binary segmentation (CBS or circular binary segmentation), see Olshen and Venkatran, Change-Point Analysis of Array-Based Comparative Genomic Hybridization Data, Alexandria, VA, American Statistical Association, 2002). The algorithm recursively identifies the best possible segmentation of each chromosome, rejecting or accepting each proposed separation, based on the probability that the difference in the mean could arise by chance. This probability is determined by a randomization method. Due to its non-parametric nature, the algorithm prevents us from identifying recognized aberrations by less than three probes.
P05 / 086 / CSHL Figures 3A-D illustrate the result of these analyzes on four chromosomes (chromosomes 5, 8, 17 and X in Figures 3A-D, respectively) of the cancer cell line SK-BR-3 using the 85K array. We observe similar segmentation profiles and segment means when we use the 10K array data. Further analyzes of the data allowed us to determine the level of ploidy of the cells. Once segmented, we assigned to each oligonucleotide the average relation of the segment to which it belonged and we plot the average relations in the classified order. These data are plotted for the cancer genomes of CHTN159 (Figure 4A) and S-BR-3 (Figure 4C). The figures show that the mean segment ratios within each genome were quantified, with main and lower dishes of similar value. We deduced the number of copies of these regions based on content and knowledge by flow analysis that CHTN159 was subtriploid and SK-BR-3 was tetraploid. If each sample was approximately monoclonal, then the two main dishes in CHTN159 could be two and three copies per cell, and the main dishes of SK-BR-3 would be three and four copies per cell. We use the number of copies calculated for the mains to solve the ploidy and the SN for
P05 / 086 / CSHL each experiment. We use an equation: RM = (Rt x SN + 1) / (SN + 1) where RM was the measured mean ratio, Rt was the true relation and SN was an experimentally derived character that measures "specific to non-specific" noise . We selected RM as the average of the probes of the segments in the dish and adjusted Rt to CN / P, where CN was the true, known number of copies of the dish and P was the ploidy of the tumor genome. The combination provided two equations and two unknowns, P and SN. For the experiment with CHTN159 (Figure 4A), we calculate that the ploidy P is 2.60, and SN is 1.13. For the experiment with SK-BR-3 (Figure 4C), we calculated that P was 3.93 and SN was 1.21. We also use the equation to calculate which average ratios would predict higher and lower copy numbers. We mark these predicted values in the respective graphs, from zero to a number of copies of 12, with horizontal lines that form a "network of the number of copies". The mean segment values assigned for the probes are shown in the order of the genome, included within the expected network of the number of copies, in Figures 4B and 4D. The network of the number of copies fits remarkably well to the smaller data plates, specifically for larger copy numbers.
P05 / 086 / CSH Example 8 - Analysis of the Fine Scale Genomic Lesions We also analyzed the data to determine the precise breakpoints in the individual chromosomes that had amplifications or deletions. Our analysis demonstrated that the arrays of the invention can be used to identify genomic lesions in the resolution of individual genes. Consequently, the data obtained from the arrays can be used to predict the impact of aberrations on particular genes in the conversion of a normal cell into a cancerous one. We first analyze a region of a break in the X chromosome, observed in Figure 3D. The SK-BR-3 cells, which are derived from a woman, were compared with cells from an unrelated man. We hope that the probes on the X chromosome have high ratios. This was the case through much of the arm length of the X chromosome. But in the middle of Xql3.3, there was a sharp break in the number of copies over a region that spans 27 kb and close relations to one were observed for the rest of the chromosome (Figure 5A). Thus, it was possible to trace genetic damage limits of the array data by segmentation. We have observed many
P05 / 086 / CSHL other cases of abrupt transitions of the number of copies that must break the genes. There are three or four narrow amplifications in the genome of SK-BR-3, each containing two or fewer genes, among which were the transmembrane receptors. We then analyzed the data from chromosome 8 (Figure 3B), which had an abundance of aberrations, including broad, distinctive, amplification regions (Figure 5B). The peak to the right was approximately one segment of a megabase, comprised of thirty-seven probes (coordinates of the probe 45099-45138, genomic coordinates June 126815070-128207342). It even contained a single well-characterized gene, c-myc. There was a second broad peak at S-BR-3, ascending to the left of the c-myc peak, and off the graph (Figure 5B). This broad peak had a broad flange on its right (coordinates of probe 44994-45051, genomic coordinates June 123976563-125564705), with a very narrow peak in its middle. We superimposed on this the tumor genome segmentation data, CHTN159, which had an even broader peak spanning c-myc
(coordinates of probe 44996-45131, genomic coordinates
June 124073565-127828283). The peak in CHTN159 also encompassed the flange of the second peak of SK-BR-3 (Figure 5B).
P05 / 086 / CSHL Thus, the flange may contain candidate oncogenes that deserve attention. Within that region, in a narrow peak, we found TRC8, the target of a translocation located in hereditary renal carcinoma (Gemmill et al., Proc. Nati, Acad. Sci. USA 95: 9572-7 (1998)). These results illustrate the value of multi-genome coordinate data, and the need for automated methods to analyze multiple data sets. We also analyzed a close suppression on chromosome 5. Figure 5C shows the results of a combined 10K (empty circles) and 85K (full circles) analysis, superimposed on a network of the number of copies. A suppression was evident in both resolutions of 10K and 85 (coordinates of probe 29496-29540, genomic coordinates June 14231414-15591226), but the limits were resolved much more clearly at 85K. This region contained TRIO, a protein that has a GEF domain, an SH3 domain, and a serine threonine kinase domain (Lin and Greenberg, Cell 101: 230-42 (2000)); ANKH, a transmembrane protein (Nurnberg et al., Nat. Genet 28: 37-41 (2001)); and FBXL, a component of the degradation pathway of the protein mediated by ubiquitin ligase (Ilyin et al., Genomics 67: 40-47 (2000)). Finally, we analyze a loss region
P05 / 08S / CSHL homozygous on chromosome 19 that affects a cluster of zinc finger proteins (Figure 5D, coordinates of probe 77142-77198, genomic coordinates June 21893948-24955961). Some of these genes can encode transcription factors, whose suppression may have a role in tumorigenesis. We observed an abundance of narrow hemizygous and homozygous lesions, some of which can be attributed to normal variation. See Example 9.
Example 9 - Examination of a "Normal" Genomic Variation We also used the oligonucleotide arrays and the methods of this invention to analyze the copy number variation between two normal genomes, and we observed differences resulting from a polymorphic variation. This analysis is important, for example, in situations where a DNA sample from the tumor can not correspond to normal DNA and an unrelated normal DNA is used as a reference, because the differences observed can result from polymorphic variation. These variations can be of two classes, punctual variation of the sequence of the class that creates or destroys a fragment Bglll, for example, SNP, or a real fluctuation of the number of copies present in the human genetic collection. The above has an impact
P05 / 086 / CSHL limited in the analysis, using the arrangements of the invention, since it will produce scattered "noise" that can be filtered to a large extent by statistical means. In Figure 6A (combined data from the 10K and 85K data sets), we showed that a light filtering algorithm (if one relationship was the most deviant of the four surrounding, we replaced it with the closest relation of its two neighbors) , can minimize the impact of point variation of the sequence and detect cases where there is a real variation in the number of copies. The cloud of scattered polymorphisms present in an unfiltered sample (for example, Figure C2), rises in this presentation of the data, revealing non-random clusters of deviating probe ratios, indicating large-scale genomic differences between normal individuals. The polymorphic variation of the dispersed variety can also be filtered by a serial comparison of experiments. For example, Figure 6B shows the data of SK-BR-3 compared to the normal donor, J. Doe, the 85K ratios shown in full circles, and the 10K ratios in empty circles. In the same graph, we show the relationships of J. Doe compared to another normal DNA of an African pygmy, in green triangles. We observed three probes of extreme relationship in the normal hybridization of
P05 / 086 / CSHL SK-BR-3 that can be identified as polymorphisms, by comparison with a hybridization between the two normal individuals. The simplest interpretation of these data is that J. Doe is + / +, the pygmy +/- and SK-BR-3 - / -, where + designates the presence of a small fragment Bglll (most likely a SNP in a Bglll site). In general, comparisons in the form of pairs of three genomes allow interpretable calls of allele status. Thus, these kinds of data are especially useful when a malignant genome can not mate to a corresponding normal one. The polymorphism in the number of copies, however, presents a different kind of problem. Figure 6A shows large regional differences in the number of copies in the normal-normal comparison. We apply segmentation analysis to these data and identify multiple regions that show an altered number of copies between the two normal individuals. We observe approximately a dozen variable regions in any normal-normal comparison. They range from one hundred kilobases to more than one megabase in length, they can occur anywhere, but they are more frequently observed near telephones and centromeres and often span known genes. Close inspections of two such
P05 / 086 / CSHL regions are shown in Figure 6C and Figure 6D, with relationships as connected circles, and segmentation values as networks. In Figure 6C, the abnormal region is 135kb on chromosome 6p21 (coordinates of probe 32518-32524, genomic coordinates June 35669083-35804705), and encompasses three known genes. In Figure 6D, the region is a 620kb region of chromosome 2pll (coordinates of probe 9927-9952, genomic coordinates June 88787694-89385815), which contains a number of variable regions of the heavy chain. Analyze the impact of normal-normal variation on the interpretation of cancer-normal data. In Figure 6C and Figure 6D, we overlap the segmentation values of the SK-BR-3 analyzes in diagonal and vertical lines, respectively. The network of the number of copies for SK-BR-3 is graphical as a network. Figure 6C illustrates a region in SK-BR-3 that would be called a suppression compared to normal. In S-BR-3, the flanking region occurs at a number of copies that we judge to be two copies per cell, and within that region, the number of copies is reduced to one. But the same region appears in the comparison of pygmy DNA with normal. In Figure 6D, we observe an analogous condition on chromosome 2pll. In Figure 6D, we also plot the segmentation data for the tumor. This region is
P05 / 086 / CSH obviously abnormal there too.
Example 10 - Analysis of a Genome or a Portion of Itself The following examples are intended to illustrate the uses of the search engine. Useful modifications and adaptations of the described conditions and parameters normally encountered in the art, which are obvious to those skilled in the art, are within the spirit and scope of the present invention. The search engine of the present invention can be used to perform calculations in a genome or in subsets of a genome (eg, a chromosome). When performing these calculations, you will find several regions that have high word counts that are not detected by the search tools, such as Repeat Masker. It has now been shown that the database of the repeats used by Repeat Master does not include region-specific or chromosome-specific repeats. Using the search engine described above in section VII, such repeats are easily found because the exact corresponding count can form the basis for the fixed algebra of the genome. In particular, a subset of the genome can be made in transformed series, which are examined to find the specific repeats of the chromosome.
P05 / 08S / CSHL A series of the transformed chromosome 1 was analyzed with the word counts within it and within the whole genome. A search was made to look for contiguous regions of chromosome 1, at least 100 bp in length, with high counts of 18-mers, in which it was found that exact correspondences are derived mainly from chromosome 1. Such regions were found easily varying in length from 100 bp to 35 kb. Focusing on such a region, it was observed that its ground mérico was almost a stepped function, composed of shorter sequences, each one with a modal frequency and a characteristic length. Specific regions of the chromosome containing one of these specific regions were collected and a family of the specific sequences of chromosome 1 was quickly identified. The specific region of chromosome 1 was selected by identifying 18-mers whose chromosome 1 counts exceed 90% of their total genome counts, these 18-mers were linked to create the chromosome-specific repeat. In addition, the space between the 18-mergers that were joined was not allowed to exceed 100 base pairs. It was found at least once that this repeat has been analyzed as overlapping a RefSeq gene (access number NM015383), with many exons encoding together a large predicted protein sequence, which has a
P05 / 086 / CSHL low homology with myosin. The same process by which specific chromosome repeats were identified can be applied to find repetitive DNA through the genome, including those that are not recognized by
Repeat Masker or other programs.
Example 11 - Probe Design Using the Motor Motor The search engine mentioned above can be used in the design of probes. The probes are generally useful for their ability to hybridize specifically to complementary DNA, and therefore one of the main objectives in the design of the probe is to minimize cross-hybridization. Previous applications for designing probes have used a repeated masking to exclude repeated regions of consideration. This type of solution is problematic, in that it does not provide protection for regions that are repetitive, such as specific chromosome repeats, and excludes "repetitive" regions that are unique. Although the rules for hybridization between corresponding sequences imperfectly are not well understood, it is known in the art that probes that have exact "small" matches with multiple
P05 / 086 / CSH regions of the genome, should be avoided in a preferred manner.
Previous probe applications have chosen probes that minimize the counts of the exact aggregate 12-mer correspondences, but for genomic probes, these methods are inadequate. First, it is not clear that exact 12-mer correspondences have any effect on hybridization under normally stringent annealing conditions. Neither the 12-mer counts predict homology, leaving only the uniqueness in the genome. In fact, a comparison of the 15-mer counts with the geometric mean of the counts of its constituent 12-merits, provided a deficient correlation between two sequences that are essentially unique. A general protocol for the design of probes used by the micromotor is described as follows. First, the genome is analyzed according to a particular length grouper, so that there are sufficiently long stretches of uniqueness (ie, candidate probes). Second, these candidate probes are analyzed using at least one grouper of predefined length, preferably, of a shorter length than the length of the grouper used to find the candidate probes. One of the candidate probes is selected as the probe based on the aggregated minimum micron counts of the predefined shorter lengths.
P05 / 086 / CSH Following the protocol mentioned above, the candidate 70-mer probes were selected from small Bglll fragments, using the uniqueness data obtained from the 21-mer counts. Within these candidate probes, a 70-mer was selected with the lowest sum of 15-mer counts, with a cutoff value of approximately 900. Additional criteria were also applied that eliminate single nucleotide passes, and a severe deviation of the composition of the base to help determine which candidate probe to choose. The selected probes were synthesized and printed on glass to test their performance under hybridization conditions with microarrays. It was found that substantially all the probes performed at or above the specified performance criteria. More particularly, a success rate of about 70% to about 98% was achieved, with probes designed using the protocol mentioned above, wherein success is defined as having a substantial (eg, large) signal-to-noise ratio. BLAST was used to test whether the selected probes were unique within a particular published genomic sequence. 30,000 such probes were tested using the default parameters for MegaBLAST (the simple sequence filtering was turned off). HE
P05 / 086 / CSHL found that more than 99% of the selected probes were unique within the genome.
Example 12 - Representation of the Algorithm Pseudocode To further illustrate how the algorithm can be implemented to perform a word count function, refer to Figures 12A and 12B. Figure 12A graphically defines the variables and data structures used by the algorithm, and Figure 12B shows a representation of the pseudocode of the algorithm. As indicated above in section VII, the transform can be used as a navigation tool for a "virtual" Genomic Dictionary or suffix arrangement. In the simplest case, suppose you want to determine if a subset appears in the genome, and if so, in how many copies. In this case, assume that the substring is the simple character "X". All occurrences of X can be observed in the dictionary as a block (for example, a search region), where Fx and Lx are the indices of the first and last occurrence of X. Fx and Lx can be derived from the data structure of the alphabetic limits. The size of this block (for example, search region), is kx = Lx - Fx + 1, it is also the number of occurrences of X. Note that this number can be determined by counting the number of occurrences of X in the transform.
P05 / 086 / CSH In a more difficult case, such as when two or more words of the character must be counted, Fx, Lx and kx of each X character in the genome, need to be determined. In other words, Fx and Lx for each X character is stored in the data structure called alphabetical limits. Once the data structure of the alphabetic boundaries is constructed, the algorithm can proceed to count the number of times that a particular word Z appears in the genome. Assume that W is a suffix of Z, W exists in the genome, and the alphabetic limits of W are known (eg, Fw and Lw as shown in Figure 12A). Next, a determination of whether XW exists as a substring needs to be made, where X is the character that precedes W in Z. In addition, the start and end indexes (for example, Fxw and Lxw) of the XW block need to be determined. If and only if X appears in the transform between Fw and Lw, then XW exists as a substring in the genome. In addition, the number of X in the "W block" of the transform, indicated as kxw, is the word count of the XW subset in the genome. The start and end indexes of XW can be completed using: 1) Fxw = Fx + bxw; and 2) Lxw = Fxw + kxw - 1, where bxw is the number of words beginning with X in the Genomic Dictionary that occur before XW. bxw can be determined by counting the number of X that appears before the W block of the transform.
P05 / 08S / CSHL This procedure is repeated, lengthening the suffix one character at a time, stopping if the suffix does not exist in the Genomic Dictionary. If the suffix W spans the entire word, Z, kw is the number of occurrences of Z in the genomic series. An exposition of this procedure is outlined in the pseudocode, as shown in Figure 12B. With respect to Figure 12B, Z is a series of length N, composed of characters of the genome alphabet, and the data structure of the alphabetic boundaries contains the indices of the first and last occurrences in the genomic dictionary for each character in the genome alphabet. Unless defined otherwise, all technical and scientific terms used herein have the same meaning commonly understood by one of ordinary skill in the art to which this invention pertains. All publications and other references mentioned herein are incorporated by reference in their entirety. In the case of conflict, it will control the present specification, including the definitions. The materials, methods and examples are illustrative only and not limiting. Through this specification, the word "comprise" or variations such as "comprises" or "comprising", shall be understood to imply the inclusion of an integer or group
P05 / 086 / CSHL of established integers, but not the exclusion of any other integer or group of integers. P05 / 086 / CSHL
Claims (100)
- CLAIMS t 1. A plurality of nucleic acid molecules, wherein: (a) the plurality consists of N nucleic acid molecules; (b) each of the plurality of nucleic acid molecules has a nucleotide sequence that hybridizes specifically to a sequence in a Z-base pair genome; and (c) at least P% of the plurality of nucleic acid molecules have (i) a length of K nucleotides; (ii) specifically hybridizes to at least one nucleic acid molecule in or predicted to be present in a representation derived from the genome, the representation has no more than R% of the complexity of the genome; and (iii) no more than X exact matches of Li nucleotides with the genome and no less than Y exact matches with Li nucleotides with the genome; and where: (A) N > 500; (B) Z > 1 x 108; (C) 300 > K > 30; (D) 70 > R > 0.001; P05 / 086 / CSHL (E) P = (N x R + (3 x sigma)) / N; (F) sigma is the square root of (N x R x (1-R)) (G) the nearest integer to (log4 (Z) + 2) > Lx > the integer closest to log4 (Z); (H) X is the nearest integer to Di x (- L? +1); (I) Y is the nearest integer to D2 x (K-L? +1); (J) 1.5 > Di > 1; and (K) 1 > D2 > 0.5 2. The plurality of nucleic acid molecules according to claim 1, wherein N is selected from the group consisting of at least 500; at least 1,000; at least 2,500; at least 5,000; at least 10,000; at least 25,000; at least 50,000; at least 85,000; at least 190,000; at least 350,000; and at least 550,000 nucleic acid molecules. 3. The plurality of nucleic acid molecules according to claim 1, wherein Z is selected from the group consisting of at least 3 x 108, at least 1 x 109, at least 1 x 1010 and at least lxlO11. 4. The plurality of nucleic acid molecules according to claim 1, wherein the genome is a mammalian genome. 5. The plurality of nucleic acid molecules according to claim 4, wherein the genome is a human genome. P05 / 08S / CSH 6. The plurality of nucleic acid molecules according to claim 1, wherein R is selected from the group consisting of 0.001, 1, 2, 4, 10, 15, 20, 30, 40, 50 and 70. 7. The plurality of molecules of nucleic acid according to claim 1, wherein P is selected from the group consisting of at least 70, at least 80, at least 90, at least 95, at least 97 and at least 99. 8. The plurality of acid molecules nucleic acid according to claim 1, wherein Di is 1. 9. The plurality of nucleic acid molecules according to claim 1, wherein D2 is 1. The plurality of nucleic acid molecules according to claim 1, wherein Li is selects from the group consisting of 15, 16, 17, 18, 19, 20, 21, 22, 23 and 24. 11. The plurality of nucleic acid molecules according to claim 1, wherein each P% of the plurality of molecules of nucleic acid in addition, it has only A exact correspondences of L2 nucleotides with the genome and not less than ue B exact correspondences of L2 nucleotides with the genome; and where (a) Li > L2 = the integer closest to log4 (Z) -3; (b) A is the nearest integer to D3 x ((K-L2 + l) x P05 / 086 / CSHL (Z / 4A)); (c) B is the nearest integer to D4 x ((-L2 + l) x (Z / 4A)) (d) 4 > D3 > 1; and (e) 1 > D4 > 0.5 12. The plurality of nucleic acid molecules according to claim 11, wherein D3 < 3, 2 or 1.5. The plurality of nucleic acid molecules according to claim 1, wherein the P% of the plurality of nucleic acid molecules have at least 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% sequence identity with at least one nucleic acid molecule present or predicted as being present in the representation. The plurality of nucleic acid molecules according to claim 1, wherein K is selected from the group consisting of 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 200 and 250. 15. A plurality of nucleic acid molecules, wherein: (a) the plurality consists of at least 100 nucleic acid molecules; (b) each of the plurality of nucleic acid molecules has a nucleotide sequence that is at least 90% identical with a sequence in a genome of at least Z base pairs; Y P05 / 086 / CSHL (c) at least P% of the plurality of nucleic acid molecules have (i) a length of K nucleotides; (ii) at least 90% sequence identity with at least one nucleic acid molecule present in or predicted to be present in a representation derived from the genome, the representation has no more than R% of the complexity of the genome; and (iii) no more than X exact correspondences of Li nucleotides with the representation and not less than Y exact correspondences with the Li nucleotides with the representations; and where: (A) Z > 1 x 108; (B) 300 = K > 30; (C) 70 > R > 0.001; (D) P > 90-R; (E) the closest integer to (log4 (ZxR) / 100) +2) > Li > the integer closest to log4 (ZxR) / 100); (F) X is the nearest integer to Dx x (K - L? +1); (G) Y is the nearest integer to D2 x (K-L? +1); (H) 1.5 = Di = 1; and (I) 1 > D2 > 0.5 16. The plurality of nucleic acid molecules according to claim 15, comprising at least 500; to the P05 / 086 / CSHL minus 1,000; at least 2,500; at least 5,000; at least 10,000; at least 25,000; at least 50,000; at least 85,000; at least 190,000; at least 350,000; or at least 550,000 nucleic acid molecules. 17. The plurality of nucleic acid molecules according to claim 15, wherein Z is selected from the group consisting of at least 3 x 108, at least 1 x 109, at least 1 x 1010 and at least 1 x 1011. 18. The plurality of nucleic acid molecules according to claim 15, wherein the genome is a mammalian genome. 19. The plurality of nucleic acid molecules according to claim 18, wherein the genome is a human genome. The plurality of nucleic acid molecules according to claim 15, wherein R is selected from the group consisting of 0.001, 1, 2, 4, 10, 15, 20, 30, 40, 50 and 70. 21. The plurality of nucleic acid molecules according to claim 15, wherein P is selected from the group consisting of at least 70, at least 80, at least 90, at least 95, at least 97 and at least 99. 22. The plurality of molecules of nucleic acid according to claim 15, wherein Di is 1. 23. The plurality of nucleic acid molecules P05 / 086 / CSH according to claim 15, wherein D2 is 1. 24. The plurality of nucleic acid molecules according to claim 15, wherein Li is selected from the group consisting of 15, 16, 17, 18, 19, 20, 21, 22, 23 and 24. 25. The plurality of nucleic acid molecules according to claim 15, wherein each P% of the plurality of nucleic acid molecules in addition, has no more than A exact correspondences of L2 nucleotides with the genome and not less than B exact correspondences of L2 nucleotides with the genome; and where (a) Li > L2 > the integer closest to log4 (Z) -3; (b) A is the nearest integer to D3 x ((K-L2 + l) x (Z / 4L2)) (c) B is the nearest integer to D4 x ((K-L2 + l) x (Z / 4L2)) (d) 4 > D3 > 1; and (e) 1 > D4 > 0 5 . 26. The plurality of nucleic acid molecules according to claim 15, wherein the P% of the plurality of nucleic acid molecules have at least 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% sequence identity with at least one nucleic acid molecule present or predicted as being present in the representation. P05 / 086 / CSH 27. The plurality of nucleic acid molecules according to claim 15, wherein K is selected from the group consisting of 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 200 and 250. 28. The plurality of nucleic acid molecules according to claim 1, wherein the representation is produced by the specific cleavage of the sequence to the genome. 29. The plurality of nucleic acid molecules according to claim 28, wherein the specific cleavage of the sequence is achieved with a restriction endonuclease. 30. The plurality of nucleic acid molecules according to claim 1, wherein the representation is a composite representation. 31. The plurality of nucleic acid molecules according to claim 1, wherein the plurality of nucleic acid molecules are immobilized on the surface of a solid phase. 32. The plurality of nucleic acid molecules according to claim 31, wherein the solid phase is selected from the group consisting of a nylon membrane, a nitrocellulose membrane, a glass slide and a microsphere. 33. The plurality of nucleic acid molecules P05 / 086 / CSHL according to claim 31, wherein the positions of the plurality of nucleic acid molecules in the solid phase are known. 34. The plurality of nucleic acid molecules according to claim 33, wherein the plurality of nucleic acid molecules is in a microarray. 35. The plurality of nucleic acid molecules according to claim 33, wherein the plurality of nucleic acid molecules are immobilized in microspheres. 36. A method for analyzing a nucleic acid sample, the method comprising: (a) hybridizing the sample to a plurality of nucleic acid molecules according to claim 1; and (b) determining which of the plurality of nucleic acid molecules the sample hybridizes to. 37. The method according to claim 36, wherein the sample is a representation. 38. The method according to claim 36, wherein the plurality of nucleic acid molecules is immobilized on the surface of a solid phase. 39. The method according to claim 38, wherein the solid phase is selected from the group consisting of a nylon membrane, a nitrocellulose membrane, a glass slide and a microsphere. P05 / 086 / CSHL 40. The method according to claim 38, wherein the positions of the plurality of nucleic acid molecules in the solid phase are known. 41. The method according to claim 40, wherein the plurality of nucleic acid molecules is in a microarray. 42. The method according to claim 38, wherein the plurality of nucleic acid molecules is immobilized in microspheres. 43. A method for analyzing the copy number variation of a genomic sequence between two genomes, the method comprising: (a) providing a first genome and a second genome; (b) preparing detectably labeled representations of each genome, using at least one identical restriction enzyme; (c) contacting the representations with the plurality of nucleic acid molecules according to claim 1 or 31, to allow hybridization between the representations and the plurality of nucleic acid molecules; and (d) comparing the levels of the hybridization of the representations, where a difference in the levels indicates a variation of the number of copies between the P05 / 086 / CSH two genomes, with respect to a genomic sequence identified by the member. 44. The method according to claim 43, wherein the two representations are distinguishably marked. 45. The method according to claim 44, wherein the representations are contacted simultaneously with the plurality of nucleic acid molecules. 46. A method for comparing the methylation status of a genomic sequence between two genomes, the method comprising: (a) providing a first genome and a second genome; (b) preparing detectably labeled representations of each genome, using at least one identical enzyme, wherein the representations are prepared by a method sensitive to methylation; (c) contacting the representations with the plurality of nucleic acid molecules according to claim 1 or 31, to allow hybridization between the representations and the plurality of nucleic acid molecules; and (d) compare the levels of hybridization of the representations, where a difference in the P05 / 086 / CSHL levels indicates a difference in the methylation status between the two genomes, with respect to a genomic sequence identified by the member. 47. The method according to claim 46, wherein the method sensitive to methylation involves preparing a first representation using a first restriction enzyme and a second representation using a second restriction enzyme, wherein the first and second restriction enzymes recognize the Same restriction site, but one is sensitive to methylation and the other is not. 48. The method according to claim 46, wherein the method sensitive to methylation involves the chemical cleavage of methyl-C sequences after making a representation with a restriction enzyme not sensitive to methylation, so that a derivative representation of A methylated genome is distinguishable from a representation derived from a non-methylated genome. 49. A method to identify an oligonucleotide, which has: (a) a length of K nucleotides; (b) at least 90% identity of the sequence with at least one nucleic acid molecule present in, or predicted to be present in a representation derived from a genome of at least Z base pairs, and P05 / 086 / CSHL (c) no more than X exact correspondences of Li nucleotides with the genome and not less than Y exact correspondences of Li nucleotides with the genome; where: (i) Z > 1 x 108; (ii) 300 > K > 30; (iii) the closest integer to (log (Z) + 2) > Li > the integer closest to log (Z); (iv) X is the nearest integer to Di x (-L? +1); (v) Y is the nearest integer to D2 x (K-L? +1); (vi) 1.5 > Di > 1; and (vii) 1 > D2 > 0.5; the method comprises: (A) cleaving the in silico genome with a restriction enzyme to generate a plurality of predicted nucleic acid molecules, (B) generating a virtual representation of the genome, identifying the predicted nucleic acid molecules, each having a length of 200-1,200 base pairs, inclusive; (C) selecting an oligonucleotide having a length of 30-300 nucleotides, inclusive, and at least 90% identity of the sequence to a nucleic acid molecule predicted in (B); (D) identifying all the stretches of Li nucleotides that appear in the oligonucleotide; Y P05 / 086 / C? HL (E) confirm that the number of times each of the sections appears in the genome, satisfies the requirements of (c). 50. The method according to claim 49, wherein step (E) comprises: providing a compressed transform of the genome; provide an auxiliary data structure that includes information related to the genome; and determining a word count for the Li nucleotides using the compressed transform and the auxiliary data structure. 51. The method according to claim 49, wherein step (E) comprises: providing a compressed transform of the genome; iterate through each nucleotide of the stretch of Li nucleotides, starting with the last nucleotide and advancing towards the first nucleotide one character per iteration, where the nucleotide corresponding to a particular iteration is stored in an index nucleotide, the iteration also includes: defining a search region that delineates a contiguous range of nucleotides within the transformed; P05 / 086 / CSHL count the number of times that the nucleotide preceding the index nucleotide appears in the search interval; and wherein the iteration ceases if no occurrence of the nucleotide preceding the index nucleotide occurs in the search interval; and extract the number of times that the first nucleotide of the stretch of Li nucleotides is counted, this number is equivalent to the number of times that the stretch of Li nucleotides appears in the genome. 52. The method according to claim 51, further comprising: providing an auxiliary data structure, the auxiliary data structure comprises: a data structure of the K intervals which maintains a total of passes of each nucleotide that has appeared in the transformed up to, and including a particular predetermined location in the compressed transform; and a dictionary count data structure, which provides a fast search access to the compressed transform; and where the counting and definition are made using the auxiliary data structure and the compressed transform. P05 / 086 / CSHL 53. The method according to claim 52, wherein the transform remains compressed while counting is performed. 54. The method according to claim 52, wherein the compressed transform is compressed such that every three characters in the uncompressed transform are compressed to form a byte, and wherein the count decompresses at least one byte during one of the iterations. . 55. The method according to claim 51, wherein the genome comprises at least three billion characters. 56. The method according to claim 51, wherein the compressed transform is a Burrows-Wheeler transform of the genome. 57. The method according to claim 51, further comprising providing data that is based on the transform, wherein the definition comprises using the data and the index nucleotide to define the search region. 58. The method according to claim 51, further comprising: providing data that is based on the transform; and determine a previous nucleotide count, the P05 / 086 / CSHL Previous nucleotide count is the number of times that the nucleotide preceding the nucleotide of the index appears in the transformed before starting the search region; wherein the definition comprises using the data, the index nucleotide, and the previous nucleotide count to define the search region. 59. The method according to claim 58, wherein the previous nucleotide count is obtained using the intervals K, the intervals are stored at predetermined locations along the transform and maintain a total of passes of each nucleotide that has appeared in the transformed up to, and including a particular predetermined location. 60. A plurality of oligonucleotides, each of which is produced by the method according to claim 49, the plurality comprises at least 500 oligonucleotides. 61. A plurality of oligonucleotides, each of which is produced by the method according to claim 49, the plurality comprises at least 1,000; at least 2,500; at least 5,000; at least 10,000; at least 25,000; at least 50,000; at least 85,000; at least 190,000; at least 350,000; or at least 550,000 oligonucleotides. 62. A method for analyzing a nucleotide sequence, the nucleotide sequence comprises a series P05 / 086 / CSHL of characters, the method comprising: dividing the nucleotide sequence into a plurality of words of a predetermined length, each word being a subregion of the nucleotide sequence having a predetermined length; and determine a word count for each word, counting the number of times each word appears in the nucleotide sequence. 63. The method according to claim 62, wherein the words overlap. 64. The method according to claim 62, wherein the "determination comprises using a word counting algorithm that uses a compressed transform of the nucleotide sequence to count how many times each word appears in the nucleotide sequence 65. The method according to claim 64, wherein the word counting algorithm comprises: iterating through each character of one of the words, starting with the last character and advancing towards the first character one character per iteration, wherein the character corresponding to a particular iteration is stored as an index character, the iteration further comprises: defining a search region that delineates a contiguous range of characters within the transform; P05 / 086 / CSHL count the number of times the character preceding the index character occurs in the search interval; and the iteration ceases if occurrences of the character preceding the index character do not occur in the search interval; and extract the number of times the first character is counted, this number is equivalent to the number of times a particular word appears in the nucleotide sequence. 66. The method according to claim 62, further comprising performing a statistical analysis of the word counts obtained for each word. 67. The method according to claim 62, further comprising: dividing the nucleotide sequence into a second plurality of words of a second predetermined length, each of the second plurality of words being a subregion of the nucleotide sequence having the second predetermined length; and determining a word count for each of the second plurality of words, counting the number of times each of the second plurality of words appears in the nucleotide sequence. 68. The method according to claim 62, in P05 / 086 / CSH where the nucleotide sequence is a genome. 69. A system for analyzing a nucleotide sequence, the nucleotide sequence comprises a series of characters, the system comprises a user equipment configured to: divide the nucleotide sequences into a plurality of words of a predetermined length, each word is a subregion of the nucleotide sequence having a predetermined length; and determine a word count for each word, counting the number of times the word appears in the nucleotide sequence. 70. The system according to claim 69, wherein the words overlap. 71. The method according to claim 69, wherein the user equipment is configured to use a word counting algorithm that uses compressed transformed uria of the nucleotide sequence, to count how many times each word appears in the nucleotide sequence. 72. The system according to claim 71, wherein the user's equipment is further configured to: iterate through each character of one of the words, starting with the last character and advancing toward the first character one character per iteration, in P05 / 086 / C? HL Where the character that corresponds to a particular iteration is stored as an index character, the user's computer is further configured to iterate by repeating the steps that: define a search region that delineate a contiguous range of characters inside the transformed; count the number of times the character preceding the index character occurs in the search interval; and the iteration ceases if occurrences of the character preceding the index character in the search interval do not occur; and extract the number of times the first character is counted, this number is equivalent to the number of times a particular word appears in the nucleotide sequence. 73. The system according to claim 69, wherein the user's equipment is configured to perform a statistical analysis of the word counts obtained for each word. 74. The system according to claim 69, wherein the user's equipment is configured to: divide the nucleotide sequence into a second plurality of words of a second predetermined length, each of the second plurality of P05 / 086 / CSHL words is a subregion of the nucleotide sequence that has the second predetermined length; and determining a word count for each of the second plurality of words, counting the number of times each of the second plurality of words appears in the nucleotide sequence. 75. The system according to claim 69, wherein the nucleotide sequence is a genome. 76. A method for selecting a polynucleotide having minimal potential for cross-hybridization to the unwanted regions of a nucleotide sequence, the method comprising: selecting a plurality of polynucleotides of a predetermined length that exists within the nucleotide sequence; generate statistical data for each polynucleotide; and determine which of the polynucleotides has statistical data that best satisfy the predetermined criteria. 77. The method according to claim 76, wherein the generation comprises: dividing each polynucleotide into a plurality of words of a predetermined length, each word being a subregion of the polynucleotide having the length P05 / 086 / CSHL default; and determine a word count for each word, counting the number of times each word appears in the nucleotide sequence. 78. The method according to claim 76, wherein the statistical data represents the number of times that the constituent words of each polynucleotide appear in the nucleotide sequence. 79. The method according to claim 76, wherein the predetermined criterion comprises a minimum average value of the word counts of a predetermined length, a geometric mean value of the counts of the words of a predetermined length, a value of fashion of the word counts of a predetermined length, a minimized maximum value of the word counts of a predetermined length, a value of the total sum of the word counts of a predetermined length, a product value of the word counts of a predetermined length, a series of maximum length of a particular nucleotide, or a combination thereof. 80. The method according to claim 76, wherein the selection comprises: generating word counts of a particular word, having a particular length that appears P05 / 086 / CSHL in the nucleotide sequence; and obtaining the polynucleotides from the regions of the nucleotide sequence so that the word counts for the sub-series within the regions does not exceed a predetermined word count. 81. A system for selecting a polynucleotide having minimal potential for cross-hybridization to the unwanted regions of a nucleotide sequence, the method comprises user equipment configured to: select a plurality of polynucleotides of a predetermined length that exists within the nucleotide sequence; generate statistical data for each polynucleotide; and determine which of the polynucleotides has statistical data that best satisfy the predetermined criteria. 82. The system according to claim 81, wherein the user's equipment is configured to: divide each polynucleotide into a plurality of words of predetermined length, each word is a subregion of the polynucleotide having the predetermined length; and determine a word count for each word, counting the number of times each word P05 / 086 / CSHL appears in the nucleotide sequence. 83. The system according to claim 81, wherein the statistical data represents the number of times that the constituent words of each polynucleotide appear in the nucleotide sequence. 84. The system according to claim 81, wherein the predetermined criterion comprises a minimum average value of the word counts of a predetermined length, a geometric average value of the counts of the words of a predetermined length, a value of fashion of the word counts of a predetermined length, a minimized maximum value of the word counts of a predetermined length, a value of the total sum of the word counts of a predetermined length, a product value of the word counts of a predetermined length, a series of maximum length of a particular nucleotide, or a combination thereof. 85. The system according to claim 81, wherein the user's equipment is configured to: generate word counts of a particular word, having a particular length appearing in the nucleotide sequence; and obtain the polynucleotides from the regions of the nucleotide sequence so that the counts of P05 / 086 / CSHL words for subseries within regions, does not exceed a predetermined word count. 86. A method for counting the number of times a word appears in a genome, where the word comprises a series of characters, the method comprises: providing a compressed transform of the genome; iterate through each character of the word, starting with the last character and advancing to the first character one character per iteration, where the character that corresponds to a particular iteration is stored as an index character, the iteration also includes: defining a search region that delineates a contiguous range of characters within the transform; count the number of times the character preceding the index character appears in the search interval; and where the iteration ceases if occurrences of the character preceding the index character in the search interval do not occur; and extract the number of times the first character of the word is counted, the number is equivalent to the number of times the word appears in the genome. 87. The method according to claim 86, which P05 / 086 / CSHL further comprises: providing an auxiliary data structure, the auxiliary data structure comprises: a data structure of the intervals K which keeps a total of passes of each character that has appeared in the transform up to, and including a particular predetermined location in the compressed transform; and a dictionary count data structure, which provides a fast search access to the compressed transform; and wherein the count is performed using at least the data structure of the K interval and the data structure of the dictionary counts. 88. The method according to claim 87, wherein the transform remains compressed while counting is performed. 89. The method according to claim 87, wherein the compressed transform is compressed such that every three characters in the uncompressed transform are compressed to form a byte, and wherein the count decompresses at least one such byte during one of the iterations 90. The method according to claim 86, wherein the compressed transform of the genome is derived P05 / 086 / CSHL using an understanding ratio of 3 to 1. 91. The method according to claim 86, wherein the genome comprises at least one million characters. 92. The method according to claim 86, wherein the genome comprises at least four million characters. 93. The method according to claim 86, wherein the genome comprises at least one hundred million characters. 94. The method according to claim 86, wherein the genome comprises at least three billion characters. 95. The method according to claim 86, wherein the word comprises at least 15 characters. 96. The method according to claim 86, wherein the compressed transform is a transform of Burrows-Wheeler of the genome. 97. The method according to claim 86, further comprising providing data that is based on the transform, wherein the definition comprises using the data and the index character to define the search region. 98. The method according to claim 86, further comprising: providing data that is based on the P05 / 08S / CSHL transformed; and determining a count of the previous character, the count of the previous character is the number of times that the character preceding the index character appears in the transform before the beginning of the search region; wherein the definition comprises using the data, the index character and the counting of the previous character to define the search region. 99. The method according to claim 98, wherein the previous character count is obtained using the intervals K, the intervals K are stored in predetermined locations together with the transform and keep a total of passes of each character that has appeared in the transformed up to, and including a particular predetermined location. 100. A system comprising user equipment that is configured to perform a method according to claims 86-99. P05 / 086 / CSHL
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US60/472,845 | 2003-05-23 | ||
US60/472,843 | 2003-05-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
MXPA05012638A true MXPA05012638A (en) | 2007-04-20 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2390561C2 (en) | Virtual sets of fragments of nucleotide sequences | |
US8685642B2 (en) | Allele-specific copy number measurement using single nucleotide polymorphism and DNA arrays | |
Grün et al. | Design and analysis of single-cell sequencing experiments | |
Lucito et al. | Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation | |
JP5171037B2 (en) | Expression profiling using microarrays | |
AU745862C (en) | Contiguous genomic sequence scanning | |
US20040077003A1 (en) | Composition for the detection of blood cell and immunological response gene expression | |
JP2009232865A (en) | Probe array for distinguishing dna, and method of using probe array | |
JP2002525127A (en) | Methods and products for genotyping and DNA analysis | |
JP2004504059A (en) | Method for analyzing and identifying transcribed gene, and finger print method | |
US20030049663A1 (en) | Use of reflections of DNA for genetic analysis | |
CN113227393A (en) | Methods, compositions, and systems for calibrating epigenetic zoning assays | |
US20070148636A1 (en) | Method, compositions and kits for preparation of nucleic acids | |
MXPA05012638A (en) | Virtual representations of nucleotide sequences | |
US10927405B2 (en) | Molecular tag attachment and transfer | |
US20080102452A1 (en) | Control nucleic acid constructs for use in analysis of methylation status | |
US20070134678A1 (en) | Comparative genome hybridization of organelle genomes | |
EP1207209A2 (en) | Methods using arrays for detection of single nucleotide polymorphisms | |
Ganova-Raeva | Artificial Nucleic Acids and Genome Profiling | |
Hindmarch | Transcriptome Analysis: Microarrays | |
Edwards et al. | Mutation and Polymorphism Detection |