WO2006109535A1 - Dna sequence analyzer and method and program for analyzing dna sequence - Google Patents

Dna sequence analyzer and method and program for analyzing dna sequence Download PDF

Info

Publication number
WO2006109535A1
WO2006109535A1 PCT/JP2006/306012 JP2006306012W WO2006109535A1 WO 2006109535 A1 WO2006109535 A1 WO 2006109535A1 JP 2006306012 W JP2006306012 W JP 2006306012W WO 2006109535 A1 WO2006109535 A1 WO 2006109535A1
Authority
WO
WIPO (PCT)
Prior art keywords
tag
dna sequence
control
tags
genomic dna
Prior art date
Application number
PCT/JP2006/306012
Other languages
French (fr)
Japanese (ja)
Inventor
Hiroaki Mita
Takashi Tokino
Kohzoh Imai
Original Assignee
Sapporo Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sapporo Medical University filed Critical Sapporo Medical University
Publication of WO2006109535A1 publication Critical patent/WO2006109535A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6841In situ hybridisation

Definitions

  • DNA sequence analyzer DNA sequence analysis method and program
  • the present invention relates to a DNA sequence analyzing apparatus, a DNA sequence analyzing method, and a program.
  • the prior art described in the above literature has room for improvement in the following points.
  • the above-mentioned CGH method, RDA method, and classical cytogenetic method using the metaphase chromosomes have a resolution limit of about 20 Mb, and the copy number is limited to a smaller region. It is difficult to use for analysis of change.
  • the recent CGH method power is increasing in resolution due to the shift to the microarray method. The number of sequences to be analyzed is limited and special equipment is required. In order to overcome these problems, a method that can identify genomic DNA sequence copy number abnormalities with high resolution is desired.
  • Non-Patent Document 1 also removes the tag to be analyzed for repetitive sequences. This means that information about regions containing repetitive sequences that occupy about 45% of the genome is discarded. Therefore, these analysis methods have room for further improvement in terms of reliability of analysis results.
  • the present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technique capable of reliably identifying an abnormal copy number of a genomic DNA sequence with high resolution.
  • control genomic DNA sequence is obtained by cleaving with a restriction enzyme.
  • Each of the numbers contained in the control genomic DNA sequence is a predetermined number or less, and each has a predetermined range of bases.
  • a control tag data acquisition unit for acquiring control tag data obtained by associating a plurality of control tags composed of a number of DNA sequences with corresponding positions in the control genomic DNA sequence, and a genomic DNA sequence to be analyzed by this restriction enzyme.
  • An analysis target tag data acquisition unit for acquiring analysis target tag data that is a set of a plurality of analysis target tags each including a DNA sequence cover having a number of bases in a predetermined range, and the control tag data
  • the analysis target tag data is compared with the corresponding tag to generate corresponding tag data in which the corresponding tags of the reference tag and the analysis target tag are associated with each other.
  • the data generation unit and the corresponding tag data are analyzed, the number of the tags to be analyzed corresponding to the control tag is determined, and based on this number, the control tag corresponds to the control tag in the genomic DNA sequence to be analyzed.
  • a copy number determination unit for determining a copy number difference with respect to the control genomic DNA sequence in the region including the portion to be processed, and data processed by the copy number determination unit.
  • a DNA sequence analyzing apparatus including an output unit for outputting data.
  • the analysis target tag consisting of a short fragment obtained by subjecting the analysis target genomic DNA sequence to a restriction enzyme is counted as a representative of the genome, and the control is a varchal tag derived from the control genomic DNA sequence.
  • control genomic DNA sequence since a plurality of control tags each having a predetermined number or less included in the control genomic DNA sequence are used, even if the tag includes a repetitive sequence, the control genomic DNA sequence Any tag that has a high degree of uniqueness and whose number is less than or equal to a predetermined number can be suitably used. Therefore, according to this configuration, it is possible to use information on a region including a repetitive sequence occupying a certain ratio in the control genomic DNA sequence, and improve the reliability of the analysis result.
  • an abnormal copy number of a genomic DNA sequence can be reliably identified with high resolution.
  • DNA sequence analyzing apparatus is an aspect of the present invention, and the DNA sequence analyzing method, the DNA sequence analyzing system, the DNA sequence analyzing program of the present invention, a recording medium including the program, and the like are also included. Have the same configuration.
  • FIG. 1 is a conceptual diagram for explaining the principle of digital genome scanning and the outline of chromosome abnormality analysis by varchy tag.
  • FIG. 2 is a functional block diagram showing the overall configuration of the DNA sequence analysis system 1000.
  • FIG. 3 is a functional block diagram showing the overall configuration of a DNA sequence analysis system 2000, which is a modification of the DNA sequence analysis system 1000.
  • FIG. 4 is a functional block diagram showing the internal configuration of the DNA sequence analyzer 100.
  • FIG. 5 is a functional block diagram showing an internal configuration of a DNA sequence analyzer 400 which is a modification of the DNA sequence analyzer 100.
  • FIG. 6 is a flowchart for explaining the operation of the DNA sequence analysis system 1000.
  • FIG. 7 is a functional block diagram showing an internal configuration of a corresponding tag data generation unit 210.
  • FIG. 8 is a flowchart for explaining the operation of the corresponding tag data generation unit 210.
  • FIG. 9 is a functional block diagram showing the internal configuration of the copy number determination unit 214.
  • FIG. 10 is a flowchart for explaining the operation of the copy number determination unit 214.
  • FIG. 11 is a conceptual diagram for explaining the data visibility image in the unit of a varch tag.
  • FIG. 12 A conceptual diagram for explaining a data visual image in units of a virtual tag.
  • FIG. 12 is a functional block diagram for explaining an internal configuration of a control tag data generation device 200.
  • FIG. 15 is a graph showing the number of tags by size generated by MboI.
  • FIG. 16 is a diagram showing the number of tags when a width is given to the tag size.
  • FIG. 17 is a graph showing MboI (“GATC) tag distribution.
  • FIG. 18 is a diagram showing the number of effective veil tags generated by MboI.
  • FIG. 19 is a conceptual diagram showing an image of DGS Monte Carlo simulation.
  • FIG. 20 is a diagram for explaining the details of the DGS Monte Carlo simulation.
  • FIG. 21 is a screen display diagram showing a user interface for DGS Monte Carlo simulation.
  • FIG. 22 This is a summary of DGS simulation results in the form of anomaly detection resolution for 165 and 845 Mbol non-repeat archial tags and the number of required analysis tags.
  • FIG. 25 is a diagram for explaining the verification of the uniqueness in the repeat tag derived from the repeat sequence.
  • FIG. 26 is a diagram for examining whether or not the tag cutout size should be shifted longer.
  • FIG. 27 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) spanning a repeat area and a non-repeat area is regarded as an effective tag.
  • FIG. 28 is a flowchart for explaining the operation of the control tag data generating apparatus 200. 29] FIG. 29 is a functional block diagram for explaining the internal configuration of the analysis target tag data generation device 300.
  • FIG. 30 is a diagram for explaining extraction of tag DNA and production of concatamers.
  • FIG. 32 is a conceptual diagram for explaining an operation flow of the analysis target tag data generation device 300.
  • FIG. 33 is a conceptual diagram for explaining re-extension of concatamers.
  • FIG. 35 is a sequence diagram of a DNA sequence for explaining a method of grasping a concatamer structure.
  • FIG. 36 The sequence map power of FIG. 35 is also a sequence map when the vector sequence is removed.
  • FIG. 37 is a sequence map for explaining the state of tag extraction from the sequence map of FIG.
  • FIG. 38 is a flowchart for explaining the operation of the analysis target tag data generating apparatus 300.
  • FIG. 39 is a conceptual diagram for explaining the flow of automatic tag analysis.
  • FIG. 40 is a conceptual diagram for explaining a flow of classifying tags.
  • FIG. 41 is an electropherogram for explaining the purification of tags from the HSC45 genome and the production of concatemers.
  • FIG. 42 is a graph showing the tag size distribution and repeat'unique classification.
  • FIG. 44 is a diagram showing a breakdown of raw tags acquired from HSC45.
  • FIG. 45 is a graph showing the tag density calculated by setting the window size.
  • FIG. 46 is a graph and a physical map showing a region showing an abnormal tag density.
  • FIG. 47 is a diagram showing a breakdown of the size and the number of tags of Mbol raw tags obtained when DGS analysis was performed using genomic DNA of gastric cancer cell lines.
  • FIG. 48 is a genome-wide tag density graph obtained when DGS analysis was performed using genomic DNA of a gastric cancer cell line.
  • FIG. 49 is a diagram showing genome amplification of 8q24.21 of chromosome 8 short arm.
  • FIG. 50 shows the relationship between c-myc genomic amplification and mRNA overexpression.
  • FIG. 51 is a diagram showing genome amplification of 12q 12.1 of chromosome 12 short arm.
  • FIG. 52 is a tag map showing the distribution of raw tags in a 3 Mbps region centered on the K-ras gene.
  • FIG. 53 is a diagram showing genome amplification of a K-ras region.
  • FIG. 54 shows the relationship between K-ras genomic amplification and mRNA and protein overexpression.
  • FIG. 55 is a diagram showing an outline of a DGS analysis system.
  • Genome scanning is a method for comprehensive analysis of genetic information on the genome.
  • Arbitrary primed PCR AP—PCR
  • Restriction AP—PCR
  • LGS Landmark Genome Scanning
  • 1% The limit is to analyze about 1%.
  • Comparative genomic hybridization is mainly used to detect amplified or deleted regions in the chromosome of tumor cells. This is a competitive fluorescence in situ hybridization method.
  • SAGE serial analysis of gene expression: a continuous analysis of gene expression is a method for simultaneously detecting the expression of a large number of transcripts mRNA.
  • Concatemer A concatemer is a group of DNA fragments linked in series by DNA ligase (ligation enzyme) or the like.
  • Fig. 1 is a conceptual diagram for explaining the principle of digital genome scanning and the outline of chromosome aberration praying using varchyartag. Note that this is only an overview.
  • control tag data serving as genome information of a predetermined species such as a human is prepared.
  • an algorithm for tag extraction is created.
  • the control genomic DNA sequence data is cleaved by a predetermined restriction enzyme cleavage site, and a control tag which is a virtual tag is created and databased.
  • control tag which is the virtual tag
  • sequence text of the genome of a predetermined biological species such as a human
  • control tag data can be obtained by linking the database of the control tags, which are the varchal tags, to the location information on the genome of a given species such as humans (location information of the control genomic DNA sequence) via sequence text. Is obtained.
  • genomic DNA molecules obtained by DNA extraction are extracted.
  • DNA molecule is prepared.
  • the plurality of genomic DNA molecules having chromosomal power are cleaved by a predetermined restriction enzyme, and the plurality of tags to be analyzed are connected to create a plurality of concatemers.
  • the DNA sequences of a plurality of concatamers are decoded by a sequence reaction. Furthermore, the decoded DNA sequences of a plurality of concatemers are converted into the DNA sequences of a plurality of tags, respectively, to obtain tag data to be analyzed.
  • the correspondence data of the control tag data and the analysis target tag data is shown. Is generated. Based on this correspondence data, the number of tags to be analyzed corresponding to each region on the genome is determined. As a result, the presence of chromosomal abnormalities such as amplification and deletion in each region on the genome is detected. Information such as amplification and deletion in each region on the genome is used for identification of disease genes.
  • FIG. 2 is a functional block diagram showing the overall configuration of the DNA sequence analysis system 1000.
  • the DNA sequence analysis system 1000 includes a DNA sequence analyzer 100 that acquires control tag data and analysis target tag data and analyzes changes in the genomic DNA sequence.
  • the DNA sequence analysis system 1000 includes a control tag data generation device 200 that generates control tag data.
  • the DNA sequence analysis system 1000 includes an analysis target data generation device 300 that generates analysis target data.
  • the DNA sequence analysis system 1000 includes another genome DNA sequence database 120 that stores genome DNA sequence data of a species different from the control tag data and the tag data to be analyzed.
  • the DNA sequence analysis system 1000 includes an operation unit 102 for operating the DNA sequence analysis apparatus 100.
  • the DNA sequence analysis system 1000 includes an image display device 104 that displays data output from the DNA sequence analysis device 100 as an image.
  • the DNA sequence analysis system 1000 includes a printer 106 that prints data output from the DNA sequence analysis apparatus 100.
  • the DNA sequence analysis system 1000 includes a PC (personal computer) 108 that receives data output from the DNA sequence analyzer 100.
  • FIG. 3 is a functional block diagram showing the overall configuration of a DNA sequence analysis system 2000 that is a modification of the DNA sequence analysis system 1000.
  • the DNA sequence analysis system 2000 has basically the same configuration as the DNA sequence analysis system 1000 in FIG. 1.
  • the control genome DNA sequence data acquisition unit 202 (FIG. 4) is included in the DNA sequence analysis device 400. Is different. Another difference is that the DNA sequence analyzer 400 is connected to the control genomic DNA sequence database 500.
  • “3.” is an explanation of the analysis target tag data generation device 300 of FIG. 1 that generates data (data to be input to the DNA sequence analysis device 100) that is the basis of the above “1.”.
  • FIG. 4 is a functional block diagram showing the internal configuration of the DNA sequence analyzer 100.
  • the DNA sequence analyzer 100 includes a control tag data acquisition unit 202 that acquires control tag data input from the control tag data generator 200.
  • the control tag data is data obtained by associating a plurality of control tags obtained by cleaving the control genomic DNA sequence with a restriction enzyme, with corresponding positions in the control genomic DNA sequence.
  • the plurality of control tag data is data composed of DNA sequences each having a predetermined number or less of the number contained in the control genomic DNA sequence and each having a predetermined number of bases.
  • the DNA sequence analyzer 100 further includes a target tag data storage unit 206 that stores the control tag data acquired by the control tag data acquisition unit 202.
  • the DNA sequence analysis apparatus 100 includes an analysis target tag data acquisition unit 204 that is input from the analysis target tag data generation apparatus 300.
  • the analysis target tag data is data of a set of a plurality of analysis target tags obtained by cleaving the analysis target genomic DNA sequence with a restriction enzyme.
  • the data of the set of the plurality of tags to be analyzed is data composed of DNA sequences each having a predetermined number of bases.
  • the DNA sequence analyzer 100 further includes an analysis target tag data storage unit 208 that stores the analysis target tag data acquired by the analysis target tag data acquisition unit 204.
  • the DNA sequence analyzer 100 includes a corresponding tag data generation unit 210 that generates corresponding tag data in which the control tag data and the analysis target tag data are associated with each other.
  • Corresponding tag data generation unit 210 acquires control tag data from control tag data storage unit 206, and
  • the tag data storage unit 208 obtains the analysis target tag data, compares the control tag data with the analysis target tag data, and associates the corresponding tag data between the control tag and the analysis target tag. Generate.
  • the DNA sequence analyzing apparatus 100 further includes a corresponding tag data storage unit 212 that stores the corresponding tag data generated by the corresponding tag data generation unit 210.
  • the DNA sequence analyzer 100 includes a copy number determination unit 214 that analyzes the corresponding tag data and determines a copy number difference between the analysis target genomic DNA sequence and the control genomic DNA sequence.
  • the copy number determination unit 214 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212 to determine the number of tags to be analyzed corresponding to the control tag, and based on this number, out of the control genomic DNA sequence. Determine the copy number difference in the genomic DNA sequence to be analyzed in the region containing the control tag from the control genomic DNA sequence.
  • the DNA sequence analyzing apparatus 100 further includes a copy number determination result storage unit 216 that stores the copy number determination result by the copy number determination unit 214.
  • the DNA sequence analyzer 100 includes a separate genomic DNA data search unit 224 that searches for genomic DNA sequence data of a biological species that is different from the control tag data and the analysis target tag data. That is, the genomic DNA sequence data search unit 224 searches for another genomic DNA sequence data by connecting to another genomic DNA sequence database 120 (FIG. 1) derived from a source different from the control genomic DNA sequence.
  • genomic DNA sequence database 120 FIG. 1
  • the DNA sequence analyzing apparatus 100 includes an origin determining unit 226 that does not correspond to the control tag and determines the origin for each tag to be analyzed. That is, the origin determining unit 226 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212, and determines whether there is a control tag corresponding to the analysis target tag. As a result, the origin determination unit 226 compares the analysis target tag with the different genomic DNA sequence data acquired from the separate genomic DNA data search unit 224 when there is no corresponding tag corresponding to the analysis target tag. To determine the origin of the tag to be analyzed. In addition, the DNA sequence analyzer 100 includes an origin determination result storage unit 228 that stores the determination result by the origin determination unit 226.
  • the DNA sequence analyzing apparatus 100 includes an image data generation unit 220 that generates image data based on the copy number determination result or the origin determination result. That is, the image data generation unit 220 acquires the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and controls each region of the genomic DNA sequence based on these results. It generates image data to display the difference in copy number with the genomic DNA sequence and the presence of heterologous DNA sequences in images that are easy for the user to understand.
  • the DNA sequence analyzing apparatus 100 further includes an image data storage unit 222 that stores the image data generated by the image data generation unit 220.
  • FIG. 5 is a functional block diagram showing the internal configuration of a DNA sequence analyzer 400 that is a modification of the DNA sequence analyzer 100.
  • the configuration of the DNA sequence analyzer 400 is basically the same as the configuration of the DNA sequence analyzer 100 of FIG. 4 except that a force control tag data generator 402 is provided inside.
  • the control tag data generation unit 402 acquires the control genomic DNA sequence data input from the control genomic DNA sequence database 500 (Fig. 3), and generates control tag data. The detailed mechanism by which the control tag data generation unit 402 generates control tag data will be described later.
  • the configuration of the DNA sequence analyzer 400 also differs in that it includes a control tag data storage unit 404 that stores the control tag data generated by the control tag data generation unit 402. Therefore, in the DNA sequence analyzer 400, the control tag data acquisition unit 202 acquires control tag data from the control tag data storage unit 404 inside the device, not from outside the device.
  • FIG. 6 is a flowchart for explaining the operation of the DNA sequence analysis system 1000.
  • the control tag data generation unit 200 shown in FIG. 2 cuts the control genomic DNA sequence data of a predetermined species such as a human at the restriction enzyme cleavage site, thereby producing a control tag. Data is generated (S102).
  • control tag data acquisition unit 202 acquires the control tag data from the control tag data generation unit 200 (S106). Further, the control tag data acquisition unit 202 stores the acquired control tag data in the control tag data storage unit 206.
  • the tag data generation device 300 to be analyzed shown in FIG. 2 connects a plurality of DNA fragments obtained by treating a genomic DNA molecule to be analyzed of a predetermined biological species such as a human with a restriction enzyme.
  • analysis target tag data including a plurality of analysis target tags is generated (S104).
  • a plurality of concatemers may be generated by connecting a plurality of DNA fragments, and a plurality of secondary concatemers may be generated by connecting a plurality of concatemers. This is because the efficiency of the sequence can be improved by generating a secondary concatemer.
  • the analysis target tag data acquisition unit 204 acquires the analysis target tag data from the analysis target tag data generation unit 300 (S108). Further, the analysis target tag data acquisition unit 204 stores the acquired analysis target tag data in the analysis target tag data storage unit 208.
  • the corresponding tag data generation unit 210 acquires the control tag data from the control tag data storage unit 206, acquires the analysis target tag data from the analysis target tag data storage unit 208, and performs the control tag data and analysis.
  • the target tag data is compared, and corresponding tag data is generated by associating corresponding tags among the control tag and the analysis target tag (S110). Further, the corresponding tag data generation unit 210 stores the generated corresponding tag data in the corresponding tag data storage unit 212.
  • the copy number determination unit 214 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212, determines the number of analysis target tags corresponding to the control tag, and V, Then, a difference in copy number with respect to the control genomic DNA sequence in the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed is determined (S114). Further, the copy number determination unit 214 stores the copy number determination result in the copy number determination result storage unit 216.
  • the origin determination unit 226 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212 and determines whether there is a control tag corresponding to the analysis target tag. As a result, If there is no control tag corresponding to the analysis target tag, the source determination unit 226 compares the analysis target tag with another genomic DNA sequence data obtained from the separate genomic DNA data search unit 224, and The origin is determined (S 112). In addition, the origin determination unit 226 stores the source determination result in the origin determination result storage unit 228.
  • the image data generation unit 220 acquires the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and determines the copy number determination result and the origin. Image data is generated based on the determination result (S116). In addition, the image data generation unit 220 stores the generated image data in the image data storage unit 222.
  • the output unit 218 obtains the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and the image data storage unit 222 also outputs the image data. After obtaining these, these are output to an image display device 104 (FIG. 2) outside the device (S118), and a series of flows is completed.
  • short and fragmentable tags to be analyzed obtained by subjecting the genomic DNA sequence to be analyzed to restriction enzymes are counted as genome representatives, and the corresponding tag data generator 210 controls the control genomic DNA.
  • the copy number determination unit 214 can quantitate the comprehensive number of copies of the human genome. Therefore, based on this, it is possible to identify a region exhibiting copy number abnormality in the genomic DNA sequence to be analyzed with high resolution. As a result, genome regions showing gene copy number abnormalities are searched and identified with high resolution, new causative genes of diseases existing in those regions are clarified, and the mechanism of onset is clarified. It can be applied to treatment.
  • FIG. 7 is a functional block diagram showing the internal configuration of the corresponding tag data generation unit 210.
  • Corresponding tag data generation unit 210 includes control tag data storage unit 206 (FIG. 4) and analysis target tag data storage unit 208 (FIG. 4). Is provided.
  • Corresponding tag data generation unit 210 corresponds to the control tag data and the analysis target tag data.
  • a correspondence determination unit 504 that determines the relationship is provided.
  • Correspondence determination unit 504 obtains control tag data and analysis target tag data from reception unit 502, and when the analysis target tag corresponds to only one tag among the control tags, these tags are given a predetermined contribution.
  • the analysis target tag corresponds to two or more tags among the control tags, the tags are associated with each other with a contribution degree (eg, 0) different from the predetermined contribution degree.
  • the setting of the contribution in the correspondence determination unit 504 is performed by the contribution setting unit 508 provided in the correspondence tag data generation unit 210.
  • the corresponding tag data generation unit 210 includes a matching degree determination unit 506 that determines the matching degree between the control tag data and the analysis target tag data.
  • the coincidence determination unit 506 acquires the control tag data and the analysis target tag data for which the contribution degree related to the corresponding relationship has been set from the correspondence relationship determination unit 504, and selects the completely matched tag among the control tag and the analysis target tag.
  • the tags are associated with each other with a predetermined contribution (for example, 1), and partially different tags are associated with each other with a contribution (for example, 0) that is different from the predetermined contribution.
  • the contribution degree setting in the coincidence degree determination unit 506 is performed by the contribution degree setting unit 508 provided in the corresponding tag data generation unit 210.
  • an analysis tag with an insertion or deletion of 1 base or 2 bases may be included in addition to an analysis tag having the same length but having a mismatch.
  • the corresponding tag data generation unit 210 includes a retry determination unit 510 that determines whether to retry generation of each of the plurality of analysis target tags included in the analysis target tag data. For example, the retry determination unit 510 determines that the number of bases that are different from each other when the result of the determination is obtained when the control tag and the analysis target tag are partially different in the determination result of the coincidence determination unit 506. If the number is less than or equal to the number, it can be configured to determine to retry the sequence for generating the analysis target tag. In this way, since the number of valuable tags to be analyzed can be suppressed by a slight sequence error at the level of several bases, the reliability of the obtained analysis results can be improved.
  • Corresponding tag data generation unit 210 outputs an output unit 512 that outputs data that has undergone the processing of correspondence determination unit 504, coincidence determination unit 506, and retry determination unit 510 to corresponding tag data storage unit 212 (FIG. 4).
  • FIG. 8 is a flowchart for explaining the operation of the corresponding tag data generation unit 210.
  • the correspondence determination unit 502 determines the correspondence between the control tag data and the analysis target tag data (S202). For example, the number of control tags corresponding to each analysis target tag is determined. If the number of control tags is 1, the contribution setting unit 508 sets the contribution to a (S206), and the number of control tags is 2. If it is above, the contribution setting unit 508 sets the contribution to b (S208), and if it is the number of control tags, the process proceeds to step 112 to determine the origin (FIG. 6).
  • the coincidence determination unit 506 determines the coincidence between the control tag data and the analysis target tag data (S210). For example, the degree of coincidence between each analysis target tag and the corresponding control tag is determined, and if it is an exact match, the contribution setting unit 508 sets the contribution to c (S212). The contribution is set to d by the setting unit 508 (S214).
  • the retry determination unit 510 determines the necessity of retry (S214), and ends the series of flows.
  • the advantages of the corresponding tag data generation unit 210 in the present embodiment will be described.
  • the correspondence relationship determination unit 504, the coincidence degree determination unit 506, and the contribution degree setting unit 508 an appropriate contribution can be set according to the correspondence and matching degree between the control tag and the analysis target tag.
  • the copy number determination unit 214 described later for each region in the genomic DNA sequence, the correspondence between the control tag and the analysis target tag and the contribution according to the degree of coincidence are integrated, thereby analyzing the analysis target genome. It is possible to reliably detect a region having a copy number different from that of the control genomic DNA sequence in the nom DNA sequence.
  • the analysis tag that is suspected of being a reading error or SNPs at the time of the sequence is useful for improving the reliability of the results obtained by the DNA sequence analysis system 1000.
  • FIG. 9 is a functional block diagram showing the internal configuration of the copy number determination unit 214.
  • the copy number determination unit 214 also receives the corresponding tag data input from the corresponding tag data storage unit 212 (Fig. 4).
  • a reception unit 602 is provided.
  • the copy number determination unit 214 includes a contribution totaling unit 604 that totals the contribution set in the corresponding tag data received by the reception unit 602.
  • the copy number determination unit 214 includes a duplication determination unit 600 that determines whether or not duplication has occurred in the genomic DNA sequence to be analyzed based on the contributions totalized by the contribution totalization unit 604.
  • the duplication determination unit 606 determines the number of analysis target tags corresponding to the control tag based on the contributions totaled by the contribution totalization unit 604, and the number of analysis target tags corresponding to the control tag is predetermined. If the number is greater than or equal to (for example, 3 or more), it is determined that duplication has occurred in the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed.
  • the copy number determination unit 214 includes a deletion determination unit 6008 that determines whether or not a deletion has occurred in the genomic DNA sequence to be analyzed based on the contributions totalized by the contribution totalization unit 604. .
  • the deletion determination unit 608 determines the number of analysis target tags corresponding to the control tag based on the contributions totaled by the contribution totalization unit 604, and determines the number of analysis target tags corresponding to the control tag. If the number is less than a predetermined number (for example, 0.5 or less), it is determined that a deletion has occurred in the region containing the portion corresponding to the control tag in the genomic DNA sequence to be analyzed.
  • the copy number determination unit 214 outputs the data obtained from the contribution counting unit 604, the duplication determination unit 606, and the deletion determination unit 608 to the copy number determination result storage unit 216 (FIG. 4). 610 is provided.
  • FIG. 10 is a flowchart for explaining the operation of the copy number determination unit 214.
  • the contribution aggregation unit 604 analyzes the corresponding tag data received by the receiving unit 602, and analyzes each region of the control genomic DNA sequence (each of the control tag data). Total contributions set for each control tag) (S302)
  • the duplication determination unit 606 determines the force that the total contribution is equal to or greater than a threshold (for example, 3 or more) for each region of the control genomic DNA sequence (S304). As a result, if it is equal to or greater than the threshold, it is determined that there is duplication (S306). On the other hand, if it is less than the threshold value, the process proceeds to the next step 308. [0070] In the next step, the deletion determination unit 608 determines, for each region of the control genomic DNA sequence, a force whose aggregated contribution is less than or equal to a threshold (eg, 0.5 or less) (S308). As a result, if it is less than or equal to the threshold value, it is determined as a deletion (S310). On the other hand, if it is larger than the threshold, nothing is judged.
  • a threshold for example, 3 or more
  • the output unit 610 outputs the determination result to the copy number determination result storage unit 216 (S312), and ends a series of flows.
  • the corresponding tag data generation unit 210 obtains data set with an appropriate contribution degree according to the correspondence and the degree of coincidence, and for each region in the genomic DNA sequence, the control tag and By integrating the correspondences between the tags to be analyzed and the contributions according to the degree of coincidence, it is possible to reliably detect a region having a copy number different from that of the control genomic DNA sequence in the genomic DNA sequence to be analyzed. it can. In addition, by determining the relationship between the contributions accumulated by the duplication judgment unit 606 and the deletion judgment unit 608 and the upper and lower thresholds, it is possible to reliably detect the occurrence of duplication and deletion in the genomic DNA sequence. can do.
  • FIG. 11 is a conceptual diagram for explaining a data visualization image for each control tag (unit: varchy tag).
  • the image data generation unit 220 acquires data related to the copy number determination result from the copy number determination result storage unit 216, and generates such image data.
  • the tag density (corresponding to the aggregate value of contribution) is calculated and displayed in a form that is easy for the user to understand by the color according to the tag density. Yes.
  • the display window can also be enlarged or reduced as necessary, taking into account the convenience of the user.
  • duplication or deletion occurs in the region in the genomic DNA corresponding to each square depending on the color of each square, and it can be easily visually determined.
  • FIG. 12 is a conceptual diagram for explaining a data visualization image for each control tag (unit: varchy tag).
  • the image data generation unit 220 obtains data related to the copy number determination result from the copy number determination result storage unit 216 and generates such image data. Good.
  • the tag concentration (corresponding to the aggregate value of the contribution) is calculated, and the height of the filled mass corresponding to the tag concentration is It is displayed in a form that is easy for the user to understand.
  • the display window can be enlarged or reduced as necessary, taking into account user convenience.
  • FIG. 13 is a functional block diagram for explaining the internal configuration of the control tag data generation apparatus 200.
  • the control tag data generation device 200 (the control tag data generation unit 402 (FIG. 5) has the same configuration) includes a control genomic DNA sequence data acquisition unit 706 that acquires control genomic DNA sequence data.
  • the control tag data generation device 200 also includes a control genome DNA sequence data storage unit 708 that stores the control genomic DNA sequence data acquired by the control genomic DNA sequence data acquisition unit 706.
  • the control tag data generation device 200 acquires control genomic DNA sequence data from the control genomic DNA sequence data storage unit 708, searches for a cleavage site by a predetermined restriction enzyme, and controls genomic DNA at the searched cleavage site.
  • a cutting site search unit 710 for cutting the sequence data is provided.
  • the control tag data generation device 200 includes a cut DNA sequence storage unit 712 that stores a plurality of DNA sequences (control tags) obtained by being cut by the cut site search unit 710.
  • the control tag data generation device 200 obtains a plurality of control tags obtained by cleaving the control genomic DNA sequence at the cleavage site from the cleaved DNA sequence storage unit 712, and among these control tags, the control tag data generation device 200 has a predetermined range. Number of bases ⁇ A control tag selection unit 714 is provided for selecting a control tag with uniqueness within a predetermined range. In addition, the control tag data generation device 200 includes a selection tag storage unit 716 that stores the control tag selected by the control tag selection unit 714.
  • the control tag data generation apparatus 200 includes an association unit 718 that generates control tag data by associating the selected control tag with a corresponding portion in the control genomic DNA sequence. Also against The reference tag data generation device 200 includes a comparison tag data storage unit 720 that stores the comparison tag data generated by the association unit 718.
  • the control tag data generation device 200 includes an output unit 722 that acquires control tag data from the control tag data storage unit 720 and outputs the control tag data to the DNA sequence analyzer 100.
  • control tag data generation device 200 generation of a virtual tag (control tag) using human genome information by the above-described control tag data generation device 200 will be described in detail.
  • the principle of digital genome scanning (DGS) devised by the present inventors is to count short fragments obtained by restriction enzyme treatment of genomic DNA as representative of the genome in order to quantify the network copy number of the human genome. Based on this, the region showing copy number abnormality is identified.
  • the present inventors conducted the following simulations in silico for the purpose of studying resolution and effectiveness in order to establish the foundation of the DGS method.
  • the analysis program used C language, and the software development environment used a system built around Red Hat server.
  • genome data was searched using restriction enzyme recognition base sequence information, then data outside the specified region was excluded, and the remaining tag data was accumulated and analyzed by size. At this time, the position information of each tag was stored, the data was recorded by determining the repeat class by comparing with the repeat sequence database.
  • FIG. 14 is a diagram showing the number of tags for which genomic force is also generated by each restriction enzyme.
  • the first question is which restriction enzyme should be used to fragment genomic DNA. Therefore, for in silico analysis of virtual tags, we first examined the number of virtual tags by restriction enzyme.
  • the restriction enzyme for 6-base recognition has severe recognition conditions, so that the number of virtual tags produced is clearly less than that for the 4-base recognition restriction enzyme.
  • Mbol recognizes and cleaves the DNA sequence GATC, so the BamHI site can be used for cloning of the conjugation themes (tags ligated together in a daisy chain).
  • the number of tags generated was limited to 20 to 40 bases in length, it showed an intermediate value in comparison with other enzymes, so in the following simulation, it was generated by Mbol.
  • the analysis proceeded with a focus on Vuyarjartag.
  • FIG. 15 is a graph showing the number of tags generated by Mbol by size.
  • FIG. 15 shows a histogram in which the number of tags is tabulated for each base in 20 to 80 bases of fragment sizes (hereinafter referred to as gap length) excluding GATC at both ends of each Mbol virtual tag.
  • gap length fragment sizes
  • FIG. 16 is a diagram showing the number of tags when a width is given to the tag size.
  • concatamer one since it is necessary to analyze a large number of tags when performing DGS, it is possible to improve work efficiency by introducing as many tags as possible into one vector (hereinafter referred to as “concatamer one”). It is important for the above and for saving work costs.
  • repeat sequences are scattered in the genome. According to the 2001 Genome Project report (Nature 2001, 409, 871—), about 45% of the human genome is occupied by repeat sequences.
  • FIG. 17 is a graph showing the Mbol ('GATC) tag distribution.
  • 'GATC Mbol virtual tags
  • FIG. 18 is a diagram illustrating the number of effective veil tags generated by Mbol.
  • Figure 18 shows the result of tabulating the tag size and recalculating them. For example, the width of 30 bases In the case of the 30-59 bp gap length, the total number of tags is 420,000, of which 250,000 are tags derived from repeat sequences, the number of non-repeat tags is 165, 845, and the ratio is 39. It was found to be 8%.
  • FIG. 19 is a conceptual diagram showing an image of DGS Monte Carlo simulation. In this way, the Monte Carlo simulation and the U method were used as the principle of simulation. This is a technique that uses pseudo-random numbers to solve the problem.
  • FIG. 20 is a diagram for explaining the details of the DGS Monte Carlo simulation.
  • Fig. 21A an original algorithm for generating pseudo-random numbers was developed and used to simulate gene amplification, gene deficiency, and loss of heterozygosity.
  • Fig. 21B the number of virtual tags, the number of tags actually analyzed, the number of tags indicating the abnormal copy number (corresponding to the distance of the area indicating the abnormal copy number), its relative appearance frequency, and the number of trials are shown. It was set as a variable and the number of occurrences of each virtual tag was simulated and recorded.
  • Figure 21 shows the user interface for DGS Monte Carlo simulation. It is a screen display figure shown.
  • a web tool with a user interface capable of the above operations was developed and used for the simulation (Fig. 4).
  • detection sensitivity of gene amplification (amplification), gene deletion (homozygous deletion), and loss of heterozygosity (LOH) to be analyzed by DGS will be described.
  • a certain number of veil tags was set, and simulations predicted how many tags would actually be analyzed to detect these genome copy number anomalies.
  • FIG. 22 is a table summarizing the DGS simulation results in the form of anomaly detection resolution in the case of 165 and 845 Mbol non-repeating architectural tags and the number of required analysis tags.
  • the DGS Monte Carlo simulation adopted only 165 and 845 non-repeating tags, which are considered to exist at tag gap lengths of 30 to 59 bases, which became obvious in the above-mentioned Mbol archial tag analysis, as effective tags. This was done assuming the case. At this time, random numbers were generated as many times as the set number of analysis tags, and the tags that appeared were recorded as one trial, and the average data of 100 trials was used.
  • FIG. 23 is a conceptual diagram for explaining the difference between a tag derived from a double-ended repeat and a tag derived from a single-ended repeat. For this reason, even if a tag derived from a repeat sequence is used, there are cases where the fragment is completely buried in the repeat sequence and when only one end of the tag is present in the repeat sequence. Conceivable. At that time, the probability that the tag is mapped to one place on the genome is expected to increase as the portion derived from the repeat sequence in the tag decreases.
  • Mbol virtual tags derived from repeat sequences are either embedded in the repeat sequence at both ends (hereinafter referred to as repeat at both ends), or are present at only one end in the repeat sequence (hereinafter referred to as single-end repeat).
  • the distribution based on the number of tags and the size was analyzed in silico.
  • Figure 24 illustrates the revision of the Mbol virtual tag by showing the size distribution of tags buried in repeat regions (repeat both) and tags spanning repeat regions and non-repeat regions (repeat either). It is a graph to show.
  • FIG. 24 shows a graph in which the classification information of both end / one end repeats is added to the drawing shown in FIG.
  • FIG. 25 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) extending over a repeat region and a non-repeat region is regarded as an effective tag.
  • an Mbol archial tag one-end repeat
  • the range of sorting sizes was determined and the number of tags was tabulated (Fig. 25). As shown in Fig. 25, the number of vaginal tags for one-end repeat increases with increasing tag size.
  • vaginal tags of one-end repeats are valid tags in DGS as well as non-repeat sequences
  • the results of column B + C in Fig. 25 are tabulated as "valid tag” candidates. It is. Even if the tag is cut out from the gel with the same 30 base width, the preparative size is 50-79. If the base is set longer, the “effective tag” ratio will increase by the increase in the number of single-ended repeat tags.
  • FIG. 26 is a diagram for examining whether or not the tag cutout size should be shifted longer. If this is the case, will the work efficiency of DGS increase if the tag size to be cut out is set longer? Based on the above calculations, Fig. 26 shows the results of studying the work efficiency when DGS refined tags with an emphasis on single-ended repeat tags.
  • FIG. 27 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) extending over a repeat area and a non-repeat area is regarded as an effective tag. In determining the result, a candidate site on the genome exists only at one location on chromosome 22, and the tag's full length matches 100% with the genome sequence of the candidate site as 'unique'. did.
  • the purpose of the digital genome scanning (DGS) method of the present embodiment is to quantify the copy number of the genome using human genome information as a background, and to identify a region exhibiting an abnormal copy number with high resolution.
  • DGS digital genome scanning
  • FIG. 28 is a flowchart for explaining the operation of the control tag data generation device 200.
  • the control genomic DNA sequence data acquisition unit 706 acquires control genomic DNA sequence data (S402) and stores it in the control genomic DNA sequence data storage unit 708.
  • the cleavage site search unit 710 selects a restriction enzyme (eg, Mbol) (S404). Then, the cleavage site search unit 710 obtains the control genomic DNA sequence data from the control genome DNA sequence data storage unit 708, searches for the cleavage site of the restriction enzyme (eg, Mbol) (S406), and controls the cleavage site. Cleave genomic DNA sequences. Further, the cleavage site search unit 710 stores a plurality of DNA sequences generated by the cleavage in the cleaved DNA sequence storage unit 712.
  • a restriction enzyme eg, Mbol
  • the control tag selection unit 714 acquires a plurality of DNA sequences from the cleaved DNA sequence storage unit 712, and determines whether each DNA sequence is a DNA sequence having a base number within a predetermined range ( S408). The control tag selection unit 714 then selects a location from these DNA sequences. Select a control tag consisting of a DNA sequence with a unique number of bases within a certain range (S410). On the other hand, the control tag selection unit 714 does not select a control tag consisting of a DNA sequence having a base number / uniqueness outside the predetermined range among these DNA sequences (S412).
  • control tag selection unit 714 stores the selected control tags in the selection tag storage unit 716.
  • the associating unit 718 acquires a plurality of control tags composed of DNA sequences having the number of bases within a predetermined range from the selection tag storage unit 716, and associates them with the corresponding positions of the control genomic DNA sequence data. (S414), control tag data is generated. Further, the associating unit 718 stores the generated control tag data in the control tag data storage unit 720.
  • the output unit 722 acquires the control tag data from the control tag data storage unit 720, outputs it to the DNA sequence analyzer 100 (S416), and ends the series of flows.
  • a control tag is not selected depending on whether it is a repeat sequence or a non-repeat sequence, and the number of the control genomic DNA sequences contained in the control genomic DNA sequence is a predetermined number or less (for example, 1 or less). ! Since a control tag can be selected, the control tag obtained with the control genomic DNA sequence can be used effectively, and the reliability of the obtained data can be improved.
  • a suitable restriction enzyme can be used by changing the combination of the restriction enzyme used for cleaving human genomic DNA and the size of the tag sequence to be extracted.
  • the restriction enzyme used for cleaving human genomic DNA in order to achieve the goal of detecting changes in the amount of DNA at the whole genome level with high resolution, it is necessary to optimize several parameters in order to increase the sensitivity and specificity of the “digital genome scanning method”.
  • the present inventors have already confirmed that the minimum region of the mutation that can be detected is determined by the combination of the restriction enzyme used for cleaving human genomic DNA and the size of the extracted tag sequence. Data simulation.
  • the tag sequence interval ranges from 200 kb to 20 Mb, with an average interval of 2 Mb. Exists.
  • restriction enzyme Mbol (4-base recognition enzyme) and tag sequence size: 20-30 base pairs
  • the tag sequence spacing ranges from 10bp to 460kb, with a high density spacing of 20kb on average. . Therefore, according to the control tag data generation device 200, it is possible to select an optimal restriction enzyme from a wide variety of restriction enzymes according to the target resolution.
  • this tag sequence has positional information on the human genome from which it is derived, and can be mapped onto the chromosome immediately after being databased. Therefore, it can be used for highly accurate quantification of DNA quantity by integrating the number of tag sequences of each chromosome.
  • the control tag data generation apparatus 200 since the Mbol restriction enzyme suitable for obtaining the control tag from the human genomic DNA sequence is used, the control tag data generation apparatus 200 can correspond to the analysis target tag data. This makes it possible to generate control tag data suitable for DNA sequencing, improving the reliability and efficiency of DNA sequence analysis.
  • control tag data generation device 200 when generating the control tag data, the control tag data is associated with the corresponding position in the control genomic DNA sequence. It is sufficient to include the positional information of the control genomic DNA sequence and the sequence data of the selected control tag data in the data. Therefore, the processing load of the DNA sequence analyzer 100 can be reduced as compared with the case where the sequence of the entire human genome DNA sequence is directly associated with the tag data to be analyzed.
  • FIG. 29 is a functional block diagram for explaining the internal configuration of the analysis target tag data generating apparatus 300.
  • the analysis target tag data generation device 300 includes an analysis target DNA molecule application unit 802 for applying an analysis target DNA molecule that is a genomic DNA molecule of a predetermined biological species such as a human.
  • the analysis target tag data generation device 300 includes a restriction enzyme application unit 804 for applying a restriction enzyme (such as Mbol) for cleaving the analysis target DNA molecule.
  • a restriction enzyme such as Mbol
  • the analysis target tag data generation apparatus 300 includes a restriction enzyme processing unit 806 for cleaving a DNA molecule containing the analysis target DNA sequence with a restriction enzyme (Mbol or the like). Furthermore, the analysis target tag data generation device 300 includes an electrophoresis unit 808 for separating a plurality of cleaved DNA fragments.
  • the tag data generation apparatus 300 to be analyzed includes a DNA fragment extraction unit 810 for extracting a DNA fragment having a predetermined number of bases from a plurality of DNA fragments obtained by cleaving a DNA molecule with a restriction enzyme. .
  • the analysis target tag data generation device 300 includes a concatemer generation unit 812 that generates a concatemer formed by linking a plurality of DNA fragments extracted by the DNA fragment extraction unit 810. Furthermore, the analysis target tag data generation apparatus 300 includes a secondary force categorization unit 814 that generates a secondary concatamer formed by connecting a plurality of concatamers generated by the concatamer generation unit 812.
  • the tag data generation device 300 to be analyzed includes a sequence unit 816 for sequencing the DNA sequence of the second concatamer. Furthermore, the analysis target tag data generation device 300 includes a sequence result storage unit 818 that stores a sequence result by the sequence unit 816.
  • the analysis target tag data generation device 300 generates an analysis target tag data generation unit that generates analysis target tag data that is a set of a plurality of analysis target tags based on the sequence result acquired from the sequence result storage unit 818. Equipped with 820. Furthermore, the analysis target tag data generation device 300 includes an analysis target tag data storage unit 822 that stores the analysis target tag data generated by the analysis target tag data generation unit 820. Then, the analysis target tag data generation device 300 includes an output unit 824 that acquires the analysis target tag data from the analysis target tag data storage unit 822 and outputs it to the DNA sequence analysis device 300.
  • genomic DNA was extracted from gastric cancer cell line HSC45 as a human genomic DNA molecule.
  • 20 to 40 ug of genomic DNA was treated with the restriction enzyme Mbol at 37 ° C. for 16 hours, and 3% Nusieve agarose electrophoresis was performed.
  • the gel force was cut out in the range of about 30 to 60 bases, the gel was dissolved with Gelase (EPIC ENTRE), and then the tag DNA was purified by ethanol precipitation.
  • pBluescript II KS (+) (St ratagene) was used as the cloning vector for concatamers, and after BamHI restriction enzyme treatment, alkaline phosphatase treatment was used for cloning.
  • a concatema was prepared from the tag and cloned into a vector.
  • the vector was introduced into E. coli DH10B by electopore positioning, and positive colonies were selected by a color selection method using X-gal. Each colony was cultured in ampicillin-containing LB medium, and the beta DNA was purified with an automatic nucleic acid extractor (KURABO and QIAGEN) and analyzed after RNase treatment.
  • KURABO and QIAGEN automatic nucleic acid extractor
  • FIG. 30 is a diagram for explaining extraction of tag DNA and production of concatemers. More specifically, a preliminary experiment on DGS was performed using actual human genomic DNA extracted from HSC45, a tag DNA extraction and concatemer-producing gastric cancer cell line. Genome extracted from HSC45 FIG. 30A shows the result of electrophoresis of DNA with restriction enzyme Mbol and electrophoresis on 3% Nusieve gel.
  • the obtained tag was ligated by ligation to produce a concatamer, which was introduced into a pBluescript vector to attempt cloning. Initially, only one tag was introduced into the vector, but increasing the concentration of the tag improves concatemer extension efficiency, so that concatemers with 3-5 tags can be obtained. (Figure 30B).
  • FIG. 30C shows a restriction enzyme map prepared based on the base sequence of a typical concatamer.
  • This concatamer consists of 5 tags.
  • Concatema I also produced a tag force derived from fraction # 3, and the actual size of each tag was 43-52 bp.
  • each tag is mapped onto a chromosome by Blat search, each tag is derived from a different chromosome such as No. 1, No. 6, No. 11, and X chromosome as shown in Fig. 30C. It was confirmed.
  • Concatema also contained one SINE-derived repeat tag.
  • the concatema sequence obtained in the above experiment was cut into tags at the Mbol site, Blat search was performed after attaching the Mbol sequence GATC to both ends of the cut tag, and mapped onto the genome. At this time, the analysis of the uniqueness of the sequence and the repeatability of the repeat sequence was also performed.
  • FIG. 31 is a graph showing the results of analyzing and counting the base sequences of the tags used in the preliminary experiment.
  • a total of 81 tag base sequences were analyzed and aggregated (FIG. 31). All The distribution by size and the ratio from repeat Z non-repeat are shown in Fig. 31A.
  • the tag gap length is between 25 and 58 bp, 38 tags out of 81 tags (46.9%) are non-repeat sequences The tag was derived from.
  • Mac Vector 7.2.2 and Assembly LIGN (Accelrys) and Clone Manager 7 Professional Suite (Sci Ed Central) were used for alignment of concatemer single nucleotide sequences and analysis of restriction enzyme sites.
  • Concatemer 1 was analyzed using a T3 and ⁇ 7 primer to analyze the two-way force base sequence, and after aligning the two data, the matching part in both data was extracted as a concatemer sequence.
  • the base sequence of concatamer at the Mbol site is cut into tags, and Mlat sequence GATC is attached to both ends of the cut tag, then Blat search is performed, mapping to the genome, and at the same time, the uniqueness of the sequence and the repeat sequence power The classification was done.
  • Human BLAT Search http: // genome, ucsc. Eduz cgi— bmZ ngB t
  • blastn http: z / w ww. Ncbi. Nlm. Nih. GovZblastZ
  • the analysis target tag data that also has the non-repeat tag power and the control tag data are associated with each other, the number of copies is determined, and these analysis target tag data are determined. Maps on the genome and tabulated for each region in the chromosome.
  • the number of tags obtained is considered to be proportional to the length of each region in the chromosome, in other words, the number of vuagear tags present in each region in the chromosome, It was expressed as the tag density divided by the number of veil tags in each region or the length of each region in the chromosome). Then, variation in tag density was observed for each region in the chromosome (not shown).
  • FIG. 32 is a conceptual diagram for explaining an operation flow of the analysis target tag data generation device 300. To summarize the above description, the analysis target tag data generation device 300 performs the following steps in order.
  • the concatemer is introduced into a BamHI-treated pBluescript II KS + vector (2nd ligation).
  • the primary library vector is treated with Spel and Pstl restriction enzymes to extract the concatema sequence.
  • the re-extended concatamer is introduced into a Pstl and / or Spel-treated pBluescript II KS + vector (4th ligation).
  • the vector is introduced into E. coli, the clones are individually collected, and the base sequence of the concatemer is analyzed.
  • tag data is obtained, mapped onto the genome, and the number of tags is tabulated.
  • FIG. 33 is a conceptual diagram for explaining the re-extension of the concatema.
  • the tag data generation apparatus 300 to be analyzed creates a concatema by connecting tags, and uses it for base sequence analysis.
  • a regular ligation has a long categorization. It is difficult to make a mer.
  • DGS has developed a protocol that takes two steps: re-extension of concatamers, and has succeeded in producing long concatamers. This increases the efficiency of base sequence analysis.
  • the conventional genome quantification methods have the ability to amplify by PCR in the tag production process.
  • DGS that uses the tag data generator 300 to be analyzed is more accurate without using PCR at all. Accurate quantification is possible. Therefore, there is an advantage that the reliability of the obtained data is high.
  • FIG. 34 is a restriction enzyme map for explaining a method of grasping a concatamer structure.
  • the base sequence of concatamers analyzed by the above-mentioned method in order to grasp the concatamer structure, it is judged only by the arrangement of restriction enzyme sites, and the sequence structure shown in FIG. It can be estimated that there is.
  • Fig. 35 is a sequence map of the DNA sequence for explaining the method of grasping the concatamer structure.
  • FIG. 36 is a sequence map when the sequence map power of FIG. 35 is also removed from the vector sequence.
  • FIG. 37 is a sequence map for explaining how the sequence map power of FIG. In this way, the entire region including the concatema is sequenced, the vector sequence is removed from the sequence map, and the sequence information of the remaining sequence map power tags is cut out, so a large number of tags to be analyzed in one sequence. DNA sequence can be analyzed, and the sequence efficiency of the tags to be analyzed is improved.
  • FIG. 38 is a flowchart for explaining the operation of the analysis target tag data generation device 300.
  • genomic DNA molecules of a predetermined species such as a human are applied to the analysis target DNA molecule application unit 802 such as a tube (S502).
  • an appropriate restriction enzyme such as Mbol is applied to the restriction enzyme application part 804 such as a tube (S504).
  • the restriction enzyme treatment unit 806 such as a restriction enzyme kit, the DNA molecule to be analyzed and the restriction enzyme come into contact with each other and incubated in an appropriate environment, whereby restriction enzyme treatment is performed (S506). .
  • the genomic DNA molecule cleaved at the restriction enzyme cleavage site by the restriction enzyme treatment, Separate into multiple DNA fragments.
  • the plurality of DNA fragments are separated according to the length of the number of bases by electrophoresis in an electrophoresis unit 808 such as an electrophoresis tank (S508).
  • an electrophoresis unit 808 such as an electrophoresis tank (S508).
  • the DNA fragment force of the number of bases within a predetermined range The DNA fragment extraction unit 810 such as a DNA extraction kit also cuts out the electrophoresis agarose gel force. Extracted by the prep method or the like (S 510).
  • the concatamer generation unit 812 such as a ligation kit generates concatamers by linking the DNA fragments having the base numbers within the predetermined range thus obtained (S512). Further, a concatamer formed by linking a plurality of DNA fragments is ligated to a multicloning site of a vector such as a plasmid to generate a concatamer-containing vector.
  • This concatamer-containing vector is introduced into E. coli and transformed, and this E. coli is cultured to amplify the concatamer-containing vector.
  • the cultured E. coli concatamer-containing vector is extracted by a miniprep method or the like.
  • the secondary concatamer generation unit 814 such as a ligation kit, further ligates a plurality of concatamers amplified by culturing a vector host by linking to a vector.
  • a secondary concatamer is generated (S514).
  • the DNA sequence of this secondary concatamer is sequenced using a sequence part 816 such as a DNA sequencer (S516).
  • the sequence unit 816 stores the generated sequence result in the sequence result storage unit 820.
  • the analysis target tag data generation unit 820 acquires the sequence result from the sequence result storage unit 820, and based on the sequence result of the DNA fragments having the number of bases within a predetermined range among these DNA fragments. Then, tag data to be analyzed is generated (S518). Further, the analysis target tag data generation unit 820 stores the generated analysis target tag data in the analysis target tag data storage unit 822.
  • the output unit 824 acquires the analysis target tag data from the analysis target tag data storage unit 822, outputs it to the DNA sequence analyzer 100 (S520), and the series of flows ends.
  • DGS can limit 4 base recognition.
  • a restriction enzyme suitable for the purpose such as the enzyme Mbol
  • fragments of about 30-80 bp are counted and counted as tags, and the copy number of the genome can be analyzed. As a result, the reliability of data obtained and the data acquisition efficiency are improved.
  • the working speed of DGS depends on whether the base sequence data of the tags to be analyzed is acquired at a very high speed.
  • the sequence analysis takes approximately 24 hours to analyze 196 samples using normal analytical equipment. Based on this assumption, the simulations of the present inventors have shown that analysis of 10,000 tags is necessary to identify the 1.3 Mbp amplification region. Then, if the goal is 10,000 tags, it takes 51 days if 1 tag is included in 1 sample and power is not included, but if 10 tags are included in 1 sample, the goal can be reached in 5 days. It is expected that it can be used as an actual system.
  • the present inventors predicted that a longer concatamer could be obtained by reconnecting the long concatamers and then reconnecting them. Re-extension). It was confirmed by experiments that this was possible by the experimental method using the steps described above. In other words, it was revealed that in one step, a concatema can be formed by combining tags from 3 to 5 different chromosomes by in vitro experiments. If this concatema generation step is repeated up to the second floor, it can be seen that an average of about 7 tags of concatemer can be obtained in the preliminary experiments so far, and the problem related to the generation time of tag data to be analyzed in DGS can be overcome. It was.
  • the analysis target tag data generation device 300 it is possible to generate a secondary concatemer without performing PCR, as described above, so that the experimental process is easy. Since bias due to PCR is unlikely to occur, high complexity and highly accurate quantitative results can be obtained even in the human genome.
  • FIG. 39 is a conceptual diagram for explaining the flow of tag automatic analysis.
  • tag data extraction step will be described.
  • sequencing a concatemer DNA sequence if sequencing is performed in two directions, an alignment of the sequences read from both directions is created. The restriction enzyme sites in these DNA sequences that made up the alignment are then washed out. Then, the structure of concatamers in these DNA sequences is grasped. Then, these vector sequences are removed from the DNA sequence, and each tag sequence is cut out.
  • the data for each vector is aggregated based on the results obtained in the tag mapping and determination steps described above, and the concatema status, number of tags, and error reason are calculated. Analyze and acquire data that can be used for reanalysis and can be reused for examination of experimental conditions. By summing up such data, it is possible to check for concatema duplication.
  • data for each tag is also aggregated. That is, count the number of votes for each virtual tag.
  • the data for each tag is analyzed and visualized.
  • dynamic analysis visualization can be performed by changing the window size and threshold (for example, tag density histogram and grid display).
  • the tag density histogram when the tag density histogram is used, the total number of raw tags (analysis target tags) in a predetermined region of the genomic DNA sequence to be analyzed is calculated as the number of varchy tags (control tags) in the corresponding region of the control genomic DNA sequence.
  • the tag density divided by the total number can be determined.
  • the window size means the number of denominator virtual tags when calculating the tag density (corresponding to the size of the genome region at the time of density calculation).
  • FIG. 40 is a conceptual diagram for explaining the classification of tags.
  • Tag classification corresponds to an operation for setting a predetermined contribution for each tag.
  • the tag classification method shown in this figure is a variation, and there can be various other tag classification methods.
  • the force is determined so that the total length of the raw tag (analysis target tag) matches 100% of the total length of one type of V tag (vearchy tag: control tag). If 100% matches the total length of a single V tag, cast a decision code 0 (for example, contribution 1) on the raw tag and vote for that V tag. On the other hand, the total length of one type of V tag is 100
  • the strength of the raw tag exceeds the range of the tag length included in the VT-DB (Veural Tag Database), and the force is judged. If the tag length exceeds the range of the tag that is included in the VT-DB, it is determined that the raw tag is not cut out correctly, or there is a problem with the setting of the tag size included in the VT-DB. Assign a decision code 1 (for example, donation level 0) to, and do not vote. On the other hand, if the tag length range included in the VT-DB is not exceeded, proceed to the next step.
  • VT-DB Visual Tag Database
  • the next step is to determine if the raw tag matches 100% of the total length with two or more types of V tags. If 100% of the total length matches two or more types of V tags, the raw tag is determined to be derived from repeat, and a decision code 2 (for example, contribution 0) is assigned to the raw tag, and no vote is given. On the other hand, if 100% of the total length does not match two or more V-tags, proceed to the next step.
  • a decision code 2 for example, contribution 0
  • a force judgment is made that only one type of mismatched V tag of 1 to 3 bases (or less than 10%) exists for the raw tag. If there is only one type of V tag with a mismatch of 1 to 3 bases (or less than 10%), it is judged that there is a high probability of a raw tag sequence error or SNP tag, and the raw tag is sent to re-voting. , Put the judgment code 3 on the raw tag. On the other hand, if only one type of mismatched V-tag of 1 to 3 bases (or less than 10%) does not exist, proceed to the next step.
  • a force determination is made that there are two or more mismatched V tags of 1 to 3 bases (or less than 10%) against the raw tag. If there are only two types of mismatched V-tags with 1 to 3 bases (or less than 10%), the tag is determined to be a repeat-derived tag, and decision code 4 (for example, contribution 0) is assigned to the raw tag. Swing and don't vote. On the other hand, if two or more mismatched V-tags with 1 to 3 bases (or less than 10%) do not exist, proceed to the next step. [0199] In the next step, it is determined whether one or both ends of the raw tag is Spel (ACTAGT) or Pstl (C TGCAG).
  • the raw tag is assigned a determination code 10 (eg, contribution 0), and no vote is given.
  • the algorithm is named Blast. Whether the sequence and homology are high is determined. In this case, if there is a mismatch of 4 bases (10%) or more, it is determined that the sequence is not derived from the human genome, and E. coli, mitochondrial, vector (which may not be removed in advance), and other diverse types. Search for homology with the genomes of different species to determine the origin of the DNA sequence.
  • the next step it is determined whether the total length of the raw tag matches 100% of two or more types of V tags.
  • the total length of the raw tag matches 100% of a part of two or more types of V tags, it is determined that the raw tag is a repeat sequence, and a determination code 6 (for example, contribution degree 0 is assigned to the raw tag). ) And do not vote.
  • the total length of the raw tag does not match 100% of some of the 2 or more types of V-tags, proceed to the next step.
  • the ability to match a part of a single V tag with a mismatch of 1 to 3 bases in the raw tag is determined. If there is a mismatch of 1 to 3 bases and a part of one type of V tag is matched, it is determined that it is a SNP tag and the power that causes a sequence error during the sequence of the raw tag. After that, assign a judgment code 7 (for example, contribution 0) to the raw tag. On the other hand, if one to three base mismatches do not match a part of one type of V tag, go to the next step.
  • the raw tag becomes a part of two or more types of V tags due to a mismatch of 1 to 3 bases. Determine if it matches. If there is a mismatch of 1 to 3 bases and a part of two or more types of V tag matches, it is determined that the raw tag is a repeat sequence, and determination code 8 (for example, contribution 0) is assigned to the raw tag. Swing and don't vote. On the other hand, if one to three base mismatches do not match some of the two or more types of V tags, proceed to the next step.
  • determination code 8 for example, contribution 0
  • the next step is a raw tag that does not belong to any of the above categories.
  • a determination code 9 (for example, contribution 0) is assigned to these raw tags, and no vote is given.
  • the ability to detect a difference in the copy number of a human genomic DNA sequence may be used to detect a difference in the copy number of a genomic DNA sequence in various organisms other than humans. Good. In this way, in addition to medicine, it will be possible to apply to a wide range of industries including food, chemistry, agriculture, forestry and fisheries.
  • the ability to analyze the entire human genomic DNA sequence is not the entire human genomic DNA sequence, but a chromosomal DNA sequence that is a part of the human genomic DNA sequence, or a further partial DNA sequence of the chromosome. May be the target of analysis. In this way, there is an advantage that efficient research can be performed pinpointed by narrowing down the region of the human genome.
  • Mbol is used as a restriction enzyme, but other restriction enzymes may be used.
  • a restriction enzyme that recognizes and cleaves a 4-base sequence is suitably used in the present embodiment because of the large number of virtual tags obtained.
  • Digital genome scanning is a technology that enables quantitative analysis of comprehensive human genomic DNA with high accuracy and high resolution.
  • DGS digital genome scanning
  • Vuary tag generated by restriction enzyme processing on a computer using human whole-genome DNA information
  • Mbol recognizes and cleaves the DNA sequence GATC, so the BamHI site can be used for cloning concatamers (ligations of ligated tags).
  • the following simulations proceeded with an analysis centered on the Vujanaure tag generated by Mbol.
  • a tag with a sequence that matches multiple locations on the genome cannot be used as a tag in DGS because it cannot identify the site on the genome. Therefore, the present inventors predicted that a tag derived from a repeat sequence is likely to be such an invalid tag. Therefore, among virtual tags, tags derived from repeat sequences were tabulated and their ratios were analyzed.
  • Figure 20 shows a random number generation algorithm for simulating gene amplification (amplification), gene deletion (homozygous deletion), and loss of heterozygosity (LOH), which are subject to analysis by DGS.
  • FIG. 41 is an electropherogram for explaining the purification of HSC45 genome-powered tags and the production of concatemers.
  • a preliminary DGS experiment was performed using human genomic DNA extracted from gastric cancer cell line HSC45.
  • HSC45 genomic DNA was treated with Mbol restriction enzyme (Fig. 41A), and the resulting short tags were ligated to produce a concatemer and attempted cloning.
  • FIG. 42 is a graph showing tag size distribution and repeat'unique classification. As a result, 5593 raw tags were obtained from 823 clones. The size distribution of raw tags is shown in Fig. 42C. The longest gap length was 118 bp, the shortest gap length was Obp, and the average was 23.8 bp.
  • VT-DB Vuary tag database
  • the VT-DB includes information on each V-tag ID, chromosome number, position on the chromosome, and sequence information, whether it is derived from a repeat, and whether it is unique.
  • a unique definition is that only one place on the genome can be located, ie there is no other V-tag with the same sequence.
  • Fig. 43 is a diagram showing the correspondence between repeat tags and unique tags in the Vuyaru tag database.
  • the repeat sequence among the varchy tags in VT-DB was 63.41%, whereas the unique sequence showed an unexpectedly high rate of 89.37% (Fig. 43A). This ratio suggests that 83.71% are unique even if they are classified as repeat sequences in the genome information, and most tags are not wasted.
  • FIG. 44 is a diagram showing a breakdown of raw tags acquired from HSC45.
  • the completed VT—DB is checked against the raw tag sequence and a perfect match (the total length of the raw tag sequence is 100% V—tag H) Extracted what to do.
  • As a result out of 5593 all-live tags, 3133 (56.02%) matched unique V-tags, and 1540 (27.53%) matched non-unique V-tags.
  • 920 items were classified as stray tag # 1 (Fig. 44).
  • FIG. 45 is a graph showing the tag density calculated by setting the window size.
  • the obtained 3133 perfect match raw tags were checked against VT-DB, and the number of votes for each V-tag was calculated. Thereafter, the tag density of the region was calculated.
  • Tag density number of unique raw tag votes in the area Z number of unique V—tags in the area.
  • the size of the area for calculating the density (hereinafter referred to as the window) was determined by the number of unique V—tags. Roughly 554 V—tags are equivalent to IMbp genome. According to this, the tag density was calculated by setting the window size from 2Mbp to lOMbp (Fig. 6).
  • FIG. 46 is a graph and physical map showing areas showing abnormal tag density.
  • a graph of Ch8 and Chl8 with a clearly higher tag density compared to the surroundings was shown (Fig. 46A) o A region that appears to be amplified at the end of Ch8 and at the beginning of Chl8 was observed.
  • DGS was used for genome analysis of gastric cancer cell lines. 4) Approximately 3000 effective live tags were obtained. 5) In the tag density analysis, the region that seems to be abnormally amplified was identified.
  • FIG. 47 is a diagram showing a breakdown of the size and number of tags of raw Mbo I tags obtained when DGS analysis was performed using genomic DNA of gastric cancer cell lines.
  • DGS analysis was performed using genomic DNA of a gastric cancer cell line as a sample.
  • 9866 raw tags were recovered by Mbol restriction enzyme treatment, and 5515 raw tags were classified as unique tags, and were mapped to the genome.
  • FIG. 48 is a genome-wide tag density graph obtained when DGS analysis was performed using genomic DNA of a gastric cancer cell line.
  • the tag density of the entire genome is shown as an overhead view for each chromosome.
  • abnormal amplification of tag density was observed at two positions, chromosome 8 and chromosome 12.
  • FIG. 49 is a diagram showing genome amplification of 8q24.21 of chromosome 8 short arm. The left is the tag density of chromosome 8 and the right is the tag map displayed by the DGS server. The top row of each screen shows the site of a unique veil tag, the second row shows the site of a non-unique veil tag, the third row shows the obtained raw tag site, and the lower part shows the gene site.
  • Figure 49 shows that the myc oncogene is present in the amplification region (circled region).
  • FIG. 50 shows the relationship between c myc genomic amplification and mRNA overexpression.
  • genomic amplification of the c myc region was confirmed by the Southern plot method.
  • the real-time PCR method for genome quantification confirmed the genomic amplification (10 to 15-fold amplification) of the c-myc region of the gastric cancer cell line targeted for analysis.
  • another type of gastric cancer cell line which showed amplification of the same site, was found.
  • the expression of c myc mRNA also increases in correlation with the degree of genomic amplification (9-fold increase compared to the control cell line). It was confirmed by PCR.
  • FIG. 51 is a diagram showing genome amplification of 12ql2.1 of chromosome 12 short arm.
  • FIG. 51 shows the tag density of the chromosome 12 (upper and middle) and the gene existing in the same site (lower).
  • Figure 51 shows that the K-ms oncogene (circled region) exists in the amplified region.
  • FIG. 52 is a tag map showing the distribution of raw tags in a 3 Mbps region centered on the K ras gene. From this figure, it can be seen that the live tags are concentrated only in the genomic region where the Kras gene (circled region) exists.
  • FIG. 53 is a diagram showing genome amplification of the K-ras region.
  • the region where abnormal amplification occurred was determined by real-time PCR for the purpose of genome quantification.
  • amplification (seven times) occurred in the 0.5 Mbp region including the K-ms region in the gastric cancer cell line subjected to DG analysis.
  • Genomic amplification in the region containing K-ras was also observed in three other gastric cancer cell lines.
  • Figure 54 shows the relationship between K ras genomic amplification and mRNA and protein overexpression.
  • genomic amplification of the K-ras region was confirmed by Southern blotting.
  • the increase in Kras mRNA expression was analyzed by the real time RT-PCR method, an increase of about 10 times was observed.
  • the other 2 genes LRMP, LOCI 44363
  • FIG. 55 is a diagram showing an outline of the DGS analysis system.
  • the DGS analysis system used in Example 2 described above was constructed using ensembl as a DGS server that stores all genome information and all tag information as a database.
  • Tag density information is included in the client, and for areas where density anomalies are recognized, the DGS server can be accessed to extract tag and gene position information and visualize it as a map.
  • the tag density is used as an index of the genome copy number, but other indexes such as the number of raw tags corresponding to a predetermined virtual tag may be used.
  • the DNA sequence analyzer according to the present invention has the effect of being able to reliably identify an abnormal copy number of a genomic DNA sequence with high resolution. It is useful as a sequence analysis method and program.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Zoology (AREA)
  • Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

It is intended to provide a technique whereby an abnormality in genomic DNA sequence copy number can be identified at a high resolution and a high reliability. Based on control tag data and analyte tag data, data corresponding to the control tag data and the analyte tag data is compiled. By using this corresponding data, analyte tags corresponding to individual regions on the genome are counted to detect the occurrence of chromosomal abnormalities (amplification, deletion, etc.) in individual region on the genome. The data relating to amplification, deletion and so on in the individual regions on the genome can be highly useful in identifying, for example, a disease gene.

Description

明 細 書  Specification
DNA配列解析装置、 DNA配列解析方法およびプログラム  DNA sequence analyzer, DNA sequence analysis method and program
技術分野  Technical field
[0001] 本発明は、 DNA配列解析装置、 DNA配列解析方法およびプログラムに関する。  [0001] The present invention relates to a DNA sequence analyzing apparatus, a DNA sequence analyzing method, and a program.
背景技術  Background art
[0002] 体細胞および生殖細胞における遺伝子コピー数の異常は、細胞および個体レベル での深刻な異常をもたらす。ヒトにお ヽては癌抑制遺伝子の欠失や癌遺伝子の増幅 をともなう染色体変化はがん細胞の特徴であると言える。また、特定の染色体や染色 体の限定された領域のコピー数変化は、ダウン症候群など多くの発生分ィ匕に関わる 疾患の原因となる。  [0002] Gene copy number abnormalities in somatic cells and germ cells lead to serious abnormalities at the cellular and individual level. In humans, chromosomal changes accompanied by deletion of tumor suppressor genes or amplification of oncogenes are characteristic of cancer cells. In addition, changes in the copy number of a specific region of a specific chromosome or chromosome cause many diseases related to development such as Down's syndrome.
[0003] 一方、 2001年初頭、約 30億塩基対におよぶヒトゲノムの概要配列が公開され、生 命情報科学の研究者はこの「宝の山」を自由に利用できるようになった。ヒトゲノム解 読はヒトの医療、発生、生理そして進化に関するきわめて貴重な情報の宝庫であり、 このヒトゲノム情報を利用する技術分野は、 21世紀の初頭力も前半にかけて急速に 発展する新規産業創造になることが大いに期待できる。  [0003] On the other hand, in early 2001, the outline sequence of the human genome of about 3 billion base pairs was released, and researchers in life information science could freely use this “treasure mountain”. Human genome reading is a treasure trove of extremely valuable information on human medicine, development, physiology and evolution, and the technical field that uses this human genome information will be the creation of new industries that develop rapidly in the first half of the 21st century. Can be greatly expected.
[0004] 最近の細胞内遺伝情報の量的変動すなわち染色体およびその特定領域のコピー 数の異常を解析する技術としては、比較ゲノムハイブリダィゼーシヨン (CGH: compa rative genomic hyoridization)、代 ¾差異分析 (RDA: representational diff erence analysis)および古典的細胞遺伝学的手法がある。  [0004] Recent quantitative changes in intracellular genetic information, that is, abnormalities in the copy number of chromosomes and specific regions, include comparative genomic hybridization (CGH), and gen- eral differences. There are analysis (RDA) and classical cytogenetic techniques.
[0005] また、従来の遺伝子コピー数の異常を解析する技術としては、例えば、 Tian-Li Wa ng et. al, Digital karyotyping , Proceedings of the National Academy of S ciences of United States of America, December 10, 2002, vol. 99, no. 25, pages 16156-16161に記載されたものがある。同文献に記載された解析方法では、 対照ゲノム DNA配列を制限酵素で切断して得られるヴアーチヤルタグと、解析対象 ゲノム DNA配列を制限酵素で切断して得られる生タグと、を比較して、対照ゲノム D NA配列に対する解析対象ゲノム DNA配列の変化を解析して!/、る。  [0005] In addition, conventional techniques for analyzing gene copy number abnormalities include, for example, Tian-Li Wang et. Al, Digital karyotyping, Proceedings of the National Academy of Sciences of United States of America, December 10, 2002, vol. 99, no. 25, pages 16156-16161. In the analysis method described in the same document, a control is performed by comparing a vutorial tag obtained by cleaving a control genomic DNA sequence with a restriction enzyme and a raw tag obtained by cleaving a genomic DNA sequence to be analyzed with a restriction enzyme. Analyze changes in the genomic DNA sequence to be analyzed relative to the genomic DNA sequence!
[0006] し力しながら、上記文献記載の従来技術は、以下の点で改善の余地を有していた。 第一に、上述の細胞分裂中期の染色体を利用する CGH法、 RDA法、古典的細胞 遺伝学的手法などでは、解像度の限界が 20Mb程度であり、それよりも小さな領域に 限局されたコピー数変化の解析に利用することは困難である。また、最近の CGH法 力もマイクロアレイ法への移行によって解像力は増している力 解析する配列の数に 限界があり特殊な設備を必要とする。これらの問題を克服するためには、高解像度で ゲノム DNA配列のコピー数異常を同定できる手法が望まれる。 However, the prior art described in the above literature has room for improvement in the following points. First, the above-mentioned CGH method, RDA method, and classical cytogenetic method using the metaphase chromosomes have a resolution limit of about 20 Mb, and the copy number is limited to a smaller region. It is difficult to use for analysis of change. Also, the recent CGH method power is increasing in resolution due to the shift to the microarray method. The number of sequences to be analyzed is limited and special equipment is required. In order to overcome these problems, a method that can identify genomic DNA sequence copy number abnormalities with high resolution is desired.
[0007] 第二に、また CGH法に代表される DNAハイブリダィゼーシヨンに基づく解析法で は、事前に hCot— 1 DNAを使用し分子生物学的手法で反復配列をサンプルから 除去している。また、非特許文献 1に記載の解析方法でも、反復配列を解析対象タグ 力 除去している。これは、ゲノムの約 45%を占める反復配列を含む領域の情報を 捨てることを意味する。そのため、これらの解析方法には、解析結果の信頼性の面で さらなる改善の余地がある。 [0007] Secondly, in the analysis method based on DNA hybridization represented by the CGH method, hCot-1 DNA is used in advance and repetitive sequences are removed from the sample by molecular biological techniques. Yes. In addition, the analysis method described in Non-Patent Document 1 also removes the tag to be analyzed for repetitive sequences. This means that information about regions containing repetitive sequences that occupy about 45% of the genome is discarded. Therefore, these analysis methods have room for further improvement in terms of reliability of analysis results.
発明の開示  Disclosure of the invention
[0008] 本発明は上記事情に鑑みてなされたものであり、高解像度でゲノム DNA配列のコ ピー数異常を信頼性よく同定できる技術を提供することを目的とする。  [0008] The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technique capable of reliably identifying an abnormal copy number of a genomic DNA sequence with high resolution.
[0009] 本発明によれば、対照ゲノム DNA配列を制限酵素により切断して得られ、それぞ れこの対照ゲノム DNA配列に含まれる個数が所定数以下であり、かつそれぞれ所 定の範囲の塩基数の DNA配列からなる複数の対照タグを、それぞれこの対照ゲノム DNA配列中の対応箇所と関連づけてなる対照タグデータを取得する対照タグデー タ取得部と、解析対象ゲノム DNA配列をこの制限酵素により切断して得られ、かつ それぞれ所定の範囲の塩基数の DNA配列カゝらなる複数の解析対象タグの集合であ る解析対象タグデータを取得する解析対象タグデータ取得部と、この対照タグデータ とこの解析対象タグデータとを比較して、この対照タグおよびこの解析対象タグのうち 、それぞれ対応するタグ同士を関連づけてなる対応タグデータを生成する対応タグ データ生成部と、この対応タグデータを解析し、この対照タグと対応するこの解析対 象タグの個数を判定し、この個数に基づいて、この解析対象ゲノム DNA配列のうちこ の対照タグに対応する箇所を含む領域のこの対照ゲノム DNA配列に対するコピー 数の相違を判定するコピー数判定部と、このコピー数判定部による処理を経たデー タを出力する出力部と、を備える DNA配列解析装置が提供される。 [0009] According to the present invention, the control genomic DNA sequence is obtained by cleaving with a restriction enzyme. Each of the numbers contained in the control genomic DNA sequence is a predetermined number or less, and each has a predetermined range of bases. A control tag data acquisition unit for acquiring control tag data obtained by associating a plurality of control tags composed of a number of DNA sequences with corresponding positions in the control genomic DNA sequence, and a genomic DNA sequence to be analyzed by this restriction enzyme. An analysis target tag data acquisition unit for acquiring analysis target tag data that is a set of a plurality of analysis target tags each including a DNA sequence cover having a number of bases in a predetermined range, and the control tag data The analysis target tag data is compared with the corresponding tag to generate corresponding tag data in which the corresponding tags of the reference tag and the analysis target tag are associated with each other. The data generation unit and the corresponding tag data are analyzed, the number of the tags to be analyzed corresponding to the control tag is determined, and based on this number, the control tag corresponds to the control tag in the genomic DNA sequence to be analyzed. A copy number determination unit for determining a copy number difference with respect to the control genomic DNA sequence in the region including the portion to be processed, and data processed by the copy number determination unit. And a DNA sequence analyzing apparatus including an output unit for outputting data.
[0010] この構成によれば、解析対象ゲノム DNA配列を制限酵素処理して得られる短い断 片からなる解析対象タグをゲノムの代表としてカウントし、対照ゲノム DNA配列に由 来するヴアーチヤルタグである対照タグと比較することにより、ヒトゲノムの網羅的コピ 一数の定量を行うことができるため、それをもとに解析対象ゲノム DNA配列中におけ るコピー数異常を呈する領域を高解像度で同定することができる。 [0010] According to this configuration, the analysis target tag consisting of a short fragment obtained by subjecting the analysis target genomic DNA sequence to a restriction enzyme is counted as a representative of the genome, and the control is a varchal tag derived from the control genomic DNA sequence. By comparing with tags, it is possible to quantify an exhaustive copy of the human genome, and based on this, identify regions with high-resolution copy number abnormalities in the genomic DNA sequence to be analyzed. Can do.
[0011] この際、この構成によれば、それぞれ対照ゲノム DNA配列に含まれる個数が所定 数以下である複数の対照タグを用いるため、反復配列を含むタグであっても、対照ゲ ノム DNA配列に含まれる個数が所定数以下であるユニーク度の高いタグであれば 好適に用いることができる。そのため、この構成によれば、対照ゲノム DNA配列中の 一定の割合を占める反復配列を含む領域の情報を活用でき、解析結果の信頼性を 向上できる。 [0011] At this time, according to this configuration, since a plurality of control tags each having a predetermined number or less included in the control genomic DNA sequence are used, even if the tag includes a repetitive sequence, the control genomic DNA sequence Any tag that has a high degree of uniqueness and whose number is less than or equal to a predetermined number can be suitably used. Therefore, according to this configuration, it is possible to use information on a region including a repetitive sequence occupying a certain ratio in the control genomic DNA sequence, and improve the reliability of the analysis result.
[0012] すなわち、本発明によれば、高解像度でゲノム DNA配列のコピー数異常を信頼性 よく同定できる。  That is, according to the present invention, an abnormal copy number of a genomic DNA sequence can be reliably identified with high resolution.
[0013] なお、上記の DNA配列解析装置は、本発明の一態様であり、本発明の DNA配列 解析方法、 DNA配列解析システム、 DNA配列解析プログラム、そのプログラムを含 む記録媒体、などもまた、同様の構成を有する。  [0013] It should be noted that the above-described DNA sequence analyzing apparatus is an aspect of the present invention, and the DNA sequence analyzing method, the DNA sequence analyzing system, the DNA sequence analyzing program of the present invention, a recording medium including the program, and the like are also included. Have the same configuration.
図面の簡単な説明  Brief Description of Drawings
[0014] [図 1]デジタルゲノムスキャニングの原理とヴアーチヤルタグによる染色体異常解析の 概要について説明するための概念図である。  [0014] FIG. 1 is a conceptual diagram for explaining the principle of digital genome scanning and the outline of chromosome abnormality analysis by varchy tag.
[図 2]DNA配列解析システム 1000の全体構成を示した機能ブロック図である。  FIG. 2 is a functional block diagram showing the overall configuration of the DNA sequence analysis system 1000.
[図 3]DNA配列解析システム 1000の変形例である DNA配列解析システム 2000の 全体構成を示した機能ブロック図である。  FIG. 3 is a functional block diagram showing the overall configuration of a DNA sequence analysis system 2000, which is a modification of the DNA sequence analysis system 1000.
[図 4]DNA配列解析装置 100の内部構成を示した機能ブロック図である。  FIG. 4 is a functional block diagram showing the internal configuration of the DNA sequence analyzer 100.
[図 5]DNA配列解析装置 100の変形例である DNA配列解析装置 400の内部構成 を示した機能ブロック図である。  FIG. 5 is a functional block diagram showing an internal configuration of a DNA sequence analyzer 400 which is a modification of the DNA sequence analyzer 100.
[図 6]DNA配列解析システム 1000の動作について説明するためのフローチャートで ある。 [図 7]対応タグデータ生成部 210の内部構成を示した機能ブロック図である。 FIG. 6 is a flowchart for explaining the operation of the DNA sequence analysis system 1000. FIG. 7 is a functional block diagram showing an internal configuration of a corresponding tag data generation unit 210.
[図 8]対応タグデータ生成部 210の動作を説明するためのフローチャートである。  FIG. 8 is a flowchart for explaining the operation of the corresponding tag data generation unit 210.
[図 9]コピー数判定部 214の内部構成を示した機能ブロック図である。  FIG. 9 is a functional block diagram showing the internal configuration of the copy number determination unit 214.
[図 10]コピー数判定部 214の動作を説明するためのフローチャートである。  FIG. 10 is a flowchart for explaining the operation of the copy number determination unit 214.
[図 11]ヴアーチヤルタグ単位のデータ可視ィ匕イメージを説明するための概念図である  FIG. 11 is a conceptual diagram for explaining the data visibility image in the unit of a varch tag.
[図 12]ヴアーチヤルタグ単位のデータ可視ィ匕イメージを説明するための概念図である 圆 13]対照タグデータ生成装置 200の内部構成を説明するための機能ブロック図で ある。 [FIG. 12] A conceptual diagram for explaining a data visual image in units of a virtual tag. [13] FIG. 12 is a functional block diagram for explaining an internal configuration of a control tag data generation device 200.
圆 14]各制限酵素によってゲノム力も生成されるタグ数を示す図である。 [14] It is a figure showing the number of tags for which genomic force is also generated by each restriction enzyme.
[図 15]MboIで生成されるサイズ別のタグ数を示すグラフである。  FIG. 15 is a graph showing the number of tags by size generated by MboI.
[図 16]タグサイズに幅を与えた場合のタグ数を示す図である。  FIG. 16 is a diagram showing the number of tags when a width is given to the tag size.
[図 17]MboI ("GATC)タグ分布を示すグラフである。  FIG. 17 is a graph showing MboI (“GATC) tag distribution.
[図 18]MboIで生成される有効なヴアーチヤルタグ数を示す図である。  FIG. 18 is a diagram showing the number of effective veil tags generated by MboI.
[図 19]DGSモンテカルロシミュレーションのイメージを示す概念図である。  FIG. 19 is a conceptual diagram showing an image of DGS Monte Carlo simulation.
[図 20]DGSモンテカルロシミュレーションの詳細を説明する図である。  FIG. 20 is a diagram for explaining the details of the DGS Monte Carlo simulation.
[図 21]DGSモンテカルロシミュレーションの際のユーザインターフェースを示す画面 表示図である。  FIG. 21 is a screen display diagram showing a user interface for DGS Monte Carlo simulation.
[図 22]DGSシミュレーション結果を、 Mbol非リピートヴアーチヤルタグ 165、 845個の 場合の異常検出解像度、及び必要となる解析タグ数の形でまとめた図である。  [Fig. 22] This is a summary of DGS simulation results in the form of anomaly detection resolution for 165 and 845 Mbol non-repeat archial tags and the number of required analysis tags.
圆 23]両端リピート由来のタグおよび片端リピート由来のタグの違いについて説明す るための概念図である。 [23] It is a conceptual diagram for explaining the difference between a tag derived from a double-ended repeat and a tag derived from a single-ended repeat.
[図 24]MboIヴアーチヤルタグの見直しにっ 、て説明するために、リピート領域に埋 没しているタグ (X both)と、リピート領域および非リピート領域にまたがるタグ (X eit her)のサイズ分布を示すグラフである。  [Fig.24] In order to explain the review of the MboI archial tag, the size distribution of the tag embedded in the repeat region (X both) and the tag across the repeat region and non-repeat region (X eit her) is shown. It is a graph to show.
[図 25]リピート配列由来のヴアーチヤルタグにおけるユニーク度の検証について説明 するための図である。 [図 26]タグ切り出しサイズを長めにシフトさせるべきか否かを検討するための図である FIG. 25 is a diagram for explaining the verification of the uniqueness in the repeat tag derived from the repeat sequence. FIG. 26 is a diagram for examining whether or not the tag cutout size should be shifted longer.
[図 27]リピート領域と非リピート領域とにまたがる Mbolヴアーチヤルタグ (片端リピート )を有効タグとみなした場合について説明するための図である。 FIG. 27 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) spanning a repeat area and a non-repeat area is regarded as an effective tag.
[図 28]対照タグデータ生成装置 200の動作を説明するためのフローチャートである。 圆 29]解析対象タグデータ生成装置 300の内部構成を説明するための機能ブロック 図である。 FIG. 28 is a flowchart for explaining the operation of the control tag data generating apparatus 200. 29] FIG. 29 is a functional block diagram for explaining the internal configuration of the analysis target tag data generation device 300.
[図 30]タグ DNAの抽出とコンカテマ一の作製を説明するための図である。  FIG. 30 is a diagram for explaining extraction of tag DNA and production of concatamers.
圆 31]予備実験で用いたタグの塩基配列を解析し集計した結果を示すグラフである [31] This is a graph showing the results of analyzing and counting the base sequences of the tags used in the preliminary experiment.
[図 32]解析対象タグデータ生成装置 300の動作の流れについて説明するための概 念図である。 FIG. 32 is a conceptual diagram for explaining an operation flow of the analysis target tag data generation device 300.
[図 33]コンカテマ一の再延長について説明するための概念図である。  FIG. 33 is a conceptual diagram for explaining re-extension of concatamers.
圆 34]コンカテマ一構造の把握方法を説明するための制限酵素地図である。 [34] This is a restriction enzyme map to explain how to understand the structure of concatamers.
[図 35]コンカテマ一構造の把握方法を説明するための DNA配列のシークェンス地 図である。  FIG. 35 is a sequence diagram of a DNA sequence for explaining a method of grasping a concatamer structure.
[図 36]図 35のシークェンス地図力もベクター配列を除去した場合のシークェンス地 図である。  [FIG. 36] The sequence map power of FIG. 35 is also a sequence map when the vector sequence is removed.
[図 37]図 36のシークェンス地図からタグを切り出す様子を説明するためのシークェン ス地図である。  FIG. 37 is a sequence map for explaining the state of tag extraction from the sequence map of FIG.
[図 38]解析対象タグデータ生成装置 300の動作を説明するためのフローチャートで ある。  FIG. 38 is a flowchart for explaining the operation of the analysis target tag data generating apparatus 300.
[図 39]タグ自動解析の流れにつ!、て説明するための概念図である。  FIG. 39 is a conceptual diagram for explaining the flow of automatic tag analysis.
[図 40]タグを分類していく流れについて説明するための概念図である。  FIG. 40 is a conceptual diagram for explaining a flow of classifying tags.
[図 41]HSC45ゲノムからのタグの精製とコンカテマ一の作製とを説明するための電 気泳動図である。  FIG. 41 is an electropherogram for explaining the purification of tags from the HSC45 genome and the production of concatemers.
[図 42]タグのサイズ分布とリピート'ユニーク分類とを示すグラフである。  FIG. 42 is a graph showing the tag size distribution and repeat'unique classification.
[図 43]ヴアーチヤルタグデータベースにおけるリピートタグとユニークタグとの対応を 示す図である。 [Figure 43] Correspondence between repeat tags and unique tags in the Vuyaru tag database FIG.
[図 44]HSC45から取得した生タグの内訳を示す図である。  FIG. 44 is a diagram showing a breakdown of raw tags acquired from HSC45.
[図 45]ウィンドウサイズを設定して算出したタグ密度を示すグラフである。  FIG. 45 is a graph showing the tag density calculated by setting the window size.
[図 46]異常なタグ密度を示す領域を示すグラフおよび物理地図である。  FIG. 46 is a graph and a physical map showing a region showing an abnormal tag density.
[図 47]胃癌細胞株のゲノム DNAを用いて DGS解析を行った際に得られた Mbol生 タグのサイズとタグ数の内訳を示す図である。  FIG. 47 is a diagram showing a breakdown of the size and the number of tags of Mbol raw tags obtained when DGS analysis was performed using genomic DNA of gastric cancer cell lines.
[図 48]胃癌細胞株のゲノム DNAを用いて DGS解析を行った際に得られたゲノムワイ ドなタグ密度グラフである。  FIG. 48 is a genome-wide tag density graph obtained when DGS analysis was performed using genomic DNA of a gastric cancer cell line.
[図 49]染色体 8番短腕の 8q24. 21のゲノム増幅を示す図である。  FIG. 49 is a diagram showing genome amplification of 8q24.21 of chromosome 8 short arm.
[図 50]c—mycのゲノム増幅と mRNAの過剰発現との間の関連性を示す図である。  FIG. 50 shows the relationship between c-myc genomic amplification and mRNA overexpression.
[図 51]染色体 12番短腕の 12q 12. 1のゲノム増幅を示す図である。  FIG. 51 is a diagram showing genome amplification of 12q 12.1 of chromosome 12 short arm.
[図 52]K— ras遺伝子を中心とした 3Mbpの領域の生タグの分布を示すタグマップで ある。  FIG. 52 is a tag map showing the distribution of raw tags in a 3 Mbps region centered on the K-ras gene.
[図 53]K— ras領域のゲノム増幅を示す図である。  FIG. 53 is a diagram showing genome amplification of a K-ras region.
[図 54]K— rasのゲノム増幅と、 mRNAおよびタンパク質の過剰発現との間の関連性 を示す図である。  FIG. 54 shows the relationship between K-ras genomic amplification and mRNA and protein overexpression.
[図 55]DGS解析システムの概要を示す図である。  FIG. 55 is a diagram showing an outline of a DGS analysis system.
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0015] 以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面 において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same components are denoted by the same reference numerals, and the description thereof is omitted as appropriate.
[0016] <用語説明 > [0016] <Glossary>
ゲノムスキャニング: genome scanningとは、ゲノム上の遺伝子情報を網羅的に 解析する方法である。 Arbitrary primed PCR法(AP— PCR)や、 Restriction Genome scanning: Genome scanning is a method for comprehensive analysis of genetic information on the genome. Arbitrary primed PCR (AP—PCR) and Restriction
Landmark Genome Scanning (RLGS)法などの手法があり、がんにおける遺 伝子増幅や、欠失の同定に有効な手法である。しかし、既存の手法では全ゲノムの 0There are methods such as Landmark Genome Scanning (RLGS), which is an effective method for gene amplification and identification of deletions in cancer. However, with existing methods, the entire genome
. 1%— 1%程度を解析するのが限界である。 1% —The limit is to analyze about 1%.
[0017] 比較ゲノムノヽイブリダィゼーシヨン: CGH (comparative genomic hybridizatio n)とは、主に腫瘍細胞の染色体内で増幅あるいは欠失している領域を検出するため の、競合的蛍光 in situノヽイブリダィゼーシヨン法である。 [0017] Comparative genomic hybridization: CGH (comparative genomic hybridization) is mainly used to detect amplified or deleted regions in the chromosome of tumor cells. This is a competitive fluorescence in situ hybridization method.
[0018] SAGE (serial analysis of gene expression):遺伝子発現の連続解析とは、 多数の転写産物 mRNAの発現を同時に検出するための方法である。 [0018] SAGE (serial analysis of gene expression): a continuous analysis of gene expression is a method for simultaneously detecting the expression of a large number of transcripts mRNA.
[0019] コンカテマ一: concatemerとは、 DNAリガーゼ(連結酵素)などによって直列に連 結した DNA断片群である。 [0019] Concatemer: A concatemer is a group of DNA fragments linked in series by DNA ligase (ligation enzyme) or the like.
[0020] <発明の概略 > [0020] <Outline of the Invention>
図 1は、デジタルゲノムスキャニングの原理とヴアーチヤルタグによる染色体異常解 祈の概要について説明するための概念図である。なお、ここでは概略を示すにとどめ Fig. 1 is a conceptual diagram for explaining the principle of digital genome scanning and the outline of chromosome aberration praying using varchyartag. Note that this is only an overview.
、詳しくは後述する。 Details will be described later.
[0021] まず、図右上に示すように、ヴアーチヤルタグである対照タグデータを作成するため に、ヒトなどの所定の生物種のゲノム情報力 なる対照ゲノム DNA配列データを用意 する。次いで、タグの抽出のためのアルゴリズムが作成され、このアルゴリズムにより、 対照ゲノム DNA配列データは所定の制限酵素切断部位により切断され、ヴアーチャ ルタグである対照タグが作成され、データベース化される。  [0021] First, as shown in the upper right of the figure, in order to create control tag data that is a varchal tag, control genomic DNA sequence data serving as genome information of a predetermined species such as a human is prepared. Next, an algorithm for tag extraction is created. By this algorithm, the control genomic DNA sequence data is cleaved by a predetermined restriction enzyme cleavage site, and a control tag which is a virtual tag is created and databased.
[0022] そして、このヴアーチヤルタグである対照タグのデータベースと、ヒトなどの所定の生 物種のゲノムのシークェンステキスト(対照ゲノム DNA配列のシークェンステキスト)と 力 Sリンクされる。さらに、このヴアーチヤルタグである対照タグのデータベースを、シー クエンステキストを介して、ヒトなどの所定の生物種のゲノム上の位置情報 (対照ゲノ ム DNA配列の位置情報)にリンクすることにより対照タグデータが得られる。  [0022] Then, the database of the control tag, which is the virtual tag, and the sequence text of the genome of a predetermined biological species such as a human (sequence text of the control genomic DNA sequence) are force-linked. Furthermore, the control tag data can be obtained by linking the database of the control tags, which are the varchal tags, to the location information on the genome of a given species such as humans (location information of the control genomic DNA sequence) via sequence text. Is obtained.
[0023] 一方では、図左上に示すように、ヒトなどの所定の生物種のゲノム DNA分子を含む 細胞から、 DNA抽出により複数の染色体カゝらなるゲノム DNA分子 (解析対象ゲノム [0023] On the other hand, as shown in the upper left of the figure, from a cell containing genomic DNA molecules of a predetermined species such as humans, genomic DNA molecules (a target genome to be analyzed) obtained by DNA extraction are extracted.
DNA分子)が用意される。次いで、これらの複数の染色体力 なるゲノム DNA分子 が所定の制限酵素により切断され、得られた複数の解析対象タグが連結されて複数 のコンカテマ一が作成される。 DNA molecule) is prepared. Next, the plurality of genomic DNA molecules having chromosomal power are cleaved by a predetermined restriction enzyme, and the plurality of tags to be analyzed are connected to create a plurality of concatemers.
[0024] そして、これらのコンカテマ一がベクターにライゲーシヨンされた後、シークェンス反 応により、複数のコンカテマ一の DNA配列が解読される。さらに、解読した複数のコ ンカテマ一の DNA配列は、それぞれ複数のタグの DNA配列に変換されて、解析対 象タグデータが得られる。 [0025] 続いて、図中央下部に示すように、上述のようにして得られた対照タグデータおよ び解析対象タグデータに基づ ヽて、対照タグデータおよび解析対象タグデータの対 応データが生成される。そして、この対応データにより、ゲノム上の各領域に対応する 解析対象タグの個数が判定され、その結果、ゲノム上の各領域における増幅、欠失 などの染色体異常の存在が検出される。このようなゲノム上の各領域における増幅、 欠失などの情報は、疾患遺伝子の同定などに活用される。 [0024] After these concatamers are ligated to a vector, the DNA sequences of a plurality of concatamers are decoded by a sequence reaction. Furthermore, the decoded DNA sequences of a plurality of concatemers are converted into the DNA sequences of a plurality of tags, respectively, to obtain tag data to be analyzed. [0025] Next, as shown in the lower center of the figure, based on the control tag data and the analysis target tag data obtained as described above, the correspondence data of the control tag data and the analysis target tag data is shown. Is generated. Based on this correspondence data, the number of tags to be analyzed corresponding to each region on the genome is determined. As a result, the presence of chromosomal abnormalities such as amplification and deletion in each region on the genome is detected. Information such as amplification and deletion in each region on the genome is used for identification of disease genes.
[0026] 図 2は、 DNA配列解析システム 1000の全体構成を示した機能ブロック図である。  FIG. 2 is a functional block diagram showing the overall configuration of the DNA sequence analysis system 1000.
DNA配列解析システム 1000は、対照タグデータおよび解析対象タグデータを取 得してゲノム DNA配列の変化を解析する DNA配列解析装置 100を備える。また、 D NA配列解析システム 1000は、対照タグデータを生成する対照タグデータ生成装置 200を備える。さらに、 DN A配列解析システム 1000は、解析対象データを生成する 解析対象データ生成装置 300を備える。  The DNA sequence analysis system 1000 includes a DNA sequence analyzer 100 that acquires control tag data and analysis target tag data and analyzes changes in the genomic DNA sequence. The DNA sequence analysis system 1000 includes a control tag data generation device 200 that generates control tag data. Furthermore, the DNA sequence analysis system 1000 includes an analysis target data generation device 300 that generates analysis target data.
[0027] DNA配列解析システム 1000は、対照タグデータや解析対象タグデータとは異種 の生物種のゲノム DN A配列データを格納する別ゲノム DN A配列データベース 120 を備える。 DNA配列解析システム 1000は、 DNA配列解析装置 100を操作するた めの操作部 102を備える。  [0027] The DNA sequence analysis system 1000 includes another genome DNA sequence database 120 that stores genome DNA sequence data of a species different from the control tag data and the tag data to be analyzed. The DNA sequence analysis system 1000 includes an operation unit 102 for operating the DNA sequence analysis apparatus 100.
[0028] DNA配列解析システム 1000は、 DNA配列解析装置 100から出力されるデータを 画像表示する画像表示装置 104を備える。また、 DNA配列解析システム 1000は、 DNA配列解析装置 100から出力されるデータを印刷するプリンタ 106を備える。さら に、 DN A配列解析システム 1000は、 DNA配列解析装置 100から出力されるデー タを受信する PC (パーソナルコンピュータ) 108を備える。  The DNA sequence analysis system 1000 includes an image display device 104 that displays data output from the DNA sequence analysis device 100 as an image. The DNA sequence analysis system 1000 includes a printer 106 that prints data output from the DNA sequence analysis apparatus 100. Further, the DNA sequence analysis system 1000 includes a PC (personal computer) 108 that receives data output from the DNA sequence analyzer 100.
[0029] 図 3は、 DNA配列解析システム 1000の変形例である DNA配列解析システム 200 0の全体構成を示した機能ブロック図である。  FIG. 3 is a functional block diagram showing the overall configuration of a DNA sequence analysis system 2000 that is a modification of the DNA sequence analysis system 1000.
[0030] DNA配列解析システム 2000は、基本的に図 1の DNA配列解析システム 1000と 同様の構成をしている力 対照ゲノム DNA配列データ取得部 202 (図 4)が DNA配 列解析装置 400内部に設けられている点が異なる。また、 DNA配列解析装置 400 が対照ゲノム DNA配列データベース 500に接続している点でも異なる。  [0030] The DNA sequence analysis system 2000 has basically the same configuration as the DNA sequence analysis system 1000 in FIG. 1. The control genome DNA sequence data acquisition unit 202 (FIG. 4) is included in the DNA sequence analysis device 400. Is different. Another difference is that the DNA sequence analyzer 400 is connected to the control genomic DNA sequence database 500.
[0031] 以下、本実施の形態にについて、下記の順番で説明する。 1.対照タグデータおよび解析対象タグデータを用いた DNA配列解析[0031] Hereinafter, the present embodiment will be described in the following order. 1. DNA sequence analysis using control tag data and target tag data
2.対照タグデータの生成 2. Generation of control tag data
3.解析対象タグデータの生成  3.Generate tag data for analysis
ここで、「1.」は、図 1の DNA配列解析装置 100の説明である。  Here, “1.” is an explanation of the DNA sequence analyzer 100 of FIG.
「2.」は、上記「1.」の基礎になるデータ(DNA配列解析装置 100に入力されるべ きデータ)を生成する図 1の対照タグデータ生成装置 200の説明である。  “2.” is an explanation of the control tag data generation device 200 of FIG. 1 that generates data (data to be input to the DNA sequence analyzer 100) that is the basis of the above “1.”.
「3.」は、上記「1.」の基礎になるデータ(DNA配列解析装置 100に入力されるべ きデータ)を生成する図 1の解析対象タグデータ生成装置 300の説明である。  “3.” is an explanation of the analysis target tag data generation device 300 of FIG. 1 that generates data (data to be input to the DNA sequence analysis device 100) that is the basis of the above “1.”.
[0032] < 1.対照タグデータおよび解析対象タグデータを用いた DNA配列解析 >  [0032] <1. DNA sequence analysis using control tag data and target tag data>
図 4は、 DNA配列解析装置 100の内部構成を示した機能ブロック図である。  FIG. 4 is a functional block diagram showing the internal configuration of the DNA sequence analyzer 100.
DNA配列解析装置 100は、対照タグデータ生成装置 200から入力される対照タグ データを取得する対照タグデータ取得部 202を備える。対照タグデータは、対照ゲノ ム DNA配列を制限酵素により切断して得られる複数の対照タグをそれぞれ対照ゲノ ム DNA配列中の対応箇所と関連づけてなるデータである。また、これら複数の対照 タグデータは、それぞれ対照ゲノム DNA配列に含まれる個数が所定数以下であり、 かつそれぞれ所定の範囲の塩基数の DNA配列からなるデータである。また、 DNA 配列解析装置 100は、対照タグデータ取得部 202により取得された対照タグデータ を格納する対象タグデータ記憶部 206を備える。  The DNA sequence analyzer 100 includes a control tag data acquisition unit 202 that acquires control tag data input from the control tag data generator 200. The control tag data is data obtained by associating a plurality of control tags obtained by cleaving the control genomic DNA sequence with a restriction enzyme, with corresponding positions in the control genomic DNA sequence. In addition, the plurality of control tag data is data composed of DNA sequences each having a predetermined number or less of the number contained in the control genomic DNA sequence and each having a predetermined number of bases. The DNA sequence analyzer 100 further includes a target tag data storage unit 206 that stores the control tag data acquired by the control tag data acquisition unit 202.
[0033] 一方、 DNA配列解析装置 100は、解析対象タグデータ生成装置 300から入力さ れる解析対象タグデータ取得部 204を備える。解析対象タグデータは、解析対象ゲ ノム DNA配列を制限酵素により切断して得られる複数の解析対象タグの集合のデ ータである。また、これらの複数の解析対象タグの集合のデータは、それぞれ所定の 範囲の塩基数の DNA配列からなるデータである。また、 DNA配列解析装置 100は 、解析対象タグデータ取得部 204により取得された解析対象タグデータを格納する 解析対象タグデータ記憶部 208を備える。  On the other hand, the DNA sequence analysis apparatus 100 includes an analysis target tag data acquisition unit 204 that is input from the analysis target tag data generation apparatus 300. The analysis target tag data is data of a set of a plurality of analysis target tags obtained by cleaving the analysis target genomic DNA sequence with a restriction enzyme. In addition, the data of the set of the plurality of tags to be analyzed is data composed of DNA sequences each having a predetermined number of bases. The DNA sequence analyzer 100 further includes an analysis target tag data storage unit 208 that stores the analysis target tag data acquired by the analysis target tag data acquisition unit 204.
[0034] DNA配列解析装置 100は、対照タグデータと解析対象タグデータとを関連づけて なる対応タグデータを生成する対応タグデータ生成部 210を備える。対応タグデータ 生成部 210は、対照タグデータ記憶部 206から対照タグデータを取得し、解析対象 タグデータ記憶部 208から解析対象タグデータを取得し、対照タグデータと解析対象 タグデータとを比較して、対照タグおよび解析対象タグのうち、それぞれ対応するタグ 同士を関連づけることにより対応タグデータを生成する。また、 DNA配列解析装置 1 00は、対応タグデータ生成部 210が生成した対応タグデータを格納する対応タグデ ータ記憶部 212を備える。 [0034] The DNA sequence analyzer 100 includes a corresponding tag data generation unit 210 that generates corresponding tag data in which the control tag data and the analysis target tag data are associated with each other. Corresponding tag data generation unit 210 acquires control tag data from control tag data storage unit 206, and The tag data storage unit 208 obtains the analysis target tag data, compares the control tag data with the analysis target tag data, and associates the corresponding tag data between the control tag and the analysis target tag. Generate. The DNA sequence analyzing apparatus 100 further includes a corresponding tag data storage unit 212 that stores the corresponding tag data generated by the corresponding tag data generation unit 210.
[0035] DN A配列解析装置 100は、対応タグデータを解析し、解析対象ゲノム DNA配列 における対照ゲノム DNA配列とのコピー数の相違を判定するコピー数判定部 214を 備える。コピー数判定部 214は、対応タグデータ記憶部 212から取得した対応タグデ ータを解析して対照タグと対応する解析対象タグの個数を判定し、この個数に基づ いて対照ゲノム DNA配列のうち対照タグを含む領域の解析対象ゲノム DNA配列に おける対照ゲノム DNA配列とのコピー数の相違を判定する。また、 DNA配列解析 装置 100は、コピー数判定部 214によるコピー数判定結果を記憶するコピー数判定 結果記憶部 216を備える。  The DNA sequence analyzer 100 includes a copy number determination unit 214 that analyzes the corresponding tag data and determines a copy number difference between the analysis target genomic DNA sequence and the control genomic DNA sequence. The copy number determination unit 214 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212 to determine the number of tags to be analyzed corresponding to the control tag, and based on this number, out of the control genomic DNA sequence. Determine the copy number difference in the genomic DNA sequence to be analyzed in the region containing the control tag from the control genomic DNA sequence. The DNA sequence analyzing apparatus 100 further includes a copy number determination result storage unit 216 that stores the copy number determination result by the copy number determination unit 214.
[0036] DNA配列解析装置 100は、対照タグデータや解析対象タグデータとは異種の生 物種のゲノム DNA配列データを検索する別ゲノム DNAデータ検索部 224を備える 。すなわち、ゲノム DNA配列データ検索部 224は、対照ゲノム DNA配列と異なる起 源由来の別ゲノム DN A配列データベース 120 (図 1)に接続して別ゲノム DN A配列 データを検索する。  [0036] The DNA sequence analyzer 100 includes a separate genomic DNA data search unit 224 that searches for genomic DNA sequence data of a biological species that is different from the control tag data and the analysis target tag data. That is, the genomic DNA sequence data search unit 224 searches for another genomic DNA sequence data by connecting to another genomic DNA sequence database 120 (FIG. 1) derived from a source different from the control genomic DNA sequence.
[0037] DNA配列解析装置 100は、対照タグに対応しな!ヽ解析対象タグにつ!ヽて起源を 判定する起源判定部 226を備える。すなわち、起源判定部 226は、対応タグデータ 記憶部 212から取得した対応タグデータを解析し、解析対象タグと対応する対照タグ が存在するか判定する。その結果、起源判定部 226は、解析対象タグと対応する対 照タグが存在しな ヽ場合には、解析対象タグと別ゲノム DNAデータ検索部 224から 取得した別ゲノム DNA配列データとを比較して、解析対象タグの起源を判定する。 また、 DNA配列解析装置 100は、起源判定部 226による判定結果を格納する起源 判定結果記憶部 228を備える。  [0037] The DNA sequence analyzing apparatus 100 includes an origin determining unit 226 that does not correspond to the control tag and determines the origin for each tag to be analyzed. That is, the origin determining unit 226 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212, and determines whether there is a control tag corresponding to the analysis target tag. As a result, the origin determination unit 226 compares the analysis target tag with the different genomic DNA sequence data acquired from the separate genomic DNA data search unit 224 when there is no corresponding tag corresponding to the analysis target tag. To determine the origin of the tag to be analyzed. In addition, the DNA sequence analyzer 100 includes an origin determination result storage unit 228 that stores the determination result by the origin determination unit 226.
[0038] DNA配列解析装置 100は、コピー数判定結果または起源判定結果に基づいて画 像データを生成する画像データ生成部 220を備える。すなわち、画像データ生成部 220は、コピー数判定結果記憶部 216からコピー数判定結果を取得し、起源判定結 果記憶部 228から起源判定結果を取得し、これらの結果に基づいて、ゲノム DNA配 列の各領域の対照ゲノム DNA配列とのコピー数の相違や異種由来の DNA配列な どの存在を、ユーザにとって理解しやすい画像により表示するための画像データを 生成する。また、 DNA配列解析装置 100は、画像データ生成部 220の生成した画 像データを格納する画像データ記憶部 222を備える。 The DNA sequence analyzing apparatus 100 includes an image data generation unit 220 that generates image data based on the copy number determination result or the origin determination result. That is, the image data generation unit 220 acquires the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and controls each region of the genomic DNA sequence based on these results. It generates image data to display the difference in copy number with the genomic DNA sequence and the presence of heterologous DNA sequences in images that are easy for the user to understand. The DNA sequence analyzing apparatus 100 further includes an image data storage unit 222 that stores the image data generated by the image data generation unit 220.
[0039] 図 5は、 DNA配列解析装置 100の変形例である DNA配列解析装置 400の内部 構成を示した機能ブロック図である。 DNA配列解析装置 400の構成は、基本的には 図 4の DNA配列解析装置 100の構成と同様である力 対照タグデータ生成部 402を 内部に備える点が異なる。  FIG. 5 is a functional block diagram showing the internal configuration of a DNA sequence analyzer 400 that is a modification of the DNA sequence analyzer 100. The configuration of the DNA sequence analyzer 400 is basically the same as the configuration of the DNA sequence analyzer 100 of FIG. 4 except that a force control tag data generator 402 is provided inside.
[0040] 対照タグデータ生成部 402は、対照ゲノム DNA配列データベース 500 (図 3)から 入力される対照ゲノム DNA配列データを取得し、対照タグデータを生成する。対照 タグデータ生成部 402が対照タグデータを生成する詳細な機構にっ 、ては、後述す る。  [0040] The control tag data generation unit 402 acquires the control genomic DNA sequence data input from the control genomic DNA sequence database 500 (Fig. 3), and generates control tag data. The detailed mechanism by which the control tag data generation unit 402 generates control tag data will be described later.
[0041] また、 DNA配列解析装置 400の構成は、対照タグデータ生成部 402の生成した対 照タグデータを格納する対照タグデータ記憶部 404を備える点でも異なる。そのため 、 DNA配列解析装置 400では、対照タグデータ取得部 202は、装置外部からでは なぐ装置内部の対照タグデータ記憶部 404から対照タグデータを取得する。  [0041] The configuration of the DNA sequence analyzer 400 also differs in that it includes a control tag data storage unit 404 that stores the control tag data generated by the control tag data generation unit 402. Therefore, in the DNA sequence analyzer 400, the control tag data acquisition unit 202 acquires control tag data from the control tag data storage unit 404 inside the device, not from outside the device.
[0042] 図 6は、 DNA配列解析システム 1000の動作について説明するためのフローチヤ ートである。まず、一連のフローがスタートすると、図 2に示した対照タグデータ生成部 200が、ヒトなどの所定の生物種の対照ゲノム DNA配列データを、制限酵素切断部 位で切断することにより、対照タグデータを生成する(S102)。  FIG. 6 is a flowchart for explaining the operation of the DNA sequence analysis system 1000. First, when a series of flows is started, the control tag data generation unit 200 shown in FIG. 2 cuts the control genomic DNA sequence data of a predetermined species such as a human at the restriction enzyme cleavage site, thereby producing a control tag. Data is generated (S102).
[0043] このとき、後述するように、対照ゲノム DNA配列に含まれる個数が所定数以下 (例 えば 1)であるユニーク度の高い対照タグを抽出して、ユニーク度の高い対照タグの みを含む対照タグデータを生成することもできる。また、所定の範囲の塩基数の DN A配列力 なる対照タグのみを抽出し、所定の範囲の長さの対照タグのみを含む対 照タグデータを生成することもできる。さらに、得られた対照タグデータは、対照ゲノム DNA配列中の対応箇所と関連づけられている構成とすることができる。 [0044] 次 、で、 DNA配列解析装置 100では、対照タグデータ取得部 202が、対照タグデ ータ生成部 200から対照タグデータを取得する(S 106)。また、対照タグデータ取得 部 202は、取得した対照タグデータを対照タグデータ記憶部 206に格納する。 [0043] At this time, as will be described later, a control tag with a high degree of uniqueness in which the number contained in the control genomic DNA sequence is a predetermined number or less (for example, 1) is extracted, and only a control tag with a high degree of uniqueness is extracted. Control tag data can also be generated. In addition, it is possible to extract only control tags having a DNA sequence ability with a predetermined number of bases, and generate control tag data including only control tags with a predetermined range of lengths. Furthermore, the control tag data obtained can be configured to be associated with corresponding locations in the control genomic DNA sequence. Next, in the DNA sequence analyzer 100, the control tag data acquisition unit 202 acquires the control tag data from the control tag data generation unit 200 (S106). Further, the control tag data acquisition unit 202 stores the acquired control tag data in the control tag data storage unit 206.
[0045] 一方、図 2に示した解析対象タグデータ生成装置 300は、ヒトなどの所定の生物種 の解析対象ゲノム DNA分子を制限酵素で処理して得られる複数の DNA断片を連 結して複数のコンカテマ一を生成し、この複数のコンカテマ一のシークェンスを行うこ とにより、複数の解析対象タグを含む解析対象タグデータを生成する(S104)。  On the other hand, the tag data generation device 300 to be analyzed shown in FIG. 2 connects a plurality of DNA fragments obtained by treating a genomic DNA molecule to be analyzed of a predetermined biological species such as a human with a restriction enzyme. By generating a plurality of concatamers and performing a sequence of the plurality of concatamers, analysis target tag data including a plurality of analysis target tags is generated (S104).
[0046] このとき、後述するように、複数の DNA断片を連結して複数のコンカテマ一を生成 し、さらに複数のコンカテマ一を連結して複数の 2次コンカテマ一を生成してもよ 、。 このように、 2次コンカテマ一を生成することにより、シークェンスの効率を向上するこ とができるためである。  [0046] At this time, as described later, a plurality of concatemers may be generated by connecting a plurality of DNA fragments, and a plurality of secondary concatemers may be generated by connecting a plurality of concatemers. This is because the efficiency of the sequence can be improved by generating a secondary concatemer.
[0047] 次 、で、 DNA配列解析装置 100では、解析対象タグデータ取得部 204が、解析 対象タグデータ生成部 300から解析対象タグデータを取得する(S 108)。また、解析 対象タグデータ取得部 204は、取得した解析対象タグデータを解析対象タグデータ 記憶部 208に格納する。  Next, in the DNA sequence analyzer 100, the analysis target tag data acquisition unit 204 acquires the analysis target tag data from the analysis target tag data generation unit 300 (S108). Further, the analysis target tag data acquisition unit 204 stores the acquired analysis target tag data in the analysis target tag data storage unit 208.
[0048] そして、対応タグデータ生成部 210は、対照タグデータ記憶部 206から対照タグデ ータを取得し、解析対象タグデータ記憶部 208から解析対象タグデータを取得し、対 照タグデータと解析対象タグデータとを比較して、対照タグおよび解析対象タグのう ち、それぞれ対応するタグ同士を関連づけてなる対応タグデータを生成する(S110) 。また、対応タグデータ生成部 210は、生成した対応タグデータを対応タグデータ記 憶部 212に格納する。  [0048] Then, the corresponding tag data generation unit 210 acquires the control tag data from the control tag data storage unit 206, acquires the analysis target tag data from the analysis target tag data storage unit 208, and performs the control tag data and analysis. The target tag data is compared, and corresponding tag data is generated by associating corresponding tags among the control tag and the analysis target tag (S110). Further, the corresponding tag data generation unit 210 stores the generated corresponding tag data in the corresponding tag data storage unit 212.
[0049] 続いて、コピー数判定部 214は、対応タグデータ記憶部 212から取得した対応タグ データを解析し、対照タグと対応する解析対象タグの個数を判定し、この個数に基づ V、て、解析対象ゲノム DNA配列のうち対照タグに対応する箇所を含む領域の対照 ゲノム DNA配列に対するコピー数の相違を判定する(S 114)。また、コピー数判定 部 214は、コピー数判定結果をコピー数判定結果記憶部 216に格納する。  [0049] Subsequently, the copy number determination unit 214 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212, determines the number of analysis target tags corresponding to the control tag, and V, Then, a difference in copy number with respect to the control genomic DNA sequence in the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed is determined (S114). Further, the copy number determination unit 214 stores the copy number determination result in the copy number determination result storage unit 216.
[0050] 一方、起源判定部 226は、対応タグデータ記憶部 212から取得した対応タグデー タを解析し、解析対象タグと対応する対照タグが存在するか判定する。その結果、起 源判定部 226は、解析対象タグと対応する対照タグが存在しない場合には、別ゲノ ム DNAデータ検索部 224から取得した別ゲノム DNA配列データと解析対象タグと を比較して、解析対象タグの起源を判定する(S 112)。また、起源判定部 226は、起 源判定結果を起源判定結果記憶部 228に格納する。 On the other hand, the origin determination unit 226 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212 and determines whether there is a control tag corresponding to the analysis target tag. As a result, If there is no control tag corresponding to the analysis target tag, the source determination unit 226 compares the analysis target tag with another genomic DNA sequence data obtained from the separate genomic DNA data search unit 224, and The origin is determined (S 112). In addition, the origin determination unit 226 stores the source determination result in the origin determination result storage unit 228.
[0051] そして、画像データ生成部 220は、コピー数判定結果記憶部 216からコピー数判 定結果を取得し、起源判定結果記憶部 228から起源判定結果を取得し、コピー数判 定結果および起源判定結果に基づいて画像データを生成する(S116)。また、画像 データ生成部 220は、生成した画像データを画像データ記憶部 222に格納する。  [0051] Then, the image data generation unit 220 acquires the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and determines the copy number determination result and the origin. Image data is generated based on the determination result (S116). In addition, the image data generation unit 220 stores the generated image data in the image data storage unit 222.
[0052] さらに、出力部 218は、コピー数判定結果記憶部 216からコピー数判定結果を取 得し、起源判定結果記憶部 228から起源判定結果を取得し、画像データ記憶部 222 力も画像データを取得したうえで、これらを装置の外部の画像表示装置 104 (図 2)な どに出力し(S118)、一連のフローが終了する。  [0052] Further, the output unit 218 obtains the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and the image data storage unit 222 also outputs the image data. After obtaining these, these are output to an image display device 104 (FIG. 2) outside the device (S118), and a series of flows is completed.
[0053] 以下、本実施の形態に係る DNA配列解析システム 1000の利点について説明す る。  [0053] Hereinafter, advantages of the DNA sequence analysis system 1000 according to the present embodiment will be described.
DNA配列解析システム 1000によれば、解析対象ゲノム DNA配列を制限酵素処 理して得られる短 、断片力もなる解析対象タグをゲノムの代表としてカウントし、対応 タグデータ生成部 210により対照ゲノム DN A配列に由来するヴアーチヤルタグであ る対照タグと比較することにより、コピー数判定部 214においてヒトゲノムの網羅的コ ピー数の定量を行うことができる。このため、それをもとに、解析対象ゲノム DNA配列 中におけるコピー数異常を呈する領域を高解像度で同定することができる。その結 果、遺伝子コピー数の異常を示すゲノム領域を高解像度で検索同定し、その領域に 存在する疾患の新たな原因遺伝子を明らかにし、発症のメカニズムを解明すると同 時に分子レベルの疾患診断と治療への応用を図ることができる。  According to the DNA sequence analysis system 1000, short and fragmentable tags to be analyzed obtained by subjecting the genomic DNA sequence to be analyzed to restriction enzymes are counted as genome representatives, and the corresponding tag data generator 210 controls the control genomic DNA. By comparing with a control tag, which is a vutorial tag derived from the sequence, the copy number determination unit 214 can quantitate the comprehensive number of copies of the human genome. Therefore, based on this, it is possible to identify a region exhibiting copy number abnormality in the genomic DNA sequence to be analyzed with high resolution. As a result, genome regions showing gene copy number abnormalities are searched and identified with high resolution, new causative genes of diseases existing in those regions are clarified, and the mechanism of onset is clarified. It can be applied to treatment.
[0054] 図 7は、対応タグデータ生成部 210の内部構成を示した機能ブロック図である。対 応タグデータ生成部 210は、対照タグデータ記憶部 206 (図 4)および解析対象タグ データ記憶部 208 (図 4)カゝらそれぞれ対照タグデータおよび解析対象タグデータの 入力を受け付ける受付部 502を備える。  FIG. 7 is a functional block diagram showing the internal configuration of the corresponding tag data generation unit 210. Corresponding tag data generation unit 210 includes control tag data storage unit 206 (FIG. 4) and analysis target tag data storage unit 208 (FIG. 4). Is provided.
[0055] 対応タグデータ生成部 210は、対照タグデータおよび解析対象タグデータの対応 関係を判定する対応関係判定部 504を備える。対応関係判定部 504は、受付部 50 2から対照タグデータおよび解析対象タグデータを取得し、解析対象タグが対照タグ のうち一個のタグとのみ対応する場合に、これらのタグ同士を所定の寄与度 (例えば 1)により関連づけ、解析対象タグが対照タグのうち二個以上のタグと対応する場合に 、これらのタグ同士を所定の寄与度と異なる寄与度 (例えば 0)により関連づける。この 際、対応関係判定部 504における寄与度の設定は、対応タグデータ生成部 210に 設けられている寄与度設定部 508により行われる。 [0055] Corresponding tag data generation unit 210 corresponds to the control tag data and the analysis target tag data. A correspondence determination unit 504 that determines the relationship is provided. Correspondence determination unit 504 obtains control tag data and analysis target tag data from reception unit 502, and when the analysis target tag corresponds to only one tag among the control tags, these tags are given a predetermined contribution. When the analysis target tag corresponds to two or more tags among the control tags, the tags are associated with each other with a contribution degree (eg, 0) different from the predetermined contribution degree. At this time, the setting of the contribution in the correspondence determination unit 504 is performed by the contribution setting unit 508 provided in the correspondence tag data generation unit 210.
[0056] 対応タグデータ生成部 210は、対照タグデータおよび解析対象タグデータの一致 度を判定する一致度判定部 506を備える。一致度判定部 506は、対応関係判定部 5 04から対応関係に関する寄与度の設定の済んだ対照タグデータおよび解析対象タ グデータを取得し、対照タグおよび解析対象タグのうち、完全に一致するタグ同士を 所定の寄与度 (例えば 1)により関連づけ、一部異なるタグ同士を前記所定の寄与度 と異なる寄与度 (例えば 0)により関連づける。この際、一致度判定部 506における寄 与度の設定は、対応タグデータ生成部 210に設けられている寄与度設定部 508によ り行われる。なお、一部異なるタグとして、長さは一致しているがミスマッチがある解析 タグばかりでなぐ 1塩基または 2塩基の挿入または欠失がある解析タグを含めてもよ い。 The corresponding tag data generation unit 210 includes a matching degree determination unit 506 that determines the matching degree between the control tag data and the analysis target tag data. The coincidence determination unit 506 acquires the control tag data and the analysis target tag data for which the contribution degree related to the corresponding relationship has been set from the correspondence relationship determination unit 504, and selects the completely matched tag among the control tag and the analysis target tag. The tags are associated with each other with a predetermined contribution (for example, 1), and partially different tags are associated with each other with a contribution (for example, 0) that is different from the predetermined contribution. At this time, the contribution degree setting in the coincidence degree determination unit 506 is performed by the contribution degree setting unit 508 provided in the corresponding tag data generation unit 210. In addition, as a partly different tag, an analysis tag with an insertion or deletion of 1 base or 2 bases may be included in addition to an analysis tag having the same length but having a mismatch.
[0057] 対応タグデータ生成部 210は、解析対象タグデータに含まれる複数の解析対象タ グのそれぞれの生成を再試行するか否かを判定する再試行判定部 510を備える。例 えば、再試行判定部 510は、一致度判定部 506における判定結果において、対照タ グおよび解析対象タグの配列が一部異なると 、う結果が得られた場合には、異なる 塩基数が所定数以下であれば、その解析対象タグを生成するためのシークェンスを 再試行するという判定をするように構成することができる。このようにすれば、数塩基 レベルのわずかなシークェンスエラーにより、貴重な解析対象タグの数を減らすこと を抑制できるので、得られる解析結果の信頼性を向上できる。  [0057] The corresponding tag data generation unit 210 includes a retry determination unit 510 that determines whether to retry generation of each of the plurality of analysis target tags included in the analysis target tag data. For example, the retry determination unit 510 determines that the number of bases that are different from each other when the result of the determination is obtained when the control tag and the analysis target tag are partially different in the determination result of the coincidence determination unit 506. If the number is less than or equal to the number, it can be configured to determine to retry the sequence for generating the analysis target tag. In this way, since the number of valuable tags to be analyzed can be suppressed by a slight sequence error at the level of several bases, the reliability of the obtained analysis results can be improved.
[0058] 対応タグデータ生成部 210は、対応関係判定部 504、一致度判定部 506、再試行 判定部 510の処理を経たデータを対応タグデータ記憶部 212 (図 4)に出力する出力 部 512を備える。 [0059] 図 8は、対応タグデータ生成部 210の動作を説明するためのフローチャートである。 一連のフローがスタートすると、まず、対応関係判定部 502が対照タグデータおよび 解析対象タグデータの対応関係を判定する(S202)。例えば、それぞれの解析対象 タグに対応する対照タグの個数を判定し、対照タグの個数が 1であれば寄与度設定 部 508により寄与度を aに設定し (S206)、対照タグの個数が 2以上であれば寄与度 設定部 508により寄与度を bに設定し (S208)、対照タグの個数力^であればステツ プ 112へ進んで起源判定を行う(図 6)。 [0058] Corresponding tag data generation unit 210 outputs an output unit 512 that outputs data that has undergone the processing of correspondence determination unit 504, coincidence determination unit 506, and retry determination unit 510 to corresponding tag data storage unit 212 (FIG. 4). Is provided. FIG. 8 is a flowchart for explaining the operation of the corresponding tag data generation unit 210. When a series of flows starts, first, the correspondence determination unit 502 determines the correspondence between the control tag data and the analysis target tag data (S202). For example, the number of control tags corresponding to each analysis target tag is determined. If the number of control tags is 1, the contribution setting unit 508 sets the contribution to a (S206), and the number of control tags is 2. If it is above, the contribution setting unit 508 sets the contribution to b (S208), and if it is the number of control tags, the process proceeds to step 112 to determine the origin (FIG. 6).
[0060] 次に、一致度判定部 506が対照タグデータおよび解析対象タグデータの一致度を 判定する(S210)。例えば、それぞれの解析対象タグと対応する対照タグとの一致度 を判定し、完全一致であれば寄与度設定部 508により寄与度を cに設定し (S212)、 不完全一致であれば寄与度設定部 508により寄与度を dに設定する(S214)。  Next, the coincidence determination unit 506 determines the coincidence between the control tag data and the analysis target tag data (S210). For example, the degree of coincidence between each analysis target tag and the corresponding control tag is determined, and if it is an exact match, the contribution setting unit 508 sets the contribution to c (S212). The contribution is set to d by the setting unit 508 (S214).
[0061] そして、上記の対応関係および一致度の判定結果に基づいて、再試行判定部 510 により再試行の必要性の有無を判定し (S214)、一連のフローを終了する。  [0061] Then, based on the determination result of the correspondence and the degree of coincidence, the retry determination unit 510 determines the necessity of retry (S214), and ends the series of flows.
[0062] 以下、本実施の形態における対応タグデータ生成部 210の利点について説明する 対応タグデータ生成部 210によれば、対応関係判定部 504、一致度判定部 506お よび寄与度設定部 508により、対照タグおよび解析対象タグの間の対応関係および 一致度に応じて適切な寄与度を設定することができる。その結果、後述するコピー数 判定部 214において、ゲノム DNA配列中の領域ごとに、対照タグおよび解析対象タ グの間の対応関係および一致度に応じた寄与度を積算することにより、解析対象ゲ ノム DNA配列のうち対照ゲノム DNA配列に対してコピー数の相違した領域を信頼 '性よく検出することができる。  Hereinafter, the advantages of the corresponding tag data generation unit 210 in the present embodiment will be described. According to the corresponding tag data generation unit 210, the correspondence relationship determination unit 504, the coincidence degree determination unit 506, and the contribution degree setting unit 508 In addition, an appropriate contribution can be set according to the correspondence and matching degree between the control tag and the analysis target tag. As a result, in the copy number determination unit 214 described later, for each region in the genomic DNA sequence, the correspondence between the control tag and the analysis target tag and the contribution according to the degree of coincidence are integrated, thereby analyzing the analysis target genome. It is possible to reliably detect a region having a copy number different from that of the control genomic DNA sequence in the nom DNA sequence.
[0063] また、対応タグデータ生成部 210によれば、再試行判定部 510を備えているため、 シークェンスの際の読み取りミスまたは SNPsであることが疑われる解析対象タグに 関しては、シークェンスなどの再試行を行うことができ、 DN A配列解析システム 100 0により得られる結果の信頼性を向上することに役立っている。  [0063] Further, according to the corresponding tag data generation unit 210, since the retry determination unit 510 is provided, the analysis tag that is suspected of being a reading error or SNPs at the time of the sequence, such as the sequence This is useful for improving the reliability of the results obtained by the DNA sequence analysis system 1000.
[0064] 図 9は、コピー数判定部 214の内部構成を示した機能ブロック図である。コピー数 判定部 214は、対応タグデータ記憶部 212 (図 4)力も対応タグデータの入力を受け 付ける受付部 602を備える。また、コピー数判定部 214は、受付部 602により受け付 けた対応タグデータに設定された寄与度を集計する寄与度集計部 604を備える。 FIG. 9 is a functional block diagram showing the internal configuration of the copy number determination unit 214. The copy number determination unit 214 also receives the corresponding tag data input from the corresponding tag data storage unit 212 (Fig. 4). A reception unit 602 is provided. In addition, the copy number determination unit 214 includes a contribution totaling unit 604 that totals the contribution set in the corresponding tag data received by the reception unit 602.
[0065] さらに、コピー数判定部 214は、寄与度集計部 604により集計された寄与度に基づ いて解析対象ゲノム DNA配列における重複の発生の有無を判定する重複判定部 6 06を備える。重複判定部 606は、寄与度集計部 604により集計された寄与度に基づ Vヽて対照タグと対応する解析対象タグの個数を判定し、対照タグと対応する解析対 象タグの個数が所定数以上 (例えば 3以上)の場合には、解析対象ゲノム DNA配列 のうちこの対照タグに対応する箇所を含む領域における重複が発生していると判定 する。 In addition, the copy number determination unit 214 includes a duplication determination unit 600 that determines whether or not duplication has occurred in the genomic DNA sequence to be analyzed based on the contributions totalized by the contribution totalization unit 604. The duplication determination unit 606 determines the number of analysis target tags corresponding to the control tag based on the contributions totaled by the contribution totalization unit 604, and the number of analysis target tags corresponding to the control tag is predetermined. If the number is greater than or equal to (for example, 3 or more), it is determined that duplication has occurred in the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed.
[0066] また、コピー数判定部 214は、寄与度集計部 604により集計された寄与度に基づ いて解析対象ゲノム DNA配列における欠失の発生の有無を判定する欠失判定部 6 08を備える。欠失判定部 608は、寄与度集計部 604により集計された寄与度に基づ Vヽて対照タグと対応する解析対象タグの個数を判定し、対照タグと対応する解析対 象タグの個数が所定数以下 (例えば 0. 5以下)の場合には、解析対象ゲノム DNA配 列のうちこの対照タグに対応する箇所を含む領域における欠失が発生していると判 定する。  [0066] In addition, the copy number determination unit 214 includes a deletion determination unit 6008 that determines whether or not a deletion has occurred in the genomic DNA sequence to be analyzed based on the contributions totalized by the contribution totalization unit 604. . The deletion determination unit 608 determines the number of analysis target tags corresponding to the control tag based on the contributions totaled by the contribution totalization unit 604, and determines the number of analysis target tags corresponding to the control tag. If the number is less than a predetermined number (for example, 0.5 or less), it is determined that a deletion has occurred in the region containing the portion corresponding to the control tag in the genomic DNA sequence to be analyzed.
[0067] さらに、コピー数判定部 214は、寄与度集計部 604、重複判定部 606および欠失 判定部 608から得られるデータをコピー数判定結果記憶部 216 (図 4)に出力する出 力部 610を備える。  [0067] Further, the copy number determination unit 214 outputs the data obtained from the contribution counting unit 604, the duplication determination unit 606, and the deletion determination unit 608 to the copy number determination result storage unit 216 (FIG. 4). 610 is provided.
[0068] 図 10は、コピー数判定部 214の動作を説明するためのフローチャートである。コピ 一数判定部 214では、一連のフローがスタートすると、まず、受付部 602が受け付け た対応タグデータを寄与度集計部 604が解析し、対照ゲノム DNA配列の領域ごと( 対照タグデータのそれぞれの対照タグごと)に設定された寄与度を集計する(S302)  FIG. 10 is a flowchart for explaining the operation of the copy number determination unit 214. In the copy number determination unit 214, when a series of flows starts, first, the contribution aggregation unit 604 analyzes the corresponding tag data received by the receiving unit 602, and analyzes each region of the control genomic DNA sequence (each of the control tag data). Total contributions set for each control tag) (S302)
[0069] 次 、で、重複判定部 606は、対照ゲノム DNA配列の領域ごとに、集計された寄与 度が閾値以上 (例えば 3以上)である力判定する(S304)。その結果、閾値以上であ れば重複であると判定する(S306)。一方、閾値未満であれば次のステップ 308に進 む。 [0070] 次のステップでは、欠失判定部 608が、対照ゲノム DNA配列の領域ごとに、集計さ れた寄与度が閾値以下 (例えば 0. 5以下)である力判定する(S308)。その結果、閾 値以下であれば欠失であると判定する(S310)。一方、閾値より大きければ特に何も 判定しない。 [0069] Next, the duplication determination unit 606 determines the force that the total contribution is equal to or greater than a threshold (for example, 3 or more) for each region of the control genomic DNA sequence (S304). As a result, if it is equal to or greater than the threshold, it is determined that there is duplication (S306). On the other hand, if it is less than the threshold value, the process proceeds to the next step 308. [0070] In the next step, the deletion determination unit 608 determines, for each region of the control genomic DNA sequence, a force whose aggregated contribution is less than or equal to a threshold (eg, 0.5 or less) (S308). As a result, if it is less than or equal to the threshold value, it is determined as a deletion (S310). On the other hand, if it is larger than the threshold, nothing is judged.
[0071] そして、以上の判定結果を取得して、出力部 610は、判定結果をコピー数判定結 果記憶部 216に出力し (S312)、一連のフローを終了する。  Then, acquiring the above determination result, the output unit 610 outputs the determination result to the copy number determination result storage unit 216 (S312), and ends a series of flows.
[0072] 以下、本実施の形態におけるコピー数判定部 214の利点について説明する。  Hereinafter, advantages of the copy number determination unit 214 in the present embodiment will be described.
コピー数判定部 214によれば、対応タグデータ生成部 210により対応関係および 一致度に応じて適切な寄与度を設定されたデータを取得して、ゲノム DNA配列中の 領域ごとに、対照タグおよび解析対象タグの間の対応関係および一致度に応じた寄 与度を積算することにより、解析対象ゲノム DNA配列中の対照ゲノム DNA配列との コピー数の相違した領域を信頼性よく検出することができる。また、重複判定部 606 および欠失判定部 608により積算した寄与度と上下の閾値との関係を判定すること により、ゲノム DNA配列中の重複および欠失が発生している箇所を信頼性よく検出 することができる。  According to the copy number determination unit 214, the corresponding tag data generation unit 210 obtains data set with an appropriate contribution degree according to the correspondence and the degree of coincidence, and for each region in the genomic DNA sequence, the control tag and By integrating the correspondences between the tags to be analyzed and the contributions according to the degree of coincidence, it is possible to reliably detect a region having a copy number different from that of the control genomic DNA sequence in the genomic DNA sequence to be analyzed. it can. In addition, by determining the relationship between the contributions accumulated by the duplication judgment unit 606 and the deletion judgment unit 608 and the upper and lower thresholds, it is possible to reliably detect the occurrence of duplication and deletion in the genomic DNA sequence. can do.
[0073] 図 11は、対照タグごと(ヴアーチヤルタグ単位)のデータ可視化イメージを説明する ための概念図である。画像データ生成部 220は、コピー数判定結果記憶部 216から コピー数の判定結果に関するデータを取得して、このような画像データを生成する。 この図では、それぞれの対照タグに相当する個々のマス目について、タグ濃度(寄与 度の集計値に対応)を計算してタグ濃度に応じた色彩により、ユーザに理解しやすい 形で表示している。また、表示ウィンドウについても、必要に応じて拡大縮小でき、ュ 一ザの利便性を考慮して 、る。  [0073] FIG. 11 is a conceptual diagram for explaining a data visualization image for each control tag (unit: varchy tag). The image data generation unit 220 acquires data related to the copy number determination result from the copy number determination result storage unit 216, and generates such image data. In this figure, for each square corresponding to each control tag, the tag density (corresponding to the aggregate value of contribution) is calculated and displayed in a form that is easy for the user to understand by the color according to the tag density. Yes. The display window can also be enlarged or reduced as necessary, taking into account the convenience of the user.
[0074] この画像によれば、マス目の色彩により、それぞれのマス目の相当するゲノム DNA 中の領域で重複または欠失が生じて 、る力否力容易に目視で判断することができる  [0074] According to this image, duplication or deletion occurs in the region in the genomic DNA corresponding to each square depending on the color of each square, and it can be easily visually determined.
[0075] 図 12は、対照タグごと(ヴアーチヤルタグ単位)のデータ可視化イメージを説明する ための概念図である。画像データ生成部 220は、コピー数判定結果記憶部 216から コピー数の判定結果に関するデータを取得して、このような画像データを生成しても よい。この図では、それぞれの対照タグに相当する個々の染色体上の位置について 、タグ濃度 (寄与度の集計値に対応)を計算してタグ濃度に応じた塗りつぶされたマ ス目の高さにより、ユーザに理解しやすい形で表示している。また、表示ウィンドウに ついても、必要に応じて拡大縮小でき、ユーザの利便性を考慮している。さらに、ヒト ゲノムの個々の染色体について、切り替えることの可能なボタンも用意されている。 [0075] FIG. 12 is a conceptual diagram for explaining a data visualization image for each control tag (unit: varchy tag). The image data generation unit 220 obtains data related to the copy number determination result from the copy number determination result storage unit 216 and generates such image data. Good. In this figure, for each chromosome position corresponding to each control tag, the tag concentration (corresponding to the aggregate value of the contribution) is calculated, and the height of the filled mass corresponding to the tag concentration is It is displayed in a form that is easy for the user to understand. In addition, the display window can be enlarged or reduced as necessary, taking into account user convenience. In addition, there are buttons for switching between individual chromosomes in the human genome.
[0076] この画像によっても、塗りつぶされたマス目の高さにより、それぞれの染色体上の位 置で重複または欠失が生じて 、る力否か容易に目視で判断することができる。このよ うな優れた画像生成機能を有するため、大量のデータ解析と結果の把握を容易にす るためのユーザインターフェースが実現して 、る。  Also according to this image, it is possible to easily visually determine whether or not there is duplication or deletion at each chromosome position due to the height of the filled cells. Such an excellent image generation function realizes a user interface that makes it easy to analyze a large amount of data and grasp the results.
[0077] < 2.対照タグデータの生成 >  [0077] <2. Generation of control tag data>
図 13は、対照タグデータ生成装置 200の内部構成を説明するための機能ブロック 図である。対照タグデータ生成装置 200 (対照タグデータ生成部 402 (図 5)も同様の 構成である)は、対照ゲノム DNA配列データを取得する対照ゲノム DNA配列データ 取得部 706を備える。また、対照タグデータ生成装置 200は、対照ゲノム DNA配列 データ取得部 706の取得した対照ゲノム DNA配列データを格納する対照ゲノム DN A配列データ記憶部 708を備える。  FIG. 13 is a functional block diagram for explaining the internal configuration of the control tag data generation apparatus 200. The control tag data generation device 200 (the control tag data generation unit 402 (FIG. 5) has the same configuration) includes a control genomic DNA sequence data acquisition unit 706 that acquires control genomic DNA sequence data. The control tag data generation device 200 also includes a control genome DNA sequence data storage unit 708 that stores the control genomic DNA sequence data acquired by the control genomic DNA sequence data acquisition unit 706.
[0078] 対照タグデータ生成装置 200は、対照ゲノム DNA配列データ記憶部 708から対照 ゲノム DNA配列データを取得し、所定の制限酵素による切断部位を検索し、検索さ れた切断部位で対照ゲノム DNA配列データを切断する切断部位検索部 710を備え る。また、対照タグデータ生成装置 200は、切断部位検索部 710により切断されて得 られる複数の DNA配列(対照タグ)を格納する切断 DNA配列記憶部 712を備える。  [0078] The control tag data generation device 200 acquires control genomic DNA sequence data from the control genomic DNA sequence data storage unit 708, searches for a cleavage site by a predetermined restriction enzyme, and controls genomic DNA at the searched cleavage site. A cutting site search unit 710 for cutting the sequence data is provided. In addition, the control tag data generation device 200 includes a cut DNA sequence storage unit 712 that stores a plurality of DNA sequences (control tags) obtained by being cut by the cut site search unit 710.
[0079] 対照タグデータ生成装置 200は、切断 DNA配列記憶部 712から対照ゲノム DNA 配列を切断部位により切断してなる複数の対照タグを取得し、これらの対照タグのう ち、所定の範囲の塩基数 ·所定の範囲のユニーク度力 なる対照タグを選択する対 照タグ選択部 714を備える。また、対照タグデータ生成装置 200は、対照タグ選択部 714が選択した対照タグを格納する選択タグ記憶部 716を備える。  [0079] The control tag data generation device 200 obtains a plurality of control tags obtained by cleaving the control genomic DNA sequence at the cleavage site from the cleaved DNA sequence storage unit 712, and among these control tags, the control tag data generation device 200 has a predetermined range. Number of bases · A control tag selection unit 714 is provided for selecting a control tag with uniqueness within a predetermined range. In addition, the control tag data generation device 200 includes a selection tag storage unit 716 that stores the control tag selected by the control tag selection unit 714.
[0080] 対照タグデータ生成装置 200は、選択された対照タグを対照ゲノム DNA配列中の 対応箇所と関連づけて対照タグデータを生成する関連付部 718を備える。また、対 照タグデータ生成装置 200は、関連付部 718が生成した対照タグデータを格納する 対照タグデータ記憶部 720を備える。 [0080] The control tag data generation apparatus 200 includes an association unit 718 that generates control tag data by associating the selected control tag with a corresponding portion in the control genomic DNA sequence. Also against The reference tag data generation device 200 includes a comparison tag data storage unit 720 that stores the comparison tag data generated by the association unit 718.
[0081] 対照タグデータ生成装置 200は、対照タグデータ記憶部 720から対照タグデータを 取得して DNA配列解析装置 100に出力する出力部 722を備える。  The control tag data generation device 200 includes an output unit 722 that acquires control tag data from the control tag data storage unit 720 and outputs the control tag data to the DNA sequence analyzer 100.
[0082] 以下、上述の対照タグデータ生成装置 200によるヒトゲノム情報を用いたヴアーチャ ルタグ (対照タグ)の生成につ!、て詳細に説明する。  Hereinafter, generation of a virtual tag (control tag) using human genome information by the above-described control tag data generation device 200 will be described in detail.
[0083] 1.ヒトゲノムの塩基配列情報、ならびにリピート配列データ  [0083] 1. Human genome sequence information and repeat sequence data
本発明者らの考案したデジタルゲノムスキャニング (DGS)の原理は、ヒトゲノムの網 羅的コピー数の定量を行うために、ゲノム DNAを制限酵素処理して得られる短い断 片をゲノムの代表としてカウントし、それをもとにコピー数異常を呈する領域を同定し ようするものである。本発明者らはこの DGS法の基盤の確立にむけて、解像度や実 効性を検討する目的で in silicoにお!/、て下記のシミュレーションを行った。  The principle of digital genome scanning (DGS) devised by the present inventors is to count short fragments obtained by restriction enzyme treatment of genomic DNA as representative of the genome in order to quantify the network copy number of the human genome. Based on this, the region showing copy number abnormality is identified. The present inventors conducted the following simulations in silico for the purpose of studying resolution and effectiveness in order to establish the foundation of the DGS method.
[0084] ヒトゲノムの塩基配列情報、ならびにリピート配列データにっ 、てはカリフォルニア 大 UCSC Genome Biomformatics Group ^公開して ヽる http : z z genome — arcnive. cse. ucsc. eduZ downloads, htmlより Jm 2003 hgl6ノ ~~ンヨンを 入手した。  [0084] The human genome base sequence information and repeat sequence data will be published by the University of California at UCSC Genome Biomformatics Group ^ http: zz genome — arcnive. Cse. Ucsc. EduZ downloads, html Jm 2003 hgl6 ~~ I got Nyung.
[0085] 解析プログラムには C言語、ソフトウェア開発環境は Red Hatサーバーを中心に構 築したシステムを使用した。ヴアーチヤルタグ解析には、制限酵素認識塩基配列情 報でゲノムデータを検索した後、指定の領域外のデータを排除し残ったタグデータを 蓄積しサイズ別に解析した。この際各タグの位置情報を保存し、リピート配列データ ベースと照合してリピートのクラスを判定しデータを記録した。  [0085] The analysis program used C language, and the software development environment used a system built around Red Hat server. In the veil tag analysis, genome data was searched using restriction enzyme recognition base sequence information, then data outside the specified region was excluded, and the remaining tag data was accumulated and analyzed by size. At this time, the position information of each tag was stored, the data was recorded by determining the repeat class by comparing with the repeat sequence database.
[0086] 2.制限酵素別のヴアーチヤルタグ数  [0086] 2. The number of vuture tags by restriction enzyme
図 14は、各制限酵素によってゲノム力も生成されるタグ数を示す図である。 DGSの 開始にあたっては、まずどの制限酵素を用いてゲノム DNAを断片化するかが問題と なる。そこで、ヴァーチャルタグの in silico解析のために、まず、制限酵素別のヴァ 一チャルタグ数にっ 、て検討した。  FIG. 14 is a diagram showing the number of tags for which genomic force is also generated by each restriction enzyme. When starting DGS, the first question is which restriction enzyme should be used to fragment genomic DNA. Therefore, for in silico analysis of virtual tags, we first examined the number of virtual tags by restriction enzyme.
[0087] より詳細には、ヒトの全ゲノム DNA情報を用いて、コンピュータ上で制限酵素処理 を行 ヽ生成されるタグ (以下、これをヴアーチヤルタグと称する)のサイズと数^^計し た。代表的な 4塩基認識、ならびに 6塩基認識の制限酵素によって生じるヴアーチャ ルタグ数の結果を図 14に示す。 [0087] In more detail, the size and number of tags (hereinafter referred to as varchy tags) generated by performing restriction enzyme processing on a computer using human total genomic DNA information are counted. It was. Figure 14 shows the results of typical 4 base recognition and the number of virtual tags generated by a 6 base recognition restriction enzyme.
[0088] この結果、 6塩基認識の制限酵素は認識条件が厳しいため、生成されるヴアーチャ ルタグ数が 4塩基認識制限酵素と比較して明らかに少なく不十分であると考えられた 。この 4塩基認識酵素の中で Mbolは DNA配列 GATCを認識し切断するため、コン 力テマ一(タグを数珠つなぎにライゲーシヨンしたもの)のクローユングには BamHI部 位を使用可能である。また生成されるタグ数も 20〜40塩基の長さのものに限定して みると他の酵素との比較において中間的な値を示したことから、以下のシミュレーショ ンにおいては Mbolによって生成されるヴアーチヤルタグを中心に解析を進めた。  [0088] As a result, it was considered that the restriction enzyme for 6-base recognition has severe recognition conditions, so that the number of virtual tags produced is clearly less than that for the 4-base recognition restriction enzyme. Among these 4-base recognition enzymes, Mbol recognizes and cleaves the DNA sequence GATC, so the BamHI site can be used for cloning of the conjugation themes (tags ligated together in a daisy chain). In addition, when the number of tags generated was limited to 20 to 40 bases in length, it showed an intermediate value in comparison with other enzymes, so in the following simulation, it was generated by Mbol. The analysis proceeded with a focus on Vuyarjartag.
[0089] 3. Mbolヴァーチャルタグのサイズ別分布  [0089] 3. Distribution of Mbol virtual tags by size
次いで、 Mbolヴアーチヤルタグのサイズ別分布について検討した。より利詳細には 、全ゲノムの Mbol処理によって生成される DNA断片のうちわけを in silicoで解析 した。現在明らかになつているヒト全ゲノムの塩基配列を対象とした場合、 Mbol断片 は合計 7、 056、 567個生成され、全 Mbol断片のうち 95%が 1377塩基以下である ことが明ら力となった。  Next, we examined the distribution of Mbol Arch Yartag by size. More specifically, the DNA fragments generated by Mbol treatment of the whole genome were analyzed in silico. When the target sequence of the entire human genome that is currently known is targeted, a total of 7, 056, 567 Mbol fragments are generated, and it is clear that 95% of the total Mbol fragments are less than 1377 bases. became.
[0090] 図 15は、」 Mbolで生成されるサイズ別のタグ数を示すグラフである。図 15では、各 Mbolヴァーチャルタグの両端の GATCを除 、た断片サイズ(以下、これをギャップ 長と称する) 20〜80塩基において、 1塩基ごとにタグ数を集計したヒストグラムを示す 。これにより、 20〜80塩基ギャップ長をもつ Mbolヴアーチヤルタグは、タグのサイズ によらずほぼ 1万〜 1. 5万個ずつ存在することが明ら力となった。  FIG. 15 is a graph showing the number of tags generated by Mbol by size. FIG. 15 shows a histogram in which the number of tags is tabulated for each base in 20 to 80 bases of fragment sizes (hereinafter referred to as gap length) excluding GATC at both ends of each Mbol virtual tag. As a result, it became clear that there are almost 10,000 to 150,000 Mbol arch arch tags with a 20 to 80 base gap length regardless of the tag size.
[0091] 一方、 36塩基、 37塩基ギャップ長のタグに見られるように、突出してタグ数の多い ものが散見された。また各染色体別にヴアーチヤルタグ数を集計した所、図 15に見ら れるように各染色体からほぼ各染色体の長さに比例して偏り無くタグが生成されてい ることがわ力つた。これらの結果から、短いサイズの Mbol制限酵素断片を収集してゲ ノムの代表とすることは妥当と考えられた。  [0091] On the other hand, as seen in the 36-base and 37-base gap length tags, there were some protruding tags with many tags. In addition, when the number of virtual tags was counted for each chromosome, it was found that tags were generated from each chromosome almost in proportion to the length of each chromosome as shown in Fig. 15. From these results, it was considered appropriate to collect a short-sized Mbol restriction enzyme fragment to represent the genome.
[0092] 図 16は、タグサイズに幅を与えた場合のタグ数を示す図である。一方 DGSを行う 際には多数のタグの解析が必要となるため、一つのベクターになるべく多くのタグを 連結したもの(以下これをコンカテマ一と称する)を導入することが作業効率をあげる 上で、また作業経費節減のため重要となる。 FIG. 16 is a diagram showing the number of tags when a width is given to the tag size. On the other hand, since it is necessary to analyze a large number of tags when performing DGS, it is possible to improve work efficiency by introducing as many tags as possible into one vector (hereinafter referred to as “concatamer one”). It is important for the above and for saving work costs.
[0093] しかしベクターに収載できるコンカテマ一の長さ、ならびに一回のシークェンスで解 読できる塩基数には限界があるため、コンカテマ一を形成する各々のタグのサイズは 可能な限り短い方が良いと考えられる。そこで分取するタグサイズを制限した場合の ヴアーチヤルタグ数を解析し、図 16に示した。この結果 20〜99塩基ギャップ長の Mb olヴアーチヤルタグは合計 1078762個存在し、 40塩基幅で分取した場合には約 50 万個、 30塩基幅で分取した場合には約 37万個のヴァーチャルタグが得られることが 明らかとなった。  [0093] However, since there is a limit to the length of concatemers that can be included in a vector and the number of bases that can be read in a single sequence, the size of each tag forming a concatemer should be as short as possible. it is conceivable that. Therefore, we analyzed the number of vajaru tags when the size of tags to be sorted was limited and shown in Fig. 16. As a result, there are a total of 1078762 Mbol arch arch tags with a gap length of 20 to 99 bases, about 500,000 when sorted with a width of 40 bases and about 370,000 when sorted with a width of 30 bases It became clear that a tag could be obtained.
[0094] 4.リピート由来のヴァーチャルタグについての解析  [0094] 4. Analysis of repeat-derived virtual tags
次いで、リピート由来のヴァーチャルタグについての解析を行った。より詳細には、 ゲノム中にはリピート配列と呼ばれる酷似した塩基配列が多数散在して 、ることが知 られている。 2001年のゲノムプロジェクトの報告(Nature 2001、 409、 871— )で は、ヒトゲノムの約 45%がリピート配列で占められているとされている。  Next, analysis was performed on repeat-derived virtual tags. More specifically, it is known that a large number of very similar base sequences called repeat sequences are scattered in the genome. According to the 2001 Genome Project report (Nature 2001, 409, 871—), about 45% of the human genome is occupied by repeat sequences.
[0095] DGSを行うにあたっては、各タグがゲノム上の 1箇所にユニークにマッピングされな ければ、正確なコピー数を推定することができない。すなわち、ゲノム上の何箇所か に合致する配列をもつタグは DGSではタグとして採用できず、リピート配列由来のタ グはそのような無効なタグとなる可能性が高いと考えられる。  [0095] When performing DGS, an accurate copy number cannot be estimated unless each tag is uniquely mapped to one location on the genome. In other words, tags with sequences that match some places on the genome cannot be used as tags in DGS, and tags derived from repeat sequences are likely to be such invalid tags.
[0096] そこでヴアーチヤルタグのうち、リピート配列由来と考えられるタグ^^計しその比率 を解析した。リピート配列としては散在性反復配列 (LINE、 LTRエレメント、 SINE, DNAトランスポゾン)に加え、縦列反復配列(マイクロサテライト、シンプルリピート et c. および non— coding RNA(tRNAゝ scRNAゝ snRNA etc. )を対象とした。  [0096] Therefore, among the veil tag, tags that were considered to be derived from repeat sequences were counted and the ratio was analyzed. Repeat sequences include scattered repeats (LINE, LTR element, SINE, DNA transposon), as well as tandem repeats (microsatellite, simple repeat et c. And non-coding RNA (tRNA ゝ scRNA ゝ snRNA etc.) It was.
[0097] 図 17は、 Mbol ('GATC)タグ分布を示すグラフである。結果として、 Mbolヴァー チャルタグの約 40%が散在性反復配列に由来するタグであることが明ら力となった( 図 17)。また 36塩基、 37塩基ギャップ長に見られるように、突出してタグ数の多いサ ィズはリピート由来のタグの比率が高ぐそれを除いた非リピートタグの数は他のサイ ズと同程度であることがわ力つた。  FIG. 17 is a graph showing the Mbol ('GATC) tag distribution. As a result, it became clear that about 40% of Mbol virtual tags were derived from scattered repeats (Figure 17). In addition, as seen in the 36-base and 37-base gap lengths, the size of protruding tags with a large number of tags has a high ratio of tags derived from repeats, and the number of non-repeat tags is the same as other sizes. I was amazed that it was.
[0098] 図 18は、 Mbolで生成される有効なヴアーチヤルタグ数を示す図である。これらをタ グサイズに幅を持たせて集計しなおした結果を図 18に示す。たとえば 30塩基の幅を 持たせた 30〜59bpギャップ長の場合、総タグ数が 42万個であり、そのうちリピート配 列由来のタグが 25万個を占め、非リピートタグの数は 165、 845個、比率は 39. 8% であることがわかった。 FIG. 18 is a diagram illustrating the number of effective veil tags generated by Mbol. Figure 18 shows the result of tabulating the tag size and recalculating them. For example, the width of 30 bases In the case of the 30-59 bp gap length, the total number of tags is 420,000, of which 250,000 are tags derived from repeat sequences, the number of non-repeat tags is 165, 845, and the ratio is 39. It was found to be 8%.
[0099] 同じ 30塩基の幅でもリピート率の高い 36塩基、 37塩基ギャップ長を外すと、例えば 40〜69bpギャップ長では非リピートタグの比率が 43%に増加する。以上は、 DGS においてリピート由来のタグを除外するには、分取するタグサイズを考慮する必要が あることを示している。  [0099] If the gap lengths of 36 bases and 37 bases with a high repeat rate are removed even with the same 30 base width, the ratio of non-repeat tags increases to 43% at a gap length of 40 to 69 bp, for example. The above shows that in order to exclude repeat-derived tags in DGS, it is necessary to consider the tag size to be sorted.
[0100] 5.モンテカルロシミュレーションによる DGSの解像度の予測  [0100] 5. Prediction of DGS resolution by Monte Carlo simulation
次に、モンテカルロシミュレーションによる DGSの解像度の予測について説明する 。上記の解析から、ヒトゲノム情報にもとづいたヴァーチャルタグの概要が把握できた 。し力しそのタグを用いて DGSを行った際に、十分な解像度と感度を得るためには 一体どれくらいの規模のタグ解析が必要となるのかは不明である。これを in silicoシ ミュレーシヨンによって推測することを試みた。  Next, the prediction of DGS resolution by Monte Carlo simulation is explained. From the above analysis, an overview of virtual tags based on human genome information was obtained. However, it is unclear how much tag analysis is required to obtain sufficient resolution and sensitivity when performing DGS using the tag. I tried to guess this with an in silico simulation.
[0101] 6. DGSモンテカルロシミュレーション  [0101] 6. DGS Monte Carlo simulation
図 19は、 DGSモンテカルロシミュレーションのイメージを示す概念図である。このよ うに、シミュレーションの原理としてモンテカルロシミュレーションと 、う手法を用いた。 これは問題の解決に擬似乱数を用いる手法である。  FIG. 19 is a conceptual diagram showing an image of DGS Monte Carlo simulation. In this way, the Monte Carlo simulation and the U method were used as the principle of simulation. This is a technique that uses pseudo-random numbers to solve the problem.
[0102] 図 20は、 DGSモンテカルロシミュレーションの詳細を説明する図である。図 21Aに 示す原理で擬似乱数を発生させるアルゴリズムを独自に開発し、遺伝子増幅、遺伝 子欠損、ヘテロ接合性喪失のシミュレートに用いた。図 21Bに示すように、ヴアーチャ ルタグ数、実際に解析するタグ数、異常なコピー数を示すタグ数 (異常なコピー数を 示す領域の距離に相当する)とその相対出現頻度、ならびに試行回数を変数として 設定し、各ヴァーチャルタグの出現回数をシミュレートし記録した。  FIG. 20 is a diagram for explaining the details of the DGS Monte Carlo simulation. Based on the principle shown in Fig. 21A, an original algorithm for generating pseudo-random numbers was developed and used to simulate gene amplification, gene deficiency, and loss of heterozygosity. As shown in Fig. 21B, the number of virtual tags, the number of tags actually analyzed, the number of tags indicating the abnormal copy number (corresponding to the distance of the area indicating the abnormal copy number), its relative appearance frequency, and the number of trials are shown. It was set as a variable and the number of occurrences of each virtual tag was simulated and recorded.
[0103] 得られた結果に対して、設定したタグウィンドウサイズ、異常検出の閾値にもとづい てコピー数異常の陽性 ·陰性を判定し、陽性適中率、感度、特異性を解析した。変数 の値をさまざまに設定し、異常検出可能な解像度とそれを実現するために必要となる 解析タグ数をシミュレーションにより予測した。  [0103] Based on the obtained tag window size and abnormality detection threshold, positive / negative of copy number abnormality was determined, and positive predictive value, sensitivity, and specificity were analyzed. Various variable values were set, and the resolution capable of detecting anomalies and the number of analysis tags required to achieve them were predicted by simulation.
[0104] 図 21は、 DGSモンテカルロシミュレーションの際におけるユーザインターフェースを 示す画面表示図である。このシミュレーションの作業効率を上げるため、上記の操作 が可能なユーザインターフェースを持つ webツールを開発しシミュレーションに用い た(図 4)。 [0104] Figure 21 shows the user interface for DGS Monte Carlo simulation. It is a screen display figure shown. In order to increase the work efficiency of this simulation, a web tool with a user interface capable of the above operations was developed and used for the simulation (Fig. 4).
[0105] 本実施の形態にぉ 、ては、 DGSの解析対象となる遺伝子増幅(amplification)、 遺 子欠損 (homozygous deletion)、ヘテロ接合性の欠失 (loss of heterozy gosity、 LOH)の検出感度と解像度について、あるヴアーチヤルタグ数を設定したと き、何タグを実際に解析すればこれらのゲノムのコピー数異常を検出できる力、という ことをシミュレーションによって予測した。  [0105] According to the present embodiment, detection sensitivity of gene amplification (amplification), gene deletion (homozygous deletion), and loss of heterozygosity (LOH) to be analyzed by DGS will be described. With regard to the resolution and the resolution, a certain number of veil tags was set, and simulations predicted how many tags would actually be analyzed to detect these genome copy number anomalies.
[0106] 図 22は、 DGSシミュレーション結果を、 Mbol非リピートヴアーチヤルタグ 165、 845 個の場合の異常検出解像度、及び必要となる解析タグ数の形でまとめた図である。こ のように、 DGSモンテカルロシミュレーションは、上述の Mbolヴアーチヤルタグ解析 で明ら力となった、タグギャップ長 30〜59塩基において存在すると考えられる非リピ ートタグ 165、 845個のみを有効なタグとして採用した場合を想定して行った。このと き、設定した解析タグ数の回数だけ乱数を発生させてどのタグが出現したかを記録 する、これを 1試行とし、それを 100回試行した平均のデータを採用した。  FIG. 22 is a table summarizing the DGS simulation results in the form of anomaly detection resolution in the case of 165 and 845 Mbol non-repeating architectural tags and the number of required analysis tags. In this way, the DGS Monte Carlo simulation adopted only 165 and 845 non-repeating tags, which are considered to exist at tag gap lengths of 30 to 59 bases, which became obvious in the above-mentioned Mbol archial tag analysis, as effective tags. This was done assuming the case. At this time, random numbers were generated as many times as the set number of analysis tags, and the tags that appeared were recorded as one trial, and the average data of 100 trials was used.
[0107] 一方、実際の DGSにおいては、陽性判定の閾値やウィンドウサイズなどタグデータ 取得後において変動させ解析する変数も存在する。そのため、 DGSの実効性を検 証する目的において、陽性適中率と感度が 90%以上を示す場合の設定を採用し、 その際に予測される異常検出の解像度を結果としてまとめた(図 22)。  [0107] On the other hand, in actual DGS, there are variables that are changed and analyzed after tag data acquisition, such as the threshold value for positive determination and the window size. Therefore, for the purpose of verifying the effectiveness of DGS, we adopted the settings when the positive predictive value and the sensitivity were 90% or higher, and summarized the predicted anomaly detection resolution as a result (Figure 22).
[0108] これにより、 5倍の遺伝子増幅を IMbpの解像度で検出するには 13800タグ、遺伝 子欠損を IMbpの解像度で検出するには、 44000タグ、 LOHを IMbpの解像度で 検出するには 495000タグの解析が必要であることが示された。一方解析タグ数を 1 0000タグに設定した場合、 5倍増幅は 1. 34Mbp、遺伝子欠損は 3. 79Mbpの解 像度で検出可能であることが示された。  [0108] This will enable 13800 tags to detect 5x gene amplification at IMbp resolution, 44000 tags to detect gene defects at IMbp resolution, and 495000 to detect LOH at IMbp resolution. It was shown that tag analysis was necessary. On the other hand, when the number of analysis tags was set to 10000 tags, it was shown that 5-fold amplification can be detected with a resolution of 1.34 Mbp, and a gene deletion can be detected with a resolution of 3.79 Mbp.
[0109] 7.リピート配列由来のヴアーチヤルタグ  [0109] 7. Varch Yartag derived from repeat sequences
次に、リピート配列由来のヴアーチヤルタグ (対照タグ)の in silico解析と DGS作業 効率の予測について説明する。散在型反復配列のサイズは短いものでは lOObpか ら、長いものでは lOKbp以上になることがある。 [0110] 図 23は、両端リピート由来のタグおよび片端リピート由来のタグの違いについて説 明するための概念図である。このため一口にリピート配列由来のタグといっても、その 断片が完全にリピート配列中に埋もれて存在している場合と、タグの一端だけがリピ ート配列中に存在している場合とが考えられる。その際に、タグがゲノム上に 1箇所に マップされる確率は、タグ中のリピート配列由来の部分が少ないほど高くなると予想さ れる。 Next, we will explain in silico analysis of Vuyar tag (control tag) derived from repeat sequences and prediction of DGS work efficiency. The size of the scattered repeats can be from lOObp for short ones and over lOKbp for long ones. FIG. 23 is a conceptual diagram for explaining the difference between a tag derived from a double-ended repeat and a tag derived from a single-ended repeat. For this reason, even if a tag derived from a repeat sequence is used, there are cases where the fragment is completely buried in the repeat sequence and when only one end of the tag is present in the repeat sequence. Conceivable. At that time, the probability that the tag is mapped to one place on the genome is expected to increase as the portion derived from the repeat sequence in the tag decreases.
[0111] そこでリピート配列由来の Mbolヴァーチャルタグについて、それが両端ともリピート 配列中に埋もれたもの(以下、両端リピート)か、片端だけがリピート配列中に存在し ているもの(以下、片端リピート)かによつて分類し、タグ数とサイズによる分布につい て in silicoで解析を行った。  [0111] Therefore, Mbol virtual tags derived from repeat sequences are either embedded in the repeat sequence at both ends (hereinafter referred to as repeat at both ends), or are present at only one end in the repeat sequence (hereinafter referred to as single-end repeat). The distribution based on the number of tags and the size was analyzed in silico.
[0112] 図 24は、 Mbolヴァーチャルタグの見直しについて説明するために、リピート領域に 埋没しているタグ (repeat both)と、リピート領域および非リピート領域にまたがるタ グ (repeat either)のサイズ分布を示すグラフである。図 24では、図 17に示したダラ フに両端/片端リピートの分類情報を付加したグラフを示している。  [0112] Figure 24 illustrates the revision of the Mbol virtual tag by showing the size distribution of tags buried in repeat regions (repeat both) and tags spanning repeat regions and non-repeat regions (repeat either). It is a graph to show. FIG. 24 shows a graph in which the classification information of both end / one end repeats is added to the drawing shown in FIG.
[0113] 同じ種類のリピート配列でも両端 (both)と片端 (either)は色分けをして区別してい る。また散在型反復配列(LINE、 SINE, LTR、 DNAトランスポゾン)の片端リピート タグは、グラフ上では意図的に非リピート側に寄せて表示させている。これを見ると、 タグギャップ長 20〜80塩基の範囲においては、タグサイズが長くなるにつれて、 SIN Eを中心とした散在型反復配列由来の片端リピートタグの比率が高くなつていくことが ゎカゝる。  [0113] Even with the same type of repeat arrangement, both ends and either end are distinguished by color coding. In addition, one-end repeat tags of scattered repeats (LINE, SINE, LTR, DNA transposon) are intentionally displayed near the non-repeat side on the graph. As can be seen, in the range of tag gap length of 20 to 80 bases, as the tag size increases, the ratio of single-ended repeat tags derived from scattered repeats centering on SIN E increases. Speak.
[0114] 図 25は、リピート領域と非リピート領域とにまたがる Mbolヴアーチヤルタグ (片端リピ ート)を有効タグとみなした場合について説明するための図である。次に、図 24のヒス トグラムをもとに、分取サイズの幅を決めてタグ数を集計した(図 25)。図 25に示され るように、片端リピートのヴアーチヤルタグ数はタグサイズが長くなるほど増加して 、る ことがわ力ゝる。  [0114] FIG. 25 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) extending over a repeat region and a non-repeat region is regarded as an effective tag. Next, based on the histogram of Fig. 24, the range of sorting sizes was determined and the number of tags was tabulated (Fig. 25). As shown in Fig. 25, the number of vaginal tags for one-end repeat increases with increasing tag size.
[0115] ここで、片端リピートのヴアーチヤルタグを非リピート配列と同様に DGSにおいて有 効なタグであると仮定して、「有効タグ」候補として集計したのが図 25の B + C列の結 果である。同じ 30塩基幅でタグをゲルカゝら切り出す場合でも、分取サイズを 50〜79 塩基と長めに設定すると、片端リピートタグの数が増カロした分だけ「有効タグ」の比率 が上昇することを示して 、る。 [0115] Here, assuming that vaginal tags of one-end repeats are valid tags in DGS as well as non-repeat sequences, the results of column B + C in Fig. 25 are tabulated as "valid tag" candidates. It is. Even if the tag is cut out from the gel with the same 30 base width, the preparative size is 50-79. If the base is set longer, the “effective tag” ratio will increase by the increase in the number of single-ended repeat tags.
[0116] 図 26は、タグ切り出しサイズを長めにシフトさせるべきカゝ否かを検討するための図 である。それならば、ゲル力 切り出すタグのサイズを長めに設定した方が DGSの作 業効率は上がるのだろうか。上記の試算をもとに、 DGSにおいて片端リピートタグを 重視したタグ精製を行った場合の作業効率について検討した結果を図 26に示す。  FIG. 26 is a diagram for examining whether or not the tag cutout size should be shifted longer. If this is the case, will the work efficiency of DGS increase if the tag size to be cut out is set longer? Based on the above calculations, Fig. 26 shows the results of studying the work efficiency when DGS refined tags with an emphasis on single-ended repeat tags.
[0117] 現時点では、コンカテマ一作製のボトルネックとなるのがコンカテマ一のサイズなの 力 もしくはコンカテマ一に含まれるタグの数なのかが判然としていないため、その両 方のケースを想定して試算を行った。図 26において、常に 500bp以上のコンカテマ 一が得られる状態を想定した結果が上段の表に、一方常に 5タグのコンカテマ一を得 られる状態を想定した結果が下段の表に示されて 、る。  [0117] At present, it is not clear whether the size of the concatema or the number of tags contained in the concatema is the bottleneck for the production of the concatema. went. In Fig. 26, the upper table shows the results assuming that a concatemer of 500 bp or more is always obtained, while the lower table shows the results assuming that a concatemer of 5 tags is always obtained.
[0118] これによると、常に 500bp以上のコンカテマ一を作れる状況では、有効タグを利用 する場合であっても非リピートタグのみを利用する場合であっても、短いタグを分取し た方が目標タグ数に到達するのは早いことがわかる。一方、常に 5タグのコンカテマ 一を作れる状況では、有効タグを重視する場合には長い分画を分取した方が早く目 標タグ数に到達することがわかる。以上の解析からは、コンカテマ一の形成が連結す るタグ数によって制限される状況においては、分取するタグサイズを長めに設定した 方が有効タグを効率的に取得できると考えられる。  [0118] According to this, in a situation where a concatemer of 500 bp or more can always be created, it is better to sort out a short tag regardless of whether a valid tag is used or only a non-repeat tag is used. It can be seen that the target tag number is reached quickly. On the other hand, in the situation where a concatemer with 5 tags can always be created, it is clear that if the effective tag is emphasized, the target tag number will be reached sooner if the long fraction is sorted. From the above analysis, in a situation where the formation of concatema is limited by the number of connected tags, it is considered that effective tags can be obtained more efficiently by setting a longer tag size to be sorted.
[0119] 8.ヴアーチヤルタグのユニーク度の検証  [0119] 8. Verification of the uniqueness of Vuyartag
次に、リピート配列由来のヴァーチャルタグがゲノム上にユニークにマッピングでき ない無効なタグとなる確率が高いか否かを in silicoで検証した。解析対象として、 2 2番染色体からリピート由来と分類される Mbolヴアーチヤルタグ 80個をランダムにピ ックアップした。  Next, we verified in silico whether there is a high probability that a virtual tag derived from a repeat sequence will be an invalid tag that cannot be uniquely mapped on the genome. As analysis targets, 80 Mbol Archyar tags classified as repeats from chromosome 22 were picked up randomly.
[0120] 内訳は両端および片端リピートタグをそれぞれ 40個(SINE、 LINE, LTR、 DNAト ランスポゾン由来を各 10個ずつ)とした。 80個の Mbolヴァーチャルタグの平均ギヤッ プ長は 60. Obp、最長で 99bp、最短で 20bpであった。これらを Blat検索にかけてゲ ノム上へのマッピングを行い、候補としてリストアップされる染色体部位とその数、ミス マッチ塩基数などを記録した。 [0121] 図 27は、リピート領域と非リピート領域とにまたがる Mbolヴアーチヤルタグ (片端リピ ート)を有効タグとみなした場合について説明するための図である。結果の判定にお いては、ゲノム上の候補部位が 22番染色体上の 1箇所にのみ存在し、かつタグの全 長が候補部位のゲノム塩基配列と 100%マッチするものを' unique'と定義した。 [0120] The breakdown was 40 repeat tags at both ends and 10 at each end (10 each from SINE, LINE, LTR, and DNA transposon). The average gear length of the 80 Mbol virtual tags was 60. Obp, the longest was 99 bp, and the shortest was 20 bp. These were subjected to Blat search and mapped onto the genome, and the chromosome sites listed as candidates, their numbers, and the number of mismatched bases were recorded. [0121] FIG. 27 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) extending over a repeat area and a non-repeat area is regarded as an effective tag. In determining the result, a candidate site on the genome exists only at one location on chromosome 22, and the tag's full length matches 100% with the genome sequence of the candidate site as 'unique'. did.
[0122] この結果、図 27Aに示すように、 80個のリピート由来の Mbolヴァーチャルタグのう ち 82. 5%が uniqueであることがわ力つた。内訳は、両端リピートが 38. 8%、片端リ ピートが 43. 8%であり、片端リピートタグの方がややユニーク度が高い結果となった  As a result, as shown in FIG. 27A, it was found that 82.5% of 80 repeat-derived Mbol virtual tags were unique. The breakdown is 38.8% for both-end repeats and 43.8% for one-end repeats, and the one-end repeat tags are slightly more unique.
[0123] し力し、 Blat検索においては、上記の uniqueの定義に該当するタグであっても 1〜 2塩基のミスマッチを許すと多数の候補部位が検出されるという現象がしばしば認め られた(このようなタグを' fine unique,と定義する)。そこで unique タグのうち、ゲ ノム上の候補部位が当該部位の 1候補し力検出されないタグを' super unique'と 定義してさらに分類を行った。 [0123] However, in the Blat search, a phenomenon that a large number of candidate sites were detected when a mismatch of 1 to 2 bases was allowed was often observed even for tags corresponding to the above definition of unique ( Such a tag is defined as 'fine unique'). Therefore, among the unique tags, the candidate part on the genome is one candidate of the part and the tag whose force is not detected is defined as 'super unique' and further classified.
[0124] この結果、図 27Bに示すように super uniqueなタグは、 80タグ中 14個存在し、内 訳は片端リピートタグが 10個(15. 2%)、両端リピートタグ力 個(6. 1%)であり、片 端リピートタグはユニーク度が高 、ことが示唆された。  [0124] As a result, as shown in Fig. 27B, there are 14 super unique tags out of 80 tags, consisting of 10 one-end repeat tags (15.2%) and double-end repeat tags (6. 1%), suggesting that one-end repeat tags are highly unique.
[0125] 以上の結果はすでにゲノム上にマップされている配列情報を使用して行った in sil ico解析ではあるが、リピート配列由来のタグであっても塩基配列解析の精度が十分 高ければゲノム上の 1箇所にマップすることが可能であることを示して 、る。その一方 で、やはり片端リピートタグの方が DGSにおいてはより安全に解析を行えるユニーク 度の高いタグであると考えられる。  [0125] The above results are in silico analysis performed using sequence information already mapped on the genome. However, even if the tag is derived from a repeat sequence, if the accuracy of the base sequence analysis is sufficiently high, the genome Show that it is possible to map to one location above. On the other hand, the single-ended repeat tag is considered to be a highly unique tag that can be analyzed more safely in DGS.
[0126] 9.まとめ  [0126] 9. Summary
本実施の形態のデジタルゲノムスキャニング (DGS)法は、ヒトゲノム情報をバックグ ランドとしてゲノムのコピー数の定量し、異常なコピー数を呈する領域を高 、解像度 で同定することを目的としている。上述の説明では、 DGSのシステムを確立する上で の予備実験を in silicoと in vitroの両面から行った。  The purpose of the digital genome scanning (DGS) method of the present embodiment is to quantify the copy number of the genome using human genome information as a background, and to identify a region exhibiting an abnormal copy number with high resolution. In the above explanation, preliminary experiments for establishing the DGS system were conducted both in silico and in vitro.
[0127] in silico解析では、制限酵素 Mbolを用いた場合のヴアーチヤルタグの解析を出 発点としてモンテカルロシミュレーションを行 、、増幅や欠失と 、つたコピー数異常の 検出のために必要な解析タグ数を予測することができた。 [0127] In the in silico analysis, Monte Carlo simulation was performed starting from the analysis of the vuture tag when the restriction enzyme Mbol was used, and amplification and deletion were detected and the copy number abnormality was detected. The number of analysis tags required for detection could be predicted.
[0128] より詳細には、解析タグ数を増やせば DGSの解像度があがることにくわえて、ある ヴアーチヤルタグ数が設定された状態でどの程度の規模で生のタグ解析をするべき かという問題に対して、エンドポイントを示すことができた。  [0128] In more detail, in addition to the fact that increasing the number of analysis tags increases the resolution of DGS, the question of how large a raw tag analysis should be performed with a certain number of virtual tags set. I was able to show the endpoint.
[0129] 一方、約 5割を占める反復配列に由来するタグの存在に着目し、実験レベルで排 除可能であるか否かの検討を行った。 in silico解析の結果では、リピート由来の配 列であってもゲノムにマッピングされるものが少なくなぐまた片端リピートの有効性が 示唆された。これを DGSのどの操作ステップに反映し、効率よく有効なタグのデータ を取得して ヽけば、得られるデータの信頼性が向上する。  [0129] On the other hand, paying attention to the presence of tags derived from repetitive sequences that account for about 50%, we examined whether they can be eliminated at the experimental level. The results of in silico analysis suggest that even if the sequence is derived from a repeat, there are few mapping to the genome, and the effectiveness of one-end repeat is suggested. If this information is reflected in any DGS operation step and efficient and effective tag data is obtained, the reliability of the obtained data will be improved.
[0130] よって、上述のデジタルゲノムスキャニング法の基盤を確立するための予備的検討 によって、 1)ヒトゲノム情報の in silico解析によって、制限酵素 Mbolによって 30塩 基幅のタグを分取すると 16. 5万個の非リピート配列由来のヴァーチャルタグが得ら れること、 2)モンテカルロシミュレーションにより、 1万タグの解析によって 5倍の遺伝 子増幅は 1. 34Mbの解像度で検出が可能であること、 3)リピート配列由来のタグで あってもゲノム上にユニークにマップできるものがあること、が理解できる。  [0130] Thus, according to the preliminary study to establish the basis of the above-mentioned digital genome scanning method, 1) When the 30-base-wide tag is sorted by the restriction enzyme Mbol by in silico analysis of human genome information, 16.5 Virtual tags derived from 10,000 non-repeat sequences can be obtained, 2) By Monte Carlo simulation, 5-fold gene amplification can be detected at a resolution of 34 Mb by analyzing 10,000 tags, 3) It can be understood that even tags derived from repeat sequences can be uniquely mapped on the genome.
[0131] 図 28は、対照タグデータ生成装置 200の動作を説明するためのフローチャートで ある。対照タグデータ生成装置 200では、一連のフローがスタートすると、まず、対照 ゲノム DNA配列データ取得部 706力 対照ゲノム DNA配列データを取得し(S402 )、対照ゲノム DNA配列データ記憶部 708に格納する。  FIG. 28 is a flowchart for explaining the operation of the control tag data generation device 200. In the control tag data generation device 200, when a series of flows starts, first, the control genomic DNA sequence data acquisition unit 706 acquires control genomic DNA sequence data (S402) and stores it in the control genomic DNA sequence data storage unit 708.
[0132] 次に、上述の予備的検討の結果に基づいて、切断部位検索部 710は、制限酵素( 例えば Mbol)を選択する(S404)。そして、切断部位検索部 710は、対照ゲノム DN A配列データ記憶部 708から対照ゲノム DNA配列データを取得し、制限酵素(例え ば Mbol)の切断部位を検索して(S406)、切断部位で対照ゲノム DNA配列を切断 する。また、切断部位検索部 710は、切断して生成した複数の DNA配列を切断 DN A配列記憶部 712に格納する。  [0132] Next, based on the result of the preliminary examination described above, the cleavage site search unit 710 selects a restriction enzyme (eg, Mbol) (S404). Then, the cleavage site search unit 710 obtains the control genomic DNA sequence data from the control genome DNA sequence data storage unit 708, searches for the cleavage site of the restriction enzyme (eg, Mbol) (S406), and controls the cleavage site. Cleave genomic DNA sequences. Further, the cleavage site search unit 710 stores a plurality of DNA sequences generated by the cleavage in the cleaved DNA sequence storage unit 712.
[0133] 続、て、対照タグ選択部 714は、切断 DNA配列記憶部 712から複数のDNA配列 を取得し、それぞれの DNA配列が所定の範囲内の塩基数の DNA配列からなるか 判定する(S408)。そして、対照タグ選択部 714は、これらの DNA配列の中から、所 定の範囲内の塩基数 ·ユニーク度の DNA配列からなる対照タグを選択する(S410) 。一方、対照タグ選択部 714は、これらの DNA配列の中でも、所定の範囲外の塩基 数 ·ユニーク度の DNA配列からなる対照タグは選択しない(S412)。 Subsequently, the control tag selection unit 714 acquires a plurality of DNA sequences from the cleaved DNA sequence storage unit 712, and determines whether each DNA sequence is a DNA sequence having a base number within a predetermined range ( S408). The control tag selection unit 714 then selects a location from these DNA sequences. Select a control tag consisting of a DNA sequence with a unique number of bases within a certain range (S410). On the other hand, the control tag selection unit 714 does not select a control tag consisting of a DNA sequence having a base number / uniqueness outside the predetermined range among these DNA sequences (S412).
[0134] さらに、このとき、それぞれ対照ゲノム DNA配列に含まれる個数が所定数以下 (例 えば 1個以下)であるユニーク度の高い DNA配列力もなる対照タグをユニーク度が 所定の範囲内であるとして選択するように構成することもできる。また、対照タグ選択 部 714は、選択した複数の対照タグを選択タグ記憶部 716に格納する。  [0134] Further, at this time, the uniqueness of the control tag having a high degree of uniqueness in the DNA sequence, each of which is less than a predetermined number (for example, 1 or less), is included in the control genomic DNA sequence. It can also be configured to select as. Further, the control tag selection unit 714 stores the selected control tags in the selection tag storage unit 716.
[0135] そして、関連付部 718は、選択タグ記憶部 716から所定の範囲内の塩基数の DN A配列からなる複数の対照タグを取得し、対照ゲノム DNA配列データの対応する位 置と関連付けて(S414)、対照タグデータを生成する。また、関連付部 718は、生成 した対照タグデータを対照タグデータ記憶部 720に格納する。  [0135] Then, the associating unit 718 acquires a plurality of control tags composed of DNA sequences having the number of bases within a predetermined range from the selection tag storage unit 716, and associates them with the corresponding positions of the control genomic DNA sequence data. (S414), control tag data is generated. Further, the associating unit 718 stores the generated control tag data in the control tag data storage unit 720.
[0136] さらに、出力部 722は、対照タグデータ記憶部 720から対照タグデータを取得し、 D NA配列解析装置 100に出力し(S416)、一連のフローを終了する。  Furthermore, the output unit 722 acquires the control tag data from the control tag data storage unit 720, outputs it to the DNA sequence analyzer 100 (S416), and ends the series of flows.
[0137] 以下、本実施の形態の対照タグデータ生成装置 200の利点について説明する。  [0137] Hereinafter, advantages of the control tag data generation device 200 of the present embodiment will be described.
対照タグデータ生成装置 200によれば、リピート配列か非リピート配列かにより対照 タグを選択せず、対照ゲノム DNA配列に含まれる個数が所定数以下 (例えば 1個以 下)であるユニーク度の高!、対照タグを選択し得るため、対照ゲノム DNA配列力 得 られる対照タグを有効に活用でき、得られるデータの信頼性を向上することができる。  According to the control tag data generator 200, a control tag is not selected depending on whether it is a repeat sequence or a non-repeat sequence, and the number of the control genomic DNA sequences contained in the control genomic DNA sequence is a predetermined number or less (for example, 1 or less). ! Since a control tag can be selected, the control tag obtained with the control genomic DNA sequence can be used effectively, and the reliability of the obtained data can be improved.
[0138] また、対照タグデータ生成装置 200によれば、ヒトゲノム DNAの切断に利用する制 限酵素と抽出するタグ配列のサイズの組み合わせによって、適した制限酵素を変更 して用いることができる。すなわち、全ゲノムレベルの DNA量の変化を高解像度で検 出する目標達成のため、「デジタルゲノムスキャニング法」の感度と特異性を高くする にはいくつかのパラメーターを最適にする必要がある。本発明者らはすでに、ヒトゲノ ム DNAの切断に利用する制限酵素と抽出するタグ配列のサイズの組み合わせによ つて、検出できる変異の最小領域が決定されることを、全ゲノム情報を用いたコンビュ ータシミュレーションにより明らかにしている。  [0138] Furthermore, according to the control tag data generation device 200, a suitable restriction enzyme can be used by changing the combination of the restriction enzyme used for cleaving human genomic DNA and the size of the tag sequence to be extracted. In other words, in order to achieve the goal of detecting changes in the amount of DNA at the whole genome level with high resolution, it is necessary to optimize several parameters in order to increase the sensitivity and specificity of the “digital genome scanning method”. The present inventors have already confirmed that the minimum region of the mutation that can be detected is determined by the combination of the restriction enzyme used for cleaving human genomic DNA and the size of the extracted tag sequence. Data simulation.
[0139] 例えば、制限酵素: EcoRI (6塩基認識酵素)とタグ配列サイズ: 20— 25塩基対の 組み合わせでは、タグ配列の間隔は 200kbから 20Mbの範囲で、平均 2Mbの間隔 で存在する。また、制限酵素: Mbol (4塩基認識酵素)とタグ配列サイズ: 20 - 30塩 基対の組み合わせでは、タグ配列の間隔は 10bpから 460kbの範囲で平均 20kbの 高密度の間隔で存在している。そのため、対照タグデータ生成装置 200によれば、 目的とする解像度に応じて、多種多様な制限酵素のうちから最適の制限酵素を選択 することが可能である。 [0139] For example, when the restriction enzyme: EcoRI (6-base recognition enzyme) and tag sequence size: 20-25 base pairs are used, the tag sequence interval ranges from 200 kb to 20 Mb, with an average interval of 2 Mb. Exists. In addition, in the combination of restriction enzyme: Mbol (4-base recognition enzyme) and tag sequence size: 20-30 base pairs, the tag sequence spacing ranges from 10bp to 460kb, with a high density spacing of 20kb on average. . Therefore, according to the control tag data generation device 200, it is possible to select an optimal restriction enzyme from a wide variety of restriction enzymes according to the target resolution.
[0140] また、このタグ配列はその由来するヒトゲノム上の位置情報を有しており、データべ ース化した後は即座に染色体上にマップできる。それゆえ、各染色体のタグ配列数 を積算することにより高精度な DNA量の定量化に利用できる。具体的には、対照タ グデータ生成装置 200によれば、上述の説明にあるように、ヒトゲノム DNA配列から 対照タグを得るのに適した Mbol制限酵素を用いるため、解析対象タグデータとの対 応に適した対照タグデータを生成することができ、 DNA配列解析の信頼性および効 率が向上する。  [0140] Further, this tag sequence has positional information on the human genome from which it is derived, and can be mapped onto the chromosome immediately after being databased. Therefore, it can be used for highly accurate quantification of DNA quantity by integrating the number of tag sequences of each chromosome. Specifically, according to the control tag data generation apparatus 200, as described above, since the Mbol restriction enzyme suitable for obtaining the control tag from the human genomic DNA sequence is used, the control tag data generation apparatus 200 can correspond to the analysis target tag data. This makes it possible to generate control tag data suitable for DNA sequencing, improving the reliability and efficiency of DNA sequence analysis.
[0141] さらに、対照タグデータ生成装置 200によれば、対照タグデータを生成する際に、 選択された複数の対照タグをそれぞれ対照ゲノム DNA配列中の対応する位置に関 連付けるため、対照タグデータ中には、対照ゲノム DNA配列の位置情報と選択され た対照タグデータのシークェンスデータとを含めておけば足りる。そのため、ヒトゲノム DNA配列全体のシークェンスを直接に解析対象タグデータと対応付けする場合に 比べて、 DNA配列解析装置 100の処理負荷を軽減できる。  [0141] Furthermore, according to the control tag data generation device 200, when generating the control tag data, the control tag data is associated with the corresponding position in the control genomic DNA sequence. It is sufficient to include the positional information of the control genomic DNA sequence and the sequence data of the selected control tag data in the data. Therefore, the processing load of the DNA sequence analyzer 100 can be reduced as compared with the case where the sequence of the entire human genome DNA sequence is directly associated with the tag data to be analyzed.
[0142] ここで、タグ生成に実際に用いる"最適"な制限酵素を選択するためには、全ゲノム データと多数の制限酵素を網羅した詳細なシミュレーションが必要あるため、大きな コンピュータ処理能力が要求される。これに対しては、対照タグデータ生成装置 200 によれば、本発明者らが以前から取り組んできたパラレルコンピューティング技術を 用いることで処理を高速ィ匕することが可能である。また、アルゴリズムには-ユーラル ネットワークを用いることで従来のコンピュータ処理や統計的手法より的確な結果を 得うるシステムを構築し得る。  [0142] Here, in order to select the "optimal" restriction enzyme that is actually used for tag generation, detailed simulations that cover the entire genome data and a large number of restriction enzymes are required, so a large computer processing capacity is required. Is done. On the other hand, according to the control tag data generation device 200, it is possible to speed up the processing by using the parallel computing technology that the present inventors have been working on. In addition, a system that can obtain more accurate results than conventional computer processing and statistical methods can be constructed by using a Yural network for the algorithm.
[0143] また、解析した ヽ目的の細胞のゲノム力も抽出するタグ配列の数である力 例えば 、 100、 000タグ配列のシークェン解析から、 lOOkb以上の増幅領域(10倍程度)、 600kb以上のホモ欠失領域、 4Mb以上の染色体コピー数変化(n= lor3)を検出で きることが、モンテカルロシミレーシヨンによって裏づけられて!/、る。 [0143] Also, the force that is the number of tag sequences to extract the genomic power of the analyzed target cells. For example, from a sequence analysis of 100,000 tag sequences, an amplification region of lOOkb or more (about 10 times), a homology of 600kb or more Detected deletion region, chromosome copy number change of 4Mb or more (n = lor3) Being supported by Monte Carlo simulation!
[0144] < 3.解析対象タグデータの生成 > [0144] <3. Generation of tag data for analysis>
図 29は、解析対象タグデータ生成装置 300の内部構成を説明するための機能ブ ロック図である。解析対象タグデータ生成装置 300は、ヒトなどの所定の生物種のゲノ ム DNA分子である解析対象 DNA分子をアプライするための解析対象 DNA分子ァ プライ部 802を備える。一方、解析対象タグデータ生成装置 300は、解析対象 DNA 分子を切断するための制限酵素(Mbolなど)をアプライするための制限酵素アプライ 部 804を備える。  FIG. 29 is a functional block diagram for explaining the internal configuration of the analysis target tag data generating apparatus 300. The analysis target tag data generation device 300 includes an analysis target DNA molecule application unit 802 for applying an analysis target DNA molecule that is a genomic DNA molecule of a predetermined biological species such as a human. On the other hand, the analysis target tag data generation device 300 includes a restriction enzyme application unit 804 for applying a restriction enzyme (such as Mbol) for cleaving the analysis target DNA molecule.
[0145] また、解析対象タグデータ生成装置 300は、解析対象 DNA配列を含む DNA分子 を制限酵素(Mbolなど)により切断するための制限酵素処理部 806を備える。さらに 、解析対象タグデータ生成装置 300は、切断された複数の DNA断片を分離するた めの電気泳動部 808を備える。そして、解析対象タグデータ生成装置 300は、 DNA 分子を制限酵素により切断してなる複数の DNA断片のうち、所定の範囲の塩基数 力 なる DNA断片を抽出するための DNA断片抽出部 810を備える。  [0145] In addition, the analysis target tag data generation apparatus 300 includes a restriction enzyme processing unit 806 for cleaving a DNA molecule containing the analysis target DNA sequence with a restriction enzyme (Mbol or the like). Furthermore, the analysis target tag data generation device 300 includes an electrophoresis unit 808 for separating a plurality of cleaved DNA fragments. The tag data generation apparatus 300 to be analyzed includes a DNA fragment extraction unit 810 for extracting a DNA fragment having a predetermined number of bases from a plurality of DNA fragments obtained by cleaving a DNA molecule with a restriction enzyme. .
[0146] また、解析対象タグデータ生成装置 300は、 DNA断片抽出部 810により抽出され た複数の DNA断片を連結してなるコンカテマ一を生成するコンカテマ一生成部 812 を備える。さらに、解析対象タグデータ生成装置 300は、コンカテマ一生成部 812に より生成されるコンカテマ一を複数連結してなる 2次コンカテマ一を生成する 2次コン 力テマ一生成部 814を備える。  [0146] Also, the analysis target tag data generation device 300 includes a concatemer generation unit 812 that generates a concatemer formed by linking a plurality of DNA fragments extracted by the DNA fragment extraction unit 810. Furthermore, the analysis target tag data generation apparatus 300 includes a secondary force categorization unit 814 that generates a secondary concatamer formed by connecting a plurality of concatamers generated by the concatamer generation unit 812.
[0147] また、解析対象タグデータ生成装置 300は、 2次コンカテマ一の DNA配列をシーク エンスするシークェンス部 816を備える。さらに、解析対象タグデータ生成装置 300 は、シークェンス部 816によるシークェンス結果を格納するシークェンス結果記憶部 818を備える。  [0147] In addition, the tag data generation device 300 to be analyzed includes a sequence unit 816 for sequencing the DNA sequence of the second concatamer. Furthermore, the analysis target tag data generation device 300 includes a sequence result storage unit 818 that stores a sequence result by the sequence unit 816.
[0148] また、解析対象タグデータ生成装置 300は、シークェンス結果記憶部 818から取得 したシークェンス結果に基づいて複数の解析対象タグの集合である解析対象タグデ ータを生成する解析対象タグデータ生成部 820を備える。さらに、解析対象タグデー タ生成装置 300は、解析対象タグデータ生成部 820が生成した解析対象タグデータ を記憶する解析対象タグデータ記憶部 822を備える。 [0149] そして、解析対象タグデータ生成装置 300は、解析対象タグデータ記憶部 822から 解析対象タグデータを取得して、 DNA配列解析装置 300に出力する出力部 824を 備える。 Further, the analysis target tag data generation device 300 generates an analysis target tag data generation unit that generates analysis target tag data that is a set of a plurality of analysis target tags based on the sequence result acquired from the sequence result storage unit 818. Equipped with 820. Furthermore, the analysis target tag data generation device 300 includes an analysis target tag data storage unit 822 that stores the analysis target tag data generated by the analysis target tag data generation unit 820. Then, the analysis target tag data generation device 300 includes an output unit 824 that acquires the analysis target tag data from the analysis target tag data storage unit 822 and outputs it to the DNA sequence analysis device 300.
[0150] 以下、上述の解析対象タグデータ生成装置 300による解析対象ゲノム DNA分子 力もの解析対象タグデータの生成について詳細に説明する。  [0150] Hereinafter, generation of analysis target tag data of the analysis target genomic DNA molecule using the analysis target tag data generation apparatus 300 will be described in detail.
[0151] 1. DGSタグの作製とタグデータの取得  [0151] 1. Production of DGS tag and acquisition of tag data
まず、 DGSタグの作製とタグデータの取得について説明する。まず、ヒトゲノム DN A分子として、胃癌細胞株 HSC45よりゲノム DNAを抽出した。次いで、 20〜40ug のゲノム DNAを制限酵素 Mbolで 37°C、 16時間処理し、 3%Nusieveァガロース電 気泳動を行った。そして、約 30〜60塩基の範囲をゲル力 切り出し、 Gelase (EPIC ENTRE)でゲルを溶解後にエタノール沈殿にてタグ DNAを精製した。  First, the creation of DGS tags and the acquisition of tag data are described. First, genomic DNA was extracted from gastric cancer cell line HSC45 as a human genomic DNA molecule. Next, 20 to 40 ug of genomic DNA was treated with the restriction enzyme Mbol at 37 ° C. for 16 hours, and 3% Nusieve agarose electrophoresis was performed. Then, the gel force was cut out in the range of about 30 to 60 bases, the gel was dissolved with Gelase (EPIC ENTRE), and then the tag DNA was purified by ethanol precipitation.
[0152] このとき、コンカテマ一のクロー-ングベクターには、 pBluescript II KS ( + ) (St ratagene)を用い、 BamHI制限酵素処理後、アルカリフォスファターゼ処理をしてク ローニングに用いた。なお、 Takara ligation kit Ver2. 1を用いてタグからコンカ テマ一を作製しベクターにクロー-ングした。  [0152] At this time, pBluescript II KS (+) (St ratagene) was used as the cloning vector for concatamers, and after BamHI restriction enzyme treatment, alkaline phosphatase treatment was used for cloning. In addition, using the Takara ligation kit Ver2.1, a concatema was prepared from the tag and cloned into a vector.
[0153] そして、エレクト口ポレーシヨンにより大腸菌 DH10Bにベクターを導入し、 X— galを 用いたカラーセレクション法により陽性コロニーを選択した。各コロニーはアンピシリン 含有 LB培地で培養後、自動核酸抽出器 (KURABOならびに QIAGEN)にてベタ ター DNAを精製し、 RNase処理後に解析を行った。  [0153] Then, the vector was introduced into E. coli DH10B by electopore positioning, and positive colonies were selected by a color selection method using X-gal. Each colony was cultured in ampicillin-containing LB medium, and the beta DNA was purified with an automatic nucleic acid extractor (KURABO and QIAGEN) and analyzed after RNase treatment.
[0154] インサートの確認は、 Xholおよび Saclの 2重処理にて行った。コンカテマ一を含む ベクターに対してのシークェンス反応にはプライマーは T3、 Τ7を使用し、 BigDye t ermmator \ d . 1 cycle sequencing Kitならびに GeneAmp PCR system 9700 (Applied Bio system)を用いて行った。産物の塩基配列解析には ABI Pri sm 3100 Genetic Analyzer (Applied Bio system)を用いた。  [0154] The insert was confirmed by double treatment of Xhol and Sacl. For the sequencing reaction to the vector containing concatemer, T3 and Τ7 were used as primers, and BigDye termmator \ d. 1 cycle sequencing Kit and GeneAmp PCR system 9700 (Applied Bio system) were used. ABI Pri sm 3100 Genetic Analyzer (Applied Bio system) was used for the base sequence analysis of the product.
[0155] 2. 胃癌細胞株のゲノム DNAを用いた in vitro予備実験  [0155] 2. Preliminary in vitro experiments using gastric cancer cell line genomic DNA
図 30は、タグ DNAの抽出とコンカテマ一の作製を説明するための図である。より詳 細には、タグ DNAの抽出とコンカテマ一作製胃癌細胞株 HSC45から抽出した実際 のヒトゲノム DNAを用いて、 DGSの予備実験を行った。 HSC45から抽出したゲノム DNAを Mbolで制限酵素処理し、 3%Nusieveゲルにお!、て電気泳動した結果を図 30Aに示す。 FIG. 30 is a diagram for explaining extraction of tag DNA and production of concatemers. More specifically, a preliminary experiment on DGS was performed using actual human genomic DNA extracted from HSC45, a tag DNA extraction and concatemer-producing gastric cancer cell line. Genome extracted from HSC45 FIG. 30A shows the result of electrophoresis of DNA with restriction enzyme Mbol and electrophoresis on 3% Nusieve gel.
[0156] DGSに有用なタグ DNAが存在すると考えられる lOObp以下においてもスメァが確 認された力 40bp付近に明瞭なバンドが存在することがわ力つた。これは上述の Mb olヴアーチヤルタグ解析で判明した 36、 37塩基ギャップ長の、数が突出して多い(し 力もリピート由来の比率が高い)タグ集団を反映したバンドであると推測された。 100b p以下におけるフラグメントサイズの正確な判断は難しいため、その 40bp付近のバン ドを指標として、そのバンドより長 、フラクション (fraction # 3)、そのバンドを含むフ ラタシヨン(fraction # 4)、それより短!、フラクション(fraction # 5)に分けてゲル を切り出し、タグ DNAを精製した。  [0156] It was found that there was a clear band around 40 bp where the smear was confirmed even at less than lOObp, which is considered to contain tag DNA useful for DGS. This band was estimated to reflect the tag population of 36 and 37 base gap lengths revealed by the above-mentioned Mb olvear tag analysis, which is prominently large (the ratio of force is also high from repeat). Since it is difficult to accurately determine the fragment size at 100 bp or less, the band around 40 bp is used as an index, and the length, fraction (fraction # 3), fraction including that band (fraction # 4), and so on. Short! The gel was cut into fractions (fraction # 5) and the tag DNA was purified.
[0157] 次に、得られたタグをライゲーシヨンにより連結してコンカテマ一を作製し、それを p Bluescriptベクターに導入してクローユングを試みた。当初はベクターに導入される タグはせいぜい 1個であったのに対し、タグ濃度を濃くすることによってコンカテマ一 の延長効率が改善され、 3〜5タグがつながったコンカテマ一を得ることができるよう になった(図 30B)。  [0157] Next, the obtained tag was ligated by ligation to produce a concatamer, which was introduced into a pBluescript vector to attempt cloning. Initially, only one tag was introduced into the vector, but increasing the concentration of the tag improves concatemer extension efficiency, so that concatemers with 3-5 tags can be obtained. (Figure 30B).
[0158] 代表的なコンカテマ一の塩基配列をもとに作製した制限酵素マップを図 30Cに示 す。このコンカテマ一は 5個のタグから構成されている。コンカテマ一は fraction # 3由来のタグ力も作製されたもので、実際の各タグのサイズは 43〜52bpであった。そ れぞれのタグを Blat検索により染色体上にマッピングしたところ、各タグは図 30Cに 示すように 1番、 6番、 11番、 X染色体というように異なる染色体から由来しているもの であることが確認された。また、コンカテマ一には SINE由来のリピートタグが 1個含ま れていた。  [0158] FIG. 30C shows a restriction enzyme map prepared based on the base sequence of a typical concatamer. This concatamer consists of 5 tags. Concatema I also produced a tag force derived from fraction # 3, and the actual size of each tag was 43-52 bp. When each tag is mapped onto a chromosome by Blat search, each tag is derived from a different chromosome such as No. 1, No. 6, No. 11, and X chromosome as shown in Fig. 30C. It was confirmed. Concatema also contained one SINE-derived repeat tag.
[0159] 3. in vitro予備実験で得られたタグの解析  [0159] 3. Analysis of tags obtained in preliminary experiments in vitro
上述の実験で得られたコンカテマ一配列は Mbol部位でタグに切り分け、切り分け たタグの両端に Mbol配列 GATCを付カ卩した後に Blat検索を行!、ゲノム上へマツピ ングした。この際に配列のユニークさやリピート配列力否かの解析も行った。  The concatema sequence obtained in the above experiment was cut into tags at the Mbol site, Blat search was performed after attaching the Mbol sequence GATC to both ends of the cut tag, and mapped onto the genome. At this time, the analysis of the uniqueness of the sequence and the repeatability of the repeat sequence was also performed.
[0160] 図 31は、予備実験で用いたタグの塩基配列を解析し集計した結果を示すグラフで ある。次に、予備実験として合計 81タグの塩基配列を解析し集計した (図 31)。全タ グのサイズ別分布、ならびにリピート Z非リピート由来の比は図 31Aに示すとおりで、 タグのギャップ長は 25〜58bpの間を示し、 81タグ中 38タグ(46. 9%)が非リピート 配列由来のタグであった。 [0160] FIG. 31 is a graph showing the results of analyzing and counting the base sequences of the tags used in the preliminary experiment. Next, as a preliminary experiment, a total of 81 tag base sequences were analyzed and aggregated (FIG. 31). All The distribution by size and the ratio from repeat Z non-repeat are shown in Fig. 31A. The tag gap length is between 25 and 58 bp, 38 tags out of 81 tags (46.9%) are non-repeat sequences The tag was derived from.
[0161] 電気泳動後にサイズで分取したタグ分画、すなわち fraction # 3、 4、 5のサイズ の内訳は図 31B示す結果となった。各フラクションのタグサイズのピークは電気泳動 のゲル上のサイズ位置にほぼ一致して 、るが、それぞれのフラクションが必ずしも排 他的にサイズで分画されて 、るわけではな 、ことがわかった。  [0161] The tag fractions sorted by size after electrophoresis, that is, the breakdown of the sizes of fractions # 3, 4, and 5 are shown in Fig. 31B. It was found that the tag size peak of each fraction almost coincided with the size position on the electrophoresis gel, but that each fraction was not necessarily fractionated by size. .
[0162] それぞれのフラクションにおける非リピート率も約 45%であり、ヴアーチヤルタグ解 祈で事前に予測した結果とほぼ一致した(図 31C)。 Fraction # 4はリピート率が高 くなると事前に予想された力 サイズ分画が完全でないためか突出して高いリピート 率は認められな力つた。  [0162] The non-repeat rate in each fraction was also about 45%, which was almost the same as the result predicted in advance by Varchjartag prayer (Fig. 31C). Fraction # 4 had a force that could not be recognized as a high repeat rate because the force size fraction predicted in advance was not perfect.
[0163] 4.タグの塩基配列解析とゲノムマッピング  [0163] 4. Tag sequence analysis and genome mapping
コンカテマ一塩基配列のァライメント、制限酵素部位の解析には、 Mac Vector 7. 2. 2と Assembly LIGN (Accelrys)ならびに Clone Manager 7 Professio nal Suite (Sci Ed Central)を用いた。  Mac Vector 7.2.2 and Assembly LIGN (Accelrys) and Clone Manager 7 Professional Suite (Sci Ed Central) were used for alignment of concatemer single nucleotide sequences and analysis of restriction enzyme sites.
[0164] コンカテマ一は、 T3、 Τ7プライマーを用いて 2方向力 塩基配列を解析し 2つのデ 一タのァライメントを作成した後に、両データにおいて合致する部分をコンカテマ一 配列として抜きだした。 Mbol部位でコンカテマ一の塩基配列をタグに切り分け、切り 分けたタグの両端に Mbol配列 GATCを付カ卩した後に Blat検索を行 ヽ、ゲノム上へ マッピングすると同時に、配列のユニークさやリピート配列力否かの分類を行った。各 タグのゲノムマッピングならびにリピートクラスの分類には Human BLAT Search ( http: / / genome, ucsc. eduz cgi— bmZ ngB t)および blastn (http : z / w ww. ncbi. nlm. nih. govZblastZ)を使用した。  [0164] Concatemer 1 was analyzed using a T3 and Τ7 primer to analyze the two-way force base sequence, and after aligning the two data, the matching part in both data was extracted as a concatemer sequence. The base sequence of concatamer at the Mbol site is cut into tags, and Mlat sequence GATC is attached to both ends of the cut tag, then Blat search is performed, mapping to the genome, and at the same time, the uniqueness of the sequence and the repeat sequence power The classification was done. Human BLAT Search (http: // genome, ucsc. Eduz cgi— bmZ ngB t) and blastn (http: z / w ww. Ncbi. Nlm. Nih. GovZblastZ) are used for genome mapping and repeat class classification of each tag. used.
[0165] そして、次に、すでに前述したように、得られた非リピートタグ力もなる解析対象タグ データと対照タグデータとを対応付けし、コピー数の判定を行ってこれらの解析対象 タグデータをゲノム上にマップし染色体中の各領域別に集計した。  [0165] Next, as already described above, the analysis target tag data that also has the non-repeat tag power and the control tag data are associated with each other, the number of copies is determined, and these analysis target tag data are determined. Maps on the genome and tabulated for each region in the chromosome.
[0166] 一方、得られたタグ数は染色体中の各領域の長さ、言い換えれば染色体中の各領 域に存在するヴアーチヤルタグ数に比例すると考えられるため、タグの実数を染色体 中の各領域のヴアーチヤルタグ数もしくは染色体中の各領域長で除してタグ密度とし て表した)。すると、染色体中の領域ごとにタグ密度のばらつきが見られた (不図示)。 [0166] On the other hand, since the number of tags obtained is considered to be proportional to the length of each region in the chromosome, in other words, the number of vuagear tags present in each region in the chromosome, It was expressed as the tag density divided by the number of veil tags in each region or the length of each region in the chromosome). Then, variation in tag density was observed for each region in the chromosome (not shown).
[0167] 5.まとめ  [0167] 5. Summary
図 32は、解析対象タグデータ生成装置 300の動作の流れについて説明するため の概念図である。上述の説明をまとめると、解析対象タグデータ生成装置 300は、以 下のステップを順に行う。  FIG. 32 is a conceptual diagram for explaining an operation flow of the analysis target tag data generation device 300. To summarize the above description, the analysis target tag data generation device 300 performs the following steps in order.
[0168] まず。ゲノム DNAを抽出し、 40〜80ugを制限酵素 Mbolで切断する。 [0168] First. Extract genomic DNA and cleave 40-80ug with restriction enzyme Mbol.
次に、 3%Nusieveァガロース電気泳動にて 30〜60bpの DNA断片をタグとして回 収する。  Next, collect the DNA fragment of 30-60 bp as a tag by 3% Nusieve agarose electrophoresis.
次に、タグの連結したコンカテマ一を作製する(1st ligation)。  Next, a concatemer with linked tags is prepared (1st ligation).
次に、コンカテマ一を BamHI処理 pBluescript II KS +ベクターに導入する(2n d ligationノ。  Next, the concatemer is introduced into a BamHI-treated pBluescript II KS + vector (2nd ligation).
次に、大腸菌にベクターを導入し、クローンをまとめて回収し、ベクター DNAを精 製する(1次ライブラリー)。  Next, introduce the vector into E. coli, collect the clones together, and purify the vector DNA (primary library).
次に、 1次ライブラリーベクターを Spel、 Pstl制限酵素で処理し、コンカテマ一配列 を抜き出す。  Next, the primary library vector is treated with Spel and Pstl restriction enzymes to extract the concatema sequence.
次に、コンカテマ一の再延長:抜き出したコンカテマ一同士をライゲーシヨンする(3r d ligationノ。  Next, re-extension of concatamers: ligation between the extracted concatamers (3r d ligation).
次に、再延長したコンカテマ一を Pstl and/or Spel処理 pBluescript II KS +ベクターに導入する(4th ligation)。  Next, the re-extended concatamer is introduced into a Pstl and / or Spel-treated pBluescript II KS + vector (4th ligation).
次に、大腸菌にベクターを導入し、クローンを個別に回収し、コンカテマ一の塩基 配列を解析する。  Next, the vector is introduced into E. coli, the clones are individually collected, and the base sequence of the concatemer is analyzed.
次に、タグデータを得て、ゲノム上にマッピングし、タグの個数を集計する。 次に、得られたタグの個数力もタグ密度を解析し、ゲノム上のコピー数の増減を検 索する。  Next, tag data is obtained, mapped onto the genome, and the number of tags is tabulated. Next, we analyze the tag density for the number power of the tags obtained, and search for the increase or decrease of the copy number on the genome.
[0169] 図 33は、コンカテマ一の再延長について説明するための概念図である。実際の解 祈においては、解析対象タグデータ生成装置 300は、タグを連結しコンカテマ一を作 製し、それを塩基配列解析に用いる。ただし、通常のライゲーシヨンでは長いコンカテ マーを作製することは困難である。 DGSにおいては、コンカテマ一の再延長という 2 段階のステップを踏むプロトコールを開発し、長いコンカテマ一の作製に成功してい る。そして、これにより、塩基配列解析の効率をあげている。 FIG. 33 is a conceptual diagram for explaining the re-extension of the concatema. In actual prayer, the tag data generation apparatus 300 to be analyzed creates a concatema by connecting tags, and uses it for base sequence analysis. However, a regular ligation has a long categorization. It is difficult to make a mer. DGS has developed a protocol that takes two steps: re-extension of concatamers, and has succeeded in producing long concatamers. This increases the efficiency of base sequence analysis.
[0170] また、これまでの類似するゲノム定量の手法においてはタグの作製過程で PCRに よる増幅を行っている力 解析対象タグデータ生成装置 300を用いる DGSでは、 PC Rを一切用いずより正確な定量が可能である。そのため、得られるデータの信頼性が 高いという利点がある。 [0170] In addition, the conventional genome quantification methods have the ability to amplify by PCR in the tag production process. DGS that uses the tag data generator 300 to be analyzed is more accurate without using PCR at all. Accurate quantification is possible. Therefore, there is an advantage that the reliability of the obtained data is high.
[0171] 図 34は、コンカテマ一構造の把握方法を説明するための制限酵素地図である。こ のように、上述の方法で解析されるコンカテマ一の塩基配列に基づけば、コンカテマ 一構造の把握をするうえで、制限酵素部位の並びだけ力 判断して、図 34のような 配列構造であると推定できる。  [0171] FIG. 34 is a restriction enzyme map for explaining a method of grasping a concatamer structure. In this way, based on the base sequence of concatamers analyzed by the above-mentioned method, in order to grasp the concatamer structure, it is judged only by the arrangement of restriction enzyme sites, and the sequence structure shown in FIG. It can be estimated that there is.
[0172] 図 35は、コンカテマ一構造の把握方法を説明するための DNA配列のシークェン ス地図である。図 36は、図 35のシークェンス地図力もベクター配列を除去した場合 のシークェンス地図である。図 37は、図 36のシークェンス地図力もタグを切り出す様 子を説明するためのシークェンス地図である。このように、コンカテマ一を含む領域全 体をシークェンスし、さらにシークェンス地図からベクター配列を除去して、残ったシ ークエンス地図力 タグの配列情報を切り出すため、一度のシークェンスで多数の解 析対象タグの DNA配列を解析でき、解析対象タグのシークェンスの効率が向上する  [0172] Fig. 35 is a sequence map of the DNA sequence for explaining the method of grasping the concatamer structure. FIG. 36 is a sequence map when the sequence map power of FIG. 35 is also removed from the vector sequence. FIG. 37 is a sequence map for explaining how the sequence map power of FIG. In this way, the entire region including the concatema is sequenced, the vector sequence is removed from the sequence map, and the sequence information of the remaining sequence map power tags is cut out, so a large number of tags to be analyzed in one sequence. DNA sequence can be analyzed, and the sequence efficiency of the tags to be analyzed is improved.
[0173] 図 38は、解析対象タグデータ生成装置 300の動作を説明するためのフローチヤ一 トである。解析対象タグデータ生成装置 300では、一連のフローがスタートすると、チ ユーブなどのような解析対象 DNA分子アプライ部 802に、ヒトなどの所定の生物種の ゲノム DNA分子がアプライされる(S502)。一方、チューブなどのような制限酵素ァ プライ部 804に、 Mbolなどの適当な制限酵素がアプライされる(S504)。そして、制 限酵素キットなどのような制限酵素処理部 806にお ヽて、解析対象 DNA分子および 制限酵素は接触し、適切な環境でインキュベートされることにより、制限酵素処理が 行われる(S506)。 FIG. 38 is a flowchart for explaining the operation of the analysis target tag data generation device 300. In the analysis target tag data generation device 300, when a series of flows starts, genomic DNA molecules of a predetermined species such as a human are applied to the analysis target DNA molecule application unit 802 such as a tube (S502). On the other hand, an appropriate restriction enzyme such as Mbol is applied to the restriction enzyme application part 804 such as a tube (S504). Then, in the restriction enzyme treatment unit 806 such as a restriction enzyme kit, the DNA molecule to be analyzed and the restriction enzyme come into contact with each other and incubated in an appropriate environment, whereby restriction enzyme treatment is performed (S506). .
[0174] 制限酵素処理により制限酵素切断部位において切断されたゲノム DNA分子は、 複数の DNA断片に分離する。これらの複数の DNA断片は、電気泳動槽などの電 気泳動部 808により電気泳動されることにより、塩基数の長さごとに分離される(S50 8)。電気泳動によりサイズごとに分離された複数の DNA断片のうち、所定の範囲内 の塩基数の DNA断片力 DNA抽出キットなどのような DNA断片抽出部 810により 、電気泳動のァガロースゲル力も切り出されてミニプレップ法などにより抽出される (S 510)。 [0174] The genomic DNA molecule cleaved at the restriction enzyme cleavage site by the restriction enzyme treatment, Separate into multiple DNA fragments. The plurality of DNA fragments are separated according to the length of the number of bases by electrophoresis in an electrophoresis unit 808 such as an electrophoresis tank (S508). Among a plurality of DNA fragments separated by size by electrophoresis, the DNA fragment force of the number of bases within a predetermined range The DNA fragment extraction unit 810 such as a DNA extraction kit also cuts out the electrophoresis agarose gel force. Extracted by the prep method or the like (S 510).
[0175] 次いで、ライゲーシヨンキットなどのようなコンカテマ一生成部 812は、こうして得られ た所定の範囲内の塩基数の DNA断片は、互いに連結されてコンカテマ一を生成す る(S512)。さらに、複数の DNA断片を連結してなるコンカテマ一は、プラスミドなど のベクターのマルチクロー-ングサイトなどに連結されてコンカテマ一含有ベクターを 生成する。このコンカテマ一含有ベクターは、大腸菌に導入して形質転換され、この 大腸菌が培養されることによりコンカテマ一含有ベクターが増幅される。そして、培養 された大腸菌力 コンカテマ一含有ベクターがミニプレップ法などにより抽出される。  Next, the concatamer generation unit 812 such as a ligation kit generates concatamers by linking the DNA fragments having the base numbers within the predetermined range thus obtained (S512). Further, a concatamer formed by linking a plurality of DNA fragments is ligated to a multicloning site of a vector such as a plasmid to generate a concatamer-containing vector. This concatamer-containing vector is introduced into E. coli and transformed, and this E. coli is cultured to amplify the concatamer-containing vector. The cultured E. coli concatamer-containing vector is extracted by a miniprep method or the like.
[0176] こうして、ライゲーシヨンキットなどのような 2次コンカテマ一生成部 814は、一且べク ターに連結してベクターの宿主を培養することにより増幅されたコンカテマ一をさらに 複数連結して、 2次コンカテマ一を生成する(S514)。そして、この 2次コンカテマ一 の DNA配列について、 DNAシークェンサ一などのシークェンス部 816を用いてシ ークエンスする(S516)。また、シークェンス部 816は、生成したシークェンス結果を シークェンス結果記憶部 820に格納する。  [0176] Thus, the secondary concatamer generation unit 814, such as a ligation kit, further ligates a plurality of concatamers amplified by culturing a vector host by linking to a vector. A secondary concatamer is generated (S514). Then, the DNA sequence of this secondary concatamer is sequenced using a sequence part 816 such as a DNA sequencer (S516). In addition, the sequence unit 816 stores the generated sequence result in the sequence result storage unit 820.
[0177] そして、解析対象タグデータ生成部 820は、シークェンス結果記憶部 820からシー クエンス結果を取得し、これらの DNA断片のうち所定の範囲内の塩基数の DNA断 片のシークェンス結果に基づいて、解析対象タグデータを生成する(S518)。さらに 、解析対象タグデータ生成部 820は、生成した解析対象タグデータを解析対象タグ データ記憶部 822に格納する。  [0177] Then, the analysis target tag data generation unit 820 acquires the sequence result from the sequence result storage unit 820, and based on the sequence result of the DNA fragments having the number of bases within a predetermined range among these DNA fragments. Then, tag data to be analyzed is generated (S518). Further, the analysis target tag data generation unit 820 stores the generated analysis target tag data in the analysis target tag data storage unit 822.
[0178] そして、出力部 824が、解析対象タグデータ記憶部 822から解析対象タグデータを 取得し、 DNA配列解析装置 100に出力して(S520)、一連のフローが終了する。  Then, the output unit 824 acquires the analysis target tag data from the analysis target tag data storage unit 822, outputs it to the DNA sequence analyzer 100 (S520), and the series of flows ends.
[0179] 以下、解析対象タグデータ生成装置 300の利点について説明する。  Hereinafter, advantages of the analysis target tag data generation device 300 will be described.
解析対象タグデータ生成装置 300を用いることにより、 DGSでは、 4塩基認識制限 酵素 Mbolなどの目的に適した制限酵素によってゲノムを切断後、 Mbolの場合には 約 30〜80bpの断片^^めてタグとしてカウントし、ゲノムのコピー数を解析すること ができる。そのため、得られるデータの信頼性およびデータの取得効率が向上する。 By using the analysis target tag data generator 300, DGS can limit 4 base recognition. After cleaving the genome with a restriction enzyme suitable for the purpose such as the enzyme Mbol, in the case of Mbol, fragments of about 30-80 bp are counted and counted as tags, and the copy number of the genome can be analyzed. As a result, the reliability of data obtained and the data acquisition efficiency are improved.
[0180] また、 CGHに代表される DNAノヽイブリダィゼーシヨンに基づく解析法では、事前に hCot- 1 DNAを使用し、分子生物学的手法で反復配列をサンプルから除去して いる。これは、ゲノムの約 45%を占める反復配列を含む領域の情報を捨てることを意 味する。一方、解析対象タグデータ生成装置 300を用いる DGSにおいては、所定の 範囲内の塩基数であれば、すべての解析対象タグの配列データを取得するため、全 ゲノム領域のうち反復配列を含む幅広 、箇所のコピー数異常を解析できる。そのた め、ヒトゲノム DNAの短い断片の定量的解析に基づき網羅的にヒトゲノム DNA量を 高解像度かつ高精度に調べる方法として、デジタルゲノムスキャニング法 (Digital Genome Scanning,以下 DGS)を行う場合に、 DGSによるコピー数異常の解析の 精度が向上する。 [0180] In addition, in the analysis method based on DNA hybridization represented by CGH, hCot-1 DNA is used in advance, and repetitive sequences are removed from the sample by molecular biological techniques. This means that information on regions containing repetitive sequences that occupy about 45% of the genome is discarded. On the other hand, in the DGS using the analysis target tag data generation apparatus 300, if the number of bases is within a predetermined range, the sequence data of all analysis target tags is acquired. Analyzes of copy number abnormalities at locations. Therefore, as a method for comprehensively examining the amount of human genomic DNA with high resolution and high accuracy based on quantitative analysis of short fragments of human genomic DNA, DGS is used when performing digital genome scanning (hereinafter referred to as DGS). This improves the accuracy of analysis of copy number anomalies.
[0181] また、 DGSの作業速度は、解析対象タグの塩基配列データを!、かに高速に取得 するかに依存している。シークェンス解析は、通常の分析機器を用いると、 196サン プルの解析にほぼ 24時間を要する。この前提にたっと、本発明者らのシミュレーショ ンでは、 1. 3Mbpの増幅領域を同定するのに 1万タグの解析が必要という結果が出 ている。すると、 1万タグをゴールとした場合、 1サンプルに 1タグし力含まれなければ 51日間を要するが、 1サンプルに 10タグ含まれていれば 5日間で目標に到達するこ とができ、現実のシステムとして活用できることが期待される。  [0181] In addition, the working speed of DGS depends on whether the base sequence data of the tags to be analyzed is acquired at a very high speed. The sequence analysis takes approximately 24 hours to analyze 196 samples using normal analytical equipment. Based on this assumption, the simulations of the present inventors have shown that analysis of 10,000 tags is necessary to identify the 1.3 Mbp amplification region. Then, if the goal is 10,000 tags, it takes 51 days if 1 tag is included in 1 sample and power is not included, but if 10 tags are included in 1 sample, the goal can be reached in 5 days. It is expected that it can be used as an actual system.
[0182] このとき、解析対象タグを連結したコンカテマ一の作製については、 DNA濃度、温 度設定、反応時間など様々な条件検討を行ったが、一度のライゲーシヨンでつながる タグ数は平均すると 2〜3タグであり、容易には長いコンカテマ一を調整できな力つた 。コンカテマ一の長さの分布を電気泳動で確認すると、 1タグのみのコンカテマ一とが 最も多ぐバンドとして目視で確認された。その一方、電気泳動上は確認できない量 で長 、コン力テマ一も存在して 、るだろうと考えられる。  [0182] At this time, regarding the production of concatemers with linked tags to be analyzed, various conditions such as DNA concentration, temperature setting, and reaction time were examined, but the average number of tags connected in one ligation was 2 to It has 3 tags, and it was not easy to adjust a long concatema. When the length distribution of the concatemer was confirmed by electrophoresis, it was visually confirmed as the band with the largest number of concatemers with only one tag. On the other hand, it is thought that there will also be a long and strong force in an amount that cannot be confirmed by electrophoresis.
[0183] そこで、本発明者らは、長いコンカテマ一^^め、さらにコンカテマ一同士を再度つ なぎ合わせればより長いコンカテマ一を得ることができると予測した (コン力テマ一の 再延長)。そして、これは上述したようなステップによる実験手法により可能であること が実験で確認できた。すなわち、一回のステップのコンカテマ一生成ステップで、 in vitro実験によって、 3〜5個の異なる染色体由来のタグがつらなったコンカテマ一を 形成できるということが明らかになった。そして、このコンカテマ一生成ステップを 2階 繰り返せば、これまでの予備実験で平均約 7タグのコンカテマ一を得ることができるこ とがわかり、 DGSにおける解析対象タグデータの生成時間に関する課題を克服でき た。 [0183] Therefore, the present inventors predicted that a longer concatamer could be obtained by reconnecting the long concatamers and then reconnecting them. Re-extension). It was confirmed by experiments that this was possible by the experimental method using the steps described above. In other words, it was revealed that in one step, a concatema can be formed by combining tags from 3 to 5 different chromosomes by in vitro experiments. If this concatema generation step is repeated up to the second floor, it can be seen that an average of about 7 tags of concatemer can be obtained in the preliminary experiments so far, and the problem related to the generation time of tag data to be analyzed in DGS can be overcome. It was.
[0184] すなわち、 in vitro解析では、 DGSの出発点となるタグの精製とコンカテマ一の形 成が最も重要でかつ困難なステップであることが明ら力となった。一つのコンカテマ 一に何個のタグをつなげるかは DGS終了までの作業時間と経費を大きく左右するた め、今後最も力を注いで効率の改善を目指す必要がある課題と考えられる。また、得 られたタグ配列の解析は単純な作業ではある力 大量のデータを処理するには自動 化が必須である。この点に関しては、タグデータベースの構築と並行して、タグデータ の解析の自動化を行うことにより解決される。  [0184] That is, in vitro analysis, it became clear that purification of tags and formation of concatemers as the starting point of DGS were the most important and difficult steps. The number of tags to be connected to one concatema greatly affects the work time and cost until the end of DGS. Therefore, it is considered that this is an issue that needs to be focused on to improve efficiency. In addition, analysis of the obtained tag sequence is a simple task. Automation is essential to process a large amount of data. This issue can be resolved by automating the analysis of tag data in parallel with the construction of the tag database.
[0185] 一方、 SAGE法(serial analysis of gene expression)の原理を微生物ゲノム DNAに応用した GST法(genome signature tags)が報告されている力 実験プ 口セスが煩雑で PCRに起因するバイアスが生じるので、複雑性の高!、ヒトゲノムでは 高精度の定量性ある結果を得るのは困難であった。  [0185] On the other hand, GST (genome signature tags), which applies the principle of SAGE (serial analysis of gene expression) to microbial genomic DNA, has been reported. Therefore, it was difficult to obtain highly accurate and quantitative results with the human genome.
[0186] これに対して、解析対象タグデータ生成装置 300を用いれば、上述の説明のように 、 PCRを行うことなしに、 2次コンカテマ一を生成可能であるため、実験プロセスが容 易で PCRに起因するバイアスが生じにく 、ので、複雑性の高!、ヒトゲノムでも高精度 の定量性ある結果を得ることができる。  [0186] On the other hand, if the analysis target tag data generation device 300 is used, it is possible to generate a secondary concatemer without performing PCR, as described above, so that the experimental process is easy. Since bias due to PCR is unlikely to occur, high complexity and highly accurate quantitative results can be obtained even in the human genome.
[0187] <その他の変形例 >  [0187] <Other variations>
以下、上記の実施形態の変形例について、上記の説明とは別の観点力 説明する 図 39は、タグ自動解析の流れについて説明するための概念図である。上述の図 1 で示したデジタルゲノムスキャニングにお 、て、タグ自動解析を行った場合のデータ の流れの変形例について詳しく説明する。 [0188] まず、生データの処理:タグデータの切り出しのステップについて説明する。ここで は、コンカテマ一の DNA配列をシークェンスする際に、 2方向でシークェンスをした 場合には、両方向から読み取った配列のァライメントを作成する。次いで、ァライメン トを作成したこれらの DNA配列中における制限酵素部位を洗い出す。そして、これら の DNA配列中におけるコンカテマ一構造を把握する。そして、これらの DNA配列中 力 ベクター配列を除去し、それぞれのタグ配列を切り出す。 Hereinafter, a modified example of the above embodiment will be described in terms of viewpoints different from the above description. FIG. 39 is a conceptual diagram for explaining the flow of tag automatic analysis. In the digital genome scanning shown in Fig. 1 above, a modified example of the data flow when automatic tag analysis is performed will be described in detail. [0188] First, raw data processing: tag data extraction step will be described. Here, when sequencing a concatemer DNA sequence, if sequencing is performed in two directions, an alignment of the sequences read from both directions is created. The restriction enzyme sites in these DNA sequences that made up the alignment are then washed out. Then, the structure of concatamers in these DNA sequences is grasped. Then, these vector sequences are removed from the DNA sequence, and each tag sequence is cut out.
[0189] 次 、で、タグのマッピングと判定、結果の管理のステップにつ 、て説明する。まず、 タグのマッピングと判定のステップにおいては、上述のようにして得られた生データ由 来のタグ配列を、ヴアーチヤルタグデータベースに照合する。  [0189] Next, the steps of tag mapping and determination and result management will be described. First, in the tag mapping and determination step, the tag sequence derived from the raw data obtained as described above is collated with the virtual tag database.
[0190] 続いて、結果の管理のステップでは、上述のタグのマッピングと判定のステップで得 られた結果に基づいて、ベクター単位のデータを集計し、コンカテマ一の状態、タグ 数、エラー理由を解析し、再解析に利用できる、実験条件の検討に再利用できるよう なデータを取得する。このようなデータを集計することにより、コンカテマ一の重複を チェックすることもできる。  [0190] Next, in the result management step, the data for each vector is aggregated based on the results obtained in the tag mapping and determination steps described above, and the concatema status, number of tags, and error reason are calculated. Analyze and acquire data that can be used for reanalysis and can be reused for examination of experimental conditions. By summing up such data, it is possible to check for concatema duplication.
[0191] また、結果の管理のステップでは、タグ単位のデータの集計も行う。すなわち、ヴァ 一チャルタグごとの得票数^^計する。そして、この集計結果に基づいて、タグ単位 のデータの解析と可視化を行う。このとき、ウィンドウサイズや閾値を振るダイナミック な解析可視化を行うことができる(例えばタグ密度ヒストグラムやグリッド表示など)。こ のような可視化されたタグ単位のデータを参照することにより、ゲノム DNA配列の各 領域における重複'欠失などの染色体異常を容易に発見できる。  [0191] In addition, in the result management step, data for each tag is also aggregated. That is, count the number of votes for each virtual tag. Based on the results of this tabulation, the data for each tag is analyzed and visualized. At this time, dynamic analysis visualization can be performed by changing the window size and threshold (for example, tag density histogram and grid display). By referencing such visualized tag unit data, it is possible to easily find chromosomal abnormalities such as duplication and deletion in each region of the genomic DNA sequence.
[0192] 例えば、タグ密度ヒストグラムを用いる場合には、解析対象ゲノム DNA配列の所定 の領域における生タグ (解析対象タグ)の総数を、対照ゲノム DNA配列の対応する 領域におけるヴアーチヤルタグ (対照タグ)の総数で除してなるタグ密度を判定するこ とができる。この場合には、ウィンドウサイズとは、タグ密度を計算する際の分母のヴァ 一チャルタグの数 (密度計算時のゲノム領域の広さに相当する)を意味する。  [0192] For example, when the tag density histogram is used, the total number of raw tags (analysis target tags) in a predetermined region of the genomic DNA sequence to be analyzed is calculated as the number of varchy tags (control tags) in the corresponding region of the control genomic DNA sequence. The tag density divided by the total number can be determined. In this case, the window size means the number of denominator virtual tags when calculating the tag density (corresponding to the size of the genome region at the time of density calculation).
[0193] 図 40は、タグを分類して 、く流れにっ 、て説明するための概念図である。タグの分 類とは、タグごとに所定の寄与度を設定する操作に相当する。なお、この図で示すタ グの分類方法は、一変形例であり、他にも多様なタグの分類方法があり得る。 [0194] この図の例では、まず、最初のステップで、生タグ (解析対象タグ)の全長が 1種類 の Vタグ(ヴアーチヤルタグ:対照タグ)全長に 100%マッチする力判定する。 1種類の Vタグ全長に 100%マッチするのであれば、その生タグに判定コード 0 (例えば、寄与 度 1)を振った上で、その Vタグに一票を投票する。一方、 1種類の Vタグ全長に 100[0193] FIG. 40 is a conceptual diagram for explaining the classification of tags. Tag classification corresponds to an operation for setting a predetermined contribution for each tag. Note that the tag classification method shown in this figure is a variation, and there can be various other tag classification methods. [0194] In the example shown in this figure, first, in the first step, the force is determined so that the total length of the raw tag (analysis target tag) matches 100% of the total length of one type of V tag (vearchy tag: control tag). If 100% matches the total length of a single V tag, cast a decision code 0 (for example, contribution 1) on the raw tag and vote for that V tag. On the other hand, the total length of one type of V tag is 100
%マッチしな 、場合には、次のステップに進む。 If not, go to the next step.
[0195] 次のステップでは、生タグの長さが、 VT-DB (ヴアーチヤルタグデータベース)中 に収載して 、るタグ長の範囲を超えて 、る力判定する。 VT— DB中に収載して 、る タグ長の範囲を超えている場合には、生タグの切り出し方が悪い、もしくは VT— DB 収載タグサイズの設定に問題があると判定し、その生タグに判定コード 1 (例えば、寄 与度 0)を振り、投票はしない。一方、 VT— DB中に収載しているタグ長の範囲を超 えていない場合には、次のステップに進む。  [0195] In the next step, the strength of the raw tag exceeds the range of the tag length included in the VT-DB (Veural Tag Database), and the force is judged. If the tag length exceeds the range of the tag that is included in the VT-DB, it is determined that the raw tag is not cut out correctly, or there is a problem with the setting of the tag size included in the VT-DB. Assign a decision code 1 (for example, donation level 0) to, and do not vote. On the other hand, if the tag length range included in the VT-DB is not exceeded, proceed to the next step.
[0196] 次のステップでは、生タグが 2種類以上の Vタグに全長 100%マッチしてしまうか判 定する。 2種類以上の Vタグに全長 100%マッチする場合には、生タグはリピート由来 であると判定し、その生タグに判定コード 2 (例えば、寄与度 0)を振り、投票はしない 。一方、 2種類以上の Vタグに全長 100%マッチしない場合には、次のステップに進 む。  [0196] The next step is to determine if the raw tag matches 100% of the total length with two or more types of V tags. If 100% of the total length matches two or more types of V tags, the raw tag is determined to be derived from repeat, and a decision code 2 (for example, contribution 0) is assigned to the raw tag, and no vote is given. On the other hand, if 100% of the total length does not match two or more V-tags, proceed to the next step.
[0197] 次のステップでは、生タグに対して、 1〜3塩基 (もしくは 10%未満)のミスマッチの V タグが 1種類だけ存在する力判定する。 1〜3塩基 (もしくは 10%未満)のミスマッチの Vタグが 1種類だけ存在する場合には、生タグのシークェンスエラーまたは SNPsタグ である可能性が高いと判定し、生タグを再投票へまわし、その生タグに判定コード 3を 振る。一方、 1〜3塩基 (もしくは 10%未満)のミスマッチの Vタグが 1種類だけ存在し ない場合には、次のステップに進む。  [0197] In the next step, a force judgment is made that only one type of mismatched V tag of 1 to 3 bases (or less than 10%) exists for the raw tag. If there is only one type of V tag with a mismatch of 1 to 3 bases (or less than 10%), it is judged that there is a high probability of a raw tag sequence error or SNP tag, and the raw tag is sent to re-voting. , Put the judgment code 3 on the raw tag. On the other hand, if only one type of mismatched V-tag of 1 to 3 bases (or less than 10%) does not exist, proceed to the next step.
[0198] 次のステップでは、生タグに対して、 1〜3塩基 (もしくは 10%未満)のミスマッチの V タグが 2種類以上存在する力判定する。 1〜3塩基 (もしくは 10%未満)のミスマッチ の Vタグが 2種類だけ存在する場合には、リピート由来のタグであると判定し、その生 タグに判定コード 4 (例えば、寄与度 0)を振り、投票はしない。一方、 1〜3塩基 (もしく は 10%未満)のミスマッチの Vタグが 2種類以上存在しない場合には、次のステップ に進む。 [0199] 次のステップでは、生タグの一端もしくは両端が Spel (ACTAGT)もしくは Pstl (C TGCAG)であるか判定する。一端もしくは両端が Spelもしくは Pstlである場合には、 生タグの全長が、 1種類の Vタグの一部に 100%マッチする力判定するステップに進 む。一方、一端もしくは両端が Spelもしくは Pstlでない場合には、何由来の DNA配 列か判定するステップに進む。 [0198] In the next step, a force determination is made that there are two or more mismatched V tags of 1 to 3 bases (or less than 10%) against the raw tag. If there are only two types of mismatched V-tags with 1 to 3 bases (or less than 10%), the tag is determined to be a repeat-derived tag, and decision code 4 (for example, contribution 0) is assigned to the raw tag. Swing and don't vote. On the other hand, if two or more mismatched V-tags with 1 to 3 bases (or less than 10%) do not exist, proceed to the next step. [0199] In the next step, it is determined whether one or both ends of the raw tag is Spel (ACTAGT) or Pstl (C TGCAG). If one or both ends are Spel or Pstl, proceed to the step of judging the force that the full length of the raw tag matches 100% of one type of V tag. On the other hand, if one or both ends are not Spel or Pstl, proceed to the step of determining what DNA sequence is from.
[0200] 何由来の DNA配列力判定するステップでは、その生タグに判定コード 10 (例えば 、寄与度 0)を振った上で、投票はせず、 Blastという名のアルゴリズムを用いてヒトゲ ノム由来の配列とホモロジ一が高いか否力判定する。この場合、 4塩基(10%)以上 のミスマッチがある場合には、ヒトゲノム由来の配列ではないと判定し、大腸菌、ミトコ ンドリア、ベクター(事前に除去していればありえないが)、それ以外の多様な生物種 のゲノムとのホモロジ一をサーチして、 DNA配列の起源を判定する。  [0200] In the step of determining the DNA sequence power from any source, the raw tag is assigned a determination code 10 (eg, contribution 0), and no vote is given. The algorithm is named Blast. Whether the sequence and homology are high is determined. In this case, if there is a mismatch of 4 bases (10%) or more, it is determined that the sequence is not derived from the human genome, and E. coli, mitochondrial, vector (which may not be removed in advance), and other diverse types. Search for homology with the genomes of different species to determine the origin of the DNA sequence.
[0201] 生タグの全長が、 1種類の Vタグの一部に 100%マッチするか判定するステップで は、 100%マッチする場合には、その生タグは、コンカテマ一切り出しの過程で切れ たタグであると判定され、再投票に回され、その生タグに判定コード 5 (例えば、寄与 度 0)を振る。一方、 1種類の Vタグの一部に 100%マッチしない場合には、次のステ ップに進む。  [0201] In the step of determining whether the total length of a raw tag matches 100% of a part of one type of V tag, if it matches 100%, the raw tag was cut in the process of cutting out the concatema The tag is determined to be sent for re-voting, and a determination code 5 (for example, contribution 0) is assigned to the raw tag. On the other hand, if 100% does not match a part of one type of V tag, go to the next step.
[0202] 次のステップでは、生タグの全長が 2種類以上の Vタグの一部に 100%マッチする か判定する。生タグの全長が 2種類以上の Vタグの一部に 100%マッチする場合に は、その生タグは、リピート配列であると判定して、その生タグに判定コード 6 (例えば 、寄与度 0)を振り、投票はしない。一方、生タグの全長が 2種類以上の Vタグの一部 に 100%マッチしない場合には、次のステップに進む。  [0202] In the next step, it is determined whether the total length of the raw tag matches 100% of two or more types of V tags. When the total length of the raw tag matches 100% of a part of two or more types of V tags, it is determined that the raw tag is a repeat sequence, and a determination code 6 (for example, contribution degree 0 is assigned to the raw tag). ) And do not vote. On the other hand, if the total length of the raw tag does not match 100% of some of the 2 or more types of V-tags, proceed to the next step.
[0203] 次のステップでは、生タグが 1〜3塩基のミスマッチで 1種類の Vタグの一部にマッチ する力判定する。 1〜3塩基のミスマッチで 1種類の Vタグの一部にマッチする場合に は、生タグのシークェンスの際にシークェンスエラーがおきている力、 SNPsタグであ ると判定し、再投票に回した上で、その生タグに判定コード 7 (例えば、寄与度 0)を振 る。一方、 1〜3塩基のミスマッチで 1種類の Vタグの一部にマッチしない場合には、 次のステップに進む。  [0203] In the next step, the ability to match a part of a single V tag with a mismatch of 1 to 3 bases in the raw tag is determined. If there is a mismatch of 1 to 3 bases and a part of one type of V tag is matched, it is determined that it is a SNP tag and the power that causes a sequence error during the sequence of the raw tag. After that, assign a judgment code 7 (for example, contribution 0) to the raw tag. On the other hand, if one to three base mismatches do not match a part of one type of V tag, go to the next step.
[0204] 次のステップでは、生タグが 1〜3塩基のミスマッチで 2種類以上の Vタグの一部に マッチするか判定する。 1〜3塩基のミスマッチで 2種類以上の Vタグの一部にマッチ する場合には、その生タグはリピート配列であると判定し、その生タグに判定コード 8 ( 例えば、寄与度 0)を振り、投票しない。一方、 1〜3塩基のミスマッチで 2種類以上の Vタグの一部にマッチしない場合には、次のステップに進む。 [0204] In the next step, the raw tag becomes a part of two or more types of V tags due to a mismatch of 1 to 3 bases. Determine if it matches. If there is a mismatch of 1 to 3 bases and a part of two or more types of V tag matches, it is determined that the raw tag is a repeat sequence, and determination code 8 (for example, contribution 0) is assigned to the raw tag. Swing and don't vote. On the other hand, if one to three base mismatches do not match some of the two or more types of V tags, proceed to the next step.
[0205] 次のステップに進むのは、以上のいずれのカテゴリーにも属さなかった生タグであり 、これらの生タグには、判定コード 9 (例えば、寄与度 0)を振り、投票しない。  [0205] The next step is a raw tag that does not belong to any of the above categories. A determination code 9 (for example, contribution 0) is assigned to these raw tags, and no vote is given.
[0206] 以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例 示であり、上記以外の様々な構成を採用することもできる。  [0206] Although the embodiments of the present invention have been described with reference to the drawings, these are examples of the present invention, and various configurations other than the above can also be adopted.
[0207] 例えば、上記実施の形態ではヒトゲノム DNA配列のコピー数の相違を検出する構 成とした力 ヒト以外の多様な生物におけるゲノム DNA配列のコピー数の相違を検 出するために用いてもよい。このようにすれば、医学にくわえて、食品、化学、農林水 産業をはじめとする幅広 、産業への応用が可能となると 、う利点が得られる。  [0207] For example, in the above-described embodiment, the ability to detect a difference in the copy number of a human genomic DNA sequence may be used to detect a difference in the copy number of a genomic DNA sequence in various organisms other than humans. Good. In this way, in addition to medicine, it will be possible to apply to a wide range of industries including food, chemistry, agriculture, forestry and fisheries.
[0208] また、上記実施の形態ではヒトゲノム DNA配列全体を解析の対象とした力 ヒトゲノ ム DNA配列全体ではなく、ヒトゲノム DNA配列の一部である染色体 DNA配列や、 染色体のさらに一部の DNA配列を解析の対象としてもよい。このようにすれば、ヒト ゲノムのうち、領域を絞ってピンポイントで効率のよい研究が可能になる利点がある。  [0208] Further, in the above embodiment, the ability to analyze the entire human genomic DNA sequence is not the entire human genomic DNA sequence, but a chromosomal DNA sequence that is a part of the human genomic DNA sequence, or a further partial DNA sequence of the chromosome. May be the target of analysis. In this way, there is an advantage that efficient research can be performed pinpointed by narrowing down the region of the human genome.
[0209] また、上記実施の形態では、制限酵素として Mbolを用いたが、他の制限酵素を用 いてもよい。特に、 4塩基配列を認識して切断する制限酵素は、得られるヴアーチャ ルタグの個数が多 、ため、本実施の形態にお 、て好適に用いられる。  [0209] In the above embodiment, Mbol is used as a restriction enzyme, but other restriction enzymes may be used. In particular, a restriction enzyme that recognizes and cleaves a 4-base sequence is suitably used in the present embodiment because of the large number of virtual tags obtained.
[0210] また、上記実施の形態では、一部異なるタグとして、対照タグと解析タグとの間で長 さは一致しているがミスマッチがあるタグを想定した力 特に限定するものではない。 例えば、実際に生タグを解析してみると、シークェンスエラーのせいで 1塩基が抜け 落ちていたり、逆に余計な 1塩基が入っていたり、ということが起きる場合がある。この ような場合に、解析タグの塩基配列中に 1塩基または数塩基の挿入 (insertion)およ び欠失 (deletion)がある場合も、これらのギャップを考慮して一部異なるタグとして 取り扱うことができる。こうすることにより、シークェンスエラーによるギャップを考慮す ることができるため、解析精度を向上させることができる。  [0210] Also, in the above embodiment, as a partially different tag, there is no particular limitation on a force that assumes a tag that has the same length but a mismatch between the control tag and the analysis tag. For example, when a raw tag is actually analyzed, one base may be missing due to a sequence error, or an extra base may be included. In such cases, even if there is an insertion or deletion of one or several bases in the base sequence of the analysis tag, it should be handled as a partially different tag in consideration of these gaps. Can do. In this way, gaps due to sequence errors can be taken into account, so that the analysis accuracy can be improved.
[0211] 以下、本発明を実施例によりさらに説明する力 本発明はこれらに限定されるもの ではない。 [0211] Hereinafter, the present invention will be further described with reference to examples. The present invention is not limited to these examples. is not.
[0212] <実施例 1 >  [0212] <Example 1>
本実施例では、デジタル ゲノム スキャニング法の開発と疾患遺伝子研究への応 用について説明する。デジタル ゲノム スキャニング法は、高精度 ·高解像度な網羅 的ヒトゲノム DNAの定量的解析を可能にする技術である。  In this example, we will describe the development of a digital genome scanning method and its application to disease gene research. Digital genome scanning is a technology that enables quantitative analysis of comprehensive human genomic DNA with high accuracy and high resolution.
[0213] 本発明者らの考案したデジタルゲノムスキャニング (DGS)の原理は、ヒトゲノムの網 羅的コピー数の定量を行うために、ゲノム DNAを制限酵素処理して得られる短い断 片(以下、これをタグと称する)をゲノムの代表としてカウントし、それをもとにコピー数 異常を呈する領域を同定しようするものである。以下、 DGS法の基盤確立のための シミュレーションと予備実験、ならびに胃癌細胞株に対する DGSの運用について説 明する。  [0213] The principle of digital genome scanning (DGS) devised by the present inventors is that a short fragment (hereinafter referred to as the following) obtained by restriction enzyme treatment of genomic DNA in order to quantify the global copy number of the human genome. This is referred to as a tag) as a representative of the genome, and based on this, a region exhibiting an abnormal copy number is identified. In the following, we will explain the simulation and preliminary experiments for establishing the foundation of the DGS method, and the operation of DGS for gastric cancer cell lines.
[0214] 1. ヴァーチャルタグの in silico解析  [0214] 1. In silico analysis of virtual tags
1. 1 制限酵素別のヴアーチヤルタグ数  1. 1 Number of veil tag by restriction enzyme
DGSの開始にあたっては、まず、どの制限酵素を用いてゲノム DNAを断片化する かが問題となる。そこで、ヒトの全ゲノム DNA情報を用いて、コンピュータ上で制限酵 素処理を行い生成されるタグ (以下、これをヴアーチヤルタグもしくは V— tagと称する When starting DGS, the first question is which restriction enzyme should be used to fragment genomic DNA. Therefore, a tag generated by restriction enzyme processing on a computer using human whole-genome DNA information (hereinafter referred to as Vuary tag or V-tag)
)のサイズと数^^計した(図 14)。この結果、 6塩基認識の制限酵素では生成される ヴアーチヤルタグ数カ S4塩基認識制限酵素と比較して明らかに少なく不十分であると 考えられた。 ) And the number ^^ (Figure 14). As a result, it was considered that the restriction enzyme with 6-base recognition clearly has fewer and less than the number of Vujarjar tags S 4 base-recognition restriction enzymes.
[0215] 4塩基認識酵素のうち、 Mbolは DNA配列 GATCを認識し切断するため、コンカテ マー(タグを数珠つなぎにライゲーシヨンしたもの)のクローユングには BamHI部位を 使用可能である。また、生成されるタグ数も他の酵素との比較において中間的な値を 示したことから、以下のシミュレーションにおいては Mbolによって生成されるヴアーチ ャノレタグを中心に解析を進めた。  [0215] Among the 4-base recognition enzymes, Mbol recognizes and cleaves the DNA sequence GATC, so the BamHI site can be used for cloning concatamers (ligations of ligated tags). In addition, since the number of tags generated showed an intermediate value in comparison with other enzymes, the following simulations proceeded with an analysis centered on the Vujanaure tag generated by Mbol.
[0216] 1. 2 Mbolヴァーチャルタグのサイズ別分布  [0216] 1.2 Size distribution of 2 Mbol virtual tags
まず全ゲノムの Mbol処理によって生成される DNA断片の内訳を in silicoで解析 した。現在明らかになつているヒトゲノム (Build35, hg 17)の塩基配列を対象とした 場合、 Mbol断片は合計 7, 056, 567個生成され、全 Mbol断片のうち 95%が 1377 塩基以下であることが明ら力となった。 First, the breakdown of DNA fragments generated by Mbol treatment of the entire genome was analyzed in silico. When targeting the nucleotide sequence of the human genome (Build35, hg 17) that is currently known, a total of 7,056,567 Mbol fragments were generated, and 95% of all Mbol fragments were 1377. It became clear that it was below the base.
[0217] このうち、タグギャップ長(Mbolタグの両端 GATC,計 8塩基を除!、たタグの長さ) 2 0〜80塩基において、 1塩基ごとにほぼ 1万〜 1. 5万個ずつタグが存在することが明 らかとなつた(図 15)。また各染色体の長さに比例して偏り無くタグが生成されている ことがわ力つた。これらの結果から、短いサイズの Mbol制限酵素断片を収集してゲノ ムの代表とすることは妥当と考えられた。  [0217] Of these, the tag gap length (GATC end of the Mbol tag, excluding a total of 8 bases, the length of the tag) 2 From 0 to 80 bases, approximately 10,000 to 150,000 per base It became clear that tags existed (Figure 15). In addition, we found that tags were generated in proportion to the length of each chromosome without any bias. From these results, it was considered appropriate to collect a short-sized Mbol restriction enzyme fragment to represent the genome.
[0218] 分取するタグサイズを制限した場合のヴアーチヤルタグ数を解析したところ、 40塩 基幅で分取した場合には約 54万個、 30塩基幅で分取した場合には約 42万個のヴ アーチヤルタグが得られることが明ら力となった(図 18)。  [0218] Analyzing the number of vuagear tags when the size of tags to be sorted was limited, it was about 540,000 when sorted at 40 base width, and about 420,000 when sorted at 30 base width It was clear that we were able to obtain a varch jar tag (Fig. 18).
[0219] 1. 3 リピート配列由来のヴァーチャルタグについての解析  [0219] 1.3 Analysis of virtual tags derived from repeat sequences
ゲノム上の複数個所に合致する配列をもつタグはゲノム上の存在部位を特定でき ないため、 DGSではタグとして採用できない。そこで、本発明者らは、リピート配列由 来のタグはそのような無効なタグとなる可能性が高 、と予想した。そこでヴァーチャル タグのうち、リピート配列由来のタグを集計しその比率を解析した。  A tag with a sequence that matches multiple locations on the genome cannot be used as a tag in DGS because it cannot identify the site on the genome. Therefore, the present inventors predicted that a tag derived from a repeat sequence is likely to be such an invalid tag. Therefore, among virtual tags, tags derived from repeat sequences were tabulated and their ratios were analyzed.
[0220] リピート配列データベースに照合した結果、 Mbolヴァーチャルタグの約 60%が散 在性反復配列(SINE, LINE, LTR, DNA element)に由来するタグであることが 明ら力となった(図 18)。タグサイズを制限して集計したところ、 30〜59bpギャップ長 の場合、総タグ数力 2万個であり、そのうちリピート配列由来のタグが 25万個を占め 、非リピートタグの数は 165, 845個(39. 8%)であることがわかった。  [0220] As a result of checking against the repeat sequence database, it became clear that about 60% of Mbol virtual tags were derived from scattered repetitive sequences (SINE, LINE, LTR, DNA element) (Fig. 18). When the tag size is limited, the total number of tags is 20,000 in the case of 30 to 59 bp gap length, of which 250,000 tags are derived from repeat sequences, and the number of non-repeat tags is 165,845 It turned out to be a piece (39.8%).
[0221] 2. モンテカルロシミュレーションによる DGSの解像度の予測  [0221] 2. Prediction of DGS resolution by Monte Carlo simulation
モンテカルロシミュレーションという乱数発生を用いた手法により、 DGSで何タグを 実際に解析すればどの程度のサイズのゲノムのコピー数異常を検出できる力、という ことを in silicoで予測した。 DGSの解析対象となる遺伝子増幅(amplification) , 遺伝子欠損(homozygous deletion) ,ヘテロ接合'性の喪失(loss of heterozy gosity, LOH)のシミュレーションをするための乱数発生アルゴリズムを図 20に示す  Using a method of random number generation called Monte Carlo simulation, we predicted in silico the ability to detect the number of genome copy number anomalies by how many tags were actually analyzed by DGS. Figure 20 shows a random number generation algorithm for simulating gene amplification (amplification), gene deletion (homozygous deletion), and loss of heterozygosity (LOH), which are subject to analysis by DGS.
[0222] これにより、ヴアーチヤルタグ数を前述の 165, 845個に設定した場合、 5倍の遺伝 子増幅を IMbpの解像度で検出するには 13800タグ、遺伝子欠損を IMbpの解像 度で検出するには 44000タグ、 LOHを IMbpの解像度で検出するには 495000タ グの解析が必要であることが示された(図 22)。一方解析タグ数を 10000タグに設定 した場合、 5倍増幅は 1. 34Mbp、遺伝子欠損は 3. 79Mbpの解像度で検出可能で あることが示された。 [0222] With this, when the number of veil tag is set to 165, 845 as described above, 13800 tags are used to detect 5-fold gene amplification with IMbp resolution, and IMbp resolution is used to detect gene defects. It was shown that an analysis of 44000 tags was required for detection at a degree, and 495,000 tags were required for detection of LOH at an IMbp resolution (Figure 22). On the other hand, when the number of analysis tags was set to 10000 tags, it was shown that 5-fold amplification can be detected with a resolution of 1.34 Mbp, and gene defects can be detected with a resolution of 3.79 Mbp.
[0223] 3. 胃癌細胞株のゲノム DNAを用いた DGSの in vitro実験 [0223] 3. In vitro experiment of DGS using genomic DNA of gastric cancer cell line
3. 1 タグ DNAの抽出とコンカテマ一作製  3.1 Extraction of tag DNA and preparation of concatamers
図 41は、 HSC45ゲノム力ものタグの精製とコンカテマ一の作製とを説明するため の電気泳動図である。次に、胃癌細胞株 HSC45から抽出したヒトゲノム DNAを用い て、 DGSの予備実験を行った。 HSC45のゲノム DNAを Mbol制限酵素処理し(図 4 1A)、得られた短いタグを連結してコンカテマ一を作製しクローユングを試みた。  FIG. 41 is an electropherogram for explaining the purification of HSC45 genome-powered tags and the production of concatemers. Next, a preliminary DGS experiment was performed using human genomic DNA extracted from gastric cancer cell line HSC45. HSC45 genomic DNA was treated with Mbol restriction enzyme (Fig. 41A), and the resulting short tags were ligated to produce a concatemer and attempted cloning.
[0224] 当初はベクターに導入されるタグはせいぜい 1個であったのに対し、ライゲーシヨン を 2段階に分け、またタグ濃度を高くすることによってタグの延長効率が改善され、平 均約 3タグがつながったコンカテマ一を得ることができた(図 41B)。  [0224] Initially, only one tag was introduced into the vector, but the extension efficiency of the tag was improved by dividing the ligation into two stages and increasing the tag concentration, with an average of about 3 tags. We were able to obtain a concatema that was connected to (Fig. 41B).
[0225] DGSの実現にはさらに長いコンカテマ一が必要と考えられたため、コンカテマ一の 再延長という手法を考案した (図 33)。手順を以下に示す。  [0225] Since it was thought that a longer concatamer would be necessary to realize DGS, we devised a method of re-extending the concatamer (Fig. 33). The procedure is shown below.
[0226] 1.一度クローユングしたコンカテマ一を含むベクターを 1次ライブラリとして精製す る。  [0226] 1. Purify a vector containing concatamers once cloned as a primary library.
2.一次ライブラリーベクター力もコンカテマ一を Pstl/Spelで切り出し(図 41C)、 長 、コン力テマ一同士を再度ライゲーシヨンする(コン力テマ一の再延長)。  2. For the primary library vector force, cut the concatamers with Pstl / Spel (Fig. 41C), and ligate them together again (long extension of the contema temers).
3.これによつて得た二次ライブラリ一力もクローンを得る、  3. The secondary library obtained by this will also get clones,
という手法である。  It is a technique.
[0227] これによりコンカテマ一の平均長が飛躍的に伸び、 1ベクターあたり平均約 7タグを 得ることが可能となった(図 41D)。  [0227] As a result, the average length of concatamers increased dramatically, making it possible to obtain an average of about 7 tags per vector (Fig. 41D).
[0228] 3. 2 タグの大量解析 [0228] 3.2 Mass tag analysis
上記の二次ライブラリーのひとつ(図 41D, # 5A)を選択し、 823クローンを回収し 塩基配列解析を行った。得られた配列情報を自動解析プログラムにかけ、 Mbolで挟 まれるタグ配列を抜き出した(こうして HSC45のゲノム DNA力 得られたタグ配列を 以下、生タグと称する)。 [0229] 図 42は、タグのサイズ分布とリピート'ユニーク分類とを示すグラフである。結果とし て、 823クローンから 5593個の生タグを得た。生タグのサイズ分布は図 42Cに示すと おりで、最長ギャップ長 118bp,最短ギャップ長 Obp,平均 23. 8bpであった。 One of the secondary libraries described above (Fig. 41D, # 5A) was selected, and 823 clones were recovered and subjected to nucleotide sequence analysis. The obtained sequence information was applied to an automatic analysis program, and a tag sequence sandwiched between Mbols was extracted (the tag sequence thus obtained for HSC45 genomic DNA was hereinafter referred to as a raw tag). FIG. 42 is a graph showing tag size distribution and repeat'unique classification. As a result, 5593 raw tags were obtained from 823 clones. The size distribution of raw tags is shown in Fig. 42C. The longest gap length was 118 bp, the shortest gap length was Obp, and the average was 23.8 bp.
[0230] 3. 3 Mbolヴアーチヤルタグデータベースの作成  [0230] 3.3 Creation of Mbol Archar Tag Database
得られた生タグのゲノム上の位置を特定しタグ数^^積するため、ヴアーチヤルタグ データベース(以下、 VT— DB)の作製を行った。 VT— DBに収載する Mbolヴァー チャルタグのサイズは 12〜122bpギャップ長に設定した。この結果、 Mbolヴアーチ ヤルタグ(Chl〜Ch22, X, Y) l, 859, 942個が 丁—0 に収載された。  In order to identify the position of the obtained raw tag on the genome and accumulate the number of tags, a Vuary tag database (hereinafter referred to as VT-DB) was created. The size of the Mbol virtual tag included in the VT-DB was set to 12 to 122 bp gap length. As a result, Mbol vearch Yartag (Chl to Ch22, X, Y) l, 859, 942 were listed on Ding-0.
[0231] VT— DBには各 V— tagの ID,染色体番号、染色体上の位置、配列情報にカロえ、 リピート由来か否か、さらにユニーク力否かの情報も含めた。ユニークの定義は、ゲノ ム上の 1箇所にのみ場所を特定できる、すなわち同じ配列を持つ V— tagが他に存在 しない、ということである。  [0231] The VT-DB includes information on each V-tag ID, chromosome number, position on the chromosome, and sequence information, whether it is derived from a repeat, and whether it is unique. A unique definition is that only one place on the genome can be located, ie there is no other V-tag with the same sequence.
[0232] DGSにお 、ては、ゲノム上の複数個所にマッチする配列、すなわち非ユニークな 生タグは場所の特定ができず無効票として捨てざるを得ない。このため VT— DB中 のすベての V— tagについても、ユニーク、非ユニークの判別を行った(図 42B)。ま た、得られた生タグのサイズを考慮して、 V— tagのリピート '非リピートについても 20b pギャップ長未満にまで範囲を拡大して解析した(図 42A)。  [0232] In DGS, sequences that match multiple locations on the genome, that is, non-unique raw tags cannot be identified and must be discarded as invalid votes. For this reason, all V-tags in the VT-DB were also identified as unique or non-unique (Figure 42B). Considering the size of the obtained raw tag, V-tag repeat 'non-repeat' was also analyzed by expanding the range to less than 20 bp gap length (Fig. 42A).
[0233] 図 43は、ヴアーチヤルタグデータベースにおけるリピートタグとユニークタグとの対 応を示す図である。この結果、 VT—DB中のヴアーチヤルタグのうちリピート配列は 6 3. 41%であるのに対し、ユニーク配列は 89. 37%と予想外に高い比率を示した(図 43A)。この比率は、ゲノム情報上リピート配列と分類されていても 83. 71%がュ- ークであり、ほとんどのタグが無駄にならないことを示唆する。  [0233] Fig. 43 is a diagram showing the correspondence between repeat tags and unique tags in the Vuyaru tag database. As a result, the repeat sequence among the varchy tags in VT-DB was 63.41%, whereas the unique sequence showed an unexpectedly high rate of 89.37% (Fig. 43A). This ratio suggests that 83.71% are unique even if they are classified as repeat sequences in the genome information, and most tags are not wasted.
[0234] 一方、 12bpギャップ長のヴァーチャルタグが 12万個と異常に多く存在し、その 9割 力 Sリピート(図 42A)、 8割が非ユニークであり(図 42B)、ほとんどが無効なタグである ことが判明した(図 43B)。  [0234] On the other hand, there are an unusually large number of 120,000 virtual tags with 12bp gap length, 90% S repeat (Fig. 42A), 80% non-unique (Fig. 42B), most invalid tags (Figure 43B).
[0235] 3. 4 生タグの分類  [0235] 3.4 Classification of raw tags
図 44は、 HSC45から取得した生タグの内訳を示す図である。完成した VT— DBに 対して生タグ配列を照合し、完全マッチ(生タグ配列の全長が 100%V— tagにマツ チ)するものを抽出した。この結果、全生タグ 5593個のうち、ユニーク V— tagにマツ チしたものが 3133個(56. 02%)、非ユニーク V— tagにマッチしたものが 1540個( 27. 53%)であり、のこり 920個は迷子タグ(stray tag) # 1と分類した(図 44)。 FIG. 44 is a diagram showing a breakdown of raw tags acquired from HSC45. The completed VT—DB is checked against the raw tag sequence and a perfect match (the total length of the raw tag sequence is 100% V—tag H) Extracted what to do. As a result, out of 5593 all-live tags, 3133 (56.02%) matched unique V-tags, and 1540 (27.53%) matched non-unique V-tags. 920 items were classified as stray tag # 1 (Fig. 44).
[0236] 生タグをサイズ別にユニーク、非ユニーク、迷子に分類した結果は図 42Cに示すと おりである。シークェンスエラーを考慮して、 VT—DBへの照合の際にミスマッチを 1 bpもしくは 2bp許容したところ、迷子タグ # 1, 920個のうち 319個を V— tagに照合で きた。残ったタグ 601個については、迷子タグ # 3と名付けた。  [0236] The results of classifying raw tags into unique, non-unique, and lost children by size are shown in Figure 42C. Considering sequence errors, when VT-DB matching was allowed for mismatches of 1 bp or 2 bp, 319 out of 1,920 lost tags # were matched against V-tags. The remaining 601 tags are named Lost Tag # 3.
[0237] 3. 5 タグ密度の解析  [0237] 3.5 Analysis of tag density
図 45は、ウィンドウサイズを設定して算出したタグ密度を示すグラフである。得られ た完全マッチ生タグ 3133個を VT—DBに照合し、各 V— tagへの投票数の計算を 行った。その後、当該領域のタグ密度 (tag density)を算出した。なお、タグ密度 = 当該領域のユニーク生タグの投票数 Z当該領域のユニーク V— tag数である。  FIG. 45 is a graph showing the tag density calculated by setting the window size. The obtained 3133 perfect match raw tags were checked against VT-DB, and the number of votes for each V-tag was calculated. Thereafter, the tag density of the region was calculated. Tag density = number of unique raw tag votes in the area Z number of unique V—tags in the area.
[0238] 密度を計算する領域 (以下、ウィンドウ)のサイズは、ユニーク V— tag数によって規 定した。概算で 554 V— tagsが IMbp genome相当となる。これに従い、 2Mbp〜 lOMbpにウィンドウサイズを設定してタグ密度を算出した(図 6)。  [0238] The size of the area for calculating the density (hereinafter referred to as the window) was determined by the number of unique V—tags. Roughly 554 V—tags are equivalent to IMbp genome. According to this, the tag density was calculated by setting the window size from 2Mbp to lOMbp (Fig. 6).
[0239] ウィンドウサイズ 5Mbpに関しては、隣接するウィンドウがオーバーラップしない通常 の算出法に加え、 1/2ウィンドウサイズずつオーバーラップさせる算出法による結果 を示した。  [0239] For a window size of 5 Mbps, the results of a calculation method that overlaps by 1/2 window size are shown in addition to the normal calculation method in which adjacent windows do not overlap.
[0240] 3. 6 増幅領域の解析  [0240] 3.6 Analysis of amplification region
図 46は、異常なタグ密度を示す領域を示すグラフおよび物理地図である。周辺と 比較して明らかにタグ密度の高い部位を持つ Ch8と Chl8のグラフを示した(図 46A ) o Ch8の末尾、 Chl8の開始に増幅と思われる領域が認められた。  FIG. 46 is a graph and physical map showing areas showing abnormal tag density. A graph of Ch8 and Chl8 with a clearly higher tag density compared to the surroundings was shown (Fig. 46A) o A region that appears to be amplified at the end of Ch8 and at the beginning of Chl8 was observed.
[0241] Ch8に関して該当領域に対して、ウィンドウサイズ lOKbpのタグ密度を算出した( 図 7B)。対応する領域に存在する遺伝子を Refseq genesとして表示させたところ、 近傍に癌遺伝子 mycがマップされた。  [0241] The tag density of window size lOKbp was calculated for the corresponding region for Ch8 (Fig. 7B). When genes in the corresponding region were displayed as Refseq genes, the oncogene myc was mapped in the vicinity.
[0242] 4.実施例 1のまとめ  [0242] 4. Summary of Example 1
本実施例では、デジタルゲノムスキャニング法の基盤の確立にむけて、 1)ヒトゲノム 情報の in silico解析によって、 Mbolヴァーチャルタグのデータベースを作成し、タ グの性状を完全に把握した。 2)モンテカルロシミュレーションにより、 DGSに必要な 解析タグ数の目標設定ができた。 3)コンカテマ一の再延長という手法をとることによ つて、生タグ取得効率を大幅に改善できた。 In this example, in order to establish a foundation for digital genome scanning, 1) a database of Mbol virtual tags was created by in silico analysis of human genome information. The characteristics of the 2) Monte Carlo simulation has set the target number of analysis tags required for DGS. 3) By taking the method of re-extending the concatema, the raw tag acquisition efficiency was greatly improved.
[0243] また、 DGSを胃癌細胞株のゲノム解析に用いて、 4)約 3000の有効生タグを取得 した。 5)タグ密度の解析において、異常増幅と思われる領域を同定できた。  [0243] DGS was used for genome analysis of gastric cancer cell lines. 4) Approximately 3000 effective live tags were obtained. 5) In the tag density analysis, the region that seems to be abnormally amplified was identified.
[0244] <実施例 2>  <Example 2>
図 47は、胃癌細胞株のゲノム DNAを用いて DGS解析を行った際に得られた Mbo I生タグのサイズとタグ数の内訳を示す図である。本実施例では、サンプルとして胃癌 細胞株のゲノム DNAを用いて、 DGS解析を行った。その結果、 Mbol制限酵素処理 によって 9866の生タグを回収し、そのうち 5515の生タグがユニークなタグとして分類 されたので、ゲノムにマップした。  FIG. 47 is a diagram showing a breakdown of the size and number of tags of raw Mbo I tags obtained when DGS analysis was performed using genomic DNA of gastric cancer cell lines. In this example, DGS analysis was performed using genomic DNA of a gastric cancer cell line as a sample. As a result, 9866 raw tags were recovered by Mbol restriction enzyme treatment, and 5515 raw tags were classified as unique tags, and were mapped to the genome.
[0245] 次いで、図 47で得られた生タグをヴアーチヤルタグに照合し、タグ密度を算出した。  [0245] Next, the raw tag obtained in Fig. 47 was collated with a varchy tag, and the tag density was calculated.
図 48は、胃癌細胞株のゲノム DNAを用いて DGS解析を行った際に得られたゲノム ワイドなタグ密度グラフである。図 48では、全ゲノムのタグ密度を染色体ごとに俯瞰表 示して示している。図 48を見てわ力るように、 8番染色体と、 12番染色体との 2箇所に タグ密度の異常な増幅を認めた。  FIG. 48 is a genome-wide tag density graph obtained when DGS analysis was performed using genomic DNA of a gastric cancer cell line. In Fig. 48, the tag density of the entire genome is shown as an overhead view for each chromosome. As can be seen from FIG. 48, abnormal amplification of tag density was observed at two positions, chromosome 8 and chromosome 12.
[0246] 続いて、染色体 8番の増幅領域について、より詳細に検討すると、染色体 8番の増 幅領域は、 8q24. 21の IMbpの範囲でおきていることがわかった。図 49は、染色体 8番短腕の 8q24. 21のゲノム増幅を示す図である。左は染色体 8番のタグ密度、右 はタグマップを DGSサーバーによって表示したものである。各画面の最上段はュ- 一クなヴアーチヤルタグの部位、 2段目は非ユニークなヴアーチヤルタグの部位、 3段 目は得られた生タグの部位、下段に遺伝子の部位を示している。図 49をみると、増 幅領域に c myc癌遺伝子が存在することがわかる(丸で囲んだ領域)。  [0246] Subsequently, when the amplification region of chromosome 8 was examined in more detail, it was found that the amplification region of chromosome 8 was located within the IMbp range of 8q24.21. FIG. 49 is a diagram showing genome amplification of 8q24.21 of chromosome 8 short arm. The left is the tag density of chromosome 8 and the right is the tag map displayed by the DGS server. The top row of each screen shows the site of a unique veil tag, the second row shows the site of a non-unique veil tag, the third row shows the obtained raw tag site, and the lower part shows the gene site. Figure 49 shows that the myc oncogene is present in the amplification region (circled region).
[0247] 次いで、 c mycのゲノム増幅の分子生物学的検証を行った。図 50は、 c mycの ゲノム増幅と mRNAの過剰発現との間の関連性を示す図である。この図 50の左側 に示すように、サザンプロット法により、 c myc領域のゲノム増幅が確認された。さら に、図 50の右上に示すように、ゲノム定量を目的とする real time PCR法によって 、解析対象とした胃癌細胞株の c— myc領域のゲノム増幅(10〜15倍増幅)が確認 され、他にもう 1種の胃癌細胞株でも同部位の増幅を示すものが見つ力つた。また、 図 50の右下に示すように、 c mycの mRNAも、ゲノム増幅の程度に相関して発現 が上昇(コントロールの細胞株に比べて 9倍上昇)していること力 real time RT— PCR法で確認された。 [0247] Next, molecular biological verification of genome amplification of c myc was performed. FIG. 50 shows the relationship between c myc genomic amplification and mRNA overexpression. As shown on the left side of FIG. 50, genomic amplification of the c myc region was confirmed by the Southern plot method. Furthermore, as shown in the upper right of Fig. 50, the real-time PCR method for genome quantification confirmed the genomic amplification (10 to 15-fold amplification) of the c-myc region of the gastric cancer cell line targeted for analysis. In addition, another type of gastric cancer cell line, which showed amplification of the same site, was found. In addition, as shown in the lower right of Fig. 50, the expression of c myc mRNA also increases in correlation with the degree of genomic amplification (9-fold increase compared to the control cell line). It was confirmed by PCR.
[0248] 一方、染色体 12番の増幅領域について、より詳細に検討すると、染色体 12番短腕 の 2箇所のゲノム増幅領域のうち、一つは 12ql2. 1でおきていることがわかった。図 51は、染色体 12番短腕の 12ql2. 1のゲノム増幅を示す図である。図 51には、染色 体 12番のタグ密度 (上段、中段)と、同部位に存在する遺伝子 (下段)が示されてい る。図 51を見ると、増幅領域に K—ms癌遺伝子 (丸で囲まれた領域)が存在すること がわカゝる。  [0248] On the other hand, when the amplification region of chromosome 12 was examined in more detail, it was found that one of the two genomic amplification regions of chromosome 12 short arm was located at 12ql2.1. FIG. 51 is a diagram showing genome amplification of 12ql2.1 of chromosome 12 short arm. FIG. 51 shows the tag density of the chromosome 12 (upper and middle) and the gene existing in the same site (lower). Figure 51 shows that the K-ms oncogene (circled region) exists in the amplified region.
[0249] 図 52は、 K ras遺伝子を中心とした 3Mbpの領域の生タグの分布を示すタグマツ プである。この図からは、 K ras遺伝子(丸で囲まれた領域)が存在するゲノム領域 にのみ集中して、生タグがとれていることがわかる。  FIG. 52 is a tag map showing the distribution of raw tags in a 3 Mbps region centered on the K ras gene. From this figure, it can be seen that the live tags are concentrated only in the genomic region where the Kras gene (circled region) exists.
[0250] 次いで、 12pl2. 1のゲノム増幅の領域のサイズを決定した。図 53は、 K— ras領域 のゲノム増幅を示す図である。ここでは、ゲノム定量を目的とする real time PCR法 によって、異常増幅が起きている領域を決定した。その結果、図 53に示すように、 D GS解析を行った胃癌細胞株では、 K—ms領域を含む 0. 5Mbpの領域で増幅(7倍 )が起きていることが確認できた。また、 K— rasを含む領域のゲノム増幅力 他の 3つ の胃癌細胞株でも認められた。  [0250] Next, the size of the region of 12pl2.1 genomic amplification was determined. FIG. 53 is a diagram showing genome amplification of the K-ras region. Here, the region where abnormal amplification occurred was determined by real-time PCR for the purpose of genome quantification. As a result, as shown in FIG. 53, it was confirmed that amplification (seven times) occurred in the 0.5 Mbp region including the K-ms region in the gastric cancer cell line subjected to DG analysis. Genomic amplification in the region containing K-ras was also observed in three other gastric cancer cell lines.
[0251] 次いで、 K rasのゲノム増幅の分子生物学的検証を行った。図 54は、 K rasの ゲノム増幅と、 mRNAおよびタンパク質の過剰発現との間の関連性を示す図である。 その結果、図 54の左側に示すように、サザンブロット法により、 K— ras領域のゲノム 増幅が確認された。また、図 54の右上に示すように、 K rasの mRNA発現の上昇 を、 real time RT— PCR法によって解析したところ、約 10倍の上昇が認められた。 さら〖こ、 K— ras近傍に存在し、増幅領域に含まれる他の 2遺伝子(LRMP, LOCI 44363)についても、 K— rasと同様に mRNA発現上昇が認められた。一方、増幅領 域外に存在する近傍遺伝子 (BCAT1)の発現には変化がな力つた。さらに、図 54の 右下に示すように、 K rasの蛋白発現をウェスタンブロット法で解析したところ、ゲノ ム増幅が検出された 4細胞株すべてにおいて、 K ras蛋白の過剰発現が確認され た。 [0251] Next, molecular biological verification of Kras genomic amplification was performed. Figure 54 shows the relationship between K ras genomic amplification and mRNA and protein overexpression. As a result, as shown on the left side of FIG. 54, genomic amplification of the K-ras region was confirmed by Southern blotting. Further, as shown in the upper right of FIG. 54, when the increase in Kras mRNA expression was analyzed by the real time RT-PCR method, an increase of about 10 times was observed. Furthermore, the other 2 genes (LRMP, LOCI 44363), which are present in the vicinity of K-ras and contained in the amplified region, showed increased mRNA expression as in K-ras. On the other hand, there was no change in the expression of the nearby gene (BCAT1) existing outside the amplification region. Furthermore, as shown in the lower right of FIG. 54, when Kras protein expression was analyzed by Western blotting, Overexpression of Kras protein was confirmed in all four cell lines in which amplification was detected.
[0252] 以上から DGSの施行によって、胃癌細胞株ゲノム上で c myc, K— rasの 2つの 癌遺伝子の領域でゲノム増幅が起きていることを検出できた。それらの領域は最小で 0. 5Mbpであり、 DGSは高い解像度をもって異常を検出することができたと考えられ る。またこれらの癌遺伝子のゲノム増幅は同部位に存在する遺伝子の mRNA発現、 ひいては蛋白発現の上昇を誘導していると考えられ、癌細胞におけるゲノム増幅の 重要性を示唆するものである。  [0252] From the above, it was possible to detect that genomic amplification occurred in the two oncogene regions of c myc and K-ras on the gastric cancer cell line genome by performing DGS. These areas are at least 0.5 Mbps, and DGS seems to have been able to detect anomalies with high resolution. In addition, the genomic amplification of these oncogenes is thought to induce an increase in mRNA expression of the gene present at the same site, and thus protein expression, suggesting the importance of genomic amplification in cancer cells.
[0253] 図 55は、 DGS解析システムの概要を示す図である。上述の実施例 2で用いた DGS 解析システムは、全ゲノム情報と全タグ情報をデータベースとして収載する DGSサー バーとして、 ensemblを用いて構築した。タグ密度情報はクライアントに収載し、密度 異常が認められた領域については、 DGSサーバーにアクセスすることでタグと遺伝 子の位置情報を引き出し、マップとして可視化することが可能となるようにした。  FIG. 55 is a diagram showing an outline of the DGS analysis system. The DGS analysis system used in Example 2 described above was constructed using ensembl as a DGS server that stores all genome information and all tag information as a database. Tag density information is included in the client, and for areas where density anomalies are recognized, the DGS server can be accessed to extract tag and gene position information and visualize it as a map.
[0254] 以上、本発明を実施例に基づ 、て説明した。この実施例はあくまで例示であり、種 々の変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に 理解されるところである。  [0254] The present invention has been described based on the embodiments. It is to be understood by those skilled in the art that this embodiment is merely an example, and various modifications are possible, and such modifications are within the scope of the present invention.
[0255] たとえば、上記実施例では、ゲノムのコピー数の指標としてタグ密度を用いたが、所 定のヴアーチヤルタグに対応する生タグの個数などの他の指標を用いてもよい。この 場合、ヴアーチヤルタグに対応する生タグの個数が多 ヽタグの組合せが連続する領 域では、ゲノム領域の重複が生じている可能性が高いと判定し得る。一方、ヴアーチ ヤルタグに対応する生タグの個数が少な 、タグの組合せが連続する領域では、ゲノ ム領域の欠失が生じて 、る可能性が高 、と判定し得る。  [0255] For example, in the above embodiment, the tag density is used as an index of the genome copy number, but other indexes such as the number of raw tags corresponding to a predetermined virtual tag may be used. In this case, it can be determined that there is a high possibility that the genomic region overlaps in the region where the number of raw tags corresponding to the veil tag is large and the combination of the tags is continuous. On the other hand, in a region where the number of raw tags corresponding to the veil tag is small and the combination of the tags is continuous, it can be determined that there is a high possibility that the genomic region is deleted.
産業上の利用可能性  Industrial applicability
[0256] 以上のように、本発明にカゝかる DNA配列解析装置は、高解像度でゲノム DNA配 列のコピー数異常を信頼性よく同定できるという効果を有し、 DNA配列解析装置、 D NA配列解析方法およびプログラム等として有用である。 [0256] As described above, the DNA sequence analyzer according to the present invention has the effect of being able to reliably identify an abnormal copy number of a genomic DNA sequence with high resolution. It is useful as a sequence analysis method and program.

Claims

請求の範囲 The scope of the claims
[1] 対照ゲノム DNA配列を制限酵素により切断して得られ、それぞれ前記対照ゲノム DNA配列に含まれる個数が所定数以下であり、かつそれぞれ所定の範囲の塩基数 の DNA配列力もなる複数の対照タグを、それぞれ前記対照ゲノム DNA配列中の対 応箇所と関連づけてなる対照タグデータを取得する対照タグデータ取得部と、 解析対象ゲノム DNA配列を前記制限酵素により切断して得られ、かつそれぞれ所 定の範囲の塩基数の DNA配列からなる複数の解析対象タグの集合である解析対象 タグデータを取得する解析対象タグデータ取得部と、  [1] A plurality of controls obtained by cleaving a control genomic DNA sequence with a restriction enzyme, each containing a predetermined number or less of the number included in the control genomic DNA sequence, and each having a DNA sequencing ability of a predetermined number of bases A control tag data acquisition unit for acquiring control tag data obtained by associating a tag with a corresponding site in the control genomic DNA sequence, and a target genomic DNA sequence obtained by cleaving with the restriction enzyme, and An analysis target tag data acquisition unit that acquires analysis target tag data, which is a set of a plurality of analysis target tags composed of DNA sequences of a fixed number of bases,
前記対照タグデータと前記解析対象タグデータとを比較して、前記対照タグおよび 前記解析対象タグのうち、それぞれ対応するタグ同士を関連づけてなる対応タグデ ータを生成する対応タグデータ生成部と、  A comparison tag data generation unit that compares the comparison tag data with the analysis target tag data and generates corresponding tag data in which the corresponding tags among the comparison tag and the analysis target tag are associated with each other;
前記対応タグデータを解析し、前記対照タグと対応する前記解析対象タグの個数 を判定し、前記個数に基づいて、前記解析対象ゲノム DNA配列のうち前記対照タグ に対応する箇所を含む領域の前記対照ゲノム DNA配列に対するコピー数の相違を 判定するコピー数判定部と、  Analyzing the corresponding tag data, determining the number of the tags to be analyzed corresponding to the control tag, and based on the number, the region of the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed A copy number determination unit for determining a copy number difference with respect to a control genomic DNA sequence;
前記コピー数判定部による処理を経たデータを出力する出力部と、  An output unit that outputs data that has undergone processing by the copy number determination unit;
を備える DNA配列解析装置。  A DNA sequence analyzer comprising:
[2] 請求項 1記載の DNA配列解析装置にぉ 、て、 [2] In the DNA sequence analyzer according to claim 1,
前記コピー数判定部は、前記解析対象ゲノム DNA配列の所定の領域における前 記解析対象タグの総数を、前記対照ゲノム DNA配列の前記所定の領域に対応する 領域における前記対照タグの総数で除してなるタグ密度を判定するタグ密度判定部 を含む DNA配列解析装置。  The copy number determination unit divides the total number of the analysis target tags in a predetermined region of the analysis target genomic DNA sequence by the total number of the control tags in a region corresponding to the predetermined region of the control genomic DNA sequence. A DNA sequence analyzer including a tag density determination unit for determining the tag density.
[3] 請求項 1記載の DNA配列解析装置にぉ 、て、 [3] In the DNA sequence analyzer according to claim 1,
前記対応タグデータ生成部は、前記解析対象タグが、前記対照タグのうち一個のタ グとのみ対応する場合に、これらのタグ同士を所定の寄与度により関連づけ、前記解 析対象タグが、前記対照タグのうち二個以上のタグと対応する場合に、これらのタグ 同士を前記所定の寄与度と異なる寄与度により関連づけるように構成されている DN A配列解析装置。 The corresponding tag data generation unit associates these tags with a predetermined contribution when the analysis target tag corresponds to only one tag of the control tags, and the analysis target tag A DNA sequence analyzer configured to associate two or more tags with a degree of contribution different from the predetermined contribution when corresponding to two or more tags among the control tags.
[4] 請求項 1記載の DNA配列解析装置にぉ 、て、 [4] The DNA sequence analyzing apparatus according to claim 1, wherein
前記対応タグデータ生成部は、前記対照タグおよび前記解析対象タグのうち、完 全に一致するタグ同士を所定の寄与度により関連づけ、一部異なるタグ同士を前記 所定の寄与度と異なる寄与度により関連づけるように構成されている DNA配列解析 装置。  The corresponding tag data generation unit associates tags that are completely matched among the control tag and the analysis target tag with a predetermined contribution degree, and associates partially different tags with a contribution degree different from the predetermined contribution degree. A DNA sequence analyzer configured to be associated.
[5] 請求項 1記載の DNA配列解析装置にぉ 、て、  [5] The DNA sequence analyzing apparatus according to claim 1, wherein
前記コピー数判定部は、前記対応タグデータを解析し、前記対照タグと対応する前 記解析対象タグの個数が所定数以上の場合には、前記解析対象ゲノム DNA配列 のうち前記対照タグと対応する箇所を含む領域の重複が発生していると判定する重 複判定部を含む DNA配列解析装置。  The copy number determination unit analyzes the corresponding tag data, and corresponds to the control tag in the analysis target genomic DNA sequence when the number of the analysis target tags corresponding to the control tag is equal to or greater than a predetermined number. A DNA sequence analyzer that includes an overlap determination unit that determines that there is overlap in the region that includes the location to be processed.
[6] 請求項 1記載の DNA配列解析装置にぉ 、て、 [6] In the DNA sequence analyzer according to claim 1,
前記コピー数判定部は、前記対応タグデータを解析し、前記対照タグと対応する前 記解析対象タグの個数が所定数以下の場合には、前記解析対象ゲノム DNA配列 のうち前記対照タグと対応する箇所を含む領域の欠失が発生していると判定する欠 失判定部を含む DNA配列解析装置。  The copy number determination unit analyzes the corresponding tag data, and corresponds to the control tag in the analysis target genomic DNA sequence when the number of the analysis target tags corresponding to the control tag is equal to or less than a predetermined number. A DNA sequence analyzer that includes a defect determination unit that determines that a deletion of a region including a portion to be generated has occurred.
[7] 請求項 1記載の DNA配列解析装置にぉ 、て、 [7] The DNA sequence analyzer according to claim 1, wherein
前記対照ゲノム DNA配列と異なる起源由来の別ゲノム DNA配列データに接続し て前記別ゲノム DN A配列データを検索する別ゲノム DN A配列データ検索部と、 前記対応タグデータを解析し、前記解析対象タグと対応する前記対照タグが存在 しな 、場合には、前記解析対象タグと前記別ゲノム DNA配列データとを比較して、 前記解析対象タグの起源を判定する起源判定部と、  Another genomic DNA sequence data search unit that searches for the different genomic DNA sequence data by connecting to another genomic DNA sequence data derived from a source different from the control genomic DNA sequence, and analyzes the corresponding tag data, and the analysis target If the control tag corresponding to the tag does not exist, an origin determination unit that compares the analysis target tag with the different genomic DNA sequence data to determine the origin of the analysis target tag;
をさらに備える DNA配列解析装置。  A DNA sequence analyzer further comprising:
[8] 請求項 1記載の DNA配列解析装置にぉ 、て、 [8] In the DNA sequence analyzer according to claim 1,
前記対照ゲノム DNA配列力 前記対照タグデータを生成する対照タグデータ生成 部をさらに備え、  The control genomic DNA sequence power further comprises a control tag data generation unit for generating the control tag data,
前記対照タグデータ生成部は、  The control tag data generation unit
前記対照ゲノム DNA配列を取得する対照ゲノム DNA配列取得部と、 前記対照ゲノム DNA配列中の前記制限酵素による切断部位を検索する切断部位 検索部と、 A control genomic DNA sequence obtaining unit for obtaining the control genomic DNA sequence; and a cleavage site for searching for a cleavage site by the restriction enzyme in the control genomic DNA sequence A search section;
前記対照ゲノム DNA配列を前記切断部位により切断してなる複数の対照タグのう ち、所定の範囲の塩基数からなり、前記対照ゲノム DNA配列に含まれる個数が所定 数以下である DNA配列を含む対照タグを選択する対照タグ選択部と、  Among a plurality of control tags formed by cleaving the control genomic DNA sequence at the cleavage site, the control genomic DNA sequence includes a DNA sequence having a number of bases in a predetermined range, and the number contained in the control genomic DNA sequence is a predetermined number or less. A control tag selector for selecting a control tag;
前記選択された対照タグを前記対照ゲノム DNA配列中の対応箇所と関連づけて 前記対照タグデータを生成する関連付部と、  Associating the selected control tag with a corresponding location in the control genomic DNA sequence to generate the control tag data;
を含み、  Including
前記対照タグデータ取得部は、前記対照タグデータ生成部から前記対照タグデー タを取得するように構成されて 、る DNA配列解析装置。  The DNA sequence analyzing apparatus, wherein the control tag data acquisition unit is configured to acquire the control tag data from the control tag data generation unit.
[9] 請求項 8記載の DNA配列解析装置にぉ 、て、 [9] In the DNA sequence analyzer according to claim 8,
前記制限酵素は、 GATC力 なる 4塩基配列を認識して切断する制限酵素である DNA配列解析装置。  The DNA sequence analysis apparatus, wherein the restriction enzyme is a restriction enzyme that recognizes and cleaves a 4-base sequence having a GATC force.
[10] 対照ゲノム DNA配列を制限酵素により切断して得られ、それぞれ前記対照ゲノム DNA配列に含まれる個数が所定数以下であり、かつそれぞれ所定の範囲の塩基数 の DNA配列力もなる複数の対照タグを、それぞれ前記対照ゲノム DNA配列中の対 応箇所と関連づけてなる対照タグデータを取得するステップと、  [10] A plurality of controls obtained by cleaving a control genomic DNA sequence with a restriction enzyme, each having a predetermined number or less of the number contained in the control genomic DNA sequence, and each having a DNA sequencing ability of a predetermined number of bases Obtaining control tag data, each of which associates a tag with a corresponding location in the control genomic DNA sequence;
解析対象ゲノム DNA配列を前記制限酵素により切断して得られ、かつそれぞれ所 定の範囲の塩基数の DNA配列からなる複数の解析対象タグの集合である解析対象 タグデータを取得するステップと、  Obtaining analysis target tag data, which is a set of a plurality of analysis target tags each obtained by cleaving the analysis target genomic DNA sequence with the restriction enzyme, each consisting of a DNA sequence of a predetermined number of bases;
前記対照タグデータと前記解析対象タグデータとを比較して、前記対照タグおよび 前記解析対象タグのうち、それぞれ対応するタグ同士を関連づけてなる対応タグデ ータを生成するステップと、  Comparing the control tag data with the tag data to be analyzed, and generating corresponding tag data in which corresponding tags among the control tag and the tag to be analyzed are associated with each other;
前記対応タグデータを解析し、前記対照タグと対応する前記解析対象タグの個数 を判定し、前記個数に基づいて、前記解析対象ゲノム DNA配列に含まれる領域のう ち前記対照タグに対応する箇所を含む領域の前記対照ゲノム DNA配列に対するコ ピー数の相違を判定するステップと、  Analyzing the corresponding tag data, determining the number of the tags to be analyzed corresponding to the control tag, and based on the number, a portion of the region included in the genomic DNA sequence to be analyzed corresponding to the control tag Determining a difference in the number of copies of the region comprising
前記コピー数判定部による処理を経たデータを出力するステップと、  Outputting data that has undergone processing by the copy number determination unit;
を含む DNA配列解析方法。 A DNA sequence analysis method comprising:
[11] 請求項 10記載の DNA配列解析方法において、 [11] The DNA sequence analysis method according to claim 10,
前記コピー数の相違を判定するステップは、前記解析対象ゲノム DNA配列の所定 の領域における前記解析対象タグの総数を、前記対照ゲノム DNA配列の前記所定 の領域に対応する領域における前記対照タグの総数で除してなるタグ密度を判定す るステップを含む DNA配列解析方法。  The step of determining the difference in the copy number includes the total number of the analysis target tags in a predetermined region of the analysis target genomic DNA sequence, and the total number of the control tags in a region corresponding to the predetermined region of the control genomic DNA sequence. A DNA sequence analysis method comprising a step of determining a tag density divided by.
[12] 請求項 10記載の DNA配列解析方法にぉ 、て、  [12] The DNA sequence analysis method according to claim 10, wherein
前記対応タグデータを生成するステップは、前記解析対象タグが、前記対照タグの うち一個のタグとのみ対応する場合に、これらのタグ同士を所定の寄与度により関連 づけ、前記解析対象タグが、前記対照タグのうち二個以上のタグと対応する場合に、 これらのタグ同士を前記所定の寄与度と異なる寄与度により関連づけるステップを含 む DNA配列解析方法。  In the step of generating the corresponding tag data, when the analysis target tag corresponds to only one tag among the control tags, the tags are associated with each other with a predetermined contribution, and the analysis target tag is A DNA sequence analysis method comprising a step of associating two or more tags with a degree of contribution different from the predetermined contribution when corresponding to two or more tags among the control tags.
[13] 請求項 10記載の DNA配列解析方法にぉ 、て、  [13] The DNA sequence analysis method according to claim 10, wherein
前記対応タグデータを生成するステップは、前記対照タグおよび前記解析対象タグ のうち、完全に一致するタグ同士を所定の寄与度により関連づけ、一部異なるタグ同 士を前記所定の寄与度と異なる寄与度により関連づけるステップを含む DNA配列 解析方法。  In the step of generating the corresponding tag data, among the control tag and the analysis target tag, completely matching tags are associated with each other with a predetermined contribution, and partially different tags are contributed differently from the predetermined contribution. DNA sequence analysis method including the step of relating by degree.
[14] 請求項 10記載の DNA配列解析方法にぉ 、て、  [14] The DNA sequence analysis method according to claim 10, wherein
前記コピー数を判定するステップは、前記対応タグデータを解析し、前記対照タグ と対応する前記解析対象タグの個数が所定数以上の場合には、前記解析対象ゲノ ム DNA配列のうち前記対照タグと対応する箇所を含む領域の重複が発生していると 判定するステップを含む DNA配列解析方法。  In the step of determining the number of copies, the corresponding tag data is analyzed, and when the number of the tags to be analyzed corresponding to the control tag is equal to or greater than a predetermined number, the control tag in the genomic DNA sequence to be analyzed is determined. A DNA sequence analysis method comprising a step of determining that an overlap of a region including a corresponding portion occurs.
[15] 請求項 10記載の DNA配列解析方法にぉ 、て、 [15] The DNA sequence analysis method according to claim 10, wherein
前記コピー数を判定するステップは、前記対応タグデータを解析し、前記対照タグ と対応する前記解析対象タグの個数が所定数以下の場合には、前記解析対象ゲノ ム DNA配列のうち前記対照タグと対応する箇所を含む領域の欠失が発生していると 判定するステップを含む DNA配列解析方法。  In the step of determining the number of copies, the corresponding tag data is analyzed, and when the number of the analysis target tags corresponding to the control tag is equal to or less than a predetermined number, the control tag among the analysis target genomic DNA sequences. A DNA sequence analysis method comprising a step of determining that a deletion of a region including a corresponding portion has occurred.
[16] 請求項 10記載の DNA配列解析方法にぉ 、て、 [16] The DNA sequence analysis method according to claim 10, wherein
前記対応タグデータを解析し、前記解析対象タグと対応する前記対照タグが存在 しな ヽ場合には、前記対照ゲノム DNA配列とは異なる起源由来の別ゲノム DNA配 列と前記解析対象タグとを比較して、前記解析対象タグの起源を判定するステップを さらに含む DNA配列解析方法。 The corresponding tag data is analyzed, and the control tag corresponding to the analysis target tag exists. In this case, a DNA sequence analysis further comprising the step of determining the origin of the target tag by comparing the target tag with another genomic DNA sequence derived from a different origin from the control genomic DNA sequence. Method.
[17] 請求項 10記載の DNA配列解析方法にぉ 、て、 [17] The DNA sequence analysis method according to claim 10, wherein
前記対照ゲノム DNA配列力 前記対照タグデータを生成するステップをさらに備 え、  The control genomic DNA sequencing power further comprising the step of generating the control tag data;
前記対照タグデータを生成するステップは、  Generating the control tag data comprises:
前記対照ゲノム DNA配列を取得するステップと、  Obtaining the control genomic DNA sequence;
前記対照ゲノム DNA配列中の前記制限酵素による切断部位を検索するステップと 前記対照ゲノム DNA配列を前記切断部位により切断してなる複数の対照タグのう ち、所定の範囲の塩基数からなり、前記対照ゲノム DNA配列に含まれる個数が所定 数以下である DNA配列を含む対照タグを選択するステップと、  A step of searching for a cleavage site by the restriction enzyme in the control genomic DNA sequence, and a plurality of control tags obtained by cleaving the control genomic DNA sequence by the cleavage site, comprising a number of bases in a predetermined range, Selecting a control tag that includes a DNA sequence that is less than or equal to a predetermined number in the control genomic DNA sequence;
前記選択された対照タグを前記対照ゲノム DNA配列中の対応箇所と関連づけて 前記対照タグデータを生成するステップと、  Associating the selected control tag with a corresponding location in the control genomic DNA sequence to generate the control tag data;
を含む DNA配列解析方法。  A DNA sequence analysis method comprising:
[18] 請求項 17記載の DNA配列解析方法にぉ 、て、 [18] The DNA sequence analysis method according to claim 17, wherein
前記制限酵素は、 GATC力 なる 4塩基配列を認識して切断する制限酵素である DNA配列解析方法。  The DNA sequence analysis method, wherein the restriction enzyme is a restriction enzyme that recognizes and cleaves a 4-base sequence having GATC power.
[19] 請求項 10記載の DNA配列解析方法にぉ 、て、 [19] The DNA sequence analysis method according to claim 10, wherein
前記解析対象 DNA配列から前記解析対象タグデータを生成するステップをさらに 含み、  Further comprising the step of generating the analysis target tag data from the analysis target DNA sequence,
前記解析対象タグデータを生成するステップは、  The step of generating the analysis target tag data includes:
前記解析対象 DNA配列を含む DNA分子を前記制限酵素により切断するステップ と、  Cleaving a DNA molecule containing the DNA sequence to be analyzed with the restriction enzyme;
前記 DNA分子を前記制限酵素により切断してなる複数の DNA断片のうち、所定 の範囲の塩基数カゝらなる DNA断片を抽出するステップと、  Extracting a DNA fragment having a predetermined number of bases out of a plurality of DNA fragments obtained by cleaving the DNA molecule with the restriction enzyme;
前記抽出を経た DNA断片の DNA配列をシークェンスして前記解析対象タグデー タを生成するステップと、 The DNA sequence of the extracted DNA fragment is sequenced and the tag data to be analyzed is analyzed. Generating the data, and
を含む DNA配列解析方法。  A DNA sequence analysis method comprising:
[20] 請求項 19記載の DNA配列解析方法にぉ 、て、 [20] The DNA sequence analysis method according to claim 19, wherein
前記解析対象タグデータを生成するステップは、前記抽出するステップを経た複数 の DNA断片を連結してなるコンカテマ一を生成するステップをさらに含み、  The step of generating the tag data to be analyzed further includes a step of generating a concatemer formed by linking a plurality of DNA fragments that have undergone the extracting step,
前記シークェンスするステップは、前記コンカテマ一の DNA配列をシークェンスす るステップを含む DNA配列解析方法。  The sequencing step includes the step of sequencing the DNA sequence of the concatemer.
[21] 請求項 20記載の DNA配列解析方法にぉ 、て、 [21] The DNA sequence analysis method according to claim 20, wherein
前記コンカテマ一を生成するステップは、  The step of generating the concatemer is
前記複数の DNA断片を連結してなるコンカテマ一をベクターに連結してコンカテ マー含有ベクターを生成するステップと、  Linking a concatamer formed by linking the plurality of DNA fragments to a vector to generate a concatamer-containing vector;
前記コンカテマ一含有ベクターを大腸菌に導入して形質転換し、前記大腸菌を培 養することにより前記コンカテマ一含有ベクターを増幅するステップと、  Amplifying the concatamer-containing vector by introducing the concatamer-containing vector into E. coli and transforming, and culturing the E. coli;
前記培養された大腸菌力 前記コンカテマ一含有ベクターを抽出するステップと、 を含む DNA配列解析方法。  Extracting the concatameric vector containing the cultured E. coli force, and a DNA sequence analysis method comprising:
[22] 請求項 20記載の DNA配列解析方法にぉ 、て、 [22] The DNA sequence analysis method according to claim 20, wherein
前記解析対象タグデータを生成するステップは、前記コンカテマ一を複数連結して なる 2次コンカテマ一を生成するステップをさらに含み、  The step of generating the analysis target tag data further includes a step of generating a secondary concatemer that is formed by connecting a plurality of the concatemers.
前記シークェンスするステップは、前記 2次コンカテマ一の DNA配列をシークェン スするステップを含む DNA配列解析方法。  The sequencing step includes the step of sequencing the DNA sequence of the secondary concatemer.
[23] 請求項 22記載の DNA配列解析方法にぉ 、て、 [23] The DNA sequence analysis method according to claim 22, wherein
前記 2次コンカテマ一を生成するステップは、  The step of generating the secondary concatamer is:
前記コンカテマ一含有ベクター力 前記コンカテマ一を切出して抽出するステップ と、  Extracting the concatemer-containing vector force and extracting the concatemer.
複数の種類の前記コンカテマ一を連結してなる 2次コンカテマ一をベクターに連結 して 2次コンカテマ一含有ベクターを生成するステップと、  Generating a secondary concatemer-containing vector by linking a secondary concatemer formed by linking a plurality of types of concatemers to a vector;
前記 2次コンカテマ一含有ベクターを大腸菌に導入して形質転換し、前記大腸菌を 培養することにより前記 2次コンカテマ一含有ベクターを増幅するステップと、 前記培養された大腸菌力 前記 2次コンカテマ一含有ベクターを抽出するステップ と、 Introducing the secondary concatamer-containing vector into E. coli, transforming, and amplifying the secondary concatamer-containing vector by culturing the E. coli; Extracting the cultured Escherichia coli force containing the second concatemer-containing vector;
を含む DNA配列解析方法。  A DNA sequence analysis method comprising:
[24] 対照ゲノム DNA配列を制限酵素により切断して得られ、それぞれ前記対照ゲノム DNA配列に含まれる個数が所定数以下であり、かつそれぞれ所定の範囲の塩基数 の DNA配列力もなる複数の対照タグを、それぞれ前記対照ゲノム DNA配列中の対 応箇所と関連づけてなる対照タグデータを取得するステップと、 [24] A plurality of controls obtained by cleaving a control genomic DNA sequence with a restriction enzyme, each having a predetermined number or less of the number contained in the control genomic DNA sequence, and each having a DNA sequencing ability of a predetermined number of bases Obtaining control tag data, each tag associated with a corresponding location in the control genomic DNA sequence;
解析対象ゲノム DNA配列を前記制限酵素により切断して得られ、かつそれぞれ所 定の範囲の塩基数の DNA配列からなる複数の解析対象タグの集合である解析対象 タグデータを取得するステップと、  Obtaining analysis target tag data, which is a set of a plurality of analysis target tags each obtained by cleaving the analysis target genomic DNA sequence with the restriction enzyme, each consisting of a DNA sequence of a predetermined number of bases;
前記対照タグデータと前記解析対象タグデータとを比較して、前記対照タグおよび 前記解析対象タグのうち、それぞれ対応するタグ同士を関連づけてなる対応タグデ ータを生成するステップと、  Comparing the control tag data with the tag data to be analyzed, and generating corresponding tag data in which corresponding tags among the control tag and the tag to be analyzed are associated with each other;
前記対応タグデータを解析し、前記対照タグと対応する前記解析対象タグの個数 を判定し、前記個数に基づいて、前記解析対象ゲノム DNA配列のうち前記対照タグ に対応する箇所を含む領域の前記対照ゲノム DNA配列に対するコピー数の相違を 判定するステップと、  Analyzing the corresponding tag data, determining the number of the tags to be analyzed corresponding to the control tag, and based on the number, the region of the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed Determining copy number differences relative to a control genomic DNA sequence;
前記コピー数判定部による処理を経たデータを出力するステップと、  Outputting data that has undergone processing by the copy number determination unit;
をコンピュータに実行させるためのプログラム。  A program that causes a computer to execute.
[25] 請求項 24記載のプログラムにおいて、 [25] In the program of claim 24,
前記コピー数の相違を判定するステップは、前記解析対象ゲノム DNA配列の所定 の領域における前記解析対象タグの総数を、前記対照ゲノム DNA配列の前記所定 の領域に対応する領域における前記対照タグの総数で除してなるタグ密度を判定す るステップを含むプログラム。  The step of determining the difference in the copy number includes the total number of the analysis target tags in a predetermined region of the analysis target genomic DNA sequence, and the total number of the control tags in a region corresponding to the predetermined region of the control genomic DNA sequence. A program including the step of determining the tag density divided by.
[26] 請求項 24記載のプログラムにおいて、 [26] In the program according to claim 24,
前記対応タグデータを生成するステップは、前記解析対象タグが、前記対照タグの うち一個のタグとのみ対応する場合に、これらのタグ同士を所定の寄与度により関連 づけ、前記解析対象タグが、前記対照タグのうち二個以上のタグと対応する場合に、 これらのタグ同士を前記所定の寄与度と異なる寄与度により関連づけるステップを含 むプログラム。 In the step of generating the corresponding tag data, when the analysis target tag corresponds to only one tag among the control tags, the tags are associated with each other with a predetermined contribution, and the analysis target tag is When corresponding to two or more of the control tags, A program including a step of associating these tags with contributions different from the predetermined contribution.
[27] 請求項 24記載のプログラムにおいて、  [27] In the program according to claim 24,
前記対応タグデータを生成するステップは、前記対照タグおよび前記解析対象タグ のうち、完全に一致するタグ同士を所定の寄与度により関連づけ、一部異なるタグ同 士を前記所定の寄与度と異なる寄与度により関連づけるステップを含むプログラム。  In the step of generating the corresponding tag data, among the reference tag and the analysis target tag, tags that completely match each other are associated with a predetermined contribution, and partially different tags are contributed differently from the predetermined contribution. A program that includes steps to relate by degrees.
[28] 請求項 24記載のプログラムにおいて、 [28] In the program according to claim 24,
前記コピー数を判定するステップは、前記対応タグデータを解析し、前記対照タグ と対応する前記解析対象タグの個数が所定数以上の場合には、前記解析対象ゲノ ム DNA配列のうち前記対照タグと対応する箇所を含む領域の重複が発生していると 判定するステップを含むプログラム。  The step of determining the number of copies analyzes the corresponding tag data, and when the number of the tags to be analyzed corresponding to the control tag is equal to or larger than a predetermined number, the control tag in the genomic DNA sequence to be analyzed A program that includes a step of determining that there is an overlapping area including the corresponding location.
[29] 請求項 24記載のプログラムにおいて、 [29] In the program of claim 24,
前記コピー数を判定するステップは、前記対応タグデータを解析し、前記対照タグ と対応する前記解析対象タグの個数が所定数以下の場合には、前記解析対象ゲノ ム DNA配列のうち前記対照タグと対応する箇所を含む領域の欠失が発生していると 判定するステップを含むプログラム。  The step of determining the number of copies analyzes the corresponding tag data, and when the number of the analysis target tags corresponding to the control tag is equal to or less than a predetermined number, the control tag of the analysis target genomic DNA sequence. A program that includes a step of determining that a deletion of a region including a corresponding location has occurred.
[30] 請求項 24記載のプログラムにおいて、 [30] In the program of claim 24,
前記対応タグデータを解析し、前記解析対象タグと対応する前記対照タグが存在 しな ヽ場合には、前記対照ゲノム DNA配列とは異なる起源由来の別ゲノム DNA配 列と前記解析対象タグとを比較して、前記解析対象タグの起源を判定するステップを さらに含むプログラム。  When the corresponding tag data is analyzed, and the control tag corresponding to the analysis target tag does not exist, another genomic DNA sequence derived from a source different from the control genomic DNA sequence and the analysis target tag are detected. A program further comprising the step of comparing and determining the origin of the tag to be analyzed.
[31] 請求項 24記載のプログラムにおいて、 [31] In the program of claim 24,
前記対照ゲノム DNA配列力 前記対照タグデータを生成するステップをさらに備 え、  The control genomic DNA sequencing power further comprising the step of generating the control tag data;
前記対照タグデータを生成するステップは、  Generating the control tag data comprises:
前記対照ゲノム DNA配列を取得するステップと、  Obtaining the control genomic DNA sequence;
前記対照ゲノム DNA配列中の前記制限酵素による切断部位を検索するステップと 前記対照ゲノム DNA配列を前記切断部位により切断してなる複数の対照タグのう ち、所定の範囲の塩基数からなり、前記対照ゲノム DNA配列に含まれる個数が所定 数以下である DNA配列を含む対照タグを選択するステップと、 Searching for the restriction enzyme cleavage site in the control genomic DNA sequence; Among a plurality of control tags formed by cleaving the control genomic DNA sequence at the cleavage site, the control genomic DNA sequence includes a DNA sequence having a number of bases in a predetermined range, and the number contained in the control genomic DNA sequence is a predetermined number or less. Selecting a control tag;
前記選択された対照タグを前記対照ゲノム DNA配列中の対応箇所と関連づけて 前記対照タグデータを生成するステップと、  Associating the selected control tag with a corresponding location in the control genomic DNA sequence to generate the control tag data;
を含むプログラム。 Including programs.
請求項 31記載のプログラムにおいて、  The program according to claim 31,
前記制限酵素は、 GATC力 なる 4塩基配列を認識して切断する制限酵素である プログラム。  The restriction enzyme is a restriction enzyme that recognizes and cleaves a 4-base sequence having GATC power.
PCT/JP2006/306012 2005-04-08 2006-03-24 Dna sequence analyzer and method and program for analyzing dna sequence WO2006109535A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2005112428A JP2008161056A (en) 2005-04-08 2005-04-08 Dna sequence analyzer and method and program for analyzing dna sequence
JP2005-112428 2005-04-08

Publications (1)

Publication Number Publication Date
WO2006109535A1 true WO2006109535A1 (en) 2006-10-19

Family

ID=37086820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/306012 WO2006109535A1 (en) 2005-04-08 2006-03-24 Dna sequence analyzer and method and program for analyzing dna sequence

Country Status (2)

Country Link
JP (1) JP2008161056A (en)
WO (1) WO2006109535A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120191356A1 (en) * 2011-01-21 2012-07-26 International Business Machines Corporation Assembly Error Detection
US10453557B2 (en) * 2011-09-30 2019-10-22 Life Technologies Corporation Methods and systems for visualizing and evaluating data
EP2602733A3 (en) * 2011-12-08 2013-08-14 Koninklijke Philips Electronics N.V. Biological cell assessment using whole genome sequence and oncological therapy planning using same
TWI793586B (en) 2015-08-12 2023-02-21 香港中文大學 Single-molecule sequencing of plasma dna

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995009929A1 (en) * 1993-10-06 1995-04-13 The Regents Of The University Of California Detection of amplified or deleted chromosomal regions
WO2004058945A2 (en) * 2002-12-23 2004-07-15 Agilent Technologies, Inc. Comparative genomic hybridization assays using immobilized oligonucleotide features and compositions for practicing the same
EP1647911A2 (en) * 2004-10-12 2006-04-19 Agilent Technologies, Inc. Systems and methods for statistically analyzing apparent CGH Data Anomalies

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995009929A1 (en) * 1993-10-06 1995-04-13 The Regents Of The University Of California Detection of amplified or deleted chromosomal regions
WO2004058945A2 (en) * 2002-12-23 2004-07-15 Agilent Technologies, Inc. Comparative genomic hybridization assays using immobilized oligonucleotide features and compositions for practicing the same
EP1647911A2 (en) * 2004-10-12 2006-04-19 Agilent Technologies, Inc. Systems and methods for statistically analyzing apparent CGH Data Anomalies

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PINKEL D. ET AL.: "Array comparative genomic hybridization and its applications in cancer", NAT. GENET. SUPPL., vol. 37, June 2005 (2005-06-01), pages S11-17, XP003006461 *
SPEICHER M.R. ET AL.: "The new cytogenetics: blurring the boundaries with molecular biology", NATURE REVIEWS GENETICS, vol. 6, no. 10, October 2005 (2005-10-01), pages 782 - 792, XP003006460 *
YLSTRA B. ET AL.: "BAC to the future! or oligonucleotides: a perspective for micro array comparative genomic hybridization (array CGH)", NUCLEIC ACIDS RESEARCH, vol. 34, no. 2, 26 January 2006 (2006-01-26), pages 445 - 450, XP003006462 *

Also Published As

Publication number Publication date
JP2008161056A (en) 2008-07-17

Similar Documents

Publication Publication Date Title
Stukenbrock et al. Whole-genome and chromosome evolution associated with host adaptation and speciation of the wheat pathogen Mycosphaerella graminicola
CA2983935C (en) Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
Kuleshov et al. Whole-genome haplotyping using long reads and statistical methods
AU2023219911A1 (en) Using cell-free DNA fragment size to detect tumor-associated variant
Gouin et al. Whole-genome re-sequencing of non-model organisms: lessons from unmapped reads
KR20160022374A (en) Methods and processes for non-invasive assessment of genetic variations
KR20160013183A (en) Methods and processes for non-invasive assessment of genetic variations
CN106715711A (en) Method for determining the sequence of a probe and method for detecting genomic structural variation
Good Reduced representation methods for subgenomic enrichment and next-generation sequencing
CA2597947C (en) Methods of genetic analysis involving the amplification of complementary duplicons
Johnson et al. Single nucleotide analysis of cytosine methylation by whole‐genome shotgun bisulfite sequencing
EP3051450A1 (en) Method of typing nucleic acid or amino acid sequences based on sequence analysis
Negi et al. Applications and challenges of microarray and RNA-sequencing
JP7361774B2 (en) A method for detecting genetic variation in highly homologous sequences by independent alignment and pairing of sequence reads
US20140162260A1 (en) Primers, snp markers and method for genotyping mycobacterium tuberculosis
US10658069B2 (en) Biological sequence variant characterization
Martin et al. Representativeness of microsatellite distributions in genomes, as revealed by 454 GS-FLX Titanium pyrosequencing
CN115719616A (en) Method and system for screening specific sequences of pathogenic species
CN109524060B (en) Genetic disease risk prompting gene sequencing data processing system and processing method
Agustinho et al. Unveiling microbial diversity: harnessing long-read sequencing technology
WO2006109535A1 (en) Dna sequence analyzer and method and program for analyzing dna sequence
JP7170711B2 (en) Use of off-target sequences for DNA analysis
WO2012096016A1 (en) Nucleic acid information processing device and processing method thereof
US20160103955A1 (en) Biological sequence tandem repeat characterization
US20050050101A1 (en) Identification and use of informative sequences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 06729960

Country of ref document: EP

Kind code of ref document: A1