WO2006109535A1

WO2006109535A1 - Dna sequence analyzer and method and program for analyzing dna sequence

Info

Publication number: WO2006109535A1
Application number: PCT/JP2006/306012
Authority: WO
Inventors: Hiroaki Mita; Takashi Tokino; Kohzoh Imai
Original assignee: Sapporo Medical University
Priority date: 2005-04-08
Filing date: 2006-03-24
Publication date: 2006-10-19
Also published as: JP2008161056A

Abstract

It is intended to provide a technique whereby an abnormality in genomic DNA sequence copy number can be identified at a high resolution and a high reliability. Based on control tag data and analyte tag data, data corresponding to the control tag data and the analyte tag data is compiled. By using this corresponding data, analyte tags corresponding to individual regions on the genome are counted to detect the occurrence of chromosomal abnormalities (amplification, deletion, etc.) in individual region on the genome. The data relating to amplification, deletion and so on in the individual regions on the genome can be highly useful in identifying, for example, a disease gene.

Description

Specification

DNA sequence analyzer, DNA sequence analysis method and program

Technical field

[0001] The present invention relates to a DNA sequence analyzing apparatus, a DNA sequence analyzing method, and a program.

Background art

[0002] Gene copy number abnormalities in somatic cells and germ cells lead to serious abnormalities at the cellular and individual level. In humans, chromosomal changes accompanied by deletion of tumor suppressor genes or amplification of oncogenes are characteristic of cancer cells. In addition, changes in the copy number of a specific region of a specific chromosome or chromosome cause many diseases related to development such as Down's syndrome.

[0003] On the other hand, in early 2001, the outline sequence of the human genome of about 3 billion base pairs was released, and researchers in life information science could freely use this “treasure mountain”. Human genome reading is a treasure trove of extremely valuable information on human medicine, development, physiology and evolution, and the technical field that uses this human genome information will be the creation of new industries that develop rapidly in the first half of the 21st century. Can be greatly expected.

[0004] Recent quantitative changes in intracellular genetic information, that is, abnormalities in the copy number of chromosomes and specific regions, include comparative genomic hybridization (CGH), and gen- eral differences. There are analysis (RDA) and classical cytogenetic techniques.

[0005] In addition, conventional techniques for analyzing gene copy number abnormalities include, for example, Tian-Li Wang et. Al, Digital karyotyping, Proceedings of the National Academy of Sciences of United States of America, December 10, 2002, vol. 99, no. 25, pages 16156-16161. In the analysis method described in the same document, a control is performed by comparing a vutorial tag obtained by cleaving a control genomic DNA sequence with a restriction enzyme and a raw tag obtained by cleaving a genomic DNA sequence to be analyzed with a restriction enzyme. Analyze changes in the genomic DNA sequence to be analyzed relative to the genomic DNA sequence!

However, the prior art described in the above literature has room for improvement in the following points. First, the above-mentioned CGH method, RDA method, and classical cytogenetic method using the metaphase chromosomes have a resolution limit of about 20 Mb, and the copy number is limited to a smaller region. It is difficult to use for analysis of change. Also, the recent CGH method power is increasing in resolution due to the shift to the microarray method. The number of sequences to be analyzed is limited and special equipment is required. In order to overcome these problems, a method that can identify genomic DNA sequence copy number abnormalities with high resolution is desired.

[0007] Secondly, in the analysis method based on DNA hybridization represented by the CGH method, hCot-1 DNA is used in advance and repetitive sequences are removed from the sample by molecular biological techniques. Yes. In addition, the analysis method described in Non-Patent Document 1 also removes the tag to be analyzed for repetitive sequences. This means that information about regions containing repetitive sequences that occupy about 45% of the genome is discarded. Therefore, these analysis methods have room for further improvement in terms of reliability of analysis results.

Disclosure of the invention

[0008] The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technique capable of reliably identifying an abnormal copy number of a genomic DNA sequence with high resolution.

[0009] According to the present invention, the control genomic DNA sequence is obtained by cleaving with a restriction enzyme. Each of the numbers contained in the control genomic DNA sequence is a predetermined number or less, and each has a predetermined range of bases. A control tag data acquisition unit for acquiring control tag data obtained by associating a plurality of control tags composed of a number of DNA sequences with corresponding positions in the control genomic DNA sequence, and a genomic DNA sequence to be analyzed by this restriction enzyme. An analysis target tag data acquisition unit for acquiring analysis target tag data that is a set of a plurality of analysis target tags each including a DNA sequence cover having a number of bases in a predetermined range, and the control tag data The analysis target tag data is compared with the corresponding tag to generate corresponding tag data in which the corresponding tags of the reference tag and the analysis target tag are associated with each other. The data generation unit and the corresponding tag data are analyzed, the number of the tags to be analyzed corresponding to the control tag is determined, and based on this number, the control tag corresponds to the control tag in the genomic DNA sequence to be analyzed. A copy number determination unit for determining a copy number difference with respect to the control genomic DNA sequence in the region including the portion to be processed, and data processed by the copy number determination unit. And a DNA sequence analyzing apparatus including an output unit for outputting data.

[0010] According to this configuration, the analysis target tag consisting of a short fragment obtained by subjecting the analysis target genomic DNA sequence to a restriction enzyme is counted as a representative of the genome, and the control is a varchal tag derived from the control genomic DNA sequence. By comparing with tags, it is possible to quantify an exhaustive copy of the human genome, and based on this, identify regions with high-resolution copy number abnormalities in the genomic DNA sequence to be analyzed. Can do.

[0011] At this time, according to this configuration, since a plurality of control tags each having a predetermined number or less included in the control genomic DNA sequence are used, even if the tag includes a repetitive sequence, the control genomic DNA sequence Any tag that has a high degree of uniqueness and whose number is less than or equal to a predetermined number can be suitably used. Therefore, according to this configuration, it is possible to use information on a region including a repetitive sequence occupying a certain ratio in the control genomic DNA sequence, and improve the reliability of the analysis result.

That is, according to the present invention, an abnormal copy number of a genomic DNA sequence can be reliably identified with high resolution.

[0013] It should be noted that the above-described DNA sequence analyzing apparatus is an aspect of the present invention, and the DNA sequence analyzing method, the DNA sequence analyzing system, the DNA sequence analyzing program of the present invention, a recording medium including the program, and the like are also included. Have the same configuration.

Brief Description of Drawings

[0014] FIG. 1 is a conceptual diagram for explaining the principle of digital genome scanning and the outline of chromosome abnormality analysis by varchy tag.

FIG. 2 is a functional block diagram showing the overall configuration of the DNA sequence analysis system 1000.

FIG. 3 is a functional block diagram showing the overall configuration of a DNA sequence analysis system 2000, which is a modification of the DNA sequence analysis system 1000.

FIG. 4 is a functional block diagram showing the internal configuration of the DNA sequence analyzer 100.

FIG. 5 is a functional block diagram showing an internal configuration of a DNA sequence analyzer 400 which is a modification of the DNA sequence analyzer 100.

FIG. 6 is a flowchart for explaining the operation of the DNA sequence analysis system 1000. FIG. 7 is a functional block diagram showing an internal configuration of a corresponding tag data generation unit 210.

FIG. 8 is a flowchart for explaining the operation of the corresponding tag data generation unit 210.

FIG. 9 is a functional block diagram showing the internal configuration of the copy number determination unit 214.

FIG. 10 is a flowchart for explaining the operation of the copy number determination unit 214.

FIG. 11 is a conceptual diagram for explaining the data visibility image in the unit of a varch tag.

[FIG. 12] A conceptual diagram for explaining a data visual image in units of a virtual tag. [13] FIG. 12 is a functional block diagram for explaining an internal configuration of a control tag data generation device 200.

[14] It is a figure showing the number of tags for which genomic force is also generated by each restriction enzyme.

FIG. 15 is a graph showing the number of tags by size generated by MboI.

FIG. 16 is a diagram showing the number of tags when a width is given to the tag size.

FIG. 17 is a graph showing MboI (“GATC) tag distribution.

FIG. 18 is a diagram showing the number of effective veil tags generated by MboI.

FIG. 19 is a conceptual diagram showing an image of DGS Monte Carlo simulation.

FIG. 20 is a diagram for explaining the details of the DGS Monte Carlo simulation.

FIG. 21 is a screen display diagram showing a user interface for DGS Monte Carlo simulation.

[Fig. 22] This is a summary of DGS simulation results in the form of anomaly detection resolution for 165 and 845 Mbol non-repeat archial tags and the number of required analysis tags.

[23] It is a conceptual diagram for explaining the difference between a tag derived from a double-ended repeat and a tag derived from a single-ended repeat.

[Fig.24] In order to explain the review of the MboI archial tag, the size distribution of the tag embedded in the repeat region (X both) and the tag across the repeat region and non-repeat region (X eit her) is shown. It is a graph to show.

FIG. 25 is a diagram for explaining the verification of the uniqueness in the repeat tag derived from the repeat sequence. FIG. 26 is a diagram for examining whether or not the tag cutout size should be shifted longer.

FIG. 27 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) spanning a repeat area and a non-repeat area is regarded as an effective tag.

FIG. 28 is a flowchart for explaining the operation of the control tag data generating apparatus 200. 29] FIG. 29 is a functional block diagram for explaining the internal configuration of the analysis target tag data generation device 300.

FIG. 30 is a diagram for explaining extraction of tag DNA and production of concatamers.

[31] This is a graph showing the results of analyzing and counting the base sequences of the tags used in the preliminary experiment.

FIG. 32 is a conceptual diagram for explaining an operation flow of the analysis target tag data generation device 300.

FIG. 33 is a conceptual diagram for explaining re-extension of concatamers.

[34] This is a restriction enzyme map to explain how to understand the structure of concatamers.

FIG. 35 is a sequence diagram of a DNA sequence for explaining a method of grasping a concatamer structure.

[FIG. 36] The sequence map power of FIG. 35 is also a sequence map when the vector sequence is removed.

FIG. 37 is a sequence map for explaining the state of tag extraction from the sequence map of FIG.

FIG. 38 is a flowchart for explaining the operation of the analysis target tag data generating apparatus 300.

FIG. 39 is a conceptual diagram for explaining the flow of automatic tag analysis.

FIG. 40 is a conceptual diagram for explaining a flow of classifying tags.

FIG. 41 is an electropherogram for explaining the purification of tags from the HSC45 genome and the production of concatemers.

FIG. 42 is a graph showing the tag size distribution and repeat'unique classification.

[Figure 43] Correspondence between repeat tags and unique tags in the Vuyaru tag database FIG.

FIG. 44 is a diagram showing a breakdown of raw tags acquired from HSC45.

FIG. 45 is a graph showing the tag density calculated by setting the window size.

FIG. 46 is a graph and a physical map showing a region showing an abnormal tag density.

FIG. 47 is a diagram showing a breakdown of the size and the number of tags of Mbol raw tags obtained when DGS analysis was performed using genomic DNA of gastric cancer cell lines.

FIG. 48 is a genome-wide tag density graph obtained when DGS analysis was performed using genomic DNA of a gastric cancer cell line.

FIG. 49 is a diagram showing genome amplification of 8q24.21 of chromosome 8 short arm.

FIG. 50 shows the relationship between c-myc genomic amplification and mRNA overexpression.

FIG. 51 is a diagram showing genome amplification of 12q 12.1 of chromosome 12 short arm.

FIG. 52 is a tag map showing the distribution of raw tags in a 3 Mbps region centered on the K-ras gene.

FIG. 53 is a diagram showing genome amplification of a K-ras region.

FIG. 54 shows the relationship between K-ras genomic amplification and mRNA and protein overexpression.

FIG. 55 is a diagram showing an outline of a DGS analysis system.

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same components are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

[0016] <Glossary>

Genome scanning: Genome scanning is a method for comprehensive analysis of genetic information on the genome. Arbitrary primed PCR (AP—PCR) and Restriction

There are methods such as Landmark Genome Scanning (RLGS), which is an effective method for gene amplification and identification of deletions in cancer. However, with existing methods, the entire genome

1% —The limit is to analyze about 1%.

[0017] Comparative genomic hybridization: CGH (comparative genomic hybridization) is mainly used to detect amplified or deleted regions in the chromosome of tumor cells. This is a competitive fluorescence in situ hybridization method.

[0018] SAGE (serial analysis of gene expression): a continuous analysis of gene expression is a method for simultaneously detecting the expression of a large number of transcripts mRNA.

[0019] Concatemer: A concatemer is a group of DNA fragments linked in series by DNA ligase (ligation enzyme) or the like.

[0020] <Outline of the Invention>

Fig. 1 is a conceptual diagram for explaining the principle of digital genome scanning and the outline of chromosome aberration praying using varchyartag. Note that this is only an overview.

Details will be described later.

[0021] First, as shown in the upper right of the figure, in order to create control tag data that is a varchal tag, control genomic DNA sequence data serving as genome information of a predetermined species such as a human is prepared. Next, an algorithm for tag extraction is created. By this algorithm, the control genomic DNA sequence data is cleaved by a predetermined restriction enzyme cleavage site, and a control tag which is a virtual tag is created and databased.

[0022] Then, the database of the control tag, which is the virtual tag, and the sequence text of the genome of a predetermined biological species such as a human (sequence text of the control genomic DNA sequence) are force-linked. Furthermore, the control tag data can be obtained by linking the database of the control tags, which are the varchal tags, to the location information on the genome of a given species such as humans (location information of the control genomic DNA sequence) via sequence text. Is obtained.

[0023] On the other hand, as shown in the upper left of the figure, from a cell containing genomic DNA molecules of a predetermined species such as humans, genomic DNA molecules (a target genome to be analyzed) obtained by DNA extraction are extracted.

DNA molecule) is prepared. Next, the plurality of genomic DNA molecules having chromosomal power are cleaved by a predetermined restriction enzyme, and the plurality of tags to be analyzed are connected to create a plurality of concatemers.

[0024] After these concatamers are ligated to a vector, the DNA sequences of a plurality of concatamers are decoded by a sequence reaction. Furthermore, the decoded DNA sequences of a plurality of concatemers are converted into the DNA sequences of a plurality of tags, respectively, to obtain tag data to be analyzed. [0025] Next, as shown in the lower center of the figure, based on the control tag data and the analysis target tag data obtained as described above, the correspondence data of the control tag data and the analysis target tag data is shown. Is generated. Based on this correspondence data, the number of tags to be analyzed corresponding to each region on the genome is determined. As a result, the presence of chromosomal abnormalities such as amplification and deletion in each region on the genome is detected. Information such as amplification and deletion in each region on the genome is used for identification of disease genes.

The DNA sequence analysis system 1000 includes a DNA sequence analyzer 100 that acquires control tag data and analysis target tag data and analyzes changes in the genomic DNA sequence. The DNA sequence analysis system 1000 includes a control tag data generation device 200 that generates control tag data. Furthermore, the DNA sequence analysis system 1000 includes an analysis target data generation device 300 that generates analysis target data.

[0027] The DNA sequence analysis system 1000 includes another genome DNA sequence database 120 that stores genome DNA sequence data of a species different from the control tag data and the tag data to be analyzed. The DNA sequence analysis system 1000 includes an operation unit 102 for operating the DNA sequence analysis apparatus 100.

The DNA sequence analysis system 1000 includes an image display device 104 that displays data output from the DNA sequence analysis device 100 as an image. The DNA sequence analysis system 1000 includes a printer 106 that prints data output from the DNA sequence analysis apparatus 100. Further, the DNA sequence analysis system 1000 includes a PC (personal computer) 108 that receives data output from the DNA sequence analyzer 100.

FIG. 3 is a functional block diagram showing the overall configuration of a DNA sequence analysis system 2000 that is a modification of the DNA sequence analysis system 1000.

[0030] The DNA sequence analysis system 2000 has basically the same configuration as the DNA sequence analysis system 1000 in FIG. 1. The control genome DNA sequence data acquisition unit 202 (FIG. 4) is included in the DNA sequence analysis device 400. Is different. Another difference is that the DNA sequence analyzer 400 is connected to the control genomic DNA sequence database 500.

[0031] Hereinafter, the present embodiment will be described in the following order. 1. DNA sequence analysis using control tag data and target tag data

2. Generation of control tag data

3.Generate tag data for analysis

Here, “1.” is an explanation of the DNA sequence analyzer 100 of FIG.

“2.” is an explanation of the control tag data generation device 200 of FIG. 1 that generates data (data to be input to the DNA sequence analyzer 100) that is the basis of the above “1.”.

“3.” is an explanation of the analysis target tag data generation device 300 of FIG. 1 that generates data (data to be input to the DNA sequence analysis device 100) that is the basis of the above “1.”.

[0032] <1. DNA sequence analysis using control tag data and target tag data>

The DNA sequence analyzer 100 includes a control tag data acquisition unit 202 that acquires control tag data input from the control tag data generator 200. The control tag data is data obtained by associating a plurality of control tags obtained by cleaving the control genomic DNA sequence with a restriction enzyme, with corresponding positions in the control genomic DNA sequence. In addition, the plurality of control tag data is data composed of DNA sequences each having a predetermined number or less of the number contained in the control genomic DNA sequence and each having a predetermined number of bases. The DNA sequence analyzer 100 further includes a target tag data storage unit 206 that stores the control tag data acquired by the control tag data acquisition unit 202.

On the other hand, the DNA sequence analysis apparatus 100 includes an analysis target tag data acquisition unit 204 that is input from the analysis target tag data generation apparatus 300. The analysis target tag data is data of a set of a plurality of analysis target tags obtained by cleaving the analysis target genomic DNA sequence with a restriction enzyme. In addition, the data of the set of the plurality of tags to be analyzed is data composed of DNA sequences each having a predetermined number of bases. The DNA sequence analyzer 100 further includes an analysis target tag data storage unit 208 that stores the analysis target tag data acquired by the analysis target tag data acquisition unit 204.

[0034] The DNA sequence analyzer 100 includes a corresponding tag data generation unit 210 that generates corresponding tag data in which the control tag data and the analysis target tag data are associated with each other. Corresponding tag data generation unit 210 acquires control tag data from control tag data storage unit 206, and The tag data storage unit 208 obtains the analysis target tag data, compares the control tag data with the analysis target tag data, and associates the corresponding tag data between the control tag and the analysis target tag. Generate. The DNA sequence analyzing apparatus 100 further includes a corresponding tag data storage unit 212 that stores the corresponding tag data generated by the corresponding tag data generation unit 210.

The DNA sequence analyzer 100 includes a copy number determination unit 214 that analyzes the corresponding tag data and determines a copy number difference between the analysis target genomic DNA sequence and the control genomic DNA sequence. The copy number determination unit 214 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212 to determine the number of tags to be analyzed corresponding to the control tag, and based on this number, out of the control genomic DNA sequence. Determine the copy number difference in the genomic DNA sequence to be analyzed in the region containing the control tag from the control genomic DNA sequence. The DNA sequence analyzing apparatus 100 further includes a copy number determination result storage unit 216 that stores the copy number determination result by the copy number determination unit 214.

[0036] The DNA sequence analyzer 100 includes a separate genomic DNA data search unit 224 that searches for genomic DNA sequence data of a biological species that is different from the control tag data and the analysis target tag data. That is, the genomic DNA sequence data search unit 224 searches for another genomic DNA sequence data by connecting to another genomic DNA sequence database 120 (FIG. 1) derived from a source different from the control genomic DNA sequence.

[0037] The DNA sequence analyzing apparatus 100 includes an origin determining unit 226 that does not correspond to the control tag and determines the origin for each tag to be analyzed. That is, the origin determining unit 226 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212, and determines whether there is a control tag corresponding to the analysis target tag. As a result, the origin determination unit 226 compares the analysis target tag with the different genomic DNA sequence data acquired from the separate genomic DNA data search unit 224 when there is no corresponding tag corresponding to the analysis target tag. To determine the origin of the tag to be analyzed. In addition, the DNA sequence analyzer 100 includes an origin determination result storage unit 228 that stores the determination result by the origin determination unit 226.

The DNA sequence analyzing apparatus 100 includes an image data generation unit 220 that generates image data based on the copy number determination result or the origin determination result. That is, the image data generation unit 220 acquires the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and controls each region of the genomic DNA sequence based on these results. It generates image data to display the difference in copy number with the genomic DNA sequence and the presence of heterologous DNA sequences in images that are easy for the user to understand. The DNA sequence analyzing apparatus 100 further includes an image data storage unit 222 that stores the image data generated by the image data generation unit 220.

FIG. 5 is a functional block diagram showing the internal configuration of a DNA sequence analyzer 400 that is a modification of the DNA sequence analyzer 100. The configuration of the DNA sequence analyzer 400 is basically the same as the configuration of the DNA sequence analyzer 100 of FIG. 4 except that a force control tag data generator 402 is provided inside.

[0040] The control tag data generation unit 402 acquires the control genomic DNA sequence data input from the control genomic DNA sequence database 500 (Fig. 3), and generates control tag data. The detailed mechanism by which the control tag data generation unit 402 generates control tag data will be described later.

[0041] The configuration of the DNA sequence analyzer 400 also differs in that it includes a control tag data storage unit 404 that stores the control tag data generated by the control tag data generation unit 402. Therefore, in the DNA sequence analyzer 400, the control tag data acquisition unit 202 acquires control tag data from the control tag data storage unit 404 inside the device, not from outside the device.

FIG. 6 is a flowchart for explaining the operation of the DNA sequence analysis system 1000. First, when a series of flows is started, the control tag data generation unit 200 shown in FIG. 2 cuts the control genomic DNA sequence data of a predetermined species such as a human at the restriction enzyme cleavage site, thereby producing a control tag. Data is generated (S102).

[0043] At this time, as will be described later, a control tag with a high degree of uniqueness in which the number contained in the control genomic DNA sequence is a predetermined number or less (for example, 1) is extracted, and only a control tag with a high degree of uniqueness is extracted. Control tag data can also be generated. In addition, it is possible to extract only control tags having a DNA sequence ability with a predetermined number of bases, and generate control tag data including only control tags with a predetermined range of lengths. Furthermore, the control tag data obtained can be configured to be associated with corresponding locations in the control genomic DNA sequence. Next, in the DNA sequence analyzer 100, the control tag data acquisition unit 202 acquires the control tag data from the control tag data generation unit 200 (S106). Further, the control tag data acquisition unit 202 stores the acquired control tag data in the control tag data storage unit 206.

On the other hand, the tag data generation device 300 to be analyzed shown in FIG. 2 connects a plurality of DNA fragments obtained by treating a genomic DNA molecule to be analyzed of a predetermined biological species such as a human with a restriction enzyme. By generating a plurality of concatamers and performing a sequence of the plurality of concatamers, analysis target tag data including a plurality of analysis target tags is generated (S104).

[0046] At this time, as described later, a plurality of concatemers may be generated by connecting a plurality of DNA fragments, and a plurality of secondary concatemers may be generated by connecting a plurality of concatemers. This is because the efficiency of the sequence can be improved by generating a secondary concatemer.

Next, in the DNA sequence analyzer 100, the analysis target tag data acquisition unit 204 acquires the analysis target tag data from the analysis target tag data generation unit 300 (S108). Further, the analysis target tag data acquisition unit 204 stores the acquired analysis target tag data in the analysis target tag data storage unit 208.

[0048] Then, the corresponding tag data generation unit 210 acquires the control tag data from the control tag data storage unit 206, acquires the analysis target tag data from the analysis target tag data storage unit 208, and performs the control tag data and analysis. The target tag data is compared, and corresponding tag data is generated by associating corresponding tags among the control tag and the analysis target tag (S110). Further, the corresponding tag data generation unit 210 stores the generated corresponding tag data in the corresponding tag data storage unit 212.

[0049] Subsequently, the copy number determination unit 214 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212, determines the number of analysis target tags corresponding to the control tag, and V, Then, a difference in copy number with respect to the control genomic DNA sequence in the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed is determined (S114). Further, the copy number determination unit 214 stores the copy number determination result in the copy number determination result storage unit 216.

On the other hand, the origin determination unit 226 analyzes the corresponding tag data acquired from the corresponding tag data storage unit 212 and determines whether there is a control tag corresponding to the analysis target tag. As a result, If there is no control tag corresponding to the analysis target tag, the source determination unit 226 compares the analysis target tag with another genomic DNA sequence data obtained from the separate genomic DNA data search unit 224, and The origin is determined (S 112). In addition, the origin determination unit 226 stores the source determination result in the origin determination result storage unit 228.

[0051] Then, the image data generation unit 220 acquires the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and determines the copy number determination result and the origin. Image data is generated based on the determination result (S116). In addition, the image data generation unit 220 stores the generated image data in the image data storage unit 222.

[0052] Further, the output unit 218 obtains the copy number determination result from the copy number determination result storage unit 216, acquires the origin determination result from the origin determination result storage unit 228, and the image data storage unit 222 also outputs the image data. After obtaining these, these are output to an image display device 104 (FIG. 2) outside the device (S118), and a series of flows is completed.

[0053] Hereinafter, advantages of the DNA sequence analysis system 1000 according to the present embodiment will be described.

According to the DNA sequence analysis system 1000, short and fragmentable tags to be analyzed obtained by subjecting the genomic DNA sequence to be analyzed to restriction enzymes are counted as genome representatives, and the corresponding tag data generator 210 controls the control genomic DNA. By comparing with a control tag, which is a vutorial tag derived from the sequence, the copy number determination unit 214 can quantitate the comprehensive number of copies of the human genome. Therefore, based on this, it is possible to identify a region exhibiting copy number abnormality in the genomic DNA sequence to be analyzed with high resolution. As a result, genome regions showing gene copy number abnormalities are searched and identified with high resolution, new causative genes of diseases existing in those regions are clarified, and the mechanism of onset is clarified. It can be applied to treatment.

FIG. 7 is a functional block diagram showing the internal configuration of the corresponding tag data generation unit 210. Corresponding tag data generation unit 210 includes control tag data storage unit 206 (FIG. 4) and analysis target tag data storage unit 208 (FIG. 4). Is provided.

[0055] Corresponding tag data generation unit 210 corresponds to the control tag data and the analysis target tag data. A correspondence determination unit 504 that determines the relationship is provided. Correspondence determination unit 504 obtains control tag data and analysis target tag data from reception unit 502, and when the analysis target tag corresponds to only one tag among the control tags, these tags are given a predetermined contribution. When the analysis target tag corresponds to two or more tags among the control tags, the tags are associated with each other with a contribution degree (eg, 0) different from the predetermined contribution degree. At this time, the setting of the contribution in the correspondence determination unit 504 is performed by the contribution setting unit 508 provided in the correspondence tag data generation unit 210.

The corresponding tag data generation unit 210 includes a matching degree determination unit 506 that determines the matching degree between the control tag data and the analysis target tag data. The coincidence determination unit 506 acquires the control tag data and the analysis target tag data for which the contribution degree related to the corresponding relationship has been set from the correspondence relationship determination unit 504, and selects the completely matched tag among the control tag and the analysis target tag. The tags are associated with each other with a predetermined contribution (for example, 1), and partially different tags are associated with each other with a contribution (for example, 0) that is different from the predetermined contribution. At this time, the contribution degree setting in the coincidence degree determination unit 506 is performed by the contribution degree setting unit 508 provided in the corresponding tag data generation unit 210. In addition, as a partly different tag, an analysis tag with an insertion or deletion of 1 base or 2 bases may be included in addition to an analysis tag having the same length but having a mismatch.

[0057] The corresponding tag data generation unit 210 includes a retry determination unit 510 that determines whether to retry generation of each of the plurality of analysis target tags included in the analysis target tag data. For example, the retry determination unit 510 determines that the number of bases that are different from each other when the result of the determination is obtained when the control tag and the analysis target tag are partially different in the determination result of the coincidence determination unit 506. If the number is less than or equal to the number, it can be configured to determine to retry the sequence for generating the analysis target tag. In this way, since the number of valuable tags to be analyzed can be suppressed by a slight sequence error at the level of several bases, the reliability of the obtained analysis results can be improved.

[0058] Corresponding tag data generation unit 210 outputs an output unit 512 that outputs data that has undergone the processing of correspondence determination unit 504, coincidence determination unit 506, and retry determination unit 510 to corresponding tag data storage unit 212 (FIG. 4). Is provided. FIG. 8 is a flowchart for explaining the operation of the corresponding tag data generation unit 210. When a series of flows starts, first, the correspondence determination unit 502 determines the correspondence between the control tag data and the analysis target tag data (S202). For example, the number of control tags corresponding to each analysis target tag is determined. If the number of control tags is 1, the contribution setting unit 508 sets the contribution to a (S206), and the number of control tags is 2. If it is above, the contribution setting unit 508 sets the contribution to b (S208), and if it is the number of control tags, the process proceeds to step 112 to determine the origin (FIG. 6).

Next, the coincidence determination unit 506 determines the coincidence between the control tag data and the analysis target tag data (S210). For example, the degree of coincidence between each analysis target tag and the corresponding control tag is determined, and if it is an exact match, the contribution setting unit 508 sets the contribution to c (S212). The contribution is set to d by the setting unit 508 (S214).

[0061] Then, based on the determination result of the correspondence and the degree of coincidence, the retry determination unit 510 determines the necessity of retry (S214), and ends the series of flows.

Hereinafter, the advantages of the corresponding tag data generation unit 210 in the present embodiment will be described. According to the corresponding tag data generation unit 210, the correspondence relationship determination unit 504, the coincidence degree determination unit 506, and the contribution degree setting unit 508 In addition, an appropriate contribution can be set according to the correspondence and matching degree between the control tag and the analysis target tag. As a result, in the copy number determination unit 214 described later, for each region in the genomic DNA sequence, the correspondence between the control tag and the analysis target tag and the contribution according to the degree of coincidence are integrated, thereby analyzing the analysis target genome. It is possible to reliably detect a region having a copy number different from that of the control genomic DNA sequence in the nom DNA sequence.

[0063] Further, according to the corresponding tag data generation unit 210, since the retry determination unit 510 is provided, the analysis tag that is suspected of being a reading error or SNPs at the time of the sequence, such as the sequence This is useful for improving the reliability of the results obtained by the DNA sequence analysis system 1000.

FIG. 9 is a functional block diagram showing the internal configuration of the copy number determination unit 214. The copy number determination unit 214 also receives the corresponding tag data input from the corresponding tag data storage unit 212 (Fig. 4). A reception unit 602 is provided. In addition, the copy number determination unit 214 includes a contribution totaling unit 604 that totals the contribution set in the corresponding tag data received by the reception unit 602.

In addition, the copy number determination unit 214 includes a duplication determination unit 600 that determines whether or not duplication has occurred in the genomic DNA sequence to be analyzed based on the contributions totalized by the contribution totalization unit 604. The duplication determination unit 606 determines the number of analysis target tags corresponding to the control tag based on the contributions totaled by the contribution totalization unit 604, and the number of analysis target tags corresponding to the control tag is predetermined. If the number is greater than or equal to (for example, 3 or more), it is determined that duplication has occurred in the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed.

[0066] In addition, the copy number determination unit 214 includes a deletion determination unit 6008 that determines whether or not a deletion has occurred in the genomic DNA sequence to be analyzed based on the contributions totalized by the contribution totalization unit 604. . The deletion determination unit 608 determines the number of analysis target tags corresponding to the control tag based on the contributions totaled by the contribution totalization unit 604, and determines the number of analysis target tags corresponding to the control tag. If the number is less than a predetermined number (for example, 0.5 or less), it is determined that a deletion has occurred in the region containing the portion corresponding to the control tag in the genomic DNA sequence to be analyzed.

[0067] Further, the copy number determination unit 214 outputs the data obtained from the contribution counting unit 604, the duplication determination unit 606, and the deletion determination unit 608 to the copy number determination result storage unit 216 (FIG. 4). 610 is provided.

FIG. 10 is a flowchart for explaining the operation of the copy number determination unit 214. In the copy number determination unit 214, when a series of flows starts, first, the contribution aggregation unit 604 analyzes the corresponding tag data received by the receiving unit 602, and analyzes each region of the control genomic DNA sequence (each of the control tag data). Total contributions set for each control tag) (S302)

[0069] Next, the duplication determination unit 606 determines the force that the total contribution is equal to or greater than a threshold (for example, 3 or more) for each region of the control genomic DNA sequence (S304). As a result, if it is equal to or greater than the threshold, it is determined that there is duplication (S306). On the other hand, if it is less than the threshold value, the process proceeds to the next step 308. [0070] In the next step, the deletion determination unit 608 determines, for each region of the control genomic DNA sequence, a force whose aggregated contribution is less than or equal to a threshold (eg, 0.5 or less) (S308). As a result, if it is less than or equal to the threshold value, it is determined as a deletion (S310). On the other hand, if it is larger than the threshold, nothing is judged.

Then, acquiring the above determination result, the output unit 610 outputs the determination result to the copy number determination result storage unit 216 (S312), and ends a series of flows.

Hereinafter, advantages of the copy number determination unit 214 in the present embodiment will be described.

According to the copy number determination unit 214, the corresponding tag data generation unit 210 obtains data set with an appropriate contribution degree according to the correspondence and the degree of coincidence, and for each region in the genomic DNA sequence, the control tag and By integrating the correspondences between the tags to be analyzed and the contributions according to the degree of coincidence, it is possible to reliably detect a region having a copy number different from that of the control genomic DNA sequence in the genomic DNA sequence to be analyzed. it can. In addition, by determining the relationship between the contributions accumulated by the duplication judgment unit 606 and the deletion judgment unit 608 and the upper and lower thresholds, it is possible to reliably detect the occurrence of duplication and deletion in the genomic DNA sequence. can do.

[0073] FIG. 11 is a conceptual diagram for explaining a data visualization image for each control tag (unit: varchy tag). The image data generation unit 220 acquires data related to the copy number determination result from the copy number determination result storage unit 216, and generates such image data. In this figure, for each square corresponding to each control tag, the tag density (corresponding to the aggregate value of contribution) is calculated and displayed in a form that is easy for the user to understand by the color according to the tag density. Yes. The display window can also be enlarged or reduced as necessary, taking into account the convenience of the user.

[0074] According to this image, duplication or deletion occurs in the region in the genomic DNA corresponding to each square depending on the color of each square, and it can be easily visually determined.

[0075] FIG. 12 is a conceptual diagram for explaining a data visualization image for each control tag (unit: varchy tag). The image data generation unit 220 obtains data related to the copy number determination result from the copy number determination result storage unit 216 and generates such image data. Good. In this figure, for each chromosome position corresponding to each control tag, the tag concentration (corresponding to the aggregate value of the contribution) is calculated, and the height of the filled mass corresponding to the tag concentration is It is displayed in a form that is easy for the user to understand. In addition, the display window can be enlarged or reduced as necessary, taking into account user convenience. In addition, there are buttons for switching between individual chromosomes in the human genome.

Also according to this image, it is possible to easily visually determine whether or not there is duplication or deletion at each chromosome position due to the height of the filled cells. Such an excellent image generation function realizes a user interface that makes it easy to analyze a large amount of data and grasp the results.

[0077] <2. Generation of control tag data>

FIG. 13 is a functional block diagram for explaining the internal configuration of the control tag data generation apparatus 200. The control tag data generation device 200 (the control tag data generation unit 402 (FIG. 5) has the same configuration) includes a control genomic DNA sequence data acquisition unit 706 that acquires control genomic DNA sequence data. The control tag data generation device 200 also includes a control genome DNA sequence data storage unit 708 that stores the control genomic DNA sequence data acquired by the control genomic DNA sequence data acquisition unit 706.

[0078] The control tag data generation device 200 acquires control genomic DNA sequence data from the control genomic DNA sequence data storage unit 708, searches for a cleavage site by a predetermined restriction enzyme, and controls genomic DNA at the searched cleavage site. A cutting site search unit 710 for cutting the sequence data is provided. In addition, the control tag data generation device 200 includes a cut DNA sequence storage unit 712 that stores a plurality of DNA sequences (control tags) obtained by being cut by the cut site search unit 710.

[0079] The control tag data generation device 200 obtains a plurality of control tags obtained by cleaving the control genomic DNA sequence at the cleavage site from the cleaved DNA sequence storage unit 712, and among these control tags, the control tag data generation device 200 has a predetermined range. Number of bases · A control tag selection unit 714 is provided for selecting a control tag with uniqueness within a predetermined range. In addition, the control tag data generation device 200 includes a selection tag storage unit 716 that stores the control tag selected by the control tag selection unit 714.

[0080] The control tag data generation apparatus 200 includes an association unit 718 that generates control tag data by associating the selected control tag with a corresponding portion in the control genomic DNA sequence. Also against The reference tag data generation device 200 includes a comparison tag data storage unit 720 that stores the comparison tag data generated by the association unit 718.

The control tag data generation device 200 includes an output unit 722 that acquires control tag data from the control tag data storage unit 720 and outputs the control tag data to the DNA sequence analyzer 100.

Hereinafter, generation of a virtual tag (control tag) using human genome information by the above-described control tag data generation device 200 will be described in detail.

[0083] 1. Human genome sequence information and repeat sequence data

The principle of digital genome scanning (DGS) devised by the present inventors is to count short fragments obtained by restriction enzyme treatment of genomic DNA as representative of the genome in order to quantify the network copy number of the human genome. Based on this, the region showing copy number abnormality is identified. The present inventors conducted the following simulations in silico for the purpose of studying resolution and effectiveness in order to establish the foundation of the DGS method.

[0084] The human genome base sequence information and repeat sequence data will be published by the University of California at UCSC Genome Biomformatics Group ^ http: zz genome — arcnive. Cse. Ucsc. EduZ downloads, html Jm 2003 hgl6 ~~ I got Nyung.

[0085] The analysis program used C language, and the software development environment used a system built around Red Hat server. In the veil tag analysis, genome data was searched using restriction enzyme recognition base sequence information, then data outside the specified region was excluded, and the remaining tag data was accumulated and analyzed by size. At this time, the position information of each tag was stored, the data was recorded by determining the repeat class by comparing with the repeat sequence database.

[0086] 2. The number of vuture tags by restriction enzyme

FIG. 14 is a diagram showing the number of tags for which genomic force is also generated by each restriction enzyme. When starting DGS, the first question is which restriction enzyme should be used to fragment genomic DNA. Therefore, for in silico analysis of virtual tags, we first examined the number of virtual tags by restriction enzyme.

[0087] In more detail, the size and number of tags (hereinafter referred to as varchy tags) generated by performing restriction enzyme processing on a computer using human total genomic DNA information are counted. It was. Figure 14 shows the results of typical 4 base recognition and the number of virtual tags generated by a 6 base recognition restriction enzyme.

[0088] As a result, it was considered that the restriction enzyme for 6-base recognition has severe recognition conditions, so that the number of virtual tags produced is clearly less than that for the 4-base recognition restriction enzyme. Among these 4-base recognition enzymes, Mbol recognizes and cleaves the DNA sequence GATC, so the BamHI site can be used for cloning of the conjugation themes (tags ligated together in a daisy chain). In addition, when the number of tags generated was limited to 20 to 40 bases in length, it showed an intermediate value in comparison with other enzymes, so in the following simulation, it was generated by Mbol. The analysis proceeded with a focus on Vuyarjartag.

[0089] 3. Distribution of Mbol virtual tags by size

Next, we examined the distribution of Mbol Arch Yartag by size. More specifically, the DNA fragments generated by Mbol treatment of the whole genome were analyzed in silico. When the target sequence of the entire human genome that is currently known is targeted, a total of 7, 056, 567 Mbol fragments are generated, and it is clear that 95% of the total Mbol fragments are less than 1377 bases. became.

FIG. 15 is a graph showing the number of tags generated by Mbol by size. FIG. 15 shows a histogram in which the number of tags is tabulated for each base in 20 to 80 bases of fragment sizes (hereinafter referred to as gap length) excluding GATC at both ends of each Mbol virtual tag. As a result, it became clear that there are almost 10,000 to 150,000 Mbol arch arch tags with a 20 to 80 base gap length regardless of the tag size.

[0091] On the other hand, as seen in the 36-base and 37-base gap length tags, there were some protruding tags with many tags. In addition, when the number of virtual tags was counted for each chromosome, it was found that tags were generated from each chromosome almost in proportion to the length of each chromosome as shown in Fig. 15. From these results, it was considered appropriate to collect a short-sized Mbol restriction enzyme fragment to represent the genome.

FIG. 16 is a diagram showing the number of tags when a width is given to the tag size. On the other hand, since it is necessary to analyze a large number of tags when performing DGS, it is possible to improve work efficiency by introducing as many tags as possible into one vector (hereinafter referred to as “concatamer one”). It is important for the above and for saving work costs.

[0093] However, since there is a limit to the length of concatemers that can be included in a vector and the number of bases that can be read in a single sequence, the size of each tag forming a concatemer should be as short as possible. it is conceivable that. Therefore, we analyzed the number of vajaru tags when the size of tags to be sorted was limited and shown in Fig. 16. As a result, there are a total of 1078762 Mbol arch arch tags with a gap length of 20 to 99 bases, about 500,000 when sorted with a width of 40 bases and about 370,000 when sorted with a width of 30 bases It became clear that a tag could be obtained.

[0094] 4. Analysis of repeat-derived virtual tags

Next, analysis was performed on repeat-derived virtual tags. More specifically, it is known that a large number of very similar base sequences called repeat sequences are scattered in the genome. According to the 2001 Genome Project report (Nature 2001, 409, 871—), about 45% of the human genome is occupied by repeat sequences.

[0095] When performing DGS, an accurate copy number cannot be estimated unless each tag is uniquely mapped to one location on the genome. In other words, tags with sequences that match some places on the genome cannot be used as tags in DGS, and tags derived from repeat sequences are likely to be such invalid tags.

[0096] Therefore, among the veil tag, tags that were considered to be derived from repeat sequences were counted and the ratio was analyzed. Repeat sequences include scattered repeats (LINE, LTR element, SINE, DNA transposon), as well as tandem repeats (microsatellite, simple repeat et c. And non-coding RNA (tRNA ゝ scRNA ゝ snRNA etc.) It was.

FIG. 17 is a graph showing the Mbol ('GATC) tag distribution. As a result, it became clear that about 40% of Mbol virtual tags were derived from scattered repeats (Figure 17). In addition, as seen in the 36-base and 37-base gap lengths, the size of protruding tags with a large number of tags has a high ratio of tags derived from repeats, and the number of non-repeat tags is the same as other sizes. I was amazed that it was.

FIG. 18 is a diagram illustrating the number of effective veil tags generated by Mbol. Figure 18 shows the result of tabulating the tag size and recalculating them. For example, the width of 30 bases In the case of the 30-59 bp gap length, the total number of tags is 420,000, of which 250,000 are tags derived from repeat sequences, the number of non-repeat tags is 165, 845, and the ratio is 39. It was found to be 8%.

[0099] If the gap lengths of 36 bases and 37 bases with a high repeat rate are removed even with the same 30 base width, the ratio of non-repeat tags increases to 43% at a gap length of 40 to 69 bp, for example. The above shows that in order to exclude repeat-derived tags in DGS, it is necessary to consider the tag size to be sorted.

[0100] 5. Prediction of DGS resolution by Monte Carlo simulation

Next, the prediction of DGS resolution by Monte Carlo simulation is explained. From the above analysis, an overview of virtual tags based on human genome information was obtained. However, it is unclear how much tag analysis is required to obtain sufficient resolution and sensitivity when performing DGS using the tag. I tried to guess this with an in silico simulation.

[0101] 6. DGS Monte Carlo simulation

FIG. 19 is a conceptual diagram showing an image of DGS Monte Carlo simulation. In this way, the Monte Carlo simulation and the U method were used as the principle of simulation. This is a technique that uses pseudo-random numbers to solve the problem.

FIG. 20 is a diagram for explaining the details of the DGS Monte Carlo simulation. Based on the principle shown in Fig. 21A, an original algorithm for generating pseudo-random numbers was developed and used to simulate gene amplification, gene deficiency, and loss of heterozygosity. As shown in Fig. 21B, the number of virtual tags, the number of tags actually analyzed, the number of tags indicating the abnormal copy number (corresponding to the distance of the area indicating the abnormal copy number), its relative appearance frequency, and the number of trials are shown. It was set as a variable and the number of occurrences of each virtual tag was simulated and recorded.

[0103] Based on the obtained tag window size and abnormality detection threshold, positive / negative of copy number abnormality was determined, and positive predictive value, sensitivity, and specificity were analyzed. Various variable values were set, and the resolution capable of detecting anomalies and the number of analysis tags required to achieve them were predicted by simulation.

[0104] Figure 21 shows the user interface for DGS Monte Carlo simulation. It is a screen display figure shown. In order to increase the work efficiency of this simulation, a web tool with a user interface capable of the above operations was developed and used for the simulation (Fig. 4).

[0105] According to the present embodiment, detection sensitivity of gene amplification (amplification), gene deletion (homozygous deletion), and loss of heterozygosity (LOH) to be analyzed by DGS will be described. With regard to the resolution and the resolution, a certain number of veil tags was set, and simulations predicted how many tags would actually be analyzed to detect these genome copy number anomalies.

FIG. 22 is a table summarizing the DGS simulation results in the form of anomaly detection resolution in the case of 165 and 845 Mbol non-repeating architectural tags and the number of required analysis tags. In this way, the DGS Monte Carlo simulation adopted only 165 and 845 non-repeating tags, which are considered to exist at tag gap lengths of 30 to 59 bases, which became obvious in the above-mentioned Mbol archial tag analysis, as effective tags. This was done assuming the case. At this time, random numbers were generated as many times as the set number of analysis tags, and the tags that appeared were recorded as one trial, and the average data of 100 trials was used.

[0107] On the other hand, in actual DGS, there are variables that are changed and analyzed after tag data acquisition, such as the threshold value for positive determination and the window size. Therefore, for the purpose of verifying the effectiveness of DGS, we adopted the settings when the positive predictive value and the sensitivity were 90% or higher, and summarized the predicted anomaly detection resolution as a result (Figure 22).

[0108] This will enable 13800 tags to detect 5x gene amplification at IMbp resolution, 44000 tags to detect gene defects at IMbp resolution, and 495000 to detect LOH at IMbp resolution. It was shown that tag analysis was necessary. On the other hand, when the number of analysis tags was set to 10000 tags, it was shown that 5-fold amplification can be detected with a resolution of 1.34 Mbp, and a gene deletion can be detected with a resolution of 3.79 Mbp.

[0109] 7. Varch Yartag derived from repeat sequences

Next, we will explain in silico analysis of Vuyar tag (control tag) derived from repeat sequences and prediction of DGS work efficiency. The size of the scattered repeats can be from lOObp for short ones and over lOKbp for long ones. FIG. 23 is a conceptual diagram for explaining the difference between a tag derived from a double-ended repeat and a tag derived from a single-ended repeat. For this reason, even if a tag derived from a repeat sequence is used, there are cases where the fragment is completely buried in the repeat sequence and when only one end of the tag is present in the repeat sequence. Conceivable. At that time, the probability that the tag is mapped to one place on the genome is expected to increase as the portion derived from the repeat sequence in the tag decreases.

[0111] Therefore, Mbol virtual tags derived from repeat sequences are either embedded in the repeat sequence at both ends (hereinafter referred to as repeat at both ends), or are present at only one end in the repeat sequence (hereinafter referred to as single-end repeat). The distribution based on the number of tags and the size was analyzed in silico.

[0112] Figure 24 illustrates the revision of the Mbol virtual tag by showing the size distribution of tags buried in repeat regions (repeat both) and tags spanning repeat regions and non-repeat regions (repeat either). It is a graph to show. FIG. 24 shows a graph in which the classification information of both end / one end repeats is added to the drawing shown in FIG.

[0113] Even with the same type of repeat arrangement, both ends and either end are distinguished by color coding. In addition, one-end repeat tags of scattered repeats (LINE, SINE, LTR, DNA transposon) are intentionally displayed near the non-repeat side on the graph. As can be seen, in the range of tag gap length of 20 to 80 bases, as the tag size increases, the ratio of single-ended repeat tags derived from scattered repeats centering on SIN E increases. Speak.

[0114] FIG. 25 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) extending over a repeat region and a non-repeat region is regarded as an effective tag. Next, based on the histogram of Fig. 24, the range of sorting sizes was determined and the number of tags was tabulated (Fig. 25). As shown in Fig. 25, the number of vaginal tags for one-end repeat increases with increasing tag size.

[0115] Here, assuming that vaginal tags of one-end repeats are valid tags in DGS as well as non-repeat sequences, the results of column B + C in Fig. 25 are tabulated as "valid tag" candidates. It is. Even if the tag is cut out from the gel with the same 30 base width, the preparative size is 50-79. If the base is set longer, the “effective tag” ratio will increase by the increase in the number of single-ended repeat tags.

FIG. 26 is a diagram for examining whether or not the tag cutout size should be shifted longer. If this is the case, will the work efficiency of DGS increase if the tag size to be cut out is set longer? Based on the above calculations, Fig. 26 shows the results of studying the work efficiency when DGS refined tags with an emphasis on single-ended repeat tags.

[0117] At present, it is not clear whether the size of the concatema or the number of tags contained in the concatema is the bottleneck for the production of the concatema. went. In Fig. 26, the upper table shows the results assuming that a concatemer of 500 bp or more is always obtained, while the lower table shows the results assuming that a concatemer of 5 tags is always obtained.

[0118] According to this, in a situation where a concatemer of 500 bp or more can always be created, it is better to sort out a short tag regardless of whether a valid tag is used or only a non-repeat tag is used. It can be seen that the target tag number is reached quickly. On the other hand, in the situation where a concatemer with 5 tags can always be created, it is clear that if the effective tag is emphasized, the target tag number will be reached sooner if the long fraction is sorted. From the above analysis, in a situation where the formation of concatema is limited by the number of connected tags, it is considered that effective tags can be obtained more efficiently by setting a longer tag size to be sorted.

[0119] 8. Verification of the uniqueness of Vuyartag

Next, we verified in silico whether there is a high probability that a virtual tag derived from a repeat sequence will be an invalid tag that cannot be uniquely mapped on the genome. As analysis targets, 80 Mbol Archyar tags classified as repeats from chromosome 22 were picked up randomly.

[0120] The breakdown was 40 repeat tags at both ends and 10 at each end (10 each from SINE, LINE, LTR, and DNA transposon). The average gear length of the 80 Mbol virtual tags was 60. Obp, the longest was 99 bp, and the shortest was 20 bp. These were subjected to Blat search and mapped onto the genome, and the chromosome sites listed as candidates, their numbers, and the number of mismatched bases were recorded. [0121] FIG. 27 is a diagram for explaining a case where an Mbol archial tag (one-end repeat) extending over a repeat area and a non-repeat area is regarded as an effective tag. In determining the result, a candidate site on the genome exists only at one location on chromosome 22, and the tag's full length matches 100% with the genome sequence of the candidate site as 'unique'. did.

As a result, as shown in FIG. 27A, it was found that 82.5% of 80 repeat-derived Mbol virtual tags were unique. The breakdown is 38.8% for both-end repeats and 43.8% for one-end repeats, and the one-end repeat tags are slightly more unique.

[0123] However, in the Blat search, a phenomenon that a large number of candidate sites were detected when a mismatch of 1 to 2 bases was allowed was often observed even for tags corresponding to the above definition of unique ( Such a tag is defined as 'fine unique'). Therefore, among the unique tags, the candidate part on the genome is one candidate of the part and the tag whose force is not detected is defined as 'super unique' and further classified.

[0124] As a result, as shown in Fig. 27B, there are 14 super unique tags out of 80 tags, consisting of 10 one-end repeat tags (15.2%) and double-end repeat tags (6. 1%), suggesting that one-end repeat tags are highly unique.

[0125] The above results are in silico analysis performed using sequence information already mapped on the genome. However, even if the tag is derived from a repeat sequence, if the accuracy of the base sequence analysis is sufficiently high, the genome Show that it is possible to map to one location above. On the other hand, the single-ended repeat tag is considered to be a highly unique tag that can be analyzed more safely in DGS.

[0126] 9. Summary

The purpose of the digital genome scanning (DGS) method of the present embodiment is to quantify the copy number of the genome using human genome information as a background, and to identify a region exhibiting an abnormal copy number with high resolution. In the above explanation, preliminary experiments for establishing the DGS system were conducted both in silico and in vitro.

[0127] In the in silico analysis, Monte Carlo simulation was performed starting from the analysis of the vuture tag when the restriction enzyme Mbol was used, and amplification and deletion were detected and the copy number abnormality was detected. The number of analysis tags required for detection could be predicted.

[0128] In more detail, in addition to the fact that increasing the number of analysis tags increases the resolution of DGS, the question of how large a raw tag analysis should be performed with a certain number of virtual tags set. I was able to show the endpoint.

[0129] On the other hand, paying attention to the presence of tags derived from repetitive sequences that account for about 50%, we examined whether they can be eliminated at the experimental level. The results of in silico analysis suggest that even if the sequence is derived from a repeat, there are few mapping to the genome, and the effectiveness of one-end repeat is suggested. If this information is reflected in any DGS operation step and efficient and effective tag data is obtained, the reliability of the obtained data will be improved.

[0130] Thus, according to the preliminary study to establish the basis of the above-mentioned digital genome scanning method, 1) When the 30-base-wide tag is sorted by the restriction enzyme Mbol by in silico analysis of human genome information, 16.5 Virtual tags derived from 10,000 non-repeat sequences can be obtained, 2) By Monte Carlo simulation, 5-fold gene amplification can be detected at a resolution of 34 Mb by analyzing 10,000 tags, 3) It can be understood that even tags derived from repeat sequences can be uniquely mapped on the genome.

FIG. 28 is a flowchart for explaining the operation of the control tag data generation device 200. In the control tag data generation device 200, when a series of flows starts, first, the control genomic DNA sequence data acquisition unit 706 acquires control genomic DNA sequence data (S402) and stores it in the control genomic DNA sequence data storage unit 708.

[0132] Next, based on the result of the preliminary examination described above, the cleavage site search unit 710 selects a restriction enzyme (eg, Mbol) (S404). Then, the cleavage site search unit 710 obtains the control genomic DNA sequence data from the control genome DNA sequence data storage unit 708, searches for the cleavage site of the restriction enzyme (eg, Mbol) (S406), and controls the cleavage site. Cleave genomic DNA sequences. Further, the cleavage site search unit 710 stores a plurality of DNA sequences generated by the cleavage in the cleaved DNA sequence storage unit 712.

Subsequently, the control tag selection unit 714 acquires a plurality of DNA sequences from the cleaved DNA sequence storage unit 712, and determines whether each DNA sequence is a DNA sequence having a base number within a predetermined range ( S408). The control tag selection unit 714 then selects a location from these DNA sequences. Select a control tag consisting of a DNA sequence with a unique number of bases within a certain range (S410). On the other hand, the control tag selection unit 714 does not select a control tag consisting of a DNA sequence having a base number / uniqueness outside the predetermined range among these DNA sequences (S412).

[0134] Further, at this time, the uniqueness of the control tag having a high degree of uniqueness in the DNA sequence, each of which is less than a predetermined number (for example, 1 or less), is included in the control genomic DNA sequence. It can also be configured to select as. Further, the control tag selection unit 714 stores the selected control tags in the selection tag storage unit 716.

[0135] Then, the associating unit 718 acquires a plurality of control tags composed of DNA sequences having the number of bases within a predetermined range from the selection tag storage unit 716, and associates them with the corresponding positions of the control genomic DNA sequence data. (S414), control tag data is generated. Further, the associating unit 718 stores the generated control tag data in the control tag data storage unit 720.

Furthermore, the output unit 722 acquires the control tag data from the control tag data storage unit 720, outputs it to the DNA sequence analyzer 100 (S416), and ends the series of flows.

[0137] Hereinafter, advantages of the control tag data generation device 200 of the present embodiment will be described.

According to the control tag data generator 200, a control tag is not selected depending on whether it is a repeat sequence or a non-repeat sequence, and the number of the control genomic DNA sequences contained in the control genomic DNA sequence is a predetermined number or less (for example, 1 or less). ! Since a control tag can be selected, the control tag obtained with the control genomic DNA sequence can be used effectively, and the reliability of the obtained data can be improved.

[0138] Furthermore, according to the control tag data generation device 200, a suitable restriction enzyme can be used by changing the combination of the restriction enzyme used for cleaving human genomic DNA and the size of the tag sequence to be extracted. In other words, in order to achieve the goal of detecting changes in the amount of DNA at the whole genome level with high resolution, it is necessary to optimize several parameters in order to increase the sensitivity and specificity of the “digital genome scanning method”. The present inventors have already confirmed that the minimum region of the mutation that can be detected is determined by the combination of the restriction enzyme used for cleaving human genomic DNA and the size of the extracted tag sequence. Data simulation.

[0139] For example, when the restriction enzyme: EcoRI (6-base recognition enzyme) and tag sequence size: 20-25 base pairs are used, the tag sequence interval ranges from 200 kb to 20 Mb, with an average interval of 2 Mb. Exists. In addition, in the combination of restriction enzyme: Mbol (4-base recognition enzyme) and tag sequence size: 20-30 base pairs, the tag sequence spacing ranges from 10bp to 460kb, with a high density spacing of 20kb on average. . Therefore, according to the control tag data generation device 200, it is possible to select an optimal restriction enzyme from a wide variety of restriction enzymes according to the target resolution.

[0140] Further, this tag sequence has positional information on the human genome from which it is derived, and can be mapped onto the chromosome immediately after being databased. Therefore, it can be used for highly accurate quantification of DNA quantity by integrating the number of tag sequences of each chromosome. Specifically, according to the control tag data generation apparatus 200, as described above, since the Mbol restriction enzyme suitable for obtaining the control tag from the human genomic DNA sequence is used, the control tag data generation apparatus 200 can correspond to the analysis target tag data. This makes it possible to generate control tag data suitable for DNA sequencing, improving the reliability and efficiency of DNA sequence analysis.

[0141] Furthermore, according to the control tag data generation device 200, when generating the control tag data, the control tag data is associated with the corresponding position in the control genomic DNA sequence. It is sufficient to include the positional information of the control genomic DNA sequence and the sequence data of the selected control tag data in the data. Therefore, the processing load of the DNA sequence analyzer 100 can be reduced as compared with the case where the sequence of the entire human genome DNA sequence is directly associated with the tag data to be analyzed.

[0142] Here, in order to select the "optimal" restriction enzyme that is actually used for tag generation, detailed simulations that cover the entire genome data and a large number of restriction enzymes are required, so a large computer processing capacity is required. Is done. On the other hand, according to the control tag data generation device 200, it is possible to speed up the processing by using the parallel computing technology that the present inventors have been working on. In addition, a system that can obtain more accurate results than conventional computer processing and statistical methods can be constructed by using a Yural network for the algorithm.

[0143] Also, the force that is the number of tag sequences to extract the genomic power of the analyzed target cells. For example, from a sequence analysis of 100,000 tag sequences, an amplification region of lOOkb or more (about 10 times), a homology of 600kb or more Detected deletion region, chromosome copy number change of 4Mb or more (n = lor3) Being supported by Monte Carlo simulation!

[0144] <3. Generation of tag data for analysis>

FIG. 29 is a functional block diagram for explaining the internal configuration of the analysis target tag data generating apparatus 300. The analysis target tag data generation device 300 includes an analysis target DNA molecule application unit 802 for applying an analysis target DNA molecule that is a genomic DNA molecule of a predetermined biological species such as a human. On the other hand, the analysis target tag data generation device 300 includes a restriction enzyme application unit 804 for applying a restriction enzyme (such as Mbol) for cleaving the analysis target DNA molecule.

[0145] In addition, the analysis target tag data generation apparatus 300 includes a restriction enzyme processing unit 806 for cleaving a DNA molecule containing the analysis target DNA sequence with a restriction enzyme (Mbol or the like). Furthermore, the analysis target tag data generation device 300 includes an electrophoresis unit 808 for separating a plurality of cleaved DNA fragments. The tag data generation apparatus 300 to be analyzed includes a DNA fragment extraction unit 810 for extracting a DNA fragment having a predetermined number of bases from a plurality of DNA fragments obtained by cleaving a DNA molecule with a restriction enzyme. .

[0146] Also, the analysis target tag data generation device 300 includes a concatemer generation unit 812 that generates a concatemer formed by linking a plurality of DNA fragments extracted by the DNA fragment extraction unit 810. Furthermore, the analysis target tag data generation apparatus 300 includes a secondary force categorization unit 814 that generates a secondary concatamer formed by connecting a plurality of concatamers generated by the concatamer generation unit 812.

[0147] In addition, the tag data generation device 300 to be analyzed includes a sequence unit 816 for sequencing the DNA sequence of the second concatamer. Furthermore, the analysis target tag data generation device 300 includes a sequence result storage unit 818 that stores a sequence result by the sequence unit 816.

Further, the analysis target tag data generation device 300 generates an analysis target tag data generation unit that generates analysis target tag data that is a set of a plurality of analysis target tags based on the sequence result acquired from the sequence result storage unit 818. Equipped with 820. Furthermore, the analysis target tag data generation device 300 includes an analysis target tag data storage unit 822 that stores the analysis target tag data generated by the analysis target tag data generation unit 820. Then, the analysis target tag data generation device 300 includes an output unit 824 that acquires the analysis target tag data from the analysis target tag data storage unit 822 and outputs it to the DNA sequence analysis device 300.

[0150] Hereinafter, generation of analysis target tag data of the analysis target genomic DNA molecule using the analysis target tag data generation apparatus 300 will be described in detail.

[0151] 1. Production of DGS tag and acquisition of tag data

First, the creation of DGS tags and the acquisition of tag data are described. First, genomic DNA was extracted from gastric cancer cell line HSC45 as a human genomic DNA molecule. Next, 20 to 40 ug of genomic DNA was treated with the restriction enzyme Mbol at 37 ° C. for 16 hours, and 3% Nusieve agarose electrophoresis was performed. Then, the gel force was cut out in the range of about 30 to 60 bases, the gel was dissolved with Gelase (EPIC ENTRE), and then the tag DNA was purified by ethanol precipitation.

[0152] At this time, pBluescript II KS (+) (St ratagene) was used as the cloning vector for concatamers, and after BamHI restriction enzyme treatment, alkaline phosphatase treatment was used for cloning. In addition, using the Takara ligation kit Ver2.1, a concatema was prepared from the tag and cloned into a vector.

[0153] Then, the vector was introduced into E. coli DH10B by electopore positioning, and positive colonies were selected by a color selection method using X-gal. Each colony was cultured in ampicillin-containing LB medium, and the beta DNA was purified with an automatic nucleic acid extractor (KURABO and QIAGEN) and analyzed after RNase treatment.

[0154] The insert was confirmed by double treatment of Xhol and Sacl. For the sequencing reaction to the vector containing concatemer, T3 and Τ7 were used as primers, and BigDye termmator \ d. 1 cycle sequencing Kit and GeneAmp PCR system 9700 (Applied Bio system) were used. ABI Pri sm 3100 Genetic Analyzer (Applied Bio system) was used for the base sequence analysis of the product.

[0155] 2. Preliminary in vitro experiments using gastric cancer cell line genomic DNA

FIG. 30 is a diagram for explaining extraction of tag DNA and production of concatemers. More specifically, a preliminary experiment on DGS was performed using actual human genomic DNA extracted from HSC45, a tag DNA extraction and concatemer-producing gastric cancer cell line. Genome extracted from HSC45 FIG. 30A shows the result of electrophoresis of DNA with restriction enzyme Mbol and electrophoresis on 3% Nusieve gel.

[0156] It was found that there was a clear band around 40 bp where the smear was confirmed even at less than lOObp, which is considered to contain tag DNA useful for DGS. This band was estimated to reflect the tag population of 36 and 37 base gap lengths revealed by the above-mentioned Mb olvear tag analysis, which is prominently large (the ratio of force is also high from repeat). Since it is difficult to accurately determine the fragment size at 100 bp or less, the band around 40 bp is used as an index, and the length, fraction (fraction # 3), fraction including that band (fraction # 4), and so on. Short! The gel was cut into fractions (fraction # 5) and the tag DNA was purified.

[0157] Next, the obtained tag was ligated by ligation to produce a concatamer, which was introduced into a pBluescript vector to attempt cloning. Initially, only one tag was introduced into the vector, but increasing the concentration of the tag improves concatemer extension efficiency, so that concatemers with 3-5 tags can be obtained. (Figure 30B).

[0158] FIG. 30C shows a restriction enzyme map prepared based on the base sequence of a typical concatamer. This concatamer consists of 5 tags. Concatema I also produced a tag force derived from fraction # 3, and the actual size of each tag was 43-52 bp. When each tag is mapped onto a chromosome by Blat search, each tag is derived from a different chromosome such as No. 1, No. 6, No. 11, and X chromosome as shown in Fig. 30C. It was confirmed. Concatema also contained one SINE-derived repeat tag.

[0159] 3. Analysis of tags obtained in preliminary experiments in vitro

The concatema sequence obtained in the above experiment was cut into tags at the Mbol site, Blat search was performed after attaching the Mbol sequence GATC to both ends of the cut tag, and mapped onto the genome. At this time, the analysis of the uniqueness of the sequence and the repeatability of the repeat sequence was also performed.

[0160] FIG. 31 is a graph showing the results of analyzing and counting the base sequences of the tags used in the preliminary experiment. Next, as a preliminary experiment, a total of 81 tag base sequences were analyzed and aggregated (FIG. 31). All The distribution by size and the ratio from repeat Z non-repeat are shown in Fig. 31A. The tag gap length is between 25 and 58 bp, 38 tags out of 81 tags (46.9%) are non-repeat sequences The tag was derived from.

[0161] The tag fractions sorted by size after electrophoresis, that is, the breakdown of the sizes of fractions # 3, 4, and 5 are shown in Fig. 31B. It was found that the tag size peak of each fraction almost coincided with the size position on the electrophoresis gel, but that each fraction was not necessarily fractionated by size. .

[0162] The non-repeat rate in each fraction was also about 45%, which was almost the same as the result predicted in advance by Varchjartag prayer (Fig. 31C). Fraction # 4 had a force that could not be recognized as a high repeat rate because the force size fraction predicted in advance was not perfect.

[0163] 4. Tag sequence analysis and genome mapping

Mac Vector 7.2.2 and Assembly LIGN (Accelrys) and Clone Manager 7 Professional Suite (Sci Ed Central) were used for alignment of concatemer single nucleotide sequences and analysis of restriction enzyme sites.

[0164] Concatemer 1 was analyzed using a T3 and Τ7 primer to analyze the two-way force base sequence, and after aligning the two data, the matching part in both data was extracted as a concatemer sequence. The base sequence of concatamer at the Mbol site is cut into tags, and Mlat sequence GATC is attached to both ends of the cut tag, then Blat search is performed, mapping to the genome, and at the same time, the uniqueness of the sequence and the repeat sequence power The classification was done. Human BLAT Search (http: // genome, ucsc. Eduz cgi— bmZ ngB t) and blastn (http: z / w ww. Ncbi. Nlm. Nih. GovZblastZ) are used for genome mapping and repeat class classification of each tag. used.

[0165] Next, as already described above, the analysis target tag data that also has the non-repeat tag power and the control tag data are associated with each other, the number of copies is determined, and these analysis target tag data are determined. Maps on the genome and tabulated for each region in the chromosome.

[0166] On the other hand, since the number of tags obtained is considered to be proportional to the length of each region in the chromosome, in other words, the number of vuagear tags present in each region in the chromosome, It was expressed as the tag density divided by the number of veil tags in each region or the length of each region in the chromosome). Then, variation in tag density was observed for each region in the chromosome (not shown).

[0167] 5. Summary

FIG. 32 is a conceptual diagram for explaining an operation flow of the analysis target tag data generation device 300. To summarize the above description, the analysis target tag data generation device 300 performs the following steps in order.

[0168] First. Extract genomic DNA and cleave 40-80ug with restriction enzyme Mbol.

Next, collect the DNA fragment of 30-60 bp as a tag by 3% Nusieve agarose electrophoresis.

Next, a concatemer with linked tags is prepared (1st ligation).

Next, the concatemer is introduced into a BamHI-treated pBluescript II KS + vector (2nd ligation).

Next, introduce the vector into E. coli, collect the clones together, and purify the vector DNA (primary library).

Next, the primary library vector is treated with Spel and Pstl restriction enzymes to extract the concatema sequence.

Next, re-extension of concatamers: ligation between the extracted concatamers (3r d ligation).

Next, the re-extended concatamer is introduced into a Pstl and / or Spel-treated pBluescript II KS + vector (4th ligation).

Next, the vector is introduced into E. coli, the clones are individually collected, and the base sequence of the concatemer is analyzed.

Next, tag data is obtained, mapped onto the genome, and the number of tags is tabulated. Next, we analyze the tag density for the number power of the tags obtained, and search for the increase or decrease of the copy number on the genome.

FIG. 33 is a conceptual diagram for explaining the re-extension of the concatema. In actual prayer, the tag data generation apparatus 300 to be analyzed creates a concatema by connecting tags, and uses it for base sequence analysis. However, a regular ligation has a long categorization. It is difficult to make a mer. DGS has developed a protocol that takes two steps: re-extension of concatamers, and has succeeded in producing long concatamers. This increases the efficiency of base sequence analysis.

[0170] In addition, the conventional genome quantification methods have the ability to amplify by PCR in the tag production process. DGS that uses the tag data generator 300 to be analyzed is more accurate without using PCR at all. Accurate quantification is possible. Therefore, there is an advantage that the reliability of the obtained data is high.

[0171] FIG. 34 is a restriction enzyme map for explaining a method of grasping a concatamer structure. In this way, based on the base sequence of concatamers analyzed by the above-mentioned method, in order to grasp the concatamer structure, it is judged only by the arrangement of restriction enzyme sites, and the sequence structure shown in FIG. It can be estimated that there is.

[0172] Fig. 35 is a sequence map of the DNA sequence for explaining the method of grasping the concatamer structure. FIG. 36 is a sequence map when the sequence map power of FIG. 35 is also removed from the vector sequence. FIG. 37 is a sequence map for explaining how the sequence map power of FIG. In this way, the entire region including the concatema is sequenced, the vector sequence is removed from the sequence map, and the sequence information of the remaining sequence map power tags is cut out, so a large number of tags to be analyzed in one sequence. DNA sequence can be analyzed, and the sequence efficiency of the tags to be analyzed is improved.

FIG. 38 is a flowchart for explaining the operation of the analysis target tag data generation device 300. In the analysis target tag data generation device 300, when a series of flows starts, genomic DNA molecules of a predetermined species such as a human are applied to the analysis target DNA molecule application unit 802 such as a tube (S502). On the other hand, an appropriate restriction enzyme such as Mbol is applied to the restriction enzyme application part 804 such as a tube (S504). Then, in the restriction enzyme treatment unit 806 such as a restriction enzyme kit, the DNA molecule to be analyzed and the restriction enzyme come into contact with each other and incubated in an appropriate environment, whereby restriction enzyme treatment is performed (S506). .

[0174] The genomic DNA molecule cleaved at the restriction enzyme cleavage site by the restriction enzyme treatment, Separate into multiple DNA fragments. The plurality of DNA fragments are separated according to the length of the number of bases by electrophoresis in an electrophoresis unit 808 such as an electrophoresis tank (S508). Among a plurality of DNA fragments separated by size by electrophoresis, the DNA fragment force of the number of bases within a predetermined range The DNA fragment extraction unit 810 such as a DNA extraction kit also cuts out the electrophoresis agarose gel force. Extracted by the prep method or the like (S 510).

Next, the concatamer generation unit 812 such as a ligation kit generates concatamers by linking the DNA fragments having the base numbers within the predetermined range thus obtained (S512). Further, a concatamer formed by linking a plurality of DNA fragments is ligated to a multicloning site of a vector such as a plasmid to generate a concatamer-containing vector. This concatamer-containing vector is introduced into E. coli and transformed, and this E. coli is cultured to amplify the concatamer-containing vector. The cultured E. coli concatamer-containing vector is extracted by a miniprep method or the like.

[0176] Thus, the secondary concatamer generation unit 814, such as a ligation kit, further ligates a plurality of concatamers amplified by culturing a vector host by linking to a vector. A secondary concatamer is generated (S514). Then, the DNA sequence of this secondary concatamer is sequenced using a sequence part 816 such as a DNA sequencer (S516). In addition, the sequence unit 816 stores the generated sequence result in the sequence result storage unit 820.

[0177] Then, the analysis target tag data generation unit 820 acquires the sequence result from the sequence result storage unit 820, and based on the sequence result of the DNA fragments having the number of bases within a predetermined range among these DNA fragments. Then, tag data to be analyzed is generated (S518). Further, the analysis target tag data generation unit 820 stores the generated analysis target tag data in the analysis target tag data storage unit 822.

Then, the output unit 824 acquires the analysis target tag data from the analysis target tag data storage unit 822, outputs it to the DNA sequence analyzer 100 (S520), and the series of flows ends.

Hereinafter, advantages of the analysis target tag data generation device 300 will be described.

By using the analysis target tag data generator 300, DGS can limit 4 base recognition. After cleaving the genome with a restriction enzyme suitable for the purpose such as the enzyme Mbol, in the case of Mbol, fragments of about 30-80 bp are counted and counted as tags, and the copy number of the genome can be analyzed. As a result, the reliability of data obtained and the data acquisition efficiency are improved.

[0180] In addition, in the analysis method based on DNA hybridization represented by CGH, hCot-1 DNA is used in advance, and repetitive sequences are removed from the sample by molecular biological techniques. This means that information on regions containing repetitive sequences that occupy about 45% of the genome is discarded. On the other hand, in the DGS using the analysis target tag data generation apparatus 300, if the number of bases is within a predetermined range, the sequence data of all analysis target tags is acquired. Analyzes of copy number abnormalities at locations. Therefore, as a method for comprehensively examining the amount of human genomic DNA with high resolution and high accuracy based on quantitative analysis of short fragments of human genomic DNA, DGS is used when performing digital genome scanning (hereinafter referred to as DGS). This improves the accuracy of analysis of copy number anomalies.

[0181] In addition, the working speed of DGS depends on whether the base sequence data of the tags to be analyzed is acquired at a very high speed. The sequence analysis takes approximately 24 hours to analyze 196 samples using normal analytical equipment. Based on this assumption, the simulations of the present inventors have shown that analysis of 10,000 tags is necessary to identify the 1.3 Mbp amplification region. Then, if the goal is 10,000 tags, it takes 51 days if 1 tag is included in 1 sample and power is not included, but if 10 tags are included in 1 sample, the goal can be reached in 5 days. It is expected that it can be used as an actual system.

[0182] At this time, regarding the production of concatemers with linked tags to be analyzed, various conditions such as DNA concentration, temperature setting, and reaction time were examined, but the average number of tags connected in one ligation was 2 to It has 3 tags, and it was not easy to adjust a long concatema. When the length distribution of the concatemer was confirmed by electrophoresis, it was visually confirmed as the band with the largest number of concatemers with only one tag. On the other hand, it is thought that there will also be a long and strong force in an amount that cannot be confirmed by electrophoresis.

[0183] Therefore, the present inventors predicted that a longer concatamer could be obtained by reconnecting the long concatamers and then reconnecting them. Re-extension). It was confirmed by experiments that this was possible by the experimental method using the steps described above. In other words, it was revealed that in one step, a concatema can be formed by combining tags from 3 to 5 different chromosomes by in vitro experiments. If this concatema generation step is repeated up to the second floor, it can be seen that an average of about 7 tags of concatemer can be obtained in the preliminary experiments so far, and the problem related to the generation time of tag data to be analyzed in DGS can be overcome. It was.

[0184] That is, in vitro analysis, it became clear that purification of tags and formation of concatemers as the starting point of DGS were the most important and difficult steps. The number of tags to be connected to one concatema greatly affects the work time and cost until the end of DGS. Therefore, it is considered that this is an issue that needs to be focused on to improve efficiency. In addition, analysis of the obtained tag sequence is a simple task. Automation is essential to process a large amount of data. This issue can be resolved by automating the analysis of tag data in parallel with the construction of the tag database.

[0185] On the other hand, GST (genome signature tags), which applies the principle of SAGE (serial analysis of gene expression) to microbial genomic DNA, has been reported. Therefore, it was difficult to obtain highly accurate and quantitative results with the human genome.

[0186] On the other hand, if the analysis target tag data generation device 300 is used, it is possible to generate a secondary concatemer without performing PCR, as described above, so that the experimental process is easy. Since bias due to PCR is unlikely to occur, high complexity and highly accurate quantitative results can be obtained even in the human genome.

[0187] <Other variations>

Hereinafter, a modified example of the above embodiment will be described in terms of viewpoints different from the above description. FIG. 39 is a conceptual diagram for explaining the flow of tag automatic analysis. In the digital genome scanning shown in Fig. 1 above, a modified example of the data flow when automatic tag analysis is performed will be described in detail. [0188] First, raw data processing: tag data extraction step will be described. Here, when sequencing a concatemer DNA sequence, if sequencing is performed in two directions, an alignment of the sequences read from both directions is created. The restriction enzyme sites in these DNA sequences that made up the alignment are then washed out. Then, the structure of concatamers in these DNA sequences is grasped. Then, these vector sequences are removed from the DNA sequence, and each tag sequence is cut out.

[0189] Next, the steps of tag mapping and determination and result management will be described. First, in the tag mapping and determination step, the tag sequence derived from the raw data obtained as described above is collated with the virtual tag database.

[0190] Next, in the result management step, the data for each vector is aggregated based on the results obtained in the tag mapping and determination steps described above, and the concatema status, number of tags, and error reason are calculated. Analyze and acquire data that can be used for reanalysis and can be reused for examination of experimental conditions. By summing up such data, it is possible to check for concatema duplication.

[0191] In addition, in the result management step, data for each tag is also aggregated. That is, count the number of votes for each virtual tag. Based on the results of this tabulation, the data for each tag is analyzed and visualized. At this time, dynamic analysis visualization can be performed by changing the window size and threshold (for example, tag density histogram and grid display). By referencing such visualized tag unit data, it is possible to easily find chromosomal abnormalities such as duplication and deletion in each region of the genomic DNA sequence.

[0192] For example, when the tag density histogram is used, the total number of raw tags (analysis target tags) in a predetermined region of the genomic DNA sequence to be analyzed is calculated as the number of varchy tags (control tags) in the corresponding region of the control genomic DNA sequence. The tag density divided by the total number can be determined. In this case, the window size means the number of denominator virtual tags when calculating the tag density (corresponding to the size of the genome region at the time of density calculation).

[0193] FIG. 40 is a conceptual diagram for explaining the classification of tags. Tag classification corresponds to an operation for setting a predetermined contribution for each tag. Note that the tag classification method shown in this figure is a variation, and there can be various other tag classification methods. [0194] In the example shown in this figure, first, in the first step, the force is determined so that the total length of the raw tag (analysis target tag) matches 100% of the total length of one type of V tag (vearchy tag: control tag). If 100% matches the total length of a single V tag, cast a decision code 0 (for example, contribution 1) on the raw tag and vote for that V tag. On the other hand, the total length of one type of V tag is 100

If not, go to the next step.

[0195] In the next step, the strength of the raw tag exceeds the range of the tag length included in the VT-DB (Veural Tag Database), and the force is judged. If the tag length exceeds the range of the tag that is included in the VT-DB, it is determined that the raw tag is not cut out correctly, or there is a problem with the setting of the tag size included in the VT-DB. Assign a decision code 1 (for example, donation level 0) to, and do not vote. On the other hand, if the tag length range included in the VT-DB is not exceeded, proceed to the next step.

[0196] The next step is to determine if the raw tag matches 100% of the total length with two or more types of V tags. If 100% of the total length matches two or more types of V tags, the raw tag is determined to be derived from repeat, and a decision code 2 (for example, contribution 0) is assigned to the raw tag, and no vote is given. On the other hand, if 100% of the total length does not match two or more V-tags, proceed to the next step.

[0197] In the next step, a force judgment is made that only one type of mismatched V tag of 1 to 3 bases (or less than 10%) exists for the raw tag. If there is only one type of V tag with a mismatch of 1 to 3 bases (or less than 10%), it is judged that there is a high probability of a raw tag sequence error or SNP tag, and the raw tag is sent to re-voting. , Put the judgment code 3 on the raw tag. On the other hand, if only one type of mismatched V-tag of 1 to 3 bases (or less than 10%) does not exist, proceed to the next step.

[0198] In the next step, a force determination is made that there are two or more mismatched V tags of 1 to 3 bases (or less than 10%) against the raw tag. If there are only two types of mismatched V-tags with 1 to 3 bases (or less than 10%), the tag is determined to be a repeat-derived tag, and decision code 4 (for example, contribution 0) is assigned to the raw tag. Swing and don't vote. On the other hand, if two or more mismatched V-tags with 1 to 3 bases (or less than 10%) do not exist, proceed to the next step. [0199] In the next step, it is determined whether one or both ends of the raw tag is Spel (ACTAGT) or Pstl (C TGCAG). If one or both ends are Spel or Pstl, proceed to the step of judging the force that the full length of the raw tag matches 100% of one type of V tag. On the other hand, if one or both ends are not Spel or Pstl, proceed to the step of determining what DNA sequence is from.

[0200] In the step of determining the DNA sequence power from any source, the raw tag is assigned a determination code 10 (eg, contribution 0), and no vote is given. The algorithm is named Blast. Whether the sequence and homology are high is determined. In this case, if there is a mismatch of 4 bases (10%) or more, it is determined that the sequence is not derived from the human genome, and E. coli, mitochondrial, vector (which may not be removed in advance), and other diverse types. Search for homology with the genomes of different species to determine the origin of the DNA sequence.

[0201] In the step of determining whether the total length of a raw tag matches 100% of a part of one type of V tag, if it matches 100%, the raw tag was cut in the process of cutting out the concatema The tag is determined to be sent for re-voting, and a determination code 5 (for example, contribution 0) is assigned to the raw tag. On the other hand, if 100% does not match a part of one type of V tag, go to the next step.

[0202] In the next step, it is determined whether the total length of the raw tag matches 100% of two or more types of V tags. When the total length of the raw tag matches 100% of a part of two or more types of V tags, it is determined that the raw tag is a repeat sequence, and a determination code 6 (for example, contribution degree 0 is assigned to the raw tag). ) And do not vote. On the other hand, if the total length of the raw tag does not match 100% of some of the 2 or more types of V-tags, proceed to the next step.

[0203] In the next step, the ability to match a part of a single V tag with a mismatch of 1 to 3 bases in the raw tag is determined. If there is a mismatch of 1 to 3 bases and a part of one type of V tag is matched, it is determined that it is a SNP tag and the power that causes a sequence error during the sequence of the raw tag. After that, assign a judgment code 7 (for example, contribution 0) to the raw tag. On the other hand, if one to three base mismatches do not match a part of one type of V tag, go to the next step.

[0204] In the next step, the raw tag becomes a part of two or more types of V tags due to a mismatch of 1 to 3 bases. Determine if it matches. If there is a mismatch of 1 to 3 bases and a part of two or more types of V tag matches, it is determined that the raw tag is a repeat sequence, and determination code 8 (for example, contribution 0) is assigned to the raw tag. Swing and don't vote. On the other hand, if one to three base mismatches do not match some of the two or more types of V tags, proceed to the next step.

[0205] The next step is a raw tag that does not belong to any of the above categories. A determination code 9 (for example, contribution 0) is assigned to these raw tags, and no vote is given.

[0206] Although the embodiments of the present invention have been described with reference to the drawings, these are examples of the present invention, and various configurations other than the above can also be adopted.

[0207] For example, in the above-described embodiment, the ability to detect a difference in the copy number of a human genomic DNA sequence may be used to detect a difference in the copy number of a genomic DNA sequence in various organisms other than humans. Good. In this way, in addition to medicine, it will be possible to apply to a wide range of industries including food, chemistry, agriculture, forestry and fisheries.

[0208] Further, in the above embodiment, the ability to analyze the entire human genomic DNA sequence is not the entire human genomic DNA sequence, but a chromosomal DNA sequence that is a part of the human genomic DNA sequence, or a further partial DNA sequence of the chromosome. May be the target of analysis. In this way, there is an advantage that efficient research can be performed pinpointed by narrowing down the region of the human genome.

[0209] In the above embodiment, Mbol is used as a restriction enzyme, but other restriction enzymes may be used. In particular, a restriction enzyme that recognizes and cleaves a 4-base sequence is suitably used in the present embodiment because of the large number of virtual tags obtained.

[0210] Also, in the above embodiment, as a partially different tag, there is no particular limitation on a force that assumes a tag that has the same length but a mismatch between the control tag and the analysis tag. For example, when a raw tag is actually analyzed, one base may be missing due to a sequence error, or an extra base may be included. In such cases, even if there is an insertion or deletion of one or several bases in the base sequence of the analysis tag, it should be handled as a partially different tag in consideration of these gaps. Can do. In this way, gaps due to sequence errors can be taken into account, so that the analysis accuracy can be improved.

[0211] Hereinafter, the present invention will be further described with reference to examples. The present invention is not limited to these examples. is not.

[0212] <Example 1>

In this example, we will describe the development of a digital genome scanning method and its application to disease gene research. Digital genome scanning is a technology that enables quantitative analysis of comprehensive human genomic DNA with high accuracy and high resolution.

[0213] The principle of digital genome scanning (DGS) devised by the present inventors is that a short fragment (hereinafter referred to as the following) obtained by restriction enzyme treatment of genomic DNA in order to quantify the global copy number of the human genome. This is referred to as a tag) as a representative of the genome, and based on this, a region exhibiting an abnormal copy number is identified. In the following, we will explain the simulation and preliminary experiments for establishing the foundation of the DGS method, and the operation of DGS for gastric cancer cell lines.

[0214] 1. In silico analysis of virtual tags

1. 1 Number of veil tag by restriction enzyme

When starting DGS, the first question is which restriction enzyme should be used to fragment genomic DNA. Therefore, a tag generated by restriction enzyme processing on a computer using human whole-genome DNA information (hereinafter referred to as Vuary tag or V-tag)

) And the number ^^ (Figure 14). As a result, it was considered that the restriction enzyme with 6-base recognition clearly has fewer and less than the number of Vujarjar tags S ₄ base-recognition restriction enzymes.

[0215] Among the 4-base recognition enzymes, Mbol recognizes and cleaves the DNA sequence GATC, so the BamHI site can be used for cloning concatamers (ligations of ligated tags). In addition, since the number of tags generated showed an intermediate value in comparison with other enzymes, the following simulations proceeded with an analysis centered on the Vujanaure tag generated by Mbol.

[0216] 1.2 Size distribution of 2 Mbol virtual tags

First, the breakdown of DNA fragments generated by Mbol treatment of the entire genome was analyzed in silico. When targeting the nucleotide sequence of the human genome (Build35, hg 17) that is currently known, a total of 7,056,567 Mbol fragments were generated, and 95% of all Mbol fragments were 1377. It became clear that it was below the base.

[0217] Of these, the tag gap length (GATC end of the Mbol tag, excluding a total of 8 bases, the length of the tag) 2 From 0 to 80 bases, approximately 10,000 to 150,000 per base It became clear that tags existed (Figure 15). In addition, we found that tags were generated in proportion to the length of each chromosome without any bias. From these results, it was considered appropriate to collect a short-sized Mbol restriction enzyme fragment to represent the genome.

[0218] Analyzing the number of vuagear tags when the size of tags to be sorted was limited, it was about 540,000 when sorted at 40 base width, and about 420,000 when sorted at 30 base width It was clear that we were able to obtain a varch jar tag (Fig. 18).

[0219] 1.3 Analysis of virtual tags derived from repeat sequences

A tag with a sequence that matches multiple locations on the genome cannot be used as a tag in DGS because it cannot identify the site on the genome. Therefore, the present inventors predicted that a tag derived from a repeat sequence is likely to be such an invalid tag. Therefore, among virtual tags, tags derived from repeat sequences were tabulated and their ratios were analyzed.

[0220] As a result of checking against the repeat sequence database, it became clear that about 60% of Mbol virtual tags were derived from scattered repetitive sequences (SINE, LINE, LTR, DNA element) (Fig. 18). When the tag size is limited, the total number of tags is 20,000 in the case of 30 to 59 bp gap length, of which 250,000 tags are derived from repeat sequences, and the number of non-repeat tags is 165,845 It turned out to be a piece (39.8%).

[0221] 2. Prediction of DGS resolution by Monte Carlo simulation

Using a method of random number generation called Monte Carlo simulation, we predicted in silico the ability to detect the number of genome copy number anomalies by how many tags were actually analyzed by DGS. Figure 20 shows a random number generation algorithm for simulating gene amplification (amplification), gene deletion (homozygous deletion), and loss of heterozygosity (LOH), which are subject to analysis by DGS.

[0222] With this, when the number of veil tag is set to 165, 845 as described above, 13800 tags are used to detect 5-fold gene amplification with IMbp resolution, and IMbp resolution is used to detect gene defects. It was shown that an analysis of 44000 tags was required for detection at a degree, and 495,000 tags were required for detection of LOH at an IMbp resolution (Figure 22). On the other hand, when the number of analysis tags was set to 10000 tags, it was shown that 5-fold amplification can be detected with a resolution of 1.34 Mbp, and gene defects can be detected with a resolution of 3.79 Mbp.

[0223] 3. In vitro experiment of DGS using genomic DNA of gastric cancer cell line

3.1 Extraction of tag DNA and preparation of concatamers

FIG. 41 is an electropherogram for explaining the purification of HSC45 genome-powered tags and the production of concatemers. Next, a preliminary DGS experiment was performed using human genomic DNA extracted from gastric cancer cell line HSC45. HSC45 genomic DNA was treated with Mbol restriction enzyme (Fig. 41A), and the resulting short tags were ligated to produce a concatemer and attempted cloning.

[0224] Initially, only one tag was introduced into the vector, but the extension efficiency of the tag was improved by dividing the ligation into two stages and increasing the tag concentration, with an average of about 3 tags. We were able to obtain a concatema that was connected to (Fig. 41B).

[0225] Since it was thought that a longer concatamer would be necessary to realize DGS, we devised a method of re-extending the concatamer (Fig. 33). The procedure is shown below.

[0226] 1. Purify a vector containing concatamers once cloned as a primary library.

2. For the primary library vector force, cut the concatamers with Pstl / Spel (Fig. 41C), and ligate them together again (long extension of the contema temers).

3. The secondary library obtained by this will also get clones,

It is a technique.

[0227] As a result, the average length of concatamers increased dramatically, making it possible to obtain an average of about 7 tags per vector (Fig. 41D).

[0228] 3.2 Mass tag analysis

One of the secondary libraries described above (Fig. 41D, # 5A) was selected, and 823 clones were recovered and subjected to nucleotide sequence analysis. The obtained sequence information was applied to an automatic analysis program, and a tag sequence sandwiched between Mbols was extracted (the tag sequence thus obtained for HSC45 genomic DNA was hereinafter referred to as a raw tag). FIG. 42 is a graph showing tag size distribution and repeat'unique classification. As a result, 5593 raw tags were obtained from 823 clones. The size distribution of raw tags is shown in Fig. 42C. The longest gap length was 118 bp, the shortest gap length was Obp, and the average was 23.8 bp.

[0230] 3.3 Creation of Mbol Archar Tag Database

In order to identify the position of the obtained raw tag on the genome and accumulate the number of tags, a Vuary tag database (hereinafter referred to as VT-DB) was created. The size of the Mbol virtual tag included in the VT-DB was set to 12 to 122 bp gap length. As a result, Mbol vearch Yartag (Chl to Ch22, X, Y) l, 859, 942 were listed on Ding-0.

[0231] The VT-DB includes information on each V-tag ID, chromosome number, position on the chromosome, and sequence information, whether it is derived from a repeat, and whether it is unique. A unique definition is that only one place on the genome can be located, ie there is no other V-tag with the same sequence.

[0232] In DGS, sequences that match multiple locations on the genome, that is, non-unique raw tags cannot be identified and must be discarded as invalid votes. For this reason, all V-tags in the VT-DB were also identified as unique or non-unique (Figure 42B). Considering the size of the obtained raw tag, V-tag repeat 'non-repeat' was also analyzed by expanding the range to less than 20 bp gap length (Fig. 42A).

[0233] Fig. 43 is a diagram showing the correspondence between repeat tags and unique tags in the Vuyaru tag database. As a result, the repeat sequence among the varchy tags in VT-DB was 63.41%, whereas the unique sequence showed an unexpectedly high rate of 89.37% (Fig. 43A). This ratio suggests that 83.71% are unique even if they are classified as repeat sequences in the genome information, and most tags are not wasted.

[0234] On the other hand, there are an unusually large number of 120,000 virtual tags with 12bp gap length, 90% S repeat (Fig. 42A), 80% non-unique (Fig. 42B), most invalid tags (Figure 43B).

[0235] 3.4 Classification of raw tags

FIG. 44 is a diagram showing a breakdown of raw tags acquired from HSC45. The completed VT—DB is checked against the raw tag sequence and a perfect match (the total length of the raw tag sequence is 100% V—tag H) Extracted what to do. As a result, out of 5593 all-live tags, 3133 (56.02%) matched unique V-tags, and 1540 (27.53%) matched non-unique V-tags. 920 items were classified as stray tag # 1 (Fig. 44).

[0236] The results of classifying raw tags into unique, non-unique, and lost children by size are shown in Figure 42C. Considering sequence errors, when VT-DB matching was allowed for mismatches of 1 bp or 2 bp, 319 out of 1,920 lost tags # were matched against V-tags. The remaining 601 tags are named Lost Tag # 3.

[0237] 3.5 Analysis of tag density

FIG. 45 is a graph showing the tag density calculated by setting the window size. The obtained 3133 perfect match raw tags were checked against VT-DB, and the number of votes for each V-tag was calculated. Thereafter, the tag density of the region was calculated. Tag density = number of unique raw tag votes in the area Z number of unique V—tags in the area.

[0238] The size of the area for calculating the density (hereinafter referred to as the window) was determined by the number of unique V—tags. Roughly 554 V—tags are equivalent to IMbp genome. According to this, the tag density was calculated by setting the window size from 2Mbp to lOMbp (Fig. 6).

[0239] For a window size of 5 Mbps, the results of a calculation method that overlaps by 1/2 window size are shown in addition to the normal calculation method in which adjacent windows do not overlap.

[0240] 3.6 Analysis of amplification region

FIG. 46 is a graph and physical map showing areas showing abnormal tag density. A graph of Ch8 and Chl8 with a clearly higher tag density compared to the surroundings was shown (Fig. 46A) o A region that appears to be amplified at the end of Ch8 and at the beginning of Chl8 was observed.

[0241] The tag density of window size lOKbp was calculated for the corresponding region for Ch8 (Fig. 7B). When genes in the corresponding region were displayed as Refseq genes, the oncogene myc was mapped in the vicinity.

[0242] 4. Summary of Example 1

In this example, in order to establish a foundation for digital genome scanning, 1) a database of Mbol virtual tags was created by in silico analysis of human genome information. The characteristics of the 2) Monte Carlo simulation has set the target number of analysis tags required for DGS. 3) By taking the method of re-extending the concatema, the raw tag acquisition efficiency was greatly improved.

[0243] DGS was used for genome analysis of gastric cancer cell lines. 4) Approximately 3000 effective live tags were obtained. 5) In the tag density analysis, the region that seems to be abnormally amplified was identified.

FIG. 47 is a diagram showing a breakdown of the size and number of tags of raw Mbo I tags obtained when DGS analysis was performed using genomic DNA of gastric cancer cell lines. In this example, DGS analysis was performed using genomic DNA of a gastric cancer cell line as a sample. As a result, 9866 raw tags were recovered by Mbol restriction enzyme treatment, and 5515 raw tags were classified as unique tags, and were mapped to the genome.

[0245] Next, the raw tag obtained in Fig. 47 was collated with a varchy tag, and the tag density was calculated.

FIG. 48 is a genome-wide tag density graph obtained when DGS analysis was performed using genomic DNA of a gastric cancer cell line. In Fig. 48, the tag density of the entire genome is shown as an overhead view for each chromosome. As can be seen from FIG. 48, abnormal amplification of tag density was observed at two positions, chromosome 8 and chromosome 12.

[0246] Subsequently, when the amplification region of chromosome 8 was examined in more detail, it was found that the amplification region of chromosome 8 was located within the IMbp range of 8q24.21. FIG. 49 is a diagram showing genome amplification of 8q24.21 of chromosome 8 short arm. The left is the tag density of chromosome 8 and the right is the tag map displayed by the DGS server. The top row of each screen shows the site of a unique veil tag, the second row shows the site of a non-unique veil tag, the third row shows the obtained raw tag site, and the lower part shows the gene site. Figure 49 shows that the myc oncogene is present in the amplification region (circled region).

[0247] Next, molecular biological verification of genome amplification of c myc was performed. FIG. 50 shows the relationship between c myc genomic amplification and mRNA overexpression. As shown on the left side of FIG. 50, genomic amplification of the c myc region was confirmed by the Southern plot method. Furthermore, as shown in the upper right of Fig. 50, the real-time PCR method for genome quantification confirmed the genomic amplification (10 to 15-fold amplification) of the c-myc region of the gastric cancer cell line targeted for analysis. In addition, another type of gastric cancer cell line, which showed amplification of the same site, was found. In addition, as shown in the lower right of Fig. 50, the expression of c myc mRNA also increases in correlation with the degree of genomic amplification (9-fold increase compared to the control cell line). It was confirmed by PCR.

[0248] On the other hand, when the amplification region of chromosome 12 was examined in more detail, it was found that one of the two genomic amplification regions of chromosome 12 short arm was located at 12ql2.1. FIG. 51 is a diagram showing genome amplification of 12ql2.1 of chromosome 12 short arm. FIG. 51 shows the tag density of the chromosome 12 (upper and middle) and the gene existing in the same site (lower). Figure 51 shows that the K-ms oncogene (circled region) exists in the amplified region.

FIG. 52 is a tag map showing the distribution of raw tags in a 3 Mbps region centered on the K ras gene. From this figure, it can be seen that the live tags are concentrated only in the genomic region where the Kras gene (circled region) exists.

[0250] Next, the size of the region of 12pl2.1 genomic amplification was determined. FIG. 53 is a diagram showing genome amplification of the K-ras region. Here, the region where abnormal amplification occurred was determined by real-time PCR for the purpose of genome quantification. As a result, as shown in FIG. 53, it was confirmed that amplification (seven times) occurred in the 0.5 Mbp region including the K-ms region in the gastric cancer cell line subjected to DG analysis. Genomic amplification in the region containing K-ras was also observed in three other gastric cancer cell lines.

[0251] Next, molecular biological verification of Kras genomic amplification was performed. Figure 54 shows the relationship between K ras genomic amplification and mRNA and protein overexpression. As a result, as shown on the left side of FIG. 54, genomic amplification of the K-ras region was confirmed by Southern blotting. Further, as shown in the upper right of FIG. 54, when the increase in Kras mRNA expression was analyzed by the real time RT-PCR method, an increase of about 10 times was observed. Furthermore, the other 2 genes (LRMP, LOCI 44363), which are present in the vicinity of K-ras and contained in the amplified region, showed increased mRNA expression as in K-ras. On the other hand, there was no change in the expression of the nearby gene (BCAT1) existing outside the amplification region. Furthermore, as shown in the lower right of FIG. 54, when Kras protein expression was analyzed by Western blotting, Overexpression of Kras protein was confirmed in all four cell lines in which amplification was detected.

[0252] From the above, it was possible to detect that genomic amplification occurred in the two oncogene regions of c myc and K-ras on the gastric cancer cell line genome by performing DGS. These areas are at least 0.5 Mbps, and DGS seems to have been able to detect anomalies with high resolution. In addition, the genomic amplification of these oncogenes is thought to induce an increase in mRNA expression of the gene present at the same site, and thus protein expression, suggesting the importance of genomic amplification in cancer cells.

FIG. 55 is a diagram showing an outline of the DGS analysis system. The DGS analysis system used in Example 2 described above was constructed using ensembl as a DGS server that stores all genome information and all tag information as a database. Tag density information is included in the client, and for areas where density anomalies are recognized, the DGS server can be accessed to extract tag and gene position information and visualize it as a map.

[0254] The present invention has been described based on the embodiments. It is to be understood by those skilled in the art that this embodiment is merely an example, and various modifications are possible, and such modifications are within the scope of the present invention.

[0255] For example, in the above embodiment, the tag density is used as an index of the genome copy number, but other indexes such as the number of raw tags corresponding to a predetermined virtual tag may be used. In this case, it can be determined that there is a high possibility that the genomic region overlaps in the region where the number of raw tags corresponding to the veil tag is large and the combination of the tags is continuous. On the other hand, in a region where the number of raw tags corresponding to the veil tag is small and the combination of the tags is continuous, it can be determined that there is a high possibility that the genomic region is deleted.

Industrial applicability

[0256] As described above, the DNA sequence analyzer according to the present invention has the effect of being able to reliably identify an abnormal copy number of a genomic DNA sequence with high resolution. It is useful as a sequence analysis method and program.

Claims

The scope of the claims

[1] A plurality of controls obtained by cleaving a control genomic DNA sequence with a restriction enzyme, each containing a predetermined number or less of the number included in the control genomic DNA sequence, and each having a DNA sequencing ability of a predetermined number of bases A control tag data acquisition unit for acquiring control tag data obtained by associating a tag with a corresponding site in the control genomic DNA sequence, and a target genomic DNA sequence obtained by cleaving with the restriction enzyme, and An analysis target tag data acquisition unit that acquires analysis target tag data, which is a set of a plurality of analysis target tags composed of DNA sequences of a fixed number of bases,

A comparison tag data generation unit that compares the comparison tag data with the analysis target tag data and generates corresponding tag data in which the corresponding tags among the comparison tag and the analysis target tag are associated with each other;

Analyzing the corresponding tag data, determining the number of the tags to be analyzed corresponding to the control tag, and based on the number, the region of the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed A copy number determination unit for determining a copy number difference with respect to a control genomic DNA sequence;

An output unit that outputs data that has undergone processing by the copy number determination unit;

A DNA sequence analyzer comprising:

[2] In the DNA sequence analyzer according to claim 1,

The copy number determination unit divides the total number of the analysis target tags in a predetermined region of the analysis target genomic DNA sequence by the total number of the control tags in a region corresponding to the predetermined region of the control genomic DNA sequence. A DNA sequence analyzer including a tag density determination unit for determining the tag density.

[3] In the DNA sequence analyzer according to claim 1,

The corresponding tag data generation unit associates these tags with a predetermined contribution when the analysis target tag corresponds to only one tag of the control tags, and the analysis target tag A DNA sequence analyzer configured to associate two or more tags with a degree of contribution different from the predetermined contribution when corresponding to two or more tags among the control tags.

[4] The DNA sequence analyzing apparatus according to claim 1, wherein

The corresponding tag data generation unit associates tags that are completely matched among the control tag and the analysis target tag with a predetermined contribution degree, and associates partially different tags with a contribution degree different from the predetermined contribution degree. A DNA sequence analyzer configured to be associated.

[5] The DNA sequence analyzing apparatus according to claim 1, wherein

The copy number determination unit analyzes the corresponding tag data, and corresponds to the control tag in the analysis target genomic DNA sequence when the number of the analysis target tags corresponding to the control tag is equal to or greater than a predetermined number. A DNA sequence analyzer that includes an overlap determination unit that determines that there is overlap in the region that includes the location to be processed.

[6] In the DNA sequence analyzer according to claim 1,

The copy number determination unit analyzes the corresponding tag data, and corresponds to the control tag in the analysis target genomic DNA sequence when the number of the analysis target tags corresponding to the control tag is equal to or less than a predetermined number. A DNA sequence analyzer that includes a defect determination unit that determines that a deletion of a region including a portion to be generated has occurred.

[7] The DNA sequence analyzer according to claim 1, wherein

Another genomic DNA sequence data search unit that searches for the different genomic DNA sequence data by connecting to another genomic DNA sequence data derived from a source different from the control genomic DNA sequence, and analyzes the corresponding tag data, and the analysis target If the control tag corresponding to the tag does not exist, an origin determination unit that compares the analysis target tag with the different genomic DNA sequence data to determine the origin of the analysis target tag;

A DNA sequence analyzer further comprising:

[8] In the DNA sequence analyzer according to claim 1,

The control genomic DNA sequence power further comprises a control tag data generation unit for generating the control tag data,

The control tag data generation unit

A control genomic DNA sequence obtaining unit for obtaining the control genomic DNA sequence; and a cleavage site for searching for a cleavage site by the restriction enzyme in the control genomic DNA sequence A search section;

Among a plurality of control tags formed by cleaving the control genomic DNA sequence at the cleavage site, the control genomic DNA sequence includes a DNA sequence having a number of bases in a predetermined range, and the number contained in the control genomic DNA sequence is a predetermined number or less. A control tag selector for selecting a control tag;

Associating the selected control tag with a corresponding location in the control genomic DNA sequence to generate the control tag data;

Including

The DNA sequence analyzing apparatus, wherein the control tag data acquisition unit is configured to acquire the control tag data from the control tag data generation unit.

[9] In the DNA sequence analyzer according to claim 8,

The DNA sequence analysis apparatus, wherein the restriction enzyme is a restriction enzyme that recognizes and cleaves a 4-base sequence having a GATC force.

[10] A plurality of controls obtained by cleaving a control genomic DNA sequence with a restriction enzyme, each having a predetermined number or less of the number contained in the control genomic DNA sequence, and each having a DNA sequencing ability of a predetermined number of bases Obtaining control tag data, each of which associates a tag with a corresponding location in the control genomic DNA sequence;

Obtaining analysis target tag data, which is a set of a plurality of analysis target tags each obtained by cleaving the analysis target genomic DNA sequence with the restriction enzyme, each consisting of a DNA sequence of a predetermined number of bases;

Comparing the control tag data with the tag data to be analyzed, and generating corresponding tag data in which corresponding tags among the control tag and the tag to be analyzed are associated with each other;

Analyzing the corresponding tag data, determining the number of the tags to be analyzed corresponding to the control tag, and based on the number, a portion of the region included in the genomic DNA sequence to be analyzed corresponding to the control tag Determining a difference in the number of copies of the region comprising

Outputting data that has undergone processing by the copy number determination unit;

A DNA sequence analysis method comprising:

[11] The DNA sequence analysis method according to claim 10,

The step of determining the difference in the copy number includes the total number of the analysis target tags in a predetermined region of the analysis target genomic DNA sequence, and the total number of the control tags in a region corresponding to the predetermined region of the control genomic DNA sequence. A DNA sequence analysis method comprising a step of determining a tag density divided by.

[12] The DNA sequence analysis method according to claim 10, wherein

In the step of generating the corresponding tag data, when the analysis target tag corresponds to only one tag among the control tags, the tags are associated with each other with a predetermined contribution, and the analysis target tag is A DNA sequence analysis method comprising a step of associating two or more tags with a degree of contribution different from the predetermined contribution when corresponding to two or more tags among the control tags.

[13] The DNA sequence analysis method according to claim 10, wherein

In the step of generating the corresponding tag data, among the control tag and the analysis target tag, completely matching tags are associated with each other with a predetermined contribution, and partially different tags are contributed differently from the predetermined contribution. DNA sequence analysis method including the step of relating by degree.

[14] The DNA sequence analysis method according to claim 10, wherein

In the step of determining the number of copies, the corresponding tag data is analyzed, and when the number of the tags to be analyzed corresponding to the control tag is equal to or greater than a predetermined number, the control tag in the genomic DNA sequence to be analyzed is determined. A DNA sequence analysis method comprising a step of determining that an overlap of a region including a corresponding portion occurs.

[15] The DNA sequence analysis method according to claim 10, wherein

In the step of determining the number of copies, the corresponding tag data is analyzed, and when the number of the analysis target tags corresponding to the control tag is equal to or less than a predetermined number, the control tag among the analysis target genomic DNA sequences. A DNA sequence analysis method comprising a step of determining that a deletion of a region including a corresponding portion has occurred.

[16] The DNA sequence analysis method according to claim 10, wherein

The corresponding tag data is analyzed, and the control tag corresponding to the analysis target tag exists. In this case, a DNA sequence analysis further comprising the step of determining the origin of the target tag by comparing the target tag with another genomic DNA sequence derived from a different origin from the control genomic DNA sequence. Method.

[17] The DNA sequence analysis method according to claim 10, wherein

The control genomic DNA sequencing power further comprising the step of generating the control tag data;

Generating the control tag data comprises:

Obtaining the control genomic DNA sequence;

A step of searching for a cleavage site by the restriction enzyme in the control genomic DNA sequence, and a plurality of control tags obtained by cleaving the control genomic DNA sequence by the cleavage site, comprising a number of bases in a predetermined range, Selecting a control tag that includes a DNA sequence that is less than or equal to a predetermined number in the control genomic DNA sequence;

A DNA sequence analysis method comprising:

[18] The DNA sequence analysis method according to claim 17, wherein

The DNA sequence analysis method, wherein the restriction enzyme is a restriction enzyme that recognizes and cleaves a 4-base sequence having GATC power.

[19] The DNA sequence analysis method according to claim 10, wherein

Further comprising the step of generating the analysis target tag data from the analysis target DNA sequence,

The step of generating the analysis target tag data includes:

Cleaving a DNA molecule containing the DNA sequence to be analyzed with the restriction enzyme;

Extracting a DNA fragment having a predetermined number of bases out of a plurality of DNA fragments obtained by cleaving the DNA molecule with the restriction enzyme;

The DNA sequence of the extracted DNA fragment is sequenced and the tag data to be analyzed is analyzed. Generating the data, and

A DNA sequence analysis method comprising:

[20] The DNA sequence analysis method according to claim 19, wherein

The step of generating the tag data to be analyzed further includes a step of generating a concatemer formed by linking a plurality of DNA fragments that have undergone the extracting step,

The sequencing step includes the step of sequencing the DNA sequence of the concatemer.

[21] The DNA sequence analysis method according to claim 20, wherein

The step of generating the concatemer is

Linking a concatamer formed by linking the plurality of DNA fragments to a vector to generate a concatamer-containing vector;

Amplifying the concatamer-containing vector by introducing the concatamer-containing vector into E. coli and transforming, and culturing the E. coli;

Extracting the concatameric vector containing the cultured E. coli force, and a DNA sequence analysis method comprising:

[22] The DNA sequence analysis method according to claim 20, wherein

The step of generating the analysis target tag data further includes a step of generating a secondary concatemer that is formed by connecting a plurality of the concatemers.

The sequencing step includes the step of sequencing the DNA sequence of the secondary concatemer.

[23] The DNA sequence analysis method according to claim 22, wherein

The step of generating the secondary concatamer is:

Extracting the concatemer-containing vector force and extracting the concatemer.

Generating a secondary concatemer-containing vector by linking a secondary concatemer formed by linking a plurality of types of concatemers to a vector;

Introducing the secondary concatamer-containing vector into E. coli, transforming, and amplifying the secondary concatamer-containing vector by culturing the E. coli; Extracting the cultured Escherichia coli force containing the second concatemer-containing vector;

A DNA sequence analysis method comprising:

[24] A plurality of controls obtained by cleaving a control genomic DNA sequence with a restriction enzyme, each having a predetermined number or less of the number contained in the control genomic DNA sequence, and each having a DNA sequencing ability of a predetermined number of bases Obtaining control tag data, each tag associated with a corresponding location in the control genomic DNA sequence;

Analyzing the corresponding tag data, determining the number of the tags to be analyzed corresponding to the control tag, and based on the number, the region of the region including the portion corresponding to the control tag in the genomic DNA sequence to be analyzed Determining copy number differences relative to a control genomic DNA sequence;

A program that causes a computer to execute.

[25] In the program of claim 24,

The step of determining the difference in the copy number includes the total number of the analysis target tags in a predetermined region of the analysis target genomic DNA sequence, and the total number of the control tags in a region corresponding to the predetermined region of the control genomic DNA sequence. A program including the step of determining the tag density divided by.

[26] In the program according to claim 24,

In the step of generating the corresponding tag data, when the analysis target tag corresponds to only one tag among the control tags, the tags are associated with each other with a predetermined contribution, and the analysis target tag is When corresponding to two or more of the control tags, A program including a step of associating these tags with contributions different from the predetermined contribution.

[27] In the program according to claim 24,

In the step of generating the corresponding tag data, among the reference tag and the analysis target tag, tags that completely match each other are associated with a predetermined contribution, and partially different tags are contributed differently from the predetermined contribution. A program that includes steps to relate by degrees.

[28] In the program according to claim 24,

The step of determining the number of copies analyzes the corresponding tag data, and when the number of the tags to be analyzed corresponding to the control tag is equal to or larger than a predetermined number, the control tag in the genomic DNA sequence to be analyzed A program that includes a step of determining that there is an overlapping area including the corresponding location.

[29] In the program of claim 24,

The step of determining the number of copies analyzes the corresponding tag data, and when the number of the analysis target tags corresponding to the control tag is equal to or less than a predetermined number, the control tag of the analysis target genomic DNA sequence. A program that includes a step of determining that a deletion of a region including a corresponding location has occurred.

[30] In the program of claim 24,

When the corresponding tag data is analyzed, and the control tag corresponding to the analysis target tag does not exist, another genomic DNA sequence derived from a source different from the control genomic DNA sequence and the analysis target tag are detected. A program further comprising the step of comparing and determining the origin of the tag to be analyzed.

[31] In the program of claim 24,

Generating the control tag data comprises:

Obtaining the control genomic DNA sequence;

Searching for the restriction enzyme cleavage site in the control genomic DNA sequence; Among a plurality of control tags formed by cleaving the control genomic DNA sequence at the cleavage site, the control genomic DNA sequence includes a DNA sequence having a number of bases in a predetermined range, and the number contained in the control genomic DNA sequence is a predetermined number or less. Selecting a control tag;

Including programs.

The program according to claim 31,

The restriction enzyme is a restriction enzyme that recognizes and cleaves a 4-base sequence having GATC power.