WO2012157778A1

WO2012157778A1 - Gene identification method in fragmentome analysis and expression analysis method

Info

Publication number: WO2012157778A1
Application number: PCT/JP2012/062963
Authority: WO
Inventors: 真澄安倍; 春信湯野川; 伸司佐藤; 一弘近藤; 隆志日永田
Original assignee: 独立行政法人放射線医学総合研究所; 株式会社メイズ
Priority date: 2011-05-19
Filing date: 2012-05-21
Publication date: 2012-11-22
Also published as: JP2012239430A; JP5403563B2

Abstract

A database construction method, including a stage that fragments genomic DNA contained in a sample or cDNA obtained from a transcription product and obtains a fragment DNA mixture by applying an identifiable index array, a stage that performs high-speed DNA sequencing of a portion of the fragment DNA mixture and acquires read array data for all fragment DNA contained therein, a stage that detects the presence of the index array portion for all read array data and extracts the read array data having the index array, and a stage that performs clustering and assembling of the sequences using sequence similarity and sequence length parameters, forms a plurality of clusters, and acquires the number of structural sequences of the cluster, consensus sequence and consensus sequence length, and alignment information for the clusters.

Description

Gene identification method and expression analysis method in comprehensive fragment analysis

The present invention relates to a fragment sequence database construction method in comprehensive fragment analysis, and a gene identification method and expression analysis method using the fragment sequence database construction method.

Various current expression analysis methods exist and have been developed. For example, a technique using a microarray is widely used. The microarray is a method for detecting hybridization between a probe containing a sequence to be detected immobilized on a substrate and a nucleic acid contained in a sample. In this method, in order to prepare a probe, information on a target nucleic acid is necessary. In addition, with the microarray technology, it is difficult to obtain a quantitative expression level difference even in the case of a biological species having sequence information. Moreover, the reliability of the change rate of the expression level detected about a low expression gene is low.

In recent years, transcription products using high-speed DNA sequencers represented by Roche 454FLX, Illumina GAII series / HiSEQ series, LifeTechnology SOLiD series / Ion Torrent PGM series, Helicos, Pacific Bio, etc. A technique (RNA-Seq) has been reported in which expression analysis is performed by sequencing and counting the number of sequences for each gene. In this method, it is necessary to align a sequence sequenced with respect to a reference sequence based on the sequence information of a biological species whose sequence information is known. Further, in such a method, even if sequence information for a target species is known, the amount (ie, the number of read sequences) of sequence populations derived from major transcripts can be compared. However, the sequence population derived from a minor transcript with a small amount has low reproducibility of the amount and low reliability of the comparison result.

On the other hand, as a method for detecting a difference in genomic DNA or a difference in the expression level of a transcript, a method that can be applied to a species having no sequence information has also been proposed. Such a method is also referred to as an exhaustive fragment analysis method and includes, for example, HiCEP, AFLP, T-RFLP, SAGE, CAGE, Differential Display, and the like. In these methods, a DNA sequence is cleaved with a restriction enzyme, and a fragment sequence with a specific sequence at the end is prepared and amplified by PCR using a specific sequence, followed by electrophoresis, or the DNA sequence is converted into a specific sequence. It is used for electrophoresis after amplification by PCR. In these methods, the electrophoresis results (that is, band groups or peak groups) of the obtained fragment DNA sequences are compared between different samples, and band groups or peak groups having different intensities are detected. In such an exhaustive fragment analysis technique, in order to perform expression analysis, it is necessary to classify each band group or peak group and sequence them one by one to determine the base sequence. In order to perform gene identification and expression analysis by such a method, enormous time and enormous costs are required.

In view of the above situation, an object of the present invention is to provide a gene identification method and expression analysis method that are simple and retain high reliability, and a fragment sequence database construction method in exhaustive fragment analysis used therein. .

According to one aspect of the invention,
Fragmenting the transcript contained in the sample, further adding an indicator sequence, and obtaining a fragment DNA mixture,
Obtaining read sequence data for all fragment DNAs contained therein by performing high-speed DNA sequencing on the first portion of the fragment DNA mixture; and
Inspecting the presence or absence of the indicator sequence portion for all of the lead sequence data, extracting the lead sequence data having the indicator sequence;
For all of the extracted read sequence data, a clustering process and an assembling process are performed using predetermined parameters to form a plurality of clusters, and for each of the clusters, a cluster configuration array Obtaining a number, consensus sequence and consensus sequence length, and alignment information;
Comprising
A database construction method is provided in which the parameters are parameters relating to sequence similarity and sequence length.

The present invention provides a gene identification method and expression analysis method that are simple and highly reliable, and a fragment sequence database construction method in exhaustive fragment analysis used therein.

The flowchart which shows an example of the construction method of a database. The figure which shows one example of a structure of a database. The flowchart which shows an example of the gene identification method. The figure which shows one example of a structure of a database. Fig. 4 is a flowchart showing a further example that can be used in the gene identification method of Fig. 3. The flowchart which shows an example of the analysis method using a high-speed DNA sequencer. The flowchart which shows an example of the analysis method using a high-speed DNA sequencer. The flowchart which shows an example of the analysis method using a microarray. The flowchart which shows an example of a clustering process. The flowchart which shows an example of a clustering process. The scheme which shows the preparation example of a DNA fragment liquid mixture. The scheme which shows the preparation example of the DNA fragment liquid mixture using selective PCT. The schematic diagram which shows the relationship between isolation | separation by the fragment length of a DNA fragment liquid mixture, and detection. The figure which shows one example of evaluation of an adapter arrangement | sequence. The conceptual diagram shown about one example of correction of an error cluster. The conceptual diagram of cluster division by hetero SNP. The figure which shows an example of the correction method of the shift | offset | difference of electrophoresis length and sequence length. The figure which shows an example of the matching method of a lead arrangement | sequence and a peak. The figure which shows an example of the matching method of a lead arrangement | sequence and a peak. The flowchart which shows an example of the test | inspection and classification | category by index arrangement | sequence. The flowchart which shows an example of a clustering and assembly process. The schematic diagram which shows one example of matching. The schematic diagram shown about one example of a high quality arrangement | sequence. The figure which shows the example of an output of alignment. The schematic diagram which shows the example of a parameter | index array. The schematic diagram which shows the example of a parameter | index array. The figure which shows an example of the input screen of the search by fragment length. The figure which shows an example of the input screen for the search by a gene name. The figure which shows an example of the input screen of a BLAST search. The figure which shows the example before and behind calibration. The figure which shows an example of a peak. The figure which shows an example of a peak. The figure which shows an example of scoring. The figure which shows correct / wrong alignment. Conceptual diagram of frame and order number. The figure which shows the conversion image from height to a height sequence number. The figure which shows the image of a height sequence number. The figure which shows an example of a profile peak. The figure which shows an example of a profile peak. The figure which shows an example of the matching before and behind correction | amendment. The figure which shows one example of matching. The graph which shows the relationship between arrangement | sequence length and shift | offset | difference. The graph which shows the relationship between molecular weight and shift | offset | difference. The graph which shows the relationship between a content amino acid and a shift | offset | difference. The figure which shows the calculation method of correction | amendment. The block diagram which shows an example of a structure of a computer.

(1) Construction of an exhaustive fragment sequence database utilizing a high-speed DNA sequencer Hereinafter, an example of construction of an exhaustive fragment sequence database utilizing a high-speed DNA sequencer will be described with reference to FIG.

First, prepare a fragment DNA mixture to create a database. The fragment DNA mixed solution may be prepared by fragmenting a genome or transcript contained in a sample for which a database is to be created, giving an index sequence. This is a mixed solution for the comprehensive fragment analysis method.

The sample may be prepared from a cell, tissue, organ, or the like into a mixed solution containing a genome or a transcription product by any means known per se. Prior to fragmentation of the genome or transcript, it may be performed by any means known per se. Preferably, cDNA is prepared from the transcript, and this is fragmented to give a labeling sequence.

Fragmentation of DNA obtained from a genome or a transcript may be performed using a restriction enzyme known per se. Addition of an identifiable indicator sequence to the fragmented DNA may be performed, for example, by adding an adapter sequence to the fragment. Application of the adapter may be at the 5 'and / or 3' end of each fragment. Also, for example, as practiced in the mate pair method, adapter attachment is applied to the 5 ′ end and / or the 3 ′ end of each fragment, followed by the 5 ′ end of the fragment to which the adapter is applied and 3 A linear nucleic acid may be prepared by binding the ends and forming a circular nucleic acid, followed by cleaving at a site other than the sequence corresponding to the adapter.

The base sequence of the adapter and its length may be arbitrarily determined as long as they can be identified. Here, the “index sequence” indicates that the sequence to be the index includes a discriminable number of base sequences.

For example, a method for assigning an index sequence in a fragment analysis method such as HiCEP method, AFLP method, T-RFLP method, CAGE method, and Differential-display method is used to give an identifiable indicator sequence to a cDNA fragment. More preferably, AFLP method, T-RFLP method, CAGE method and Differential-display method, most preferably HiCEP method may be used. Using the fragment analysis method described above, a discriminating indicator sequence is imparted to the cDNA fragment, and the mixture of the fragments is subjected to electrophoresis such as gel electrophoresis and / or capillary electrophoresis. Generally, fragments may be comprehensively analyzed by performing an analysis method by obtaining an electrophoretic sequence length (herein also referred to as “molecular weight” or “sequence length” or “fragment length”).

The lead mixture is obtained by applying the fragment mixture thus prepared to a high-speed DNA sequencer.

Here, “high-speed DNA sequencer” refers to a sequencer that can be sequenced without separating multiple types of base sequences having different lengths. For example, it is possible to use sequencers provided by Roche 454FLX, Illumina GAII series / HiSEQ series, LifeTechnology SOLiD series / Ion Torrent PMG series, Helicos, Pacific Bio, etc. However, the present invention is not limited to this. The high-speed DNA sequencer may not require cloning.

Next, using the two elements of the length and similarity of the read sequence as parameters, the read sequence is clustered and assembled by computer processing. As a result, a highly accurate array cluster and consensus array are created, and the number of read arrays constituting each array cluster is totaled.

The clustering processing and assembling processing of the read sequence by computer processing will be described in more detail with reference to FIGS. 9A and 9B. 9A and 9B show the same series of steps. For convenience, FIG. 9A describes in detail steps 1 to 3, and FIG. 9B describes in detail steps 4 to 6. Here, when both the clustering process and the assembling process are performed, this process is also referred to as “clustering and assembling” or “clustering and assembling process”.

Here, “sequence clustering” is a term used interchangeably with “clustering” and “clustering” and is grouped based on predetermined parameters, preferably base sequence similarity and / or sequence length. Indicates to do. Groups resulting from clustering are referred to as “clusters” or “array clusters”. A cluster composed of a plurality of arrays having the same length is called an “aligned cluster”, and a cluster composed of a plurality of arrays having different lengths is called an “unaligned cluster”. A cluster consisting of only one array is also referred to as “singleton”, but “singleton” may also be used as a cluster.

Here, “assembly” is a term used interchangeably with “assembly” and “assembly”, and is a consensus sequence that is one representative sequence from a plurality of nucleic acid sequences having at least a partially common sequence. It also means obtaining alignment information of the sequence subjected to assembly to the consensus sequence.

“Here,“ read array ”refers to an array output from a sequencer.

Here, “consensus sequence” means an artificial sequence obtained by the assembly process.

Step 1 Sequence classification When a specific sequence appears at both ends of a fragment DNA sequence to be detected by comprehensive fragment analysis, both or one of the both end sequences is evaluated, and a sequence used for clustering and assembling is assigned. Specifically, in other words, it is determined whether or not the lead sequence includes an indicator sequence. When the indicator sequence is included at both ends or one end, it is extracted as a lead sequence for database creation, and the following steps are performed: Used in.

Whether or not an index sequence is included may be determined by confirming the presence of the index sequence in the lead sequence to be determined. The indicator sequence used for confirmation may be a base sequence corresponding to the sequence given to the DNA fragment as an adapter sequence, and even if it is a base sequence corresponding to the entire adapter sequence, Even a base sequence corresponding to a part may be a sequence containing additional bases in addition to a sequence corresponding to the adapter sequence. When further sequences are included, for example, any number of any base N (a base selected from adenine, thymine, guanine and cytosine) may be included. In addition, when an arbitrary base N is included, it is preferable that the base N is included so as to extend to the 5 'end side or 3' end side of the sequence corresponding to the adapter sequence. When an arbitrary number of arbitrary bases N is included, the number of arbitrary bases N may be, for example, 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, and preferably 2 for 1 adapter. And two on both sides of the 5 ′ and 3 ′ ends of one sequence. However, when it is added to both ends of the fragment, the 5 'end side and the 3' end side may contain different types of arbitrary types of bases.

The index sequence may be present inside the lead sequence, but is preferably present at both ends.

Step 2 Clustering and assembling Clustering and assembling are performed to obtain a sequence cluster and its consensus sequence. All of the extracted read sequence data are subjected to sequence clustering processing and assembly processing using predetermined parameters. Thereby, a plurality of sequence clusters are formed, and the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and alignment information are obtained for each of the sequence clusters. As the predetermined parameters, for example, parameters related to sequence similarity, sequence length and / or index sequence, preferably parameters related to sequence similarity, sequence length and index sequence may be used.

Step 3 Correction of clustering error For the obtained sequence cluster, using the alignment information of the consensus sequence and the sequence constituting the cluster, the sequence similarity between the sequences constituting the sequence cluster and the sequence length identity are evaluated, Furthermore, the sequence similarity and sequence length between consensus sequences are evaluated, and errors and contradictions in clustering and assembling are detected, and the created cluster is corrected.

Step 4 Cluster reliability data conversion For the sequence cluster obtained in step 3, the reliability of the consensus sequence as a representative sequence of each sequence cluster is converted into data from the alignment information of the consensus sequence and the lead sequence constituting the cluster. .

In order to obtain the reliability of the cluster, for example, the base adjacent to the index sequence may be evaluated for the consensus sequence of the cluster and the lead sequence constituting the cluster. In that case, the number of sequences adjacent to the indicator sequence may be 2 or more, preferably 2 bases.

Step 5 Data conversion of reliability of consensus sequence using known gene information Regarding the consensus sequence of the sequence cluster obtained in the above step 4, in a biological species where known gene information (transcript, genome, EST information, etc.) exists, The known sequence information is searched, and consensus sequence reliability data is created.

Step 6: Giving gene information to a consensus sequence With respect to the consensus sequence of the sequence cluster obtained in Step 4, known sequence information is searched and gene information is given to the sequence.

An exhaustive fragment analysis database (hereinafter also referred to as “DB”) can be constructed by the above steps. Steps 4 to 6 are optional steps, and may be performed according to the purpose, for example, when it is desired to ensure higher reliability or when specific gene information is to be added to the database.

In addition, FIG. 2 shows an example of components included in the database obtained by the steps 1 to 6. Through the above-described steps 1 to 6, the database is converted into “consensus sequence”, “number of sequence sequences of sequence cluster”, “consensus sequence length of sequence cluster”, “alignment information”, “reliability data of sequence clustering” and “ The gene information of the sequence cluster ”is included, and these pieces of information may be stored in the storage unit in association with each other.

(2) Correlation of Bands or Peaks Obtained by Electrophoresis with Sequences One example of a procedure for associating bands or peaks obtained by electrophoresis with sequences will be described below with reference to FIG.

Using the sequence information of the high-precision consensus sequence obtained in (1) above, the number of bases and the number of sequences constituting the sequence cluster, electrophoresis of comprehensive fragment analysis obtained from the DNA mixture targeted for sequencing A band group or a peak group (these data are collectively referred to as “reference profiling”) is associated.

The sequence information of the consensus sequence, the number of bases and the molecular weight (or electrophoresis sequence length) obtained by electrophoresis of the band or peak, and the number of sequences constituting the cluster of the consensus sequence and the intensity of the band or peak Use to associate consensus sequences with reference profiling.

For consensus sequence mapping, calibration information on the number of bases based on the molecular weight and base composition of the sequence obtained based on data obtained by conducting an experiment of matching a large amount of sequence length with the number of bases obtained by electrophoresis in advance. Then, the calibrated value may be used.

Thus, the consensus sequence included in the database (1) is associated with reference profiling.

(3) Gene identification method of band or peak obtained by comprehensive fragment analysis using database of (1) and correspondence information of (2) An example of the gene identification method will be described with reference to FIG.

[Method 1]
Step 1 Produce data that associates the profiling results obtained from the gene identification target sample with the reference profiling used in (2).

Step 2 From the gene identification target band group or peak group obtained from the gene identification target sample, the reference profiling band or peak is obtained using the association data created in Step 1 above, and further in (2) From the association information, the cluster created in (1) above is obtained, and the consensus sequence and gene information are obtained. This creates a correspondence list between the band group or peak group of interest and the gene information.

Step 3 In addition, the consensus sequence of interest is determined based on the gene information created in Step 6 of (1) above, and the gene identification target is determined via the reference profiling band or peak associated in (2) above. Determine the profiling band or peak obtained from the sample.

[Method 2]
A reference profiling band using the number of bases obtained by electrophoresis of one or more bands or peaks of the electrophoresis result obtained from the gene identification target sample in (2) above, and also created in (1) above The gene information of the attention band or peak of the electrophoresis result obtained from the gene identification target sample is obtained by arranging and displaying the pseudo profiling created from the sequence cluster and the number of sequences and the gene information thereof.

(4) Detection by a high-speed DNA sequencer for comprehensive fragment analysis It is also possible to comprehensively analyze fragments by using a high-speed DNA sequencer to detect, for example, a target gene. An example of such a method will be described with reference to FIG.

For multiple samples to be measured, comprehensive fragment analysis method Each sample is subjected to the same kind of high-speed DNA sequencer to obtain a sequence.

For each lead sequence obtained from each sample to be measured, the database of (1) is created, the sequence clusters are associated with each other based on the similarity of the consensus sequences, and the number of sequences constituting between the associated sequence clusters To compare the number of sequences, detect the sequence group with a change in quantity, and perform expression analysis between the first target sample and the second target sample. Standardization may be performed using the number of sequences that have been compared.

(5) Detection of exhaustive fragment analysis using database as reference using high-speed sequencer An example of exhaustive fragment analysis using a further high-speed sequencer will be described with reference to FIG.

Execute the above procedure (1) for the sample to be measured in advance to create a database.

∙ For multiple samples to be measured, each sample is subjected to the same type of high-speed DNA sequencer for the mixed solution prepared by the comprehensive fragment analysis method to obtain a sequence. It does not have to be the same as the high-speed DNA sequencer used in the database created in advance.

 The lead sequence to be measured is clustered by using the consensus sequence of the database in which this sequence has been created in advance as a reference and performing alignment processing or the like.

A method of comparing the number of sequences clustered in the same consensus sequence between measurement samples, detecting a sequence group accompanied by a change in quantity, and performing expression analysis between the first target sample and the second target sample. May be compared by standardization using the total number of read sequences or the number of sequences used for clustering.

(6) Comprehensive Fragment Analysis Method Using a Microarray Created by Designing Probes from a Database Further, an example of an exhaustive fragment analysis method using a microarray will be described with reference to FIG.

Execute the procedure (1) for the sample to be measured in advance to create a database. A probe is designed based on the obtained consensus sequence to create a microarray.

For a mixture of samples to be measured, a mixed solution prepared by an exhaustive fragment analysis method is used to detect a sequence group accompanied by a change in amount using the microarray created above, and a first target sample and a second target sample To perform expression analysis between.

(7) Inspection and Classification Process Using Index Array A further example of the inspection and classification process using the index array will be further described below with reference to FIG.

読み込み Read all sequenced read sequences and calculate similarity data with known index sequences that must exist in those read sequences. Thereafter, it is checked whether there is a known index sequence by referring to the similarity data one by one for all the read sequences. A lead sequence for which a known index sequence has been confirmed is classified as a sequence used for clustering.

(8) Clustering / Assembly Process Hereinafter, a further example of the clustering / assembler process will be further described with reference to FIG.

In FIG. 19, each symbol means the following:
M: Number of the read sequence that is the seed of the cluster
N: Number for reading from the next read sequence to the last sequence after the M-th read sequence as a seed
I: Generated cluster number.

Read all the read sequences to be used for clustering. First, the read sequence that becomes the seed of the cluster: The M-th sequence is determined, and from the next read sequence of the seed sequence, all the remaining read sequences are sequentially sequenced and the Nth sequence The similarity and sequence length are compared with the target read sequence, and if they are determined to be the same, the read sequence is stored in the I-th cluster of the cluster storage area. A cluster is established when all seed sequence searches are completed. Thereafter, in order to obtain a consensus sequence for each cluster, assembly is performed for each cluster.

(9) Program In order to perform the method according to any aspect of the present invention, a program for causing a computer to execute the steps included in each method (herein also referred to as “stage”) as each procedure may be provided. For example, a program for implementing the steps included in the above (1), (2), (3), (4), (5), (6), (7) and / or (8) as each procedure Is provided.

For example, a program for causing a computer to execute the method (1) may be stored in any medium and provided.

Such a program is, for example, the following program:
For all of the read sequence data obtained by high-speed DNA sequencing of fragment DNA mixed solution from a transcript contained in a sample that has been fragmented and given an identifiable indicator sequence, the indicator sequence portion A procedure for examining the presence or absence and extracting lead sequence data having the index sequence;
For all of the extracted lead sequence data, a sequence clustering process and an assembling process are performed using parameters relating to sequence similarity and sequence length determined in advance, thereby forming a plurality of sequence clusters. For each cluster, the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and a procedure for obtaining alignment information;
Means for constructing a database including the number of constituent sequences of the sequence cluster associated with each of the sequence clusters, the consensus sequence and the consensus sequence length, and alignment information;
A program for constructing a database for comprehensive fragment analysis of transcripts, which causes a computer to execute a process including:

The computer used in the embodiment of the present invention may be any computer known per se. FIG. 44 schematically shows an example of the configuration of a computer for performing a procedure according to an embodiment of the present invention. The computer includes a processing management unit, a storage unit, a temporary recording unit, a program storage unit, a clustering and assembly processing unit, an index array inspection unit, a correction data storage unit, a similarity determination unit, and an array length determination unit. At least for the processing management unit, all other components, that is, a storage unit, a temporary recording unit, a program storage unit, a clustering and assembling processing unit, a marker array inspection unit, a correction data storage unit, a similarity determination unit, and An array length determination unit is connected to be able to exchange signals. Further, a further configuration unit for performing processing as desired may be included, and such a configuration unit is connected to the process management unit so as to be able to exchange signals.

All programs are stored in the program storage unit. The process management unit manages and executes all processes in accordance with the program stored in the program storage unit. The database configured according to the aspect of the present invention is stored in the storage unit. The lead array is stored in the storage unit or the temporary recording unit. The index array inspecting unit inspects whether or not the index array is included in the read array that is output from the stored component unit according to the instruction of the process management unit according to the program stored in the program storage unit. The clustering and assembling processing unit performs clustering and assembling processing on the lead sequence. The correction data storage unit stores data used for correcting the obtained data. Data stored in the correction data storage unit is output, a program for correction is output from the program storage unit, and the process management unit corrects data obtained based on the data. The similarity determination unit performs a determination regarding similarity with respect to an object to be compared in accordance with an instruction from the process management unit according to the program stored in the program storage unit. The sequence length determination unit makes a determination regarding the sequence length of the objects to be compared in accordance with an instruction from the process management unit according to the program stored in the program storage unit.

Furthermore, the computer may have an input unit such as a keyboard and / or a scanner for inputting data from an operator or a high-speed DNA sequencer. Furthermore, you may have output parts, such as a monitor and / or a printer for outputting the obtained result. In the above description, the processing management unit performs the correction. However, the computer may further include a correction unit, and the correction unit may correct the data as described above.

The present inventors have found that the conventional techniques have the following problems. Such a problem is also solved by the present invention.

A method in which a DNA sequence typified by the HiCEP method is cleaved with a restriction enzyme, a fragment sequence to which a specific sequence is added at the end is prepared, amplified by PCR using a specific sequence, and then electrophoresed, or DNA Compare the results of electrophoresis (band group or peak group) of fragment DNA sequences obtained by amplification after PCR amplification using a specific sequence, etc. between different samples, and band groups with different intensities Alternatively, there is a method for detecting a peak group (hereinafter referred to as an exhaustive fragment analysis method). A typical method for comprehensive fragment analysis is as follows. For example, such methods include methods called HiCEP, AFLP, T-RFLP, SAGE, CAGE, and Differential Display.

These exhaustive fragment analysis methods enable exhaustive fragment analysis without existing sequence data. However, methods other than the HiCEP method have a problem that the comprehensiveness is low, or the fragment is short and the gene cannot be specified, and moreover, a plurality of bands or peaks appear from one sequence and the analysis is difficult.

Unlike other exhaustive fragment analysis, the HiCEP method seems to generate only one type of fragment from the target mRNA sequence or genomic sequence fragment (start sequence) by digestion with restriction enzymes. In addition, in the PCR process for detection, by using a primer 2 bases (selection sequence) longer than the adapter sequence and performing 256 PCRs and electrophoresis, More than 20,000 types of fragment sequences can be simultaneously obtained as independent waveform peaks, and this one peak has a feature that it corresponds to one type of the original sequence. Therefore, by comparing the electrophoresis results obtained with HiCEP, a peak whose intensity has changed between samples is a method by which it can be detected that there is also a quantitative difference in the original sequence of the fragment. Furthermore, since this technique uses PCR, transcripts with low expression levels can also be detected, and reproducibility is very good, so that differences in expression levels of 1.2 times or more can be detected.

However, even in the HiCEP method, as in other comprehensive fragment analysis methods, even if bands and peaks with different amounts can be known, determining the sequence requires a complicated process of fractionation. To do.

As a method for solving this problem, for species with known sequence information, a method of determining the sequence by predicting the comprehensive fragment analysis with a computer was considered. Even if it is possible to predict the arrangement of bands and peaks due to problems such as difficulty in discrimination due to excessive information of known sequences, the reality is that the reliability is low.

In HiCEP, ES cells were used as samples, and about 14000 peaks detected by HiCEP were sequenced using a Sanger sequencer to create a database. To create this database, about 3 It is impractical to perform this method on all samples that are subject to HiCEP measurement, requiring yearly periods and high costs.

Recently, a high-speed DNA sequencer has appeared, and there are many studies on sequencing genomic DNA and mRNA using this, but the sequence is the same when the reading length is limited or the sequence to be sequenced is created. Methods that are not suitable for sequencing samples for exhaustive fragment analysis, such as aligning lengths, are used. In addition, mapping to genome sequences and transcripts using only sequence similarity and clustering for each gene is used, so it cannot be applied to species that do not have sequence information. The expected reproducibility cannot be obtained except for major sequence groups.

The present invention can also provide the following effects.

(1) Simplification of the method for determining the sequence of the target band or peak In order to know the candidate band or peak sequence obtained by the comprehensive fragment analysis method, the band or peak is sorted and sequenced. I need to sing. The fact that the sequence cannot be known at the time when the band or peak of interest is detected is a major obstacle to the subsequent analysis. The most important drawback is that when many candidates are obtained as a result of exhaustive fragment analysis, information is not given to those bands or peaks, so it cannot be narrowed down using known knowledge, and scientific The problem is that it is necessary to determine the sorting target based on the ease of subsequent experiments and the like, not the basis, and drop important genes from the candidates. Of course, a method of determining the sequence by sorting all the bands or peaks of interest is also conceivable, but if there are many candidates, it takes a large cost and time. Another problem is that the preparative procedure itself is complicated, especially when the bands and peaks are honeyed in units of 1 base, and it is necessary to perform cloning and sequencing, which is an expensive process. It is to become.

As one method for omitting this fractionation process, the HiCEP method is simulated by a computer for organisms with genome information and transcript information, and a virtual fragment sequence obtained from a known sequence is subjected to electrophoresis. A method for predicting the arrangement of bands or peaks by matching with the number of molecules of bands or peaks was also constructed. However, the sequence length obtained by electrophoresis (referred to as the electrophoresis length) does not always match the actual sequence length of the target fragment, and the number of known sequences increases compared to the number of bands and peaks, and candidates From these facts, it was found that the sequence and the band or peak cannot be accurately associated with the sequence length alone.

As another method, we thought that the fractionation process could be omitted by creating a database by determining the sequence for all bands or peaks in advance for the target sample. Therefore, HiCEP was performed on ES cells, a Sanger sequencer was specified, the sequence was determined for about 14,000 peaks, and a database was created. As a result, it was useful for analysis of ES cells, but creating this database requires an enormous amount of time and time, and it is necessary to create a database using this method for each species and sample to which HiCEP is applied. Proved difficult.

Comprehensive fragment analysis method and high-speed DNA sequencer are used to create a database of sequences, which makes it possible to comprehensively cover band or peak group sequences in a shorter period of time and at a lower cost than ever before. Can be identified. Furthermore, since this method does not require a known sequence, this method can be applied to a biological species that does not have a known sequence.

That is, by applying this method, it is possible to obtain not only the effect of identifying the sequence of the band group and peak group of the comprehensive fragment analysis method but also the comprehensive fragment sequence of the genome and transcripts. There is also an effect.

(2) New detection method for exhaustive fragment analysis Although it is an important task to determine the sequence of bands and peaks obtained by the exhaustive fragment analysis method, which is the subject of (1) above, an exhaustive fragment analysis method It is conceivable to use a high-speed DNA sequencer as a method for detecting fragments having a difference between the two.

However, when using a high-speed DNA sequencer, it is usually necessary to randomly sequence the sequence group to be sequenced and align the sequence length.

Also, in order to cluster sequenced sequences, a known sequence as a reference is required.

Therefore, it is considered difficult to sequence and analyze the cDNA preparation obtained by the comprehensive fragment analysis method using a high-speed DNA sequencer.

C By sequencing the cDNA preparation of the comprehensive fragment analysis method with a high-speed DNA sequencer and constructing a database with this method, the consensus sequence and the number of constituent sequences for each sequence cluster can be obtained. Using this consensus array as a constituent array, array clusters having a difference in quantity can be obtained. Although there are problems with high-speed DNA sequencers, it is possible to directly compare the number of read sequences between sequence clusters, rather than obtaining band groups and peak groups by PCR and electrophoresis. There is an advantage that it is not necessary to associate with an array cluster.

Furthermore, on the premise that the consensus sequence of the sequence cluster created by this method is used as a reference sequence, the existing analysis method (high-speed DNA) is applied to the DNA mixture prepared from the sample to be measured by the comprehensive fragment analysis method. Application of sequencer RNA-seq and microarray) enables the analysis utilizing the features of the comprehensive fragment analysis method without losing the disadvantages of the existing analysis method, and also the high-precision sequence storing reliability data The availability of cluster information enables analysis with higher accuracy than applying existing analysis methods.

The HiCEP method (High-coverage-expression-profiling-method) is one of the methods for comprehensive fragment analysis, and is a method for comprehensive and high-precision gene expression analysis from a very small amount of sample. The greatest feature of the HiCEP method is that low expression transcripts can be analyzed with high reproducibility and high accuracy. Furthermore, since this method does not require gene sequence information in advance, it can also be applied to biological species for which genomic information is not clear. However, it also means that it is difficult to predict the base sequence of the transcript obtained as a profiling peak. Therefore, the base sequence identification of the electrophoresis peak in the expression profile of the comprehensive fragment analysis obtained by HiCEP method was carried out by this method.

As shown in FIG. 20, the image of using this method is prepared in advance by preparing a sample to be measured by the HiCEP method, sequencing it by this method, clustering and assembling, and creating a database of sequence clusters. In addition, the HiCEP reference profiling peaks and clusters are stored in correspondence. After that, perform HiCEP on the same species / tissue as the sample for which the database was created, but obtain the analysis target profiling, list the electrophoresis peaks of interest, and create them in advance. This is a method of determining the sequence of the peak of interest using the database of the sequence cluster and the association data between the cluster and the peak of reference profiling.

As shown in Fig. 10, HiCEP's specific method is to first generate double-stranded cDNA groups based on RNA (TotalRNA) extracted from biological samples or purified mRNA samples. This is a method of preparing a preparation solution of only cDNA fragment groups cleaved with two appropriate restriction enzymes and provided with a characteristic adapter at each end. At this time, only one kind of cDNA fragment having different adapter sequences at both ends is generated from one kind of starting mRNA, which is a feature of HiCEP.

Furthermore, in the HiCEP method, as shown in FIG. 11, the preparation solution of the cDNA fragment group is divided into 256, and 16 types of primers that are 2 bases longer than the adapter sequences (known sequences) at both ends are prepared. As shown in FIG. 12, the PCR product is applied to a capillary electrophoresis apparatus together with a size marker as shown in FIG. 12, and the electrophoresis waveform pattern, peak electrophoresis sequence length and fluorescence intensity data are obtained as profiling data. It is.

The sequence identification of the profiling peak obtained with this HiCEP was carried out by this method using mouse ES cells (E14) as a sample.

(1) Construction of comprehensive fragment sequence database using high-speed DNA sequencer HiCEP method was performed using 1 μg of mouse ES cells (E14) total RNA.

Next, among the steps of the HiCEP method, “template cDNAs” obtained in the step shown in FIG. 10 (a mixture of sequences having adapters used in the HiCEP method, which is an indicator sequence at both ends. The length distribution is about 60-base. In order to obtain a DNA amount necessary for sequencing, amplification was performed with primers on the adapter. Thereafter, for the purpose of removing the primer dimer and adapter dimer fractions, purification by acrylamide gel electrophoresis was performed to remove fragments from 70base to 100base or less. The purified product was sequenced with a GS-454 FLX System, which is a high-speed DNA sequencer manufactured by Roche. DNA was not fragmented when the sequencing library was prepared. By sequencing, 469,318 sequences were obtained for the first time (1/2 plate), and 1,868,178 sequences were obtained for the second time (2 plates). Using these two elements of sequence length and similarity, clustering and assembling of these sequence groups is performed by computer processing to create highly accurate sequence clusters and consensus sequences, and the number of read sequences that compose them We have developed the following process to aggregate the data into a database.

Step 1: Inspection and classification by index sequence (adapter sequence used for HiCEP method) An adapter sequence that is a specific index sequence is always given to both ends of the cDNA fragment of FIG. 10 to be detected by the HiCEP method. The index sequences are evaluated for all the read sequences, and the sequences used for clustering and assembly are allocated.

Specifically, as shown in FIG. 13, 32 types of masking sequences obtained by adding up to the selection base NN to the adapter sequence are searched for similarities in all lead sequences by the cross_match (University of Washington) program. The sequence that can be confirmed on both ends or one side of the adapter sequence is the target of clustering and assembly.

(A) parameters of cross_match The parameters of the cross_match program are as follows.

A) Minimize mismatch gap penalty -penalty -1 -gap_init -1 -gap_ext -1
Taking into account the characteristics of 454 read errors (characteristics in which variations in the number of bases in the continuous region for each lead sequence increase when the monopolymer (one type of base) is continuous), Minimize the penalty value.

B) Make word size smaller -minmatch 5
Try to detect and output as many pairwise alignments as possible.

C) Lower the minimum score -minscore 15
(B) Index array An adapter array on each of the MspI side and MseI side is used as an index array as an input mask array to cross_match. As the sequence actually used, as shown in FIG. 13, not only the adapter sequence but also a sequence obtained by adding 2 base portions of the selection base of NN was used as an index sequence. Thus, in order to cover all patterns, a total of 32 types of adapter arrays were used, 16 types each for MspI and MseI. An adapter sequence that does not contain the selection base of NN can also be used as an index sequence. However, 32 types of adapters were used because 32 types of adapters can be used to confirm a slightly larger number of index sequences.

(C) Classification of adapters detected by cross_match In order to obtain a correct HiCEP fragment using the output of cross_match as shown in FIG. 27, the adapters detected by cross_match are divided into the following four types.

A) High quality adapter: adapter with high score including NN in alignment B) Low quality adapter that can be rescued: Alignment is short but no replacement / gap, NN is included, and the internal arrangement is expected to be high quality Possible adapters C) Low quality adapters: low quality and irreparable adapters D) Fake adapters: cross-match that seems to have aligned an internal array similar to the adapter (the part that does not seem to actually exist as an adapter).

(D) Judgment conditions for high-quality adapters 29 bps or more of the adapter array + NN33 bp match (see FIG. 13).

(E) Judgment conditions for relieving low-quality adapter All 19 bp including the selection base NN match in order to make use of the characteristics of the HiCEP method (see FIG. 13).

(F) An array whose array classification has been confirmed is a high-quality adapter or a low-quality adapter that can be repaired, and is used as an array for clustering and assembly. In this example, among all the lead arrays (469, 318), 300,635 arrays (64.1%) were arrays in which adapters could be confirmed at both ends. In addition, 112365 array (23.9%) was an array in which only one adapter was confirmed (see FIG. 13).

Step 2: Clustering / Assembling Among the sequences used for clustering / assembling, clustering / assembling is performed for sequences in which adapters can be confirmed at both ends in the embodiment to generate a consensus sequence of HiCEP fragments. This eliminates individual lead sequence errors and provides a more accurate HiCEP fragment sequence. In addition, the number of read sequences constituting the consensus sequence can be used as reference data for the transcription amount of the HiCEP fragment.

(A) Pre-processing As pre-processing, with respect to all the read sequences for which the adapter sequence used for clustering and assembling can be confirmed, the outside is removed from the position of the adapter sequence including the confirmed adapter sequence (see FIG. 21). Furthermore, the base sequence of the original adapter is artificially given to the end of the removed sequence. In addition, for the quality value information of each read sequence output from the sequencer, the outside is removed from the position of the confirmed adapter sequence, and the highest quality value is assigned to the part corresponding to the artificially assigned adapter. Is granted.

Execute the clustering and assembling program with the sequence group subjected to the above pre-processing as an input, and obtain sequence cluster information, sequence alignment information of each sequence cluster, and consensus sequence of each sequence cluster.

(B) Clustering and assembling software For clustering and assembling, the TGICL program (Harvard University website http://compbio.dfci.harvard.edu/tgi/software) Publish at /). For assembly of sequences, an assembly program CAP3 attached to TGICL is used.

(C) TGICL parameters Parameter “-v 2” is a setting that minimizes the number of overhangs allowed during assembly (the part that is invalidated from the sequence and excluded from the assembly result). In assembly, the input sequence is an array with a common HiCEP adapter sequence added to both ends, so when an alignment with an overhang is output, it indicates that the cluster cannot be assembled normally. Error clusters can be recognized.

The default values were used for the following parameters related to clustering and assembly.

・ Clustering Minimum overlap length: 40bp (-l)
Overlap match rate: 94% (-p)
・ Assembly ・ Minimum match rate: 93% (-O -p)
(D) Collection of singleton sequences Usually, sequences obtained by sequencing randomly cut samples determine genomic regions and genes belonging to similarity, form clusters, and link to knowledge. In such a case, the singleton sequence may not be treated as a signal for obtaining knowledge that no cluster is formed. However, the singleton sequence obtained by this method is a sequence in which the indicator sequence is confirmed (the sequence in which the adapter sequence is confirmed in the HiCEP method), which is highly reliable, and the fact that one of the sequences was sequenced is also found Available to get. Therefore, in this method, processing is performed so that singleton arrays can also be used effectively (thus, in this specification, a cluster having only a singleton array may be referred to).

In tgicl processing, singleton arrays are generated separately in the two processing stages of clustering and assembly. The singleton array generated at the time of assembly is output to a singles file created for each thread that has executed the assembly, but the information of the singleton array excluded at the time of clustering is not output. For this reason, the sequences excluded during clustering must be identified and extracted as singleton sequences. tgicl outputs a list of sequence names that can be clustered at the end of clustering to a file. Use this file to obtain singleton sequences excluded during clustering.

The procedure for collecting singleton arrays is shown below.

A) Edit the clustered sequence name list and create a file in the format of 1-line-1 sequence name. B) From the FASTA format input file, format 1-line-1 sequence name of all input sequence names. Create a file C) Concatenate and sort the two files above to extract a line with only one array name D) Create an array database of input arrays, and extract the array obtained in (3) in FASTA format E) Concatenating singleton sequence (singlets) file and (4) file during assembly Step 3: Correction of clustering error by sequence length Alignment information of tgicl assembly result is output to ace format file. All constituent sequences of the sequence cluster produced from the cDNA obtained by the HiCEP method and the consensus sequence obtained from the sequence cluster must be aligned clusters in which the sequences are aligned over the entire length as well as the sequences. .

(A) Inspection of sequence cluster In accordance with the above principle, the contents of this ace file are inspected for each contig to determine the assembly error.

A) The entire length of the lead is effectively assembled. There should be no clipped parts in all the lead sequences that make up the contig (the part where the lead ends are invalidated during alignment and do not contribute to the formation of consensus sequences). B) Leads are aligned to the full length of the consensus (contig) sequence. Both ends of the lead sequence aligned to the consensus sequence are aligned with both ends of the consensus sequence and must not be aligned in the middle of the consensus sequence.

An array cluster that does not satisfy the above conditions A) and B) is determined as an unaligned cluster, and is determined to be an error cluster (see FIG. 14).

(B) Error cluster repair The cluster (contig) that is determined to be an error cluster in (A) above is taken out, and then reassembled separately for each cluster to obtain a consensus sequence. Is corrected (see FIG. 14).

A) Repetitive assembly Using CAP3 for individual assembly, first try to assemble with a match of 93%. Until the assembly result is determined not to be an error cluster, the matching parameter is increased by 1% and the assembly is repeated.

In the startup parameter, set “-k 0” to specify 0 for the overhang length.

B) Merging assembly results (modifying ace file)
The ace file output by tgicl before error cluster repair is corrected with the alignment information generated after repair.

i. Create a list of individually assembled clusters ii. Read the ace file output by tgicl in order from the top, and delete any cluster information that has been individually assembled iii. Individual assembly at the deleted position Inserted cluster information.

At this time, the consensus (contig) sequence name is given the name of the original file with a branch number added. The branch number is a 4-digit number padded with zeros following '_' and is assigned in order from "_0001".

D) Collection of singleton arrays Singleton arrays generated by individual assembly are output to a singles file for each input cluster. Concatenate these files and add them to the singleton array file created from the tgicl results.

In this example, as shown in Table 1, among the all lead arrays 469 and 318, this is performed on the lead arrays 300 and 635 arrays where the adapters at both ends can be confirmed, and 15326 clusters (two or more arrays are determined to be the same). Group) and a singleton of 284554 sequences (sequences determined to have no other sequences). See Table 1 for full lead sequence 1,868,178.

Step 4: Data generation of cluster reliability The degree to which the selection sequence portion of HiCEP is probable with respect to the cluster sequence obtained in the above step 3 is evaluated. Specifically, the sequence similarity was scored for the consensus sequence of the sequence cluster and the constituent lead sequence. As a result, when the created cluster information is used, such as the profiling association process of (2), it is possible to perform a process using the reliability of the array cluster calculated here as a threshold value.

(A) Pre-processing The following pre-processing is performed to perform selection evaluation from the clustering and assembly results.

-Check the orientation of the sequence from the adapter sequence and convert it to the normal chain direction (CGG-TTA).-Replace the adapter sequence with CCGG / TTAA.

(B) Evaluation based on consensus constituent array In the evaluation based on the consensus constituent array, the following two types of scores are calculated for the 5 ′ end and the 3 ′ end, respectively.

・ Selection base composition ratio (up to 3rd candidate)
The selection match rate of the consensus constituent sequence is evaluated, and it becomes a correction candidate when peak matching is not successful. The ideal score is 100% for the first candidate and 0% for the second and third candidates. However, when the hetero SNP is in the selection part, the first candidate and the second candidate are 50% each (see FIG. 23).

・ Average of editing distance between consensus sequence of 5 bases from the restriction enzyme site and constituent sequence (see Fig. 24)
The ideal score is 0. However, if the hetero SNP is in the selection part, it will be 0.5. FIG. 23 shows the bases recognized as selection bases when calculating the constituent sequence ratio of selection bases.

(C) Cluster division In the HiCEP method that performs 256 kinds of PCR and electrophoresis, the selection base (two bases inside from the adapter sequence) is the important data when performing the profiling association process in (2). Become. Therefore, in this example, for all sequence clusters, the consensus sequence bases and the bases of each read sequence constituting the sequence cluster were converted into data for each of the two base positions inside the adapter sequence. Thus, in the case of the HiCEP method of this embodiment, in the profiling association processing in (2), when the base at the selection base position of the constituent lead sequence of one sequence cluster is divided into two types, the selection part It is determined that a heterogeneous SNP exists, and the cluster can be divided into two (FIG. 15).

Step 5: Data of reliability of consensus sequence using known gene information Regarding the consensus sequence of the sequence cluster obtained in step 4 above, for species that have known gene information (transcript, genome, EST information, etc.) Then, the known sequence information is searched, and the reliability data of the consensus sequence is created.

(A) Execution of similarity search For all sequence clusters and singletons, a consensus sequence or singleton read sequence is subjected to a similarity search to a known public database, and the output is converted into data.

・ MRNA: blastn -dust no -task megablast
・ Genome: no blat parameter (default).

(B) Category classification The similarity search results to public databases are divided into the following four categories. However, 95-95 represents a base match rate of 95% or more, the alignment length represents 95% or more of the query length, and 95-20base represents a base match rate of 95% or more, and an alignment length of 20 bases or more.

1. 95-95 and CCGG-TTAA exist 2. 95-95, CCGG / TTAA, or one of them is different by one base 3. Hit at the end of 95-20base (start / end position of query sequence) Is within 4 bases from the end), and CCGG / TTAA is present 4. Hits the terminal part at 95-20base, and CCGG / TTAA is present with one base difference.

In addition, when searching for restriction enzyme sites, search within the following two bases from the alignment position.

Query sequence: ccGG XX XXXX ... (SEQ ID NO: 13)
-Subject sequence: ... yy zzZZ YY XXXX ... (SEQ ID NO: 14).

† Uppercase aligned sequence. In this example, alignment starts from the third base of the query sequence. CCGG is searched for in yyzzZZYY in which the subject sequence has the same coordinates as the restriction enzyme site of the query sequence (zzZZ) plus two bases before and after.

When a restriction enzyme site is searched for by one base difference, there may be multiple candidates in yyzzZZYY. For example, for the following sequence:
... CCNGGXX ...
There are two ways to think of CCNG GX as GX, and CNGG XX as XX as a selection. Since these cannot be determined only by the arrangement, if there are a plurality of candidates, both are left as data.

For all sequence clusters and singletons, a region and score similar to the ID of a known transcript that was similar to the consensus sequence or singleton lead sequence, and similar to the chromosome number of the similar genomic sequence Sexual areas and scores were stored with the data. As a result, when the created cluster information is used, such as the profiling association process of (2), it is possible to perform a process using the reliability of the array cluster calculated here as a threshold value.

In addition, for the consensus sequences of all sequence clusters, the bases of known sequences that were similar to the bases of the consensus sequence were converted into data for each position of the two bases inside the adapter sequence. Thus, in the case of the HiCEP method of the present embodiment, in the sequence identification process of (3), in the case of a sample of a different individual, the search process is performed on the assumption that the SNP exists even when there is no corresponding sequence cluster. Can do.

Step 6: Giving gene information to the consensus sequence For the consensus sequence of the sequence cluster obtained in the above step 4, the known sequence information is searched and the gene information is given to the sequence.

If the target organism is a species having known gene information (transcript, genome, EST information, etc.), the information is given in step 5. However, exhaustive fragment analysis may detect many unknown transcripts. Therefore, in the step 6, the similarity search is performed on the known sequence information of all the species or a plurality of specific species for the consensus sequence of the sequence cluster, and a known sequence having a high similarity to each consensus sequence is obtained. Associate.

(2) Association of bands or peaks obtained by electrophoresis with sequences Profiles obtained by performing PCR and electrophoresis from individual sequence clusters and “template cDNA solution” obtained by HiCEP method for sequencing the sequences. We developed a method for associating the electrophoresis peak group (ES cell reference profiling). For the association, the consensus sequence sequence and sequence length of each sequence cluster obtained in (1), and the number of read sequences constituting the sequence cluster are used.

Step 1: Correction of Sequence Length It is known that the electrophoresis length and the sequence length of the sequence to be electrophoresed do not always match (see FIG. 39). In this method, in order to make the peak and the sequence coincide with each other, the shift between the electrophoresis length and the sequence length is one problem. In order to solve this problem, in the database in which HiCEP method is applied to existing mouse ES cells, the base composition, molecular weight, electrophoretic length, and sequence length are calculated using the associated peaks 37,675 and their sequence data. The relationship with the gap was examined. As a result, it was found that correction based on the base composition and molecular weight is possible, and the peak matching accuracy can be improved. First, it was found that the base composition of the misalignment sequence correlated with the TG (or AC) content (see FIG. 42). Moreover, it turned out that there exists a tendency for a shift | offset | difference also by molecular weight (refer FIG. 41).

As shown in FIG. 16, when the correction is not performed with the existing mouse ES cells, as shown in FIG. 16, the sequence with a deviation of ± 2 bp is 89%, while the correction with the base composition and molecular weight is 96%. It turned out to increase. In addition, in the case of Step 2, the correct answer rate when the Step 2 was carried out without correction was 66%, but it was found that the correction increased to 77% by performing the correction.

(A) Calibration using the relationship between deviation, sequence length, and molecular weight After removing apparently inappropriate data from the known sequence data, the deviation between the electrophoresis length of the known sequence data and the corresponding sequence length As a result of examining the relationship with the sequence length, it was found that there may be a deviation in the shift depending on the sequence length as shown in FIG. Moreover, when the relationship between the deviation and the molecular weight is represented by a scatter diagram, it is as shown in FIG. Since the molecular weight is finer than the sequence length, the calibration using the molecular weight is good for fine calibration, so a calibration table based on the molecular weight is adopted. The loess function, which is a local regression smoothing function, was used to create the calibration table.

(B) Calibration using the relationship between the deviation and the internal base composition of the sequence “Calibration using the relationship between the deviation and the molecular weight” calculated in (A) to investigate the relationship between the deviation and the internal base composition of the sequence As a result of examining the relationship between the later shift (hereinafter referred to as the shift after calibrating in (A)) and the internal base composition of the sequence, a negative value was found between the A and C content ratios and the shift after the calibration in (A). It was found that there was a positive correlation between the T and G content ratios and the deviation after calibration in (A). In order to increase the result of calibration, the relationship between the deviation after calibration in (1) and the AC content ratio and the relationship with the TG content ratio are shown in a scatter diagram as shown in FIG. There seemed to be a clearer correlation than A, C, T, G alone. A calibration table is prepared from the relationship between the deviation after the calibration in (1) above and the AC content ratio, and the relationship with the TG content ratio, and calibration is performed. The loess function, which is a local regression smoothing function, is used to create the calibration table.

(C) Linear interpolation was used for the prediction interpolation of the electrophoretic length using the calibration table. Calibration of point X to be obtained when points A (x _A , y _B ) and B (x _B , y _B ) described in the calibration table exist before and after the point X (x _x, y _x ) to be obtained The value y _x is as follows (see FIG. 43).

Step 2: Sequence cluster and peak association processing The sequence length corrected in step 1 of the consensus sequence of each sequence cluster, the electrophoresis length of the electrophoresis peak as a profile obtained by the HiCEP method, and the sequence Using two values of the number of read sequences constituting the cluster and the intensity of the peak obtained by electrophoresis, the sequence cluster was associated with the peak of HiCEP reference profiling.

Specifically, follow the procedure below.

(A) Generate “pseudo peaks” from clustering and assembly results (B) Apply peak length calibration to pseudo peaks (C) Combine pseudo peaks with the same peak length into one pseudo peak (D) Peak matching algorithm Matching by.

As a result, the number of peaks associated with profiling peaks 21,778 obtained by the HiCEP method of ES cells was 12,551 peaks (57.6%), of which 77% could be identified by computer processing.

(A) Generate “Pseudo Peak” from Clustering / Assembly Processing Results The peak length and height are assigned as follows to the consensus array or singleton array of the array cluster to generate a pseudo peak.

・ Peak length: Number of bases including selection base of consensus sequence + 34
-Peak height: Number of reads of the sequence cluster.

The correction value +34 from the sequence length to the electrophoresis length is determined as follows.

The PCR product is a 40-base primer sequence length used in PCR and 41 bases added to the end of the fragment DNA sequence, including the HiCEP selection base, as the thymine is artificially bound to the end of the PCR. It becomes. Since the sequence length uses the number of bases from which the adapter sequence has been removed, 37 bases obtained by subtracting 4 bases from both ends of the selection base 2 bases included in the sequence length make the sequence length the electrophoresis length. Is the correction value. However, it is known that the capillary electrophoresis apparatus (particularly 3100) manufactured by Applied Biosystems appears at an electrophoresis position 3 bases less than this theoretical correction value, and therefore 3 bases are subtracted from the theoretical correction value. The corrected value was 34 bases.

∙ As for the peak height, if the number of leads is applied to the peak height as it is, it will be much lower than the profile peak. In the peak matching algorithm of this system, this is not a problem because it is not affected by the absolute height. On the other hand, when visualizing pseudo peaks, the relationship of height is difficult to see when the number of leads = height. Therefore, in the pseudo peak drawing in this specification, a certain coefficient is applied to the number of leads to raise the height to the same level as the profile peak (see FIG. 30).

(B) Applying peak length calibration to the pseudo peak The peak length of the pseudo peak is calibrated by the calibration table generated in step (2) above (see FIG. 28).

(C) Combine pseudo peaks with the same peak length into one pseudo peak When fragments with different internal sequences but the same sequence length exist, they appear as one peak in the HiCEP electrophoresis results, while (1) above The array cluster created in step 1 becomes a different array cluster. Therefore, when associating peaks and sequence clusters, if the sequence length after calibration of the pseudo peaks is the same, these pseudo peaks are combined into one pseudo peak, and the total height is one pseudo peak. The peak height is set (see FIG. 29).

Note that even if the sequence length of the consensus sequence of the sequence cluster is the same, the electrophoretic length of the pseudo peak may be greatly different when the sequence length calibration is performed. Therefore, peaks whose sequence lengths after calibration are within + −0.25 bases are combined into one peak.

(D) Association by Peak Association Algorithm The sequence / peak association algorithm uses DP matching as a basic framework and performs scoring of each edge independently.

A) Characteristics of pseudo peaks The characteristics of pseudo peaks are summarized below:
(I) The number of reads is thought to reflect the expression level, but the variation is large, and it is difficult to align the overall height by applying a constant coefficient as in the case of comparison between the same electrophoresis;
(II) A consensus sequence with a large number of reads is considered to be more reliable. Consensus sequences with fewer reads are considered more likely to have errors in selection bases and sequence lengths;
(III) The deviation between the sequence length to be calibrated and the electrophoresis length is different for each single peak as compared with the case where there is a certain deviation for each region.

The peak matching algorithm is based on these characteristics.

B) Peak alignment by DP matching The range for obtaining the score value of a pair is called a frame. A certain frame region is set, and all combinations of reference profiling peaks and pseudo peaks in the frame are scored as pair candidates, and a pair combination having the highest total score is obtained by the DP matching method (FIG. 17A). FIG. 31 and FIG. 32). The highest score of each pair candidate is 1.0, and if it is 0 or more, there is a possibility that it will be a final pair. If the score is negative, the pair candidate is unlikely to be a final pair.

具体 As a specific method for DP matching, “Needleman & Wunsch, 1970 modified method” was used.

C) Pair candidate score The pair candidate score is the sum of the peak height score and the size score, each weighted. The highest value of each of the height score and the size score is 1.0, and by multiplying each by a weighting factor, the highest value of the pair candidate score is also 1.0;
Pair candidate score = (height score x height weight) + (size score x size weight)
In the example,
Height weight = 0.5 and size weight = 0.5 were used (see FIG. 17B).

I) Height score The height score is calculated as follows for each pair candidate. In calculating the peak height score, the height sequence number in the frame is used instead of the height value (see FIG. 17B).

Height score = (error-abs (p.order-r.order)) / error
error = height sequence number tolerance (currently 10)
p.order = profile peak height order number r.order = pseudo peak height order number abs (n) = absolute value of n.

II) Height sequence number and frame Height sequence numbers are assigned as 1, 2, 3 ... n in order from the highest peak in the frame (see FIG. 33). Allocate profile peak and pseudo peak separately. The frame and height sequence number are calculated for each profile peak of interest (the same number of frames as the number of profile peaks is generated per primer set).

It should be noted that by replacing the height with a sequence number (see FIG. 34), it is possible to avoid the influence of the characteristic (I) “pseudo peak height variation” of the data while considering the relationship of the height. Also, the lower peak, the lower the accuracy of coincidence of the height sequence numbers, which depends on the data feature (II).

III) Handling of peaks with the same height Peaks with the same height may occur frequently because the height of the pseudo peak is the number of sequences. The same sequence number is assigned to peaks having the same height in the frame (see FIG. 35). The sequence number at this time is obtained by adding the number of peaks having the same height. The peak group at the same height is likely to have a small number of leads or a singleton. For such peaks, the accuracy of matching is improved by assigning sequence numbers that are further apart (the influence of noise can be reduced).

IV) Method for Determining Frame Width In the example, the range is within the number of profile peaks of 27 and 80 bases before and after the target profile peak (the pseudo peak is not considered) (FIG. 36).

V) Tolerance of height sequence number There is no penalty if the height order difference of a pair candidate is within the “tolerance of height sequence number”.

VI) Size score The size score is calculated for each pair candidate as follows (see FIG. 17B).

Size score = (error-abs (p.size-r.size)) / error
error = size tolerance (currently 2 to 4: see below)
p.size = profile peak size r.size = pseudo peak size abs (n) = absolute value of n.

VII) Size tolerance There is no penalty if the size difference between the pair candidates is within the “size tolerance”. The size tolerance is variable between 2 bases and 4 bases.

The procedure for obtaining the size tolerance is shown below.

-Find the distance to the next peak in both directions before and after the profile peak of interest;
・ One of the shorter distances of the two distances in front and rear is 1/2 (one half) as a candidate value;
・ If the candidate value is within 2 to 4 bases, the candidate value is directly used as the size tolerance;
If the candidate value is smaller than 2 bases, 2 is set as the size tolerance if it is larger than 4 bases (FIG. 37).

A method of increasing the tolerance as the size increases can be considered, but in this method, the above method was adopted.

VIII) Correction of nearby peak judgment When there is a corresponding profiling peak and pseudo peak in the vicinity, a pair of profiling peaks that are close in size but low in intensity to a pseudo peak with high intensity have a large difference in height sequence number. This often lowers the score and does not make a final pair. However, if there is no pseudo peak corresponding to a low profiling peak, and the pseudo peak and the profiling peak that should originally correspond are separated from each other by a certain size, they may be paired.

As a countermeasure to such a case, a pair is corrected for a strong peak and a nearby peak (FIG. 38). The correction method is as follows.

-If there is a peak whose intensity is 30% or less of the target peak within 0.75 base before and after the peak to be matched, it is designated as a “low peak in the vicinity” (right and left are distinguished);
In the score calculation, when one is a base peak, the score is calculated only when the same neighboring peaks have low intensities. Otherwise, the score is -1 (penalty).

This correction is effective only when the sequence length calibration is applied (if the sequence length calibration is not performed, the pseudo peak is equally divided by 1 base).

(3) Gene identification method of peak obtained by HiCEP using the above (1) database and the association information of (2) above Using the same template cDNA adjusted by HiCEP method according to (2) above The created sequence cluster and the reference profiling peak are associated, and the sequence can be identified for the reference profiling peak. The following is a method for identifying the profiling peak sequence obtained by applying the HiCEP method to another sample using the database in (1) and the association information in (2) above. (See FIGS. 25, 26, and 27).

Method 1
Step 1: Create data in which the HiCEP profiling results obtained from the gene identification target sample and the reference profiling used in (2) are associated with the electrophoresis length and the peak intensity.

Step 2: From the gene identification target peak group obtained from the gene identification target sample, the reference profiling peak is obtained using the association data created in Step 1 above, and from the association information in (2) The clusters created in (1) are obtained, and the consensus sequence and gene information are obtained. As a result, a correspondence list between the peak group of interest and gene information is created.

Step 3: The consensus sequence of interest is determined based on the gene information created in Step 6 of (1), and electrophoresis obtained from the sample for gene identification through the reference profiling peak associated in (2) Find the peak.

Method 2
A reference profiling band using the number of bases obtained by electrophoresis of one or more peaks of the electrophoresis result obtained from the gene identification target sample in (2), and the sequence cluster created in (1) By displaying the pseudo-profiling created from the number of sequences and the gene information side by side, the gene information of the peak of interest of the electrophoresis result obtained from the gene identification target sample is obtained.

Claims

Fragmenting cDNA obtained from genomic DNA or transcript contained in a sample and adding an indicator sequence to obtain a fragment DNA mixture;
Obtaining read sequence data for all fragment DNAs contained therein by performing high-speed DNA sequencing on the first portion of the fragment DNA mixture; and
Inspecting the presence or absence of the indicator sequence portion for all of the lead sequence data, extracting the lead sequence data having the indicator sequence;
All of the extracted read sequence data are subjected to sequence clustering processing and assembling processing using predetermined parameters, thereby forming a plurality of sequence clusters. For each of the sequence clusters, the sequence cluster Obtaining the number of constituent sequences, the consensus sequence and the consensus sequence length, and alignment information;
Constructing a database including the number of constituent sequences of the sequence cluster associated with each of the sequence clusters, a consensus sequence and a consensus sequence length, and alignment information;
Comprising
A database construction method for comprehensive fragment analysis of genomes or transcripts, wherein the parameters are parameters relating to sequence similarity, sequence length, and indicator sequence.
Electrophoresing a second portion of the fragment DNA mixed solution, and obtaining the intensity of the band group or peak group derived from each fragment and the electrophoretic sequence length as reference profiling data from the obtained electrophoresis result When,
The sequence information of the consensus sequence of each sequence cluster, the number of bases and the number of sequences constituting the sequence cluster, and the step of associating with the band group or peak group of the reference profiling,
The database construction method according to claim 1, further comprising:
The correspondence uses the relationship between the sequence information and the number of bases of the consensus sequence and the molecular weight obtained by electrophoresis of the band or peak of the reference profiling as a first parameter, and constitutes a sequence cluster of the consensus sequence The database construction according to claim 2, wherein the consensus sequence is associated with the reference profiling by using a relationship between the number of sequences to be performed and the intensity of the band or peak of the reference profiling as a second parameter. Method.
The correspondence scores the degree of coincidence between the number of sequences and the number of bases constituting the sequence cluster, the intensity of the band or peak obtained by the electrophoresis and the length of the electrophoresis sequence, and the total score is maximized. 4. The database construction method according to claim 2, wherein the sequence cluster is associated with the band or peak by selecting a combination.
The calibration according to any one of claims 1 to 4, further comprising calibrating the association based on a relationship between a deviation in the association and a sequence length and a molecular weight and / or a relationship between the deviation and an internal base composition. The database construction method described.
Following the step of constructing the database, the known gene sequence information of the same animal species as the animal species from which the sample is derived is retrieved, and the consensus sequence is compared with the known gene of the same species to obtain in the method. The database construction method according to any one of claims 1 to 5, further comprising the step of converting the reliability of the consensus sequence obtained into data.
Following the step of constructing the database, the method further comprises converting the reliability of the consensus sequence of each sequence cluster into data based on the alignment information of the consensus sequence of the sequence cluster and the lead sequence constituting the cluster. The database construction method according to any one of claims 1 to 6.
Subsequent to the step of constructing the database, the known gene sequence information is searched, and the consensus sequence and the known gene sequence information are compared to give the known gene sequence information to the consensus sequence obtained in the method. The database construction method according to any one of claims 1 to 7, further comprising:
Fragmenting DNA obtained from the genome or transcript contained in the target sample, and further adding a distinguishable indicator sequence to obtain a target fragment DNA mixture,
Electrophoresing the target fragment DNA mixture, obtaining the intensity of the band or peak and the electrophoretic sequence length from the obtained electrophoresis result as data for gene identification target profiling,
The gene identification target profiling data, the consensus sequence of the database and the reference profiling data that are pre-constructed by the method according to any one of claims 2 to 8 depending on the type of the target sample. A method for identifying genes of genomes or transcripts contained in the target sample by associating with each other.
The association associates the band or peak intensity and electrophoresis sequence length data included in the gene identification target profiling with the band or peak intensity and electrophoresis sequence length of the reference profiling, and thereby the subject. The method according to claim 9, wherein genetic information of a genome or a transcription product contained in the sample is obtained.
The mapping is the data of the electrophoretic sequence length included in the gene identification target profiling, the electrophoretic sequence length of the reference profiling band or peak, the number of bases of the sequence cluster, and the pseudo profiling created from the number of sequences. The method according to claim 9, wherein genetic information of a transcription product contained in the target sample is obtained.
Fragmenting DNAs obtained from genomes or transcripts contained in the first to nth target samples, respectively, and further providing index sequences to obtain first to nth fragment DNA mixed solutions, respectively; (Where “n” represents an integer of 2 or more),
The first to nth fragment DNA mixed solutions are respectively subjected to high-speed DNA sequencing, whereby first to nth read sequence data for all fragment DNAs contained in each fragment DNA mixed solution are obtained. Stages,
Examining all of the array data of each of the first to nth read array data for the presence or absence of the index array portion and extracting the first to nth read array data having the index array, respectively;
For each of the extracted first to n-th read array data, the clustering process and the assembling process of the array are respectively performed using predetermined parameters, so that the first to n-th cluster groups are respectively set. Forming, for each of the first to n-th cluster groups, obtaining the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and alignment information,
First to n-th sequence cluster group information including the number of constituent sequences of the sequence cluster associated with each of the first to n-th sequence cluster groups, a consensus sequence and a consensus sequence length, and alignment information. Building each of the included databases;
The first to n-th sequence cluster groups are associated with each other by the similarity of the respective consensus sequences, and the number of sequences constituting each of the sequence clusters corresponding to each of the sequence clusters is compared. Detecting a cluster of sequences with a change in quantity,
Comprehensive fragment analysis method comprising:
Fragmenting DNA obtained from the genome or transcript contained in the target sample and further adding a marker sequence to obtain a fragment DNA mixture,
Obtaining the read sequence data for all the fragment DNAs contained therein by subjecting the fragment DNA mixture to high-speed DNA sequencing;
Inspecting the presence or absence of the index sequence portion for all the sequence data of the lead sequence data, extracting the lead sequence data having the index sequence,
Using the sequence similarity with the consensus sequence of the database previously constructed by the method according to any one of claims 2 to 6 depending on the type of the target sample as a parameter, the read sequence data Performing a sequence clustering process for each to obtain a sequence cluster group of the target sample, and
By comparing the sequence cluster group of the target sample with the sequence cluster group included in the database, a cluster that exists only in the sequence cluster group of the target sample and / or a cluster that exists only in the sequence cluster group of the database Detecting the stage,
Comprehensive fragment analysis method comprising:
A step of obtaining a mixture of first to nth fragment DNAs by fragmenting each DNA obtained from a genome or a transcription product contained in each of the first to nth target samples, and further adding an index sequence. (Where “n” represents an integer of 2 or more),
Obtaining the first to nth read sequence data for all of the fragment DNAs contained therein by performing high-speed DNA sequencing on the first to nth fragment DNA mixed solutions, respectively;
Inspecting each of the array data of each of the first to nth read array data for the presence or absence of the index array portion, and extracting the first to nth read array data having the index array, respectively. ,
The sequence similarity with the consensus sequence of the database previously constructed by the method according to any one of claims 2 to 6 is used as a parameter depending on the type of each of the first to nth target samples. Performing a clustering process for each of the first to n-th read sequence data to obtain first to n-th sequence cluster groups,
For each of the same consensus sequences included in each of the first to nth sequence cluster groups, the number of sequences constituting each of the clusters is compared, and the number of sequences that differ between the first to nth sequence cluster groups is determined. Identifying the sequence cluster to be shown;
Comprehensive fragment analysis method comprising:
In accordance with the purpose, a step of building a database in advance by the method according to any one of claims 2 to 6;
Creating a microarray by immobilizing a probe group designed and prepared based on a consensus sequence included in the constructed database on a substrate;
Fragmenting DNA obtained from the genome or transcript contained in the target sample and further adding a marker sequence to obtain a fragment DNA mixture,
Contacting the fragment DNA mixture with the probe group to obtain a hybridization signal;
Detecting the presence of a transcription product contained in the target sample based on the obtained hybridization signal;
Comprehensive fragment analysis method comprising:
In accordance with the purpose, a step of building a database in advance by the method according to any one of claims 2 to 6;
A set of probes designed and prepared based on the consensus sequence included in the constructed database as a set, and immobilizing each set on n substrates to create n microarrays (here, “N” represents an integer of 2 or more),
Fragmenting DNA obtained from genomes or transcripts contained in each of the first to n-th target samples, respectively, and further obtaining a fragment DNA mixture solution by adding an indicator sequence respectively;
Contacting each of the first to n-th fragment DNA mixed solutions with each of the probe groups respectively fixed to the n microarrays to obtain respective hybridization signals;
Comparing the abundance of transcripts contained in the first to n-th target samples based on the respective obtained hybridization signals;
Detecting, by the comparison, a transcript having a difference in the abundance between the first to nth target samples;
Comprehensive fragment analysis method comprising:
For all of the read sequence data obtained by high-speed DNA sequencing, the fragment DNA mixture from the genome or transcript contained in the sample that has been fragmented and given the index sequence is subjected to A procedure for examining presence / absence and extracting lead sequence data having the index sequence;
All of the extracted read sequence data are subjected to sequence clustering processing and assembly processing using predetermined sequence similarity, sequence length, and index sequence parameters to form a plurality of sequence clusters. , For each of the sequence clusters, the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and a procedure for obtaining alignment information;
Means for constructing a database including the number of constituent sequences of the sequence cluster associated with each of the sequence clusters, the consensus sequence and the consensus sequence length, and alignment information;
A program for constructing a database for comprehensive fragment analysis of transcripts, which causes a computer to execute a process including: