WO2012157778A1 - Gene identification method in fragmentome analysis and expression analysis method - Google Patents

Gene identification method in fragmentome analysis and expression analysis method Download PDF

Info

Publication number
WO2012157778A1
WO2012157778A1 PCT/JP2012/062963 JP2012062963W WO2012157778A1 WO 2012157778 A1 WO2012157778 A1 WO 2012157778A1 JP 2012062963 W JP2012062963 W JP 2012062963W WO 2012157778 A1 WO2012157778 A1 WO 2012157778A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
cluster
fragment
data
sequences
Prior art date
Application number
PCT/JP2012/062963
Other languages
French (fr)
Japanese (ja)
Inventor
真澄 安倍
春信 湯野川
伸司 佐藤
一弘 近藤
隆志 日永田
Original Assignee
独立行政法人放射線医学総合研究所
株式会社メイズ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 独立行政法人放射線医学総合研究所, 株式会社メイズ filed Critical 独立行政法人放射線医学総合研究所
Publication of WO2012157778A1 publication Critical patent/WO2012157778A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6809Methods for determination or identification of nucleic acids involving differential detection

Definitions

  • the present invention relates to a fragment sequence database construction method in comprehensive fragment analysis, and a gene identification method and expression analysis method using the fragment sequence database construction method.
  • the microarray is a method for detecting hybridization between a probe containing a sequence to be detected immobilized on a substrate and a nucleic acid contained in a sample.
  • the microarray in order to prepare a probe, information on a target nucleic acid is necessary.
  • the reliability of the change rate of the expression level detected about a low expression gene is low.
  • RNA-Seq RNA-Seq
  • a method for detecting a difference in genomic DNA or a difference in the expression level of a transcript a method that can be applied to a species having no sequence information has also been proposed.
  • Such a method is also referred to as an exhaustive fragment analysis method and includes, for example, HiCEP, AFLP, T-RFLP, SAGE, CAGE, Differential Display, and the like.
  • a DNA sequence is cleaved with a restriction enzyme, and a fragment sequence with a specific sequence at the end is prepared and amplified by PCR using a specific sequence, followed by electrophoresis, or the DNA sequence is converted into a specific sequence. It is used for electrophoresis after amplification by PCR.
  • the electrophoresis results that is, band groups or peak groups
  • the electrophoresis results that is, band groups or peak groups
  • band groups or peak groups having different intensities are detected.
  • in order to perform expression analysis it is necessary to classify each band group or peak group and sequence them one by one to determine the base sequence.
  • enormous time and enormous costs are required.
  • an object of the present invention is to provide a gene identification method and expression analysis method that are simple and retain high reliability, and a fragment sequence database construction method in exhaustive fragment analysis used therein. .
  • Fragmenting the transcript contained in the sample, further adding an indicator sequence, and obtaining a fragment DNA mixture Obtaining read sequence data for all fragment DNAs contained therein by performing high-speed DNA sequencing on the first portion of the fragment DNA mixture; and Inspecting the presence or absence of the indicator sequence portion for all of the lead sequence data, extracting the lead sequence data having the indicator sequence; For all of the extracted read sequence data, a clustering process and an assembling process are performed using predetermined parameters to form a plurality of clusters, and for each of the clusters, a cluster configuration array Obtaining a number, consensus sequence and consensus sequence length, and alignment information; Comprising A database construction method is provided in which the parameters are parameters relating to sequence similarity and sequence length.
  • the present invention provides a gene identification method and expression analysis method that are simple and highly reliable, and a fragment sequence database construction method in exhaustive fragment analysis used therein.
  • the flowchart which shows an example of the construction method of a database The figure which shows one example of a structure of a database.
  • the flowchart which shows an example of the gene identification method The figure which shows one example of a structure of a database.
  • Fig. 4 is a flowchart showing a further example that can be used in the gene identification method of Fig. 3.
  • the flowchart which shows an example of the analysis method using a high-speed DNA sequencer The flowchart which shows an example of the analysis method using a high-speed DNA sequencer.
  • the flowchart which shows an example of a clustering process The flowchart which shows an example of a clustering process.
  • the scheme which shows the preparation example of a DNA fragment liquid mixture The scheme which shows the preparation example of the DNA fragment liquid mixture using selective PCT.
  • sequence The flowchart which shows an example of a clustering and assembly process.
  • index array The schematic diagram which shows the example of a parameter
  • the figure which shows an example of a peak The figure which shows an example of scoring.
  • difference The graph which shows the relationship between molecular weight and shift
  • the block diagram which shows an example of a structure of a computer.
  • the fragment DNA mixed solution may be prepared by fragmenting a genome or transcript contained in a sample for which a database is to be created, giving an index sequence. This is a mixed solution for the comprehensive fragment analysis method.
  • the sample may be prepared from a cell, tissue, organ, or the like into a mixed solution containing a genome or a transcription product by any means known per se. Prior to fragmentation of the genome or transcript, it may be performed by any means known per se. Preferably, cDNA is prepared from the transcript, and this is fragmented to give a labeling sequence.
  • Fragmentation of DNA obtained from a genome or a transcript may be performed using a restriction enzyme known per se. Addition of an identifiable indicator sequence to the fragmented DNA may be performed, for example, by adding an adapter sequence to the fragment. Application of the adapter may be at the 5 'and / or 3' end of each fragment. Also, for example, as practiced in the mate pair method, adapter attachment is applied to the 5 ′ end and / or the 3 ′ end of each fragment, followed by the 5 ′ end of the fragment to which the adapter is applied and 3 A linear nucleic acid may be prepared by binding the ends and forming a circular nucleic acid, followed by cleaving at a site other than the sequence corresponding to the adapter.
  • the base sequence of the adapter and its length may be arbitrarily determined as long as they can be identified.
  • the “index sequence” indicates that the sequence to be the index includes a discriminable number of base sequences.
  • a method for assigning an index sequence in a fragment analysis method such as HiCEP method, AFLP method, T-RFLP method, CAGE method, and Differential-display method is used to give an identifiable indicator sequence to a cDNA fragment. More preferably, AFLP method, T-RFLP method, CAGE method and Differential-display method, most preferably HiCEP method may be used.
  • a discriminating indicator sequence is imparted to the cDNA fragment, and the mixture of the fragments is subjected to electrophoresis such as gel electrophoresis and / or capillary electrophoresis.
  • fragments may be comprehensively analyzed by performing an analysis method by obtaining an electrophoretic sequence length (herein also referred to as “molecular weight” or “sequence length” or “fragment length”).
  • the lead mixture is obtained by applying the fragment mixture thus prepared to a high-speed DNA sequencer.
  • high-speed DNA sequencer refers to a sequencer that can be sequenced without separating multiple types of base sequences having different lengths.
  • sequencers provided by Roche 454FLX, Illumina GAII series / HiSEQ series, LifeTechnology SOLiD series / Ion Torrent PMG series, Helicos, Pacific Bio, etc.
  • the present invention is not limited to this.
  • the high-speed DNA sequencer may not require cloning.
  • the read sequence is clustered and assembled by computer processing.
  • a highly accurate array cluster and consensus array are created, and the number of read arrays constituting each array cluster is totaled.
  • FIGS. 9A and 9B show the same series of steps.
  • FIG. 9A describes in detail steps 1 to 3
  • FIG. 9B describes in detail steps 4 to 6.
  • this process is also referred to as “clustering and assembling” or “clustering and assembling process”.
  • sequence clustering is a term used interchangeably with “clustering” and “clustering” and is grouped based on predetermined parameters, preferably base sequence similarity and / or sequence length. Indicates to do. Groups resulting from clustering are referred to as “clusters” or “array clusters”. A cluster composed of a plurality of arrays having the same length is called an “aligned cluster”, and a cluster composed of a plurality of arrays having different lengths is called an “unaligned cluster”. A cluster consisting of only one array is also referred to as “singleton”, but “singleton” may also be used as a cluster.
  • assembly is a term used interchangeably with “assembly” and “assembly”, and is a consensus sequence that is one representative sequence from a plurality of nucleic acid sequences having at least a partially common sequence. It also means obtaining alignment information of the sequence subjected to assembly to the consensus sequence.
  • Consensus sequence means an artificial sequence obtained by the assembly process.
  • Step 1 Sequence classification When a specific sequence appears at both ends of a fragment DNA sequence to be detected by comprehensive fragment analysis, both or one of the both end sequences is evaluated, and a sequence used for clustering and assembling is assigned. Specifically, in other words, it is determined whether or not the lead sequence includes an indicator sequence. When the indicator sequence is included at both ends or one end, it is extracted as a lead sequence for database creation, and the following steps are performed: Used in.
  • Whether or not an index sequence is included may be determined by confirming the presence of the index sequence in the lead sequence to be determined.
  • the indicator sequence used for confirmation may be a base sequence corresponding to the sequence given to the DNA fragment as an adapter sequence, and even if it is a base sequence corresponding to the entire adapter sequence, Even a base sequence corresponding to a part may be a sequence containing additional bases in addition to a sequence corresponding to the adapter sequence.
  • any number of any base N (a base selected from adenine, thymine, guanine and cytosine) may be included.
  • the base N is included so as to extend to the 5 'end side or 3' end side of the sequence corresponding to the adapter sequence.
  • the number of arbitrary bases N may be, for example, 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, and preferably 2 for 1 adapter.
  • the 5 'end side and the 3' end side may contain different types of arbitrary types of bases.
  • the index sequence may be present inside the lead sequence, but is preferably present at both ends.
  • Step 2 Clustering and assembling Clustering and assembling are performed to obtain a sequence cluster and its consensus sequence. All of the extracted read sequence data are subjected to sequence clustering processing and assembly processing using predetermined parameters. Thereby, a plurality of sequence clusters are formed, and the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and alignment information are obtained for each of the sequence clusters.
  • predetermined parameters for example, parameters related to sequence similarity, sequence length and / or index sequence, preferably parameters related to sequence similarity, sequence length and index sequence may be used.
  • Step 3 Correction of clustering error
  • the sequence similarity between the sequences constituting the sequence cluster and the sequence length identity are evaluated, Furthermore, the sequence similarity and sequence length between consensus sequences are evaluated, and errors and contradictions in clustering and assembling are detected, and the created cluster is corrected.
  • Step 4 Cluster reliability data conversion For the sequence cluster obtained in step 3, the reliability of the consensus sequence as a representative sequence of each sequence cluster is converted into data from the alignment information of the consensus sequence and the lead sequence constituting the cluster. .
  • the base adjacent to the index sequence may be evaluated for the consensus sequence of the cluster and the lead sequence constituting the cluster.
  • the number of sequences adjacent to the indicator sequence may be 2 or more, preferably 2 bases.
  • Step 5 Data conversion of reliability of consensus sequence using known gene information Regarding the consensus sequence of the sequence cluster obtained in the above step 4, in a biological species where known gene information (transcript, genome, EST information, etc.) exists, The known sequence information is searched, and consensus sequence reliability data is created.
  • Step 6 Giving gene information to a consensus sequence With respect to the consensus sequence of the sequence cluster obtained in Step 4, known sequence information is searched and gene information is given to the sequence.
  • An exhaustive fragment analysis database (hereinafter also referred to as “DB”) can be constructed by the above steps.
  • Steps 4 to 6 are optional steps, and may be performed according to the purpose, for example, when it is desired to ensure higher reliability or when specific gene information is to be added to the database.
  • FIG. 2 shows an example of components included in the database obtained by the steps 1 to 6.
  • the database is converted into “consensus sequence”, “number of sequence sequences of sequence cluster”, “consensus sequence length of sequence cluster”, “alignment information”, “reliability data of sequence clustering” and “ The gene information of the sequence cluster ”is included, and these pieces of information may be stored in the storage unit in association with each other.
  • sequence information of the consensus sequence the number of bases and the molecular weight (or electrophoresis sequence length) obtained by electrophoresis of the band or peak, and the number of sequences constituting the cluster of the consensus sequence and the intensity of the band or peak Use to associate consensus sequences with reference profiling.
  • the consensus sequence included in the database (1) is associated with reference profiling.
  • Step 1 Produce data that associates the profiling results obtained from the gene identification target sample with the reference profiling used in (2).
  • Step 2 From the gene identification target band group or peak group obtained from the gene identification target sample, the reference profiling band or peak is obtained using the association data created in Step 1 above, and further in (2) From the association information, the cluster created in (1) above is obtained, and the consensus sequence and gene information are obtained. This creates a correspondence list between the band group or peak group of interest and the gene information.
  • Step 3 the consensus sequence of interest is determined based on the gene information created in Step 6 of (1) above, and the gene identification target is determined via the reference profiling band or peak associated in (2) above. Determine the profiling band or peak obtained from the sample.
  • the gene information of the attention band or peak of the electrophoresis result obtained from the gene identification target sample is obtained by arranging and displaying the pseudo profiling created from the sequence cluster and the number of sequences and the gene information thereof.
  • the database of (1) For each lead sequence obtained from each sample to be measured, the database of (1) is created, the sequence clusters are associated with each other based on the similarity of the consensus sequences, and the number of sequences constituting between the associated sequence clusters To compare the number of sequences, detect the sequence group with a change in quantity, and perform expression analysis between the first target sample and the second target sample. Standardization may be performed using the number of sequences that have been compared.
  • each sample is subjected to the same type of high-speed DNA sequencer for the mixed solution prepared by the comprehensive fragment analysis method to obtain a sequence. It does not have to be the same as the high-speed DNA sequencer used in the database created in advance.
  • the lead sequence to be measured is clustered by using the consensus sequence of the database in which this sequence has been created in advance as a reference and performing alignment processing or the like.
  • a probe is designed based on the obtained consensus sequence to create a microarray.
  • a mixed solution prepared by an exhaustive fragment analysis method is used to detect a sequence group accompanied by a change in amount using the microarray created above, and a first target sample and a second target sample To perform expression analysis between.
  • each symbol means the following: M: Number of the read sequence that is the seed of the cluster N: Number for reading from the next read sequence to the last sequence after the M-th read sequence as a seed I: Generated cluster number.
  • Read all the read sequences to be used for clustering First, the read sequence that becomes the seed of the cluster: The M-th sequence is determined, and from the next read sequence of the seed sequence, all the remaining read sequences are sequentially sequenced and the Nth sequence The similarity and sequence length are compared with the target read sequence, and if they are determined to be the same, the read sequence is stored in the I-th cluster of the cluster storage area. A cluster is established when all seed sequence searches are completed. Thereafter, in order to obtain a consensus sequence for each cluster, assembly is performed for each cluster.
  • a program for causing a computer to execute the steps included in each method (herein also referred to as “stage”) as each procedure may be provided.
  • a program for implementing the steps included in the above (1), (2), (3), (4), (5), (6), (7) and / or (8) as each procedure Is provided.
  • a program for causing a computer to execute the method (1) may be stored in any medium and provided.
  • Such a program is, for example, the following program: For all of the read sequence data obtained by high-speed DNA sequencing of fragment DNA mixed solution from a transcript contained in a sample that has been fragmented and given an identifiable indicator sequence, the indicator sequence portion A procedure for examining the presence or absence and extracting lead sequence data having the index sequence; For all of the extracted lead sequence data, a sequence clustering process and an assembling process are performed using parameters relating to sequence similarity and sequence length determined in advance, thereby forming a plurality of sequence clusters.
  • a program for constructing a database for comprehensive fragment analysis of transcripts, which causes a computer to execute a process including:
  • FIG. 44 schematically shows an example of the configuration of a computer for performing a procedure according to an embodiment of the present invention.
  • the computer includes a processing management unit, a storage unit, a temporary recording unit, a program storage unit, a clustering and assembly processing unit, an index array inspection unit, a correction data storage unit, a similarity determination unit, and an array length determination unit.
  • all other components that is, a storage unit, a temporary recording unit, a program storage unit, a clustering and assembling processing unit, a marker array inspection unit, a correction data storage unit, a similarity determination unit, and An array length determination unit is connected to be able to exchange signals.
  • a further configuration unit for performing processing as desired may be included, and such a configuration unit is connected to the process management unit so as to be able to exchange signals.
  • All programs are stored in the program storage unit.
  • the process management unit manages and executes all processes in accordance with the program stored in the program storage unit.
  • the database configured according to the aspect of the present invention is stored in the storage unit.
  • the lead array is stored in the storage unit or the temporary recording unit.
  • the index array inspecting unit inspects whether or not the index array is included in the read array that is output from the stored component unit according to the instruction of the process management unit according to the program stored in the program storage unit.
  • the clustering and assembling processing unit performs clustering and assembling processing on the lead sequence.
  • the correction data storage unit stores data used for correcting the obtained data. Data stored in the correction data storage unit is output, a program for correction is output from the program storage unit, and the process management unit corrects data obtained based on the data.
  • the similarity determination unit performs a determination regarding similarity with respect to an object to be compared in accordance with an instruction from the process management unit according to the program stored in the program storage unit.
  • the sequence length determination unit makes a determination regarding the sequence length of the objects to be compared in accordance with an instruction from the process management unit according to the program stored in the program storage unit.
  • the computer may have an input unit such as a keyboard and / or a scanner for inputting data from an operator or a high-speed DNA sequencer. Furthermore, you may have output parts, such as a monitor and / or a printer for outputting the obtained result.
  • the processing management unit performs the correction.
  • the computer may further include a correction unit, and the correction unit may correct the data as described above.
  • a typical method for comprehensive fragment analysis is as follows. For example, such methods include methods called HiCEP, AFLP, T-RFLP, SAGE, CAGE, and Differential Display.
  • the HiCEP method seems to generate only one type of fragment from the target mRNA sequence or genomic sequence fragment (start sequence) by digestion with restriction enzymes.
  • start sequence mRNA sequence or genomic sequence fragment
  • selection sequence selection sequence
  • electrophoresis More than 20,000 types of fragment sequences can be simultaneously obtained as independent waveform peaks, and this one peak has a feature that it corresponds to one type of the original sequence. Therefore, by comparing the electrophoresis results obtained with HiCEP, a peak whose intensity has changed between samples is a method by which it can be detected that there is also a quantitative difference in the original sequence of the fragment.
  • transcripts with low expression levels can also be detected, and reproducibility is very good, so that differences in expression levels of 1.2 times or more can be detected.
  • HiCEP ES cells were used as samples, and about 14000 peaks detected by HiCEP were sequenced using a Sanger sequencer to create a database. To create this database, about 3 It is impractical to perform this method on all samples that are subject to HiCEP measurement, requiring yearly periods and high costs.
  • the present invention can also provide the following effects.
  • the HiCEP method is simulated by a computer for organisms with genome information and transcript information, and a virtual fragment sequence obtained from a known sequence is subjected to electrophoresis.
  • a method for predicting the arrangement of bands or peaks by matching with the number of molecules of bands or peaks was also constructed.
  • the sequence length obtained by electrophoresis (referred to as the electrophoresis length) does not always match the actual sequence length of the target fragment, and the number of known sequences increases compared to the number of bands and peaks, and candidates From these facts, it was found that the sequence and the band or peak cannot be accurately associated with the sequence length alone.
  • the existing analysis method (high-speed DNA) is applied to the DNA mixture prepared from the sample to be measured by the comprehensive fragment analysis method.
  • sequencer RNA-seq and microarray enables the analysis utilizing the features of the comprehensive fragment analysis method without losing the disadvantages of the existing analysis method, and also the high-precision sequence storing reliability data
  • the availability of cluster information enables analysis with higher accuracy than applying existing analysis methods.
  • the HiCEP method (High-coverage-expression-profiling-method) is one of the methods for comprehensive fragment analysis, and is a method for comprehensive and high-precision gene expression analysis from a very small amount of sample.
  • the greatest feature of the HiCEP method is that low expression transcripts can be analyzed with high reproducibility and high accuracy. Furthermore, since this method does not require gene sequence information in advance, it can also be applied to biological species for which genomic information is not clear. However, it also means that it is difficult to predict the base sequence of the transcript obtained as a profiling peak. Therefore, the base sequence identification of the electrophoresis peak in the expression profile of the comprehensive fragment analysis obtained by HiCEP method was carried out by this method.
  • the image of using this method is prepared in advance by preparing a sample to be measured by the HiCEP method, sequencing it by this method, clustering and assembling, and creating a database of sequence clusters.
  • the HiCEP reference profiling peaks and clusters are stored in correspondence. After that, perform HiCEP on the same species / tissue as the sample for which the database was created, but obtain the analysis target profiling, list the electrophoresis peaks of interest, and create them in advance. This is a method of determining the sequence of the peak of interest using the database of the sequence cluster and the association data between the cluster and the peak of reference profiling.
  • HiCEP's specific method is to first generate double-stranded cDNA groups based on RNA (TotalRNA) extracted from biological samples or purified mRNA samples. This is a method of preparing a preparation solution of only cDNA fragment groups cleaved with two appropriate restriction enzymes and provided with a characteristic adapter at each end. At this time, only one kind of cDNA fragment having different adapter sequences at both ends is generated from one kind of starting mRNA, which is a feature of HiCEP.
  • TotalRNA TotalRNA
  • the preparation solution of the cDNA fragment group is divided into 256, and 16 types of primers that are 2 bases longer than the adapter sequences (known sequences) at both ends are prepared.
  • the PCR product is applied to a capillary electrophoresis apparatus together with a size marker as shown in FIG. 12, and the electrophoresis waveform pattern, peak electrophoresis sequence length and fluorescence intensity data are obtained as profiling data. It is.
  • the sequence identification of the profiling peak obtained with this HiCEP was carried out by this method using mouse ES cells (E14) as a sample.
  • “template cDNAs” obtained in the step shown in FIG. 10 (a mixture of sequences having adapters used in the HiCEP method, which is an indicator sequence at both ends. The length distribution is about 60-base.
  • amplification was performed with primers on the adapter.
  • purification by acrylamide gel electrophoresis was performed to remove fragments from 70base to 100base or less.
  • the purified product was sequenced with a GS-454 FLX System, which is a high-speed DNA sequencer manufactured by Roche. DNA was not fragmented when the sequencing library was prepared.
  • Step 1 Inspection and classification by index sequence (adapter sequence used for HiCEP method) An adapter sequence that is a specific index sequence is always given to both ends of the cDNA fragment of FIG. 10 to be detected by the HiCEP method. The index sequences are evaluated for all the read sequences, and the sequences used for clustering and assembly are allocated.
  • 32 types of masking sequences obtained by adding up to the selection base NN to the adapter sequence are searched for similarities in all lead sequences by the cross_match (University of Washington) program.
  • the sequence that can be confirmed on both ends or one side of the adapter sequence is the target of clustering and assembly.
  • High quality adapter adapter with high score including NN in alignment
  • Low quality adapter that can be rescued Alignment is short but no replacement / gap, NN is included, and the internal arrangement is expected to be high quality
  • Possible adapters C) Low quality adapters: low quality and irreparable adapters
  • Fake adapters cross-match that seems to have aligned an internal array similar to the adapter (the part that does not seem to actually exist as an adapter).
  • An array whose array classification has been confirmed is a high-quality adapter or a low-quality adapter that can be repaired, and is used as an array for clustering and assembly.
  • 300,635 arrays (64.1%) were arrays in which adapters could be confirmed at both ends.
  • 112365 array (23.9%) was an array in which only one adapter was confirmed (see FIG. 13).
  • Step 2 Clustering / Assembling Among the sequences used for clustering / assembling, clustering / assembling is performed for sequences in which adapters can be confirmed at both ends in the embodiment to generate a consensus sequence of HiCEP fragments. This eliminates individual lead sequence errors and provides a more accurate HiCEP fragment sequence. In addition, the number of read sequences constituting the consensus sequence can be used as reference data for the transcription amount of the HiCEP fragment.
  • (A) Pre-processing As pre-processing, with respect to all the read sequences for which the adapter sequence used for clustering and assembling can be confirmed, the outside is removed from the position of the adapter sequence including the confirmed adapter sequence (see FIG. 21). Furthermore, the base sequence of the original adapter is artificially given to the end of the removed sequence. In addition, for the quality value information of each read sequence output from the sequencer, the outside is removed from the position of the confirmed adapter sequence, and the highest quality value is assigned to the part corresponding to the artificially assigned adapter. Is granted.
  • TGICL parameters Parameter “-v 2” is a setting that minimizes the number of overhangs allowed during assembly (the part that is invalidated from the sequence and excluded from the assembly result).
  • the input sequence is an array with a common HiCEP adapter sequence added to both ends, so when an alignment with an overhang is output, it indicates that the cluster cannot be assembled normally. Error clusters can be recognized.
  • singleton arrays are generated separately in the two processing stages of clustering and assembly.
  • the singleton array generated at the time of assembly is output to a singles file created for each thread that has executed the assembly, but the information of the singleton array excluded at the time of clustering is not output. For this reason, the sequences excluded during clustering must be identified and extracted as singleton sequences.
  • tgicl outputs a list of sequence names that can be clustered at the end of clustering to a file. Use this file to obtain singleton sequences excluded during clustering.
  • A) The entire length of the lead is effectively assembled. There should be no clipped parts in all the lead sequences that make up the contig (the part where the lead ends are invalidated during alignment and do not contribute to the formation of consensus sequences).
  • B) Leads are aligned to the full length of the consensus (contig) sequence. Both ends of the lead sequence aligned to the consensus sequence are aligned with both ends of the consensus sequence and must not be aligned in the middle of the consensus sequence.
  • An array cluster that does not satisfy the above conditions A) and B) is determined as an unaligned cluster, and is determined to be an error cluster (see FIG. 14).
  • the consensus (contig) sequence name is given the name of the original file with a branch number added.
  • the branch number is a 4-digit number padded with zeros following '_' and is assigned in order from "_0001".
  • Step 4 Data generation of cluster reliability The degree to which the selection sequence portion of HiCEP is probable with respect to the cluster sequence obtained in the above step 3 is evaluated. Specifically, the sequence similarity was scored for the consensus sequence of the sequence cluster and the constituent lead sequence. As a result, when the created cluster information is used, such as the profiling association process of (2), it is possible to perform a process using the reliability of the array cluster calculated here as a threshold value.
  • Selection base composition ratio (up to 3rd candidate) The selection match rate of the consensus constituent sequence is evaluated, and it becomes a correction candidate when peak matching is not successful.
  • the ideal score is 100% for the first candidate and 0% for the second and third candidates. However, when the hetero SNP is in the selection part, the first candidate and the second candidate are 50% each (see FIG. 23).
  • FIG. 23 shows the bases recognized as selection bases when calculating the constituent sequence ratio of selection bases.
  • Step 5 Data of reliability of consensus sequence using known gene information Regarding the consensus sequence of the sequence cluster obtained in step 4 above, for species that have known gene information (transcript, genome, EST information, etc.) Then, the known sequence information is searched, and the reliability data of the consensus sequence is created.
  • (B) Category classification The similarity search results to public databases are divided into the following four categories. However, 95-95 represents a base match rate of 95% or more, the alignment length represents 95% or more of the query length, and 95-20base represents a base match rate of 95% or more, and an alignment length of 20 bases or more.
  • 95-95 and CCGG-TTAA exist 2.
  • 95-95, CCGG / TTAA, or one of them is different by one base 3.
  • Hit at the end of 95-20base (start / end position of query sequence) Is within 4 bases from the end), and CCGG / TTAA is present 4.
  • ⁇ Uppercase aligned sequence In this example, alignment starts from the third base of the query sequence. CCGG is searched for in yyzzZZYY in which the subject sequence has the same coordinates as the restriction enzyme site of the query sequence (zzZZ) plus two bases before and after.
  • the bases of known sequences that were similar to the bases of the consensus sequence were converted into data for each position of the two bases inside the adapter sequence.
  • the search process is performed on the assumption that the SNP exists even when there is no corresponding sequence cluster. Can do.
  • Step 6 Giving gene information to the consensus sequence For the consensus sequence of the sequence cluster obtained in the above step 4, the known sequence information is searched and the gene information is given to the sequence.
  • the target organism is a species having known gene information (transcript, genome, EST information, etc.).
  • the information is given in step 5.
  • exhaustive fragment analysis may detect many unknown transcripts. Therefore, in the step 6, the similarity search is performed on the known sequence information of all the species or a plurality of specific species for the consensus sequence of the sequence cluster, and a known sequence having a high similarity to each consensus sequence is obtained. Associate.
  • Step 1 Correction of Sequence Length It is known that the electrophoresis length and the sequence length of the sequence to be electrophoresed do not always match (see FIG. 39). In this method, in order to make the peak and the sequence coincide with each other, the shift between the electrophoresis length and the sequence length is one problem. In order to solve this problem, in the database in which HiCEP method is applied to existing mouse ES cells, the base composition, molecular weight, electrophoretic length, and sequence length are calculated using the associated peaks 37,675 and their sequence data. The relationship with the gap was examined. As a result, it was found that correction based on the base composition and molecular weight is possible, and the peak matching accuracy can be improved.
  • a calibration table is prepared from the relationship between the deviation after the calibration in (1) above and the AC content ratio, and the relationship with the TG content ratio, and calibration is performed.
  • the loess function which is a local regression smoothing function, is used to create the calibration table.
  • Step 2 Sequence cluster and peak association processing The sequence length corrected in step 1 of the consensus sequence of each sequence cluster, the electrophoresis length of the electrophoresis peak as a profile obtained by the HiCEP method, and the sequence Using two values of the number of read sequences constituting the cluster and the intensity of the peak obtained by electrophoresis, the sequence cluster was associated with the peak of HiCEP reference profiling.
  • the number of peaks associated with profiling peaks 21,778 obtained by the HiCEP method of ES cells was 12,551 peaks (57.6%), of which 77% could be identified by computer processing.
  • Peak length Number of bases including selection base of consensus sequence + 34 -Peak height: Number of reads of the sequence cluster.
  • the correction value +34 from the sequence length to the electrophoresis length is determined as follows.
  • the PCR product is a 40-base primer sequence length used in PCR and 41 bases added to the end of the fragment DNA sequence, including the HiCEP selection base, as the thymine is artificially bound to the end of the PCR. It becomes. Since the sequence length uses the number of bases from which the adapter sequence has been removed, 37 bases obtained by subtracting 4 bases from both ends of the selection base 2 bases included in the sequence length make the sequence length the electrophoresis length. Is the correction value. However, it is known that the capillary electrophoresis apparatus (particularly 3100) manufactured by Applied Biosystems appears at an electrophoresis position 3 bases less than this theoretical correction value, and therefore 3 bases are subtracted from the theoretical correction value. The corrected value was 34 bases.
  • the peak matching algorithm is based on these characteristics.
  • a frame A certain frame region is set, and all combinations of reference profiling peaks and pseudo peaks in the frame are scored as pair candidates, and a pair combination having the highest total score is obtained by the DP matching method (FIG. 17A). FIG. 31 and FIG. 32).
  • the highest score of each pair candidate is 1.0, and if it is 0 or more, there is a possibility that it will be a final pair. If the score is negative, the pair candidate is unlikely to be a final pair.
  • Pair candidate score is the sum of the peak height score and the size score, each weighted.
  • the highest value of each of the height score and the size score is 1.0, and by multiplying each by a weighting factor, the highest value of the pair candidate score is also 1.0;
  • the height score is calculated as follows for each pair candidate. In calculating the peak height score, the height sequence number in the frame is used instead of the height value (see FIG. 17B).
  • p.order profile peak height order number
  • r.order pseudo peak height order number
  • abs (n) absolute value of n.
  • Height sequence number and frame Height sequence numbers are assigned as 1, 2, 3 ... n in order from the highest peak in the frame (see FIG. 33). Allocate profile peak and pseudo peak separately. The frame and height sequence number are calculated for each profile peak of interest (the same number of frames as the number of profile peaks is generated per primer set).
  • Peaks with the same height may occur frequently because the height of the pseudo peak is the number of sequences.
  • the same sequence number is assigned to peaks having the same height in the frame (see FIG. 35).
  • the sequence number at this time is obtained by adding the number of peaks having the same height.
  • the peak group at the same height is likely to have a small number of leads or a singleton. For such peaks, the accuracy of matching is improved by assigning sequence numbers that are further apart (the influence of noise can be reduced).
  • the range is within the number of profile peaks of 27 and 80 bases before and after the target profile peak (the pseudo peak is not considered) (FIG. 36).
  • Size score The size score is calculated for each pair candidate as follows (see FIG. 17B).
  • p.size profile peak size
  • Size tolerance There is no penalty if the size difference between the pair candidates is within the “size tolerance”.
  • the size tolerance is variable between 2 bases and 4 bases.
  • One of the shorter distances of the two distances in front and rear is 1/2 (one half) as a candidate value; ⁇ If the candidate value is within 2 to 4 bases, the candidate value is directly used as the size tolerance; If the candidate value is smaller than 2 bases, 2 is set as the size tolerance if it is larger than 4 bases (FIG. 37).
  • a method of increasing the tolerance as the size increases can be considered, but in this method, the above method was adopted.
  • a pair is corrected for a strong peak and a nearby peak (FIG. 38).
  • the correction method is as follows.
  • Step 1 Create data in which the HiCEP profiling results obtained from the gene identification target sample and the reference profiling used in (2) are associated with the electrophoresis length and the peak intensity.
  • Step 2 From the gene identification target peak group obtained from the gene identification target sample, the reference profiling peak is obtained using the association data created in Step 1 above, and from the association information in (2) The clusters created in (1) are obtained, and the consensus sequence and gene information are obtained. As a result, a correspondence list between the peak group of interest and gene information is created.
  • Step 3 The consensus sequence of interest is determined based on the gene information created in Step 6 of (1), and electrophoresis obtained from the sample for gene identification through the reference profiling peak associated in (2) Find the peak.
  • Method 2 A reference profiling band using the number of bases obtained by electrophoresis of one or more peaks of the electrophoresis result obtained from the gene identification target sample in (2), and the sequence cluster created in (1) By displaying the pseudo-profiling created from the number of sequences and the gene information side by side, the gene information of the peak of interest of the electrophoresis result obtained from the gene identification target sample is obtained.

Abstract

A database construction method, including a stage that fragments genomic DNA contained in a sample or cDNA obtained from a transcription product and obtains a fragment DNA mixture by applying an identifiable index array, a stage that performs high-speed DNA sequencing of a portion of the fragment DNA mixture and acquires read array data for all fragment DNA contained therein, a stage that detects the presence of the index array portion for all read array data and extracts the read array data having the index array, and a stage that performs clustering and assembling of the sequences using sequence similarity and sequence length parameters, forms a plurality of clusters, and acquires the number of structural sequences of the cluster, consensus sequence and consensus sequence length, and alignment information for the clusters.

Description

網羅的フラグメント解析における遺伝子同定方法および発現解析方法Gene identification method and expression analysis method in comprehensive fragment analysis
 本発明は、網羅的フラグメント解析におけるフラグメント配列データベース構築方法、並びにそれを利用した遺伝子同定方法および発現解析方法に関する。 The present invention relates to a fragment sequence database construction method in comprehensive fragment analysis, and a gene identification method and expression analysis method using the fragment sequence database construction method.
 現在の種々の発現解析手法が存在し、また開発されている。例えば、マイクロアレイを使用する手法が広く使用されている。マイクロアレイは、基板に固定された検出しようとする配列を含むプローブと試料に含まれる核酸とのハイブリダイゼーションを検出する方法である。この方法では、プローブを準備するために、対象となる核酸の情報が必要である。また、マイクロアレイ技術では、配列情報のある生物種の場合であっても、定量的な発現量差を求めることは難しい。また、低発現遺伝子について検出される発現量の変化率の信頼性は低い。 Various current expression analysis methods exist and have been developed. For example, a technique using a microarray is widely used. The microarray is a method for detecting hybridization between a probe containing a sequence to be detected immobilized on a substrate and a nucleic acid contained in a sample. In this method, in order to prepare a probe, information on a target nucleic acid is necessary. In addition, with the microarray technology, it is difficult to obtain a quantitative expression level difference even in the case of a biological species having sequence information. Moreover, the reliability of the change rate of the expression level detected about a low expression gene is low.
 近年、ロシュ社製454FLX、イルミナ社製GAIIシリーズ・HiSEQシリーズ、LifeTechnology社製SOLiDシリーズ・イオントレントPGMシリーズ、ヘリコス社製、パシフィックバイオ社製などで代表される高速DNAシーケンサを使用して、転写産物をシーケンシングし、遺伝子ごとの配列数を集計して、発現解析を行う手法(RNA-Seq)が報告されている。この方法では、配列情報が公知の生物種について、その配列情報に基づくリファレンス配列に対してシーケンシングされた配列をアライメントする必要がある。また、このような方法では、目的とする生物種のための配列情報が公知であっても、メジャーな転写産物由来の配列集団についてはその量(即ち、リード配列数)を比較することができるが、量の少ないマイナーな転写産物由来の配列集団については、量の再現性が低く、比較結果の信頼性も低い。 In recent years, transcription products using high-speed DNA sequencers represented by Roche 454FLX, Illumina GAII series / HiSEQ series, LifeTechnology SOLiD series / Ion Torrent PGM series, Helicos, Pacific Bio, etc. A technique (RNA-Seq) has been reported in which expression analysis is performed by sequencing and counting the number of sequences for each gene. In this method, it is necessary to align a sequence sequenced with respect to a reference sequence based on the sequence information of a biological species whose sequence information is known. Further, in such a method, even if sequence information for a target species is known, the amount (ie, the number of read sequences) of sequence populations derived from major transcripts can be compared. However, the sequence population derived from a minor transcript with a small amount has low reproducibility of the amount and low reliability of the comparison result.
 一方、ゲノムDNAの違いや転写産物の発現量の違いを検出する方法として、配列情報がない生物種にも適用できる方法も提案されている。そのような方法には、網羅的フラグメント解析手法とも称され、例えば、HiCEP、AFLP、T-RFLP、SAGE、CAGE、Differential Display などがある。これらの方法は、DNA配列を制限酵素で切断し、末端に特定の配列を付与した断片配列を調整して特定の配列を用いてPCRで増幅後電気泳動する、またはDNA配列を特定の配列を用いてPCRで増幅後電気泳動するものである。これらの方法では、更に、得られた断片DNA配列の電気泳動結果(即ち、バンド群またはピーク群)について、異なるサンプル間で比較し、強度の異なるバンド群またはピーク群を検出する。このような網羅的フラグメント解析手法においては、発現解析を行うためには各バンド群またはピーク群を各々分取して、それらを1つ1つシーケンシングして塩基配列を決定する必要がある。そのような手法により遺伝子同定と発現解析を行うためには、膨大な時間と莫大な費用が必要である。 On the other hand, as a method for detecting a difference in genomic DNA or a difference in the expression level of a transcript, a method that can be applied to a species having no sequence information has also been proposed. Such a method is also referred to as an exhaustive fragment analysis method and includes, for example, HiCEP, AFLP, T-RFLP, SAGE, CAGE, Differential Display, and the like. In these methods, a DNA sequence is cleaved with a restriction enzyme, and a fragment sequence with a specific sequence at the end is prepared and amplified by PCR using a specific sequence, followed by electrophoresis, or the DNA sequence is converted into a specific sequence. It is used for electrophoresis after amplification by PCR. In these methods, the electrophoresis results (that is, band groups or peak groups) of the obtained fragment DNA sequences are compared between different samples, and band groups or peak groups having different intensities are detected. In such an exhaustive fragment analysis technique, in order to perform expression analysis, it is necessary to classify each band group or peak group and sequence them one by one to determine the base sequence. In order to perform gene identification and expression analysis by such a method, enormous time and enormous costs are required.
 上記の状況に鑑み、本願発明の目的は、簡便且つ高い信頼性を保持した遺伝子同定方法および発現解析方法、並びにそこにおいて使用される網羅的フラグメント解析におけるフラグメント配列データベース構築方法を提供することである。 In view of the above situation, an object of the present invention is to provide a gene identification method and expression analysis method that are simple and retain high reliability, and a fragment sequence database construction method in exhaustive fragment analysis used therein. .
 本発明の1態様に従うと、
 試料に含まれる転写産物を断片化し、更に指標配列を付与し、フラグメントDNA混合液を得る段階と、
 前記フラグメントDNA混合液の第1の一部分を高速DNAシーケンシングすることによって、そこに含まれる全てのフラグメントDNAについてのリード配列データを取得する段階と、
 前記リード配列データの全てについて、前記指標配列部分の有無を検査し、前記指標配列を有するリード配列データを抽出する段階と、
 前記抽出されたリード配列データの全てについて、予め決定されたパラメータを用いて配列のクラスタリング処理とアッセンブリング処理を行うことにより、複数のクラスタを形成し、前記クラスタのそれぞれについて、当該クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を取得する段階と、
を具備し、
前記パラメータが、配列の類似性と配列長に関するパラメータであることを特徴とするデータベース構築方法
が提供される。
According to one aspect of the invention,
Fragmenting the transcript contained in the sample, further adding an indicator sequence, and obtaining a fragment DNA mixture,
Obtaining read sequence data for all fragment DNAs contained therein by performing high-speed DNA sequencing on the first portion of the fragment DNA mixture; and
Inspecting the presence or absence of the indicator sequence portion for all of the lead sequence data, extracting the lead sequence data having the indicator sequence;
For all of the extracted read sequence data, a clustering process and an assembling process are performed using predetermined parameters to form a plurality of clusters, and for each of the clusters, a cluster configuration array Obtaining a number, consensus sequence and consensus sequence length, and alignment information;
Comprising
A database construction method is provided in which the parameters are parameters relating to sequence similarity and sequence length.
 本発明により、簡便且つ高い信頼性を保持した遺伝子同定方法および発現解析方法、並びにそこにおいて使用される網羅的フラグメント解析におけるフラグメント配列データベース構築方法が提供される。 The present invention provides a gene identification method and expression analysis method that are simple and highly reliable, and a fragment sequence database construction method in exhaustive fragment analysis used therein.
データベースの構築方法の1例を示すフローチャート。The flowchart which shows an example of the construction method of a database. データベースの構成の1例を示す図。The figure which shows one example of a structure of a database. 遺伝子同定法の1例を示すフローチャート。The flowchart which shows an example of the gene identification method. データベースの構成の1例を示す図。The figure which shows one example of a structure of a database. 図3の遺伝子同定法において使用できる更なる1例を示すフローチャート。Fig. 4 is a flowchart showing a further example that can be used in the gene identification method of Fig. 3. 高速DNAシーケンサを用いた解析方法の1例を示すフローチャート。The flowchart which shows an example of the analysis method using a high-speed DNA sequencer. 高速DNAシーケンサを用いた解析方法の1例を示すフローチャート。The flowchart which shows an example of the analysis method using a high-speed DNA sequencer. マイクロアレイを用いた解析方法の1例を示すフローチャート。The flowchart which shows an example of the analysis method using a microarray. クラスタリング処理の1例を示すフローチャート。The flowchart which shows an example of a clustering process. クラスタリング処理の1例を示すフローチャート。The flowchart which shows an example of a clustering process. DNAフラグメント混合液の調製例を示すスキーム。The scheme which shows the preparation example of a DNA fragment liquid mixture. 選択的PCTを用いたDNAフラグメント混合液の調製例を示すスキーム。The scheme which shows the preparation example of the DNA fragment liquid mixture using selective PCT. DNAフラグメント混合液の断片長による分離と検出との関係を示す模式図。The schematic diagram which shows the relationship between isolation | separation by the fragment length of a DNA fragment liquid mixture, and detection. アダプタ配列の評価の1例を示す図。The figure which shows one example of evaluation of an adapter arrangement | sequence. エラークラスタの修正の1例について示す概念図。The conceptual diagram shown about one example of correction of an error cluster. ヘテロSNPによるクラスタ分割の概念図。The conceptual diagram of cluster division by hetero SNP. 電気泳動長と配列長のズレの補正方法の1例を示す図。The figure which shows an example of the correction method of the shift | offset | difference of electrophoresis length and sequence length. リード配列とピークとの対応付け方法の1例を示す図。The figure which shows an example of the matching method of a lead arrangement | sequence and a peak. リード配列とピークとの対応付け方法の1例を示す図。The figure which shows an example of the matching method of a lead arrangement | sequence and a peak. 指標配列による検査および分類の1例を示すフローチャート。The flowchart which shows an example of the test | inspection and classification | category by index arrangement | sequence. クラスタリング・アッセンブリング処理の1例示すフローチャート。The flowchart which shows an example of a clustering and assembly process. 対応付けの1例を示す模式図。The schematic diagram which shows one example of matching. 高品質配列の1例について示す模式図。The schematic diagram shown about one example of a high quality arrangement | sequence. アライメントの出力例を示す図。The figure which shows the example of an output of alignment. 指標配列の例を示す模式図。The schematic diagram which shows the example of a parameter | index array. 指標配列の例を示す模式図。The schematic diagram which shows the example of a parameter | index array. フラグメント長による検索の入力画面の1例を示す図。The figure which shows an example of the input screen of the search by fragment length. 遺伝子名による検索のための入力画面の1例を示す図。The figure which shows an example of the input screen for the search by a gene name. BLAST検索の入力画面の1例を示す図。The figure which shows an example of the input screen of a BLAST search. 較正の前後の例を示す図。The figure which shows the example before and behind calibration. ピークの1例を示す図。The figure which shows an example of a peak. ピークの1例を示す図。The figure which shows an example of a peak. スコア化の1例を示す図。The figure which shows an example of scoring. 正誤のアライメントを示す図。The figure which shows correct / wrong alignment. フレームと順序番号の概念図。Conceptual diagram of frame and order number. 高さから高さ順序番号への変換イメージを示す図。The figure which shows the conversion image from height to a height sequence number. 高さ順序番号のイメージを示す図。The figure which shows the image of a height sequence number. プロファイルピークの1例を示す図。The figure which shows an example of a profile peak. プロファイルピークの1例を示す図。The figure which shows an example of a profile peak. 補正前後の対応付けの1例を示す図。The figure which shows an example of the matching before and behind correction | amendment. 対応付けの1例を示す図。The figure which shows one example of matching. 配列長とずれの関係を示すグラフ。The graph which shows the relationship between arrangement | sequence length and shift | offset | difference. 分子量とずれの関係を示すグラフ。The graph which shows the relationship between molecular weight and shift | offset | difference. 含有アミノ酸とずれとの関係を示すグラフ。The graph which shows the relationship between a content amino acid and a shift | offset | difference. 補正の計算方法を示す図。The figure which shows the calculation method of correction | amendment. コンピュータの構成の1例を示すブロック図。The block diagram which shows an example of a structure of a computer.
 (1)高速DNAシーケンサを活用した網羅的フラグメント配列データベースの構築
 以下、図1を用いて高速DNAシーケンサを活用した網羅的フラグメント配列データベースの構築の1例について説明する。
(1) Construction of an exhaustive fragment sequence database utilizing a high-speed DNA sequencer Hereinafter, an example of construction of an exhaustive fragment sequence database utilizing a high-speed DNA sequencer will be described with reference to FIG.
 まず、データベースを作成するためのフラグメントDNA混合液を調製する。フラグメントDNA混合液は、データベースを作成しようとする試料に含まれるゲノムまたは転写産物を断片化し、指標配列を付与し、調製すればよい。これを網羅的フラグメント解析法のための混合液とする。 First, prepare a fragment DNA mixture to create a database. The fragment DNA mixed solution may be prepared by fragmenting a genome or transcript contained in a sample for which a database is to be created, giving an index sequence. This is a mixed solution for the comprehensive fragment analysis method.
 試料は、細胞、組織および臓器などからそれ自身公知の何れかの手段によりゲノムまたは転写産物を含む混合液に調製されればよい。ゲノムまたは転写産物の断片化に先駆けてそれ自身公知の何れかの手段により行ってもよい。好ましくは転写産物からcDNAを調製し、これを断片化して、標識配列を付与する。 The sample may be prepared from a cell, tissue, organ, or the like into a mixed solution containing a genome or a transcription product by any means known per se. Prior to fragmentation of the genome or transcript, it may be performed by any means known per se. Preferably, cDNA is prepared from the transcript, and this is fragmented to give a labeling sequence.
 ゲノムまたは転写産物から得られたDNAの断片化は、それ自身公知の制限酵素を用いて行ってよい。断片化されたDNAへの識別可能な指標配列の付加は、例えば、アダプタ配列を当該断片に付与することにより行ってよい。アダプタの付与は、各断片の5’末端および/または3’末端であってもよい。また、例えば、メイトペア法において実施されるように、アダプタの付与は、各断片の5’末端および/または3’末端に付与された後に、アダプタを付与された1つの断片の5’末端と3’末端とを結合し、環状核酸を形成した後に当該アダプタに対応する配列以外の部位において切断することにより直鎖状核酸を調製してもよい。 Fragmentation of DNA obtained from a genome or a transcript may be performed using a restriction enzyme known per se. Addition of an identifiable indicator sequence to the fragmented DNA may be performed, for example, by adding an adapter sequence to the fragment. Application of the adapter may be at the 5 'and / or 3' end of each fragment. Also, for example, as practiced in the mate pair method, adapter attachment is applied to the 5 ′ end and / or the 3 ′ end of each fragment, followed by the 5 ′ end of the fragment to which the adapter is applied and 3 A linear nucleic acid may be prepared by binding the ends and forming a circular nucleic acid, followed by cleaving at a site other than the sequence corresponding to the adapter.
 アダプタの塩基配列およびその長さは、識別可能な限りで任意に決定してよい。ここで、「指標配列」とは、指標となるべき配列が識別可能な数の塩基配列を含むことを示す。 The base sequence of the adapter and its length may be arbitrarily determined as long as they can be identified. Here, the “index sequence” indicates that the sequence to be the index includes a discriminable number of base sequences.
 このようなcDNA断片への識別可能な指標配列の付与は、例えば、HiCEP法、AFLP法、T-RFLP法、CAGE法およびDifferential Display法などのフラグメント解析法における指標配列を付与する方法を利用してよく、より好ましくはAFLP法、T-RFLP法、CAGE法およびDifferential Display法、最も好ましくはHiCEP法を利用して行ってよい。上述のフラグメント解析法を利用して、cDNA断片への識別可能な指標配列の付与し、更に、それらのフラグメントの混合液についてゲル電気泳動および/またはキャピラリ電気泳動などの電気泳動によりバンドまたはピークおよび電気泳動配列長(ここでは「分子量」または「配列長」または「フラグメント長」ともいう)を得ることにより解析する方法を行うことにより、一般的には網羅的にフラグメントが解析されてもよい。 For example, a method for assigning an index sequence in a fragment analysis method such as HiCEP method, AFLP method, T-RFLP method, CAGE method, and Differential-display method is used to give an identifiable indicator sequence to a cDNA fragment. More preferably, AFLP method, T-RFLP method, CAGE method and Differential-display method, most preferably HiCEP method may be used. Using the fragment analysis method described above, a discriminating indicator sequence is imparted to the cDNA fragment, and the mixture of the fragments is subjected to electrophoresis such as gel electrophoresis and / or capillary electrophoresis. Generally, fragments may be comprehensively analyzed by performing an analysis method by obtaining an electrophoretic sequence length (herein also referred to as “molecular weight” or “sequence length” or “fragment length”).
 このように調製されたフラグメント混合液を高速DNAシーケンサにかけてリード配列を得る。 The lead mixture is obtained by applying the fragment mixture thus prepared to a high-speed DNA sequencer.
 ここで「高速DNAシーケンサ」とは、長さの異なる複数種類の塩基配列について分離することなくシーケンシングできるシーケンサを示す。例えば、ロシュ社製454FLX、イルミナ社製GAIIシリーズ・HiSEQシリーズ、LifeTechnology社製SOLiDシリーズ・イオントレントPMGシリーズ、ヘリコス社製、パシフィックバイオ社製などにより提供されるシーケンサを使用することが可能であるが、これに限定するものではない。また、高速DNAシーケンサは、クローニング不要であってもよい。 Here, “high-speed DNA sequencer” refers to a sequencer that can be sequenced without separating multiple types of base sequences having different lengths. For example, it is possible to use sequencers provided by Roche 454FLX, Illumina GAII series / HiSEQ series, LifeTechnology SOLiD series / Ion Torrent PMG series, Helicos, Pacific Bio, etc. However, the present invention is not limited to this. The high-speed DNA sequencer may not require cloning.
 次に、リード配列の長さと類似性の2つの要素をパラメータとして利用して、コンピュータ処理により、リード配列をクラスタリング処理およびアッセンブリング処理する。それにより、高精度な配列クラスタとコンセンサス配列を作成し、各々の配列クラスタを構成するリード配列数を集計する。 Next, using the two elements of the length and similarity of the read sequence as parameters, the read sequence is clustered and assembled by computer processing. As a result, a highly accurate array cluster and consensus array are created, and the number of read arrays constituting each array cluster is totaled.
 コンピュータ処理によるリード配列のクラスタリング処理とアッセンブリング処理について図9Aおよび図9Bを用いて更に詳しく説明する。なお、図9Aと図9Bは、同じ一連の工程を示すものであるが、便宜上、図9Aでは工程1~工程3について詳細に記載し、図9Bでは工程4~6について詳細に記載する。またここで、配列のクラスタリング処理とアッセンブリング処理の両方の処理を行う場合、この処理を「クラスタリング・アッセンブリング」または「クラスタリング・アッセンブリング処理」とも記す。 The clustering processing and assembling processing of the read sequence by computer processing will be described in more detail with reference to FIGS. 9A and 9B. 9A and 9B show the same series of steps. For convenience, FIG. 9A describes in detail steps 1 to 3, and FIG. 9B describes in detail steps 4 to 6. Here, when both the clustering process and the assembling process are performed, this process is also referred to as “clustering and assembling” or “clustering and assembling process”.
 ここで「配列のクラスタリング」は、「クラスタリング」および「クラスタ化」と交換可能に使用される語であり、予め決定したパラメータ、好ましくは塩基配列の類似性および/または配列長に基づいてグループ分けすることを示す。クラスタリングにより生じたグループを「クラスタ」または「配列クラスタ」と呼ぶ。互いに同じ長さの複数の配列からなるクラスタを「整列クラスタ」と呼び、互いに異なる長さの複数の配列からなるクラスタを「非整列クラスタ」と呼ぶ。1つのみの配列からなるクラスタを「シングルトン」とも称するが、「シングルトン」もクラスタとして使用されてよい。 Here, “sequence clustering” is a term used interchangeably with “clustering” and “clustering” and is grouped based on predetermined parameters, preferably base sequence similarity and / or sequence length. Indicates to do. Groups resulting from clustering are referred to as “clusters” or “array clusters”. A cluster composed of a plurality of arrays having the same length is called an “aligned cluster”, and a cluster composed of a plurality of arrays having different lengths is called an “unaligned cluster”. A cluster consisting of only one array is also referred to as “singleton”, but “singleton” may also be used as a cluster.
 ここで「アッセンブリング」は、「アッセンブリ」および「アッセンブル」と交換可能に使用される語であり、少なくとも部分的に共通する配列を有する複数の核酸配列から1つの代表的な配列であるコンセンサス配列を得ることをいい、また、アッセンブリングに供した配列のコンセンサス配列へのアライメント情報を得ることをいう。 Here, “assembly” is a term used interchangeably with “assembly” and “assembly”, and is a consensus sequence that is one representative sequence from a plurality of nucleic acid sequences having at least a partially common sequence. It also means obtaining alignment information of the sequence subjected to assembly to the consensus sequence.
 ここで「リード配列」とは、シーケンサから出力された配列をいう。 “Here,“ read array ”refers to an array output from a sequencer.
 ここで「コンセンサス配列」とはアッセンブリ処理により得られた人工的な配列をいう。 Here, “consensus sequence” means an artificial sequence obtained by the assembly process.
 工程1 配列の分類
 網羅的フラグメント解析の検出対象となるフラグメントDNA配列の両端に特定の配列が出現する場合、その両端配列の両方または片方を評価し、クラスタリング・アッセンブリングに使用する配列を振り分ける。具体的には、即ち、リード配列が指標配列を含むか否かが判断され、両端または片方の末端に指標配列が含まれる場合には、データベース作成のためのリード配列として抽出され、以下の工程において使用される。
Step 1 Sequence classification When a specific sequence appears at both ends of a fragment DNA sequence to be detected by comprehensive fragment analysis, both or one of the both end sequences is evaluated, and a sequence used for clustering and assembling is assigned. Specifically, in other words, it is determined whether or not the lead sequence includes an indicator sequence. When the indicator sequence is included at both ends or one end, it is extracted as a lead sequence for database creation, and the following steps are performed: Used in.
 指標配列を含むか否かの判断は、判断の対象となるリード配列における指標配列の存在を確認すればよい。確認のために使用される指標配列は、アダプタ配列としてDNA断片に対して付与された配列に対応する塩基配列であってよく、アダプタ配列の全体に対応する塩基配列であっても、アダプタ配列の一部分に対応する塩基配列であっても、アダプタ配列に対応する配列に加えて更なる塩基を含む配列であってもよい。更なる配列を含ませる場合には、例えば、任意の数の任意の塩基N(アデニン、チミン、グアニンおよびシトシンから選択される塩基)を含ませてよい。また、任意の塩基Nを含ませる場合には、アダプタ配列に対応する配列の5’末端側または3’末端側に伸長するように含ませることが好ましい。任意の塩基Nを任意の数で含む場合、任意の塩基Nの数は、例えば、1以上、2以上、3以上、4以上、5以上であってよく、好ましくは、1アダプタに対して2つで、且つ1つの配列の5’末端と3’末端の両側に2つずつ含ませる。しかしながら、当該断片の両末端に付与する場合、5’末端側と3’末端側側とでは互いに異なる数の任意の種類の塩基を含んでもよい。 Whether or not an index sequence is included may be determined by confirming the presence of the index sequence in the lead sequence to be determined. The indicator sequence used for confirmation may be a base sequence corresponding to the sequence given to the DNA fragment as an adapter sequence, and even if it is a base sequence corresponding to the entire adapter sequence, Even a base sequence corresponding to a part may be a sequence containing additional bases in addition to a sequence corresponding to the adapter sequence. When further sequences are included, for example, any number of any base N (a base selected from adenine, thymine, guanine and cytosine) may be included. In addition, when an arbitrary base N is included, it is preferable that the base N is included so as to extend to the 5 'end side or 3' end side of the sequence corresponding to the adapter sequence. When an arbitrary number of arbitrary bases N is included, the number of arbitrary bases N may be, for example, 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, and preferably 2 for 1 adapter. And two on both sides of the 5 ′ and 3 ′ ends of one sequence. However, when it is added to both ends of the fragment, the 5 'end side and the 3' end side may contain different types of arbitrary types of bases.
 なお、指標配列は、リード配列の内部に存在していてもよいが、両末端に存在するのが好ましい。 The index sequence may be present inside the lead sequence, but is preferably present at both ends.
 工程2 クラスタリング・アッセンブリング
 クラスタリング・アッセンブリングを行ない、配列クラスタとそのコンセンサス配列を得る。前記抽出されたリード配列データの全てについて、予め決定されたパラメータを用いて配列のクラスタリング処理とアッセンブリング処理を行う。それにより、複数の配列クラスタを形成し、前記配列クラスタのそれぞれについて、当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を取得する。予め決定されたパラメータとして、例えば、配列の類似性、配列長および/または指標配列に関するパラメータ、好ましくは、配列の類似性と配列長と指標配列に関するパラメータを用いてよい。
Step 2 Clustering and assembling Clustering and assembling are performed to obtain a sequence cluster and its consensus sequence. All of the extracted read sequence data are subjected to sequence clustering processing and assembly processing using predetermined parameters. Thereby, a plurality of sequence clusters are formed, and the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and alignment information are obtained for each of the sequence clusters. As the predetermined parameters, for example, parameters related to sequence similarity, sequence length and / or index sequence, preferably parameters related to sequence similarity, sequence length and index sequence may be used.
 工程3 クラスタリングエラーの修正
 得られた配列クラスタについて、コンセンサス配列とクラスタを構成する配列のアライメント情報を使用して、配列クラスタを構成する配列どうしの配列類似性と配列長の同一性を評価し、更に、コンセンサス配列同士の配列類似性と配列長を評価し、クラスタリング・アッセンブリングの間違いや矛盾を検出し、作成されたクラスタを修正する。
Step 3 Correction of clustering error For the obtained sequence cluster, using the alignment information of the consensus sequence and the sequence constituting the cluster, the sequence similarity between the sequences constituting the sequence cluster and the sequence length identity are evaluated, Furthermore, the sequence similarity and sequence length between consensus sequences are evaluated, and errors and contradictions in clustering and assembling are detected, and the created cluster is corrected.
 工程4 クラスタの信頼性のデータ化
工程3で得られた配列クラスタについて、コンセンサス配列とクラスタを構成するリード配列のアライメント情報から、各配列クラスタの代表配列としてのコンセンサス配列の信頼性をデータ化する。
Step 4 Cluster reliability data conversion For the sequence cluster obtained in step 3, the reliability of the consensus sequence as a representative sequence of each sequence cluster is converted into data from the alignment information of the consensus sequence and the lead sequence constituting the cluster. .
クラスタの信頼性を得るためには、例えば、クラスタのコンセンサス配列とそれを構成するリード配列について指標配列に隣接する塩基の評価を行なえばよい。その場合、指標配列に隣接する配列の数は、2以上、好ましくは2塩基であってよい。 In order to obtain the reliability of the cluster, for example, the base adjacent to the index sequence may be evaluated for the consensus sequence of the cluster and the lead sequence constituting the cluster. In that case, the number of sequences adjacent to the indicator sequence may be 2 or more, preferably 2 bases.
 工程5 既知遺伝子情報を利用したコンセンサス配列の信頼性のデータ化
 前記工程4で得られた配列クラスタのコンセンサス配列について、既知遺伝子情報(転写産物、ゲノム、EST情報など)が存在する生物種では、公知配列情報を検索し、コンセンサス配列の信頼性データを作成する。
Step 5 Data conversion of reliability of consensus sequence using known gene information Regarding the consensus sequence of the sequence cluster obtained in the above step 4, in a biological species where known gene information (transcript, genome, EST information, etc.) exists, The known sequence information is searched, and consensus sequence reliability data is created.
 工程6 コンセンサス配列への遺伝子情報の付与
 前記工程4で得られた配列クラスタのコンセンサス配列について、公知配列情報を検索し、配列に遺伝子情報を付与する。
Step 6: Giving gene information to a consensus sequence With respect to the consensus sequence of the sequence cluster obtained in Step 4, known sequence information is searched and gene information is given to the sequence.
 以上の工程により網羅的フラグメント解析データベース(以下、「DB」とも記す)を構築することが可能である。なお、工程4~工程6は、任意の工程であり、目的に応じて、例えば、より高い信頼性を担保したい場合や、具体的な遺伝子情報をデータベースに加えたい場合に行えばよい。 An exhaustive fragment analysis database (hereinafter also referred to as “DB”) can be constructed by the above steps. Steps 4 to 6 are optional steps, and may be performed according to the purpose, for example, when it is desired to ensure higher reliability or when specific gene information is to be added to the database.
 また、工程1~工程6により得られるデータベースに含まれる成分の例を図2に示す。上記の工程1~工程6によって、データベースは、「コンセンサス配列」、「配列クラスタの構成配列数」、「配列クラスタのコンセンサス配列長」、「アライメント情報」、「配列クラスタリングの信頼性データ」および「配列クラスタの遺伝子情報」を含み、これらの情報は関連して記憶部に格納されればよい。 In addition, FIG. 2 shows an example of components included in the database obtained by the steps 1 to 6. Through the above-described steps 1 to 6, the database is converted into “consensus sequence”, “number of sequence sequences of sequence cluster”, “consensus sequence length of sequence cluster”, “alignment information”, “reliability data of sequence clustering” and “ The gene information of the sequence cluster ”is included, and these pieces of information may be stored in the storage unit in association with each other.
 (2)電気泳動で得られるバンドまたはピークと配列の対応付け
 以下、電気泳動で得られるバンドまたはピークと配列の対応付けの手順の1例について図1を用いて説明する。
(2) Correlation of Bands or Peaks Obtained by Electrophoresis with Sequences One example of a procedure for associating bands or peaks obtained by electrophoresis with sequences will be described below with reference to FIG.
 上記(1)で得られた高精度なコンセンサス配列の配列情報と塩基数と配列クラスタを構成する配列数を利用して、シーケンス対象としたDNA混合液から得られる網羅的フラグメント解析の電気泳動のバンド群またはピーク群(これらのデータを総称して「リファレンスプロファリング」と称す)に対応付ける。 Using the sequence information of the high-precision consensus sequence obtained in (1) above, the number of bases and the number of sequences constituting the sequence cluster, electrophoresis of comprehensive fragment analysis obtained from the DNA mixture targeted for sequencing A band group or a peak group (these data are collectively referred to as “reference profiling”) is associated.
 コンセンサス配列の配列情報及び塩基数とバンドまたはピークの電気泳動で得られた分子量(または電気泳動配列長)、及び、コンセンサス配列のクラスタを構成する配列数とバンドまたはピークの強度の2つの要素を使用して、コンセンサス配列とリファレンスプロファイリングとを対応付ける。 The sequence information of the consensus sequence, the number of bases and the molecular weight (or electrophoresis sequence length) obtained by electrophoresis of the band or peak, and the number of sequences constituting the cluster of the consensus sequence and the intensity of the band or peak Use to associate consensus sequences with reference profiling.
 コンセンサス配列の対応付けには、あらかじめ多量の配列長とその電気泳動で得られた塩基数との対応付け実験を行なったデータを基に得られた配列の分子量および塩基組成による塩基数の較正情報で、較正を行った値を用いてよい。 For consensus sequence mapping, calibration information on the number of bases based on the molecular weight and base composition of the sequence obtained based on data obtained by conducting an experiment of matching a large amount of sequence length with the number of bases obtained by electrophoresis in advance. Then, the calibrated value may be used.
 これにより上記(1)のデータベースに含まれるコンセンサス配列が、リファレンスプロファイリングと対応付けられる。 Thus, the consensus sequence included in the database (1) is associated with reference profiling.
 (3)(1)のデータベース、及び、(2)の対応付け情報を使用して、網羅的フラグメント解析で得られるバンドまたはピークの遺伝子同定法
 以下、網羅的フラグメント解析で得られるバンドまたはピークの遺伝子同定法の手順の1例について図3を用いて説明する。
(3) Gene identification method of band or peak obtained by comprehensive fragment analysis using database of (1) and correspondence information of (2) An example of the gene identification method will be described with reference to FIG.
 [方法1]
 工程1 遺伝子同定対象の試料から得たプロファリング結果と(2)で使用したリファレンスプロファイリングを対応付けたデータを作成する。
[Method 1]
Step 1 Produce data that associates the profiling results obtained from the gene identification target sample with the reference profiling used in (2).
 工程2 遺伝子同定対象の試料から得た遺伝子同定対象バンド群またはピーク群から、上記工程1で作成した対応付けデータを利用して、リファレンスプロファイリングのバンドまたはピークを求め、さらに、(2)での対応付け情報から、上記(1)で作成したクラスタを求め、コンセンサス配列と遺伝子情報を求める。これにより、注目のバンド群またはピーク群と遺伝子情報との対応リストを作成する。 Step 2 From the gene identification target band group or peak group obtained from the gene identification target sample, the reference profiling band or peak is obtained using the association data created in Step 1 above, and further in (2) From the association information, the cluster created in (1) above is obtained, and the consensus sequence and gene information are obtained. This creates a correspondence list between the band group or peak group of interest and the gene information.
 工程3 加えて、上記(1)の工程6で作成した遺伝子情報により、注目するコンセンサス配列を決定し、上記(2)で対応付けられたリファレンスプロファイリングのバンドまたはピークを介して、遺伝子同定対象の試料から得たプロファイリングのバンドまたはピークを求める。 Step 3 In addition, the consensus sequence of interest is determined based on the gene information created in Step 6 of (1) above, and the gene identification target is determined via the reference profiling band or peak associated in (2) above. Determine the profiling band or peak obtained from the sample.
 [方法2]
 遺伝子同定対象サンプルから得られた電気泳動結果のひとつもしくは複数のバンドまたはピークの電気泳動で得られた塩基数を上記(2)で使用したリファレンスプロファイリグのバンド、さらに、上記(1)で作成した配列クラスタとその配列数から作成した擬似プロファイリングおよびその遺伝子情報を並べて提示することで、遺伝子同定対象サンプルから得られた電気泳動結果の注目バンドまたはピークの遺伝子情報を得る。
[Method 2]
A reference profiling band using the number of bases obtained by electrophoresis of one or more bands or peaks of the electrophoresis result obtained from the gene identification target sample in (2) above, and also created in (1) above The gene information of the attention band or peak of the electrophoresis result obtained from the gene identification target sample is obtained by arranging and displaying the pseudo profiling created from the sequence cluster and the number of sequences and the gene information thereof.
 (4)網羅的フラグメント解析の高速DNAシーケンサによる検出
 高速DNAシーケンサにより、網羅的にフラグメントを解析し、例えば、目的とする遺伝子を検出することも可能である。このような方法の1例について図6を用いて説明する。
(4) Detection by a high-speed DNA sequencer for comprehensive fragment analysis It is also possible to comprehensively analyze fragments by using a high-speed DNA sequencer to detect, for example, a target gene. An example of such a method will be described with reference to FIG.
 測定対象の複数サンプルについて、それぞれ網羅的フラグメント解析法
 調整された混合液をサンプルごとに同じ種類の高速DNAシーケンサにかけて配列を得る。
For multiple samples to be measured, comprehensive fragment analysis method Each sample is subjected to the same kind of high-speed DNA sequencer to obtain a sequence.
 測定対象のサンプルからそれぞれ得られたリード配列について、(1)のデータベースをそれぞれ作成し、コンセンサス配列の類似性により、配列クラスタどうしを対応付け、対応付けられた配列クラスタ間で、構成する配列数を比較し、量の変化を伴う配列群を検出し、第1の対象試料と第2の対象試料の間で発現解析を行う方法
 配列数を比較する際は、全リード配列数もしくはクラスタリングに用いた配列数を使用して標準化を行ない比較してもよい。
For each lead sequence obtained from each sample to be measured, the database of (1) is created, the sequence clusters are associated with each other based on the similarity of the consensus sequences, and the number of sequences constituting between the associated sequence clusters To compare the number of sequences, detect the sequence group with a change in quantity, and perform expression analysis between the first target sample and the second target sample. Standardization may be performed using the number of sequences that have been compared.
(5)データベースをリファレンスにした網羅的フラグメント解析の高速シーケンサによる検出
 更なる高速シーケンサによる網羅的フラグメント解析の例について図7を用いて説明する。
(5) Detection of exhaustive fragment analysis using database as reference using high-speed sequencer An example of exhaustive fragment analysis using a further high-speed sequencer will be described with reference to FIG.
 あらかじめ測定対象となるサンプルについて上記(1)の手順を実施し、データベースを作成しておく。 Execute the above procedure (1) for the sample to be measured in advance to create a database.
 測定対象の複数サンプルについて、それぞれ網羅的フラグメント解析法で調整された混合液をサンプルごとに同じ種類の高速DNAシーケンサにかけて配列を得る。あらかじめ作成したデータベースで使用した高速DNAシーケンサと同じである必要はない。 ∙ For multiple samples to be measured, each sample is subjected to the same type of high-speed DNA sequencer for the mixed solution prepared by the comprehensive fragment analysis method to obtain a sequence. It does not have to be the same as the high-speed DNA sequencer used in the database created in advance.
 この配列をあらかじめ作成したデータベースのコンセンサス配列をリファレンスとして、これにアライメント処理等を行なうことで測定対象のリード配列をクラスタリングする。  The lead sequence to be measured is clustered by using the consensus sequence of the database in which this sequence has been created in advance as a reference and performing alignment processing or the like.
 同じコンセンサス配列にクラスタリングされた配列の数を、測定サンプル間で比較し、量の変化を伴う配列群を検出し 第1の対象試料と第2の対象試料の間で発現解析を行う方法
 配列数を比較する際は、全リード配列数もしくはクラスタリングに用いた配列数を使用して標準化を行ない比較してもよい。
A method of comparing the number of sequences clustered in the same consensus sequence between measurement samples, detecting a sequence group accompanied by a change in quantity, and performing expression analysis between the first target sample and the second target sample. May be compared by standardization using the total number of read sequences or the number of sequences used for clustering.
 (6)データベースからプローブを設計して作成したマイクロアレイによる網羅的フラグメント解析方法
 更にマイクロアレイを利用する網羅的フラグメント解析方法の1例を図8を用いて説明する。
(6) Comprehensive Fragment Analysis Method Using a Microarray Created by Designing Probes from a Database Further, an example of an exhaustive fragment analysis method using a microarray will be described with reference to FIG.
 あらかじめ測定対象となるサンプルについて(1)の手順を実施し、データベースを作成しておく。得られたコンセンサス配列をもとにプローブ設計を行い、マイクロアレイを作成する。 Execute the procedure (1) for the sample to be measured in advance to create a database. A probe is designed based on the obtained consensus sequence to create a microarray.
 測定対象の複数サンプルについて、網羅的フラグメント解析法で調整された混合液について、上記で作成したマイクロアレイを用いて量の変化を伴う配列群を検出し、第1の対象試料と第2の対象試料の間で発現解析を行う方法。 For a mixture of samples to be measured, a mixed solution prepared by an exhaustive fragment analysis method is used to detect a sequence group accompanied by a change in amount using the microarray created above, and a first target sample and a second target sample To perform expression analysis between.
 (7)指標配列による検査および分類工程
 以下に、図18を用いて指標配列による検査および分類工程の更なる1例について更に説明する。
(7) Inspection and Classification Process Using Index Array A further example of the inspection and classification process using the index array will be further described below with reference to FIG.
 シーケンスされた全リード配列を読み込み、それらのリード配列に必ず存在するべき既知指標配列との類似性データを算出する。その後、リード配列すべてに対して、一本ずつ、類似性データを参照し、既知の指標配列があるかどうかを確認する。既知の指標配列が確認できたリード配列は、クラスタリングに使用する配列として分類する。 読 み 込 み Read all sequenced read sequences and calculate similarity data with known index sequences that must exist in those read sequences. Thereafter, it is checked whether there is a known index sequence by referring to the similarity data one by one for all the read sequences. A lead sequence for which a known index sequence has been confirmed is classified as a sequence used for clustering.
 (8)クラスタリング・アッセンブリング処理
 以下に、図19を用いてクラスタリング・アッセンブリング処理の更なる1例について更に説明する。
(8) Clustering / Assembly Process Hereinafter, a further example of the clustering / assembler process will be further described with reference to FIG.
 図19において各記号は次のことを意味する;
M:クラスタのシーズとなるリード配列の番号
N:シーズとなるM番目のリード配列の次のリード配列から最後の配列までを読み取るための番号
I:生成されたクラスタ番号。
In FIG. 19, each symbol means the following:
M: Number of the read sequence that is the seed of the cluster
N: Number for reading from the next read sequence to the last sequence after the M-th read sequence as a seed
I: Generated cluster number.
 クラスタリングに使用するリード配列をすべて読み込み、まず、クラスタのシーズとなるリード配列:M番目の配列を決定し、シーズ配列の次のリード配列から残り全部のリード配列について、順次シーズ配列とN番目の対象リード配列とで類似性と配列長を比較し、同じと判定されたならば、クラスタの記憶域のI番目のクラスタにリード配列を格納する。シーズ配列の検索がすべて終了した時点で、クラスタが確立する。その後、各クラスタでコンセンサス配列を得るために、各クラスタごとにアッセンブリングを行なう。 Read all the read sequences to be used for clustering. First, the read sequence that becomes the seed of the cluster: The M-th sequence is determined, and from the next read sequence of the seed sequence, all the remaining read sequences are sequentially sequenced and the Nth sequence The similarity and sequence length are compared with the target read sequence, and if they are determined to be the same, the read sequence is stored in the I-th cluster of the cluster storage area. A cluster is established when all seed sequence searches are completed. Thereafter, in order to obtain a consensus sequence for each cluster, assembly is performed for each cluster.
 (9)プログラム
 本発明の何れの態様に従う方法を行うために、各方法に含まれる工程(ここでは「段階」とも記す)を各手順としてコンピュータに実行させるためのプログラムが提供されてもよい。例えば、上述の(1)、(2)、(3)、(4)、(5)、(6)、(7)および/または(8)に含まれる段階を各手順として実施するためのプログラムが提供される。
(9) Program In order to perform the method according to any aspect of the present invention, a program for causing a computer to execute the steps included in each method (herein also referred to as “stage”) as each procedure may be provided. For example, a program for implementing the steps included in the above (1), (2), (3), (4), (5), (6), (7) and / or (8) as each procedure Is provided.
 例えば、上記(1)の方法をコンピュータに実行させるためのプログラムが何れかの媒体に格納されて供されてもよい。 For example, a program for causing a computer to execute the method (1) may be stored in any medium and provided.
 そのようなプログラムは、例えば、次のようなプログラムである:
断片化されて識別可能な指標配列を付与された、試料に含まれる転写産物からのフラグメントDNA混合液が、高速DNAシーケンシングされることによって取得されたリード配列データの全てについて、前記指標配列部分の有無を検査し、前記指標配列を有するリード配列データを抽出する手順と、
前記抽出されたリード配列データの全てについて、予め決定された配列の類似性と配列長に関するパラメータを用いて配列のクラスタリング処理とアッセンブリング処理を行うことにより、複数の配列クラスタを形成し、前記配列クラスタのそれぞれについて、当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を取得する手順と、
前記配列クラスタのそれぞれに対応付けられた当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を含むデータベースを構築する手段と、
を含む処理をコンピュータに実行させる、前記転写産物の網羅的フラグメント解析のためのデータベース構築用プログラム。
Such a program is, for example, the following program:
For all of the read sequence data obtained by high-speed DNA sequencing of fragment DNA mixed solution from a transcript contained in a sample that has been fragmented and given an identifiable indicator sequence, the indicator sequence portion A procedure for examining the presence or absence and extracting lead sequence data having the index sequence;
For all of the extracted lead sequence data, a sequence clustering process and an assembling process are performed using parameters relating to sequence similarity and sequence length determined in advance, thereby forming a plurality of sequence clusters. For each cluster, the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and a procedure for obtaining alignment information;
Means for constructing a database including the number of constituent sequences of the sequence cluster associated with each of the sequence clusters, the consensus sequence and the consensus sequence length, and alignment information;
A program for constructing a database for comprehensive fragment analysis of transcripts, which causes a computer to execute a process including:
 本発明の態様において使用されるコンピュータは、それ自体公知の何れかのコンピュータであればよい。本発明の態様に従い手続きを行うためのコンピュータの構成の1例を模式的に図44に示す。当該コンピュータは、処理管理部、記憶部、一時記録部、プログラム格納部、クラスタリング・アッセンブリング処理部、指標配列検査部、補正データ格納部、類似性判定部および配列長判定部を含む。少なくとも処理管理部に対して、他の全ての構成部、即ち、記憶部、一時記録部、プログラム格納部、クラスタリング・アッセンブリング処理部、標識配列検査部、補正データ格納部、類似性判定部および配列長判定部が、信号の授受可能に接続される。また、所望に応じて処理を行うための更なる構成部が含まれてもよく、そのような構成部は、信号の授受可能に処理管理部に接続される。 The computer used in the embodiment of the present invention may be any computer known per se. FIG. 44 schematically shows an example of the configuration of a computer for performing a procedure according to an embodiment of the present invention. The computer includes a processing management unit, a storage unit, a temporary recording unit, a program storage unit, a clustering and assembly processing unit, an index array inspection unit, a correction data storage unit, a similarity determination unit, and an array length determination unit. At least for the processing management unit, all other components, that is, a storage unit, a temporary recording unit, a program storage unit, a clustering and assembling processing unit, a marker array inspection unit, a correction data storage unit, a similarity determination unit, and An array length determination unit is connected to be able to exchange signals. Further, a further configuration unit for performing processing as desired may be included, and such a configuration unit is connected to the process management unit so as to be able to exchange signals.
 全てのプログラムはプログラム格納部に格納される。処理管理部は、プログラム格納部に格納されたプログラムに従って、全ての処理を管理し実行させる。本発明の態様により構成されたデータベースは、記憶部に格納される。リード配列は、記憶部または一時記録部に格納される。指標配列検査部は、プログラム格納部に格納されたプログラムに従う処理管理部の指示により、格納された構成部から出力され、入力されたリード配列に指標配列が含まれるか否かを検査する。クラスタリング・アッセンブリング処理部は、リード配列をクラスタリングおよびアッセンブリング処理する。補正データ格納部は、得られたデータについて補正を行うために使用されるデータを格納する。補正データ格納部に格納されたデータを出力させ、プログラム格納部から補正のためのプログラムを出力させ、それらに基づいて得られたデータについての補正を処理管理部が行う。類似性判定部は、プログラム格納部に格納されたプログラムに従う処理管理部の指示により、比較されるべき対象についての類似性に関する判定を行なう。配列長判定部は、プログラム格納部に格納されたプログラムに従う処理管理部の指示により、比較されるべき対象についての配列長に関する判定を行なう。 All programs are stored in the program storage unit. The process management unit manages and executes all processes in accordance with the program stored in the program storage unit. The database configured according to the aspect of the present invention is stored in the storage unit. The lead array is stored in the storage unit or the temporary recording unit. The index array inspecting unit inspects whether or not the index array is included in the read array that is output from the stored component unit according to the instruction of the process management unit according to the program stored in the program storage unit. The clustering and assembling processing unit performs clustering and assembling processing on the lead sequence. The correction data storage unit stores data used for correcting the obtained data. Data stored in the correction data storage unit is output, a program for correction is output from the program storage unit, and the process management unit corrects data obtained based on the data. The similarity determination unit performs a determination regarding similarity with respect to an object to be compared in accordance with an instruction from the process management unit according to the program stored in the program storage unit. The sequence length determination unit makes a determination regarding the sequence length of the objects to be compared in accordance with an instruction from the process management unit according to the program stored in the program storage unit.
 更にコンピュータは、オペレータや高速DNAシーケンサなどからデータを入力するためにキーボードおよび/またはスキャナーなどの入力部を有してもよい。また更に、得られた結果を出力するためのモニターおよび/またはプリンターなどの出力部を有してもよい。尚、上記では、補正を処理管理部が行う例を示したが、コンピュータが更に補正部を有し、補正部が上述のようにデータの補正を行ってもよい。 Furthermore, the computer may have an input unit such as a keyboard and / or a scanner for inputting data from an operator or a high-speed DNA sequencer. Furthermore, you may have output parts, such as a monitor and / or a printer for outputting the obtained result. In the above description, the processing management unit performs the correction. However, the computer may further include a correction unit, and the correction unit may correct the data as described above.
 本発明者らは、従来の技術においては、次のような問題があることを見出している。このような問題も本発明により解決される。 The present inventors have found that the conventional techniques have the following problems. Such a problem is also solved by the present invention.
 HiCEP法に代表されるようなDNA配列を制限酵素で切断し、末端に特定の配列を付与した断片配列を調整して、特定の配列を用いてPCRで増幅後電気泳動する方法、または、DNA配列を特定の配列を用いてPCRで増幅後電気泳動する方法、などで得られた断片DNA配列の電気泳動結果(バンド群またはピーク群)を、異なるサンプル間で比較し、強度の異なるバンド群またはピーク群を検出する方法がある(以下、網羅的フラグメント解析手法と呼ぶ)。網羅的フラグメント解析手法の代表的なものは次のような手法である。例えば、そのような手法には、HiCEP、AFLP、T-RFLP、SAGE、CAGE、Differential Display と称される方法が含まれる。 A method in which a DNA sequence typified by the HiCEP method is cleaved with a restriction enzyme, a fragment sequence to which a specific sequence is added at the end is prepared, amplified by PCR using a specific sequence, and then electrophoresed, or DNA Compare the results of electrophoresis (band group or peak group) of fragment DNA sequences obtained by amplification after PCR amplification using a specific sequence, etc. between different samples, and band groups with different intensities Alternatively, there is a method for detecting a peak group (hereinafter referred to as an exhaustive fragment analysis method). A typical method for comprehensive fragment analysis is as follows. For example, such methods include methods called HiCEP, AFLP, T-RFLP, SAGE, CAGE, and Differential Display.
 これらの網羅的なフラグメント解析法は、既存の配列データがなくとも、網羅的フラグメント解析が可能である。しかしながら、HiCEP法以外の方法は、網羅性が低い、または、フラグメントが短く遺伝子を特定できない、さらに、ひとつの配列から複数のバンドまたはピークが出現し解析が艱難であるという問題がある。 These exhaustive fragment analysis methods enable exhaustive fragment analysis without existing sequence data. However, methods other than the HiCEP method have a problem that the comprehensiveness is low, or the fragment is short and the gene cannot be specified, and moreover, a plurality of bands or peaks appear from one sequence and the analysis is difficult.
 HiCEP法は、他の網羅的なフラグメント解析とは異なり、解析対象となる1種類のmRNA配列またはゲノム配列断片(スタート配列)から、制限酵素で切断して1種類のフラグメントのみが生成されるように調整することを特徴とする方法で、さらに、検出のためのPCR工程で、アダプタ配列より内側に2塩基(セレクション配列)長いプライマーを使用し256通りのPCRと電気泳動を行うことで、約2万種類以上の断片配列を同時に独立した波形ピークとして得ることでき、その1ピークが元となる配列1種類と対応付くという特徴を持った方法である。よって、HiCEPで得られた電気泳動結果を比較し、サンプル間で強度が変化したピークは、そのフラグメントの元となる配列も同様に量的な差があることを検出できる方法である。さらに、PCRを利用した手法であるため低発現量の転写産物も検出可能であり、再現性も非常に良いため1.2倍以上の発現量差も検出できる。 Unlike other exhaustive fragment analysis, the HiCEP method seems to generate only one type of fragment from the target mRNA sequence or genomic sequence fragment (start sequence) by digestion with restriction enzymes. In addition, in the PCR process for detection, by using a primer 2 bases (selection sequence) longer than the adapter sequence and performing 256 PCRs and electrophoresis, More than 20,000 types of fragment sequences can be simultaneously obtained as independent waveform peaks, and this one peak has a feature that it corresponds to one type of the original sequence. Therefore, by comparing the electrophoresis results obtained with HiCEP, a peak whose intensity has changed between samples is a method by which it can be detected that there is also a quantitative difference in the original sequence of the fragment. Furthermore, since this technique uses PCR, transcripts with low expression levels can also be detected, and reproducibility is very good, so that differences in expression levels of 1.2 times or more can be detected.
 しかしながら、HiCEP法においても、その他の網羅的なフラグメント解析法同様、量に違いのあるバンドやピークを知ることができても、その配列を決定するには、分取という煩雑な工程を必要とする。 However, even in the HiCEP method, as in other comprehensive fragment analysis methods, even if bands and peaks with different amounts can be known, determining the sequence requires a complicated process of fractionation. To do.
 これを解決するひとつの方法として、公知の配列情報を持つ生物種については、網羅的フラグメント解析をコンピュータで予測して配列を決定する方法が考えられたが、電気泳動長と配列長のズレや公知配列の情報過多による判別の困難さなどの問題で、バンドやピークの配列の予測ができても、その信頼性は低いというのが実状である。 As a method for solving this problem, for species with known sequence information, a method of determining the sequence by predicting the comprehensive fragment analysis with a computer was considered. Even if it is possible to predict the arrangement of bands and peaks due to problems such as difficulty in discrimination due to excessive information of known sequences, the reality is that the reliability is low.
 また、HiCEPにおいては、ES細胞を試料として、HiCEPで検出されたピークの約14000についてサンガー法のシーケンサを利用して配列を決定してデータベースを作成したが、このデータベースを作成するのに約3年の期間と大きなコストを必要として、この方法をHiCEP測定対象とするすべての試料で行なうのは現実的ではない。 In HiCEP, ES cells were used as samples, and about 14000 peaks detected by HiCEP were sequenced using a Sanger sequencer to create a database. To create this database, about 3 It is impractical to perform this method on all samples that are subject to HiCEP measurement, requiring yearly periods and high costs.
 最近では、高速DNAシーケンサが登場し、これを利用してゲノムDNAやmRNAをシーケンスする研究がさかんに行なわれているが、読み取りの長さの制限やシーケンスする配列を作成する段階で配列を同じ長さにそろえる等、網羅的フラグメント解析の試料をシーケンスするには適当ではない方法が使われている。また、配列類似性のみでゲノム配列や転写産物にマッピングをして、遺伝子ごとにクラスタリングする方法を取るため、配列情報がない生物種に適用できないことはもちろんのこと、マッピングエラーなどのバイアスがかかり、メジャーな配列群以外は期待される再現性が得られない。 Recently, a high-speed DNA sequencer has appeared, and there are many studies on sequencing genomic DNA and mRNA using this, but the sequence is the same when the reading length is limited or the sequence to be sequenced is created. Methods that are not suitable for sequencing samples for exhaustive fragment analysis, such as aligning lengths, are used. In addition, mapping to genome sequences and transcripts using only sequence similarity and clustering for each gene is used, so it cannot be applied to species that do not have sequence information. The expected reproducibility cannot be obtained except for major sequence groups.
 また本発明は、次のような効果を奏することが可能である。 The present invention can also provide the following effects.
 (1)注目バンドまたは注目ピークの配列決定方法の簡易化
 網羅的フラグメント解析方法で得られた候補のバンド、または、ピークの配列を知るためには、それらのバンドやピークを分取してシーケンシングする必要がある。この注目しているバンド、または、ピークを検出した時点でその配列を知りえないということは、その後の解析を行う上で大きな障害となる。最も重要な欠点は、網羅的フラグメント解析の結果、多くの候補が得られた場合、それらのバンドまたはピークに情報が付与されていないため既知知見を利用して絞り込むことができず、科学的な根拠ではなく、その後の実験の容易さ等で分取対象を決定する必要があり、重要な遺伝子を候補から落としてしまうという問題である。もちろん、注目のバンド、または、ピークを全て分取して配列を決定するという方法も考えられるが、候補が多い場合には大きな費用と時間がかかる。もうひとつの問題点は、分取の手技そのものが煩雑であり、特に、バンドやピークが1ベース単位で蜜な状態では、クローニングを行なってシーケンスしなければならないなど、費用と時間のかかる工程になることである。
(1) Simplification of the method for determining the sequence of the target band or peak In order to know the candidate band or peak sequence obtained by the comprehensive fragment analysis method, the band or peak is sorted and sequenced. I need to sing. The fact that the sequence cannot be known at the time when the band or peak of interest is detected is a major obstacle to the subsequent analysis. The most important drawback is that when many candidates are obtained as a result of exhaustive fragment analysis, information is not given to those bands or peaks, so it cannot be narrowed down using known knowledge, and scientific The problem is that it is necessary to determine the sorting target based on the ease of subsequent experiments and the like, not the basis, and drop important genes from the candidates. Of course, a method of determining the sequence by sorting all the bands or peaks of interest is also conceivable, but if there are many candidates, it takes a large cost and time. Another problem is that the preparative procedure itself is complicated, especially when the bands and peaks are honeyed in units of 1 base, and it is necessary to perform cloning and sequencing, which is an expensive process. It is to become.
この分取工程を省くためのひとつの方法として、ゲノム情報や転写産物情報のある生物に対して、HiCEP法をコンピュータによりシミュレーションし、既知の配列から得られる仮想的なフラグメント配列を、電気泳動のバンドもしくはピークの分子数とマッチングを行い、バンドあるいはピークの配列を予測する方法も構築した。しかしながら、電気泳動で得られる配列長(電気泳動長と呼ぶ)と対象のフラグメントの実際の配列長はかならずしも一致しないこと、また、バンドやピークの数に比べて既知の配列の種類が多くなり候補の配列が多くなってしまうこと、これらのことから配列長だけで正確に配列とバンドまたはピークと対応付けられないことがわかった。 As one method for omitting this fractionation process, the HiCEP method is simulated by a computer for organisms with genome information and transcript information, and a virtual fragment sequence obtained from a known sequence is subjected to electrophoresis. A method for predicting the arrangement of bands or peaks by matching with the number of molecules of bands or peaks was also constructed. However, the sequence length obtained by electrophoresis (referred to as the electrophoresis length) does not always match the actual sequence length of the target fragment, and the number of known sequences increases compared to the number of bands and peaks, and candidates From these facts, it was found that the sequence and the band or peak cannot be accurately associated with the sequence length alone.
 もうひとつの方法として、対象サンプルについて、すべてのバンドまたはピークについて、あらかじめ配列を決定してデータベースを作成しておくことで分取工程を省くことができると考えた。そこで、ES細胞について、HiCEPを実施し、サンガー法のシーケンサを仕様して、得られたピーク約14000について配列を決定し、データベースを作成した。その結果、ES細胞の解析には有用であったが、このデータベースを作成するには、膨大な時間と期間が必要で、HiCEPを適用する生物種や試料ごとにこの方法でデータベースを作成することは困難であることがわかった。 As another method, we thought that the fractionation process could be omitted by creating a database by determining the sequence for all bands or peaks in advance for the target sample. Therefore, HiCEP was performed on ES cells, a Sanger sequencer was specified, the sequence was determined for about 14,000 peaks, and a database was created. As a result, it was useful for analysis of ES cells, but creating this database requires an enormous amount of time and time, and it is necessary to create a database using this method for each species and sample to which HiCEP is applied. Proved difficult.
 網羅的フラグメント解析法に、高速DNAシーケンサを併用する本法で配列をデータベース化することで、これまでとは比べ物にならないほど、短い期間と低コストで、バンド群またはピーク群の配列を網羅的に同定することができるようになる。さらに、本法では、公知配列を必須としないため、公知配列がない生物種においても本法を適用できる。 Comprehensive fragment analysis method and high-speed DNA sequencer are used to create a database of sequences, which makes it possible to comprehensively cover band or peak group sequences in a shorter period of time and at a lower cost than ever before. Can be identified. Furthermore, since this method does not require a known sequence, this method can be applied to a biological species that does not have a known sequence.
そのことは、本法を適用することで、網羅的フラグメント解析法のバンド群やピーク群の配列を同定するという効果だけではなく、ゲノムや転写産物の網羅的断片配列を手にいれることができるという効果もある。 That is, by applying this method, it is possible to obtain not only the effect of identifying the sequence of the band group and peak group of the comprehensive fragment analysis method but also the comprehensive fragment sequence of the genome and transcripts. There is also an effect.
(2)網羅的フラグメント解析の新たな検出方法
 上記(1)の課題である網羅的フラグメント解析法で得られるバンドやピークの配列を決定することは重要な課題であるが、網羅的フラグメント解析法の差のあるフラグメントの検出方法として、高速DNAシーケンサを利用することも考えられる。
(2) New detection method for exhaustive fragment analysis Although it is an important task to determine the sequence of bands and peaks obtained by the exhaustive fragment analysis method, which is the subject of (1) above, an exhaustive fragment analysis method It is conceivable to use a high-speed DNA sequencer as a method for detecting fragments having a difference between the two.
 しかしながら、高速DNAシーケンサを利用する場合、通常は、シーケンス対象の配列群をランダムに切断し、配列の長さをそろえてシーケンスしなければならない。 However, when using a high-speed DNA sequencer, it is usually necessary to randomly sequence the sequence group to be sequenced and align the sequence length.
 また、シーケンスした配列をクラスタリングするには、リファレンスとなる既知配列が必要である。 Also, in order to cluster sequenced sequences, a known sequence as a reference is required.
 よって、網羅的フラグメント解析法で得られるcDNA調製液を高速DNAシーケンサでシーケンシングし解析することは難しいと考えられている。 Therefore, it is considered difficult to sequence and analyze the cDNA preparation obtained by the comprehensive fragment analysis method using a high-speed DNA sequencer.
 網羅的フラグメント解析法のcDNA調整液を高速DNAシーケンサでシーケンシングし、本法でデータベースを構築することにより、配列クラスタごとのコンセンサス配列と構成配列数を得ることができる。このコンセンサス配列を構成配列を使用して、量に差のある配列クラスタを求めることできる。これは、高速DNAシーケンサの問題はあるものの、バンド群やピーク群をPCRと電気泳動で求めるのではなく、直接、配列クラスタ間のリード配列数を比較することができ、バンド群やピーク群と配列クラスタとの対応付けを必要としないメリットがある。 C By sequencing the cDNA preparation of the comprehensive fragment analysis method with a high-speed DNA sequencer and constructing a database with this method, the consensus sequence and the number of constituent sequences for each sequence cluster can be obtained. Using this consensus array as a constituent array, array clusters having a difference in quantity can be obtained. Although there are problems with high-speed DNA sequencers, it is possible to directly compare the number of read sequences between sequence clusters, rather than obtaining band groups and peak groups by PCR and electrophoresis. There is an advantage that it is not necessary to associate with an array cluster.
 さらに、本法で作成された配列クラスタのコンセンサス配列をリファレンス配列として利用する前提でし、測定対象の試料から網羅的フラグメント解析法で調整されるDNA混合液に対して、既存解析法(高速DNAシーケンサのRNA-seqやマイクロアレイ)を適用することで、既存解析法の欠点をおぎないながら網羅的フラグメント解析法の特徴を生かした解析が可能となり、さらに、信頼性データも格納された高精度な配列クラスタの情報を利用できることで、既存解析法を適用するよりもより高精度な解析が可能となる。 Furthermore, on the premise that the consensus sequence of the sequence cluster created by this method is used as a reference sequence, the existing analysis method (high-speed DNA) is applied to the DNA mixture prepared from the sample to be measured by the comprehensive fragment analysis method. Application of sequencer RNA-seq and microarray) enables the analysis utilizing the features of the comprehensive fragment analysis method without losing the disadvantages of the existing analysis method, and also the high-precision sequence storing reliability data The availability of cluster information enables analysis with higher accuracy than applying existing analysis methods.
 HiCEP法(High coverage expression profiling method)は、網羅的フラグメント解析の方法のひとつで、微量の試料から網羅的・高精度に遺伝子発現解析を行う方法である。HiCEP法の最大の特徴は、低発現転写物も再現性良く高精度に解析可能な点である。更に、本法はあらかじめ遺伝子配列情報を必要としないため、ゲノム情報が明らかではない生物種にも適用可能である。しかしながら、プロファイリングピークとして得られる転写産物の塩基配列予測が困難であることも意味する。よって、HiCEP法で得られる網羅的フラグメント解析の発現プロファイルにおける電気泳動ピークの塩基配列同定を本法で実施した。 The HiCEP method (High-coverage-expression-profiling-method) is one of the methods for comprehensive fragment analysis, and is a method for comprehensive and high-precision gene expression analysis from a very small amount of sample. The greatest feature of the HiCEP method is that low expression transcripts can be analyzed with high reproducibility and high accuracy. Furthermore, since this method does not require gene sequence information in advance, it can also be applied to biological species for which genomic information is not clear. However, it also means that it is difficult to predict the base sequence of the transcript obtained as a profiling peak. Therefore, the base sequence identification of the electrophoresis peak in the expression profile of the comprehensive fragment analysis obtained by HiCEP method was carried out by this method.
 本法の利用イメージは、図20に示す通り、あらかじめ測定対象となる試料をHiCEP法で調製し、本法でシーケンシングしクラスタリング・アッセンブリングし配列クラスタのデータベースを作成後、同じ調製試料から得たHiCEPのリファレンスプロファイリングのピークとクラスタの対応付けを行なったデータを保存しておく。その後、データベースを作成した試料と同様の生物種・組織ではあるが異なる試料について、HiCEPを実施し、解析対象プロファリングを得、注目する電気泳動ピークをリストアップしたのちに、あらかじめ作成しておいた配列クラスタのデータベース及びクラスタとリファレンスプロファイリングのピークとの対応付けデータを使用して、注目ピークの配列を決定する方法である。 As shown in FIG. 20, the image of using this method is prepared in advance by preparing a sample to be measured by the HiCEP method, sequencing it by this method, clustering and assembling, and creating a database of sequence clusters. In addition, the HiCEP reference profiling peaks and clusters are stored in correspondence. After that, perform HiCEP on the same species / tissue as the sample for which the database was created, but obtain the analysis target profiling, list the electrophoresis peaks of interest, and create them in advance. This is a method of determining the sequence of the peak of interest using the database of the sequence cluster and the association data between the cluster and the peak of reference profiling.
 HiCEPの具体的な手法は、図10で示すように、生物試料から抽出したRNA(TotalRNA)、もしくは、精製したmRNAの試料をもとに、まず二重鎖のcDNA群を生成し、これを適切な2つの制限酵素によって切断し、それぞれの末端に特徴的なアダプタを付与したcDNA断片群のみの調製液を作成する方法である。このとき、両端に異なるアダプタ配列を付与されたcDNA断片は、スタートのmRNA1種類から1種類しか生成されないのがHiCEPの特徴である。 As shown in Fig. 10, HiCEP's specific method is to first generate double-stranded cDNA groups based on RNA (TotalRNA) extracted from biological samples or purified mRNA samples. This is a method of preparing a preparation solution of only cDNA fragment groups cleaved with two appropriate restriction enzymes and provided with a characteristic adapter at each end. At this time, only one kind of cDNA fragment having different adapter sequences at both ends is generated from one kind of starting mRNA, which is a feature of HiCEP.
  さらにHiCEP法では、図11で示すように、cDNA断片群の調整液を256分割し、両端のアダプタ配列(既知の配列)より2塩基長いプライマーを16種類作成し、256種類の異なるプライマーの組み合わせでPCRを行って、図12のようにそれぞれのPCR産物をサイズマーカとともにキャピラリー電気泳動装置にかけて、電気泳動の波形パターンとピークの電気泳動配列長及び蛍光強度のデータを、プロファリングデータとして得る手法である。 Furthermore, in the HiCEP method, as shown in FIG. 11, the preparation solution of the cDNA fragment group is divided into 256, and 16 types of primers that are 2 bases longer than the adapter sequences (known sequences) at both ends are prepared. As shown in FIG. 12, the PCR product is applied to a capillary electrophoresis apparatus together with a size marker as shown in FIG. 12, and the electrophoresis waveform pattern, peak electrophoresis sequence length and fluorescence intensity data are obtained as profiling data. It is.
 このHiCEPで得られたプロファリングピークの配列同定を、マウスES細胞(E14)を試料として本法により実施した。 The sequence identification of the profiling peak obtained with this HiCEP was carried out by this method using mouse ES cells (E14) as a sample.
(1)高速DNAシーケンサを活用した網羅的フラグメント配列データベースの構築
 マウスES細胞(E14)total RNA 1μgを用いてHiCEP法を実施した。
(1) Construction of comprehensive fragment sequence database using high-speed DNA sequencer HiCEP method was performed using 1 μg of mouse ES cells (E14) total RNA.
 次に、HiCEP法の工程の内、図10で示す工程で得られた「鋳型cDNAs」(両端に指標配列であるHiCEP法で用いるアダプターを有する配列の混合物。長さの分布は約60-baseから約800base)について、シーケンシングに必要なDNA量を得るため、アダプター上のプライマーにて増幅を行った。その後、プライマーダイマーおよびアダプターダイマー画分の除去を目的とし、アクリルアミドゲル電気泳動による精製を行い70baseから100base以下のフラグメントを除去した。その精製物をRoche社製高速DNAシーケンサである GS 454 FLX Systemにてシーケンシングを行った。尚、シーケンシングライブラリー作製時、DNAの断片化は行わなかった。シーケンシングにより、1回目(2分の1プレート)は469,318配列、2回目(2プレート)は1,868,178配列を得た。これらの配列群について、配列の長さと類似性のふたつの要素を利用して、コンピュータ処理により、クラスタリング・アッセンブリングし、高精度な配列クラスタとコンセンサス配列を作成し、それを構成するリード配列数を集計してデータベース化する次のような工程を開発した。 Next, among the steps of the HiCEP method, “template cDNAs” obtained in the step shown in FIG. 10 (a mixture of sequences having adapters used in the HiCEP method, which is an indicator sequence at both ends. The length distribution is about 60-base. In order to obtain a DNA amount necessary for sequencing, amplification was performed with primers on the adapter. Thereafter, for the purpose of removing the primer dimer and adapter dimer fractions, purification by acrylamide gel electrophoresis was performed to remove fragments from 70base to 100base or less. The purified product was sequenced with a GS-454 FLX System, which is a high-speed DNA sequencer manufactured by Roche. DNA was not fragmented when the sequencing library was prepared. By sequencing, 469,318 sequences were obtained for the first time (1/2 plate), and 1,868,178 sequences were obtained for the second time (2 plates). Using these two elements of sequence length and similarity, clustering and assembling of these sequence groups is performed by computer processing to create highly accurate sequence clusters and consensus sequences, and the number of read sequences that compose them We have developed the following process to aggregate the data into a database.
  工程1:指標配列(HiCEP法に用いるアダプタ配列)による検査と分類
   HiCEP法の検出対象となる、図10のcDNA断片の両端には、必ず特定の指標配列であるアダプタ配列が付与される。すべてのリード配列について指標配列を評価し、クラスタリング・アッセンブリングに使用する配列を振り分ける。
Step 1: Inspection and classification by index sequence (adapter sequence used for HiCEP method) An adapter sequence that is a specific index sequence is always given to both ends of the cDNA fragment of FIG. 10 to be detected by the HiCEP method. The index sequences are evaluated for all the read sequences, and the sequences used for clustering and assembly are allocated.
 具体的には、図13で示すように、アダプター配列にセレクション塩基NNまでを加えたマスキング配列32種類で、cross_match(ワシントン大学)プログラムにより、全リード配列を類似性検索し、一定の類似度以上でアダプター配列が両端または片側に確認できる配列をクラスタリング・アッセンブリングの対象とする。 Specifically, as shown in FIG. 13, 32 types of masking sequences obtained by adding up to the selection base NN to the adapter sequence are searched for similarities in all lead sequences by the cross_match (University of Washington) program. The sequence that can be confirmed on both ends or one side of the adapter sequence is the target of clustering and assembly.
  (A)cross_matchのパラメータ
  cross_matchプログラムのパラメータは、次の通りである。
(A) parameters of cross_match The parameters of the cross_match program are as follows.
   A) ミスマッチ・ギャップのペナルティ値を最小にする
    -penalty -1 -gap_init -1 -gap_ext -1
    454のリードエラーの特性(モノポリマー(1種類の塩基)が連続している場合に、リード配列ごとにその連続している領域の塩基数のばらつきが大きくなる特性)を考慮して、ギャップのペナルティ値を最小にする。
A) Minimize mismatch gap penalty -penalty -1 -gap_init -1 -gap_ext -1
Taking into account the characteristics of 454 read errors (characteristics in which variations in the number of bases in the continuous region for each lead sequence increase when the monopolymer (one type of base) is continuous), Minimize the penalty value.
   B)ワードサイズを小さめにとる
     -minmatch 5
 ペアワイズアライメントを出来るだけ多く検出・出力するようにする。
B) Make word size smaller -minmatch 5
Try to detect and output as many pairwise alignments as possible.
   C)最低スコア値を小さくとる
     -minscore 15
  (B)指標配列
  MspI側, MseI側それぞれのアダプタ配列を指標配列としてcross_matchへの入力マスク配列とする。実際に使用した配列は、図13のように、アダプタ配列だけではなくNNのセレクション塩基2塩基部分も加えた配列を指標配列とした。これによって、全パターンを網羅するためには、MspI,MseI各16種類ずつ計32種類のアダプタ配列を使用した。なお、NNのセレクション塩基を含まないアダプタ配列を指標配列ともできるが、32種類のアダプタを使用したほうが、確認できる指標配列をやや多く確認できるため、32種類のアダプタを採用した。
C) Lower the minimum score -minscore 15
(B) Index array An adapter array on each of the MspI side and MseI side is used as an index array as an input mask array to cross_match. As the sequence actually used, as shown in FIG. 13, not only the adapter sequence but also a sequence obtained by adding 2 base portions of the selection base of NN was used as an index sequence. Thus, in order to cover all patterns, a total of 32 types of adapter arrays were used, 16 types each for MspI and MseI. An adapter sequence that does not contain the selection base of NN can also be used as an index sequence. However, 32 types of adapters were used because 32 types of adapters can be used to confirm a slightly larger number of index sequences.
  (C) cross_matchによって検出されたアダプタの分類
  図27で示すようなcross_matchの出力を使用して、正しいHiCEPフラグメントを得るために、cross_matchによって検出されるアダプタを次の4種類に分けて考える。
(C) Classification of adapters detected by cross_match In order to obtain a correct HiCEP fragment using the output of cross_match as shown in FIG. 27, the adapters detected by cross_match are divided into the following four types.
   A)高品質アダプタ:アライメントにNNを含む高スコアのアダプタ
   B)救済可能な低品質アダプタ:アライメントは短いが置換・ギャップがなく、NNを含んでおり、内部配列は高品質であることが期待できるアダプタ
   C)低品質アダプタ:低品質かつ救済できないアダプタ
   D)偽アダプタ:アダプタに似た内部配列をcross_matchがアライメントしたと思われるもの(実際にはアダプタとして存在しないと思われる部分)。
A) High quality adapter: adapter with high score including NN in alignment B) Low quality adapter that can be rescued: Alignment is short but no replacement / gap, NN is included, and the internal arrangement is expected to be high quality Possible adapters C) Low quality adapters: low quality and irreparable adapters D) Fake adapters: cross-match that seems to have aligned an internal array similar to the adapter (the part that does not seem to actually exist as an adapter).
  (D)高品質アダプタの判定条件
  アダプタ配列+NN33bpの内、29bp以上が一致する(図13を参照)。
(D) Judgment conditions for high-quality adapters 29 bps or more of the adapter array + NN33 bp match (see FIG. 13).
  (E)救済可能な低品質アダプタの判定条件
  HiCEP法の特徴を生かすためにセレクション塩基NNを含む19bpがすべて一致する(図13を参照)。
(E) Judgment conditions for relieving low-quality adapter All 19 bp including the selection base NN match in order to make use of the characteristics of the HiCEP method (see FIG. 13).
  (F)配列の分類
確認されたアダプタの種類が、高品質アダプタか救済可能な低品質アダプタである配列をクラスタリング・アッセンブリングに使用する配列とする。本実施例では、全リード配列(469,318)の内、300,635配列,(64.1%)が両端にアダプタが確認できた配列であった。また、112365配列(23.9%)が片側のアダプタのみが確認できた配列であった(図13を参照)。
Figure JPOXMLDOC01-appb-T000001
(F) An array whose array classification has been confirmed is a high-quality adapter or a low-quality adapter that can be repaired, and is used as an array for clustering and assembly. In this example, among all the lead arrays (469, 318), 300,635 arrays (64.1%) were arrays in which adapters could be confirmed at both ends. In addition, 112365 array (23.9%) was an array in which only one adapter was confirmed (see FIG. 13).
Figure JPOXMLDOC01-appb-T000001
  工程2:クラスタリング・アッセンブリング
   クラスタリング・アッセンブリングに使用する配列の内、実施例では両端にアダプタが確認できた配列を対象として、クラスタリング・アッセンブリングして、HiCEPフラグメントのコンセンサス配列を生成する。これにより、個々のリード配列のエラーが除かれ、より正確なHiCEPフラグメント配列が得られる。加えて、コンセンサス配列を構成するリード配列の数をHiCEPフラグメントの転写量の参照データとすることができる。
Step 2: Clustering / Assembling Among the sequences used for clustering / assembling, clustering / assembling is performed for sequences in which adapters can be confirmed at both ends in the embodiment to generate a consensus sequence of HiCEP fragments. This eliminates individual lead sequence errors and provides a more accurate HiCEP fragment sequence. In addition, the number of read sequences constituting the consensus sequence can be used as reference data for the transcription amount of the HiCEP fragment.
  (A)前処理
  前処理として、クラスタリング・アッセンブリングに使用するアダプタ配列が確認できたすべてのリード配列について、確認できたアダプタ配列を含むアダプタ配列の位置から外側を除去し(図21参照)、さらに、除去した配列の端に本来のアダプタの塩基配列を人工的に付与する。加えて、シーケンサーから出力された各リード配列のクオリティ値の情報についても、確認できたアダプタ配列の位置から外側を除去し、人工的に付与されたアダプタに対応する部分に、クオリティ値の最高点を付与する。
(A) Pre-processing As pre-processing, with respect to all the read sequences for which the adapter sequence used for clustering and assembling can be confirmed, the outside is removed from the position of the adapter sequence including the confirmed adapter sequence (see FIG. 21). Furthermore, the base sequence of the original adapter is artificially given to the end of the removed sequence. In addition, for the quality value information of each read sequence output from the sequencer, the outside is removed from the position of the confirmed adapter sequence, and the highest quality value is assigned to the part corresponding to the artificially assigned adapter. Is granted.
  上記の前処理を行なった配列群を入力として、クラスタリング・アッセンブリングプログラムを実行し、配列クラスタ情報と各配列クラスタの配列アライメント情報、及び、各配列クラスタのコンセンサス配列を得る。 Execute the clustering and assembling program with the sequence group subjected to the above pre-processing as an input, and obtain sequence cluster information, sequence alignment information of each sequence cluster, and consensus sequence of each sequence cluster.
  (B)クラスタリング・アッセンブリングソフトウエア
  クラスタリング・アセンブリングには、配列類似性のみでクラスタリング・アッセンブリングを行なうTGICLプログラム(ハーバード大学のウェブサイトhttp://compbio.dfci.harvard.edu/tgi/software/で公開)を利用する。なお、配列のアッセンブリングはTGICLに付属したアッセンブリングプログラム CAP3を用いる。
(B) Clustering and assembling software For clustering and assembling, the TGICL program (Harvard University website http://compbio.dfci.harvard.edu/tgi/software) Publish at /). For assembly of sequences, an assembly program CAP3 attached to TGICL is used.
  (C)TGICLのパラメータ
  パラメータ“-v 2” はアッセンブリング時に許容するオーバーハング(配列の端を無効にして、アッセンブリング結果から除外する部分)の塩基数を最小にする設定である。アッセンブリングにおいては、入力配列は両端に共通のHiCEPアダプタ配列が付加された配列なので、オーバーハングのあるアライメントが出力された場合には、そのクラスタが正常にアッセンブリング出来てないことを示すため、エラークラスタが認識可能となる。
(C) TGICL parameters Parameter “-v 2” is a setting that minimizes the number of overhangs allowed during assembly (the part that is invalidated from the sequence and excluded from the assembly result). In assembly, the input sequence is an array with a common HiCEP adapter sequence added to both ends, so when an alignment with an overhang is output, it indicates that the cluster cannot be assembled normally. Error clusters can be recognized.
  クラスタリング・アッセンブリングに関する以下のパラメータにはデフォルト値を使用した。 The default values were used for the following parameters related to clustering and assembly.
   ・クラスタリング
    最小オーバーラップ長: 40bp  (-l)
    オーバーラップの一致率: 94%  (-p)
   ・アッセンブリング
  ・最小一致率: 93%       (-O -p)
  (D)シングルトン配列の収集
   通常、ランダムに切断された試料をシーケンシングして得られる配列は、類似性で帰属するゲノム領域や遺伝子を判定し、クラスタを形成し、知見へと結び付けていく。このような場合、シングルトン配列はクラスタを形成していないとして知見を得るためのシグナルとしては扱われない場合もある。しかしながら、本法で得られるシングルトン配列は、指標配列が確認された配列(HiCEP法ではアダプタ配列が確認された配列)で、信頼性も高く、その配列が1本シーケンシングされたという事実も知見を得るために利用できる。よって、本法では、シングルトン配列についても有効利用できるように、処理対象とする(よって、本明細書で、クラスタという場合は、シングルトン配列のみのクラスタも指す場合がある)。
・ Clustering Minimum overlap length: 40bp (-l)
Overlap match rate: 94% (-p)
・ Assembly ・ Minimum match rate: 93% (-O -p)
(D) Collection of singleton sequences Usually, sequences obtained by sequencing randomly cut samples determine genomic regions and genes belonging to similarity, form clusters, and link to knowledge. In such a case, the singleton sequence may not be treated as a signal for obtaining knowledge that no cluster is formed. However, the singleton sequence obtained by this method is a sequence in which the indicator sequence is confirmed (the sequence in which the adapter sequence is confirmed in the HiCEP method), which is highly reliable, and the fact that one of the sequences was sequenced is also found Available to get. Therefore, in this method, processing is performed so that singleton arrays can also be used effectively (thus, in this specification, a cluster having only a singleton array may be referred to).
   tgiclの処理において、シングルトン配列は、クラスタリング・アッセンブリングの2つの処理段階で別々に発生する。アッセンブリング時に生じたシングルトン配列は、アッセンブリングを実行したスレッド毎に作成されるsingletsファイルに出力されるが、クラスタリング時に除外されたシングルトン配列の情報は出力されない。このため、クラスタリング時に除外された配列を特定して、シングルトン配列として取り出さなければならない。tgiclはクラスタリング終了時にクラスタ化できた配列名のリストをファイルに出力するので、このファイルを利用して、クラスタリング時に除外されたシングルトン配列を取得する。 In tgicl processing, singleton arrays are generated separately in the two processing stages of clustering and assembly. The singleton array generated at the time of assembly is output to a singles file created for each thread that has executed the assembly, but the information of the singleton array excluded at the time of clustering is not output. For this reason, the sequences excluded during clustering must be identified and extracted as singleton sequences. tgicl outputs a list of sequence names that can be clustered at the end of clustering to a file. Use this file to obtain singleton sequences excluded during clustering.
  以下にシングルトン配列の収集手順を示す。 The procedure for collecting singleton arrays is shown below.
   A) クラスタ化された配列名リストを編集して、1行-1配列名の形式でファイルを作成する
   B) FASTA形式の入力ファイルから、全入力配列名の1行-1配列名の形式のファイルを作成する
   C)上の2つのファイルを連結・ソートして配列名が1行だけの行をとりだす
   D)入力配列の配列データベースを作成し、(3)で取得した配列をFASTA形式で抜き出す
   E)アッセンブリング時のシングルトン配列(singlets)ファイルと(4)のファイルを連結する
  工程3:配列長によるクラスタリングエラーの修正
   tgiclのアッセンブリング結果のアライメント情報はace形式のファイルに出力される。HiCEP法で得られたcDNAから生産される配列クラスタの構成配列すべておよびそこから得られるコンセンサス配列は、配列が類似していることはもちろん、全長にわたり配列がアライメントされている整列クラスタでなければならない。
A) Edit the clustered sequence name list and create a file in the format of 1-line-1 sequence name. B) From the FASTA format input file, format 1-line-1 sequence name of all input sequence names. Create a file C) Concatenate and sort the two files above to extract a line with only one array name D) Create an array database of input arrays, and extract the array obtained in (3) in FASTA format E) Concatenating singleton sequence (singlets) file and (4) file during assembly Step 3: Correction of clustering error by sequence length Alignment information of tgicl assembly result is output to ace format file. All constituent sequences of the sequence cluster produced from the cDNA obtained by the HiCEP method and the consensus sequence obtained from the sequence cluster must be aligned clusters in which the sequences are aligned over the entire length as well as the sequences. .
  (A)配列クラスタの検査
  上記原則に従って、このaceファイルの内容をコンティグ毎に以下の点を検査して、アッセンブリングエラーの判定を行う。
(A) Inspection of sequence cluster In accordance with the above principle, the contents of this ace file are inspected for each contig to determine the assembly error.
   A)リードの全長が有効にアッセンブリングされている
 コンティグを構成する全てのリード配列にクリップされた部分(アライメント中でリードの端が無効化され、コンセンサス配列の形成に寄与しない部分)が無いこと
   B)リードがコンセンサス(コンティグ)配列の全長にアライメントされている
 コンセンサス配列にアライメントされたリード配列の両端がコンセンサス配列の両端に一 致し、コンセンサス配列の途中からアライメントされることがないこと。
A) The entire length of the lead is effectively assembled. There should be no clipped parts in all the lead sequences that make up the contig (the part where the lead ends are invalidated during alignment and do not contribute to the formation of consensus sequences). B) Leads are aligned to the full length of the consensus (contig) sequence. Both ends of the lead sequence aligned to the consensus sequence are aligned with both ends of the consensus sequence and must not be aligned in the middle of the consensus sequence.
 上のA),B)の条件を満たさない配列クラスタを非整列クラスタとし、これはエラークラスタであると判定する(図14参照)。 An array cluster that does not satisfy the above conditions A) and B) is determined as an unaligned cluster, and is determined to be an error cluster (see FIG. 14).
  (B)エラークラスタの修復
  上記(A)でエラークラスタと判定されたクラスタ(コンティグ)の構成配列を取り出して、そのクラスタごとに個別に再度アッセンブリングしてコンセンサス配列を取得することで、エラークラスタを修正する(図14参照)。
(B) Error cluster repair The cluster (contig) that is determined to be an error cluster in (A) above is taken out, and then reassembled separately for each cluster to obtain a consensus sequence. Is corrected (see FIG. 14).
   A)反復アッセンブリング
   個別アッセンブリングにはCAP3を用い、始めに一致度93%でアッセンブリングを試みる。アッセンブリング結果がエラークラスタと判定されなくなるまで、一致度のパラメータを1%ずつ上げて反復してアッセンブリングを行っていく。
A) Repetitive assembly Using CAP3 for individual assembly, first try to assemble with a match of 93%. Until the assembly result is determined not to be an error cluster, the matching parameter is increased by 1% and the assembly is repeated.
   起動パラメータには、オーバーハング長に0を指定する「 -k 0」を設定する。 In the startup parameter, set “-k 0” to specify 0 for the overhang length.
   B)アッセンブリング結果のマージ(aceファイルの修正)
   エラークラスタ修復前のtgiclが出力したaceファイル対して、修復後に生成されたアライメント情報で修正を行なう。
B) Merging assembly results (modifying ace file)
The ace file output by tgicl before error cluster repair is corrected with the alignment information generated after repair.
     i.個別アッセンブリングしたクラスタのリストを作成する
     ii. tgiclが出力したaceファイルを先頭から順に読み込み、個別アッセンブリングしたクラスタがあれば、そのクラスタの情報を削除する
     iii.削除した位置に個別アッセンブリングしたクラスタ情報を挿入する。
i. Create a list of individually assembled clusters ii. Read the ace file output by tgicl in order from the top, and delete any cluster information that has been individually assembled iii. Individual assembly at the deleted position Inserted cluster information.
 このときに、コンセンサス(コンティグ)配列名は元ファイルの名前に枝番を付加した名前を付与する。枝番は'_'に続く0埋めした4桁の数字を「_0001」から順に付与する。 At this time, the consensus (contig) sequence name is given the name of the original file with a branch number added. The branch number is a 4-digit number padded with zeros following '_' and is assigned in order from "_0001".
   D)シングルトン配列の収集
   個別アッセンブリングで発生したシングルトン配列は、個々の入力クラスタ毎のsingletsファイルに出力されている。これらのファイルを連結して、tgiclの結果から作成したシングルトン配列のファイルに追加する。
D) Collection of singleton arrays Singleton arrays generated by individual assembly are output to a singles file for each input cluster. Concatenate these files and add them to the singleton array file created from the tgicl results.
 本実施例では、表1で示すように、全リード配列469,318の内、両端アダプタが確認できたリード配列300,635配列にこれを実施し、15326のクラスタ(2本以上の配列が同じものと判定されたグループ)と284554配列のシングルトン(他に同じ配列がないと判定された配列)を得ることができた。全リード配列1,868,178の場合については表1を参照のこと。 In this example, as shown in Table 1, among the all lead arrays 469 and 318, this is performed on the lead arrays 300 and 635 arrays where the adapters at both ends can be confirmed, and 15326 clusters (two or more arrays are determined to be the same). Group) and a singleton of 284554 sequences (sequences determined to have no other sequences). See Table 1 for full lead sequence 1,868,178.
  工程4:クラスタの信頼性のデータ化
   上記工程3で得られたクラスタ配列に対して、HiCEPのセレクション配列部分がどの程度確からしいかを評価する。具体的には、配列クラスタのコンセンサス配列と構成リード配列について、配列の類似性をスコア化した。これにより、(2)のプロファイリングの対応付け処理等、作成されたクラスタ情報を使用する際、ここで算出した配列クラスタの信頼性を閾値にした処理を行なうことができるようになる。
Step 4: Data generation of cluster reliability The degree to which the selection sequence portion of HiCEP is probable with respect to the cluster sequence obtained in the above step 3 is evaluated. Specifically, the sequence similarity was scored for the consensus sequence of the sequence cluster and the constituent lead sequence. As a result, when the created cluster information is used, such as the profiling association process of (2), it is possible to perform a process using the reliability of the array cluster calculated here as a threshold value.
  (A)前処理
  クラスタリング・アッセンブリング結果から、セレクション評価を行うために下記の前処理を行う。
(A) Pre-processing The following pre-processing is performed to perform selection evaluation from the clustering and assembly results.
   ・アダプタ配列から配列の向きを確認し、順鎖方向(CGG-TTA)に変換する
   ・アダプタ配列をCCGG/TTAAに置換する。
-Check the orientation of the sequence from the adapter sequence and convert it to the normal chain direction (CGG-TTA).-Replace the adapter sequence with CCGG / TTAA.
  (B)コンセンサス構成配列による評価
   コンセンサス構成配列による評価では、下記の二種類のスコアを5’端と3’端についてそれぞれ計算する。
(B) Evaluation based on consensus constituent array In the evaluation based on the consensus constituent array, the following two types of scores are calculated for the 5 ′ end and the 3 ′ end, respectively.
    ・セレクション塩基の構成配列割合(第3候補まで)
 コンセンサス構成配列のセレクション一致率を評価し、ピーク対応付けがうまくいかない場合の修正候補となる。理想な場合のスコアは第一候補が100%で、第二および第三候補は0%である。ただし、ヘテロのSNPがセレクション部分に入っていた場合、第一候補と第二候補が50%ずつとなる(図23参照)。
・ Selection base composition ratio (up to 3rd candidate)
The selection match rate of the consensus constituent sequence is evaluated, and it becomes a correction candidate when peak matching is not successful. The ideal score is 100% for the first candidate and 0% for the second and third candidates. However, when the hetero SNP is in the selection part, the first candidate and the second candidate are 50% each (see FIG. 23).
  ・制限酵素サイトから内側5塩基のコンセンサス配列と構成配列の編集距離平均(図24参照)
 理想な場合のスコアは0である。ただし、ヘテロのSNPがセレクション部分に入っていた場合、0.5となる。セレクション塩基の構成配列割合を計算する際に、セレクション塩基として認識される塩基を図23に示す。
・ Average of editing distance between consensus sequence of 5 bases from the restriction enzyme site and constituent sequence (see Fig. 24)
The ideal score is 0. However, if the hetero SNP is in the selection part, it will be 0.5. FIG. 23 shows the bases recognized as selection bases when calculating the constituent sequence ratio of selection bases.
  (C)クラスタの分割
  256通りのPCR及び電気泳動を行なうHiCEP法の場合、セレクション塩基(アダプタ配列から内側の2塩基)は、(2)のプロファイリングの対応付け処理を行なう際に重要なデータとなる。そこで、本実施例では、すべての配列クラスタについて、アダプタ配列の内側2塩基のそれぞれの位置について、コンセンサス配列の塩基と配列クラスタを構成する各リード配列の塩基をデータ化した。これにより、本実施例のHiCEP法の場合、(2)のプロファイリングの対応付けの処理において、ひとつの配列クラスタの構成リード配列のセレクション塩基位置の塩基が2種類に分かれる場合は、そのセレクション部分にヘテロなSNPが存在すると判定し、クラスタをふたつにわけることができるようになる(図15)。
(C) Cluster division In the HiCEP method that performs 256 kinds of PCR and electrophoresis, the selection base (two bases inside from the adapter sequence) is the important data when performing the profiling association process in (2). Become. Therefore, in this example, for all sequence clusters, the consensus sequence bases and the bases of each read sequence constituting the sequence cluster were converted into data for each of the two base positions inside the adapter sequence. Thus, in the case of the HiCEP method of this embodiment, in the profiling association processing in (2), when the base at the selection base position of the constituent lead sequence of one sequence cluster is divided into two types, the selection part It is determined that a heterogeneous SNP exists, and the cluster can be divided into two (FIG. 15).
  工程5:既知遺伝子情報を利用したコンセンサス配列の信頼性のデータ化
  上記工程4で得られた配列クラスタのコンセンサス配列について、既知遺伝子情報(転写産物、ゲノム、EST情報など)が存在する生物種では、公知配列情報を検索し、コンセンサス配列の信頼性データを作成する。
Step 5: Data of reliability of consensus sequence using known gene information Regarding the consensus sequence of the sequence cluster obtained in step 4 above, for species that have known gene information (transcript, genome, EST information, etc.) Then, the known sequence information is searched, and the reliability data of the consensus sequence is created.
  (A)類似性検索の実行
  すべての配列クラスタとシングルトンについて、コンセンサス配列またはシングルトンのリード配列を既知公共データベースに類似性検索をかけ、その出力をデータ化する。
(A) Execution of similarity search For all sequence clusters and singletons, a consensus sequence or singleton read sequence is subjected to a similarity search to a known public database, and the output is converted into data.
  ・mRNA:blastn  -dust no -task megablast
  ・ゲノム:blat  パラメータなし (デフォルトのまま)。
・ MRNA: blastn -dust no -task megablast
・ Genome: no blat parameter (default).
  (B)カテゴリ分類
  公共データベースへの類似性検索結果を、下記の4つのカテゴリに分ける。ただし、95-95は塩基の一致率95%以上・アライメント長がクエリ長の95%以上を表し、95-20baseは塩基の一致率が95%以上・アライメント長が20塩基以上を表す。
(B) Category classification The similarity search results to public databases are divided into the following four categories. However, 95-95 represents a base match rate of 95% or more, the alignment length represents 95% or more of the query length, and 95-20base represents a base match rate of 95% or more, and an alignment length of 20 bases or more.
   1. 95-95かつCCGG-TTAAが存在する
   2. 95-95でCCGG/TTAAの両方または片方が1塩基違いで存在する
   3. 95-20baseで末端部分にヒット(クエリ配列の開始/終了位置が端から4塩基以内)し、CCGG/TTAAが存在する
   4. 95-20baseで末端部分にヒットし、CCGG/TTAAが1塩基違いで存在する。
1. 95-95 and CCGG-TTAA exist 2. 95-95, CCGG / TTAA, or one of them is different by one base 3. Hit at the end of 95-20base (start / end position of query sequence) Is within 4 bases from the end), and CCGG / TTAA is present 4. Hits the terminal part at 95-20base, and CCGG / TTAA is present with one base difference.
 また、制限酵素サイトを探す場合は、下記のようにアライメント位置から前後2塩基を加えた中で探す。 In addition, when searching for restriction enzyme sites, search within the following two bases from the alignment position.
   ・クエリ配列:ccGG XX XXXX...  (配列番号13)
   ・サブジェクト配列:...yy zzZZ YY XXXX...  (配列番号14)。
Query sequence: ccGG XX XXXX ... (SEQ ID NO: 13)
-Subject sequence: ... yy zzZZ YY XXXX ... (SEQ ID NO: 14).
 大文字がアライメントされた配列で、この例ではクエリ配列の3塩基目からアライメントが始まっている。サブジェクト配列でクエリ配列の制限酵素サイトと同じ座標となる配列(zzZZ)に前後2塩基を加えたyyzzZZYYの中でCCGGを探すこととなる。 † Uppercase aligned sequence. In this example, alignment starts from the third base of the query sequence. CCGG is searched for in yyzzZZYY in which the subject sequence has the same coordinates as the restriction enzyme site of the query sequence (zzZZ) plus two bases before and after.
 1塩基違いで制限酵素サイトを探した場合、yyzzZZYYの中に複数個所の候補が存在する場合がある。たとえば下記の配列の場合;
     ...CCNGGXX...
 CCNG GXと考えてGXをセレクションととるか、CNGG XXと考えてXXをセレクションととるかの二通りがある。これらは、配列のみでは判断できないので、候補が複数ある場合は、両方ともデータとして残すこととする。
When a restriction enzyme site is searched for by one base difference, there may be multiple candidates in yyzzZZYY. For example, for the following sequence:
... CCNGGXX ...
There are two ways to think of CCNG GX as GX, and CNGG XX as XX as a selection. Since these cannot be determined only by the arrangement, if there are a plurality of candidates, both are left as data.
 すべての配列クラスタとシングルトンについて、コンセンサス配列またはシングルトンのリード配列と類似性のあった既知転写産物のIDと類似性のあった領域及びスコア、また、類似性のあったゲノム配列の染色体番号と類似性のあった領域及びスコアをデータと格納した。これにより、(2)のプロファイリングの対応付け処理等、作成されたクラスタ情報を使用する際、ここで算出した配列クラスタの信頼性を閾値にした処理を行なうことができるようになる。 For all sequence clusters and singletons, a region and score similar to the ID of a known transcript that was similar to the consensus sequence or singleton lead sequence, and similar to the chromosome number of the similar genomic sequence Sexual areas and scores were stored with the data. As a result, when the created cluster information is used, such as the profiling association process of (2), it is possible to perform a process using the reliability of the array cluster calculated here as a threshold value.
 加えて、すべての配列クラスタのコンセンサス配列について、アダプタ配列の内側2塩基のそれぞれの位置について、コンセンサス配列の塩基と類似性のあった公知配列の塩基をデータ化した。これにより、本実施例のHiCEP法の場合、(3)の配列同定処理において、異なる個体の試料の場合、対応する配列クラスタが存在しない場合も、SNPが存在すると仮定して検索処理を行なうことができる。 In addition, for the consensus sequences of all sequence clusters, the bases of known sequences that were similar to the bases of the consensus sequence were converted into data for each position of the two bases inside the adapter sequence. Thus, in the case of the HiCEP method of the present embodiment, in the sequence identification process of (3), in the case of a sample of a different individual, the search process is performed on the assumption that the SNP exists even when there is no corresponding sequence cluster. Can do.
  工程6:コンセンサス配列への遺伝子情報の付与
  上記工程4で得られた配列クラスタのコンセンサス配列について、公知配列情報を検索し、配列に遺伝子情報を付与する。
Step 6: Giving gene information to the consensus sequence For the consensus sequence of the sequence cluster obtained in the above step 4, the known sequence information is searched and the gene information is given to the sequence.
  対象生物が既知遺伝子情報(転写産物、ゲノム、EST情報など)が存在する生物種であれば、工程5において、その情報は付与されている。しかしながら、網羅的フラグメント解析では、未知の転写産物を多く検出することもある。よって、工程6においては、配列クラスタのコンセンサス配列について、すべての生物種、または、特定の複数の生物種の公知配列情報を類似性検索し、それぞれのコンセンサス配列に類似性の高い公知の配列を対応付ける。 If the target organism is a species having known gene information (transcript, genome, EST information, etc.), the information is given in step 5. However, exhaustive fragment analysis may detect many unknown transcripts. Therefore, in the step 6, the similarity search is performed on the known sequence information of all the species or a plurality of specific species for the consensus sequence of the sequence cluster, and a known sequence having a high similarity to each consensus sequence is obtained. Associate.
  (2)電気泳動で得られるバンドまたはピークと配列の対応付け
  個々の配列クラスタとその配列のシーケンス対象のHiCEP法で得られた「鋳型cDNA液」からPCRと電気泳動を行って得られたプロファリングである電気泳動のピーク群(ES細胞のリファレンスプロファリング)の対応付け方法を開発した。対応付けには、(1)で得られた各配列クラスタのコンセンサス配列の配列と配列長、配列クラスタを構成するリード配列数を使用する。
(2) Association of bands or peaks obtained by electrophoresis with sequences Profiles obtained by performing PCR and electrophoresis from individual sequence clusters and “template cDNA solution” obtained by HiCEP method for sequencing the sequences. We developed a method for associating the electrophoresis peak group (ES cell reference profiling). For the association, the consensus sequence sequence and sequence length of each sequence cluster obtained in (1), and the number of read sequences constituting the sequence cluster are used.
  工程1:配列長の補正
  電気泳動長と電気泳動対象となった配列の配列長は、かならずしも一致しないことが知られている(図39参照)。本法では、ピークと配列を一致させるためには、電気泳動長と配列長のズレはひとつの課題である。この課題を解決するために、既存のマウスES細胞にHiCEP法を適用したデータベースにおいて、対応付け済みのピーク37,675とその配列のデータを利用して、塩基組成や分子量と電気泳動長と配列長のズレとの関係を検討した。その結果、塩基組成や分子量による補正が可能で、ピーク対応付け精度を向上させることができることがわかった。ひとつは、ズレが配列の塩基組成はTG(またはAC)含量と相関していることがわかった(図42参照)。また、分子量によっても、ズレに傾向があることがわかった(図41参照)。
Step 1: Correction of Sequence Length It is known that the electrophoresis length and the sequence length of the sequence to be electrophoresed do not always match (see FIG. 39). In this method, in order to make the peak and the sequence coincide with each other, the shift between the electrophoresis length and the sequence length is one problem. In order to solve this problem, in the database in which HiCEP method is applied to existing mouse ES cells, the base composition, molecular weight, electrophoretic length, and sequence length are calculated using the associated peaks 37,675 and their sequence data. The relationship with the gap was examined. As a result, it was found that correction based on the base composition and molecular weight is possible, and the peak matching accuracy can be improved. First, it was found that the base composition of the misalignment sequence correlated with the TG (or AC) content (see FIG. 42). Moreover, it turned out that there exists a tendency for a shift | offset | difference also by molecular weight (refer FIG. 41).
  その効果としては、図16のように、既存のマウスES細胞で補正しない場合、ズレが±2bp以内の配列が89%であるのに対して、塩基組成と分子量で補正することで、96%に増加することがわかった。また、工程2についても、補正を行なわずに工程2を実施した場合の正解率は66%であったが、補正を行なうことによって77%に増加することがわかった。 As shown in FIG. 16, when the correction is not performed with the existing mouse ES cells, as shown in FIG. 16, the sequence with a deviation of ± 2 bp is 89%, while the correction with the base composition and molecular weight is 96%. It turned out to increase. In addition, in the case of Step 2, the correct answer rate when the Step 2 was carried out without correction was 66%, but it was found that the correction increased to 77% by performing the correction.
  (A)ずれと配列長、分子量との関係を用いた較正
  既知のシーケンスデータより、明らかに不適切なデータを除去した後、既知のシーケンスデータの電気泳動長と対応する配列長とのずれと配列長との関係を検討したところ、図40のように、配列長によりずれに偏りがあることがあることが分かった。また、ずれと分子量との関係を散布図で表すと図41のようになった。分子量は配列長と比べるとより単位が細かいために、細かい較正を行うには分子量を利用した較正が良いので、分子量での較正表を採用する。較正表の作成には、局所回帰平滑化関数であるloess関数を用いた。
(A) Calibration using the relationship between deviation, sequence length, and molecular weight After removing apparently inappropriate data from the known sequence data, the deviation between the electrophoresis length of the known sequence data and the corresponding sequence length As a result of examining the relationship with the sequence length, it was found that there may be a deviation in the shift depending on the sequence length as shown in FIG. Moreover, when the relationship between the deviation and the molecular weight is represented by a scatter diagram, it is as shown in FIG. Since the molecular weight is finer than the sequence length, the calibration using the molecular weight is good for fine calibration, so a calibration table based on the molecular weight is adopted. The loess function, which is a local regression smoothing function, was used to create the calibration table.
  (B)ずれとシーケンスの内部塩基組成との関係を用いた較正
  ずれと配列の内部塩基組成との関係を調べるために、(A)で計算した「ずれと分子量との関係を用いた較正」後のずれ(以下、(A)で較正した後のずれ)と配列の内部塩基組成との関係を検討した結果、A、C含量割合と(A)で較正した後のずれとの間に負の相関関係があること、T、G含量割合と(A)で較正した後のずれとの間に正の相関関係があることがわかった。較正の結果を大きくするために(1)で較正した後のずれとAC含量割合との関係、及びTG含量割合との関係を散布図で表すと図42のようになった。A、C、T、G単体よりもはっきりとした相関があるように見えた。上記(1)で較正した後のずれとAC含量割合との関係、及びTG含量割合との関係より較正表を作成し、較正を行う。較正表の作成には、局所回帰平滑化関数であるloess関数を使用する。
(B) Calibration using the relationship between the deviation and the internal base composition of the sequence “Calibration using the relationship between the deviation and the molecular weight” calculated in (A) to investigate the relationship between the deviation and the internal base composition of the sequence As a result of examining the relationship between the later shift (hereinafter referred to as the shift after calibrating in (A)) and the internal base composition of the sequence, a negative value was found between the A and C content ratios and the shift after the calibration in (A). It was found that there was a positive correlation between the T and G content ratios and the deviation after calibration in (A). In order to increase the result of calibration, the relationship between the deviation after calibration in (1) and the AC content ratio and the relationship with the TG content ratio are shown in a scatter diagram as shown in FIG. There seemed to be a clearer correlation than A, C, T, G alone. A calibration table is prepared from the relationship between the deviation after the calibration in (1) above and the AC content ratio, and the relationship with the TG content ratio, and calibration is performed. The loess function, which is a local regression smoothing function, is used to create the calibration table.
  (C)較正表を用いた電気泳動長の予測
補間には線形補間を用いた。求めたい点X(xx, yx)の前後に較正表に記載されている点A(xA, yB)、B(xB, yB)が存在する場合の求めたい点Xの較正値yxは下式の通りになる(図43参照)。
Figure JPOXMLDOC01-appb-M000002
(C) Linear interpolation was used for the prediction interpolation of the electrophoretic length using the calibration table. Calibration of point X to be obtained when points A (x A , y B ) and B (x B , y B ) described in the calibration table exist before and after the point X (x x, y x ) to be obtained The value y x is as follows (see FIG. 43).
Figure JPOXMLDOC01-appb-M000002
  工程2:配列クラスタとピークとの対応付け処理
  各配列クラスタのコンセンサス配列の工程1で補正された配列長とHiCEP法で得られたプロファリングである電気泳動のピークの電気泳動長、及び、配列クラスタを構成するリード配列数と電気泳動で得られたピークの強度のふたつの値を使用して、配列クラスタとHiCEPのリファレンスプロファイリングのピークとの対応付けを行なった。
Step 2: Sequence cluster and peak association processing The sequence length corrected in step 1 of the consensus sequence of each sequence cluster, the electrophoresis length of the electrophoresis peak as a profile obtained by the HiCEP method, and the sequence Using two values of the number of read sequences constituting the cluster and the intensity of the peak obtained by electrophoresis, the sequence cluster was associated with the peak of HiCEP reference profiling.
 具体的には、次の手順で行なう。 Specifically, follow the procedure below.
   (A)クラスタリング・アッセンブリング処理結果から”擬似ピーク”を生成
   (B)擬似ピークへピーク長較正を適用
   (C)同じピーク長の擬似ピークを1つの擬似ピークにまとめる
   (D)ピーク対応付けアルゴリズムによる対応付け。
(A) Generate “pseudo peaks” from clustering and assembly results (B) Apply peak length calibration to pseudo peaks (C) Combine pseudo peaks with the same peak length into one pseudo peak (D) Peak matching algorithm Matching by.
  この結果、ES細胞のHiCEP法で得あられたプロファイリングピーク21,778に対応付けられたピークの数12,551ピーク(57.6%)で、その内77%をコンピュータ処理によって同定することができた。 As a result, the number of peaks associated with profiling peaks 21,778 obtained by the HiCEP method of ES cells was 12,551 peaks (57.6%), of which 77% could be identified by computer processing.
  (A) クラスタリング・アッセンブリング処理結果から”擬似ピーク”を生成
配列クラスタのコンセンサス配列またはシングルトン配列に対し、ピーク長・高さを次のように割り当て、擬似ピークを生成する。
(A) Generate “Pseudo Peak” from Clustering / Assembly Processing Results The peak length and height are assigned as follows to the consensus array or singleton array of the array cluster to generate a pseudo peak.
   ・ピーク長: コンセンサス配列のセレクション塩基を含む塩基数 + 34
   ・ピーク高さ: 当該配列クラスタのリード数。
・ Peak length: Number of bases including selection base of consensus sequence + 34
-Peak height: Number of reads of the sequence cluster.
 配列長から電気泳動長への補正値+34は次のように決定される。 The correction value +34 from the sequence length to the electrophoresis length is determined as follows.
 PCR時に使用されるプライマー配列の長さ40塩基とPCRで人工的にチミンが末端に結合された分で41塩基が、HiCEPのセレクション塩基を含みフラグメントDNA配列の外側に付加された配列がPCR産物となる。配列長は、アダプタ配列を除去した塩基数を用いるため、41塩基から、配列長に含まれるセレクション塩基2塩基の両端分の4塩基を引いた37塩基が、配列長を電気泳動長にするための補正値である。しかしながら、アプライドバイオシステムズ社製のキャピラリー電気泳動装置(特に3100)では、この理論的な補正値より3塩基少ない電気泳動位置に現われることがわかっており、よって、理論的な補正値から3塩基引いた値の34塩基を補正値とした。 The PCR product is a 40-base primer sequence length used in PCR and 41 bases added to the end of the fragment DNA sequence, including the HiCEP selection base, as the thymine is artificially bound to the end of the PCR. It becomes. Since the sequence length uses the number of bases from which the adapter sequence has been removed, 37 bases obtained by subtracting 4 bases from both ends of the selection base 2 bases included in the sequence length make the sequence length the electrophoresis length. Is the correction value. However, it is known that the capillary electrophoresis apparatus (particularly 3100) manufactured by Applied Biosystems appears at an electrophoresis position 3 bases less than this theoretical correction value, and therefore 3 bases are subtracted from the theoretical correction value. The corrected value was 34 bases.
 ピークの高さについては、リード数をそのままピークの高さに適用すると、プロファイルピークよりもかなり低い値になる。本システムのピーク対応付けアルゴリズムでは、高さの絶対値に影響を受けないためこれは問題にならない。一方、擬似ピークを視覚化する場合は、リード数=高さでは高さの関係が見えにくい。そこで、本仕様書での擬似ピークの描画では、リード数に一定の係数をかけて、プロファイルピークと同レベルの高さまで引き上げている(図30を参照)。 ∙ As for the peak height, if the number of leads is applied to the peak height as it is, it will be much lower than the profile peak. In the peak matching algorithm of this system, this is not a problem because it is not affected by the absolute height. On the other hand, when visualizing pseudo peaks, the relationship of height is difficult to see when the number of leads = height. Therefore, in the pseudo peak drawing in this specification, a certain coefficient is applied to the number of leads to raise the height to the same level as the profile peak (see FIG. 30).
  (B)擬似ピークへピーク長較正を適用
  擬似ピークのピーク長を、上記(2)工程1で生成した較正表により較正する(図28参照)。
(B) Applying peak length calibration to the pseudo peak The peak length of the pseudo peak is calibrated by the calibration table generated in step (2) above (see FIG. 28).
  (C)同じピーク長の擬似ピークを1つの擬似ピークにまとめる
  内部配列は異なるが配列長が同じフラグメントが存在した場合、HiCEPの電気泳動結果では、ひとつのピークとして現われ、一方、上記(1)で作成した配列クラスタは異なる配列クラスタとなる。よって、ピークと配列クラスタの対応付けを行なう場合、擬似ピークの較正後の配列長が同一である場合は、これらの擬似ピークをまとめてひとつの擬似ピークとし、高さは合計してひとつの擬似ピークの高さとする(図29参照)。
(C) Combine pseudo peaks with the same peak length into one pseudo peak When fragments with different internal sequences but the same sequence length exist, they appear as one peak in the HiCEP electrophoresis results, while (1) above The array cluster created in step 1 becomes a different array cluster. Therefore, when associating peaks and sequence clusters, if the sequence length after calibration of the pseudo peaks is the same, these pseudo peaks are combined into one pseudo peak, and the total height is one pseudo peak. The peak height is set (see FIG. 29).
 なお、配列クラスタのコンセンサス配列の配列長さが同じであっても、配列長較正を施した場合、擬似ピークの電気泳動長が大きく異なる可能性がある。よって、較正後の配列長が+-0.25ベース以内のピーク同士を1つのピークにまとめることとする。 Note that even if the sequence length of the consensus sequence of the sequence cluster is the same, the electrophoretic length of the pseudo peak may be greatly different when the sequence length calibration is performed. Therefore, peaks whose sequence lengths after calibration are within + −0.25 bases are combined into one peak.
  (D)ピーク対応付けアルゴリズムによる対応付け
 シーケンス・ピーク対応付けアルゴリズムは、DPマッチングを基本的な枠組みとし、各エッジのスコアリングを独自に行うようにしたものである。
(D) Association by Peak Association Algorithm The sequence / peak association algorithm uses DP matching as a basic framework and performs scoring of each edge independently.
   A)擬似ピークの特徴
 以下に擬似ピークの特徴をまとめる;
    (I)リード数が発現量を反映していると考えられるが、ばらつきが大きく、同じ電気泳動同士で比較する場合のように一定係数をかけることで全体の高さをそろえることが難しい;
    (II)リード数が多いコンセンサス配列ほど信頼性が高いと考えられる。リード数が少ないコンセンサス配列ほどセレクション塩基や配列長に誤りが発生する可能性が高くなると考えられる;
    (III)較正対象となる配列長と電気泳動長のずれは、領域単位で一定のずれがある場合よりも、単独のピークごとにずれ幅が異なる。
A) Characteristics of pseudo peaks The characteristics of pseudo peaks are summarized below:
(I) The number of reads is thought to reflect the expression level, but the variation is large, and it is difficult to align the overall height by applying a constant coefficient as in the case of comparison between the same electrophoresis;
(II) A consensus sequence with a large number of reads is considered to be more reliable. Consensus sequences with fewer reads are considered more likely to have errors in selection bases and sequence lengths;
(III) The deviation between the sequence length to be calibrated and the electrophoresis length is different for each single peak as compared with the case where there is a certain deviation for each region.
 これらの特徴を前提にピーク対応付けアルゴリズムでる。 The peak matching algorithm is based on these characteristics.
  B)DPマッチングによるピークアライメント
  ペアのスコア値を得るための範囲をフレームと呼ぶ。一定のフレーム領域を設定して、そのフレーム内のリファレンスプロファイリングピークと擬似ピークのすべての組み合わせをペア候補としてスコアを付け、スコアの合計が最も高くなるペアの組み合わせをDPマッチング法で求める(図17A、図31、図32参照)。各ペア候補のスコアの最高値は1.0とし、0以上の場合に最終的なペアとなる可能性が生じる。スコアがマイナス値の場合、そのペア候補が最終的なペアになる可能性はない。
B) Peak alignment by DP matching The range for obtaining the score value of a pair is called a frame. A certain frame region is set, and all combinations of reference profiling peaks and pseudo peaks in the frame are scored as pair candidates, and a pair combination having the highest total score is obtained by the DP matching method (FIG. 17A). FIG. 31 and FIG. 32). The highest score of each pair candidate is 1.0, and if it is 0 or more, there is a possibility that it will be a final pair. If the score is negative, the pair candidate is unlikely to be a final pair.
  DPマッチングの具体的な方法としては、「Needleman & Wunsch,1970の方法を修正したもの」を利用した。 具体 As a specific method for DP matching, “Needleman & Wunsch, 1970 modified method” was used.
  C)ペア候補のスコア
  ペア候補のスコアは、ピークの高さスコアとサイズスコアそれぞれに重みを掛けて合計したものとする。高さスコア、サイズスコアそれぞれの最高値は1.0であり、それぞれに重み係数を掛けることでペア候補のスコアの最高値も1.0となるようにする;
   ペア候補のスコア = (高さスコア × 高さの重み)+(サイズスコア × サイズの重み)
 実施例では、
   高さの重み = 0.5, サイズの重み = 0.5を使用した(図17B参照)。
C) Pair candidate score The pair candidate score is the sum of the peak height score and the size score, each weighted. The highest value of each of the height score and the size score is 1.0, and by multiplying each by a weighting factor, the highest value of the pair candidate score is also 1.0;
Pair candidate score = (height score x height weight) + (size score x size weight)
In the example,
Height weight = 0.5 and size weight = 0.5 were used (see FIG. 17B).
  I)高さのスコア
  高さスコアはペア候補ごとに次のように計算する。ピークの高さスコアを計算するにあたって、高さ値の代わりにフレーム内での高さの順序番号を使う(図17B参照)。
I) Height score The height score is calculated as follows for each pair candidate. In calculating the peak height score, the height sequence number in the frame is used instead of the height value (see FIG. 17B).
    高さスコア = (error - abs(p.order - r.order)) / error
    error = 高さ順序番号の許容差(現状は10)
    p.order = プロファイルピークの高さ順序番号
    r.order = 擬似ピークの高さ順序番号
    abs(n) = nの絶対値。
Height score = (error-abs (p.order-r.order)) / error
error = height sequence number tolerance (currently 10)
p.order = profile peak height order number r.order = pseudo peak height order number abs (n) = absolute value of n.
  II)高さ順序番号とフレーム
  高さ順序番号はフレーム内で最も高いピークから順に1, 2, 3…nと割り振る(図33参照)。プロファイルピークと擬似ピークそれぞれで別々に割り振る。フレーム及び高さ順序番号は、着目するプロファイルピークごとに計算する(1プライマーセットあたりプロファイルピーク数と同じ数のフレームが生成される)。
II) Height sequence number and frame Height sequence numbers are assigned as 1, 2, 3 ... n in order from the highest peak in the frame (see FIG. 33). Allocate profile peak and pseudo peak separately. The frame and height sequence number are calculated for each profile peak of interest (the same number of frames as the number of profile peaks is generated per primer set).
 なお、高さを順序番号に置き換える(図34参照)ことにより、高さの関係を考慮しつつ、データの特徴(I)「擬似ピークの高さのばらつき」の影響を受けないようにできる。また、低いピークほど高さ順序番号の一致精度が悪くなるが、これはデータの特徴(II)による。 It should be noted that by replacing the height with a sequence number (see FIG. 34), it is possible to avoid the influence of the characteristic (I) “pseudo peak height variation” of the data while considering the relationship of the height. Also, the lower peak, the lower the accuracy of coincidence of the height sequence numbers, which depends on the data feature (II).
  III)高さが同じピークの扱い
  同じ高さのピークは、擬似ピークの高さが配列数であるため高い頻度で発生する可能性がある。フレーム内で高さの同じピークには同じ順序番号を割り振る(図35参照)。このときの順序番号は、同じ高さのピーク数を加算したものとする。同じ高さのピーク群は、リード数が少ないかシングルトンである可能性が高い。そのようなピークは、より離れた順序番号をつけておくほうが一致精度が良くなる(ノイズの影響を少なくできる)。
III) Handling of peaks with the same height Peaks with the same height may occur frequently because the height of the pseudo peak is the number of sequences. The same sequence number is assigned to peaks having the same height in the frame (see FIG. 35). The sequence number at this time is obtained by adding the number of peaks having the same height. The peak group at the same height is likely to have a small number of leads or a singleton. For such peaks, the accuracy of matching is improved by assigning sequence numbers that are further apart (the influence of noise can be reduced).
  IV)フレーム幅の決定方法
  実施例においては、着目プロファイルピーク前後でプロファイルピーク数27以内、かつ80塩基以内の範囲をフレームとする(擬似ピークは考慮しない)(図36)。
IV) Method for Determining Frame Width In the example, the range is within the number of profile peaks of 27 and 80 bases before and after the target profile peak (the pseudo peak is not considered) (FIG. 36).
  V)高さ順序番号の許容差
  ペア候補の高さ順序差が「高さ順序番号の許容差」以内であればペナルティにならない。
V) Tolerance of height sequence number There is no penalty if the height order difference of a pair candidate is within the “tolerance of height sequence number”.
  VI)サイズスコア
  サイズスコアはペア候補ごとに次のように計算する(図17B参照)。
VI) Size score The size score is calculated for each pair candidate as follows (see FIG. 17B).
    サイズスコア = (error - abs(p.size - r.size)) / error
    error = サイズ許容差(現状は2から4:後述)
    p.size = プロファイルピークのサイズ
    r.size = 擬似ピークのサイズ
    abs(n) = nの絶対値。
Size score = (error-abs (p.size-r.size)) / error
error = size tolerance (currently 2 to 4: see below)
p.size = profile peak size r.size = pseudo peak size abs (n) = absolute value of n.
  VII)サイズ許容差
  ペア候補のサイズ差が「サイズ許容差」以内であればペナルティにならない。サイズ許容差は、2ベースから4ベースの間で可変とする。
VII) Size tolerance There is no penalty if the size difference between the pair candidates is within the “size tolerance”. The size tolerance is variable between 2 bases and 4 bases.
以下にサイズ許容差を求める手順を示す。 The procedure for obtaining the size tolerance is shown below.
    ・着目プロファイルピーク前後両方向それぞれに、隣のピークまでの距離を求める;
    ・前後2つの距離のうち、短いほうの距離の1/2(二分の一)を候補値とする;
    ・候補値が2~4ベース以内なら、候補値をそのままサイズ許容差とする;
    ・候補値が2ベースより小さければ2、4ベースより大きければ4をサイズ許容差とする(図37)。
-Find the distance to the next peak in both directions before and after the profile peak of interest;
・ One of the shorter distances of the two distances in front and rear is 1/2 (one half) as a candidate value;
・ If the candidate value is within 2 to 4 bases, the candidate value is directly used as the size tolerance;
If the candidate value is smaller than 2 bases, 2 is set as the size tolerance if it is larger than 4 bases (FIG. 37).
  サイズが大きくなるにつれて、許容差を大きくしていく方法も考えられるが、本法では、上記方法を採用した。 A method of increasing the tolerance as the size increases can be considered, but in this method, the above method was adopted.
  VIII)近傍ピークの判定の矯正
  対応すべきプロファイリングピークと擬似ピークが近傍にある場合、強度の大きい擬似ピークとサイズが近いが強度が低いプロファリングピークのペアは高さ順序番号の差が大きくなることでスコアが低くなり、最終的なペアにはならないことが多い。しかし、低いプロファイリングピークに対応する擬似ピークがなく、本来対応すべき擬似ピークとプロファリングピークが一定のサイズ以上はなれているとペアになってしまうことがある。
VIII) Correction of nearby peak judgment When there is a corresponding profiling peak and pseudo peak in the vicinity, a pair of profiling peaks that are close in size but low in intensity to a pseudo peak with high intensity have a large difference in height sequence number. This often lowers the score and does not make a final pair. However, if there is no pseudo peak corresponding to a low profiling peak, and the pseudo peak and the profiling peak that should originally correspond are separated from each other by a certain size, they may be paired.
 このような場合への対処として、強度の強いピークとその近傍のピークについてペアを矯正する(図38)。矯正方法は以下の通り。 As a countermeasure to such a case, a pair is corrected for a strong peak and a nearby peak (FIG. 38). The correction method is as follows.
    ・対応付け対象となるピークの前後0.75ベース以内に、強度が対象ピークの30%以下のピークがあればそれを“近傍の低いピーク”とする(左右は区別する);
    ・スコア計算において一方が裾野ピークの場合、同じ近傍の強度が低いピーク同士の場合のみスコアを計算する。それ以外の場合、スコアを -1 (ペナルティ)とする。
-If there is a peak whose intensity is 30% or less of the target peak within 0.75 base before and after the peak to be matched, it is designated as a “low peak in the vicinity” (right and left are distinguished);
In the score calculation, when one is a base peak, the score is calculated only when the same neighboring peaks have low intensities. Otherwise, the score is -1 (penalty).
 この矯正は、配列長較正を適用した場合にのみ効果がある(配列長較正を行わない場合、擬似ピークは均等に1ベース刻みになる)。 This correction is effective only when the sequence length calibration is applied (if the sequence length calibration is not performed, the pseudo peak is equally divided by 1 base).
  (3)上記(1)データベース、及び上記(2)の対応付け情報を使用して、HiCEPで得られるピークの遺伝子同定法
  上記(2)によって、HiCEP法で調整した同じ鋳型cDNAを使用して作成した配列クラスタとリファレンスプロファリングのピークとを対応付け、リファレンスプロファイリングのピークについて配列を同定できるようになった。次は、同じ細胞で、別の試料にHiCEP法を適用して得られたプロファイリングのピークの配列を上記(1)のデータベースと上記(2)の対応付け情報を使用して、同定する方法である(図25、図26、図27参照)。
(3) Gene identification method of peak obtained by HiCEP using the above (1) database and the association information of (2) above Using the same template cDNA adjusted by HiCEP method according to (2) above The created sequence cluster and the reference profiling peak are associated, and the sequence can be identified for the reference profiling peak. The following is a method for identifying the profiling peak sequence obtained by applying the HiCEP method to another sample using the database in (1) and the association information in (2) above. (See FIGS. 25, 26, and 27).
 方法1
  工程1:遺伝子同定対象のサンプルから得たHiCEPのプロファイリング結果と(2)で使用したリファレンスプロファイリングを電気泳動長とピークの強度で対応付けたデータを作成する。
Method 1
Step 1: Create data in which the HiCEP profiling results obtained from the gene identification target sample and the reference profiling used in (2) are associated with the electrophoresis length and the peak intensity.
  工程2:遺伝子同定対象のサンプルから得た遺伝子同定対象ピーク群から、上記工程1で作成した対応付けデータを利用して、リファレンスプロファイリングのピークを求め、さらに、(2)での対応付け情報から、(1)で作成したクラスタを求め、コンセンサス配列と遺伝子情報を求める。これにより、注目のピーク群と遺伝子情報との対応リストが作成される。 Step 2: From the gene identification target peak group obtained from the gene identification target sample, the reference profiling peak is obtained using the association data created in Step 1 above, and from the association information in (2) The clusters created in (1) are obtained, and the consensus sequence and gene information are obtained. As a result, a correspondence list between the peak group of interest and gene information is created.
  工程3:(1)の工程6で作成した遺伝子情報により、注目するコンセンサス配列を決定し、(2)で対応付けられたリファレンスプロファイリングのピークを介して、遺伝子同定対象のサンプルから得た電気泳動のピークを求める。 Step 3: The consensus sequence of interest is determined based on the gene information created in Step 6 of (1), and electrophoresis obtained from the sample for gene identification through the reference profiling peak associated in (2) Find the peak.
 方法2
  遺伝子同定対象サンプルから得られた電気泳動結果のひとつもしくは複数のピークの電気泳動で得られた塩基数を(2)で使用したリファレンスプロファイリグのバンド、さらに、(1)で作成した配列クラスタとその配列数から作成した擬似プロファイリングおよびその遺伝子情報を並べて提示することで、遺伝子同定対象サンプルから得られた電気泳動結果の注目ピークの遺伝子情報を得る。
Method 2
A reference profiling band using the number of bases obtained by electrophoresis of one or more peaks of the electrophoresis result obtained from the gene identification target sample in (2), and the sequence cluster created in (1) By displaying the pseudo-profiling created from the number of sequences and the gene information side by side, the gene information of the peak of interest of the electrophoresis result obtained from the gene identification target sample is obtained.

Claims (17)

  1.  試料に含まれるゲノムDNAまたは転写産物から得られたcDNAを断片化し、且つ指標配列を付与することによってフラグメントDNA混合液を得る段階と、
     前記フラグメントDNA混合液の第1の一部分を高速DNAシーケンシングすることによって、そこに含まれる全てのフラグメントDNAについてのリード配列データを取得する段階と、
     前記リード配列データの全てについて、前記指標配列部分の有無を検査し、前記指標配列を有するリード配列データを抽出する段階と、
     前記抽出されたリード配列データの全てについて、予め決定されたパラメータを用いて配列のクラスタリング処理とアッセンブリング処理を行うことにより、複数の配列クラスタを形成し、前記配列クラスタのそれぞれについて、当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を取得する段階と、
     前記配列クラスタのそれぞれに対応付けられた当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を含むデータベースを構築する段階と、
    を具備し、
     前記パラメータが、配列の類似性と配列長と指標配列に関するパラメータであることを特徴とするゲノムまたは転写産物の網羅的フラグメント解析のためのデータベース構築方法。
    Fragmenting cDNA obtained from genomic DNA or transcript contained in a sample and adding an indicator sequence to obtain a fragment DNA mixture;
    Obtaining read sequence data for all fragment DNAs contained therein by performing high-speed DNA sequencing on the first portion of the fragment DNA mixture; and
    Inspecting the presence or absence of the indicator sequence portion for all of the lead sequence data, extracting the lead sequence data having the indicator sequence;
    All of the extracted read sequence data are subjected to sequence clustering processing and assembling processing using predetermined parameters, thereby forming a plurality of sequence clusters. For each of the sequence clusters, the sequence cluster Obtaining the number of constituent sequences, the consensus sequence and the consensus sequence length, and alignment information;
    Constructing a database including the number of constituent sequences of the sequence cluster associated with each of the sequence clusters, a consensus sequence and a consensus sequence length, and alignment information;
    Comprising
    A database construction method for comprehensive fragment analysis of genomes or transcripts, wherein the parameters are parameters relating to sequence similarity, sequence length, and indicator sequence.
  2.  前記フラグメントDNA混合液の第2の一部分を電気泳動し、得られた電気泳動結果から各フラグメントに由来するバンド群またはピーク群の強度と電気泳動配列長とをリファレンスプロファイリングの各データとして取得する段階と、
     前記各配列クラスタの前記コンセンサス配列の配列情報、塩基数および当該配列クラスタを構成する配列数と、前記リファレンスプロファイリングのバンド群またはピーク群に対応付ける段階と、
    を更に具備する請求項1に記載のデータベース構築方法。
    Electrophoresing a second portion of the fragment DNA mixed solution, and obtaining the intensity of the band group or peak group derived from each fragment and the electrophoretic sequence length as reference profiling data from the obtained electrophoresis result When,
    The sequence information of the consensus sequence of each sequence cluster, the number of bases and the number of sequences constituting the sequence cluster, and the step of associating with the band group or peak group of the reference profiling,
    The database construction method according to claim 1, further comprising:
  3.  前記対応付けが、前記コンセンサス配列の配列情報及び塩基数と前記リファレンスプロファイリングのバンドまたはピークの電気泳動で得られた分子量との関係を第1のパラメータとして用い、及び前記コンセンサス配列の配列クラスタを構成する配列数と前記リファレンスプロファイリングのバンドまたはピークの強度との関係を第2のパラメータとして使用することにより、当該コンセンサス配列と当該リファレンスプロファイリングとを対応付けることを特徴とする請求項2に記載のデータベース構築方法。 The correspondence uses the relationship between the sequence information and the number of bases of the consensus sequence and the molecular weight obtained by electrophoresis of the band or peak of the reference profiling as a first parameter, and constitutes a sequence cluster of the consensus sequence The database construction according to claim 2, wherein the consensus sequence is associated with the reference profiling by using a relationship between the number of sequences to be performed and the intensity of the band or peak of the reference profiling as a second parameter. Method.
  4.  前記対応付けが、当該配列クラスタを構成する配列数および塩基数と、当該電気泳動で得られたバンドまたはピークの強度および電気泳動配列長とについての一致度をスコア化し、合計スコアが最大になる組み合わせを選択することにより、当該配列クラスタと当該バンドまたはピークとの対応付けが行われることを特徴とする請求項2または3に記載のデータベース構築方法。 The correspondence scores the degree of coincidence between the number of sequences and the number of bases constituting the sequence cluster, the intensity of the band or peak obtained by the electrophoresis and the length of the electrophoresis sequence, and the total score is maximized. 4. The database construction method according to claim 2, wherein the sequence cluster is associated with the band or peak by selecting a combination.
  5.  前記対応付けにおけるずれと配列長および分子量との関係および/またはずれと内部塩基組成との関係に基づいて、前記対応付を較正することを更に具備する請求項1~4の何れか1項に記載のデータベース構築方法。 The calibration according to any one of claims 1 to 4, further comprising calibrating the association based on a relationship between a deviation in the association and a sequence length and a molecular weight and / or a relationship between the deviation and an internal base composition. The database construction method described.
  6.  前記データベースを構築する段階に続いて、当該試料の由来する動物種と同じ動物種の既知遺伝子配列情報を検索し、当該コンセンサス配列と当該同種の既知遺伝子を比較することにより、当該方法おいて得られたコンセンサス配列の信頼性をデータ化する段階を更に具備する請求項1~5の何れか1項に記載のデータベース構築方法。 Following the step of constructing the database, the known gene sequence information of the same animal species as the animal species from which the sample is derived is retrieved, and the consensus sequence is compared with the known gene of the same species to obtain in the method. The database construction method according to any one of claims 1 to 5, further comprising the step of converting the reliability of the consensus sequence obtained into data.
  7.  前記データベースを構築する段階に続いて、前記配列クラスタのコンセンサス配列と前記クラスタを構成するリード配列のアライメント情報に基づいて、各配列クラスタのコンセンサス配列についての信頼性をデータ化することを更に具備する請求項1~6の何れか1項に記載のデータベース構築方法。 Following the step of constructing the database, the method further comprises converting the reliability of the consensus sequence of each sequence cluster into data based on the alignment information of the consensus sequence of the sequence cluster and the lead sequence constituting the cluster. The database construction method according to any one of claims 1 to 6.
  8.  前記データベースを構築する段階に続いて、既知遺伝子配列情報を検索し、当該コンセンサス配列と既知遺伝子配列情報とを比較することにより、当該方法において得られたコンセンサス配列に既知遺伝子配列情報を付与することを更に具備する請求項1~7の何れか1項に記載のデータベース構築方法。 Subsequent to the step of constructing the database, the known gene sequence information is searched, and the consensus sequence and the known gene sequence information are compared to give the known gene sequence information to the consensus sequence obtained in the method. The database construction method according to any one of claims 1 to 7, further comprising:
  9.  対象試料に含まれるゲノムまたは転写産物から得られたDNAを断片化し、更に識別可能な指標配列を付与することによって対象フラグメントDNA混合液を得る段階と、
     前記対象フラグメントDNA混合液を電気泳動し、得られた電気泳動結果からバンドまたはピークの強度および電気泳動配列長を遺伝子同定対象プロファイリングのデータとして取得する段階と、
     前記遺伝子同定対象プロファイリングのデータと、当該対象試料の種類に依存して請求項2~8の何れか1項に記載の方法により予め構築されたデータベースの当該コンセンサス配列および当該リファレンスプロファイリングのデータとを対応付けることにより当該対象試料に含まれるゲノムまたは転写産物について遺伝子同定する方法。
    Fragmenting DNA obtained from the genome or transcript contained in the target sample, and further adding a distinguishable indicator sequence to obtain a target fragment DNA mixture,
    Electrophoresing the target fragment DNA mixture, obtaining the intensity of the band or peak and the electrophoretic sequence length from the obtained electrophoresis result as data for gene identification target profiling,
    The gene identification target profiling data, the consensus sequence of the database and the reference profiling data that are pre-constructed by the method according to any one of claims 2 to 8 depending on the type of the target sample. A method for identifying genes of genomes or transcripts contained in the target sample by associating with each other.
  10.  前記対応付けが、遺伝子同定対象プロファイリングに含まれるバンドまたはピークの強度および電気泳動配列長のデータと、当該リファレンスプロファイリングのバンドまたはピークの強度および電気泳動配列長とを対応付けて、それにより当該対象試料に含まれるゲノムまたは転写産物の遺伝子情報を得ることを特徴とする請求項9に記載の方法。 The association associates the band or peak intensity and electrophoresis sequence length data included in the gene identification target profiling with the band or peak intensity and electrophoresis sequence length of the reference profiling, and thereby the subject. The method according to claim 9, wherein genetic information of a genome or a transcription product contained in the sample is obtained.
  11.  前記対応付けが、遺伝子同定対象プロファイリングに含まれる電気泳動配列長のデータを、当該リファレンスプロファイリングのバンドまたはピークの電気泳動配列長および当該配列クラスタの塩基数とその配列数から作成された疑似プロファイリングとを対応付けて、それにより当該対象試料に含まれる転写産物の遺伝子情報を得ることを特徴とする請求項9に記載の方法。 The mapping is the data of the electrophoretic sequence length included in the gene identification target profiling, the electrophoretic sequence length of the reference profiling band or peak, the number of bases of the sequence cluster, and the pseudo profiling created from the number of sequences. The method according to claim 9, wherein genetic information of a transcription product contained in the target sample is obtained.
  12.  第1~第nの対象試料にそれぞれ含まれるゲノムまたは転写産物から得られたDNAをそれぞれ断片化し、更に指標配列をそれぞれ付与することによって第1~第nのフラグメントDNA混合液をそれぞれ得る段階と(ここにおいて「n」は2以上の整数を示す)、
     前記第1~第nのフラグメントDNA混合液を、それぞれ高速DNAシーケンシングすることによって、各フラグメントDNA混合液にそれぞれ含まれる全てのフラグメントDNAについての第1~第nのリード配列データをそれぞれ取得する段階と、
     前記第1~第nのリード配列データそれぞれの全ての配列データについて、前記指標配列部分の有無をそれぞれ検査し、前記指標配列を有する第1~第nのリード配列データをそれぞれ抽出する段階と、
     前記抽出された第1~第nのリード配列データそれぞれの全てについて、予め決定されたパラメータを用いて配列のクラスタリング処理とアッセンブリング処理をそれぞれ行うことにより、第1~第nのクラスタ群をそれぞれ形成し、前記第1~第nのクラスタ群のそれぞれについて、当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報をそれぞれ取得する段階と、
     前記第1~第nのそれぞれの配列クラスタ群にそれぞれ対応付けられた当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を含む、第1~第nの配列クラスタ群情報を含むデータベースをそれぞれ構築する段階と、
     前記第1~第nの配列クラスタ群の間で、それぞれの当該コンセンサス配列の類似性によってそれぞれ互いに対応付け、当該それぞれに対応付けられた配列クラスタ間で、それらを構成する配列数をそれぞれ比較することにより、量の変化を伴う配列クラスタ群を検出する段階と、
    を具備する網羅的フラグメント解析方法。
    Fragmenting DNAs obtained from genomes or transcripts contained in the first to nth target samples, respectively, and further providing index sequences to obtain first to nth fragment DNA mixed solutions, respectively; (Where “n” represents an integer of 2 or more),
    The first to nth fragment DNA mixed solutions are respectively subjected to high-speed DNA sequencing, whereby first to nth read sequence data for all fragment DNAs contained in each fragment DNA mixed solution are obtained. Stages,
    Examining all of the array data of each of the first to nth read array data for the presence or absence of the index array portion and extracting the first to nth read array data having the index array, respectively;
    For each of the extracted first to n-th read array data, the clustering process and the assembling process of the array are respectively performed using predetermined parameters, so that the first to n-th cluster groups are respectively set. Forming, for each of the first to n-th cluster groups, obtaining the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and alignment information,
    First to n-th sequence cluster group information including the number of constituent sequences of the sequence cluster associated with each of the first to n-th sequence cluster groups, a consensus sequence and a consensus sequence length, and alignment information. Building each of the included databases;
    The first to n-th sequence cluster groups are associated with each other by the similarity of the respective consensus sequences, and the number of sequences constituting each of the sequence clusters corresponding to each of the sequence clusters is compared. Detecting a cluster of sequences with a change in quantity,
    Comprehensive fragment analysis method comprising:
  13.  対象試料に含まれるゲノムまたは転写産物から得られたDNAを断片化し、更に指標配列を付与することによってフラグメントDNA混合液を得る段階と、
     前記フラグメントDNA混合液を、高速DNAシーケンシングすることによって、そこに含まれる全てのフラグメントDNAについてのリード配列データを取得する段階と、
     前記リード配列データの全ての配列データについて、前記指標配列部分の有無を検査し、前記指標配列を有するリード配列データを抽出する段階と、
     当該対象試料の種類に依存して請求項2~6の何れか1項に記載の方法により予め構築されたデータベースの当該コンセンサス配列との配列類似性をパラメータとして使用して、当該リード配列データのそれぞれについて配列のクラスタリング処理を行い、前記対象試料の配列クラスタ群をそれぞれ得る段階と、
     前記対象試料の配列クラスタ群と、前記データベースに含まれる当該配列クラスタ群とを比較して、前記対象試料の配列クラスタ群のみに存在するクラスタおよび/または前記データベースの配列クラスタ群のみに存在するクラスタを検出する段階と、
    を具備する網羅的フラグメント解析方法。
    Fragmenting DNA obtained from the genome or transcript contained in the target sample and further adding a marker sequence to obtain a fragment DNA mixture,
    Obtaining the read sequence data for all the fragment DNAs contained therein by subjecting the fragment DNA mixture to high-speed DNA sequencing;
    Inspecting the presence or absence of the index sequence portion for all the sequence data of the lead sequence data, extracting the lead sequence data having the index sequence,
    Using the sequence similarity with the consensus sequence of the database previously constructed by the method according to any one of claims 2 to 6 depending on the type of the target sample as a parameter, the read sequence data Performing a sequence clustering process for each to obtain a sequence cluster group of the target sample, and
    By comparing the sequence cluster group of the target sample with the sequence cluster group included in the database, a cluster that exists only in the sequence cluster group of the target sample and / or a cluster that exists only in the sequence cluster group of the database Detecting the stage,
    Comprehensive fragment analysis method comprising:
  14.  第1~第nの対象試料にそれぞれ含まれるゲノムまたは転写産物から得られたそれぞれのDNAをそれぞれ断片化し、更に指標配列を付与することによって第1~第nのフラグメントDNA混合液をそれぞれ得る段階と(ここにおいて「n」は2以上の整数を示す)、
     前記第1~第nのフラグメントDNA混合液を、それぞれ高速DNAシーケンシングすることによって、それぞれそこに含まれる全てのフラグメントDNAについての第1~第nのリード配列データをそれぞれ取得する段階と、
     前記第1~第nのリード配列データのそれぞれの全ての配列データについて、前記指標配列部分の有無をそれぞれ検査し、前記指標配列を有する第1~第nのリード配列データをそれぞれ抽出する段階と、
     当該第1~第nの対象試料それぞれの種類に依存して請求項2~6の何れか1項に記載の方法により予め構築されたデータベースの当該コンセンサス配列との配列類似性をパラメータとして使用して、当該第1~第nのリード配列データのそれぞれについて配列のクラスタリング処理をそれぞれ行い、第1~第nの配列クラスタ群をそれぞれ得る段階と、
     前記第1~第nの配列クラスタ群にそれぞれ含まれる互いに同じコンセンサス配列毎に、当該クラスタをそれぞれ構成する配列数を比較して、前記第1~第nの配列クラスタ群間で異なる配列数を示す配列クラスタを特定する段階と、
    を具備する網羅的フラグメント解析方法。
    A step of obtaining a mixture of first to nth fragment DNAs by fragmenting each DNA obtained from a genome or a transcription product contained in each of the first to nth target samples, and further adding an index sequence. (Where “n” represents an integer of 2 or more),
    Obtaining the first to nth read sequence data for all of the fragment DNAs contained therein by performing high-speed DNA sequencing on the first to nth fragment DNA mixed solutions, respectively;
    Inspecting each of the array data of each of the first to nth read array data for the presence or absence of the index array portion, and extracting the first to nth read array data having the index array, respectively. ,
    The sequence similarity with the consensus sequence of the database previously constructed by the method according to any one of claims 2 to 6 is used as a parameter depending on the type of each of the first to nth target samples. Performing a clustering process for each of the first to n-th read sequence data to obtain first to n-th sequence cluster groups,
    For each of the same consensus sequences included in each of the first to nth sequence cluster groups, the number of sequences constituting each of the clusters is compared, and the number of sequences that differ between the first to nth sequence cluster groups is determined. Identifying the sequence cluster to be shown;
    Comprehensive fragment analysis method comprising:
  15.  目的に応じて、請求項2~6の何れか1項に記載の方法により予めデータベースを構築する段階と、
     構築されたデータベースに含まれるコンセンサス配列に基づいて設計して準備したプローブ群を基体に固定化することによりマイクロアレイを作成する段階と、
     対象試料に含まれるゲノムまたは転写産物から得られたDNAを断片化し、更に指標配列を付与することによってフラグメントDNA混合液を得る段階と、
     前記プローブ群に対して前記フラグメントDNA混合液を接触させ、ハイブリダイズ信号を得る段階と、
     前記得られたハイブリダイズ信号に基づいて、当該対象試料に含まれる転写産物の存在を検出する段階と、
    を具備する網羅的フラグメント解析方法。
    In accordance with the purpose, a step of building a database in advance by the method according to any one of claims 2 to 6;
    Creating a microarray by immobilizing a probe group designed and prepared based on a consensus sequence included in the constructed database on a substrate;
    Fragmenting DNA obtained from the genome or transcript contained in the target sample and further adding a marker sequence to obtain a fragment DNA mixture,
    Contacting the fragment DNA mixture with the probe group to obtain a hybridization signal;
    Detecting the presence of a transcription product contained in the target sample based on the obtained hybridization signal;
    Comprehensive fragment analysis method comprising:
  16.  目的に応じて、請求項2~6の何れか1項に記載の方法により予めデータベースを構築する段階と、
     構築されたデータベースに含まれるコンセンサス配列に基づいて設計して準備したプローブ群を1セットとして、n個の基体に1セットずつそれぞれ固定化することによりn個のマイクロアレイを作成する段階と(ここにおいて「n」は2以上の整数を示す)、
     第1~第nの対象試料にそれぞれ含まれるゲノムまたは転写産物から得られたDNAをそれぞれ断片化し、更に指標配列をそれぞれ付与することによってフラグメントDNA混合液をそれぞれ得る段階と、
     それぞれ前記n個のマイクロアレイにそれぞれ固定された各プローブ群に対して前記第1~第nのフラグメントDNA混合液をそれぞれ接触させ、それぞれのハイブリダイズ信号をそれぞれ得る段階と、
     前記それぞれ得られたハイブリダイズ信号に基づいて、当該第1~第nの対象試料に含まれる転写産物の存在量を比較する段階と、
     前記比較により、前記当該第1~第nの対象試料の間で前記存在量に差がある転写産物を検出する段階と、
    を具備する網羅的フラグメント解析方法。
    In accordance with the purpose, a step of building a database in advance by the method according to any one of claims 2 to 6;
    A set of probes designed and prepared based on the consensus sequence included in the constructed database as a set, and immobilizing each set on n substrates to create n microarrays (here, “N” represents an integer of 2 or more),
    Fragmenting DNA obtained from genomes or transcripts contained in each of the first to n-th target samples, respectively, and further obtaining a fragment DNA mixture solution by adding an indicator sequence respectively;
    Contacting each of the first to n-th fragment DNA mixed solutions with each of the probe groups respectively fixed to the n microarrays to obtain respective hybridization signals;
    Comparing the abundance of transcripts contained in the first to n-th target samples based on the respective obtained hybridization signals;
    Detecting, by the comparison, a transcript having a difference in the abundance between the first to nth target samples;
    Comprehensive fragment analysis method comprising:
  17.  断片化されて指標配列を付与された、試料に含まれるゲノムまたは転写産物からのフラグメントDNA混合液が、高速DNAシーケンシングされることによって取得されたリード配列データの全てについて、前記指標配列部分の有無を検査し、前記指標配列を有するリード配列データを抽出する手順と、
     前記抽出されたリード配列データの全てについて、予め決定された配列の類似性と配列長と指標配列に関するパラメータを用いて配列のクラスタリング処理とアッセンブリング処理を行うことにより、複数の配列クラスタを形成し、前記配列クラスタのそれぞれについて、当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を取得する手順と、
     前記配列クラスタのそれぞれに対応付けられた当該配列クラスタの構成配列数、コンセンサス配列およびコンセンサス配列長、並びにアライメント情報を含むデータベースを構築する手段と、
    を含む処理をコンピュータに実行させる、前記転写産物の網羅的フラグメント解析のためのデータベース構築用プログラム。
    For all of the read sequence data obtained by high-speed DNA sequencing, the fragment DNA mixture from the genome or transcript contained in the sample that has been fragmented and given the index sequence is subjected to A procedure for examining presence / absence and extracting lead sequence data having the index sequence;
    All of the extracted read sequence data are subjected to sequence clustering processing and assembly processing using predetermined sequence similarity, sequence length, and index sequence parameters to form a plurality of sequence clusters. , For each of the sequence clusters, the number of constituent sequences of the sequence cluster, the consensus sequence and the consensus sequence length, and a procedure for obtaining alignment information;
    Means for constructing a database including the number of constituent sequences of the sequence cluster associated with each of the sequence clusters, the consensus sequence and the consensus sequence length, and alignment information;
    A program for constructing a database for comprehensive fragment analysis of transcripts, which causes a computer to execute a process including:
PCT/JP2012/062963 2011-05-19 2012-05-21 Gene identification method in fragmentome analysis and expression analysis method WO2012157778A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2011112887A JP5403563B2 (en) 2011-05-19 2011-05-19 Gene identification method and expression analysis method in comprehensive fragment analysis
JP2011-112887 2011-05-19

Publications (1)

Publication Number Publication Date
WO2012157778A1 true WO2012157778A1 (en) 2012-11-22

Family

ID=47177092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/062963 WO2012157778A1 (en) 2011-05-19 2012-05-21 Gene identification method in fragmentome analysis and expression analysis method

Country Status (2)

Country Link
JP (1) JP5403563B2 (en)
WO (1) WO2012157778A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6327473B2 (en) * 2013-01-25 2018-05-23 国立研究開発法人国立国際医療研究センター Method for identifying Mycobacterium tuberculosis strain and method for detecting gene mutation
CA2994406A1 (en) * 2015-08-06 2017-02-09 Arc Bio, Llc Systems and methods for genomic analysis
CN112735516A (en) * 2020-12-29 2021-04-30 上海派森诺生物科技股份有限公司 Group variation detection analysis method without reference genome

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005250615A (en) * 2004-03-02 2005-09-15 Natl Inst Of Radiological Sciences Gene analysis support system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005250615A (en) * 2004-03-02 2005-09-15 Natl Inst Of Radiological Sciences Gene analysis support system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ARAKI R. ET AL.: "More than 40,000 transcripts, including novel and noncoding transcripts, in mouse embryonic stem cells", STEM CELLS, vol. 24, no. 11, 2006, pages 2522 - 2528 *
BRAUTIGAM A. ET AL.: "Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C(3) and C(4) species", J. EXP. BOT., vol. 62, no. 9, March 2011 (2011-03-01), pages 3093 - 3102 *
FRANSSEN SU. ET AL.: "Comprehensive transcriptome analysis of the highly complex Pisum sativum genome using next generation sequencing", BMC GENOMICS, vol. 12, 11 May 2011 (2011-05-11), pages 227 *
FUKUMURA R. ET AL.: "A sensitive transcriptome analysis method that can detect unknown transcripts", NUCLEIC ACIDS RES., vol. 31, no. 16, 2003, pages E94 *
HARUNOBU YUNOKAWA ET AL.: "An effective way to build a high quality HiCEP peak database (peak:gene) using GS454FLX", DAI 33 KAI THE MOLECULAR BIOLOGY SOCIETY OF JAPAN, DAI 83 KAI ANNUAL MEETING OF THE JAPANESE BIOCHEMICAL SOCIETY GODO TAIKAI YOSHI (4P-1181), 19 November 2010 (2010-11-19), Retrieved from the Internet <URL:http://www.aeplan.co.jp/bmb2010> *

Also Published As

Publication number Publication date
JP2012239430A (en) 2012-12-10
JP5403563B2 (en) 2014-01-29

Similar Documents

Publication Publication Date Title
Ding et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods
US10347365B2 (en) Systems and methods for visualizing a pattern in a dataset
Giraud et al. Fast multiclonal clusterization of V (D) J recombinations from high-throughput sequencing
JP2019531700A5 (en)
JP6066924B2 (en) DNA sequence data analysis method
Vasiljevic et al. Developmental validation of Oxford Nanopore Technology MinION sequence data and the NGSpeciesID bioinformatic pipeline for forensic genetic species identification
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
CN115198023B (en) Hainan cattle liquid-phase breeding chip and application thereof
CN112289376B (en) Method and device for detecting somatic cell mutation
CN109337997B (en) Camellia polymorphism chloroplast genome microsatellite molecular marker primer and method for screening and discriminating kindred species
US20190390269A1 (en) Method for detecting known nucleotide modifications in an rna
JP5403563B2 (en) Gene identification method and expression analysis method in comprehensive fragment analysis
CN114875118B (en) Methods, kits and devices for determining cell lineage
WO2012096016A1 (en) Nucleic acid information processing device and processing method thereof
WO2006109535A1 (en) Dna sequence analyzer and method and program for analyzing dna sequence
Widłak High-throughput technologies in molecular biology
US20230368863A1 (en) Multiplexed Screening Analysis of Peptides for Target Binding
Wright et al. “Serpentinomics”—An emerging new field of study
WO2008068831A1 (en) Method of assessing degree of reliability for nucleic acid base sequence
CN107533588B (en) Method for estimating probe-target affinity of DNA chip and method for manufacturing DNA chip
CN117746988A (en) Fusion gene detection method based on DNA or RNA sequencing technology
Liu et al. Transcriptomic Approaches for Muscle Biology and Disorders
JP3783315B2 (en) Nucleic acid analysis method
KR20230028619A (en) Biomarker for diagnosing atopic dermatitis and use thereof
CN115552535A (en) Genome sequencing and detection techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12786389

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12786389

Country of ref document: EP

Kind code of ref document: A1