WO2019132010A1 - Procédé, appareil et programme d'estimation de type de base dans une séquence de bases - Google Patents

Procédé, appareil et programme d'estimation de type de base dans une séquence de bases Download PDF

Info

Publication number
WO2019132010A1
WO2019132010A1 PCT/JP2018/048502 JP2018048502W WO2019132010A1 WO 2019132010 A1 WO2019132010 A1 WO 2019132010A1 JP 2018048502 W JP2018048502 W JP 2018048502W WO 2019132010 A1 WO2019132010 A1 WO 2019132010A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
lead
base
nucleic acid
consensus
Prior art date
Application number
PCT/JP2018/048502
Other languages
English (en)
Japanese (ja)
Inventor
公一郎 安益
奈央 泰山
Original Assignee
タカラバイオ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by タカラバイオ株式会社 filed Critical タカラバイオ株式会社
Priority to JP2019562510A priority Critical patent/JPWO2019132010A1/ja
Publication of WO2019132010A1 publication Critical patent/WO2019132010A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay

Definitions

  • the present invention relates to a technique for estimating a base type in a base sequence.
  • Cancer cells are said to be generated by mutation of base sequences in oncogenes and tumor suppressor genes (cancer related genes) in normal cells.
  • cancer related genes cancer related genes
  • development of a method to distinguish normal cells from cancer cells has been promoted by applying genetic abnormalities of these cancer-related genes.
  • Non-Patent Document 1 and Non-Patent Document 2 disclose that a trace amount of abnormal DNA derived from cancer cells present in blood or stool could be detected.
  • molecular targeting drugs that target abnormal proteins translated by mutations in cancer-related genes, it is required to determine the base sequence with high accuracy.
  • non-patent document 3 discloses smCounter as a technique for enhancing detection accuracy by statistical processing of base sequence information obtained from a sequence analysis device, higher detection sensitivity and accuracy such as liquid biopsy can be obtained. In the case of lack of precision, it is insufficient.
  • An object of the present invention is to provide a method, an apparatus, and a computer program for carrying out the method for estimating a base type in a base sequence.
  • the present inventor determines the reliability score for each molecular barcode from the large amount of base sequence data (read information) of the next-generation sequencer obtained from the DNA fragment prepared by adding the molecular barcode to the template DNA, We found that parametric analysis can detect mutations with higher sensitivity than conventional methods.
  • the present inventor calculates an index for efficiently removing an error generated by nucleic acid amplification or sequencing, and the true base type We have developed a statistical method that determines with high accuracy.
  • the technique for enhancing detection accuracy of the present invention can be expected, for example, to detect variant nucleic acids at an earlier disease stage.
  • a computer is used to provide a first estimation method for estimating the presence of a specific base type present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
  • the first estimation method is a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e, g) Estimate the occurrence rate of appearance of a specific base due to an analysis error
  • a second estimation is performed using a computer to estimate the proportion of a specific base species present at a specific position in a base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture in the clonal nucleic acid mixture.
  • a method is provided.
  • the second estimation method is a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus reads containing each base species specific to the particular position selected in step e to the total number of consensus leads including the particular position used in step e, g) Estimate the occurrence rate of a distinctive base due to an analysis error at a
  • step h a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h
  • step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture, including.
  • a third estimation method for estimating a base sequence of a specific region of each nucleic acid molecule constituting a clonal nucleic acid mixture using a computer.
  • the third estimation method is a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus
  • a computer readable program for causing a computer to execute any of the above first to third estimation methods.
  • an estimation device for estimating the presence of a specific base species at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
  • the estimation device is A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result; Equipped with
  • an estimation device for estimating the presence of a specific base species at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
  • the estimation device is A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit; Equipped with
  • the present invention it is possible to detect infrequent mutations occurring on a part of nucleic acids derived from a body sample of a subject, and to estimate the base type in the base sequence.
  • FIG. 6 shows the evaluation results of sensitivity for detecting eight types of mutations in a nucleic acid mixture in Example 1.
  • FIG. 8 shows the abundance ratio comparison of mutant molecules estimated from the read ratio in the nucleic acid mixture in Example 2.
  • the "clonal nucleic acid mixture” in the present disclosure refers to a mixture of multiple nucleic acids generated by repeating biological replication from one nucleic acid molecule.
  • the sample from which such clonal nucleic acid is obtained is not particularly limited by species, but includes, for example, human biological samples, tissue, blood, plasma, serum, urine, saliva, mucus drainage, sputum, stool and Tears are illustrated.
  • the FFPE formalin fixed paraffin embedded
  • the FFPE sample-derived DNA is often damaged such as short fragments, nicks and gaps, and deamination (urasylation) of cytosine. Since it directly contributes to poor analysis, it is desirable to repair the DNA using a commercially available reagent such as NEBNext FFPE DNA Repair Mix (manufactured by NEB).
  • the "read sequence” in the present disclosure is base sequence data output from a sequencer.
  • the “infrequently occurring base species at a specific position” in the present disclosure is not particularly limited, and includes, for example, “infrequent mutations”.
  • the term "low frequency mutation” is intended to mean (a sudden) mutation of a base in a nucleic acid generated in vivo and satisfying the following two conditions a and b. a) In a nucleic acid molecule, it appears at a frequency of 1 ⁇ 10 -3 / base or less (ie, a frequency of 1 or less per 1,000 bases). b) In a sample containing nucleic acid molecules, the proportion of nucleic acid molecules having different sequences of bases at specific positions is 10% or less of the total number of nucleic acid molecules in the sample.
  • low frequency mutation as an example as “base type at a specific position which exists at low frequency” as an example, but the present disclosure is not limited to detection of low frequency mutations. Absent.
  • the (sudden) mutation in the present disclosure may be any of substitution, insertion and deletion. Also, the (sudden) mutation may be known or novel.
  • FIG. 1 is a diagram showing a configuration of a mutation detection system (an example of a estimation system) that implements a method of estimating a base sequence according to an embodiment of the present invention.
  • the mutation detection system 100 detects mutations in DNA sequences.
  • the mutation detection system 100 analyzes the fragmented DNA with the sequencer 50, and inputs the lead sequence data obtained as a result of the analysis into the mutation detection apparatus 10 (an example of an estimation apparatus).
  • the mutation detection device 10 acquires lead sequence data from the sequencer 50 and analyzes it to detect mutations in the DNA sequence.
  • FIG. 2 is a diagram showing the procedure of a novel mutation detection method in the DNA sequence of the present embodiment. The procedure of the method for detecting mutations in DNA sequences will be described with reference to FIG.
  • genomic DNA and RNA are extracted from a biological sample, fragmented so as to contain DNA (for example, 1000 base or less) of a length suitable for a sequencer, and fragmented clonal nucleic acid is generated (S1). If the biological sample is already fragmented DNA (cell-free DNA, FFPE sample, etc.) or short RNA (non-coding RNA, etc.), the fragmentation step may not be performed.
  • the present invention is particularly suitably used for analysis of nucleic acids which may contain low frequency mutations.
  • the clonal nucleic acid to be analyzed is DNA or RNA, preferably DNA, more preferably genomic DNA.
  • the sequence length that can be analyzed may be as short as several hundred bp. Therefore, for example, when the clonal nucleic acid is genomic DNA, it is necessary to fragment it into a length that can be analyzed by a sequencer.
  • fragmentation There is no limitation on the method of fragmentation, and fragmentation may be performed by a known method.
  • the present invention is not limited in any way, for example, DNA Shearing System M220 / ME220 (manufactured by Covaris), which physically cleaves using ultrasound, QIAseq FX Library Kit (Qiagen), which cleaves using an enzyme.
  • Commercially available instruments and reagents such as those manufactured by Preferably, physical cutting using ultrasound is used.
  • a nucleic acid called "molecular barcode” is added to each fragmented clonal nucleic acid (S2).
  • the template nucleic acid thus obtained is copied later, but the molecular barcode can be added to identify which clonal nucleic acid the copy is derived from.
  • the molecular barcode is used.
  • the molecular barcode given to the clonal nucleic acid may be one or more unique.
  • the molecular barcode has various names, and may be called “unique molecular identifier (UMI)", “unique molecular tag (UMT)", or simply "molecular tag”. There is no particular limitation on the method of adding the molecular barcode to the clonal nucleic acid, but it can be prepared by binding with an enzyme such as ligase.
  • a clonal nucleic acid to which a molecular barcode is added may be referred to as a template nucleic acid or a nucleic acid to be analyzed.
  • nucleic acid amplification procedure it is easy to perform the below-mentioned nucleic acid amplification procedure by adding a sequence that can be aligned with a primer for PCR (polymerase chain reaction) amplification in addition to the molecular barcode to the template nucleic acid.
  • a primer for PCR polymerase chain reaction
  • ThruPLEX registered trademark
  • Tag-Seq Kit manufactured by Takara Bio Inc.
  • nucleic acid region to be analyzed when the nucleic acid region to be analyzed is determined in advance, addition of the molecular barcode to the clonal nucleic acid and amplification can be performed simultaneously by using a nucleic acid in which the PCR primer capable of amplifying the region and the molecular barcode are linked. it can.
  • template nucleic acid including the target region may be concentrated by the so-called probe capture method or the like. The concentration operation may be performed after obtaining a copy.
  • the molecular bar coat may further contain a sequence called "stem” or the like.
  • the stem sequence is a sequence that marks the start of the molecular barcode. Although the sequence length and sequence pattern are not particularly limited, for example, a sequence of 8 to 10 bases in length is used.
  • the stem sequence is placed at a position between the clonal nucleic acid and the molecular barcode.
  • the resulting template nucleic acid is amplified to create a library (S3).
  • S3 a library
  • a nucleic acid molecule containing a low frequency mutation also increases the amount of nucleic acid, that is, copies are obtained because it requires a certain amount of nucleic acid to perform sequence analysis. There is a need.
  • the method for producing the copy of the template nucleic acid is not particularly limited, a nucleic acid amplification method may be mentioned, and it is preferable to carry out by a method based on PCR from the viewpoint of ease of operation and the like.
  • the sample to be decoded by the sequencer obtained in this manner is called a library.
  • Mutations in the DNA sequence are analyzed using the nucleotide sequence data obtained by sequencing this library.
  • amplification of the template nucleic acid is performed by a PCR-based method, but is not limited thereto.
  • the method for producing the library is not particularly limited.
  • the region to be analyzed may be concentrated by PCR method or probe capture method, but there is no limitation on the concentration method in practice.
  • the library is input to the sequencer 50, and the base sequence is analyzed (S4). That is, the lead sequence can be obtained by nucleotide sequence sequencing by a known sequencing method.
  • the platform for the sequencer is not particularly limited, but next generation sequencers are desirable when the amount of information to be analyzed is enormous, such as genomic DNA.
  • HiSeq manufactured by illumina Inc.
  • MiSeq ® Ion Proton TM and Ion PGM TM (manufactured by Thermo Fisher Scientific Inc.)
  • PacBio ® RS II and PacBio registered TM
  • Sequel TM manufactured by Pacific Biosciences, Inc.
  • MinION MinION
  • GridION X5 PromethION
  • SmidgION SmidgION
  • the file format of the base sequence data (read sequence) obtained from the sequencer 50 is FASTQ format, but may be other format.
  • the FASTQ format is a format known in the art and is as shown in FIG. 3 (A).
  • the meaning of each column of data is as shown in FIG.
  • sequence data a Phred quality score is output as an index indicating the accuracy for base call (base designation) of each base.
  • the error rate by the sequence per base is indicated by the error frequency (P error ).
  • Phred quality score is calculated automatically by the next generation sequencer. Although the conversion formula to error frequency differs depending on each platform, there is no limited factor in this embodiment.
  • lead filtering also referred to as lead cleaning processing is performed to remove low quality lead sequences, that is, sequence data itself containing bases with low basecall accuracy (Phred quality score).
  • Lead filtering also referred to as lead cleaning
  • Low quality lead sequences that is, sequence data itself containing bases with low basecall accuracy (Phred quality score).
  • trimming processing may be performed to remove only the base sequence of the corresponding part.
  • the trimming process is sometimes called a masking process.
  • a known technique can be used for the treatment, and examples thereof include known software such as Trimmomatic and sickle (https://github.com/najoshi/sickle).
  • FIGS. 4 (a) and 4 (b) are diagrams showing a more specific procedure of the mutation analysis process (S5) of FIG.
  • FIG. 4 (a) shows the procedure by data analysis in consideration of PCR errors and sequence errors as described in detail later, and FIG. 4 (b) considers lead alignment. It shows the procedure by the data analysis.
  • FIG. 4 (b) differs from FIG. 4 (a) only in that it includes the steps of "local assembly” (S17).
  • each step of the mutation analysis process (S5) will be specifically described with reference to both FIG. 4 (a) and FIG. 4 (b).
  • mapping processing is performed to classify the lead sequence by comparison with a reference sequence.
  • the reference sequence is a reference sequence, and any sequence can be used.
  • sequences registered in NCBI GenBank, DDBJ, UCSC Genome Browser, etc. can be used.
  • the analysis target is human genomic DNA
  • Genome Reference Consortium human build 38 (GRCh38) or UCSC human genome 19 (hg19) can be used as reference sequences (these versions are updated as needed, and as necessary) choose the appropriate version).
  • the region to be the reference sequence can be appropriately selected according to the purpose. It may be the entire length of the sequence included in the record, or only the desired region.
  • the reference sequence is compared with a consensus read described later, and a base different from the reference sequence is in the consensus read, the base is a candidate for mutation.
  • the genome sequence can be generally obtained in FASTA format as a reference sequence.
  • FASTA FASTA format
  • the read sequence is mapped to the reference sequence acquired above.
  • BWA Wellcome Trust Sanger Institute
  • Bowtie Johns Hopkins University
  • DNASTAR registered trademark
  • SeqMan NGen made by DNASTAR
  • Hisat 2 Johns Hopkins University
  • NOVACRAFT Novoalign
  • the mapping result is output in a known file format called a SAM / BAM format file as shown in FIG.
  • BAM format files are binary files of SAM format files.
  • the file format for outputting the mapping result is not limited to the BAM format.
  • FIG. 5A is a SAM / BAM format file to which the result of mapping to the reference sequence (S12) is output, and FIG. 5B is added with the result of molecular bar code clustering (S13) It is a SAM / BAM format file.
  • SAM / BAM format file The meaning of each column of the SAM / BAM format data is as shown in FIG.
  • N (a base where the base type can not be determined) is regarded as “N” (a base having a low Phred quality score) and “completely matched base” when compared with the reference sequence. It could be used for the subsequent analysis by assuming it as
  • mapping results in one group different molecular barcode read sequences may be mixed. In that case, by grouping based on molecular barcode sequences, it is possible to reorganize into correct groups.
  • the unaligned read sequence contributes to analysis failure and is not used for the subsequent analysis. Therefore, there has been a problem that while obtaining a large amount of sequence data, the number of read sequences available for analysis decreases.
  • the unmapped (unmapped) lead sequences are left for judgment and used for analysis in comparison with the generated family in the grouping step based on molecular barcode sequences. Determine if you can. In this way, the number of lead sequences that can be used for analysis is increased (the number of non-analyzable lead sequences is reduced).
  • UMI Molecular barcode clustering
  • mapping information eg, lead sequence, UMI sequence, mapping position relative to reference sequence
  • each lead sequence is obtained.
  • UMI sequencing step clustering of molecular barcodes
  • FIG. 6A shows an example of a state after the lead arrangement is grouped by the mapping start point and end point on the reference arrangement.
  • FIG. 6A exemplifies three groups (Position family) classified based on the mapping position on the reference sequence.
  • FIG. 6 (B) shows an example of the state where it is further divided by the sequence of UMI (molecular barcode).
  • FIG. 6 (B) illustrates two groups (UMI family) classified according to the sequence of UMI (molecular barcode).
  • Step a Classify the lead sequence group independently of the UMI sequence based on the position of each lead sequence consisting of the mapping position to the reference sequence.
  • the classified groups are managed as Position family. That is, a read sequence group having the same mapping start point and end point on the reference sequence described in the SAM / BAM format file is managed as a Position family.
  • Step b Count the number of lead sequences in each Position family. A flag is set for Position family whose aggregate value is less than the threshold.
  • Step c Classify the lead sequences based on the similarity of UMI sequences within each Position family. A group classified based on the similarity of UMI sequences is called UMI family. In addition, when two or more types of UMI exist on the same library, all UMI sequences are combined.
  • L-UMI and R-UMI are combined.
  • the read sequences in the Position family are further classified based on the similarity of UMI sequences. By setting a similarity threshold and allowing a certain degree of mismatch, mismatched UMIs within the threshold have the same sequence.
  • a read sequence having a UMI sequence having a smaller number of occurrences than the total number of read sequences is flagged on the SAM / BAM file.
  • Step d Based on the number of lead sequences in the Position family, reclassification of the final UMI family is performed with limitation to the Position family located in the vicinity.
  • the Position family which has set the flag and which has a small number of reads can be integrated into the nearby Position family. If it is determined that integration is possible, the flagged position family with a small number of leads is integrated into the neighboring position family.
  • determination of sequence similarity as to whether UMI sequences are identical uses editing distance such as Hamming distance.
  • determination of sequence similarity is not limited to this, and, for example, similarity of Zscore-based vectors assuming (similarity) network analysis based on motif appearance frequency, phylogenetic distance and evolution distance used in molecular phylogenetic analysis An indicator such as is also selectable.
  • the criteria for determining the degree of similarity may be changed depending on the length of the UMI sequence and the copy number of the template nucleic acid.
  • Step a Bases with low Phred quality scores in the sequence data are masked with N (a symbol indicating an arbitrary base) and excluded from comparison targets of sequence similarity.
  • Step b Adjust the edit distance threshold to allow for a mismatch between UMI sequences. That is, the read sequence group in the Position family is further classified based on the degree of similarity (such as Hamming distance) of the UMI sequence. By setting a similarity threshold and allowing a certain degree of mismatch, mismatched UMIs satisfying the threshold condition have the same sequence.
  • WO 2016/176091 suggests a UMI substitution phenomenon that replaces a sequence different from the UMI sequence initially given in the PCR amplification and sequencing process (eg UMI Jumping, PCR Jumping [UMI -Tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy, Tom Smith et al. Http://dx.doi.org/ 10.1101 / gr.209601.116.], Index swapping [Characterization and remediation of ample index swaps by non-redundant dual indexing on massively parallel sequencing platforms, Maura Costello et al.
  • the present disclosure is not limited in any way, for example, when the molecular barcode consists of 6 bases, a mismatch of 2 bases or less in one or both molecular barcodes, or a lead sequence of the molecular barcode region at either end If the six bases match, it is regarded as a lead sequence to be subjected to clustering. This enables clustering that takes into account the possibility of errors present in the UMI sequences themselves.
  • the error present in the UMI sequence itself refers to a mismatched base, a base call bad base, a mapping error and the like.
  • the existing analysis methods only mismatched bases were considered.
  • the analysis accuracy can be improved by considering the base call defect base, the mapping error and the like.
  • the presence rate of errors present in the molecular barcode sequence itself is 50% or less, preferably 40% or less, more preferably 35% or less, particularly preferably the molecular barcode base length. Is less than or equal to 30%, clustering is possible in consideration of the possibility of the error.
  • the error present in the UMI sequence itself may be present in only one molecular barcode or in both molecular barcodes according to the invention.
  • clustering is possible in consideration of the possibility of the error.
  • the classification result of UMI is added to the data of the SAM / BAM format file as described above. That is, as shown in the SAM / BAM format (output format) of FIG. 5 (B), the name of the lead sequence in the first column is [chromosome number]-[mapping start point]-[mapping end point]-[alignment pattern] -[UMI sequence after classification]-[UMI sequence before classification] Furthermore, the information obtained by adding necessary information to BC: Z: sequence: sequence, XD: i: count, XP: i: status is output as option information of the SAM / BAM format file (see Table 1 below). The form of appending is not limited to this form.
  • mutation detection is performed based on a SAM / BAM format file to which UMI classification results are added.
  • the step of this local assembly is a step included in the mutation analysis process shown in FIG. 4 (b).
  • processing is performed to determine a sequence region not smaller than the reference sequence but longer than the sample-specific read sequence length. By comparing this sequence with the reference sequence, it is possible to confirm base substitution, insertion or deletion having a sequence size equal to or greater than the length of the lead sequence.
  • the local assembly of this embodiment is described in Rimmer et al. It is coded with reference to Platypus's algorithm published in (Nature Genetics, 2014), and a two-colored de Bruijn Graph is constructed from a reference sequence and a lead sequence to construct an assembly sequence. The options available for local assembly and their examples are shown in Table 2 below.
  • de Bruijn graph the base sequence is first cut into base sequences of length k (k-mer) to make it a Node (apex). Furthermore, each Node (vertex) is connected by Edge (edge). A k-mer Node-Edge is generated with a plurality of read sequences or reference sequences, and a graph in which nodes having the same k-mer and the same Node are combined becomes a de Bruijn graph (see FIG. 7).
  • the origin of the base sequence derived from the reference sequence or the lead sequence, presence or absence of the molecular barcode and its sequence information
  • the reliability score the Phred quality score of the sequence
  • de Bruijn graph targets the partial length of the analyzed gene and can change the sequence length (which can be changed by the window size option in Table 2). Create using the read array mapped to the same region as the reference sequence corresponding to the specified target region. Further, the read sequences to be used include the read sequences themselves mapped to the same region and paired reads (that is, unmapped reads not mapped to the reference sequence or reads mapped to the outside of the target region). Also, in this embodiment, the color of the Edge of the graph is set to 0 for the reference array side, 1 for the lead array side, and 0.5 for the portion where both exist is derived from the Edge (edge). It is possible to distinguish.
  • Dijkstra's algorithm which is an algorithm based on the best priority search for solving the single-source shortest path problem when the weight of an edge is a non-negative number, in path search on a graph.
  • Adopt priority queue
  • individual search paths become assemble sequences, and further, Edges specific to the reference sequence or edge specific to the read sequence can be discriminated by the difference in color of Edges on the paths, so that mutation candidates can be identified.
  • the route search method and algorithm are not limited to the above.
  • Consensus lead (S14) As described above, especially in "3) Molecular Barcode (UMI) clustering (S13)", based on UMI sequences, taking into account PCR errors and sequence errors, the same UMI sequences (same UMI family) A common read sequence, ie, a consensus read, is output from the read sequence having.
  • the consensus lead may be simply referred to as a consensus, a consensus sequence, or a consensus lead sequence.
  • FIG. 8A is a view for explaining the processing steps of the consensus read process for obtaining a consensus lead. (See “(a) Data Analysis Taking Account of PCR Error and Sequence Error” below).
  • FIG. 9A is a view for explaining the processing steps for obtaining a consensus lead. (See “(b) Data analysis considering lead alignment (Large Indel detection)” below). Furthermore, there is another method in “(a) Data analysis in consideration of PCR error and sequence error” (see “(a ′) Another method of data analysis in consideration of PCR error and sequence error”).
  • a consensus read and a reliability score (UQ) for each base in the sequence are used. Do.
  • the reliability score (UQ) is calculated for each base in each UMI family, and the reliability score (UQ) is expressed by the following equation.
  • the reliability score (UQ) for each base in each UMI family is not particularly limited, for example, in the Bayesian approach, it can be estimated as a posterior distribution. Specifically, in each UMI family, the likelihood for each target base (each base) at the corresponding place is estimated.
  • the target bases are a subset of the whole set ⁇ , and all insertions or deletions detected in all four types of base pairs (A, T, G, C) and all UMI families mapped onto the same reference sequence It is composed of (other obs. Alleles). In the example of FIG.
  • the statistical model of the well-known software SmCounter is a specification with low detection sensitivity to insertion or deletion bases.
  • the detection sensitivity also increases for base insertion and base deletion in addition to base substitution.
  • each base is a likelihood function
  • P (each base) is a prior distribution showing the appearance rate of the target base.
  • the prior distribution can adopt a uniform distribution such as 1 / ⁇ . However, if there is any information about ⁇ or P (each base) from prior data etc., it is possible to change.
  • the likelihood function expresses a sequence error (S error ) and a PCR amplification error.
  • the first term indicates the frequency of sequence errors (S error ) and the second term indicates the frequency of occurrence of PCR amplification errors.
  • the sequence error (error from A to C) is S error frequency in the case of three read sequences where the target base is C. It occurs in On the other hand, in the case of four read sequences in which the target base is A, the sequence is correctly performed with the frequency of (1-S error ), and the likelihood under this condition is determined.
  • n y indicates the number (redundancy) of the read sequences of each constituent base (Y) of the set ⁇ .
  • Y A
  • n y is four.
  • T and G which could not actually be detected in the set ⁇
  • n y will be zero, and the calculation of the first term will not be performed. That is, only the likelihood (second term) due to PCR amplification error is calculated.
  • the sequence error (S error ) is calculated based on P error calculated from the above-mentioned Phred quality score.
  • SNV single nucleotide substitution
  • the second term of the likelihood function is a code based on the frequency of PCR amplification errors (misC p ). misC p is calculated from PCR error (x).
  • the PCR error (x) is determined by the product of the probability that one mutation occurs in the PCR cycle of x cycles and the probability that the PCR product of x cycles is included in the PCR product after the final cycle (n).
  • the probability of the appearance of mutations in the former follows the Poisson distribution.
  • the parameters of Poisson distribution can be obtained as expected values from the error appearance rate ( ⁇ ) of the PCR enzyme and the amplification efficiency ( ⁇ ) of the PCR enzyme.
  • the latter probability is calculated as the expected value (maximum likelihood) of the binomial distribution. Therefore, PCR error (x) can be obtained by the following equation.
  • PCR error (x) it is important to estimate the number of PCR cycles in which a mutation appears, but in this embodiment, this estimation is performed based on the number of lead sequences of each base in the UMI family. .
  • the ratio is 3/7 (43.8%).
  • the proportion of the number of lead sequences is approximately shown by the following formula of the proportion of PCR products derived from PCR amplification errors contained in all PCR products.
  • Sm (1 + ⁇ m ) n x / So (1 + ⁇ w ) n
  • Sm is an initial nucleic acid copy number of PCR amplification error and is 1.
  • So is the first initial nucleic acid copy number, which is 1.
  • ⁇ m is the PCR amplification efficiency for the nucleic acid product of PCR error.
  • ⁇ w is the PCR amplification efficiency for nucleic acid products without PCR errors.
  • PCR amplification error and sequence error are considered and treated as mutually exclusive events, but the effect of interaction (both PCR amplification error and sequence error Cases that arise may apply) and other factors may also be added to the equation.
  • Other factors include mapping errors (in the case where the read sequence is mapped to different places and different places) and base substitution by oxidation reaction.
  • the reliability score (UQ) is calculated for each base in each UMI family.
  • FIG. 9A is a diagram showing processing steps for obtaining a consensus lead.
  • the maximum posterior probability P is applied to a route having 10 base or more of Insert and / or deletion from the shortest path (assembly sequence, haplotype candidate, H) of each local assembly and mutation list derived from each. Determine ⁇ each haplotype
  • the maximum posterior probability P in the present embodiment corresponds to the reliability score (UQ) described in (a) above.
  • the first half shows the consideration portion of the mapping (alignment) distance of the lead, and the second half shows the consideration portion of the error of the sequence.
  • ⁇ , ⁇ Probability of the distance between the leads to the insert size (normal distribution ⁇ , ⁇ )” in the first half is all with the same UMI sequence (same UMI Family, UMI j )
  • the read sequence (k ⁇ ReadPair) is realigned to the sequence of the haplotype to represent the probability for the distance between the reads (dist k ).
  • the population in this embodiment is a normal distribution defined by the mean value ( ⁇ ) and the variance ( ⁇ 2 ) of the distances between the leads on the reference sequence of all the leads, and the distribution of Insert size (normal distribution) I assume.
  • the lead sequence used in the present embodiment can be selected by the mapping information to the reference sequence. For example, in this embodiment, a read having I / D / S / H (unmapped read not mapped to the reference sequence) in Cigar (see FIG. 5C) of the read sequence which can be obtained from the mapping result of the SAM / BAM format Extract). Furthermore, the leads having the same UMI are also extracted (even when the above conditions are not met).
  • leads having mutations at the corresponding positions are selected based on Cigar and NM tags. Furthermore, the same UMI reads are also extracted regardless of mutations. All mutations in the corresponding part are represented by a set ⁇ , and an element Ai composed of A, T, G, C and all mutations (obs Allele) in the corresponding part belongs.
  • a specific base type present at a specific position is arbitrarily selected from a set of consensus leads.
  • "arbitrary” means “all cases” and “all cases”.
  • the “specific position” means that the base sequence of the reference sequence and the consensus lead is different.
  • the “specific base type” means a base of a consensus lead different from the base of the reference sequence.
  • the position and base type to be confirmed in the clonal nucleic acid may be set (selected) arbitrarily.
  • the site and base type for the mutation may be selected in advance.
  • a position where a base type different from the reference sequence is present may be extracted by comparing the reference sequence with the consensus lead.
  • the base type may also be a base type corresponding to the insertion site or a base type equivalent to deletion (in this case, treatment with no such equivalent base type).
  • the subsequent estimation step may not be performed.
  • a parametric error model using a reliability score estimates the occurrence rate of appearance of a specific base by analysis error at a specific position at a set threshold (or significance level).
  • the basic model of base type estimation builds a parametric error model by Poisson distribution, and performs base type estimation at arbitrary places. Depending on the embodiment, it is possible to change the error model. Besides Poisson distribution, for example, tests with multi-valued variables (prior distribution is Dirichlet distribution), tests with beta binomial distribution with binary variables, supervised learning models, etc. can be used as parametric error models of the present invention.
  • Base type estimation includes the following steps a to d.
  • Step a) Select a target location (chr3: 38, 930 in FIG. 8 (B)).
  • Step b) Extract all UMI families mapped to the corresponding part (in FIG. 8 (B), extract j UMI families).
  • Process c) Acquire the score of the consensus and UQ in the corresponding part of each UMI family.
  • Step d) Select only the consensus lead and UQ score different from the reference sequence (A) having the base type to be confirmed.
  • Step e) An error model (Poisson distribution) is constructed from the information obtained through steps a to d (pattern of consensus lead having selected base species and UQ score).
  • the parameter ⁇ of the Poisson distribution can preferably be expressed as follows.
  • the parameter ⁇ of Poisson distribution can be calculated according to the following equation.
  • 1000 * (1 ⁇ argminP ⁇ C
  • UMI ⁇ indicates the minimum value among five UQ scores UQ (c) (P ⁇ C
  • Mutation detection S16 Mutations are detected using the error model (Poisson distribution) constructed as described above. That is, when m UMIs are obtained from the error model, it is possible to estimate the probability that n consensus leads can be obtained by the errors.
  • the above-mentioned threshold is also called a significance level, and is generally set to 0.05 (5%), but if necessary, 0.01 (1%) or 0.001 (0.1%) Set to). Then, the probability estimated from the error model is compared with a predetermined threshold (such as 1% level or 5% level). If the estimated probability is lower than a predetermined threshold value, the hypothesis due to an error will be rejected, so it is determined to be the selected base type. This makes it possible to estimate the presence of a base type in which only one of 1000 molecules is present. Thus, the present disclosure can be suitably used for detection of low frequency mutations.
  • the error model can be changed as appropriate.
  • the ratio that the specific base species at the specific base position estimated as described above is present in the clonal nucleic acid mixture can be calculated by the following formula.
  • the ratio is also called mutation frequency.
  • Ratio (mutation frequency) (number of UMIs of consensus reads with mutations mapped to mutation sites) / (total UMI number of consensus reads mapped to mutations sites)
  • the calculated ratio can be regarded as the percentage (low abundance) of low frequency mutations present in the clonal nucleic acid mixture.
  • the specific base type at the specific base position can be present in the clonal nucleic acid mixture for all bases in the specific region to be analyzed, or the base type at the specific base position is unknown
  • the aggregated sequence information can be the sequence of low frequency variants present in the clonal nucleic acid mixture.
  • the specific base type at the specific base position can be estimated to be present in the clonal nucleic acid mixture, or the base type at the specific base position can be estimated unknown, the result obtained by verifying It applies to the read to create secondary sequence information, and further integrates all secondary sequence information of a specific region obtained by repeating the above-mentioned process while changing the base position, and each diffusion constituting the clonal nucleic acid mixture It can be employed as the base sequence of a specific region of the molecule.
  • mutation filtering can be further performed.
  • P.I. Flaherty et al., Nucleic Acids Res. Vol. 40, No. 1, page e2, 2012 (DOI: 10.1093 / nar / gkr861)
  • the mutations detected in normal samples are excluded from the mutations detected in tumor samples You may Thereby, mutations specific to the tumor sample can be extracted (filtered).
  • Example 1 Detection of Infrequent Mutations in Fragmented Nucleic Acid Mixture
  • 4 types of EGFR, KRAS, NRAS and PIK3CA are used.
  • Multiplex I cfDNA Reference Standard Set (horizon) as a human-derived clonal nucleic acid having, at different mixing ratios, a copy of a wild-type gene and a copy of eight mutant genes in which rare mutations occur in the same gene in a human gene group. Company company) was used.
  • the Multiplex I cfDNA Reference Standard Set is a standard product of clonal nucleic acid artificially fragmented into about 160 bp by resembling cell-free nucleic acid (cell-free DNA) in plasma.
  • ThruPLEX Tag-seq 6S (12) Kit manufactured by Takara Bio Inc.
  • a Stem-Loop adapter was added to both ends of 50 ng of clonal nucleic acid by a ligation method to obtain a template nucleic acid.
  • PCR amplification was performed with a universal primer having an adapter sequence to prepare an amplified nucleic acid product (adapter + template nucleic acid) for Illumina's sequencer.
  • UMI is added to each template nucleic acid before amplification because UMI of random sequence consisting of 6 bases is contained in the Stem-Loop adapter (total 2 UMI is included) ).
  • sequences at both ends 100 bp for each sample were performed to obtain base sequence data (read sequence data).
  • the read sequence data was mapped to the human reference sequence (hg19) using BWA-MEM (version: 0.7.15). From the resulting SAM / BAM format file, mapping information for the reference sequence and UMI sequence information were obtained for each read sequence.
  • Molecular barcode clustering was performed on a specific target area based on the mapping position information and UMI sequence information according to the method described above.
  • the conditions for classification are as follows. Condition 1: Two or less mismatched bases are identical UMI. Condition 2: Even if the condition 1 is not satisfied, if at least one UMI sequence is identical, the same UMI is used.
  • Consensus lead generation was performed according to the method described above. Thereafter, the bases of 178, 935, 997-178, 936, 121 bp corresponding to the exon 10 of the PIK3CA gene located on the 3rd chromosome are calculated from the Phred quality score given to the read sequence of the Illumina sequencer of 1) The error rate Perror and the error rate 1-UQ (base) calculated from 2) UQ were calculated, and two types of error rates were compared. The results are shown in FIG. In FIG. 10, using the SAM / BAM file after molecular bar code clustering, the above two types of error rates are calculated for the base with the largest number of UMI in each location, and the Phred scale ( ⁇ 10 log 10 (error It plotted with the rate).
  • the error rate ( ⁇ in FIG. 10) calculated from the Phred quality score is between 30 and 40 on the Phred scale, but the error rate based on UQ (FIG. 10) ⁇ ) indicates a value over 50, and the error rate was reduced to about 1/10 times over the entire area.
  • LoFreq is a mutation detection software reported in a paper (Wilm A et al., Nucleic Acids Res. 2012 40 (22): 11189-11201), and Bernoulli trial is performed based on the sequence error rate calculated from Phred quality score. It is software that is implemented to detect mutations.
  • FIG. 11 after acquiring consensus leads using Connor (version: 0.5.1 https://github.com/umich-brcf-bioinf/Connor) for SAM / BAM files mapped by BWA Mutation detection was performed.
  • SmCounter is software that performs error correction based on molecular barcodes by the Bayesian approach to detect mutations (Non-Patent Document 3).
  • FIG. 11 shows the result of mutation detection using the SmCounter, with the SAM / BAM format file subjected to molecular barcode clustering (the disclosed method) as an input file.
  • the analysis code of the present invention detects all eight mutant genes with a mixing ratio of 1.0% indicated by hatching, and the detection sensitivity is It was 100% (8 types / 8 types). Furthermore, while the mixing ratio is 0.1%, five kinds of mutations are shown: hatched SNV of EGFR gene L858R and Deletion of ⁇ 746-750, SN12 of NRAS gene G12D and A59T, and SNK of PIK3CA E545K. The detection sensitivity was 62.5% (5 types / 8 types). The p-values of the detected mutations were all on the order of E-06 (10 -6 ), which were lower than the threshold (1% level).
  • the detection method of the present embodiment showed a detection sensitivity superior to that of the conventional method, and furthermore, the mutation could be detected even in a very low mutation which could not be detected by the conventional method.
  • Example 2 Quantification of mutation frequency in a nucleic acid mixture
  • a copy of a wild-type gene further Tru-Q7 (1.3% Tier) Reference Standard (Part No. :) with copies of 35 mutant genes in which rare mutations occur in the same gene at various mixing ratios of 1.0% -30%. HD734) (manufactured by Horizon) was used as a template nucleic acid.
  • Example 1 a library for Illumina sequencer was prepared using ThruPLEX Tag-seq 6S (12) Kit, and sequence reads were obtained by HiSeq.
  • UMI classification can be realized with high accuracy by clustering of molecular barcodes according to the method of the present disclosure.
  • low-frequency mutations could be detected with high accuracy even with degraded (fragmented) low-quality genomic DNA.
  • FIG. 13 is a block diagram showing a specific internal configuration of the mutation detection device 10 for performing the mutation detection method described above.
  • the mutation detection device 10 is configured by an information processing device such as a personal computer.
  • the mutation detection device 10 includes a control unit 11 that controls the overall operation, a display unit 13 that performs screen display, an operation unit 15 that a user performs operations, a data storage unit 17 that stores data and programs, and an external device. And an interface unit 18 which communicates with the network, and a RAM (Random Access Memory) 16 which is a main memory for the control unit 11 to perform processing.
  • a RAM Random Access Memory
  • the display unit 13 is configured of, for example, a liquid crystal display or an organic EL display.
  • the operation unit 15 includes a keyboard, a mouse, a touch panel, and the like.
  • the interface unit 18 can connect various devices (printer, communication device, input device, etc.) conforming to an interface such as USB, HDMI (registered trademark), SCSI, etc., and data between the connected device and the mutation detection device 10 And enable communication of control commands.
  • the mutation detection apparatus 10 acquires the read sequence data from the next-generation sequencer 50 via the interface unit 18.
  • the control unit 11 controls the entire operation of the mutation detection device 10, and is configured by a CPU or an MPU that realizes a predetermined function by executing a program.
  • the program executed by the control unit 11 can be provided via a communication line, or a recording medium such as a CD, a DVD, a memory card or the like.
  • the control unit 11 may be configured by a dedicated hardware circuit (FPGA, ASIC, etc.) designed to realize a predetermined function.
  • the program uses, for example, a computer language such as JAVA (registered trademark), C, C ++, C #, OBJECTIVE-C, SWIFT, or a script language such as PERL, PYTHON, BIO-RUBY, or any suitable language.
  • JAVA registered trademark
  • C C
  • C ++ C #
  • OBJECTIVE-C SWIFT
  • SWIFT or a script language
  • PERL PYTHON
  • BIO-RUBY BIO-RUBY
  • the software code may be stored as a sequence of instructions on a computer readable medium for storage and / or transmission, such as RAM 16. Alternatively, it can be transmitted using a carrier signal that is encoded and adapted for transmission over wired, optical, and / or wireless communication lines that conform to various protocols, including the Internet.
  • the data storage unit 17 is a device that stores data and programs, and can be configured by, for example, a hard disk (HDD), an SSD (SOLID STATE DRIVE), a semiconductor memory device, or an optical disk.
  • the data storage unit 21 stores a control program executed by the control unit 11, read sequence data, and the like.
  • FIGS. 14A and 14B are flowcharts showing processing executed by the control unit 11 of the mutation detection device 10 of FIG. Among them, the flowchart of FIG. 14A shows a process related to data analysis in consideration of PCR errors and sequence errors. The flowchart of FIG. 14B shows a process related to data analysis in consideration of lead alignment.
  • the mutation detection device 10 acquires the read sequence data from the sequencer 50 via the interface unit 18 (S51).
  • the control unit 11 of the mutation detection device 10 performs mapping processing of each read sequence data to a reference sequence (S52).
  • the control unit 11 classifies the read sequence data based on the position on the reference sequence and the UMI according to the method described above (S53). Thereafter, the control unit 11 performs a consensus read process to create a consensus read for each UMI family (S54).
  • the control unit 11 constructs an error model (S55). Specifically, the control unit 11 calculates a reliability score UQ, obtains a parameter ⁇ of an error model (Poisson distribution) from the pattern of consensus leads and the reliability score UQ, and constructs an error model.
  • S55 an error model
  • the control unit 11 calculates a reliability score UQ, obtains a parameter ⁇ of an error model (Poisson distribution) from the pattern of consensus leads and the reliability score UQ, and constructs an error model.
  • the control unit 11 detects a mutation with reference to the error model (S56). Specifically, the control unit 11 selects the position and base type for which the mutation is to be confirmed, and estimates the occurrence probability of the mutation with reference to the error model. Then, when the estimated value is smaller than the threshold value, the control unit 11 determines that no mutation has occurred, and determines that the selected base type is correct. In the flow chart shown in FIG. 14 (a), small sized mutations can be detected.
  • control unit 11 performs mutation filtering (S57) and excludes a mutation that satisfies a predetermined condition from the mutation detection result.
  • the mutation detection device 10 acquires the read sequence data from the sequencer 50 via the interface unit 18 (S51).
  • the control unit 11 of the mutation detection device 10 performs mapping processing of each read sequence data to a reference sequence (S52).
  • the control unit 11 classifies the read sequence data based on the position on the reference sequence and the UMI according to the method described above (S53).
  • control unit 11 sets the reference sequence to “clustering of molecular barcodes (S53)” and (following) “consensus read processing (S54)”.
  • a local assembly process (S58) is performed to determine a sequence region longer than the sample-specific read sequence length.
  • control unit 11 performs a consensus read process to create a consensus read for each UMI family (S54).
  • control unit 11 constructs an error model (S55). Specifically, the control unit 11 calculates a reliability score UQ, obtains a parameter ⁇ of an error model (Poisson distribution) from the pattern of consensus leads and the reliability score UQ, and constructs an error model.
  • the control unit 11 detects a mutation with reference to an error model (S56). Specifically, the control unit 11 selects the position and base type for which the mutation is to be confirmed, and estimates the occurrence probability of the mutation with reference to the error model.
  • the control unit 11 determines that no mutation has occurred, and determines that the selected base type is correct. Furthermore, if necessary, the control unit 11 performs mutation filtering (S57) and excludes a mutation that satisfies a predetermined condition from the mutation detection result.
  • the mutation detection apparatus 10 can detect a mutation in the DNA sequence based on the data of the read sequence from the sequencer 50.
  • a method of using a computer to estimate the presence of a specific base type present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus reads containing each specific base species at the
  • step h a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h
  • step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture, Method including.
  • step b when the reference sequence adopted in step b and the base sequence on the consensus lead are the same, the base of the secondary information prepared in step i as it is without passing through steps e to h You may adopt as a seed.
  • the specific base position selected in step e may be a position different from the reference sequence adopted in step b and the base type on the consensus read .
  • the nucleic acid molecule may have a molecular barcode at at least one end.
  • each nucleic acid molecule may have a different molecular barcode.
  • each nucleic acid molecule may have one or more different molecular barcodes.
  • the clustering in step c may be performed using the arrangement of the molecular barcode region in the lead sequence as an index.
  • the molecular barcode may be a random sequence of base sequences.
  • the mismatch in the lead sequence of the molecular barcode region may be within the allowable range of the sequence similarity between the lead sequences of the molecular barcode region.
  • the reliability score may be obtained by the Bayesian approach from the information on the clustered lead sequences.
  • the lead sequence may be a lead sequence of the amplification product of the nucleic acid molecule.
  • an amplification product may be obtained by polymerase chain reaction.
  • the likelihood function of the Bayesian approach may be a function taking into account sequencer errors and amplification errors.
  • the parametric error model of step g may be based on a Poisson distribution.
  • the specific base type at the specific position may be a target base of low frequency mutation.
  • step c may proceed directly to step d.
  • (21) A method comprising the step of determining a specific base species present at a specific position in the base sequence of the diffusion molecule constituting the clonal nucleic acid mixture based on the results obtained by the methods of (19) and (20). May be
  • An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result; An estimation device equipped with
  • An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit; An estimation device equipped with

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

La présente invention concerne un procédé, un appareil et un programme d'estimation d'un type de base dans une séquence de bases. Ce procédé met en œuvre les étapes suivantes : acquisition d'une séquence de tête dérivée de chaque molécule d'acide nucléique (S11) ; mise en correspondance de la séquence de tête acquise avec une séquence de référence (S12) ; classification de la séquence de tête sur la base des résultats de mise en correspondance (S13) ; détermination d'un score de fiabilité correspondant à la tête de consensus et à chaque base dans la tête de consensus, sur la base des résultats de classification (S14) ; sélection arbitraire d'un type de base à la position spécifique à partir de la tête de consensus, acquisition du nombre de têtes de consensus contenant chaque base à la position sélectionnée, par rapport au nombre total de têtes de consensus, construction d'un modèle d'erreur selon un modèle d'erreur paramétrique au moyen d'un score de fiabilité correspondant au type de base à la position spécifique sélectionnée (S15) ; estimation de l'incidence en fonction d'une erreur d'analyse pour le type de base à la position spécifique avec la fiabilité définie, et estimation, sur la base de l'incidence, du fait que le type de base spécifique à la position spécifique est ou non présent dans le mélange d'acide nucléique clonal à la fiabilité définie (S16).
PCT/JP2018/048502 2017-12-28 2018-12-28 Procédé, appareil et programme d'estimation de type de base dans une séquence de bases WO2019132010A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2019562510A JPWO2019132010A1 (ja) 2017-12-28 2018-12-28 塩基配列における塩基種を推定する方法、装置及びプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-253510 2017-12-28
JP2017253510 2017-12-28

Publications (1)

Publication Number Publication Date
WO2019132010A1 true WO2019132010A1 (fr) 2019-07-04

Family

ID=67063856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/048502 WO2019132010A1 (fr) 2017-12-28 2018-12-28 Procédé, appareil et programme d'estimation de type de base dans une séquence de bases

Country Status (2)

Country Link
JP (1) JPWO2019132010A1 (fr)
WO (1) WO2019132010A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114150047A (zh) * 2020-12-29 2022-03-08 阅尔基因技术(苏州)有限公司 用一代测序评估样本dna中碱基损伤、错配和变异的方法
CN115831233A (zh) * 2023-02-07 2023-03-21 杭州联川基因诊断技术有限公司 一种基于mTag的靶向测序数据预处理的方法、设备和介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120245039A1 (en) * 2009-09-22 2012-09-27 President And Fellows Of Harvard College Entangled Mate Sequencing
US20130217006A1 (en) * 2008-11-20 2013-08-22 Pacific Biosciences Of California, Inc. Algorithms for sequence determination
US20160273049A1 (en) * 2015-03-16 2016-09-22 Personal Genome Diagnostics, Inc. Systems and methods for analyzing nucleic acid
JP2017506875A (ja) * 2013-12-28 2017-03-16 ガーダント ヘルス, インコーポレイテッド 遺伝的バリアントを検出するための方法およびシステム
JP2017099400A (ja) * 2014-07-02 2017-06-08 株式会社Dnaチップ研究所 核酸分子数計測法
JP2017517282A (ja) * 2014-05-23 2017-06-29 ユニバーシティ オブ テクノロジー,シドニー 配列決定プロセス
WO2017191073A1 (fr) * 2016-05-01 2017-11-09 Genome Research Limited Signatures mutationnelles dans le cancer

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130217006A1 (en) * 2008-11-20 2013-08-22 Pacific Biosciences Of California, Inc. Algorithms for sequence determination
US20120245039A1 (en) * 2009-09-22 2012-09-27 President And Fellows Of Harvard College Entangled Mate Sequencing
JP2017506875A (ja) * 2013-12-28 2017-03-16 ガーダント ヘルス, インコーポレイテッド 遺伝的バリアントを検出するための方法およびシステム
JP2017517282A (ja) * 2014-05-23 2017-06-29 ユニバーシティ オブ テクノロジー,シドニー 配列決定プロセス
JP2017099400A (ja) * 2014-07-02 2017-06-08 株式会社Dnaチップ研究所 核酸分子数計測法
US20160273049A1 (en) * 2015-03-16 2016-09-22 Personal Genome Diagnostics, Inc. Systems and methods for analyzing nucleic acid
WO2017191073A1 (fr) * 2016-05-01 2017-11-09 Genome Research Limited Signatures mutationnelles dans le cancer

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114150047A (zh) * 2020-12-29 2022-03-08 阅尔基因技术(苏州)有限公司 用一代测序评估样本dna中碱基损伤、错配和变异的方法
CN115831233A (zh) * 2023-02-07 2023-03-21 杭州联川基因诊断技术有限公司 一种基于mTag的靶向测序数据预处理的方法、设备和介质
CN117116348A (zh) * 2023-02-07 2023-11-24 杭州联川基因诊断技术有限公司 针对靶向测序数据的mTag序列进行修正的方法、设备和介质

Also Published As

Publication number Publication date
JPWO2019132010A1 (ja) 2021-01-21

Similar Documents

Publication Publication Date Title
Smith et al. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy
EP3143537B1 (fr) Identifications de variant rares dans un séquençage ultra-profond
JP6725481B2 (ja) 母体血漿の無侵襲的出生前分子核型分析
AU2022203114A1 (en) Detecting mutations for cancer screening and fetal analysis
KR102638152B1 (ko) 서열 변이체 호출을 위한 검증 방법 및 시스템
CN106715711B (zh) 确定探针序列的方法和基因组结构变异的检测方法
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN113196404A (zh) 利用无细胞dna样本中的小变异的多层分析的癌症组织来源预测
WO2019132010A1 (fr) Procédé, appareil et programme d'estimation de type de base dans une séquence de bases
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
Roy et al. NGS-μsat: bioinformatics framework supporting high throughput microsatellite genotyping from next generation sequencing platforms
US20190108311A1 (en) Site-specific noise model for targeted sequencing
WO2019213810A1 (fr) Procédé, appareil et système pour la détection d'une aneuploïdie chromosomique
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
US11001880B2 (en) Development of SNP islands and application of SNP islands in genomic analysis
Hu et al. Processing UMI Datasets at High Accuracy and Efficiency with the Sentieon ctDNA Analysis Pipeline
Prjibelski et al. IsoQuant: a tool for accurate novel isoform discovery with long reads
US20220399079A1 (en) Method and system for combined dna-rna sequencing analysis to enhance variant-calling performance and characterize variant expression status
Prodanov Read Mapping, Variant Calling, and Copy Number Variation Detection in Segmental Duplications
Lim Copy number estimation for high-throughput short read shotgun sequencing de novo whole-genome assembly contigs
WO2023031641A1 (fr) Procédés et dispositifs pour un dépistage prénatal non-invasif

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18895469

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019562510

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18895469

Country of ref document: EP

Kind code of ref document: A1