CN114496077B - Methods, devices, and media for detecting single nucleotide variations and indels - Google Patents

Methods, devices, and media for detecting single nucleotide variations and indels Download PDF

Info

Publication number
CN114496077B
CN114496077B CN202210392841.6A CN202210392841A CN114496077B CN 114496077 B CN114496077 B CN 114496077B CN 202210392841 A CN202210392841 A CN 202210392841A CN 114496077 B CN114496077 B CN 114496077B
Authority
CN
China
Prior art keywords
sequence
read
haplotype
single nucleotide
genotype
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210392841.6A
Other languages
Chinese (zh)
Other versions
CN114496077A (en
Inventor
杨旗
张钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Berry Genomics Co Ltd
Original Assignee
Berry Genomics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Berry Genomics Co Ltd filed Critical Berry Genomics Co Ltd
Priority to CN202210392841.6A priority Critical patent/CN114496077B/en
Publication of CN114496077A publication Critical patent/CN114496077A/en
Application granted granted Critical
Publication of CN114496077B publication Critical patent/CN114496077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to a method, apparatus and medium for detecting single nucleotide variations and indels. The method comprises the following steps: calculating activation probabilities based on the aligned sequence stacks of genomic loci so as to merge into activation regions genomic loci having consecutive activation probabilities greater than a predetermined probability threshold; obtaining read-length segments within the extended activation region that meet a predetermined quality condition for assembly against the local reference genomic sequence and the read-length segments to generate a haplotype sequence, the haplotype sequence comprising a haplotype sequence repaired via a semi-assembled haplotype sequence; aligning the haplotype sequences to a local reference genomic sequence to identify the SNP and INDEL based on the alignment results; determining genotype types to generate variant feature values; and generating a prediction result about the SNP and the INDEL through a prediction model constructed by a random forest model based on the variation characteristic value. Therefore, the overall accuracy and sensitivity of the detection result can be obviously improved.

Description

Methods, devices, and media for detecting single nucleotide variations and indels
Technical Field
The present invention relates generally to biological information processing, and in particular, to methods, computing devices, and computer storage media for detecting single nucleotide variations and indels.
Background
The conventional methods for detecting single nucleotide variations (SNPs) and INDELs (INDELs) mainly include two methods.
The first method for detecting SNPs and INDELs is based on pileup detection algorithms, such as samtools, GATK UnifiedGenotyper (UG) detection software. Pileup-based detection algorithms detect single nucleotide variations (SNPs) and small insertion deletions (INDELs) by scanning short sequences into a reference genome alignment. Specifically, it calculates the probability value of occurrence of variation for the sequence stacking (pileup) at one site of the reference genome, taking the mismatched bases, base quality, and alignment quality into consideration, for detecting SNPs and INDELs. Including samtools, GATK UnifiedGenotyper (UG), etc., which can detect SNPs relatively accurately, but detect INDEL with low accuracy.
The second method for detecting SNPs and INDEL is to detect SNPs and INDEL by a local assembly method, for example, Platypus and GATK HaplotpypeCaller (HC) detection software. In order to improve the accuracy of mutation detection, the GATK HaplotpypeCaller (HC) obtains a haplotype sequence by a de Bruijn-like graph (DBG for short) local assembly method. Since the end-to-end haplotype sequence cannot be generated by local assembly due to the uneven sequencing coverage and sequencing errors, the second method for detecting SNP and INDEL is liable to cause poor accuracy of the detection result when the sequencing coverage is uneven and the sequencing errors occur.
In summary, the conventional scheme for detecting single nucleotide variation and indels has the following disadvantages: it is difficult to obtain high accuracy detection results when sequencing coverage is not uniform and sequencing errors occur.
Disclosure of Invention
The present invention provides a method, computing device and computer storage medium for detecting single nucleotide variations and indels that can significantly improve the overall accuracy and sensitivity of the detection results even when sequencing coverage is not uniform and sequencing errors occur.
According to a first aspect of the invention, a method for detecting single nucleotide variations and indels is provided. The method comprises the following steps: calculating activation probabilities for the reference genomic loci based on the aligned sequence stacks for the reference genomic loci, such that genomic loci with consecutive activation probabilities greater than a predetermined probability threshold are merged into an activation region; obtaining read-length fragments of the sample to be tested, which meet a predetermined quality condition, within the extended activation region for assembly against the local reference genomic sequence and the read-length fragments for generating a haplotype sequence, the generated haplotype sequence comprising a haplotype sequence repaired by a semi-assembled haplotype sequence; aligning the haplotype sequences to a local reference genomic sequence to identify single nucleotide variations and indels based on the alignment results; calculating genotype probabilities for single nucleotide variations and insertion deletions in order to determine genotype types for generating variation characteristic values; and generating a prediction result about the single nucleotide variation and the insertion deletion based on the variation characteristic value through a prediction model constructed based on a random forest model.
According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the invention.
According to a third aspect of the invention, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the invention.
In some embodiments, assembling based on the mapping to the local reference genomic sequence and the read fragments to generate the haplotype sequence comprises: extending each activation region to two sides by a preset number of basic groups respectively so as to obtain corresponding extension intervals; preprocessing the read length in the extension interval so as to obtain read length fragments meeting the preset quality condition; assembling the local reference genome sequence and the read length segment, wherein the assembled sequences comprise a fully assembled haplotype sequence and a semi-assembled haplotype sequence; and identifying the semi-assembled haplotype sequences for repair against the semi-assembled haplotype sequences for generating fully-assembled haplotype sequences.
In some embodiments, preprocessing the read lengths within the extended interval to obtain read length segments meeting a predetermined quality condition comprises: calculating the root mean square of the base masses of the read-long soft-sheared sequences within the extended interval; determining whether the calculated root mean square is less than a predetermined threshold; removing the soft-clip sequence of read lengths in response to determining that the calculated root mean square is less than a predetermined threshold; scanning the read length within the extension interval to determine whether the base of the read length does not match the reference genome and whether the base quality is less than a predetermined base quality threshold; responsive to determining that the base of the read length does not match the reference genome and that the base quality is less than a predetermined base quality threshold, performing segmentation for the read length; and filtering the read lengths subjected to segmentation so as to determine the read length fragments with the length larger than or equal to a preset length threshold value as the read length fragments meeting a preset quality condition.
In some embodiments, assembling for the local reference genomic sequence and the read length segments comprises: according to the sequence of the lengths from small to large, carrying out iteration of the current round aiming at the segment unit of the local reference sequence with the length being a preset value by a preset step length; inserting fragment units of the local reference sequence of the current iteration into the de Bruijn graph; determining whether a loop exists in the de Bruijn graph; in response to determining that no loops exist in the de Bruijn graph, inserting fragment units that read long sequences into the de Bruijn graph to determine whether loops exist in the inserted de Bruijn graph; in response to determining that a loop exists in the de Bruijn graph, the segment units of the local reference sequence for the current iteration are increased by a predetermined step size for the next iteration against the segment units of the local reference sequence.
In some embodiments, assembling for the local reference genomic sequence and the read length segment further comprises: in response to determining that a loop exists in the inserted de Bruijn graph, increasing the fragment unit of the read-long sequence of the current iteration by a predetermined step size so as to perform the next iteration for the fragment unit of the read-long sequence; in response to determining that no loops exist in the de Bruijn graph after the insertion, removing edges in the de Bruijn graph having a depth below a predetermined depth threshold and paths in the de Bruijn graph that do not start with the reference node and do not end with the reference node; looking up all paths in the de Bruijn graph to determine whether the number of paths is greater than a predetermined path number threshold; in response to determining that the number of paths is greater than the predetermined path number threshold, increasing the fragment unit of the read-long sequence of the current iteration by a predetermined step size for a next iteration for the fragment unit of the read-long sequence; and responsive to determining that the number of paths is less than or equal to the predetermined number of paths threshold, connecting nodes of all paths to output a haplotype sequence.
In some embodiments, the output haplotype sequence comprises: connecting nodes of all paths, and outputting a semi-assembly haplotype sequence corresponding to a semi-assembly path, wherein the semi-assembly path is an assembly path which is only started or stopped as a reference node; and aligning the semi-assembled haplotype sequence to a local reference sequence so as to add the sequence of the local reference sequence at the end of the semi-assembled haplotype sequence for outputting the haplotype sequence.
In some embodiments, calculating the genotype probabilities for single nucleotide variations and insertion deletions to determine the genotype type comprises: calculating genotype probabilities for single nucleotide variations and insertion deletions based on the probabilities of read length mapping to alleles using a bayesian algorithm; calculating a posterior probability of the genotype based on the calculated genotype probabilities for the single nucleotide variations and the insertion deletions; and comparing the calculated posterior probabilities of the genotypes to determine the genotype based on the genotype corresponding to the maximum value of the posterior probabilities of the genotypes.
In some embodiments, generating the variant feature value comprises: calculating mutation type data, genotype type data, depth data, alignment quality data, base quality data, the genotype type data being determined based on the determined genotype type; and generating a variation characteristic value based on the calculated variation type data, genotype type data, depth data, alignment quality data, and base quality data.
In some embodiments, generating the variant feature value comprises: generating, for the single nucleotide variation and the indel, a "true" or "false" flag regarding the single nucleotide variation and the indel based on the variation feature value via a single nucleotide variation-based prediction model and an indel-based prediction model, respectively, the single nucleotide variation prediction model and the indel prediction model being constructed based on a random forest model; and filtering out single nucleotide variations and indels marked as "false" in order to generate predictive results for single nucleotide variations and indels.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the invention, nor is it intended to be used to limit the scope of the invention.
Drawings
FIG. 1 shows a schematic diagram of a system for performing a method for detecting single nucleotide variations and indels, according to an embodiment of the invention.
FIG. 2 shows a flow diagram of a method for detecting single nucleotide variations and indels according to an embodiment of the invention.
Fig. 3 shows a flow chart of a method for obtaining read-long fragments meeting a predetermined quality condition according to an embodiment of the invention.
Fig. 4 shows a flow diagram of a method for assembling for local reference genomic sequences and read length segments, according to an embodiment of the invention.
FIG. 5 shows a flow diagram of a method for determining genotype type according to an embodiment of the invention.
FIG. 6 shows a schematic diagram comparing the method according to an embodiment of the invention with a conventional detection method in terms of accuracy.
FIG. 7 schematically shows a block diagram of an electronic device suitable for use to implement an embodiment of the invention.
Like or corresponding reference characters designate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.
As described above, the conventional scheme for detecting single nucleotide variations and indels has disadvantages in that: it is difficult to obtain high accuracy detection results when sequencing coverage is not uniform and sequencing errors occur.
To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a scheme for detecting single nucleotide variations and indels. In the scheme, a haplotype sequence is generated by combining genome loci with continuous activation probability values larger than a predetermined probability threshold into activation regions and obtaining read-length fragments meeting a predetermined quality condition in the prolonged activation regions so as to assemble the local reference genome sequence and the read-length fragments; the invention can process low-quality reads, further reduce assembly failure caused by excessive path number in DBG (direct bus group) caused by sequencing error, and also reduce the problem that reads nodes cannot be merged into reference nodes caused by sequencing error. In addition, by repairing the half-assembled haplotype sequence in the assembled sequence to obtain the full-assembled haplotype sequence, the invention reduces haplotype assembly failure caused by sequencing errors. Furthermore, single nucleotide variations and indels are identified by alignment to local reference genomic sequences based on fully assembled haplotype sequences; and generating a variation characteristic value, and generating a prediction result about the single nucleotide variation and the insertion deletion through a prediction model, wherein the overall accuracy (F1 value) and sensitivity of the detection result can be obviously improved. Therefore, the invention can remarkably improve the overall accuracy and sensitivity of the detection result even when the sequencing coverage is not uniform and the sequencing is wrong.
FIG. 1 shows a schematic diagram of a system 100 for a method of detecting single nucleotide variations and indels, according to an embodiment of the invention. As shown in fig. 1, the system 100 includes: computing equipment 110, sequencing equipment 130, network 140. In some embodiments, the system 100 further comprises a server 150, and the computing device 110, the sequencing device 130, and the server 150 interact data via the network 140.
A sequencing apparatus 130, for example, for sequencing a sample to be tested of a subject to be tested, so as to generate sequencing data on the sample to be tested; and sending the generated sequencing sequence data of the sample to be tested and the reference genomic sequence data to the computing device 110. In some embodiments, the server 150 sends the sequencing sequence data and the reference genomic sequence data of the test sample to the computing device 110.
With respect to the computing device 110, for example, for calculating activation probabilities for reference genomic sites based on aligned sequence stacks for the reference genomic sites, so as to merge genomic sites having consecutive activation probabilities greater than a predetermined probability threshold into activation regions; and obtaining read-length fragments of the test sample within the extended activation region that meet a predetermined quality condition for assembly against the local reference genomic sequence and the read-length fragments to generate fully-assembled haplotype sequences and semi-assembled haplotype sequences. The computing device 110 may also be used to align the semi-assembled haplotype sequences to a local reference genomic sequence for repair of the semi-assembled haplotype sequences to obtain fully-assembled haplotype sequences; aligning the haplotype sequences to a local reference genomic sequence to identify single nucleotide variations and indels based on the alignment results; calculating genotype probabilities for single nucleotide variations and insertion deletions in order to determine genotype types for generating variation characteristic values; and generating a prediction result about the single nucleotide variation and the insertion deletion based on the variation characteristic value through a prediction model constructed based on a random forest model.
In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. Computing device 110 includes, for example: activation region merging section 112, haplotype sequence generation section 114, single nucleotide mutation/indel recognition section 116, mutation characteristic value generation section 118, and prediction result generation section 120. The activation region merging unit 112, the haplotype sequence generating unit 114, the single nucleotide mutation and indel identifying unit 116, the mutation characteristic value generating unit 118, and the prediction result generating unit 120 may be disposed on one or more computing devices 110.
With respect to the activation region merging unit 112, which is used for stacking aligned sequences of the reference genomic sites, the activation probabilities of the reference genomic sites are calculated so as to merge genomic sites having consecutive activation probabilities greater than a predetermined probability threshold as activation regions.
Regarding the haplotype sequence generation unit 114, it is used to obtain the read length fragments of the sample to be tested which meet the predetermined quality condition in the prolonged activation region, so as to assemble for the local reference genome sequence and the read length fragments, so as to generate the haplotype sequence, and the generated haplotype sequence includes the haplotype sequence repaired by the half-assembled haplotype sequence.
Regarding the single nucleotide variation and indel recognition unit 116, it is used to align the haplotype sequences to local reference genomic sequences in order to recognize single nucleotide variations and indels based on the alignment results.
And a variation characteristic value generation unit 118 for calculating genotype probabilities for single nucleotide variations and insertion deletions to determine genotype types for generating variation characteristic values.
And a prediction result generation unit 120 for generating a prediction result regarding the single nucleotide mutation and the indels based on the mutation feature value via a prediction model constructed based on a random forest model.
Methods for detecting single nucleotide variations and indels according to embodiments of the present invention will be described below with reference to fig. 2 and 6. FIG. 2 shows a flow diagram of a method 200 for detecting single nucleotide variations and indels, according to an embodiment of the invention. It should be understood that the method 200 may be performed, for example, at the electronic device 700 depicted in fig. 7. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 202, the computing device 110 calculates activation probabilities for the reference genomic sites based on the aligned sequence stacks of reference genomic sites so as to merge into activation regions genomic sites having consecutive activation probabilities greater than a predetermined probability threshold.
For example, the computing device 110 calculates an activation probability of a reference genomic site using a bayesian approach based on a stack of aligned sequences (pileup) of the reference genomic site; activation probabilities are then convolved using a gaussian kernel to merge genomic loci with consecutive activation probabilities greater than a predetermined probability threshold into activation regions.
At step 204, the computing device 110 obtains read-length segments of the sample under test within the extended activation region that meet a predetermined quality condition for assembly against the local reference genomic sequence and the read-length segments for generating a haplotype sequence, the generated haplotype sequence comprising a haplotype sequence repaired via the semi-assembled haplotype sequence.
Regarding the method for generating a haplotype sequence, it includes, for example: the computing device 110 extends each activation region to both sides by a predetermined number of bases to obtain a corresponding extension interval; preprocessing the read length in the extension interval so as to obtain read length fragments meeting the preset quality condition; performing local assembly on the local reference genome sequence and the read length segment, wherein the assembled sequence comprises a completely assembled haplotype sequence and a half-assembled haplotype sequence; and identifying the semi-assembled haplotype sequences for repair against the semi-assembled haplotype sequences for generating fully-assembled haplotype sequences. For example, the computing device 110 pre-processes reads for head-to-tail pruning and slicing of reads for obtaining read segments that meet a predetermined quality condition (e.g., high quality); then, using De Bruijn map (or "De Bruijn Graph", i.e., DBG) algorithm, local reference genomes and read fragments that meet predetermined quality conditions (e.g., high quality) are assembled, the assembled sequences including fully assembled haplotype sequences and semi-assembled haplotype sequences. Thereafter, the computing device 110 identifies the semi-assembled haplotype sequences for repair against the semi-assembled haplotype sequences for generating fully-assembled haplotype sequences.
It should be understood that a path starting with a reference node and ending with a reference node is referred to as a fully assembled path and the corresponding haplotype sequence is referred to as a fully assembled haplotype sequence. The start-only or stop-only is referred to as the semi-assembled path for the reference node and the corresponding haplotype sequence is referred to as the semi-assembled haplotype sequence.
Due to sequencing errors and low local coverage, it is difficult to obtain good results by full assembly. By identifying and repairing the semi-assembled haplotypes in the sequence assembled via within the activation region, the present invention can reduce the problem of low sensitivity due to unsuccessful assembly caused by sequencing errors.
With regard to the method for obtaining read-length fragments meeting the predetermined quality condition, it includes, for example: the computing device 110 calculates a root mean square of the base masses of the read-long soft-clip sequences over the extended interval; determining whether the calculated root mean square is less than a predetermined threshold; removing the soft-clip sequence of read lengths in response to determining that the calculated root mean square is less than a predetermined threshold; scanning the read length within the extension interval to determine whether the base of the read length does not match the reference genome and whether the base quality is less than a predetermined base quality threshold; responsive to determining that the base of the read length does not match the reference genome and that the base quality is less than a predetermined base quality threshold, performing segmentation for the read length; and filtering the read lengths subjected to segmentation so as to determine the read length fragments with the length larger than or equal to a preset length threshold value as the read length fragments meeting a preset quality condition.
By adopting the method, assembly failure caused by low-quality reads can be reduced.
The method 300 for obtaining the read-long fragments meeting the predetermined quality condition will be described in detail below with reference to fig. 3, and will not be described herein again.
At step 206, the computing device 110 aligns the haplotype sequences to local reference genomic sequences in order to identify single nucleotide variations and indels based on the alignment results. For example, computing device 110 aligns the fully assembled haplotype sequences to a local reference sequence using the Smith-waterman algorithm to identify SNP and INDEL variations from the alignment results.
At step 208, the computing device 110 calculates genotype probabilities for single nucleotide variations and insertion deletions in order to determine genotype types for generating variation feature values.
As for the method of determining the genotype type, it includes, for example: the computing device 110 calculates genotype probabilities for single nucleotide variations and insertion deletions based on the probabilities of read length mapping to alleles using a bayesian algorithm; calculating a posterior probability of the genotype based on the calculated genotype probabilities for the single nucleotide variations and the insertion deletions; and comparing the calculated posterior probabilities of the genotypes to determine the genotype based on the genotype corresponding to the maximum value of the posterior probabilities of the genotypes. The method 500 for determining the genotype type of a mutation will be described in detail below with reference to fig. 5, and will not be described herein again.
Regarding the method of generating the variation feature value, it includes, for example: calculating mutation type data, genotype type data, depth data, alignment quality data, base quality data, the genotype type data being determined based on the determined genotype type; and generating a variation characteristic value based on the calculated variation type data, genotype type data, depth data, alignment quality data, and base quality data. In some embodiments, the variant feature values further comprise: variation quality data, allele read length position data.
Regarding the variant type data, the data format is, for example, an integer. Regarding a method of generating mutation-type data, it includes, for example: if the variation is determined to be SNP, the variation type data is determined to be 0; if the mutation is determined to be INDEL, the mutation type data is determined to be "1". If the mutation is determined to be otherwise, the mutation type data is determined to be "2".
As for genotype type data, the data format thereof is, for example, an integer. With regard to the method of generating genotype type data, it includes, for example: if the genotype type is determined to be hom-ref (0/0), the genotype type data is determined to be "0"; if the genotype type is determined to be het (0/1, 0/2 …), then the genotype type data is determined to be "1"; if the genotype type is determined to be hom-alt (1/1,2/2 …); genotype type data was determined to be "2"; if the genotype type is determined to be het-alt (1/2, 2/3 …), the genotype type data is determined to be "3".
Regarding the comparison quality data, the data format is, for example, a two-dimensional array. With respect to methods of generating alignment quality data, these include, for example: the computing device 110 calculates a read length alignment quality root Mean square (rms) for allele 1 and allele 2 of the SNP/INDEL to obtain alignment quality data. The method of calculating the root mean square is described below with reference to formula (1).
Figure DEST_PATH_IMAGE001
In the above-mentioned formula (1),
Figure DEST_PATH_IMAGE002
represents the comparison quality root mean square.
Figure DEST_PATH_IMAGE003
Represents the allele read length alignment quality.
In some embodiments, the alignment quality data further comprises: the Mantoux (Mann-Whitney) U of alternative and reference allele reads aligned to quality checks the Z value, both in data format as floating point numbers, for example. The mann-whitney U test, also known as the mann-whitney rank sum test, can be regarded as a T test which is a parametric test for the difference between two means. The method of calculating Mann-Whitney U test Z values for the alignment quality of candidate and reference allele reads for SNP/INDEL is described below in connection with equations (2) through (6).
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
Figure DEST_PATH_IMAGE008
In equations (2) to (6) above, Z represents the Uest Z value of the alignment quality of the candidate allele and the reference allele reads.
Figure DEST_PATH_IMAGE009
Represents the mean of the Mann-Whitney U test.
Figure DEST_PATH_IMAGE010
Represents the variance of the U-test.
Figure DEST_PATH_IMAGE011
And
Figure DEST_PATH_IMAGE012
representing the respective Rank sums (Rank Sum) of sample 1 and sample 2 data, respectively.
Figure DEST_PATH_IMAGE013
And
Figure DEST_PATH_IMAGE014
the calculation is performed by equations (5) and (6), respectively.
Figure DEST_PATH_IMAGE015
Number of terms representing sample 1 observations.
Figure DEST_PATH_IMAGE016
Number of terms representing sample 2 observations.
As for the depth data, the data format thereof is, for example, a floating point number. Depth data was calculated based on the values for the mutation sites of the SNPs/INDELs relative to the whole genome depth. For example, if we and other capture sequencing is employed, the depth data is set to "1".
Regarding the base quality data, the data format thereof is, for example, a floating point number. The base quality data includes a base quality RMS value for the variation site for the SNP, a Mann-Whitney U test Z value for the base quality at the variation site for the candidate allele and the reference allele reads for the SNP. The Mann-Whitney U test Z value for base quality is set to "0" if the reference allele is not present.
In some embodiments, the variant feature values further comprise: quality data of variation. The data format is, for example, a floating point number. The mutation type data is a mutation quality value calculated by a genotype probability calculation method.
In some embodiments, the variant feature values further comprise: sign ratio data. The format of the positive-negative ratio data is a two-dimensional matrix. For example, the sign ratio data is determined based on the sign Log10 Odds ratio for reads for SNP/INDEL allele 1 and allele 2.
In some embodiments, the variant feature values further comprise allele read length location data. The data format of the allele read length position data is, for example, a floating point number. The allele read length position data further comprises: (iii) a value for the RMS position of the reads for the candidate allele, the Mann-Whitney U-test Z for the read positions of the candidate allele and the reference allele for the SNP/INDEL. If the reference allele is not present, the Mann-Whitney U test Z value for the reference allele reads position is set to, for example, "0".
At step 210, the computing device 110 generates predictions about single nucleotide variations and indels based on the variation feature values via a prediction model constructed based on a random forest model.
As for the prediction model, it includes, for example: the method comprises an SNP prediction model constructed based on a random forest model and an INDEL prediction model constructed based on the random forest model. The predictive model is trained via multiple samples.
As for the training method, it includes, for example: first, the method 200 detects SNP and INDEL variation of the GIAB standard sample, and calculates variation characteristics with respect to SNP and INDEL. The SNP, INDEL variation detected by the method 200 is then marked as "true" or "false" using the SNP, INDEL detection results of the GIAB standard sample as a reference. Then, fitting (fit) the SNP prediction model constructed based on the random forest model and the INDEL prediction model constructed based on the random forest model for the SNP and the INDEL mutation added with the label respectively, and outputting the trained SNP prediction model
Figure DEST_PATH_IMAGE017
And INDEL predictive model
Figure DEST_PATH_IMAGE018
Methods for generating predictors for single nucleotide variations and indels include, for example: for example, the computing device 110 generates, based on the variation feature values, for single nucleotide variations and indels, tags regarding "true" or "false" of single nucleotide variations and indels via a single nucleotide variation-based prediction model and an indel prediction model, respectively, the single nucleotide variation prediction model and the indel prediction model being constructed based on a random forest model; and filtering out single nucleotide variations and indels that are marked as "false" in order to generate predictive results for the single nucleotide variations and indels.
For example, the invention uses the targeted sequencing data of 8 GIAB standard samples NA12878, and the final prediction results are over 99% in terms of accuracy, precision, sensitivity (i.e., recall), and F1 values. The following table shows the comparison of the method of the present invention with other conventional detection methods in terms of accuracy, sensitivity and F1 values. FIG. 6 shows a schematic diagram of a method according to an embodiment of the present invention compared to a conventional detection method in terms of accuracy. As can be seen from the data in table one and fig. 6, the method 200 of the present invention significantly improves the overall accuracy (F1 value) and (recall rate) of the detection results, and at the same time, has good accuracy.
Figure DEST_PATH_IMAGE019
In the foregoing solutions, to at least partially address one or more of the above problems, as well as other potential problems, exemplary embodiments of the present invention propose a solution for detecting single nucleotide variations and indels. In the scheme, a haplotype sequence is generated by combining genome loci with continuous activation probability values larger than a predetermined probability threshold into activation regions and obtaining read-length fragments meeting a predetermined quality condition in the prolonged activation regions so as to assemble the local reference genome sequence and the read-length fragments; the invention can process low-quality reads, further reduce assembly failure caused by excessive path number in DBG (direct bus group) caused by sequencing error, and also reduce the problem that reads nodes cannot be merged into reference nodes caused by sequencing error. In addition, by repairing the half-assembled haplotype sequence in the assembled sequence to obtain the full-assembled haplotype sequence, the invention reduces haplotype assembly failure caused by sequencing errors. Furthermore, single nucleotide variations and indels are identified by alignment results based on fully assembled haplotype sequence alignment to local reference genomic sequences; and generating a variation characteristic value, and generating a prediction result about the single nucleotide variation and the insertion deletion through a prediction model, wherein the overall accuracy (F1 value) and sensitivity of the detection result can be obviously improved. Thus, the present invention can significantly improve the overall accuracy and sensitivity of the assay results even when sequencing coverage is not uniform and sequencing errors occur.
A method 300 for obtaining read-long fragments meeting predetermined quality conditions according to an embodiment of the present invention will be described below in conjunction with fig. 3. Fig. 3 shows a flow diagram of a method 300 for obtaining a read long fragment meeting a predetermined quality condition according to an embodiment of the invention. It should be understood that the method 300 may be performed, for example, at the electronic device 700 depicted in fig. 7. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 302, the computing device 110 calculates a root mean square of the base masses of the soft-clip sequences of read lengths within the extended interval. For example, the computing device 110 calculates Root Mean Square (Root Mean Square) of base quality of soft-cut (soft-clip, which is located at the head and tail ends of the read) sequences of the read in corresponding extension sections obtained after each activation region is extended to both sides by 100bp respectively
Figure DEST_PATH_IMAGE020
A method for calculating the root mean square of the base masses of the soft-cleaved sequences at the read length in the extended region is described below in conjunction with equation (7).
Figure DEST_PATH_IMAGE021
In the above-mentioned formula (7),
Figure 801220DEST_PATH_IMAGE020
the root mean square of the base masses of the soft-sheared sequences representing read lengths within the extended interval.
Figure DEST_PATH_IMAGE022
Represents the first
Figure DEST_PATH_IMAGE023
Base quality of individual reads of long soft-sheared sequences.
Figure DEST_PATH_IMAGE024
Representing the read length number.
At step 304, the computing device 110 determines whether the calculated root mean square is less than a predetermined threshold. As to the predetermined threshold, it is, for example, without limitation, 29.3. If it is determined that the calculated root mean square is greater than or equal to the predetermined threshold, then a jump is made to step 314 to retain the read-long soft-clip sequence.
At step 306, if the computing device 110 determines that the calculated root mean square is less than the predetermined threshold, the read-long soft-clip sequence is removed. For example, if the computing device 110 determines the root mean square of the base masses of soft-sheared sequences of reads within an extended interval
Figure 498787DEST_PATH_IMAGE020
Less than the predetermined threshold 29.3, the soft clip sequence of the read is removed.
At step 308, the computing device 110 scans the read length within the extended interval to determine whether the bases of the read length do not match the reference genome and whether the base quality is less than a predetermined base quality threshold. With respect to the predetermined base quality threshold, it is, for example and without limitation, 15.
At step 310, if the computing device 110 determines that the base of the read length does not match the reference genome and that the base quality is less than a predetermined base quality threshold, a cut is made for the read length. For example, if the computing device 110 determines that the bases of a read length do not match the reference genome and the base quality is less than 15, then a cut is made for that read length.
At step 312, the computing device 110 filters the read lengths via the slicing to determine read length segments having a length greater than or equal to a predetermined length threshold as read length segments meeting a predetermined quality condition. As for the predetermined length threshold, it is, for example, without limitation, 10 bp. For example, computing device 110 filters the sliced read and only outputs fragments greater than 10bp in length. This is mainly because: if the read length segment is too short, it is not suitable for assembly.
By adopting the means, the invention can remarkably reduce the assembly failure caused by excessive path number in the DBG due to sequencing error by performing head and tail trimming and cutting on low-quality reads, and also can reduce the problem that reads nodes cannot be merged into the reference node due to the sequencing error.
A method for local assembly for local reference genomic sequences and read length segments according to an embodiment of the invention will be described below in connection with fig. 4. FIG. 4 shows a flow diagram of a method 400 for assembling for local reference genomic sequences and read-length fragments, according to an embodiment of the invention. It should be understood that method 400 may be performed, for example, at electronic device 700 depicted in fig. 7. May also be executed at the computing device 110 depicted in fig. 1. It is to be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 402, computing device 110 performs the current round of iterations for segment units (i.e., kmer) of the local reference sequence having a predetermined value (predetermined value k) in order of smaller to larger length, at predetermined steps. For example, computing device 110 iterates at a predetermined step size s (predetermined step size of, for example, 1) in order from a minimum length kmer, i.e., kmin (kmin of, for example, 10), to a maximum length kmer, i.e., kmax (kmax of, for example, 101), e.g., such that kmer = kmin for the current round of iteration.
At step 404, computing device 110 inserts kmer of the local reference sequence for the current round of iteration into de Bruijn diagram (de Bruijn Graph, DBG). It should be understood that the De Bruijn Graph (DBG) assembly method has relatively low memory consumption, fast operation speed and high accuracy. The node in the DBG derived from the reference gene sequence is called a reference node.
At step 406, the computing device 110 determines whether a loop exists in the DBG. For example, the computing device 110 determines whether there is a loop in the DBG using a depth first search algorithm. In the assembly process, a loop is not allowed to exist in the DBG, and if the loop exists, the number of times of repetition on the loop cannot be analyzed.
Regarding a Depth-First Search (DFS), which is a graph Search algorithm, it should be understood that the DFS constructs a tree structure for a connected graph, and if a Back Edge occurs during the process of constructing the tree structure, a loop exists in the graph. For the non-connected graph, tree structures can be constructed based on DFS for different parts in the graph, and whether reverse edges exist or not can be detected for each tree structure. The method for detecting the directional edge is, for example: during traversal of the graph based on DFS, the traversed vertex is put into the recursion stack, and if the newly traversed vertex already exists in the recursion stack, the existence of a reverse edge is indicated, namely a loop exists.
At step 408, if computing device 110 determines that a loop does not exist in the DBG, kmer reading a long sequence is inserted into the DBG. The read-length sequence herein is, for example, the read-length fragment mentioned above that meets the predetermined quality condition. If it is determined that a loop exists in the DBG, the method jumps to step 412, and the kmer of the local reference sequence of the current iteration is increased by a predetermined step size, so as to perform the next iteration for the kmer of the local reference sequence.
At step 410, the computing device 110 determines whether a loop exists for the inserted DBG. Among them, a node derived from a read-long sequence in the DBG is referred to as a read-long node.
At step 414, if computing device 110 determines that a loop exists in the inserted DBG, the kmer of the read-long sequence of the current round of iteration is increased by a predetermined step size (e.g., s, such that kmer = kmer + s) for the next round of iteration for the kmer of the read-long sequence.
At step 416, if the computing device 110 determines that no loops exist in the inserted DBG, edges in the DBG having a depth below a predetermined depth threshold and paths in the DBG that do not start with a reference node and do not end with a reference node are removed. The reference node is a node in the DBG derived from the reference sequence.
At step 418, the computing device 110 looks up all paths in the DBG to determine if the number of paths is greater than a predetermined number of paths threshold. The number of paths being, for example
Figure DEST_PATH_IMAGE025
. The predetermined threshold number of paths is, for example, the maximum number of paths
Figure DEST_PATH_IMAGE026
. For example, the computing device 110 determines
Figure 153890DEST_PATH_IMAGE025
Whether or not greater than
Figure 80258DEST_PATH_IMAGE026
At step 420, if the computing device 110 determines that the number of paths is less than or equal to the predetermined number of paths threshold, the nodes of all paths are connected to output a haplotype sequence. If computing device 110 determines that the number of paths is greater than the predetermined number of paths threshold, it jumps to step 414 to increase the kmer of the read-long sequence for the current iteration by a predetermined step size for the next iteration for the kmer of the read-long sequence.
A method of outputting a fully-assembled haplotype sequence for a semi-assembled haplotype sequence corresponding to a semi-assembled path, comprising, for example: connecting nodes of all paths, and outputting a semi-assembly haplotype sequence corresponding to a semi-assembly path, wherein the semi-assembly path is an assembly path which is only started or stopped as a reference node; and aligning the semi-assembled haplotype sequence to a local reference sequence so as to add the sequence of the local reference sequence at the end of the semi-assembled haplotype sequence for outputting the haplotype sequence. By aligning the half-assembled haplotype sequence to a local reference sequence, it can be inferred what sequence the end of the half-assembled haplotype sequence that does not start or stop at the reference node may end with, and therefore, the predicted sequence can be added to repair the half-assembled haplotype sequence to a fully-assembled haplotype sequence.
For example, computing device 110 aligns the semi-assembled haplotype sequences to a local reference sequence using the Smith-Waterman algorithm (Smith-Waterman algorithm) algorithm, and then assembles the sequences of the local reference sequence at the ends of the semi-assembled haplotype sequences, the assembled sequences being haplotype sequences. It will be appreciated that by using the smith-waterman algorithm for local sequence alignment, fragments of a haplotype sequence that have high similarity to a local reference sequence can be found.
In some embodiments, a method for performing a next round of iteration with respect to a kmer of a local reference sequence (or a kmer of a read-long sequence), further comprises: determining whether the kmer of the next iteration obtained after the kmer of the current iteration is increased by a preset step length is smaller than or equal to a preset kmer threshold value; if the kmer of the next iteration is determined to be less than or equal to the preset kmer threshold value, inserting the kmer of the local reference sequence of the next iteration into the DBG so as to continuously determine whether a loop exists in the DBG; if it is determined that the kmer for the next iteration is greater than the predetermined kmer threshold, the method 400 ends.
By adopting the above measures, the computing device 110 of the present invention can identify and repair the half-assembled haplotype in the assembled sequence, and further obtain the fully-assembled haplotype sequence, thereby reducing haplotype assembly failure caused by sequencing errors.
A method for determining a genotype type according to an embodiment of the present invention will be described below with reference to fig. 5. FIG. 5 shows a flow diagram of a method 500 for determining genotype types, according to an embodiment of the invention. It should be understood that method 500 may be performed, for example, at electronic device 700 depicted in fig. 7. May also be executed at the computing device 110 depicted in fig. 1. It is to be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 502, the computing device 110 calculates genotype probabilities for single nucleotide variations and insertion deletions based on the probabilities of read length mapping to alleles using a bayesian algorithm. The probability of read mapping to alleles is based on read by a double hidden Markov model
Figure DEST_PATH_IMAGE027
Comparing to haplotype sequence
Figure DEST_PATH_IMAGE028
Is calculated from the probability of (a). The computing device 110 calculates the genotype probabilities using a bayesian algorithm. Will be described below in connection with formula (8)Ming calculation of genotype
Figure DEST_PATH_IMAGE029
Probability of (2)
Figure DEST_PATH_IMAGE030
The method of (1).
Figure DEST_PATH_IMAGE031
In the above-mentioned formula (8),
Figure DEST_PATH_IMAGE032
represents the genotype.
Figure DEST_PATH_IMAGE033
And
Figure DEST_PATH_IMAGE034
represents an allele.
Figure DEST_PATH_IMAGE035
Represents the ith read length.
Figure DEST_PATH_IMAGE036
And
Figure DEST_PATH_IMAGE037
representative allele
Figure 852821DEST_PATH_IMAGE033
And
Figure 81808DEST_PATH_IMAGE034
the probability of (c).
Figure 880000DEST_PATH_IMAGE036
And
Figure 751004DEST_PATH_IMAGE037
can be calculated according to equation (9) below.
Figure DEST_PATH_IMAGE038
Representative genotype
Figure 357566DEST_PATH_IMAGE032
The probability of (c).
The method of calculating the probability of mapping read lengths to alleles is described below in connection with equation (9).
Figure DEST_PATH_IMAGE039
In the above-mentioned formula (9),
Figure DEST_PATH_IMAGE040
representing a haplotype sequence. Hypothetical haplotypes
Figure 724831DEST_PATH_IMAGE040
All support alleles
Figure DEST_PATH_IMAGE041
Representing the probability of mapping read length to allele.
Figure DEST_PATH_IMAGE042
Representative read
Figure 354526DEST_PATH_IMAGE035
Comparing to haplotype sequence
Figure DEST_PATH_IMAGE043
The probability of (c).
Figure 58915DEST_PATH_IMAGE035
Represents the ith read length.
Relating to reads
Figure 988825DEST_PATH_IMAGE035
Comparing to haplotype sequence
Figure 949828DEST_PATH_IMAGE043
Example of the method of calculating the probability of (1)Such as including: computing device 110 calculates the probability of reads aligning to haplotype sequences using the forward algorithm via PairHMM
Figure DEST_PATH_IMAGE044
. Wherein the input of PairHMM is the read length
Figure 535661DEST_PATH_IMAGE035
Haplotype sequence
Figure 341943DEST_PATH_IMAGE043
Read length of each base
Figure DEST_PATH_IMAGE045
Probability of vacancy opening per base on read length (i.e., Gap open probability)
Figure DEST_PATH_IMAGE046
Probability of extension of Gap for each base on read length (i.e., Gap extension probability)
Figure DEST_PATH_IMAGE047
At step 504, the computing device 110 calculates a posterior probability of the genotype based on the calculated genotype probabilities for the single nucleotide variations and insertion deletions.
The following describes the genotype with reference to the formula (10)
Figure DEST_PATH_IMAGE048
A posteriori probability of
Figure DEST_PATH_IMAGE049
The method of (3).
Figure DEST_PATH_IMAGE050
In the above-mentioned formula (10),
Figure DEST_PATH_IMAGE051
representative genotype
Figure 77424DEST_PATH_IMAGE032
A priori probability of. For example, the prior probability of a heterozygous SNP is
Figure DEST_PATH_IMAGE052
The prior probability of heterozygous INDEL is
Figure DEST_PATH_IMAGE053
The prior probability of a homozygous Reference is 1.
Figure DEST_PATH_IMAGE054
Representative genotype
Figure 458596DEST_PATH_IMAGE032
The posterior probability of (d). The prior probability of a homozygous SNP or INDEL can be calculated based on hardweinberg equilibrium or heterozygous homozygous ratio (default of 2).
At step 506, the computing device 110 compares the posterior probabilities of the calculated genotypes to determine the genotype type based on the genotype to which the maximum of the posterior probabilities of the genotypes corresponds.
The calculation method for determining the genotype type of the mutation is described below with reference to formula (11).
Figure DEST_PATH_IMAGE055
In the above-mentioned formula (11),
Figure DEST_PATH_IMAGE056
representing the genotype type of the variation.
Figure 921938DEST_PATH_IMAGE054
Representative genotype
Figure 672857DEST_PATH_IMAGE032
The posterior probability of (a).
By adopting the above means, the present invention can accurately determine the genotype type of the mutation.
Fig. 7 schematically shows a block diagram of an electronic device 700 suitable for use in implementing an embodiment of the invention. The electronic device 700 may be for implementing the method 200 to 500 shown in fig. 2 to 5. As shown in fig. 7, electronic device 700 includes a central processing unit (i.e., CPU 701) that can perform various appropriate actions and processes in accordance with computer program instructions stored in a read-only memory (i.e., ROM 702) or loaded from storage unit 708 into a random access memory (i.e., RAM 703). In the RAM 703, various programs and data required for the operation of the electronic device 700 can also be stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output interface (i.e., I/O interface 705) is also connected to bus 704.
A plurality of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and the CPU 701 executes the respective methods and processes described above, for example, executes the methods 200 to 500. For example, in some embodiments, methods 200-500 may be implemented as a computer software program stored on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by CPU 701, one or more operations of methods 200 through 500 described above may be performed. Alternatively, in other embodiments, CPU 701 may be configured in any other suitable manner (e.g., by way of firmware) to perform one or more acts of methods 200-500.
It should be further appreciated that the present invention may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above description is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention are possible to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method for detecting single nucleotide variations and indels, comprising:
calculating activation probabilities for the reference genomic loci based on the aligned sequence stacks for the reference genomic loci, such that genomic loci with consecutive activation probabilities greater than a predetermined probability threshold are merged into an activation region; and
obtaining read-length fragments of the test sample within the extended activation region that meet a predetermined quality condition for assembly against the local reference genomic sequence and the read-length fragments for generating haplotype sequences, the generated haplotype sequences comprising haplotype sequences repaired via semi-assembled haplotype sequences;
aligning the haplotype sequences to a local reference genomic sequence to identify single nucleotide variations and indels based on the alignment results;
calculating genotype probabilities for single nucleotide variations and insertion deletions in order to determine genotype types for generating variation characteristic values; and
based on the variation characteristic values, prediction results about single nucleotide variation and insertion deletion are generated through a prediction model constructed based on a random forest model.
2. The method of claim 1, wherein assembling based on the assembly for the local reference genomic sequence and the read length segments to generate a haplotype sequence comprises:
extending each activation region to two sides by a preset number of basic groups respectively so as to obtain corresponding extension intervals;
preprocessing the read length in the extension interval so as to obtain read length fragments meeting the preset quality condition;
assembling the local reference genome sequence and the read length segment, wherein the assembled sequences comprise a fully assembled haplotype sequence and a semi-assembled haplotype sequence; and
identifying the semi-assembled haplotype sequences for repair against the semi-assembled haplotype sequences for generating fully-assembled haplotype sequences.
3. The method of claim 2, wherein preprocessing the read lengths within the extended interval to obtain read length fragments meeting a predetermined quality condition comprises:
calculating the root mean square of the base masses of the read-long soft-sheared sequences within the extended interval;
determining whether the calculated root mean square is less than a predetermined threshold;
removing the soft-clip sequence of read lengths in response to determining that the calculated root mean square is less than a predetermined threshold;
scanning the read length within the extension interval to determine whether the base of the read length does not match the reference genome and whether the base quality is less than a predetermined base quality threshold;
responsive to determining that the base of the read length does not match the reference genome and that the base quality is less than a predetermined base quality threshold, performing segmentation for the read length; and
the read lengths subjected to the slicing are filtered, so that the read length fragments with the length larger than or equal to a preset length threshold value are determined as the read length fragments meeting a preset quality condition.
4. The method of claim 2, wherein assembling for the local reference genomic sequence and the read length fragment comprises:
according to the sequence of the lengths from small to large, carrying out iteration of the current round aiming at the segment unit of the local reference sequence with the length being a preset value by a preset step length;
inserting fragment units of the local reference sequence of the current iteration into the de Bruijn graph;
determining whether a loop exists in the de Bruijn graph;
in response to determining that no loops exist in the de Bruijn graph, inserting fragment units that read long sequences into the de Bruijn graph to determine whether loops exist in the inserted de Bruijn graph;
in response to determining that a loop exists in the de Bruijn graph, the segment units of the local reference sequence for the current iteration are increased by a predetermined step size for the next iteration against the segment units of the local reference sequence.
5. The method of claim 4, wherein assembling for the local reference genomic sequence and the read fragment further comprises:
in response to determining that a loop exists in the inserted de Bruijn graph, increasing the fragment unit of the read-length sequence of the current iteration by a predetermined step size so as to perform the next iteration for the fragment unit of the read-length sequence;
in response to determining that no loops exist in the de Bruijn graph after the insertion, removing edges in the de Bruijn graph having a depth below a predetermined depth threshold and paths in the de Bruijn graph that do not start with the reference node and do not end with the reference node;
searching all paths in the de Bruijn graph to determine whether the number of paths is greater than a predetermined number of paths threshold;
in response to determining that the number of paths is greater than the predetermined path number threshold, increasing the fragment unit of the read-long sequence of the current iteration by a predetermined step size for a next iteration for the fragment unit of the read-long sequence; and
in response to determining that the number of paths is less than or equal to the predetermined number of paths threshold, nodes of all paths are connected to output a haplotype sequence.
6. The method of claim 5, wherein outputting the haplotype sequence comprises:
connecting nodes of all paths, and outputting a semi-assembly haplotype sequence corresponding to a semi-assembly path, wherein the semi-assembly path is an assembly path which is only started or stopped as a reference node; and
the semi-assembled haplotype sequence is aligned to a local reference sequence such that the sequence of the local reference sequence is added at the end of the semi-assembled haplotype sequence to output the haplotype sequence.
7. The method of claim 1, wherein calculating genotype probabilities for single nucleotide variations and insertion deletions to determine genotype type comprises:
calculating genotype probabilities for single nucleotide variations and insertion deletions based on the probabilities of read length mapping to alleles using a bayesian algorithm;
calculating a posterior probability of the genotype based on the calculated genotype probabilities for the single nucleotide variations and the insertion deletions; and
comparing the calculated posterior probabilities of the genotypes to determine the genotype based on the genotype corresponding to the maximum value of the posterior probabilities of the genotypes.
8. The method of claim 1, wherein generating the variant feature value comprises:
calculating mutation type data, genotype type data, depth data, alignment quality data, base quality data, the genotype type data being determined based on the determined genotype type; and
generating a variation characteristic value based on the calculated variation type data, genotype type data, depth data, alignment quality data, and base quality data.
9. The method of claim 1, wherein generating the variant feature value comprises:
generating a 'true' or 'false' mark about the single nucleotide variation and the insertion deletion through a single nucleotide variation prediction model and an insertion deletion prediction model respectively aiming at the single nucleotide variation and the insertion deletion based on the variation characteristic value, wherein the single nucleotide variation prediction model and the insertion deletion prediction model are constructed based on a random forest model; and
single nucleotide variations and indels marked as "false" are filtered out in order to generate predicted results for the single nucleotide variations and indels.
10. A computing device, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 9.
11. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a machine, implements the method according to any one of claims 1 to 9.
CN202210392841.6A 2022-04-15 2022-04-15 Methods, devices, and media for detecting single nucleotide variations and indels Active CN114496077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210392841.6A CN114496077B (en) 2022-04-15 2022-04-15 Methods, devices, and media for detecting single nucleotide variations and indels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210392841.6A CN114496077B (en) 2022-04-15 2022-04-15 Methods, devices, and media for detecting single nucleotide variations and indels

Publications (2)

Publication Number Publication Date
CN114496077A CN114496077A (en) 2022-05-13
CN114496077B true CN114496077B (en) 2022-06-21

Family

ID=81488730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210392841.6A Active CN114496077B (en) 2022-04-15 2022-04-15 Methods, devices, and media for detecting single nucleotide variations and indels

Country Status (1)

Country Link
CN (1) CN114496077B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115602244B (en) * 2022-10-24 2023-04-28 哈尔滨工业大学 Genome variation detection method based on sequence alignment skeleton
CN115631789B (en) * 2022-10-25 2023-08-15 哈尔滨工业大学 Group joint variation detection method based on pan genome
CN117711488A (en) * 2023-11-29 2024-03-15 东莞博奥木华基因科技有限公司 Gene haplotype detection method based on long-reading long-sequencing and application thereof
CN117935921A (en) * 2024-03-21 2024-04-26 北京贝瑞和康生物技术有限公司 Method, apparatus, medium and program product for determining deletion/repetition type

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017035392A1 (en) * 2015-08-25 2017-03-02 Nantomics, Llc Systems and methods for high-accuracy variant calling
CN110021342A (en) * 2017-08-21 2019-07-16 北京哲源科技有限责任公司 For accelerating the method and system of the identification of variant sites
CN110299185A (en) * 2019-05-08 2019-10-01 西安电子科技大学 A kind of insertion mutation detection method and system based on new-generation sequencing data
CN110797081A (en) * 2019-10-17 2020-02-14 南京医基云医疗数据研究院有限公司 Activation area identification method and device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017035392A1 (en) * 2015-08-25 2017-03-02 Nantomics, Llc Systems and methods for high-accuracy variant calling
CN108351917A (en) * 2015-08-25 2018-07-31 南托米克斯有限责任公司 System and method for identifying variant in high precision
CN110021342A (en) * 2017-08-21 2019-07-16 北京哲源科技有限责任公司 For accelerating the method and system of the identification of variant sites
CN110299185A (en) * 2019-05-08 2019-10-01 西安电子科技大学 A kind of insertion mutation detection method and system based on new-generation sequencing data
CN110797081A (en) * 2019-10-17 2020-02-14 南京医基云医疗数据研究院有限公司 Activation area identification method and device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Performance evaluation of indel calling tools using real short-read data》;Mohammad Shabbir Hasan 等;《Hum Genomics》;20150819;全文 *
《基于二三代测序数据联合拼接的拷贝数变异检测方法》;高峰;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20200615(第06期);全文 *

Also Published As

Publication number Publication date
CN114496077A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN114496077B (en) Methods, devices, and media for detecting single nucleotide variations and indels
US20200303035A1 (en) Haplotype phasing models
Modolo et al. UrQt: an efficient software for the Unsupervised Quality trimming of NGS data
Browning et al. Improving the accuracy and efficiency of identity-by-descent detection in population data
JP7064654B2 (en) Gene mutation recognition method, device and storage medium
Homer et al. Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA
US20160140289A1 (en) Variant caller
CN114649055B (en) Methods, devices and media for detecting single nucleotide variations and indels
US20190325990A1 (en) Process for aligning targeted nucleic acid sequencing data
CN109979530B (en) Gene variation identification method, device and storage medium
CN116386718B (en) Method, apparatus and medium for detecting copy number variation
CN111933214B (en) Method and computing device for detecting RNA level somatic gene variation
CN112687332A (en) Method, apparatus and storage medium for determining sites of variation at risk of disease
US20190259468A1 (en) System and Method for Correlated Error Event Mitigation for Variant Calling
Huang et al. Performing parentage analysis in the presence of inbreeding and null alleles
CN113205857B (en) Method and device for identifying non-homologous regions of genomic chromosomes
US20190267110A1 (en) System and method for sequence identification in reassembly variant calling
EP3938932B1 (en) Method and system for mapping read sequences using a pangenome reference
CN114758720B (en) Method, apparatus and medium for detecting copy number variation
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
CN112669902B (en) Method, computing device and storage medium for detecting genomic structural variation
CN112908412A (en) Methods, devices and media for compounding the applicability of heterozygous variant pathogenic evidence
US20210065845A1 (en) Scoring variants in an exome to predict an effect of the variants on gene function
CN110797081B (en) Activation area identification method and device, storage medium and electronic equipment
Bohutínská et al. Population Genomic Analysis of Diploid-Autopolyploid Species

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40076881

Country of ref document: HK