CN104408284A - Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene - Google Patents

Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene Download PDF

Info

Publication number
CN104408284A
CN104408284A CN201410571652.0A CN201410571652A CN104408284A CN 104408284 A CN104408284 A CN 104408284A CN 201410571652 A CN201410571652 A CN 201410571652A CN 104408284 A CN104408284 A CN 104408284A
Authority
CN
China
Prior art keywords
algorithm
somatic mutation
cancer somatic
integration algorithm
cancer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410571652.0A
Other languages
Chinese (zh)
Inventor
吴翀
王瑜
闫威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING MICRO-HELIX GENE TECHNOLOGY Co Ltd
Original Assignee
BEIJING MICRO-HELIX GENE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING MICRO-HELIX GENE TECHNOLOGY Co Ltd filed Critical BEIJING MICRO-HELIX GENE TECHNOLOGY Co Ltd
Priority to CN201410571652.0A priority Critical patent/CN104408284A/en
Publication of CN104408284A publication Critical patent/CN104408284A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an integration algorithm of sequencing data analysis workflow of a cancer somatic mutation gene. The integration algorithm comprises the following steps that (1) comparison of sequencing data utilizes a cushaw algorithm; (2) authentication of SNP (Single Nucleotide Polymorphism) utilizes a samtools algorithm; and (3) authentication of cancer somatic mutation utilizes a VarScan algorithm.

Description

Cancer somatic mutation gene sequencing data analysis workflow integration algorithm
Technical field
The present invention relates to biomedical data analysis field, in particular to a kind of cancer somatic mutation gene sequencing data analysis workflow integration algorithm.
Background technology
Gene is physical basis of heredity.The all life phenomenons such as the birth and old age, sickness and death of biosome are all relevant with gene.Gene sequencing understands a kind of approach of life, along with the development of the second generation and third generation high throughput sequencing technologies, sequencing result often TB rank even larger sequence data.Reasonable analysis understands these extensive and high-dimensional data to be become and to obtain after data a larger difficult point, is the committed step of current biological research, has huge realistic meaning.
The storage of magnanimity high-flux sequence data, process and analysis are all greatly challenged current department of computer science and to be unified computation schema.Existing systems face operand is inadequate, and manual intervention fiduciary level is lower, and cloud framework is the problem such as lower and privacy of user worry to bottom hardware control.
The comprehensive coordinate that the challenge of existing large data order-checking information to data analysis tool needs storage, management, transmission, scheduling and computational analysis to optimize, need the close fit in many ways such as biological field, computer realm, data statistic analysis, especially in the integration of analysis tool, it is low to there is degree of integration in existing data analysis software, poor to the Data Matching of separate sources, accuracy and repeatability are not high, the various problems such as inefficiency.
In lesion detection and early diagnosis, cancer somatic mutation is that order-checking detects institute's issues that need special attention, this requirement can be tried one's best and efficiently and accurately be analyzed order-checking raw data, but existing algorithm often only pays close attention to the single link analyzed with sequencing data, and the range of choice of analysis software too numerous and diverse in each step, also give and show that diagnostic result adds obstacle by order-checking raw data
Such as, in sequencing data comparison link, conventional algorithm comprises bwa, bowtie, cushaw, barracuda, arithmetic speed is different, adaptive underlying hardware is also different, and wherein cushaw is high performance computation card specialized designs, can reach the object that parallel computation is carried out accelerating, and bwa, bowtie, although barracuda does not possess parallel computation function, bottom data and computing hardware that can be adaptive be then comparatively wide in range.
And SNP qualification (main realize by sequencing result comparison to genome and identify the function of related mutation) link, popular software mainly comprises: samtools, GATK, Qcall etc., what these softwares had lays particular stress on accuracy, and what have lays particular stress on efficiency.
In cancer somatic mutation (mainly through more same patient normal and cancer tissue, and identify the somatic mutation occurred in cancer thus) in qualification process, main software comprises: VarScan, GATK UnifiedGenotyper, VarScan scheduling algorithm, have to bias toward accuracy in detection high, some universality data compatibilities, some input and output standards are relatively simple.
Therefore, need a kind of cancer somatic mutation gene sequencing data analysis workflow integration algorithm, to a certain extent above-mentioned algorithm optimization is integrated, detect the object of cancer somatic mutation gene to reach accurate efficiently.
Summary of the invention
The object of this invention is to provide a kind of cancer somatic mutation gene sequencing data analysis workflow integration algorithm.
Described cancer somatic mutation gene sequencing is data from the sequenator of the Hiseq series of illumina company, or PGM and the Proton series sequenator of ThermoFisher company, sequencing data rank is Mb ~ Gb size, foundation length 10 ~ 1000bp, data layout is FastQ or SFF form.
Described cancer somatic mutation gene sequencing analyzes datamation stream exemplary flow as accompanying drawing 1,
Described cancer somatic mutation gene sequencing is analyzed datamation stream integration algorithm and is comprised the steps:
(1) by sequencing data comparison to reference to genome sequence, use cushaw algorithm, after using scientific calculation stream handle to accelerate, comparing speed is increased to the 10-100 of other softwares doubly;
(2) SNP qualification, use samtools algorithm, compatible several data form, accuracy is high, and location is fast;
(3) cancer somatic mutation qualification, use VarScan algorithm, compatible data kind is many, and accuracy is high, and input and output meet Open Standard;
Through above-mentioned workflow integration method, can and cancer tissue sequencing result normal by more same patient fast, thus identify the somatic mutation occurred in cancer.
Described workflow integration algorithm is GPU algorithm, utilizes the high-speed floating point of GPU to calculate and parallel characteristics, significantly can improve the arithmetic speed of algorithm, reach hardware-accelerated effect.
In described workflow integration algorithm, design have matched the software using GPU to accelerate further, improves workflow efficiency by hardware-accelerated.
The invention has the advantages that: the complete job stream that this Algorithms Integration sequencing data is analyzed, data analysis user can be removed from from row filter and optimize the step of specific program/procedure set, algorithm after simultaneously optimizing and combining greatly improves on operation efficiency, can feed back sequencing data analysis result more rapidly.
Accompanying drawing explanation
Fig. 1. cancer somatic mutation gene sequencing analyzes datamation stream schematic flow diagram.
Embodiment
Below in conjunction with specific embodiment, the invention will be further described, but should not limit the scope of the invention with this.
Raw data derives from Illumina Hiseq 2000, and data layout is FastQ, and reading long is 100bp.
By workflow analysis,
(1) comparison result that checks order exports: wherein normal structure data count is 233988 records, article 222290, (95.3%) comparison is in genome, cancer tissue data count is 200549, and 188516 (94%) comparisons are on genome
(2) samtools is used to identify SNP;
(3) by comparing the SNP difference site of normal combination and tumor tissues, the distinctive site 12 of tumor tissues is accredited as.
This workflow integral operation time is 95s.

Claims (3)

1. cancer somatic mutation gene sequencing analyzes datamation stream integration algorithm, it is characterized in that, comprises the steps:
(1) by sequencing data comparison to reference to genome sequence, use bwa algorithm, after using scientific calculation stream handle to accelerate, comparing speed is increased to the 10-100 of other softwares doubly;
(2) SNP qualification, use samtools algorithm, compatible several data form, accuracy is high, and location is fast;
(3) cancer somatic mutation qualification, use VarScan algorithm, compatible data kind is many, and accuracy is high, and input and output meet Open Standard.
2. cancer somatic mutation gene sequencing analyzes datamation stream integration algorithm according to claim 1, and it is characterized in that, described workflow integration algorithm is GPU algorithm, utilizes the calculating of the high-speed floating point of GPU and parallel characteristics to reach hardware-accelerated effect.
3. cancer somatic mutation gene sequencing analyzes datamation stream integration algorithm according to claim 1, it is characterized in that, in described workflow integration algorithm, comprises the software using GPU to accelerate further.
CN201410571652.0A 2014-10-24 2014-10-24 Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene Pending CN104408284A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410571652.0A CN104408284A (en) 2014-10-24 2014-10-24 Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410571652.0A CN104408284A (en) 2014-10-24 2014-10-24 Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene

Publications (1)

Publication Number Publication Date
CN104408284A true CN104408284A (en) 2015-03-11

Family

ID=52645915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410571652.0A Pending CN104408284A (en) 2014-10-24 2014-10-24 Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene

Country Status (1)

Country Link
CN (1) CN104408284A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391965A (en) * 2017-08-15 2017-11-24 上海派森诺生物科技股份有限公司 A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998053300A2 (en) * 1997-05-23 1998-11-26 Lynx Therapeutics, Inc. System and apparaus for sequential processing of analytes
TW201101081A (en) * 2009-06-30 2011-01-01 Univ Ishou Analysis comparison system for SNP mutation information sequence
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998053300A2 (en) * 1997-05-23 1998-11-26 Lynx Therapeutics, Inc. System and apparaus for sequential processing of analytes
TW201101081A (en) * 2009-06-30 2011-01-01 Univ Ishou Analysis comparison system for SNP mutation information sequence
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈辰: "基于二代测序技术的结直肠癌中Wnt通路相关基因的indel检测与细胞功能研究"", 《中国博士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391965A (en) * 2017-08-15 2017-11-24 上海派森诺生物科技股份有限公司 A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies

Similar Documents

Publication Publication Date Title
Witten Classification and clustering of sequencing data using a Poisson model
Yang et al. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications
US9147167B2 (en) Similarity analysis with tri-point data arbitration
Narayan et al. Density-preserving data visualization unveils dynamic patterns of single-cell transcriptomic variability
US20200239965A1 (en) Source of origin deconvolution based on methylation fragments in cell-free dna samples
JP7041614B2 (en) Multi-level architecture for pattern recognition in biometric data
Kaytoue et al. Two fca-based methods for mining gene expression data
Kavitha et al. A correlation based SVM-recursive multiple feature elimination classifier for breast cancer disease using microarray
Zhang et al. Undersampling near decision boundary for imbalance problems
Hasan et al. Linear regression–based feature selection for microarray data classification
Ding Visualization and integrative analysis of cancer multi-omics data
Pavlenko et al. Covariance structure approximation via gLasso in high-dimensional supervised classification
Gerber et al. Automated discovery of functional generality of human gene expression programs
CN118969078A (en) A spatial omics tumor evolution prediction method and system based on graph neural network
KR102137029B1 (en) Sample data analysis method based on genomic module network from filtered data
Cahuantzi et al. Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods
CN104408284A (en) Integration algorithm of sequencing data analysis workflow of cancer somatic mutation gene
Kousnetsov et al. Single-cell sequencing analysis within biologically relevant dimensions
Ahmad et al. Deep learning-based computational approach for predicting ncRNAs-disease associations in metaplastic breast cancer diagnosis
Chen et al. Multi-objective evolutionary triclustering with constraints of time-series gene expression data
Gan et al. A survey of pattern classification-based methods for predicting survival time of lung cancer patients
Metsalu Statistical analysis of multivariate data in bioinformatics
Zamani et al. A universal genomic coordinate translator for comparative genomics
Wang et al. A realistic FastQ-based framework FastQDesign for ScRNA-seq study design issues
US20200357484A1 (en) Method for simultaneous multivariate feature selection, feature generation, and sample clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150311

WD01 Invention patent application deemed withdrawn after publication