CN105095686A - High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware - Google Patents

High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware Download PDF

Info

Publication number
CN105095686A
CN105095686A CN201410205571.9A CN201410205571A CN105095686A CN 105095686 A CN105095686 A CN 105095686A CN 201410205571 A CN201410205571 A CN 201410205571A CN 105095686 A CN105095686 A CN 105095686A
Authority
CN
China
Prior art keywords
sequence
core cpu
transcript profile
quality control
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410205571.9A
Other languages
Chinese (zh)
Other versions
CN105095686B (en
Inventor
周茜
宁康
苏晓泉
徐健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Original Assignee
Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Institute of Bioenergy and Bioprocess Technology of CAS filed Critical Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Priority to CN201410205571.9A priority Critical patent/CN105095686B/en
Publication of CN105095686A publication Critical patent/CN105095686A/en
Application granted granted Critical
Publication of CN105095686B publication Critical patent/CN105095686B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a high-flux transcriptome sequencing data quality control method based on multi-core CPU hardware. The method comprises: performing parallel processing on high-flux transcriptome sequencing data by using a multi-core CPU, so as to obtain data without low sequencing quality sequences; performing prediction and removal on rRNA sequences in the data without the low sequencing quality sequences by using the multi-core CPU, and performing qualitative identification on polluted sequences; and performing statistics and evaluation on a sequence comparison result. According to the high-flux transcriptome sequencing data quality control method based on the multi-core CPU hardware, provided by the present invention, based on a multi-core CPU computer, a computing efficiency bottleneck based on a single-core CPU hardware computer is overcome, so that high-flux transcriptome data quality control efficiency is increased by over 7 times; and by applying the high-flux transcriptome sequencing data quality control method, the accuracy and speed of the high-flux transcriptome data quality control are significantly improved, and rapid development of relevant researches of transcriptome sequencing is widely facilitated.

Description

Based on the high flux transcript profile sequencing data method of quality control of multi-core CPU hardware
Technical field
Patent of the present invention relates to bioinformatics, and a kind of high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware, can carry out quality control to high flux transcript profile sequencing data fast specifically.
Background technology
High throughput sequencing technologies, also known as " next generation " sequencing technologies, is the change to tradition order-checking revolution, once can carries out sequencing to hundreds of thousands to millions of DNA/RNA molecules, be applied in biology correlative study more and more widely.Compared with traditional Sanger sequencing technologies, the flux of new-generation sequencing technology improves one to two orders of magnitude, data volume more (100MB is to number G).Transcript profile order-checking is a deep application based on high throughput sequencing technologies, can carry out careful, deep to the transcripting spectrum of species and comprehensively analyze.But, due to the operate miss of the artificial experiment such as the restriction of high throughput sequencing technologies itself and transcript profile extraction, the transcript profile data of original generation, often containing part inferior quality sequence, comprise inferior quality base, polluted sequence and nRNA sequence (rRNA) etc.The existence of these inferior quality sequences greatly will affect the accuracy of follow-up transcript profile data analysis, the conclusion even led to errors.In addition, sequence drawn with comparing after (alignment) with reference to genome because follow-up transcriptome analysis result depends on, therefore the comparison quality of transcript profile sequence is also one of key factor weighing transcript profile sequencing data total quality.In sum, the necessary committed step of high flux transcript profile sequencing data analysis is carried out in quality control.Current existing transcript profile data quality control method has mainly concentrated on the quality assessment of sequence alignment aspect, and comprehensively cannot carry out quality control for base, sequence, pollution and comparison quality simultaneously.
Because high flux transcript profile sequencing data generally needs the multiple samples measuring different condition or different time points collection, each sample generally needs three or more biology to repeat and technology repetition respectively, therefore the sample size checked order is huge, cause each data volume often obtained more than 20 samples and tens G that checks order, so correspond to the quality control of high flux transcript profile data, the supercomputer with suitable arithmetic capability and corresponding analysis software must be had to realize.Adopt current general analysis method to utilize single CPU computing machine scan one by one several hundred million sequences and process respectively, may a couple of days be needed time of even one month, make the efficiency of data analysis also become the large bottleneck of one in correlative study.
Summary of the invention
The problem of the requirement of high flux transcript profile sequencing data quality control comprehensively, accurately and efficiently cannot be met in order to solve traditional analysis and computing system, the present invention can the feature of parallel processing according to high flux transcript profile sequencing data, proposes a kind of high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware.
The technical scheme that the present invention is adopted for achieving the above object is: a kind of high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware, comprises the following steps:
Utilize multi-core CPU to carry out parallel processing to high flux transcript profile sequencing data, obtain the data removing low sequencing quality sequence;
Utilize multi-core CPU to predict the rRNA sequence removed in the data of low sequencing quality sequence and remove, and carry out the Qualitative Identification of polluted sequence;
Sequence alignment result is added up and evaluated.
The described multi-core CPU that utilizes carries out the removal of low sequencing quality sequence to high flux transcript profile sequencing data, comprises the following steps:
Parallel-QC instrument is utilized input file to be divided into several subdatas on a small scale;
Each subdata is assigned in different CPU core;
In multiple CPU core, detect base quality and the joint sequence of each sequence in its subdata simultaneously, and the inferior quality base at each sequence two ends is excised successively according to the length that user specifies, filter the sequence of the inferior quality base containing user's designated ratio, delete joint sequence wherein;
Sequence after above-mentioned process is merged together, thus obtains the data removing low sequencing quality sequence.
The described multi-core CPU that utilizes is predicted the rRNA sequence removed in the data of low sequencing quality sequence and is removed, and carries out the Qualitative Identification of polluted sequence, comprises the following steps:
By rRNA sequence construct Hidden Markov Model (HMM) all in database SILVA; Based on Hidden Markov Model (HMM) search, rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction is removed from transcript profile data;
By 16S or 18SrRNA that predict and extract, be mapped on known rRNA sequence library SILVA, obtain the source of species information of all sequences, respectively the annotation result of 16S and 18SrRNA characteristic sequence is gathered, generate Species Structure composition result, thus obtain the species and polluted information that likely exist in transcript profile sequencing data;
Describedly based on Hidden Markov Model (HMM) search, rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction to be removed from transcript profile data, comprise the following steps:
The data file segmentation of the removal inferior quality sequencing sequence through Parallel-QC process is become subdata on a small scale;
Different subdatas is assigned in different CPU core;
16S, 18S, 23S or 28SrRNA characteristic sequence of predictor sequence while of in numerous CPU core;
All kinds of characteristic sequence is predicted the outcome and is merged together;
Predict the outcome according to characteristic sequence and repeatedly from external storage, be loaded into internal memory by inputting data on a large scale and search extraction, finally Search Results is merged.
Described sequence alignment is added up to reference to the result on genome and evaluated, comprise the number of statistical series, sequence of calculation coverage, gather both-end sequence comparison information.
The number of described statistical series comprises full sequence, the successful sequence of comparison, comparison to the sequence in some specific gene group region and the ratio shared in full sequence of above-mentioned sequence.
Described sequence of calculation coverage includes the distribution of sequence on genome structure of the number of gene of sequence success comparison, the base coverage of each gene, successful comparison.
Described comprise all successful comparison of both-end sequence number, the number of sequence only having one end success comparison, both-end aligned sequences Insert Fragment length.
The present invention has the following advantages and beneficial effect:
1. achieve comprehensive, transcript profile data quality control efficiently, comprise for many-sided comprehensive analysis and Quality Controls such as sequencing quality, rRNA sequence, polluted sequence and comparison results;
2. with based on multi-core CPU computing machine match, overcome the counting yield bottleneck based on monokaryon CPU bare machine, high flux transcript profile data quality control efficiency can be made to improve more than 7 times;
3. application of the present invention will significantly improve accuracy and the speed of high flux transcript profile data quality control, extensively contributes to developing rapidly of transcript profile order-checking correlative study.
Accompanying drawing explanation
Fig. 1 is hardware structure figure of the present invention; Wherein, be 1. DMI and PCIe2.0 bus; 2. be triple channel DDR3 rambus; 3. be SATA bus;
Fig. 2 is software flow pattern of the present invention; Wherein, (1) is low sequencing quality data processing; (2) be the Qualitative Identification of rRNA sequence and polluted sequence; (3) be evaluation and the quality control of sequence comparison;
Fig. 3 is employing 16 core CUP application the present invention and applies the test result figure of monokaryon CPU for same transcript profile sequencing data.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.
The highly efficient, unified software platform that the technical solution used in the present invention is multi-core CPU computing machine and constructs thereon.Be characterized in (1) high performance parallel computation and storage hardware system; (2) complete function, high-performance, unification, configurable parallelization software platform.
(1) high performance parallel computation and storage hardware
This hardware system adopts multiple-path multiple-core CPU to carry out large-scale parallel calculating.Fig. 1 is the system construction drawing of calculation server:
First, multiple-path multiple-core CPU parallelization calculates, and adopts 4 path processors, adopts QPI bus to connect between processor.Every path processor has 8 and independently calculates core, is equipped with triple channel DDR3RDIMM internal memory, is also adapted to the calculation requirement of cloud computing server simultaneously.
Secondly, high-speed cache and high-speed bus: the allotment and the needs of cooperative working environment on extensive task matching that are adapted to the sequencing data analysis task of concurrent type frog.
Finally, RAID disk array: stored by RAID disk array, not only improves response speed and the stability of central server, and is conducive to the renewal of irregular central server.Backup and the upgrading needs of cloud computing server can be dealt with simultaneously.
(2) complete function, high-performance, unification, configurable software platform
High performance software platform comprises (Fig. 2) such as low sequencing quality data processing, the Qualitative Identification of polluted sequence, the qualitative, quantitative qualification of rRNA polluted sequence and the detections of sequence alignment quality.This system is named as RNA-QC-Chain software systems (http://www.computationalbioenergy.org/rna-qc-chain.html, independent intellectual property right), and its data quality control step is:
The first, based on the low sequencing quality data processing of multi-core CPU parallel computation.Utilize Parallel-QC instrument (http://www.computationalbioenergy.org/parallel-qc.html, independent intellectual property right), input file is divided into subdata on a small scale, different subdatas is assigned in different CPU core, then in multiple CPU core, predict base quality and the joint sequence of each sequence simultaneously, and the inferior quality base at the length of specifying according to user excision sequence two ends successively, filter the sequence containing certain proportion inferior quality base, delete joint sequence wherein, finally the sequence after filtration is merged together, thus obtain the data result removing low sequencing quality sequence.
The second, based on the Qualitative Identification of the polluted sequence of multi-core CPU parallel computation.First the rRNA sequence in rRNA-filter instrument removal data is utilized.RRNA sequences (comprising 16S, 18S, 23S and 28SrRNA sequence) all in disclosed rRNA database SILVA is built Hidden Markov Model (HMM) (HMM), and based on HMM search, rRNA prediction is carried out for transcript profile sequence, then the rRNA sequence of prediction is removed from transcript profile data.SILVA database includes one of nRNA database of the most comprehensive rRNA sequence in the world at present, covers the rRNA sequence in the large field of bacterium, fungi and eucaryote three.Therefore, the rRNA sequence that our method can be contained in removal transcript profile sequence as much as possible.Input file is divided into subdata on a small scale by rRNA-filter, different subdatas is assigned in different CPU core, then the while of in numerous CPU core, 16S, 18S, 23S or 28SrRNA characteristic sequence of predictor sequence, finally predicts the outcome all kinds of characteristic sequence and is merged together; Then, predict the outcome according to characteristic sequence and repeatedly from external storage, be loaded into internal memory by inputting data on a large scale and search extraction, finally Search Results is merged.
Then, 16S or 18SrRNA sequence is a kind of shorter biomarker characteristic sequence, is widely used in the qualification of protokaryon and eucaryon species.RRNA-filter based on to predict and extract the result that 16S or 18SrRNA annotates, obtain the source of species information of all sequences in high-flux sequence data qualitatively, and respectively the Search Results of 16S and 18SrRNA characteristic sequence is gathered, generate patterned Species Structure composition result, thus obtain the species and polluted information that likely exist in transcript profile sequencing data.
3rd, the evaluation of comprehensive, accurate sequence alignment result and quality control.Utilize the SAM-stats instrument of independent development, based on the sequence alignment destination file of SAM form, carry out accurately the comparison result of transcript profile sequence and genomic data (known), comprehensively add up and evaluate, its function comprises:
The number of statistical series, comprises full sequence, the successful sequence of comparison, comparison to the sequence in some specific gene group region and the ratio etc. shared in full sequence of above-mentioned sequence;
Sequence of calculation coverage, includes the distribution etc. of sequence on genome structure of the number of the gene of sequence success comparison, the base coverage of each gene, successful comparison;
Gather both-end sequence comparison information, comprise the sequence number of all successful comparison of both-end, only have the number of sequence of one end success comparison, both-end aligned sequences Insert Fragment length etc.
In sum, this software platform depends on multi-core CPU hardware platform, only cooperatively interacts and can play the function of high-level efficiency transcript profile sequencing data quality control.
As shown in Figure 1, based on the high flux transcript profile sequencing data method of quality control of multi-core CPU hardware, its major part is: first, the multiple dimensioned parallelization computing power of 4 road multi-core CPU, and every road CPU possesses independent 8 and calculates core, and has triple channel internal memory.The second, high-speed cache and high-speed bus.Three, RAID disk array, not only improves response speed and the stability of central server, and is conducive to the renewal of irregular central server.Calculating and storage hardware basic configuration are: single channel CPU at least possesses 4 separate physical and calculates cores, more than dual access memory 2GB, and hard disk is high speed interconnect between more than 50G, CPU and storage at least.
As shown in Figure 2, its flow process key step is: first, use Parallel-QC Software tool, multi-core CPU is utilized to process transcript profile sequence, excise the inferior quality base at input data sequence two ends successively, filter the sequence containing certain proportion inferior quality base, delete joint sequence wherein, then result is combined, as high sequencing quality sequence data.Then, utilize rRNA-filter instrument, the prediction of rRNA sequence and the qualitative detection of polluted sequence are carried out for data obtained in the previous step, use parallelization multithreading computational tool, extract and remove the rRNA sequence (16S/18S or 23S/28S) predicted, and by 16S or 18S sequence mapping wherein on known rRNA sequence library SILVA, obtain source of species (comprise and may the pollute species) information of all sequences.Finally, for sequence alignment to the result (file of SAM form) on reference genome, utilize SAM-stats Software tool, from the angle statistics of sequence alignment and the quality of evaluation transcript profile data, comprise the effect etc. of the comparison success number of sequence, the coverage of gene and both-end aligned sequences.Comprehensive the above results, generates graphical analysis result and analysis report.Software platform basic configuration is: (SuSE) Linux OS, prepackage GCC running environment, CUDA running environment (more than 3.0), RNA-QC-Chain software systems version more than 1.0, Parallel-META software version more than 2.0.The runnable interface of RNA-QC-Chain software systems and Parallel-META software systems is order line form, joins electronic edition operation instruction.Official website (http://www.computationalbioenergy.org/software.html) provides long-term software update service simultaneously.
Method of the present invention, overcomes the counting yield bottleneck based on monokaryon CPU bare machine, makes high flux transcript profile data quality control efficiency improve more than 7 times.As shown in Figure 3, the test for same transcript profile sequencing data shows, uses 16 core CPU can complete whole quality control process in 23 minutes, and consuming time when using monokaryon CPU be 180 minutes.

Claims (8)

1., based on a high flux transcript profile sequencing data method of quality control for multi-core CPU hardware, it is characterized in that, comprise the following steps:
Utilize multi-core CPU to carry out parallel processing to high flux transcript profile sequencing data, obtain the data removing low sequencing quality sequence;
Utilize multi-core CPU to predict the rRNA sequence removed in the data of low sequencing quality sequence and remove, and carry out the Qualitative Identification of polluted sequence;
Sequence alignment result is added up and evaluated.
2. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 1, is characterized in that, the described multi-core CPU that utilizes carries out the removal of low sequencing quality sequence to high flux transcript profile sequencing data, comprises the following steps:
Parallel-QC instrument is utilized input file to be divided into several subdatas on a small scale;
Each subdata is assigned in different CPU core;
In multiple CPU core, detect base quality and the joint sequence of each sequence in its subdata simultaneously, and the inferior quality base at each sequence two ends is excised successively according to the length that user specifies, filter the sequence of the inferior quality base containing user's designated ratio, delete joint sequence wherein;
Sequence after above-mentioned process is merged together, thus obtains the data removing low sequencing quality sequence.
3. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 1, it is characterized in that, the described multi-core CPU that utilizes is predicted the rRNA sequence removed in the data of low sequencing quality sequence and is removed, and carry out the Qualitative Identification of polluted sequence, comprise the following steps:
By rRNA sequence construct Hidden Markov Model (HMM) all in database SILVA; Based on Hidden Markov Model (HMM) search, rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction is removed from transcript profile data;
By 16S or 18SrRNA that predict and extract, be mapped on known rRNA sequence library SILVA, obtain the source of species information of all sequences, respectively the annotation result of 16S and 18SrRNA characteristic sequence is gathered, generate Species Structure composition result, thus obtain the species and polluted information that likely exist in transcript profile sequencing data.
4. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 3, it is characterized in that, described based on Hidden Markov Model (HMM) search rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction is removed from transcript profile data, comprise the following steps:
The data file segmentation of the removal inferior quality sequencing sequence through Parallel-QC process is become subdata on a small scale;
Different subdatas is assigned in different CPU core;
16S, 18S, 23S or 28SrRNA characteristic sequence of predictor sequence while of in numerous CPU core;
All kinds of characteristic sequence is predicted the outcome and is merged together;
Predict the outcome according to characteristic sequence and repeatedly from external storage, be loaded into internal memory by inputting data on a large scale and search extraction, finally Search Results is merged.
5. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 1, it is characterized in that, described sequence alignment is added up to reference to the result on genome and evaluated, comprise the number of statistical series, sequence of calculation coverage, gather both-end sequence comparison information.
6. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 5, it is characterized in that, the number of described statistical series comprise full sequence, the successful sequence of comparison, comparison to the sequence in some specific gene group region and above-mentioned sequence in full sequence respectively shared by ratio.
7. according to the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware that claim 5 is stated, it is characterized in that, described sequence of calculation coverage includes the distribution of sequence on genome structure of the number of gene of sequence success comparison, the base coverage of each gene, successful comparison.
8. according to the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware that claim 5 is stated, it is characterized in that, described in comprise the sequence number of all successful comparison of both-end, the number of sequence only having one end success comparison, both-end aligned sequences Insert Fragment length.
CN201410205571.9A 2014-05-15 2014-05-15 High-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware Active CN105095686B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410205571.9A CN105095686B (en) 2014-05-15 2014-05-15 High-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410205571.9A CN105095686B (en) 2014-05-15 2014-05-15 High-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware

Publications (2)

Publication Number Publication Date
CN105095686A true CN105095686A (en) 2015-11-25
CN105095686B CN105095686B (en) 2018-08-14

Family

ID=54576104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410205571.9A Active CN105095686B (en) 2014-05-15 2014-05-15 High-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware

Country Status (1)

Country Link
CN (1) CN105095686B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740650A (en) * 2016-03-02 2016-07-06 广西作物遗传改良生物技术重点开放实验室 Method for rapidly and accurately identifying high-throughput genome data pollution sources
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN106701995A (en) * 2017-02-20 2017-05-24 元码基因科技(北京)有限公司 Method for cell quality control through unicellular transcriptome sequencing
CN106777262A (en) * 2016-12-28 2017-05-31 上海华点云生物科技有限公司 High-flux sequence quality of data filter method and filter
CN107194204A (en) * 2017-05-22 2017-09-22 人和未来生物科技(长沙)有限公司 A kind of sequencing data of whole genome calculates deciphering method
CN107203703A (en) * 2017-05-22 2017-09-26 人和未来生物科技(长沙)有限公司 A kind of transcript profile sequencing data calculates deciphering method
CN107451424A (en) * 2017-07-31 2017-12-08 浙江绍兴千寻生物科技有限公司 In high volume unicellular RNA seq data quality controls and analysis method
CN109559780A (en) * 2018-09-27 2019-04-02 华中科技大学鄂州工业技术研究院 A kind of RNA data processing method of high-flux sequence
CN111326216A (en) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN112927756A (en) * 2019-12-06 2021-06-08 深圳华大基因科技服务有限公司 Method and device for identifying transcriptome rRNA pollution source and method for improving rRNA pollution
CN115495299A (en) * 2022-11-15 2022-12-20 深圳市江元科技(集团)有限公司 Method, system and medium for intelligent QC software detection and identifier uploading

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914619A (en) * 2010-07-22 2010-12-15 深圳华大基因科技有限公司 RNA (Ribonucleic Acid) sequencing quality control method and device relating to gene expression
WO2012125848A2 (en) * 2011-03-16 2012-09-20 Baylor College Of Medicine A method for comprehensive sequence analysis using deep sequencing technology

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914619A (en) * 2010-07-22 2010-12-15 深圳华大基因科技有限公司 RNA (Ribonucleic Acid) sequencing quality control method and device relating to gene expression
WO2012125848A2 (en) * 2011-03-16 2012-09-20 Baylor College Of Medicine A method for comprehensive sequence analysis using deep sequencing technology

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
QIAN ZHOU 等: "Meta-QC-Chain: Comprehensive and Fast Quality Control Method for Metagenomic Data", 《GENOMICS PROTEOMICS BIOINFORMATICS》 *
QIAN ZHOU 等: "QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data", 《PLOS ONE》 *
RAVI K. PATEL 等: "NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data", 《PLOS ONE》 *
宋琳琳 等: "Illumina-Solexa测序数据质量评估系统的构建", 《现代生物医学进展》 *
苏晓泉 等: "Meta-Mesh——元基因组数据分析系统", 《生物工程学报》 *
苏晓泉 等: "服务于微生物群落研究的高性能元基组数据分析平台", 《E-SCIENCE应用》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740650B (en) * 2016-03-02 2019-04-05 广西作物遗传改良生物技术重点开放实验室 A method of quick and precisely identifying high-throughput genomic data pollution sources
CN105740650A (en) * 2016-03-02 2016-07-06 广西作物遗传改良生物技术重点开放实验室 Method for rapidly and accurately identifying high-throughput genome data pollution sources
CN106407743B (en) * 2016-08-31 2019-03-05 上海美吉生物医药科技有限公司 A kind of high-throughput data analysing method based on cluster
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN106777262A (en) * 2016-12-28 2017-05-31 上海华点云生物科技有限公司 High-flux sequence quality of data filter method and filter
CN106777262B (en) * 2016-12-28 2020-07-03 上海华点云生物科技有限公司 High-throughput sequencing data quality filtering method and filtering device
CN106701995A (en) * 2017-02-20 2017-05-24 元码基因科技(北京)有限公司 Method for cell quality control through unicellular transcriptome sequencing
CN107203703A (en) * 2017-05-22 2017-09-26 人和未来生物科技(长沙)有限公司 A kind of transcript profile sequencing data calculates deciphering method
CN107194204A (en) * 2017-05-22 2017-09-22 人和未来生物科技(长沙)有限公司 A kind of sequencing data of whole genome calculates deciphering method
CN107451424A (en) * 2017-07-31 2017-12-08 浙江绍兴千寻生物科技有限公司 In high volume unicellular RNA seq data quality controls and analysis method
CN109559780A (en) * 2018-09-27 2019-04-02 华中科技大学鄂州工业技术研究院 A kind of RNA data processing method of high-flux sequence
CN112927756A (en) * 2019-12-06 2021-06-08 深圳华大基因科技服务有限公司 Method and device for identifying transcriptome rRNA pollution source and method for improving rRNA pollution
CN111326216A (en) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN111326216B (en) * 2020-02-27 2023-07-21 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN115495299A (en) * 2022-11-15 2022-12-20 深圳市江元科技(集团)有限公司 Method, system and medium for intelligent QC software detection and identifier uploading

Also Published As

Publication number Publication date
CN105095686B (en) 2018-08-14

Similar Documents

Publication Publication Date Title
CN105095686A (en) High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware
Jin et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes
Garber et al. Computational methods for transcriptome annotation and quantification using RNA-seq
Nikolayeva et al. edgeR for differential RNA-seq and ChIP-seq analysis: an application to stem cell biology
CN103838985A (en) High-throughput sequencing data quality control system based on multi-core CPU and GPGPU hardware
EP2759952A1 (en) Efficient genomic read alignment in an in-memory database
Feng et al. Inference of isoforms from short sequence reads
Pireddu et al. MapReducing a genomic sequencing workflow
CN102736896A (en) Run-ahead approximated computations
Wang et al. GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda
Chimani et al. Algorithm engineering: Concepts and practice
Sarwar et al. Database search, alignment viewer and genomics analysis tools: big data for bioinformatics
CN111312342B (en) Electronic structure computer-aided drug design system
CN106021992A (en) Computation pipeline of location-dependent variant calls
CN112016636A (en) Crop spectral clustering analysis processing method based on Hadoop frame
Metsker et al. Performance improvement algorithms in big data analysis
Wu et al. TIGER: tiled iterative genome assembler
CN103310125A (en) High-performance metagenomic data analysis system on basis of GPGPU (General Purpose Graphics Processing Units) and multi-core CPU (Central Processing Unit) hardware
Rehman et al. Need and role of scala implementations in bioinformatics
Khan et al. MSuPDA: A memory efficient algorithm for sequence alignment
JP6503774B2 (en) Program execution analysis method, information processing apparatus and program execution analysis program
Muhammadzadeh MR-CUDASW-GPU accelerated Smith-Waterman algorithm for medium-length (meta) genomic data
Cong et al. Pattern-mining for behavioral synthesis
Majhi et al. Artificial Intelligence in Bioinformatics
Elder et al. MitoMut: An Efficient Approach to Detecting Mitochondrial DNA Deletions from Paired-end Next-generation Sequencing Data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant