CN113223619A - Method for comparing sequencing result coverage rates of different whole genome sequencing methods - Google Patents
Method for comparing sequencing result coverage rates of different whole genome sequencing methods Download PDFInfo
- Publication number
- CN113223619A CN113223619A CN202110673259.2A CN202110673259A CN113223619A CN 113223619 A CN113223619 A CN 113223619A CN 202110673259 A CN202110673259 A CN 202110673259A CN 113223619 A CN113223619 A CN 113223619A
- Authority
- CN
- China
- Prior art keywords
- fasta
- file
- folder
- sequencing
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 25
- 238000012070 whole genome sequencing analysis Methods 0.000 title claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 239000012634 fragment Substances 0.000 claims abstract description 9
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012216 screening Methods 0.000 abstract 1
- 201000007336 Cryptococcosis Diseases 0.000 description 6
- 241000221204 Cryptococcus neoformans Species 0.000 description 6
- 108090000623 proteins and genes Proteins 0.000 description 6
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000001415 gene therapy Methods 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods, which comprises the following steps: reading sequencing results of different whole genomes of the same species, creating a new fasta folder, and copying the fasta file into the fasta folder; establishing a database for the fasta files under the fasta folder; segmenting a fasta file sequence under a fasta folder; comparing the segmented fragments with the database built in the second step by using a blast tool; removing the sequences aligned to the self by screening; and (4) performing score calculation on the filtered results, wherein the highest score is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate. According to the invention, through comparison results, which genome has higher coverage rate can be easily found out.
Description
Technical Field
The invention relates to a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods, belonging to the field of genome analysis.
Background
The concept of genomics (genomics) was first introduced in 1986 by the american geneticist Thomas h. A cross-biology discipline studied by collective characterization, quantitative studies and comparison of different genomes of all genes of an organism. Genomics mainly studies the structure, function, evolution, location, editing, etc. of genomes, and their influence on organisms; a great deal of effort has been made in many areas by studying genomes! Particularly in the medical field, gene diagnosis and gene therapy can be realized by the genomic technology, thereby effectively treating patients.
At present, sequencing technology is used for sequencing all genes in a genome of an organism, results measured by different machines of different tissues have certain difference, which sequencing result is higher in coverage rate cannot be determined, the genome is required to be deeply researched, and it is very important to select a more comprehensive and accurate reference genome.
The prior art mainly has the following defects:
1. basically, two complete sequence genomes are compared, and the multi-sequence comparison is troublesome and time-consuming;
2. the comparison result lacks the sorting and combination and the visual comparison result.
Disclosure of Invention
The invention aims to provide a method for comparing the coverage rates of sequencing results of different whole genome sequencing methods, and which genome with higher coverage rate can be easily found out through comparing the results.
The technical scheme adopted by the invention is as follows: a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods is characterized by comprising the following steps:
(1) reading sequencing results of different whole genomes of the same species, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta folder, copying the fasta file into the fasta folder, decompressing the fasta.gz file and putting the fasta.gz file into the fasta folder;
(2) integrating the fasta files under the fasta folder, simultaneously carrying out duplication removal processing on the integrated files, and building a library of the duplicated files for later comparison;
(3) segmenting a fasta file sequence under a fasta folder, and segmenting the fasta file sequence into fragments with proper sizes so as to facilitate later comparison;
(4) comparing the segmented fragments with the database built in the second step by using a blast tool;
(5) processing the fasta file through the blast comparison file obtained in the step (4), and removing a sequence compared to the fasta file;
(6) and calculating bit-score scores of the filtered results, and sequencing the results from high to low or from low to high according to the scores after calculation, wherein the sequencing result with the highest ranking is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate. Wherein the bit-score calculation is specifically as follows:
A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;
B. under such distribution conditions, a probability of observing an alignment score of x or more isP denotes probability, S is an event;
C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne-λSObtaining;
λ: a Gumble distribution constant;
K. constants associated with the scoring matrix used can be determined with reference to https:// www.sciencedirect.com/science/article/pii/S0022283605803602;
m: the length of the query sequence;
n: the size of the database.
Preferably, the fragment size of the fasta file sequence in step (3) is 200bp-500 bp.
The invention has the following beneficial effects:
1. by comparing results, which genome has higher coverage rate can be easily found out;
2. the invention integrates and links the processes, provides fool-style operation and enables technical personnel to use the system simply and quickly;
3. the invention provides a concise and clear multi-sequence comparison result, scores of the compared sequences are arranged in a descending order or an ascending order, and the workload of technicians is reduced;
4. according to the invention, by removing the sequences compared with the files, the influence of the size of the files on the final scoring result is reduced;
5. the scoring method provided by the invention adopts scientific and rigorous bitscore, so that the scoring is more scientific.
Drawings
FIG. 1: a gz file of Cryptococcus neoformans sequencing data provided in the refseq or genbank databases.
FIG. 2: and (3) decompressing fa files of the cryptococcus neoformans sequencing data provided by a refseq or genbank database.
FIG. 3: and (5) establishing a reference genome file after library establishment.
FIG. 4: fa files after genome sequence segmentation.
FIG. 5: and (5) a result file of blast comparison.
FIG. 6: bit-score sorted result files.
FIG. 7 is a flow chart of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.
Example 1
The method for comparing the coverage rate of the sequencing results of different whole genome sequencing methods is characterized by comprising the following steps of:
(1) downloading sequencing results of different whole genomes of cryptococcus neoformans provided by a refseq or genbank database of NCBI, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta file folder, copying the fasta file into the fasta file folder, decompressing the fasta.gz file into the fasta file folder, and enabling all genome files to be in a decompressed state, wherein the steps are shown in figure 1 and figure 2;
(2) integrating the fasta files in the fasta folder, simultaneously performing de-duplication processing on the integrated files to obtain de-duplicated files, wherein the file size is about 40M, and establishing a library of the de-duplicated files, namely Reference index files and comparison files, by specifically commanding makeblastdb-part _ block Reference-type core-in Reference _ sequence, as shown in FIG. 3;
(3) segmenting a fasta file sequence under a fasta folder, segmenting the fasta file sequence into fragments with the size of 240bp so as to facilitate later comparison, wherein the segmenting of the fasta file is realized by an algorithm based on python3, and specifically comprises the following two steps: reading out the gene sequences of different chromosomes of the cryptococcus neoformans whole genome, dividing the cryptococcus neoformans whole genome into gene segments with the size of 240bp, and storing the gene segments, as shown in a figure 4;
(4) and comparing the segmented fragments with the database built in the second step by using a blast tool, wherein when performing blast comparison, the used specific commands are as follows: blast-db reference-query-fa-out-query-xls-outfmt 6-num _ threads 10;
(5) processing the fasta file through the blast comparison file obtained in the step (4), removing the sequence of the fasta file, wherein the removing operation is realized by an algorithm based on python3, and is realized by removing rows with the same number in the first row and the second row, and the rows are as shown in fig. 5, and the numbers of the first row and the second row in the first row are the same;
(6) and calculating bit-score scores of the filtered results, and ranking the results from high to low after calculation, wherein the highest ranking is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate, as shown in FIG. 6, and GCA _000195955.2 is the optimal reference genome of cryptococcus neoformans.
Wherein the bit-score calculation is specifically as follows:
A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;
B. under such distribution conditions, a probability of observing an alignment score of x or more isP denotes probability, S is an event;
C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne-λSObtaining;
Wherein
λ: a Gumble distribution constant;
K. constants associated with the scoring matrix used;
m: the length of the query sequence;
n: the size of the database.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (3)
1. A method for comparing the coverage rate of sequencing results of different whole genome sequencing methods is characterized by comprising the following steps:
(1) reading sequencing results of different whole genomes of the same species, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta folder, copying the fasta file into the fasta folder, decompressing the fasta.gz file and putting the fasta.gz file into the fasta folder;
(2) integrating the fasta files under the fasta folder, simultaneously carrying out duplication removal processing on the integrated files, and building a library of the duplicated files for later comparison;
(3) segmenting the sequence of the fasta file under the fasta folder, and segmenting the sequence of the fasta file into fragments with proper sizes so as to facilitate later comparison;
(4) comparing the segmented fragments with the database built in the second step by using a blast tool;
(5) processing the fasta file through the blast comparison file obtained in the step (4), and removing a sequence compared to the fasta file;
(6) and (4) calculating bit-score scores of the results filtered in the step (5), and sorting the results from high to low after calculation, wherein the highest score is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate.
2. The method of aligning the coverage of sequencing results of different whole genome sequencing methods according to claim 1,
the method is characterized in that: the fragment size of the fasta file sequence in the step (3) is 200bp-500 bp.
3. The method of aligning the coverage of sequencing results of different whole genome sequencing methods according to any one of claims 1-2, wherein: the bit-score calculation specifically comprises:
A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;
B. under such distribution conditions, a probability of observing an alignment score of x or more isP denotes probability, S is an event;
C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne-λSObtaining;
Wherein
λ: a Gumble distribution constant;
K. constants associated with the scoring matrix used;
m: the length of the query sequence;
n: the size of the database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110673259.2A CN113223619A (en) | 2021-06-17 | 2021-06-17 | Method for comparing sequencing result coverage rates of different whole genome sequencing methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110673259.2A CN113223619A (en) | 2021-06-17 | 2021-06-17 | Method for comparing sequencing result coverage rates of different whole genome sequencing methods |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113223619A true CN113223619A (en) | 2021-08-06 |
Family
ID=77080454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110673259.2A Pending CN113223619A (en) | 2021-06-17 | 2021-06-17 | Method for comparing sequencing result coverage rates of different whole genome sequencing methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113223619A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114328399A (en) * | 2022-03-15 | 2022-04-12 | 四川大学华西医院 | Method and system for automatically pairing gene sequencing multi-sample data files |
CN115346606A (en) * | 2022-10-17 | 2022-11-15 | 南京诺因生物科技有限公司 | Method and system for designing targeting probe based on species sequence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784199A (en) * | 2017-10-18 | 2018-03-09 | 中国科学院昆明植物研究所 | A kind of organelle gene group screening technique based on STb gene sequencing result |
US20180327830A1 (en) * | 2015-12-03 | 2018-11-15 | Ares Trading S.A. | Method for determining cell clonality |
CN111916149A (en) * | 2020-08-19 | 2020-11-10 | 江南大学 | Hierarchical clustering-based protein interaction network global comparison method |
CN111926094A (en) * | 2020-07-17 | 2020-11-13 | 电子科技大学中山学院 | Bar code identification primer, identification method and kit for different species in aeromonas |
CN112011595A (en) * | 2020-06-01 | 2020-12-01 | 广东美格基因科技有限公司 | Whole genome amplification method for SARS-CoV-2 virus, application and sequencing method and kit |
-
2021
- 2021-06-17 CN CN202110673259.2A patent/CN113223619A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180327830A1 (en) * | 2015-12-03 | 2018-11-15 | Ares Trading S.A. | Method for determining cell clonality |
CN107784199A (en) * | 2017-10-18 | 2018-03-09 | 中国科学院昆明植物研究所 | A kind of organelle gene group screening technique based on STb gene sequencing result |
CN112011595A (en) * | 2020-06-01 | 2020-12-01 | 广东美格基因科技有限公司 | Whole genome amplification method for SARS-CoV-2 virus, application and sequencing method and kit |
CN111926094A (en) * | 2020-07-17 | 2020-11-13 | 电子科技大学中山学院 | Bar code identification primer, identification method and kit for different species in aeromonas |
CN111916149A (en) * | 2020-08-19 | 2020-11-10 | 江南大学 | Hierarchical clustering-based protein interaction network global comparison method |
Non-Patent Citations (3)
Title |
---|
JOHN A. ELIX ET.AL: "PacBio amplicon sequencing for metabarcoding of mixed DNA samples from lichen herbarium specimens", 《MYCOKEYS》, pages 73 - 91 * |
亢雨笺: "BLAST算法介绍", Retrieved from the Internet <URL:https://ngdc.cncb.ac.cn/education/ABC/talk/> * |
王欣 等: "NK/T细胞淋巴瘤基因组中EBV DNA整合检测及分析", 《中国肿瘤临床》, vol. 45, no. 23, pages 1194 - 1200 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114328399A (en) * | 2022-03-15 | 2022-04-12 | 四川大学华西医院 | Method and system for automatically pairing gene sequencing multi-sample data files |
CN114328399B (en) * | 2022-03-15 | 2022-05-24 | 四川大学华西医院 | Method and system for automatically pairing gene sequencing multi-sample data files |
CN115346606A (en) * | 2022-10-17 | 2022-11-15 | 南京诺因生物科技有限公司 | Method and system for designing targeting probe based on species sequence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108573125B (en) | Method for detecting genome copy number variation and device comprising same | |
Adie et al. | Speeding disease gene discovery by sequence based candidate prioritization | |
CN109767810B (en) | High-throughput sequencing data analysis method and device | |
CN113223619A (en) | Method for comparing sequencing result coverage rates of different whole genome sequencing methods | |
US11339426B2 (en) | Method capable of differentiating fetal sex and fetal sex chromosome abnormality on various platforms | |
CN107480470B (en) | Known variation detection method and device based on Bayesian and Poisson distribution test | |
CN110846411B (en) | Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing | |
JP6066924B2 (en) | DNA sequence data analysis method | |
CN107944228B (en) | Visualization method for gene sequencing variation site | |
CN111081315B (en) | Homologous pseudogene mutation detection method | |
KR101686146B1 (en) | Copy Number Variation Determination Method Using Sample comprising Nucleic Acid Mixture | |
Sun et al. | A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq | |
CN110016497B (en) | Method for detecting copy number variation of tumor single cell genome | |
CN111755068A (en) | Method and device for identifying tumor purity and absolute copy number based on sequencing data | |
CN114420212A (en) | Escherichia coli strain identification method and system | |
Lawrence et al. | Assignment of position-specific error probability to primary DNA sequence data | |
CN114446389B (en) | Tumor neoantigen feature analysis and immunogenicity prediction tool and application thereof | |
CN116741268A (en) | Method, device and computer readable storage medium for screening key mutation of pathogen | |
CN110970091A (en) | Label quality control method and device | |
KR20210110241A (en) | Prediction system and method of cancer immunotherapy drug Sensitivity using multiclass classification A.I based on HLA Haplotype | |
CN115862740B (en) | Rapid distributed multi-sequence comparison method for large-scale virus genome data | |
CN110970093B (en) | Method and device for screening primer design template and application | |
CN114566214B (en) | Method for detecting genome deletion insertion variation, detection device, computer readable storage medium and application | |
CN108595914B (en) | High-precision prediction method for tobacco mitochondrial RNA editing sites | |
TW202300656A (en) | Machine detection of a candidate break-point of a copy number variant on a genomic sequence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |