CN113223619A - Method for comparing sequencing result coverage rates of different whole genome sequencing methods - Google Patents

Method for comparing sequencing result coverage rates of different whole genome sequencing methods Download PDF

Info

Publication number
CN113223619A
CN113223619A CN202110673259.2A CN202110673259A CN113223619A CN 113223619 A CN113223619 A CN 113223619A CN 202110673259 A CN202110673259 A CN 202110673259A CN 113223619 A CN113223619 A CN 113223619A
Authority
CN
China
Prior art keywords
fasta
file
folder
sequencing
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110673259.2A
Other languages
Chinese (zh)
Inventor
易康
安泰然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Nuoyin Biotechnology Co ltd
Original Assignee
Nanjing Nuoyin Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Nuoyin Biotechnology Co ltd filed Critical Nanjing Nuoyin Biotechnology Co ltd
Priority to CN202110673259.2A priority Critical patent/CN113223619A/en
Publication of CN113223619A publication Critical patent/CN113223619A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods, which comprises the following steps: reading sequencing results of different whole genomes of the same species, creating a new fasta folder, and copying the fasta file into the fasta folder; establishing a database for the fasta files under the fasta folder; segmenting a fasta file sequence under a fasta folder; comparing the segmented fragments with the database built in the second step by using a blast tool; removing the sequences aligned to the self by screening; and (4) performing score calculation on the filtered results, wherein the highest score is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate. According to the invention, through comparison results, which genome has higher coverage rate can be easily found out.

Description

Method for comparing sequencing result coverage rates of different whole genome sequencing methods
Technical Field
The invention relates to a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods, belonging to the field of genome analysis.
Background
The concept of genomics (genomics) was first introduced in 1986 by the american geneticist Thomas h. A cross-biology discipline studied by collective characterization, quantitative studies and comparison of different genomes of all genes of an organism. Genomics mainly studies the structure, function, evolution, location, editing, etc. of genomes, and their influence on organisms; a great deal of effort has been made in many areas by studying genomes! Particularly in the medical field, gene diagnosis and gene therapy can be realized by the genomic technology, thereby effectively treating patients.
At present, sequencing technology is used for sequencing all genes in a genome of an organism, results measured by different machines of different tissues have certain difference, which sequencing result is higher in coverage rate cannot be determined, the genome is required to be deeply researched, and it is very important to select a more comprehensive and accurate reference genome.
The prior art mainly has the following defects:
1. basically, two complete sequence genomes are compared, and the multi-sequence comparison is troublesome and time-consuming;
2. the comparison result lacks the sorting and combination and the visual comparison result.
Disclosure of Invention
The invention aims to provide a method for comparing the coverage rates of sequencing results of different whole genome sequencing methods, and which genome with higher coverage rate can be easily found out through comparing the results.
The technical scheme adopted by the invention is as follows: a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods is characterized by comprising the following steps:
(1) reading sequencing results of different whole genomes of the same species, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta folder, copying the fasta file into the fasta folder, decompressing the fasta.gz file and putting the fasta.gz file into the fasta folder;
(2) integrating the fasta files under the fasta folder, simultaneously carrying out duplication removal processing on the integrated files, and building a library of the duplicated files for later comparison;
(3) segmenting a fasta file sequence under a fasta folder, and segmenting the fasta file sequence into fragments with proper sizes so as to facilitate later comparison;
(4) comparing the segmented fragments with the database built in the second step by using a blast tool;
(5) processing the fasta file through the blast comparison file obtained in the step (4), and removing a sequence compared to the fasta file;
(6) and calculating bit-score scores of the filtered results, and sequencing the results from high to low or from low to high according to the scores after calculation, wherein the sequencing result with the highest ranking is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate. Wherein the bit-score calculation is specifically as follows:
A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;
B. under such distribution conditions, a probability of observing an alignment score of x or more is
Figure BDA0003119615810000021
P denotes probability, S is an event;
C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne-λSObtaining;
D. deducing the formula obtained in the step C to obtain a bit score calculation formula
Figure BDA0003119615810000022
Wherein
λ: a Gumble distribution constant;
K. constants associated with the scoring matrix used can be determined with reference to https:// www.sciencedirect.com/science/article/pii/S0022283605803602;
m: the length of the query sequence;
n: the size of the database.
Preferably, the fragment size of the fasta file sequence in step (3) is 200bp-500 bp.
The invention has the following beneficial effects:
1. by comparing results, which genome has higher coverage rate can be easily found out;
2. the invention integrates and links the processes, provides fool-style operation and enables technical personnel to use the system simply and quickly;
3. the invention provides a concise and clear multi-sequence comparison result, scores of the compared sequences are arranged in a descending order or an ascending order, and the workload of technicians is reduced;
4. according to the invention, by removing the sequences compared with the files, the influence of the size of the files on the final scoring result is reduced;
5. the scoring method provided by the invention adopts scientific and rigorous bitscore, so that the scoring is more scientific.
Drawings
FIG. 1: a gz file of Cryptococcus neoformans sequencing data provided in the refseq or genbank databases.
FIG. 2: and (3) decompressing fa files of the cryptococcus neoformans sequencing data provided by a refseq or genbank database.
FIG. 3: and (5) establishing a reference genome file after library establishment.
FIG. 4: fa files after genome sequence segmentation.
FIG. 5: and (5) a result file of blast comparison.
FIG. 6: bit-score sorted result files.
FIG. 7 is a flow chart of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.
Example 1
The method for comparing the coverage rate of the sequencing results of different whole genome sequencing methods is characterized by comprising the following steps of:
(1) downloading sequencing results of different whole genomes of cryptococcus neoformans provided by a refseq or genbank database of NCBI, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta file folder, copying the fasta file into the fasta file folder, decompressing the fasta.gz file into the fasta file folder, and enabling all genome files to be in a decompressed state, wherein the steps are shown in figure 1 and figure 2;
(2) integrating the fasta files in the fasta folder, simultaneously performing de-duplication processing on the integrated files to obtain de-duplicated files, wherein the file size is about 40M, and establishing a library of the de-duplicated files, namely Reference index files and comparison files, by specifically commanding makeblastdb-part _ block Reference-type core-in Reference _ sequence, as shown in FIG. 3;
(3) segmenting a fasta file sequence under a fasta folder, segmenting the fasta file sequence into fragments with the size of 240bp so as to facilitate later comparison, wherein the segmenting of the fasta file is realized by an algorithm based on python3, and specifically comprises the following two steps: reading out the gene sequences of different chromosomes of the cryptococcus neoformans whole genome, dividing the cryptococcus neoformans whole genome into gene segments with the size of 240bp, and storing the gene segments, as shown in a figure 4;
(4) and comparing the segmented fragments with the database built in the second step by using a blast tool, wherein when performing blast comparison, the used specific commands are as follows: blast-db reference-query-fa-out-query-xls-outfmt 6-num _ threads 10;
(5) processing the fasta file through the blast comparison file obtained in the step (4), removing the sequence of the fasta file, wherein the removing operation is realized by an algorithm based on python3, and is realized by removing rows with the same number in the first row and the second row, and the rows are as shown in fig. 5, and the numbers of the first row and the second row in the first row are the same;
(6) and calculating bit-score scores of the filtered results, and ranking the results from high to low after calculation, wherein the highest ranking is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate, as shown in FIG. 6, and GCA _000195955.2 is the optimal reference genome of cryptococcus neoformans.
Wherein the bit-score calculation is specifically as follows:
A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;
B. under such distribution conditions, a probability of observing an alignment score of x or more is
Figure BDA0003119615810000041
P denotes probability, S is an event;
C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne-λSObtaining;
D. deducing the formula obtained in the step C to obtain a bit score calculation formula
Figure BDA0003119615810000042
Wherein
λ: a Gumble distribution constant;
K. constants associated with the scoring matrix used;
m: the length of the query sequence;
n: the size of the database.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. A method for comparing the coverage rate of sequencing results of different whole genome sequencing methods is characterized by comprising the following steps:
(1) reading sequencing results of different whole genomes of the same species, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta folder, copying the fasta file into the fasta folder, decompressing the fasta.gz file and putting the fasta.gz file into the fasta folder;
(2) integrating the fasta files under the fasta folder, simultaneously carrying out duplication removal processing on the integrated files, and building a library of the duplicated files for later comparison;
(3) segmenting the sequence of the fasta file under the fasta folder, and segmenting the sequence of the fasta file into fragments with proper sizes so as to facilitate later comparison;
(4) comparing the segmented fragments with the database built in the second step by using a blast tool;
(5) processing the fasta file through the blast comparison file obtained in the step (4), and removing a sequence compared to the fasta file;
(6) and (4) calculating bit-score scores of the results filtered in the step (5), and sorting the results from high to low after calculation, wherein the highest score is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate.
2. The method of aligning the coverage of sequencing results of different whole genome sequencing methods according to claim 1,
the method is characterized in that: the fragment size of the fasta file sequence in the step (3) is 200bp-500 bp.
3. The method of aligning the coverage of sequencing results of different whole genome sequencing methods according to any one of claims 1-2, wherein: the bit-score calculation specifically comprises:
A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;
B. under such distribution conditions, a probability of observing an alignment score of x or more is
Figure FDA0003119615800000012
P denotes probability, S is an event;
C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne-λSObtaining;
D. deducing the formula obtained in the step C to obtain a bit score calculation formula
Figure FDA0003119615800000011
Wherein
λ: a Gumble distribution constant;
K. constants associated with the scoring matrix used;
m: the length of the query sequence;
n: the size of the database.
CN202110673259.2A 2021-06-17 2021-06-17 Method for comparing sequencing result coverage rates of different whole genome sequencing methods Pending CN113223619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110673259.2A CN113223619A (en) 2021-06-17 2021-06-17 Method for comparing sequencing result coverage rates of different whole genome sequencing methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110673259.2A CN113223619A (en) 2021-06-17 2021-06-17 Method for comparing sequencing result coverage rates of different whole genome sequencing methods

Publications (1)

Publication Number Publication Date
CN113223619A true CN113223619A (en) 2021-08-06

Family

ID=77080454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110673259.2A Pending CN113223619A (en) 2021-06-17 2021-06-17 Method for comparing sequencing result coverage rates of different whole genome sequencing methods

Country Status (1)

Country Link
CN (1) CN113223619A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328399A (en) * 2022-03-15 2022-04-12 四川大学华西医院 Method and system for automatically pairing gene sequencing multi-sample data files
CN115346606A (en) * 2022-10-17 2022-11-15 南京诺因生物科技有限公司 Method and system for designing targeting probe based on species sequence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784199A (en) * 2017-10-18 2018-03-09 中国科学院昆明植物研究所 A kind of organelle gene group screening technique based on STb gene sequencing result
US20180327830A1 (en) * 2015-12-03 2018-11-15 Ares Trading S.A. Method for determining cell clonality
CN111916149A (en) * 2020-08-19 2020-11-10 江南大学 Hierarchical clustering-based protein interaction network global comparison method
CN111926094A (en) * 2020-07-17 2020-11-13 电子科技大学中山学院 Bar code identification primer, identification method and kit for different species in aeromonas
CN112011595A (en) * 2020-06-01 2020-12-01 广东美格基因科技有限公司 Whole genome amplification method for SARS-CoV-2 virus, application and sequencing method and kit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180327830A1 (en) * 2015-12-03 2018-11-15 Ares Trading S.A. Method for determining cell clonality
CN107784199A (en) * 2017-10-18 2018-03-09 中国科学院昆明植物研究所 A kind of organelle gene group screening technique based on STb gene sequencing result
CN112011595A (en) * 2020-06-01 2020-12-01 广东美格基因科技有限公司 Whole genome amplification method for SARS-CoV-2 virus, application and sequencing method and kit
CN111926094A (en) * 2020-07-17 2020-11-13 电子科技大学中山学院 Bar code identification primer, identification method and kit for different species in aeromonas
CN111916149A (en) * 2020-08-19 2020-11-10 江南大学 Hierarchical clustering-based protein interaction network global comparison method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JOHN A. ELIX ET.AL: "PacBio amplicon sequencing for metabarcoding of mixed DNA samples from lichen herbarium specimens", 《MYCOKEYS》, pages 73 - 91 *
亢雨笺: "BLAST算法介绍", Retrieved from the Internet <URL:https://ngdc.cncb.ac.cn/education/ABC/talk/> *
王欣 等: "NK/T细胞淋巴瘤基因组中EBV DNA整合检测及分析", 《中国肿瘤临床》, vol. 45, no. 23, pages 1194 - 1200 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328399A (en) * 2022-03-15 2022-04-12 四川大学华西医院 Method and system for automatically pairing gene sequencing multi-sample data files
CN114328399B (en) * 2022-03-15 2022-05-24 四川大学华西医院 Method and system for automatically pairing gene sequencing multi-sample data files
CN115346606A (en) * 2022-10-17 2022-11-15 南京诺因生物科技有限公司 Method and system for designing targeting probe based on species sequence

Similar Documents

Publication Publication Date Title
CN108573125B (en) Method for detecting genome copy number variation and device comprising same
Adie et al. Speeding disease gene discovery by sequence based candidate prioritization
CN109767810B (en) High-throughput sequencing data analysis method and device
CN113223619A (en) Method for comparing sequencing result coverage rates of different whole genome sequencing methods
US11339426B2 (en) Method capable of differentiating fetal sex and fetal sex chromosome abnormality on various platforms
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
CN110846411B (en) Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
JP6066924B2 (en) DNA sequence data analysis method
CN107944228B (en) Visualization method for gene sequencing variation site
CN111081315B (en) Homologous pseudogene mutation detection method
KR101686146B1 (en) Copy Number Variation Determination Method Using Sample comprising Nucleic Acid Mixture
Sun et al. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq
CN110016497B (en) Method for detecting copy number variation of tumor single cell genome
CN111755068A (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
CN114420212A (en) Escherichia coli strain identification method and system
Lawrence et al. Assignment of position-specific error probability to primary DNA sequence data
CN114446389B (en) Tumor neoantigen feature analysis and immunogenicity prediction tool and application thereof
CN116741268A (en) Method, device and computer readable storage medium for screening key mutation of pathogen
CN110970091A (en) Label quality control method and device
KR20210110241A (en) Prediction system and method of cancer immunotherapy drug Sensitivity using multiclass classification A.I based on HLA Haplotype
CN115862740B (en) Rapid distributed multi-sequence comparison method for large-scale virus genome data
CN110970093B (en) Method and device for screening primer design template and application
CN114566214B (en) Method for detecting genome deletion insertion variation, detection device, computer readable storage medium and application
CN108595914B (en) High-precision prediction method for tobacco mitochondrial RNA editing sites
TW202300656A (en) Machine detection of a candidate break-point of a copy number variant on a genomic sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination