CN113223619A

CN113223619A - Method for comparing sequencing result coverage rates of different whole genome sequencing methods

Info

Publication number: CN113223619A
Application number: CN202110673259.2A
Authority: CN
Inventors: 易康; 安泰然
Original assignee: Nanjing Nuoyin Biotechnology Co ltd
Current assignee: Nanjing Nuoyin Biotechnology Co ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-08-06

Abstract

The invention discloses a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods, which comprises the following steps: reading sequencing results of different whole genomes of the same species, creating a new fasta folder, and copying the fasta file into the fasta folder; establishing a database for the fasta files under the fasta folder; segmenting a fasta file sequence under a fasta folder; comparing the segmented fragments with the database built in the second step by using a blast tool; removing the sequences aligned to the self by screening; and (4) performing score calculation on the filtered results, wherein the highest score is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate. According to the invention, through comparison results, which genome has higher coverage rate can be easily found out.

Description

Method for comparing sequencing result coverage rates of different whole genome sequencing methods

Technical Field

The invention relates to a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods, belonging to the field of genome analysis.

Background

The concept of genomics (genomics) was first introduced in 1986 by the american geneticist Thomas h. A cross-biology discipline studied by collective characterization, quantitative studies and comparison of different genomes of all genes of an organism. Genomics mainly studies the structure, function, evolution, location, editing, etc. of genomes, and their influence on organisms; a great deal of effort has been made in many areas by studying genomes! Particularly in the medical field, gene diagnosis and gene therapy can be realized by the genomic technology, thereby effectively treating patients.

At present, sequencing technology is used for sequencing all genes in a genome of an organism, results measured by different machines of different tissues have certain difference, which sequencing result is higher in coverage rate cannot be determined, the genome is required to be deeply researched, and it is very important to select a more comprehensive and accurate reference genome.

The prior art mainly has the following defects:

1. basically, two complete sequence genomes are compared, and the multi-sequence comparison is troublesome and time-consuming;

2. the comparison result lacks the sorting and combination and the visual comparison result.

Disclosure of Invention

The invention aims to provide a method for comparing the coverage rates of sequencing results of different whole genome sequencing methods, and which genome with higher coverage rate can be easily found out through comparing the results.

The technical scheme adopted by the invention is as follows: a method for comparing the coverage rate of sequencing results of different whole genome sequencing methods is characterized by comprising the following steps:

(1) reading sequencing results of different whole genomes of the same species, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta folder, copying the fasta file into the fasta folder, decompressing the fasta.gz file and putting the fasta.gz file into the fasta folder;

(2) integrating the fasta files under the fasta folder, simultaneously carrying out duplication removal processing on the integrated files, and building a library of the duplicated files for later comparison;

(3) segmenting a fasta file sequence under a fasta folder, and segmenting the fasta file sequence into fragments with proper sizes so as to facilitate later comparison;

(4) comparing the segmented fragments with the database built in the second step by using a blast tool;

(5) processing the fasta file through the blast comparison file obtained in the step (4), and removing a sequence compared to the fasta file;

(6) and calculating bit-score scores of the filtered results, and sequencing the results from high to low or from low to high according to the scores after calculation, wherein the sequencing result with the highest ranking is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate. Wherein the bit-score calculation is specifically as follows:

A. comparing the query sequence filtered in the step (5) with a series of random sequences with uniform length, wherein the score accords with Gumbel extreme value distribution, namely mu ═ log (Kmn) ]/lambda;

B. under such distribution conditions, a probability of observing an alignment score of x or more is

P denotes probability, S is an event;

C. this gives the expectation that, in the random case, the number of possible alignments which have an equal or higher score than the current alignment is equal to or higher than the current alignment score, as given by the formula E-Kmne^-λSObtaining;

D. deducing the formula obtained in the step C to obtain a bit score calculation formula

Wherein

λ: a Gumble distribution constant;

K. constants associated with the scoring matrix used can be determined with reference to https:// www.sciencedirect.com/science/article/pii/S0022283605803602;

m: the length of the query sequence;

n: the size of the database.

Preferably, the fragment size of the fasta file sequence in step (3) is 200bp-500 bp.

The invention has the following beneficial effects:

1. by comparing results, which genome has higher coverage rate can be easily found out;

2. the invention integrates and links the processes, provides fool-style operation and enables technical personnel to use the system simply and quickly;

3. the invention provides a concise and clear multi-sequence comparison result, scores of the compared sequences are arranged in a descending order or an ascending order, and the workload of technicians is reduced;

4. according to the invention, by removing the sequences compared with the files, the influence of the size of the files on the final scoring result is reduced;

5. the scoring method provided by the invention adopts scientific and rigorous bitscore, so that the scoring is more scientific.

Drawings

FIG. 1: a gz file of Cryptococcus neoformans sequencing data provided in the refseq or genbank databases.

FIG. 2: and (3) decompressing fa files of the cryptococcus neoformans sequencing data provided by a refseq or genbank database.

FIG. 3: and (5) establishing a reference genome file after library establishment.

FIG. 4: fa files after genome sequence segmentation.

FIG. 5: and (5) a result file of blast comparison.

FIG. 6: bit-score sorted result files.

FIG. 7 is a flow chart of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.

Example 1

The method for comparing the coverage rate of the sequencing results of different whole genome sequencing methods is characterized by comprising the following steps of:

(1) downloading sequencing results of different whole genomes of cryptococcus neoformans provided by a refseq or genbank database of NCBI, judging whether an input file is a fasta file or a fasta.gz file, creating a new fasta file folder, copying the fasta file into the fasta file folder, decompressing the fasta.gz file into the fasta file folder, and enabling all genome files to be in a decompressed state, wherein the steps are shown in figure 1 and figure 2;

(2) integrating the fasta files in the fasta folder, simultaneously performing de-duplication processing on the integrated files to obtain de-duplicated files, wherein the file size is about 40M, and establishing a library of the de-duplicated files, namely Reference index files and comparison files, by specifically commanding makeblastdb-part _ block Reference-type core-in Reference _ sequence, as shown in FIG. 3;

(3) segmenting a fasta file sequence under a fasta folder, segmenting the fasta file sequence into fragments with the size of 240bp so as to facilitate later comparison, wherein the segmenting of the fasta file is realized by an algorithm based on python3, and specifically comprises the following two steps: reading out the gene sequences of different chromosomes of the cryptococcus neoformans whole genome, dividing the cryptococcus neoformans whole genome into gene segments with the size of 240bp, and storing the gene segments, as shown in a figure 4;

(4) and comparing the segmented fragments with the database built in the second step by using a blast tool, wherein when performing blast comparison, the used specific commands are as follows: blast-db reference-query-fa-out-query-xls-outfmt 6-num _ threads 10;

(5) processing the fasta file through the blast comparison file obtained in the step (4), removing the sequence of the fasta file, wherein the removing operation is realized by an algorithm based on python3, and is realized by removing rows with the same number in the first row and the second row, and the rows are as shown in fig. 5, and the numbers of the first row and the second row in the first row are the same;

(6) and calculating bit-score scores of the filtered results, and ranking the results from high to low after calculation, wherein the highest ranking is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate, as shown in FIG. 6, and GCA _000195955.2 is the optimal reference genome of cryptococcus neoformans.

Wherein the bit-score calculation is specifically as follows:

P denotes probability, S is an event;

Wherein

λ: a Gumble distribution constant;

K. constants associated with the scoring matrix used;

m: the length of the query sequence;

n: the size of the database.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for comparing the coverage rate of sequencing results of different whole genome sequencing methods is characterized by comprising the following steps:

(3) segmenting the sequence of the fasta file under the fasta folder, and segmenting the sequence of the fasta file into fragments with proper sizes so as to facilitate later comparison;

(6) and (4) calculating bit-score scores of the results filtered in the step (5), and sorting the results from high to low after calculation, wherein the highest score is the sequencing result obtained by the whole genome sequencing method with the highest coverage rate.

2. The method of aligning the coverage of sequencing results of different whole genome sequencing methods according to claim 1,

the method is characterized in that: the fragment size of the fasta file sequence in the step (3) is 200bp-500 bp.

3. The method of aligning the coverage of sequencing results of different whole genome sequencing methods according to any one of claims 1-2, wherein: the bit-score calculation specifically comprises:

P denotes probability, S is an event;

Wherein

λ: a Gumble distribution constant;

K. constants associated with the scoring matrix used;

m: the length of the query sequence;

n: the size of the database.