CN111199772A

CN111199772A - PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing

Info

Publication number: CN111199772A
Application number: CN201911381880.0A
Authority: CN
Inventors: 崔天一; 孙子奎
Original assignee: Shanghai Personal Biotechnology Co ltd
Current assignee: Shanghai Personal Biotechnology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-26
Anticipated expiration: 2039-12-27
Also published as: CN111199772B

Abstract

The invention discloses a PEDV genome analysis method based on second-generation sequencing, which is characterized by comprising the following steps of: downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence; obtaining a sequencing sequence without host contamination; splicing genomes; carrying out homologous alignment on the genome; selecting a comparison result; the start site of the alignment was used as the start site of the gene. Compared with the prior art, the method can accurately predict the gene structure and avoid omission.

Description

PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing

Technical Field

The invention relates to the field of gene detection, in particular to a PEDV virus genome analysis method based on second-generation sequencing.

Background

Firstly, partial or even a large number of genome sequences of host pigs exist in PEDV virus secondary sequencing data, and the splicing of the PEDV virus genome can be influenced by the pollution of the host genome. Secondly, the method for splicing and predicting the virus gene structure mainly comprises prediction software such as GeneMarkS and the like, but because genes in a PDEV virus genome are mutually overlapped, a Ribosol frame shift phenomenon also exists in the genes, and the correct gene structure cannot be accurately identified by the conventional gene prediction software.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a PEDV virus genome analysis method based on second-generation sequencing.

Compared with the existing method, the analysis method provided by the invention has the advantages that array genome pollution is removed by using an alignment strategy in the aspect of genome splicing, and then splicing is carried out by using Spads splicing software. And in the aspect of gene structure identification, gene information of similar virus genomes is collected, a database of PEDV virus gene sequences is formed through arrangement, and gene prediction is carried out by utilizing a homologous prediction mode.

In order to realize the purpose of the invention, the adopted specific technical scheme is as follows:

a PEDV virus genome analysis method based on next generation sequencing comprises the following steps:

1) downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence;

2) running Bowtie2 software, aligning the sequencing data with the host reference sequence in the step 1, and removing the sequencing data aligned to the host sequence. Obtaining a sequencing sequence without host contamination;

3) intercepting a sequence for splicing from the host-decontaminated sequencing sequence obtained in step 2 according to the amount of data of PEDV genome size 150X;

4) running SPAdes software, setting the parameters to be that k-mers are respectively set to be 55,77,121 and cov-cutoff values are set to be auto, and splicing genomes by taking the sequence obtained in the step 3 as input;

5) downloading gene sequences of all PEDV viruses in an NCBI database in batches by using Edirect software carried by an NCBI website;

6) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step 5 as an input;

7) running blast software, and carrying out homologous comparison on the genome spliced in the step 4 and the database constructed in the step 6;

8) selecting the result with the largest Total score value when the comparison results are obtained as the final result;

9) and (4) according to the comparison result selected in the step (8), taking the initial site of the comparison as the initial site of the gene.

In a preferred embodiment of the invention, the Bowtie2 software is run as default parameters.

The invention has the beneficial effects that:

compared with the prior art, the method can accurately predict the gene structure and avoid omission.

Drawings

FIG. 1 is a schematic diagram showing the results of gene prediction in the embodiment of the present invention.

FIG. 2 is a schematic view showing the gene prediction result of the comparative example of the present invention.

FIG. 3 is a flow chart of the present invention.

Detailed Description

The invention is further illustrated by the following specific examples.

Example 1:

PEDV virus sample PEDV1 is processed by an illumina Miseq sequencer to generate original data 2.4G, and a 2.3G fastq format high-quality sequence is obtained by removing sequencing joints and performing quality control.

2. Since the host of PEDV virus is a pig, the reference genome and transcript sequences of the pig are downloaded at the Ensembl website.

3. And (3) running the bowtie 2-built, and establishing an index by taking the reference sequence downloaded in the step (2) as an input.

4. The Bowtie2 software was run (default parameters) and the sequencing data was aligned to the host reference sequence in step (1).

5. And eliminating the data aligned to the pig reference sequence according to the alignment result. Obtaining the sequencing sequence without host pollution.

6. Since the excessive data amount is not favorable for splicing, the sequencing sequence for decontamination of the host obtained in step (5) intercepts the data amount of about 150 times of the PEDV virus genome size of about 2.8K, which is 4.2 m.

7. And (3) operating SPAdes software, setting the parameters to be k-mer to be 55,77,121 and cov-cutoff values to be auto respectively, and splicing the genomes by taking the sequence obtained in the step (3) as an input. A contig1 with a length of 26785bp and a coverage depth of 137.58 times was obtained.

8. And downloading the gene sequences of all PEDV viruses in the NCBI database in batches by using Edirect software carried by the NCBI website.

9. And (4) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step (8) as input.

10. And (5) running blast software, and carrying out homologous alignment on the genome spliced in the step (7) and the database constructed in the step (8).

11. And selecting the result with the maximum Total score value when the comparison results are obtained as the final result.

12 according to the comparison result selected in the step (11), taking the initial site of the comparison as the initial site of the gene. The results of gene prediction are shown in FIG. 1.

Comparative example 1:

next, the genome sequence contig1 spliced in step (7) in example 1 was subjected to gene prediction by a conventional method.

Running GeneMarkS, the parameters were set to identify the linear virus pattern, and the results of prediction as shown in FIG. 2 revealed that gene4 in FIG. 1 was missing compared to the results of prediction in example 1, and that gene1 in FIG. 1 was predicted as two genes in the results of prediction in FIG. 2, and the results did not match the actual situation.

Claims

1. A PEDV virus genome analysis method based on next generation sequencing is characterized by comprising the following steps:

2) running Bowtie2 software, comparing sequencing data with the host reference sequence in the step 1, and removing the sequencing data of the compared host sequence to obtain a sequencing sequence without host pollution;

2. The method for analyzing the PEDV genome based on secondary sequencing of claim 1, wherein the Bowtie2 software is run as default parameters.