CN111199772A - PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing - Google Patents
PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing Download PDFInfo
- Publication number
- CN111199772A CN111199772A CN201911381880.0A CN201911381880A CN111199772A CN 111199772 A CN111199772 A CN 111199772A CN 201911381880 A CN201911381880 A CN 201911381880A CN 111199772 A CN111199772 A CN 111199772A
- Authority
- CN
- China
- Prior art keywords
- sequence
- host
- genome
- pedv
- software
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a PEDV genome analysis method based on second-generation sequencing, which is characterized by comprising the following steps of: downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence; obtaining a sequencing sequence without host contamination; splicing genomes; carrying out homologous alignment on the genome; selecting a comparison result; the start site of the alignment was used as the start site of the gene. Compared with the prior art, the method can accurately predict the gene structure and avoid omission.
Description
Technical Field
The invention relates to the field of gene detection, in particular to a PEDV virus genome analysis method based on second-generation sequencing.
Background
Firstly, partial or even a large number of genome sequences of host pigs exist in PEDV virus secondary sequencing data, and the splicing of the PEDV virus genome can be influenced by the pollution of the host genome. Secondly, the method for splicing and predicting the virus gene structure mainly comprises prediction software such as GeneMarkS and the like, but because genes in a PDEV virus genome are mutually overlapped, a Ribosol frame shift phenomenon also exists in the genes, and the correct gene structure cannot be accurately identified by the conventional gene prediction software.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a PEDV virus genome analysis method based on second-generation sequencing.
Compared with the existing method, the analysis method provided by the invention has the advantages that array genome pollution is removed by using an alignment strategy in the aspect of genome splicing, and then splicing is carried out by using Spads splicing software. And in the aspect of gene structure identification, gene information of similar virus genomes is collected, a database of PEDV virus gene sequences is formed through arrangement, and gene prediction is carried out by utilizing a homologous prediction mode.
In order to realize the purpose of the invention, the adopted specific technical scheme is as follows:
a PEDV virus genome analysis method based on next generation sequencing comprises the following steps:
1) downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence;
2) running Bowtie2 software, aligning the sequencing data with the host reference sequence in the step 1, and removing the sequencing data aligned to the host sequence. Obtaining a sequencing sequence without host contamination;
3) intercepting a sequence for splicing from the host-decontaminated sequencing sequence obtained in step 2 according to the amount of data of PEDV genome size 150X;
4) running SPAdes software, setting the parameters to be that k-mers are respectively set to be 55,77,121 and cov-cutoff values are set to be auto, and splicing genomes by taking the sequence obtained in the step 3 as input;
5) downloading gene sequences of all PEDV viruses in an NCBI database in batches by using Edirect software carried by an NCBI website;
6) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step 5 as an input;
7) running blast software, and carrying out homologous comparison on the genome spliced in the step 4 and the database constructed in the step 6;
8) selecting the result with the largest Total score value when the comparison results are obtained as the final result;
9) and (4) according to the comparison result selected in the step (8), taking the initial site of the comparison as the initial site of the gene.
In a preferred embodiment of the invention, the Bowtie2 software is run as default parameters.
The invention has the beneficial effects that:
compared with the prior art, the method can accurately predict the gene structure and avoid omission.
Drawings
FIG. 1 is a schematic diagram showing the results of gene prediction in the embodiment of the present invention.
FIG. 2 is a schematic view showing the gene prediction result of the comparative example of the present invention.
FIG. 3 is a flow chart of the present invention.
Detailed Description
The invention is further illustrated by the following specific examples.
Example 1:
PEDV virus sample PEDV1 is processed by an illumina Miseq sequencer to generate original data 2.4G, and a 2.3G fastq format high-quality sequence is obtained by removing sequencing joints and performing quality control.
2. Since the host of PEDV virus is a pig, the reference genome and transcript sequences of the pig are downloaded at the Ensembl website.
3. And (3) running the bowtie 2-built, and establishing an index by taking the reference sequence downloaded in the step (2) as an input.
4. The Bowtie2 software was run (default parameters) and the sequencing data was aligned to the host reference sequence in step (1).
5. And eliminating the data aligned to the pig reference sequence according to the alignment result. Obtaining the sequencing sequence without host pollution.
6. Since the excessive data amount is not favorable for splicing, the sequencing sequence for decontamination of the host obtained in step (5) intercepts the data amount of about 150 times of the PEDV virus genome size of about 2.8K, which is 4.2 m.
7. And (3) operating SPAdes software, setting the parameters to be k-mer to be 55,77,121 and cov-cutoff values to be auto respectively, and splicing the genomes by taking the sequence obtained in the step (3) as an input. A contig1 with a length of 26785bp and a coverage depth of 137.58 times was obtained.
8. And downloading the gene sequences of all PEDV viruses in the NCBI database in batches by using Edirect software carried by the NCBI website.
9. And (4) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step (8) as input.
10. And (5) running blast software, and carrying out homologous alignment on the genome spliced in the step (7) and the database constructed in the step (8).
11. And selecting the result with the maximum Total score value when the comparison results are obtained as the final result.
12 according to the comparison result selected in the step (11), taking the initial site of the comparison as the initial site of the gene. The results of gene prediction are shown in FIG. 1.
Comparative example 1:
next, the genome sequence contig1 spliced in step (7) in example 1 was subjected to gene prediction by a conventional method.
Running GeneMarkS, the parameters were set to identify the linear virus pattern, and the results of prediction as shown in FIG. 2 revealed that gene4 in FIG. 1 was missing compared to the results of prediction in example 1, and that gene1 in FIG. 1 was predicted as two genes in the results of prediction in FIG. 2, and the results did not match the actual situation.
Claims (2)
1. A PEDV virus genome analysis method based on next generation sequencing is characterized by comprising the following steps:
1) downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence;
2) running Bowtie2 software, comparing sequencing data with the host reference sequence in the step 1, and removing the sequencing data of the compared host sequence to obtain a sequencing sequence without host pollution;
3) intercepting a sequence for splicing from the host-decontaminated sequencing sequence obtained in step 2 according to the amount of data of PEDV genome size 150X;
4) running SPAdes software, setting the parameters to be that k-mers are respectively set to be 55,77,121 and cov-cutoff values are set to be auto, and splicing genomes by taking the sequence obtained in the step 3 as input;
5) downloading gene sequences of all PEDV viruses in an NCBI database in batches by using Edirect software carried by an NCBI website;
6) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step 5 as an input;
7) running blast software, and carrying out homologous comparison on the genome spliced in the step 4 and the database constructed in the step 6;
8) selecting the result with the largest Total score value when the comparison results are obtained as the final result;
9) and (4) according to the comparison result selected in the step (8), taking the initial site of the comparison as the initial site of the gene.
2. The method for analyzing the PEDV genome based on secondary sequencing of claim 1, wherein the Bowtie2 software is run as default parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911381880.0A CN111199772B (en) | 2019-12-27 | 2019-12-27 | PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911381880.0A CN111199772B (en) | 2019-12-27 | 2019-12-27 | PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111199772A true CN111199772A (en) | 2020-05-26 |
CN111199772B CN111199772B (en) | 2023-05-23 |
Family
ID=70747664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911381880.0A Active CN111199772B (en) | 2019-12-27 | 2019-12-27 | PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111199772B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816249A (en) * | 2020-06-01 | 2020-10-23 | 上海派森诺生物科技股份有限公司 | Genome cyclization analysis method |
CN116426696A (en) * | 2023-06-14 | 2023-07-14 | 北京大学人民医院 | Plasma virus detection and analysis method based on sequencing technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008000186A1 (en) * | 2006-06-21 | 2008-01-03 | Beijing Bioway-Fortune Research Center For Gene Drugs Ltd. | A method for identifying novel gene and the resulting novel genes |
WO2014005329A1 (en) * | 2012-07-06 | 2014-01-09 | 深圳华大基因科技有限公司 | Method and system for determining integration manner of foreign gene in human genome |
CN108197434A (en) * | 2018-01-16 | 2018-06-22 | 深圳市泰康吉音生物科技研发服务有限公司 | The method for removing human source gene sequence in macro gene order-checking data |
-
2019
- 2019-12-27 CN CN201911381880.0A patent/CN111199772B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008000186A1 (en) * | 2006-06-21 | 2008-01-03 | Beijing Bioway-Fortune Research Center For Gene Drugs Ltd. | A method for identifying novel gene and the resulting novel genes |
WO2014005329A1 (en) * | 2012-07-06 | 2014-01-09 | 深圳华大基因科技有限公司 | Method and system for determining integration manner of foreign gene in human genome |
CN108197434A (en) * | 2018-01-16 | 2018-06-22 | 深圳市泰康吉音生物科技研发服务有限公司 | The method for removing human source gene sequence in macro gene order-checking data |
Non-Patent Citations (1)
Title |
---|
李文利;林澜;王乃红;栾雨时;: "柞蚕核型多角体病毒基因组编码VmiRNA的预测与功能分析" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816249A (en) * | 2020-06-01 | 2020-10-23 | 上海派森诺生物科技股份有限公司 | Genome cyclization analysis method |
CN111816249B (en) * | 2020-06-01 | 2023-12-08 | 上海派森诺生物科技股份有限公司 | Cyclization analysis method of genome |
CN116426696A (en) * | 2023-06-14 | 2023-07-14 | 北京大学人民医院 | Plasma virus detection and analysis method based on sequencing technology |
CN116426696B (en) * | 2023-06-14 | 2024-01-26 | 北京大学人民医院 | Plasma virus detection and analysis method based on sequencing technology |
Also Published As
Publication number | Publication date |
---|---|
CN111199772B (en) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105886616B (en) | Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof | |
Auer et al. | Analysis of large 16S rRNA Illumina data sets: Impact of singleton read filtering on microbial community description | |
CN111199772A (en) | PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing | |
CN108197434B (en) | Method for removing human gene sequence in metagenome sequencing data | |
US20130211729A1 (en) | Data analysis of dna sequences | |
Wang et al. | Computational resources for ribosome profiling: from database to Web server and software | |
CN107480470B (en) | Known variation detection method and device based on Bayesian and Poisson distribution test | |
CN108103164B (en) | Method for detecting copy number variation by using multiple fluorescent competitive PCR | |
CN113066532B (en) | Method for analyzing virus source sRNA data in host based on high-throughput sequencing technology | |
Wu et al. | MEC: Misassembly error correction in contigs based on distribution of paired-end reads and statistics of GC-contents | |
CN111676276A (en) | Method for rapidly and accurately determining gene editing mutation condition and application thereof | |
CN110527714B (en) | Method for detecting integration site of HPV in host genome | |
CN110970091A (en) | Label quality control method and device | |
US20160103955A1 (en) | Biological sequence tandem repeat characterization | |
CN113571132B (en) | Method for judging sample degradation based on CNV result | |
CN110993022B (en) | Method and device for detecting copy number amplification and method and device for establishing dynamic base line for detecting copy number amplification | |
CN114664379A (en) | Third generation sequencing data self-correction error correction method based on deep learning | |
CN114822697A (en) | Method for analyzing drug-resistant gene pollution of traced soil by using metagenome | |
CN103177198B (en) | A kind of protein identification method | |
US20140379271A1 (en) | System and method for aligning genome sequence | |
JP2021051597A (en) | Image processing apparatus, image processing method, and computer program | |
EP1202211A2 (en) | Genomic DNA analysis computer program | |
JP2009031128A (en) | Device, method, and program for analyzing base sequence and base modification of nucleic acid | |
Novikov et al. | A noise-resistant algorithm for grid finding in microarray image analysis | |
CN112927756B (en) | Method and device for identifying rRNA pollution source of transcriptome and method for improving rRNA pollution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |