CN111199772A - PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing - Google Patents

PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing Download PDF

Info

Publication number
CN111199772A
CN111199772A CN201911381880.0A CN201911381880A CN111199772A CN 111199772 A CN111199772 A CN 111199772A CN 201911381880 A CN201911381880 A CN 201911381880A CN 111199772 A CN111199772 A CN 111199772A
Authority
CN
China
Prior art keywords
sequence
host
genome
pedv
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911381880.0A
Other languages
Chinese (zh)
Other versions
CN111199772B (en
Inventor
崔天一
孙子奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Personal Biotechnology Co ltd
Original Assignee
Shanghai Personal Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Personal Biotechnology Co ltd filed Critical Shanghai Personal Biotechnology Co ltd
Priority to CN201911381880.0A priority Critical patent/CN111199772B/en
Publication of CN111199772A publication Critical patent/CN111199772A/en
Application granted granted Critical
Publication of CN111199772B publication Critical patent/CN111199772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a PEDV genome analysis method based on second-generation sequencing, which is characterized by comprising the following steps of: downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence; obtaining a sequencing sequence without host contamination; splicing genomes; carrying out homologous alignment on the genome; selecting a comparison result; the start site of the alignment was used as the start site of the gene. Compared with the prior art, the method can accurately predict the gene structure and avoid omission.

Description

PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
Technical Field
The invention relates to the field of gene detection, in particular to a PEDV virus genome analysis method based on second-generation sequencing.
Background
Firstly, partial or even a large number of genome sequences of host pigs exist in PEDV virus secondary sequencing data, and the splicing of the PEDV virus genome can be influenced by the pollution of the host genome. Secondly, the method for splicing and predicting the virus gene structure mainly comprises prediction software such as GeneMarkS and the like, but because genes in a PDEV virus genome are mutually overlapped, a Ribosol frame shift phenomenon also exists in the genes, and the correct gene structure cannot be accurately identified by the conventional gene prediction software.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a PEDV virus genome analysis method based on second-generation sequencing.
Compared with the existing method, the analysis method provided by the invention has the advantages that array genome pollution is removed by using an alignment strategy in the aspect of genome splicing, and then splicing is carried out by using Spads splicing software. And in the aspect of gene structure identification, gene information of similar virus genomes is collected, a database of PEDV virus gene sequences is formed through arrangement, and gene prediction is carried out by utilizing a homologous prediction mode.
In order to realize the purpose of the invention, the adopted specific technical scheme is as follows:
a PEDV virus genome analysis method based on next generation sequencing comprises the following steps:
1) downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence;
2) running Bowtie2 software, aligning the sequencing data with the host reference sequence in the step 1, and removing the sequencing data aligned to the host sequence. Obtaining a sequencing sequence without host contamination;
3) intercepting a sequence for splicing from the host-decontaminated sequencing sequence obtained in step 2 according to the amount of data of PEDV genome size 150X;
4) running SPAdes software, setting the parameters to be that k-mers are respectively set to be 55,77,121 and cov-cutoff values are set to be auto, and splicing genomes by taking the sequence obtained in the step 3 as input;
5) downloading gene sequences of all PEDV viruses in an NCBI database in batches by using Edirect software carried by an NCBI website;
6) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step 5 as an input;
7) running blast software, and carrying out homologous comparison on the genome spliced in the step 4 and the database constructed in the step 6;
8) selecting the result with the largest Total score value when the comparison results are obtained as the final result;
9) and (4) according to the comparison result selected in the step (8), taking the initial site of the comparison as the initial site of the gene.
In a preferred embodiment of the invention, the Bowtie2 software is run as default parameters.
The invention has the beneficial effects that:
compared with the prior art, the method can accurately predict the gene structure and avoid omission.
Drawings
FIG. 1 is a schematic diagram showing the results of gene prediction in the embodiment of the present invention.
FIG. 2 is a schematic view showing the gene prediction result of the comparative example of the present invention.
FIG. 3 is a flow chart of the present invention.
Detailed Description
The invention is further illustrated by the following specific examples.
Example 1:
PEDV virus sample PEDV1 is processed by an illumina Miseq sequencer to generate original data 2.4G, and a 2.3G fastq format high-quality sequence is obtained by removing sequencing joints and performing quality control.
2. Since the host of PEDV virus is a pig, the reference genome and transcript sequences of the pig are downloaded at the Ensembl website.
3. And (3) running the bowtie 2-built, and establishing an index by taking the reference sequence downloaded in the step (2) as an input.
4. The Bowtie2 software was run (default parameters) and the sequencing data was aligned to the host reference sequence in step (1).
5. And eliminating the data aligned to the pig reference sequence according to the alignment result. Obtaining the sequencing sequence without host pollution.
6. Since the excessive data amount is not favorable for splicing, the sequencing sequence for decontamination of the host obtained in step (5) intercepts the data amount of about 150 times of the PEDV virus genome size of about 2.8K, which is 4.2 m.
7. And (3) operating SPAdes software, setting the parameters to be k-mer to be 55,77,121 and cov-cutoff values to be auto respectively, and splicing the genomes by taking the sequence obtained in the step (3) as an input. A contig1 with a length of 26785bp and a coverage depth of 137.58 times was obtained.
8. And downloading the gene sequences of all PEDV viruses in the NCBI database in batches by using Edirect software carried by the NCBI website.
9. And (4) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step (8) as input.
10. And (5) running blast software, and carrying out homologous alignment on the genome spliced in the step (7) and the database constructed in the step (8).
11. And selecting the result with the maximum Total score value when the comparison results are obtained as the final result.
12 according to the comparison result selected in the step (11), taking the initial site of the comparison as the initial site of the gene. The results of gene prediction are shown in FIG. 1.
Comparative example 1:
next, the genome sequence contig1 spliced in step (7) in example 1 was subjected to gene prediction by a conventional method.
Running GeneMarkS, the parameters were set to identify the linear virus pattern, and the results of prediction as shown in FIG. 2 revealed that gene4 in FIG. 1 was missing compared to the results of prediction in example 1, and that gene1 in FIG. 1 was predicted as two genes in the results of prediction in FIG. 2, and the results did not match the actual situation.

Claims (2)

1. A PEDV virus genome analysis method based on next generation sequencing is characterized by comprising the following steps:
1) downloading a genome sequence and a transcript sequence of a host pig as a host reference sequence;
2) running Bowtie2 software, comparing sequencing data with the host reference sequence in the step 1, and removing the sequencing data of the compared host sequence to obtain a sequencing sequence without host pollution;
3) intercepting a sequence for splicing from the host-decontaminated sequencing sequence obtained in step 2 according to the amount of data of PEDV genome size 150X;
4) running SPAdes software, setting the parameters to be that k-mers are respectively set to be 55,77,121 and cov-cutoff values are set to be auto, and splicing genomes by taking the sequence obtained in the step 3 as input;
5) downloading gene sequences of all PEDV viruses in an NCBI database in batches by using Edirect software carried by an NCBI website;
6) operating makeblastdb software, and constructing a database by taking the gene sequence downloaded in the step 5 as an input;
7) running blast software, and carrying out homologous comparison on the genome spliced in the step 4 and the database constructed in the step 6;
8) selecting the result with the largest Total score value when the comparison results are obtained as the final result;
9) and (4) according to the comparison result selected in the step (8), taking the initial site of the comparison as the initial site of the gene.
2. The method for analyzing the PEDV genome based on secondary sequencing of claim 1, wherein the Bowtie2 software is run as default parameters.
CN201911381880.0A 2019-12-27 2019-12-27 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing Active CN111199772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911381880.0A CN111199772B (en) 2019-12-27 2019-12-27 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911381880.0A CN111199772B (en) 2019-12-27 2019-12-27 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing

Publications (2)

Publication Number Publication Date
CN111199772A true CN111199772A (en) 2020-05-26
CN111199772B CN111199772B (en) 2023-05-23

Family

ID=70747664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911381880.0A Active CN111199772B (en) 2019-12-27 2019-12-27 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing

Country Status (1)

Country Link
CN (1) CN111199772B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816249A (en) * 2020-06-01 2020-10-23 上海派森诺生物科技股份有限公司 Genome cyclization analysis method
CN116426696A (en) * 2023-06-14 2023-07-14 北京大学人民医院 Plasma virus detection and analysis method based on sequencing technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008000186A1 (en) * 2006-06-21 2008-01-03 Beijing Bioway-Fortune Research Center For Gene Drugs Ltd. A method for identifying novel gene and the resulting novel genes
WO2014005329A1 (en) * 2012-07-06 2014-01-09 深圳华大基因科技有限公司 Method and system for determining integration manner of foreign gene in human genome
CN108197434A (en) * 2018-01-16 2018-06-22 深圳市泰康吉音生物科技研发服务有限公司 The method for removing human source gene sequence in macro gene order-checking data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008000186A1 (en) * 2006-06-21 2008-01-03 Beijing Bioway-Fortune Research Center For Gene Drugs Ltd. A method for identifying novel gene and the resulting novel genes
WO2014005329A1 (en) * 2012-07-06 2014-01-09 深圳华大基因科技有限公司 Method and system for determining integration manner of foreign gene in human genome
CN108197434A (en) * 2018-01-16 2018-06-22 深圳市泰康吉音生物科技研发服务有限公司 The method for removing human source gene sequence in macro gene order-checking data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李文利;林澜;王乃红;栾雨时;: "柞蚕核型多角体病毒基因组编码VmiRNA的预测与功能分析" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816249A (en) * 2020-06-01 2020-10-23 上海派森诺生物科技股份有限公司 Genome cyclization analysis method
CN111816249B (en) * 2020-06-01 2023-12-08 上海派森诺生物科技股份有限公司 Cyclization analysis method of genome
CN116426696A (en) * 2023-06-14 2023-07-14 北京大学人民医院 Plasma virus detection and analysis method based on sequencing technology
CN116426696B (en) * 2023-06-14 2024-01-26 北京大学人民医院 Plasma virus detection and analysis method based on sequencing technology

Also Published As

Publication number Publication date
CN111199772B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN105886616B (en) Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
Auer et al. Analysis of large 16S rRNA Illumina data sets: Impact of singleton read filtering on microbial community description
CN111199772A (en) PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
CN108197434B (en) Method for removing human gene sequence in metagenome sequencing data
US20130211729A1 (en) Data analysis of dna sequences
Wang et al. Computational resources for ribosome profiling: from database to Web server and software
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
CN108103164B (en) Method for detecting copy number variation by using multiple fluorescent competitive PCR
CN113066532B (en) Method for analyzing virus source sRNA data in host based on high-throughput sequencing technology
Wu et al. MEC: Misassembly error correction in contigs based on distribution of paired-end reads and statistics of GC-contents
CN111676276A (en) Method for rapidly and accurately determining gene editing mutation condition and application thereof
CN110527714B (en) Method for detecting integration site of HPV in host genome
CN110970091A (en) Label quality control method and device
US20160103955A1 (en) Biological sequence tandem repeat characterization
CN113571132B (en) Method for judging sample degradation based on CNV result
CN110993022B (en) Method and device for detecting copy number amplification and method and device for establishing dynamic base line for detecting copy number amplification
CN114664379A (en) Third generation sequencing data self-correction error correction method based on deep learning
CN114822697A (en) Method for analyzing drug-resistant gene pollution of traced soil by using metagenome
CN103177198B (en) A kind of protein identification method
US20140379271A1 (en) System and method for aligning genome sequence
JP2021051597A (en) Image processing apparatus, image processing method, and computer program
EP1202211A2 (en) Genomic DNA analysis computer program
JP2009031128A (en) Device, method, and program for analyzing base sequence and base modification of nucleic acid
Novikov et al. A noise-resistant algorithm for grid finding in microarray image analysis
CN112927756B (en) Method and device for identifying rRNA pollution source of transcriptome and method for improving rRNA pollution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant