CN113035269B - Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology - Google Patents
Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology Download PDFInfo
- Publication number
- CN113035269B CN113035269B CN202110422896.2A CN202110422896A CN113035269B CN 113035269 B CN113035269 B CN 113035269B CN 202110422896 A CN202110422896 A CN 202110422896A CN 113035269 B CN113035269 B CN 113035269B
- Authority
- CN
- China
- Prior art keywords
- genome
- splicing
- sequence
- quality
- reads
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Genetics & Genomics (AREA)
- Physiology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for constructing, optimizing and visualizing a genome metabolism model based on a high-throughput sequencing technology, which combines the I l umina second-generation sequencing technology and the PacB I o third-generation sequencing technology in the high-throughput sequencing field, carries out genome splicing and annotation with complementary advantages, compares a biological metabolism pathway database KEGG and MetaCyc through annotated protein information on the basis, and further obtains a whole genome scale metabolism network model according to the correlation of gene-enzyme-reaction according to genes corresponding to the proteins; and further searching a metabolic network model of the existing genetic species in the KEGG database, constructing relatively accurate genome-scale metabolic network models (GEMs) by an integration strategy of gene homology comparison genetic models, realizing flow automation and generating related result report files. The construction of the genome-scale metabolic network models GEMs based on genome sequence annotation and metabolic biochemical information integration is helpful for accelerating the reconstruction of microorganisms.
Description
Technical Field
The invention belongs to the field of metabolic genomics and computational simulation biology in bioinformatics, relates to a method for constructing, optimizing and visualizing a genome metabolic model, and particularly relates to a method for constructing, optimizing and visualizing a genome metabolic model based on a high-throughput sequencing technology.
Background
In order to promote the development of industrial biotechnology, further improvement in the metabolic capacity of designing and modifying microbial cells is urgently required. Along with the rapid development of whole genome high-throughput sequencing and the continuous emergence of bioinformatics analysis strategies, the global and systematic design and regulation of microbial metabolic functions become possible. The construction of genome-scale metabolic network models GEMs, where genome sequence annotation and integration of metabolic biochemical information are based, will help to better understand and expedite the engineering of microorganisms. Therefore, how to rapidly construct a high-precision GEM by using a high-throughput sequencing technology and large-scale computational simulation and effectively perform visual analysis makes the simulation transformation process of organisms rapidly controlled and micro-controlled become a problem which needs to be solved in the field at present.
Disclosure of Invention
The invention aims to integrate respective technical advantages of two-third-generation sequencing, improve genome assembly quality and genome annotation accuracy, calculate through a gene model expression to obtain a comprehensive score value of a gene prediction result, and determine a gene prediction model with the highest score as a primary gene function annotation result; matching the preliminary genome splicing annotation result by combining transcriptome data of similar species to complete the optimization of the genome sequence annotation result; the genome-scale metabolic network models GEMs are quickly constructed, the metabolic network is visualized, and a user can conveniently and quickly master the simulation transformation process, scientific research, analysis and search of microorganisms.
The invention relates to a method for constructing, optimizing and visualizing a genome metabolism model based on a high-throughput sequencing technology, which comprises the following steps of:
step1, performing sequence determination on species by respectively utilizing second-generation Illumina and third-generation PacBio high-throughput sequencing technologies to obtain an original sequence file with quality control;
step2, respectively carrying out quality statistics and evaluation on the original sequence file, if the evaluation is qualified, executing step3, otherwise, returning to execute step 1;
step3, removing primers, joint sequences and reading sequences with quality/length lower than a set requirement from the original sequence file respectively to obtain screened sequence data;
step4, splicing the second generation Illumina high-throughput screening data;
splicing the screened second-generation Illumina sequencing data according to different k-mer size parameters, and selecting a splicing result with the maximum parameter N50 as a contig file spliced by the second-generation Illumina sequencing data;
step 5, carrying out first splicing on the three generations of PacBio high-throughput screening data;
carrying out first splicing on the screened third-generation PacBio sequencing data through splicing software to obtain a contig file subjected to first splicing, wherein the splicing step comprises the following steps:
step 51, correction: stacking the screened reads together for base correction, and improving the accuracy of the bases in the reads;
step 52, trimming: determining a high-quality area and a low-quality area of reads, trimming the low-quality area of the reads, reserving a single highest-quality sequence block, and deleting the low-quality sequence block;
step 53, assembling: sequencing reads sequences into contigs files, generating corresponding consensus sequence consensus sequences, and creating paths in which consensus sequences are connected with each other;
step 6, fusing and embedding the contig file of the second generation Illumina into the third generation PacBio high throughput screening data for the second splicing to obtain a high-quality genome splicing result, and the specific steps are as follows:
step 61, carrying out sequence error correction on the first-time splicing data of the PacBio to obtain a reads sequence corrected based on the PacBio;
step 62, using the contig file spliced by the second generation Illumina as a supplement of a reads sequence corrected by the PacBio, filling gap in the third generation PacBio sequencing, and carrying out second splicing on the genome to obtain the contig file of the PacBio;
step 63, comparing the contig files of PacBio further by utilizing Illumina original sequencing data, filtering high-quality compared reads, and correcting the base errors in the second splicing result by using a bam file generated by an align part as a basis through base error revision software Pilot;
step 64, further extending the corrected splicing result of the Illumina original reads data PacBio through extension software to finally obtain a high-quality genome splicing result;
step 7, predicting and matching the genome splicing result to finish genome sequence annotation;
according to the splicing result, gene prediction software Augustus, maker and Braker are respectively used for obtaining all gene prediction results, the predicted results are respectively compared with an nt database, a BUSCO database and a Pfam database of NCBI to obtain corresponding scores, a gene model expression is used for calculating to obtain a comprehensive score value of the gene prediction model, the gene prediction model with the highest score is determined as a primary genome splicing annotation result, the primary genome splicing annotation result is matched by combining the transcriptome of similar species, if the matching degree of sequencing read reads and genome is greater than 80%, the primary genome splicing annotation result is determined as a final result, genome sequence annotation is completed, otherwise, the gene prediction software is changed, and prediction and selection are carried out again;
step 8, constructing genome metabolic network models GEMs;
step 9, building a visual network platform of the metabolic network models GEMs through MetExploreViz, and visually displaying the metabolic model GEMs;
and step 10, carrying out interface visualization FBA flow balance analysis on the metabolism network models GEMs by utilizing optFlux software.
Further, the quality below the set requirement in step3 refers to bases with quality below 20, and the length below the set requirement refers to sequences with length less than 100 bp.
Further, the different k-mer size parameters in step4 are 21,33,55,77 and 99 respectively.
Further, the expression of the gene model in the step 7 is as follows:
Evidence score(gene model)=BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)
wherein BLASTp score is the score derived from the BLASTp protein sequence alignment program;
BLASTp score is the score derived from the BLASTn sequence base alignment program; BUSCO _ score is a score obtained by comparing a Benchmarking Universal Single-Copy orthopology database derived from BUSCO; pfam _ scores is a score obtained by comparison in a Pfam database; cov (query) is the full-length alignment coverage of the query sequence; cov (target) is the full-length alignment coverage of the target sequence; eidferencescore is the integrated score value of the final gene model.
Further, the specific steps in step 8 include:
step 81, respectively comparing protein sequences formed by genes injected and released by a RAVEN to biological pathway databases KEGG and MetaCyc to obtain reactions Rxns, metabolites Mets and gene numbers related to corresponding metabolic network models, merging the same reactions, metabolites and related genes, performing de-duplication fusion, and finally obtaining metabolic network models GEMs based on KEGG and MetaCyc;
step 82, searching a metabolic network model GEMs of the species closest to the species in a KEGG database, acquiring involved reactions Rxns, metabolites Mets and genes through a gene homology comparison strategy, and supplementing the involved reactions Rxns, metabolites Mets and genes into the acquired metabolic network model GEMs based on KEGG and MetaCyc;
and step 83, perfecting the whole metabolic network model GEMs through GapFind and GapFll to enable the model to obtain more genes and reactions.
Further, the second generation Illumina high-throughput sequencing technology in the step2 determines the sequence quality of the species with the evaluation standard that the clear Data Q20 is more than 85%; the evaluation standard of the third generation PacBio high-throughput sequencing technology for sequencing quality determination of species is that Mean Concordance is more than 85%;
further, the high quality region in step 5 is a region where the base mass of reads is 20 or more, and the low quality region is a region where the base mass of reads is 20 or less.
Further, the species is yeast.
Further, the yeast is yarrowia lipolytica.
The invention has the following effects:
the method effectively integrates Illumina second-generation sequencing and PacBio third-generation sequencing data to obtain a high-quality genome chromosome horizontal assembly result; integrating a plurality of eukaryotic gene prediction software results and combining the real transcriptome data of the species to obtain the most reliable and effective genome sequence annotation result; rapidly constructing genome-scale metabolic network models (GEMs) based on the above genome sequence annotation result; the metabolic network is visualized, so that scientific research personnel or users can conveniently analyze and search; carrying out rapid and effective flow balance analysis on the existing GEMs to complete the simulation, reconstruction and prediction of microorganisms; the method obviously improves the working efficiency of microorganism modification engineering, reduces the scientific research cost, can be widely applied to various microorganism research works in high-throughput sequencing, and is particularly suitable for the analysis and improvement research of various engineering bacteria in the field of practical microorganism application.
Drawings
FIG. 1 is a flow chart of genome splicing assembly in the method for constructing, optimizing and visualizing the genome metabolism model based on the high throughput sequencing technology according to the present invention;
FIG. 2 is a flow chart of the genomic sequence annotation of the present invention;
FIG. 3 is a schematic diagram of the annotation result of the genome according to the present invention;
FIG. 4 is a flowchart of the metabolic network model construction of the present invention;
FIG. 5 is a diagram showing the distribution of 459 unique genes on a chromosome according to the present invention;
FIG. 6 is a graph illustrating the evaluation of three assembly results according to the present invention;
FIG. 7 is a diagrammatic illustration of the sequence splicing process of the present invention;
FIG. 8 is a diagram of a sequence alignment of a reference chromosomal genome with a presumed corresponding chromosomal genome;
FIG. 9 is a graph of the results of a highly matched Scaffold contig-Scaffold86_1 according to the present invention;
FIG. 10 is a diagram illustrating data result storage according to the present invention.
Detailed Description
Embodiments of the present invention will be described below with reference to fig. 1 to 10.
To specifically illustrate the technical solution, the Chinese and English comparison table related in the solution is provided as follows:
the invention discloses a genome splicing and annotation method which is based on the combination of second-generation sequencing represented by Illumina and third-generation sequencing technology of PacBio and has complementary advantages and is widely applied in the field of high-throughput sequencing, and accurate genome-scale metabolic network models (GEMs) are constructed on the basis. The method comprises the steps that firstly, a user-defined parameter file is generated by a system, after the parameter file is modified and set by a user, a batch processing executable file is generated by a high-throughput data processing flow module, the batch processing executable file is executed by the system, the automation of the whole process is realized, and a related result report file is generated. Therefore, biological information researchers, even researchers who do not know the programming script for data analysis, can efficiently complete high-throughput sequencing data analysis, and the construction of genome scale metabolic network models (GEMs) based on genome sequence annotation and metabolic biochemical information integration can be helpful for the researchers to understand and accelerate the microbial transformation globally. The process of the invention can not only be used for the research of the eukaryotic microorganism such as yeast, but also be popularized and applied to the GEMs reconstruction of other eukaryotic organisms, even prokaryotic organisms such as bacteria and the like, such as escherichia coli, and the application range is wider.
In order to solve the above problems, the present invention provides a method for constructing, optimizing and visualizing a genome metabolic model based on a high-throughput sequencing technology, comprising the following steps:
1. the steps for obtaining the assembly result of the high-quality chromosome level are shown in the attached figure 1: and inputting a set splicing parameter configuration file by a user according to the analyzed species, respectively importing high-throughput sequencing original sequence data of second-generation sequencing and third-generation sequencing, and screening and splicing by respective special processing flow software to obtain an effective genome splicing result, namely a corresponding contig file. And further analyzing the integration result, splicing to obtain a genome full-length sequence on the chromosome level, and performing subsequent bioinformatics analysis and metabolic network model construction on the basis, wherein the method specifically comprises the following steps:
(a) Respectively importing two high-throughput sequencing original sequence files of second-generation sequencing and third-generation sequencing of species;
(b) Respectively performing quality control and statistics on the original sequence files by using respective basic processing flows, and eliminating low-quality sequence data to obtain screened sequence data;
(c) Respectively splicing the screened data and assembling the spliced data into contig files of the genome;
according to the initial sequencing data of Illumina, after a linker sequence and a tail low-quality (q < 20) base sequence are removed, different k-mer size parameter results are calculated through SPAdes splicing software, the length of an effective contig sequence is selected to be larger than 500bp, and the splicing result with the maximum N50 is selected as a final Illumina assembly result. The PacBio original sequencing data is subjected to an open-source software canu standard flow to obtain a splicing assembly contigs result of the PacBio data, then the two contigs files are combined, software SSPACE is further extended through base error revision software Pilot and contig, illumina original reads data and the assembly contig result are utilized again, the splicing result of the PacBio is used as a framework, and the quality of the assembly result is further improved. In the step (C), SPAdes (v3.14.1) splicing software is used for splicing Illumina high-throughput sequencing data with different k-mer lengths (21,33,55,77,99), a single-cell mode (set-sc parameter) can be further selected for some data sets with high GC content, low coverage rate or uneven coverage, and finally the splicing result with the maximum N50 is selected as a contig file for splicing Illumina sequencing data for next analysis. The high-throughput sequencing data of PacBio are spliced by Canu (v2.1.1) splicing software, the splicing process is divided into three steps, the first step is correction (Correct), namely, read 'stacking' (Pileup) is corrected together, the accuracy of bases in reads is improved, generally, errors in third-generation sequencing are mainly caused by one more base or one less base, and because single-molecule sequencing sometimes can lead to signal connection when two bases are detected or sometimes leads to the errors when the same base is detected for multiple times; the second step is pruning (Trim), which uses overlap to determine which regions of the read are high quality regions and which regions are of lower quality and require trimming. Finally, reserving a single sequence block with the highest quality, and deleting suspicious areas; the third step is assembly, i.e. sorting reads into contigs, generating corresponding consensus sequences and creating paths where possible consensus sequences are connected to each other.
(d) Then integrating two sequencing splicing results, taking the third-generation result with longer sequencing as a skeleton, correcting and prolonging by using the second-generation sequencing reads with lower sequencing error rate, and finally obtaining the optimal genome assembly result; in the step (d), the preliminary contig file is taken through respective standard splicing flows, then the long contig file of PacBio is used as a framework, the contig file of Illumina and high-quality reads sequencing data are used as supplementary aids, the final version of genome splicing assembly is taken under the assistance of the correction of the base error correction software Pilot (v 1.23) and the extension of the sequence further extension software SSPACE (v 2.1.1), and the next step of genome annotation and metabolic model construction is carried out.
(1) The steps for obtaining reliable and effective genome sequence annotation are shown in the attached figure 2:
according to the splicing result of the last step, a user selects species transcriptome reference data according to the analyzed species, verifies all predicted results according to various gene prediction software results in eukaryotes, removes low-score gene prediction models according to comprehensive scoring results, and finally determines the most reliable gene prediction result to finish annotation;
(a) Respectively introducing a transcriptome high-throughput sequencing original sequence file of a reference species and a final genome splicing result file in the step (1);
(b) A repeat modeler is used for carrying out modeling annotation on a repeat sequence family of a genome from scratch, and a RepeatMasker is used for detecting and shielding repeat sequences in the genome, because a repeat sequence structural region exists in the genome, particularly in higher eukaryotes, the repeat sequences account for a large proportion, the quality of gene prediction is influenced, and unnecessary resource consumption is brought.
Gene annotation results after genome splicing were then done by three common gene prediction software, augustus, marker and Braker. For reference transcriptome data, the splicing assembly result of RNA-seq is obtained through a Hisat/Trinity flow, and the gene prediction of the result is completed by using marker and Braker. And all the prediction results are evaluated and scored, corresponding scores are obtained through BLASTp, BUSCO, interProScan and BLASTn, and finally the prediction result with the highest Score of the evaluation Score is obtained as a final gene prediction model through the following expression of the evaluation Score of the gene model.
Evidence score(gene model)=BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)
Wherein BLASTp score is the score derived from the BLASTp protein sequence alignment program; BLASTp score is the score derived from the BLASTn sequence base alignment program; BUSCO _ score is a score derived from alignment of the benchmark Universal Single-Copy ordology (BUSCO) database; pfam _ scores is a score obtained by comparison in a Pfam database; cov (query) is the full-length alignment coverage of the query sequence; cov (target) is the full-length alignment coverage of the target sequence; eidferencescore is the integrated score value of the final gene model.
(2) The steps for constructing the metabolic network models GEMs are shown in the attached figure 4: selecting an open-source RAVEN tool set according to the parameter configuration file and the genome sequence annotation result, and generating a corresponding automatic GEMs (genetic engineering models) construction process, wherein the method specifically comprises the following steps:
(a) By utilizing a RAVEN2.0 tool set platform, protein sequences injected and released from a genome are compared with two major biological access databases of KEGG and MetaCyc to respectively obtain reaction Rxns, metabolite Mets and involved gene numbers related to a corresponding metabolic network model. Combining the same reaction, metabolite and related gene to finally obtain a metabolic network model GEMs based on the two main flow platforms.
(b) Searching the closest species by using the existing metabolic network models GEMs in the KEGG database, and acquiring the involved reactions Rxns, metabolites Mets and genes through a homologous comparison strategy of the genes, wherein the involved reactions Rxns, metabolites Mets and genes are used as further supplements and are integrated into the obtained metabolic network models GEMs.
(c) Finally, the integrated metabolic network model GEMs are perfected through GapFind and GapFlil in the platform flow,
(3) Executing and outputting: executing the described automated GEMs construction process, obtaining a genome-scale metabolic network model GEMs report based on the sequence annotation result and finishing visual display, wherein the method specifically comprises the following steps:
the GEM network visualization evaluation analysis is completed by using, but not limited to, model explore2.0, and the quality of the metabolic network model is determined by evaluating the number of genes involved, the connectivity of the network and the accuracy of the predicted biological functions after gene knockout. And further building a visual network platform of the metabolism model GEMs through MetExploreViz, so that a user can conveniently and specifically check certain specific metabolic pathways and reactions.
(4) The flow balance analysis of the interface visualization of the metabolic network models GEMs is mainly but not limited to the related FBA flow balance analysis by using open-source optFlux software.
According to the invention, in the process of a splicing module, the high-throughput sequencing data mainly based on second generation Illumina and the high-throughput sequencing data of third generation PacBio are effectively combined, advantages are complemented, so that the species genome splicing can approach or reach the chromosome assembly level, in the process of a genome annotation module, real similar species transcriptome data are combined, a plurality of gene prediction software is used to obtain all possible prediction results, the predicted results are respectively compared with an nt database of NCBI, a BUSCO database and a Pfam database to obtain corresponding scores, the comprehensive score value of a gene prediction model is obtained through calculation of an Even score formula, and the gene prediction model with the highest comprehensive score value is finally determined as the most reliable genome splicing annotation result. Finally, on the process of constructing the genome metabolic network model, the protein sequences annotated by the genome are utilized, a relatively complete genome scale metabolic network model GEMs is obtained finally by comparing the metabolic networks related to homologous genes in KEGG and MetaCyc databases and extracting the metabolic networks related to homologous proteins in the existing similar species GEMs. The analysis process of the invention has clear thought, focuses on integrating the respective technical advantages of the second-third-generation sequencing, improves the genome assembly quality and the accuracy of genome annotation, finally obviously improves the working efficiency of microorganism modification engineering, reduces the scientific research cost, can be widely applied to various microorganism research works in high-throughput sequencing, and is particularly suitable for analysis and improvement research of various engineering bacteria in the field of practical microorganism application.
The method comprises the steps of firstly providing a user-defined parameter configuration file by a system process, generating a batch processing executable file corresponding to a data process according to the user-defined parameter file and a high-throughput data processing process module after a user modifies set parameters, executing the batch processing executable file by a process system to realize data process analysis and generate a result report file, thereby efficiently helping a biological information analyzer to complete a set of standardized high-throughput data integration analysis process, and even allowing scientific researchers who do not know the high-throughput data analysis and metabolic network modeling to complete splicing, assembling, annotation and analysis of sequencing data to establish a genome-scale metabolic network model. The splicing skill and the annotation method can be further popularized and applied to more subsequent high-throughput sequencing fields.
The above-mentioned embodiments are further described with reference to specific embodiments, which are intended to illustrate the present invention but not to limit the scope of the present invention, and the relevant conditions and parameter settings used in the embodiments can be further adjusted according to the requirements of specific applications, and the un-noted implementation conditions are generally the conditions in routine experiments.
Firstly, filtering the original data, namely removing a linker sequence and a low-quality reads sequence, and finishing the processing and splicing of Illumina and PacBio early-stage reads by respectively utilizing respective standard processes.
Statistics of Illumina sequencing data quality are given in the following table:
in the above table: sample ID-name of the sequencing Sample; insert Size (bp), the length of the inserted DNA fragment when the library is built; clean Reads Length (bp): length of reads sequencing; raw Data (Mb): sequencing the total quantity of bases by using original data; clean Q20 (%)% of the number of bases with sequencing error rate less than 1%; clear Q30 percent (percentage of the number of bases with a sequencing error rate of less than 0.1 percent); GC (%). C + G bases are the number percentage of all bases.
Statistics for the quality of PacBio sequencing data are shown in the following table:
in the above table: sample ID-name of the sequencing Sample; mean Concordance: averaging the consistency scores; number of Reads: the number of reads; number of Bases: the amount of data; mean Read Length: average sequencing read length: n50 ReadLength (bp): sequencing reads were N50 long, i.e. sequencing data were ranked from large to small, with sequencing reads at 50% of the total.
Optimizing sequencing data quality: sequencing errors such as point mutations and the like usually occur in high-throughput sequencing, the quality of the sequence end is generally low, and the sequence end contains a linker sequence, so that in order to improve the splicing quality and obtain a more accurate biological information analysis result, the original data needs to be optimized.
The method comprises the following specific steps and parameter setting: removing primer and adaptor sequences from Illumina sequencing raw data by using Trimmomatic (v 0.32), removing bases with the mass of lower than 20 at two ends, and removing sequences with the length of less than 100 bp; and (3) splicing Illumina sequencing data by using SPAdes (v 3.14.1), selecting a paired reads input form and defaulting parameters, and finishing contig assembly splicing. The PacBio sequencing data was stitched using Canu (v2.1.1) to set genoMeSize to 20m (yeast reference genome size), correct ErrorRate default settings (0.045 for PacBio reads, and 0.144for Nanopore reads) would increase corrected ErrorRate by 1% for coverage below 30X and decrease corrected ErrorRate by 1% for sequencing coverage above 60X depending on sequencing depth. The other parameters are set based on default parameters, such as minoverlapLength < integer =500>, and minReadLength < integer =1000>. Using Pilon (v 1.23), FASTA and BAM files as input, and performing enhancement on the input reference genome according to the alignment result, including finding single base differences, small indels (indels), large indels or blocks, filling N in the reference sequence, and finding local misassembly. The fasta files of contig are from the respective standard process assembly results of Illumina and PacBio, redundancy is removed by combination, and finally the genome splicing assembly sequence with further improved splicing quality after Pilot treatment is obtained. And (3) further performing possible extension on the spliced contig by using SSPACE (v2.1.1) and utilizing original reads sequence paired-end information, and finally obtaining an extended genome splicing version.
The extended genome spliced versions were quality assessed using QUAST (v5.0.2), BUSCO (v4.1.4), respectively. Wherein the accuracy of the bases is assessed by QUAST calculation of the ratio of mismatches to InDel in the assembly results. And performing integrity evaluation on the assembly result by inquiring a BUSCO database and comparing the result. Where C represents the number of complete single copy orthologs in the database that can be matched, F represents matched but incomplete, and M represents a single copy ortholog gene that cannot be matched. n represents the total number of single copy orthologous genes contained in the corresponding database.
Considering that the actually spliced genome is from yeast, a special annotation process Fungap (v1.1.1) suitable for fungi is selected to complete, and an RNAseq sequence of a transcriptome of a reference adjacent yeast genome Y.lipolytica W29 of the existing transcriptome is selected as an assistant, the results of the following three different gene prediction software Augustus (v3.3.3), braker (v2.1.5) and Maker (v2.31.10) are evaluated in the process, and the optimal gene model prediction result is determined by constructing an evidencscorore score formula.
The method comprises the steps of using RAVEN Toolbox (v 2.0), obtaining homologous genes corresponding to proteins by comparing and searching in a KEGG database and a MetaCyc database based on predicted genome annotated protein sequence files, and respectively and repeatedly constructing a genome metabolic scale network model according to gene-enzyme-reaction association, wherein the genome metabolic scale network model comprises the reaction Rxns related to each, and the metabolite Mets and the gene genes.
Evaluation of GMEs was done by model explore (v 2.0) and further building a visual network platform presentation of metabolic model GEMs by MetExploreViz. And (3) introducing an OptFlux (v3.4.0) into a metabolic model to realize the analysis of FBA flow balance and help scientific researchers to perform metabolic simulation and modification of microorganisms.
The analysis software involved in the whole process is as follows: trimmatic (v 0.32), SPAdes (v 3.14.1), canu (v 2.1.1), pilot (v 1.23), SSPACE (v 2.1.1), QUAST (v 5.0.2), BUSCO (v 4.1.4), fungap (v 1.1.1), augustus (v 3.3.3), braker (v 2.1.5), maker (v 2.31.10), samtools (v 1.10), bamtools (v 2.5.1), pfam _ scan (v 1.6), geneMark-ES/ET (v 4.59), repeat modeler (v 2.0.1), hisat2 (v 2.2.0), RAVEN Toolbox (v 2.0), model 2.0, explore vitoviz (v 3.4.0).
The technical scheme of the invention is illustrated by taking yarrowia lipolytica as an example below:
a first part: yarrowia lipolytica genome splicing
The sequence splicing process is shown in FIG. 7, and the sequence assembly splicing and evaluation flow chart is shown in FIG. 1.
Genome sequence splicing is to finish genome assembly results of corresponding platforms by Illumina second-generation sequencing and PacBio third-generation sequencing respectively, and further to use high-quality reads of Illumina second-generation sequencing as a framework to carry out sequence error correction and derivation through a pilot program and an SSPACE program, so that a first genome analysis assembly version Poly _ version 1.0 is obtained. And finally, finishing the whole assembly evaluation work by respectively using N50 parameters, QUAST and BUSCO databases.
(1) Library construction and sequencing results of the Illumina platform:
illumina sequencing depth assessment
7,567,4601502/20,000,000=113X (yeast genome estimated from 20 MB)
The results of Illumina NovaSeq PE150 sequencing are given in the following table:
sample ID-name of the sequencing Sample; insert Size (bp) length of the inserted DNA fragment at the time of library construction; clean Reads Length (bp): length of reads sequencing; raw Data (Mb): sequencing the total quantity of bases by using original data; clear Q20% percentage of the number of bases with sequencing error rate less than 1%; clear Q30 percent (percentage of the number of bases with a sequencing error rate of less than 0.1 percent); GC (%) the percentage of bases of C + G to the number of all bases
The PacBio platform library construction sequencing results are shown in the following table:
the decompressed corresponding file of the sequencing result file of the PacBio platform library building is data/data _ jie/jiequ05. Subsidiary. BamPamBio sequencing depth evaluation
3,000,081,562/20,000,000=150X (yeast genome estimated from 20 MB)
(2) And (3) genome splicing results:
the primary splice results of Illumina sequencing are shown in the following table:
serial number | Item | Scaffold | Contig fragment | |
1 | Total number of (>500bp) | 117 | 147 | |
2 | Total length (bp) | 20,295,846 | 20,295,546 | |
3 | N50 Length (bp) | 372,953 | 316,305 | |
4 | N90 Length (bp) | 139,075 | 87,734 | |
5 | Maximum Length (bp) | 949,861 | 949,861 | |
6 | Minimum length (bp) | 574 | 547 | |
7 | Sequence GC content (%) | 49 | 49 |
The process is as follows: splicing and assembling are carried out by second generation sequencing common software such as SPAdes.
The PacBio sequencing alignment results are shown in the following table:
the process is as follows: reads are assembled through SMRT Link v5.0.1 software provided by PacBio authorities, correction is completed through mutation detection software policy, then unused reads and assembly result policy _ assembly.fasta are mapped by using a response program, and finally PacBio assembly results of 6 contigs are obtained.
The Illumina and PacBio data combined with the splice results are shown in the following table:
the process is as follows: the pilot and bowtie programs revise the assembly result error program, and the detailed result interpretation is exemplified by: contig1:1146572Contig1 _pilot
The initial Contig1 sequence file spliced out by PacBio, checked by the pilot and bowtie programs, found that two nucleotide ATs, namely 1146573-1146574, were missing AT 1146572 nucleotide positions.
The corrected result is further derived
(3) And (3) evaluating a genome splicing result:
the results of the Quast evaluation are shown in the following table:
evaluating the results according to a BUSCO database: integrity assessment of genome splicing and annotation was by using a BUSCO (universal single copy ortholog gene as reference) database. The genes that make up the BUSCO set of each major lineage are selected from the orthologous group, the genes of which are present as single-copy orthologs in at least 90% of the species.
The process is as follows: evaluation of three assembly results was performed by selecting fungi _ odb9 and ascomycota _ odb9 for ascomycota phyla belonging to our spliced genome Yarrowia lipolytica. The results are shown in FIG. 6. Three assembly results are slightly different in the evaluation of a BUSCO-dependent database, 97.5% of BUSCO sets can be found in ascomycota _ odb9, and the assembly result Poly _ version 1.0 combining second-generation sequencing and third-generation sequencing is selected for subsequent analysis.
(4) Determination of chromosome number: the total number of spliced yeasts is 6 chromosomes according to experimental information, the contig number corresponding to the assembly splicing result is exactly 6, the splicing result can be determined to reach the chromosome level, information of yli strain of Yarrowia lipolytica in an online KEGG database is known, and the information is as follows, and the serial number of the chromosomes is preliminarily determined according to the size corresponding relation.
Serial number | Ref Chromosome ID | Size(bp) | Scaffold ID | Scaffold size(bp) |
1 | Chromosome E | 4,224,103 | scaffold1 | 4,202,664 |
2 | Chromosome F | 4,003,362 | scaffold2 | 4,015,094 |
3 | Chromosome D | 3,633,272 | scaffold3 | 3,641,580 |
4 | Chromosome C | 3,272,609 | scaffold4 | 3,323,792 |
5 | Chromosome B | 3,066,374 | scaffold5 | 3,073,838 |
6 | Chromosome A | 2,303,261 | scaffold6 | 2,275,489 |
And the reference chromosome genome and the corresponding chromosome genome presumed by us were subjected to sequence alignment by using the flak-1.0 software, and the results are shown in the attached FIG. 8: from the above alignment results, we assembled chromosomal genomic sequences that were highly matched to the corresponding reference chromosomal genomic sequences, except that scfold 5 and ChrB gave a match of an inverted reverse around 150 kb. We further looked at mapping of Illumina reads and assembled scaffold5 containment for this region, and for the resulting bam file looking at alignment using the visualization software IGV, we were able to find that most of the highly matched gray reads covered this region, where the reads would be represented in different colors (insert size less than expected; insert size greater than expected; inversion, repeat, translocation events), thus presuming that the region was indeed a different place than the reference yli strain. Alignment mapping was also performed on the PacBio data and the scaffold5 contig, resulting in visualization of the same segment, around 160kb, and also finding a long PacBio reads that smoothly crossed the match. If there are problems with assembly, pacBio is unable to cross smoothly over all.
(5) Finding the line three-dimensional chromosome sequence: since yli also has a mitochondrial chromosomal sequence in the reference genome, we have reason to believe if the spliced genome misses a mitochondrial sequence because of the high match of their 6 corresponding chromosomal sequences, first map the reads and the reference mitochondrial sequence as follows:
we found that 7.2% reads could match, and we then picked these matched reads out and performed a rejoin of denovo with SPAdes software. Mitochondrial chromosomal genome file mt _ chr.fasta with size 47,912bp to verify if it is a circular structure sequence, we cut the first and last DNA sequence and create fasta file of test sequence. Performing blat query by using a test file spliced with the previous Scaffold contigs file obtained by Illumina to find a highly matched Scaffold contig-Scaffold86_1, as shown in figure 9, a test sequence and a reverse sequence of the Scaffold86_1 can find good span matching, thereby proving to be a complete circular mt _ chr.fasta, the sequence is compared with a reference genome mitochondrial chromosome sequence, is highly similar, proves that the mitochondrial chromosome genome sequence is highly conserved and basically has no change, and the final genome assembly sequence result is (we remove 8 bases in the mitochondrial circular comparison)
Serial number | Ref Chromosome ID | Size(bp) | Scaffold ID | Scaffold size(bp) |
1 | Chromosone MT | 47,916 | scaffold_mt | 47,904 |
2 | Chromosome E | 4,224,103 | scaffold1 | 4,202,664 |
3 | Chromosome F | 4,003,362 | scaffold2 | 4,015,094 |
4 | Chromosome D | 3,633,272 | scaffold3 | 3,641,580 |
5 | Chromosome C | 3,272,609 | scaffold4 | 3,323,792 |
6 | Chromosome B | 3,066,374 | scaffold5 | 3,073,838 |
7 | Chromosome A | 2,303,261 | scaffold6 | 2,275,489 |
The whole first part of the process parameter selection is described as follows:
for secondary Illumina sequencing data, firstly removing a linker sequence and a tail low-quality (q < 20) base sequence from Illumina original sequencing data through Trimmomatic software, removing a sequence with the length less than 36pb, calculating different k-mer size parameter results through SPAdes splicing software, then selecting an effective contig sequence with the length more than 500bp by default, and selecting a splicing result with the maximum N50 as a final Illumina assembly contig result. After taking corrected PacBio data, we found that integrating Illumina spliced contig files as a beneficial complement using Circlator software, based on PacBio corrected reads sequences for assembly, would be better than canu default self-contained splicing assembly procedure results, especially for circular prokaryotes, and even for entire complete circular genome sequences without the need to compensate for gap on the genome with PCR experiments. And a correction part, wherein the obtained contig assembly result of PacBio is compared by using Illumina original sequencing data through bowtie and samtools software, high-quality comparison read is filtered, possible base errors in the splicing result are corrected by using a bam file generated by an align part as a basis through base error revision software Pilot, and the splicing result is further extended after correction of the Illumina original reads data PacBio through further extension software SSPACE, so that a high-quality genome assembly result is finally obtained. And in the genome assembly evaluation part, the evaluation work of the assembly result is completed through the database information of the best and the BUSCO. In the genome annotation part, a repeatModler is called by a funcap to perform modeling annotation on a repeat sequence family of a genome from head, and a repeatMasker is used for detecting and shielding repeat sequences in the genome, because a repeat sequence structural region exists in the genome, particularly in higher eukaryotes, the repeat sequences account for a large proportion, the quality of gene prediction is influenced, and unnecessary resource consumption is caused. Gene annotation results after genome splicing were then done by three common gene prediction software, augustus, marker and Braker. For reference transcriptome data, the splicing assembly result of RNA-seq is obtained through a Hisat/Trinity flow, and the gene prediction of the result is completed by using marker and Braker. And all the prediction results are evaluated and scored, corresponding scores are obtained through BLASTp, BUSCO, interProScan and BLASTn, and finally the prediction result with the highest Score of the everscience Score is obtained as a final gene prediction model through the following equation of the everscience Score of the gene model. In this step of parameter selection, we also add the siter _ organisms, that is, download all 9 yarrowia lipolytica genomes in NCBI database as annotation reference to improve the accuracy of annotation. Taking the splicing of the yeast genome as an example, the obtained splicing, evaluation and annotation comparison results are detailed in a research report document. Here we can see that our final splice assembly results also resulted in 6 chromosomes and a plasmid sequence compared to Yarrowia lipolytica CLIB122, which is a reference for the Yarrowia lipolytica genome at the complete chromosome level already in the NCBI database.
A second part: yarrowia lipolytica genome annotation
(1) Genome annotation process
The whole process utilizes transcriptome data as assistance, and selects a set of transcriptome sequences W29 of the same yeast Yarrowia lipolytica Strain of the same kind with high similarity as assistance aiming at the situation that the transcriptome sequences of the same kind are not available at present.
According to the information provided by the article, when the same species or genus level is selected, corresponding to an average DNA similarity of 96% or 92.7%, the predicted result is very close to the auxiliary result of the real transcriptome data.
(2) Genome annotation the specific process steps are as follows
Step1: and (4) preprocessing input data.
Repeating the shielding; genomic regions such as transposon repeats, often misalign and interfere with gene prediction.
The genome-specific repeat library constructed using the RepeatModeller.
Assembly of mRNA reads;
the mRNA readings provided by the user were assembled by Trinity program.
Step2: gene prediction
Manufacturer and default parameters used by FunGAP; training with the iterative SNAP gene model was run four times. Due to the higher gene density of the fungal genome, fusion of adjacent transcripts is corrected in mRNA assembly. Default parameters used by Augustus and FunGAP; with the user specified augustus _ tasks parameter. Overlapping CDS predictions are allowed. Brake and default parameters used by FunGAP; genome annotation based on unsupervised RNA sequencing was performed using GeneMark-ET and Augustus.
Step3: gene model evaluation
BLASTp; BUSCO; interProScan (Pfam Domain prediction); BLASTn; evidence score (gene model) = BLASTp _ score: (query) × (target) + BUSCO _ score + Pfam _ scores + BLASTn _ score: (query) × (target)
Step4: gene model filtering
The provided information, in the annotation process, the reference model selected by the gene prediction program augustus is used for predicting the same yarrowia _ lipolytica, and 9 yarrowia protein data are downloaded for homologous search.
Downloading sister proteome (Download sister proteome)
(3) Genome annotation results the FunGAP report: created in 2019-09-17, 6970 genes were predicted in this genome.
1. The Gene structure Gene structure is shown in the following table:
serial number | Attributes | |
1 | Total protein-coding genes | 6,970 |
2 | Transcript length(avg/med) | 1,463.9/1,201.0 |
3 | CDS length(avg/med) | 1,362.7/1,131.0 |
4 | Protein length(avg/med) | 454.2/377.0 |
5 | Exon length(avg/med) | 1,035.0/810.0 |
6 | Intron length(avg/med) | 319.5/168.0 |
7 | Spliced genes | 1,833(26.3%) |
8 | Gene density(genes/Mb) | 338.67 |
9 | Number of introns | 2,207 |
10 | Number of introns per gene(med) | 1.0 |
11 | Number of exons | 9,177 |
12 | Number of exons per gene(med) | 1.0 |
2. Transcriptome read sequences assembly (Transcriptome reads assembly) is shown in the following table:
serial number | Attributes | |
1 | Number of mapped reads | 20,207,847 |
2 | Number of assembled contigs | 10,448 |
3 | Number of contigs>1kbp | 5,295 |
4 | Total transcript size(Mbp) | 13,320,738 |
3. The Transcript Length distribution (Transcript length distribution) is shown in FIG. 3
4. Protein length distribution (Protein length distribution) see figure 3 for total reads:22,385,814 times; mapping reading: 20,207,847 times; alignment ratio: 90.27%. Total reads 22,385,814; mapped reads 20,207,847; alignment rate 90.27%
According to the prior art, the gene prediction effect is reliable when the selected transcription reads alignment rate is not less than 80%, and the matching degree of reads and genes is as high as 90.27% because the yeast transcriptome data with the speces level is selected, which indicates that the prediction result is right.
And a third part: yarrowia lipolytica genome annotation compared to reference genome
(1) The annotated completed genomes were evaluated by the BUSCO database for the assembly results of 5 Missing BUSCOs (M), EOG092D12CU, EOG092D2L06, EOG092D3PHH, EOG092D4JRM, EOG092D4SCY for the same BUSCO database check of the completed reference genome Yarrowia lipolytica Strain CLIB122 present in the KEGG database, resulting in the following 9 Missing BUSCOs (M) (EOG 092D1EU1, EOG092D1Q04, EOG092D1RDH, EOG092D2L06, EOG092D2RP6, EOG092D37V8, EOG092D3C2F, EOG D4H30, EOG092D4 SCY) and 5 Missing BUSCOs, all of which are found to be 2 Missing the first 2 Missing BUSCOs.
1 st: EOG092D2L06"Zinc finger, RING-like"156 1 329 1.3703 NA IPR011513; IPR014857
The 2 nd: EOG092D4SCY "EF-hand domain"164 2 205.0865NA IPR002048; IPR011992; IPR018247
(2) The selected reference genome is Yarrowia lipolytica Strain W29, and a corresponding protein sequence annotation file is downloaded for online alignment, so that the common 6511pair genes are found.
Next, we performed VENN map analysis based on the annotated 6970 gene file fungap _ out _ prot.faa, with the reference genome YarliW29 sharing 6511 of the number of gene maps, in addition we compared 1315 BUSCO group matches in the earlier BUSCO ascomycota _ odb9 database, which in total matched 1310 BUSCO groups (including both full and fragment matches), corresponding to 1318 genes in the 6970 genes. 2 of these values are 0, because reference bias we only selected 6970 genes we spliced and annotated as a common reference comparison, there is a possibility of missing a complete comparison of YarliW29 to BUSCO (1315 groups). No obvious clustering condition of the genes on the chromosomes is seen, and a considerable number of unique gene distributions exist in comparison with the reference genome on each chromosome, and FIG. 5 is a distribution chart of 459 unique genes on the chromosomes.
The fourth part: yarrowia lipolytica genome metabolic model reconstruction
(1) Genome-scale metabolic models (GEMs)
Genome-scale metabolic models (GEMs) can mimic and predict the metabolic flux of various system-level metabolic studies by calculating gene-protein-response associations that describe the entire metabolic gene in an organism. The latest GEM, iMK1208, is a high-quality model that contains 1208 genes and 1643 responses and is successfully used to predict metabolic engineering goals for increased actin production. Streptomyces coelicolor genome scale models (GEMs) are used for system biology research, and automatic draft GEM reconstruction of creation board
Reconstruction based on two databases
(a) MetaCyc-based reconstruction Module: metaCyc is a well-planned database that contains experimentally elucidated metabolic pathways from all areas of life. MetaCyc contains 2722 pathways from 3009 different organisms.
(b) KEGG-based reconstruction module: (1) GEM draft of genome information based on KEGG annotation. (2) reconstructing GEM draft from homology with HMMs trained with KEGG;
ScoKEGGAnnotation=getKEGGModelForOrganism('yli',”,”,”,0,0);
ScoKEGGAnnotation.id='ScoKEGGAnnotation';
(c) Merging of the MetaCyc-based model and the KEGG-based model. KEGG relies on sequence-based annotation, whereas MetaCyc collects only experimentally validated pathways. .
(2) Reference to Yarrowia lipolytica CLIB122 genomic information
(3) The JGI database currently has all 8 Yarrowia lipolytica information
(4) Semi-automatic metabolic model creation iterative process
(5) Specific reconstruction process of yarrowia lipolytica genome metabolism model
Based on the above process, the first version of our current metabolic model is as follows
For later further iterative optimization, gap compensation can be carried out in a deep optimization mode only by further checking related documents and later cooperative discussion.
The fifth part is that: the data result storage description is shown in figure 10
The foregoing description has described the general principles, features and advantages of the invention. The present invention is not limited by the above examples, which are described in the specification and are illustrative only of the principles of the invention, and various changes and modifications can be made without departing from the spirit and scope of the invention, which will fall within the scope of the invention as claimed.
Claims (9)
1. A method for constructing, optimizing and visualizing a genome metabolic model based on a high-throughput sequencing technology is characterized by comprising the following steps of:
step1, performing sequence determination on species by respectively utilizing second-generation Illumina and third-generation PacBio high-throughput sequencing technologies to obtain an original sequence file with quality control;
step2, respectively carrying out quality statistics and evaluation on the original sequence file, if the evaluation is qualified, executing step3, otherwise, returning to execute step 1;
step3, removing primers, joint sequences and reading sequences with quality/length lower than a set requirement from the original sequence file respectively to obtain screened sequence data;
step4, splicing the second generation Illumina high-throughput screening data;
splicing the screened second-generation Illumina sequencing data according to different k-mer size parameters, and selecting a splicing result with the maximum parameter N50 as a contig file spliced by the second-generation Illumina sequencing data;
step 5, carrying out first splicing on the three generations of PacBio high-throughput screening data;
carrying out first splicing on the screened third generation PacBio sequencing data through splicing software to obtain a contig file subjected to first splicing, wherein the splicing step comprises the following steps:
step 51, correction: stacking the screened reads together for base correction, and improving the accuracy of the bases in the reads;
step 52, trimming: determining a high-quality area and a low-quality area of reads, trimming the low-quality area of the reads, reserving a single highest-quality sequence block, and deleting the low-quality sequence block;
step 53, assembling: sequencing reads sequences into contigs files, generating corresponding consensus sequence consensus sequences, and creating paths in which consensus sequences are connected with each other;
step 6, fusing and embedding the contig file of the second generation Illumina into the third generation PacBio high throughput screening data for the second splicing to obtain a high-quality genome splicing result, and the specific steps are as follows:
step 61, carrying out sequence error correction on the first-time splicing data of the PacBio to obtain a reads sequence corrected based on the PacBio;
step 62, using the contig file spliced by the second generation Illumina as a supplement of a reads sequence corrected by the PacBio, filling gap in the third generation PacBio sequencing, and carrying out second splicing on the genome to obtain the contig file of the PacBio;
step 63, comparing the contig file of PacBio by utilizing Illumina original sequencing data, filtering high-quality compared reads, and correcting the base error in the second splicing result by using a bam file generated by an align part as a basis through base error revision software Pilot;
step 64, further extending the corrected splicing result of the Illumina original reads data PacBio through extension software to finally obtain a high-quality genome splicing result;
step 7, predicting and matching the genome splicing result to finish genome sequence annotation;
according to the splicing result, gene prediction software Augustus, maker and Braker are respectively used for obtaining all gene prediction results, the predicted results are respectively compared with an nt database, a BUSCO database and a Pfam database of NCBI to obtain corresponding scores, a gene model expression is used for calculating to obtain a comprehensive score value of the gene prediction model, the gene prediction model with the highest score is determined as a primary genome splicing annotation result, the primary genome splicing annotation result is matched by combining the transcriptome of similar species, if the matching degree of sequencing read reads and genome is greater than 80%, the primary genome splicing annotation result is determined as a final result, genome sequence annotation is completed, otherwise, the gene prediction software is changed, and prediction and selection are carried out again;
step 8, constructing genome metabolic network models GEMs;
step 9, building a visual network platform of the metabolic network models GEMs through MetExploreViz, and visually displaying the metabolic model GEMs;
and step 10, carrying out interface visualization FBA flow balance analysis on the metabolism network models GEMs by utilizing optFlux software.
2. The method for constructing, optimizing and visualizing the genome metabolism model based on the high-throughput sequencing technology of claim 1, wherein in the step3, the quality is lower than the set requirement and refers to bases with the quality lower than 20, and the length is lower than the set requirement and refers to sequences with the length less than 100 bp.
3. The method for constructing, optimizing and visualizing the genome metabolism model based on the high-throughput sequencing technology of claim 1, wherein the different k-mer size parameters in the step4 are 21,33,55,77 and 99 respectively.
4. The method for constructing, optimizing and visualizing the genome metabolism model based on the high-throughput sequencing technology according to claim 1, wherein the expression of the gene model in the step 7 is as follows:
Evidence score(gene model)=BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)
wherein BLASTp score is the score derived from the BLASTp protein sequence alignment program;
BLASTp score is the score derived from the BLASTn sequence base alignment program; BUSCO _ score is a score obtained by comparing a Benchmarking Universal Single-Copy orthopology database derived from BUSCO; pfam _ scores is a score obtained by comparison in a Pfam database;
cov (query) is the full-length alignment coverage of the query sequence; cov (target) is the full-length alignment coverage of the target sequence; eidferencescore is the integrated score value of the final gene model.
5. The method for constructing, optimizing and visualizing the genome metabolism model based on the high-throughput sequencing technology according to claim 1, wherein the specific steps in the step 8 comprise:
step 81, respectively comparing protein sequences formed by genes injected and released by a genome by using RAVEN with biological pathway databases KEGG and MetaCyc to obtain the related reactions Rxns, metabolites Mets and the number of genes of the corresponding metabolic network models, merging the same reactions, metabolites and related genes, performing de-duplication fusion, and finally obtaining metabolic network models GEMs based on KEGG and MetaCyc;
step 82, searching a metabolic network model GEMs of the species closest to the species in a KEGG database, acquiring involved reactions Rxns, metabolites Mets and genes through a gene homology comparison strategy, and supplementing the involved reactions Rxns, metabolites Mets and genes into the acquired metabolic network model GEMs based on KEGG and MetaCyc;
and step 83, perfecting the whole metabolic network model GEMs through GapFind and GapFll to enable the model to obtain more genes and reactions.
6. The method for constructing, optimizing and visualizing the genome metabolic model based on the high-throughput sequencing technology of claim 1, wherein the evaluation standard of the second generation Illumina high-throughput sequencing technology for sequencing quality determination of species in the step2 is that clear Data Q20 is more than 85%; the third generation PacBio high throughput sequencing technology performed on species for sequence quality determination evaluation criteria was Mean Concordance >85%.
7. The method for constructing, optimizing and visualizing the genome metabolism model based on the high-throughput sequencing technology of claim 1, wherein the high-quality region in the step 5 is a region where the base quality of reads is greater than or equal to 20, and the low-quality region is a region where the base quality of reads is less than 20.
8. The method for modeling, optimizing and visualizing genome metabolism based on high-throughput sequencing technology of claim 1, wherein the species is yeast.
9. The method for high throughput sequencing technology-based genome metabolism model construction, optimization and visualization of claim 8, wherein the yeast is yarrowia lipolytica.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110422896.2A CN113035269B (en) | 2021-04-16 | 2021-04-16 | Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110422896.2A CN113035269B (en) | 2021-04-16 | 2021-04-16 | Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113035269A CN113035269A (en) | 2021-06-25 |
CN113035269B true CN113035269B (en) | 2022-11-01 |
Family
ID=76456928
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110422896.2A Active CN113035269B (en) | 2021-04-16 | 2021-04-16 | Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113035269B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111161797B (en) * | 2019-12-31 | 2023-06-06 | 北京百迈客生物科技有限公司 | Transcription analysis method based on three-generation sequencing detection multi-sample comparison |
CN113903411B (en) * | 2021-08-11 | 2024-06-28 | 东北林业大学 | Genome assembly pretreatment method based on suffix array and monotone stack |
CN114168708B (en) * | 2021-11-15 | 2022-06-14 | 哈尔滨工业大学 | Personalized biological channel retrieval method based on multi-domain characteristics |
CN114496091A (en) * | 2021-12-30 | 2022-05-13 | 浙江安诺优达生物科技有限公司 | Method for optimizing assembled genomes |
CN116631500A (en) * | 2021-12-30 | 2023-08-22 | 天津金匙医学科技有限公司 | Non-core drug-resistant gene |
CN114822700B (en) * | 2022-04-25 | 2023-02-17 | 至本医疗科技(上海)有限公司 | Methods, devices and media for presenting rearranged or fused structural subtypes |
CN114875118B (en) * | 2022-06-30 | 2022-10-11 | 北京百图智检科技服务有限公司 | Methods, kits and devices for determining cell lineage |
CN115691673B (en) * | 2022-10-25 | 2023-08-15 | 广东省农业科学院蔬菜研究所 | Genome assembly method from telomere to telomere |
CN116072222B (en) * | 2023-02-16 | 2024-02-06 | 湖南大学 | Method for identifying and splicing viral genome and application thereof |
CN117275590B (en) * | 2023-11-10 | 2024-03-26 | 华东师范大学 | Degradation function gene database and analysis platform for macromolecules in organic solid waste system |
CN118398090B (en) * | 2024-06-26 | 2024-09-06 | 安诺优达基因科技(北京)有限公司 | Genome annotation method and electronic device |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106778076A (en) * | 2016-11-15 | 2017-05-31 | 上海派森诺生物科技股份有限公司 | A kind of efficient method for being directed to the splicing of actinomyces genome |
KR101832834B1 (en) * | 2017-03-09 | 2018-04-13 | 주식회사 샤인바이오 | Method and system for multiple dot plot analysis |
CN107609347A (en) * | 2017-08-21 | 2018-01-19 | 上海派森诺生物科技股份有限公司 | A kind of grand transcript profile data analysing method based on high throughput sequencing technologies |
US20190144852A1 (en) * | 2017-11-13 | 2019-05-16 | The Board Of Trustees Of The University Of Illinois | Combinatorial Metabolic Engineering Using a CRISPR System |
CN108804875B (en) * | 2018-06-21 | 2020-11-17 | 中国科学院北京基因组研究所 | Method for analyzing microbial population function by using metagenome data |
CN111192630B (en) * | 2019-12-24 | 2023-10-13 | 中国科学院生态环境研究中心 | Metagenomic data mining method |
CN112133368B (en) * | 2020-10-13 | 2024-02-23 | 南开大学 | Automatic analysis method of metagenome sequencing data based on three-generation sequencing technology |
-
2021
- 2021-04-16 CN CN202110422896.2A patent/CN113035269B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113035269A (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113035269B (en) | Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology | |
Shumate et al. | Improved transcriptome assembly using a hybrid of long and short reads with StringTie | |
Scalzitti et al. | A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms | |
Lischer et al. | Reference-guided de novo assembly approach improves genome reconstruction for related species | |
Venturini et al. | Leveraging multiple transcriptome assembly methods for improved gene structure annotation | |
Zimin et al. | Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm | |
Wang et al. | The draft nuclear genome assembly of Eucalyptus pauciflora: a pipeline for comparing de novo assemblies | |
Hara et al. | Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation | |
Rudd | Expressed sequence tags: alternative or complement to whole genome sequences? | |
Brent | Genome annotation past, present, and future: how to define an ORF at each locus | |
Romiguier et al. | Analytical biases associated with GC-content in molecular evolution | |
Trivedi et al. | Quality control of next-generation sequencing data without a reference | |
Vendrell-Mir et al. | A benchmark of transposon insertion detection tools using real data | |
Rupp et al. | Construction of a public CHO cell line transcript database using versatile bioinformatics analysis pipelines | |
Hoffmann et al. | Accurate mapping of tRNA reads | |
Rödelsperger et al. | CYNTENATOR: progressive gene order alignment of 17 vertebrate genomes | |
Lozano-Fernandez | A practical guide to design and assess a phylogenomic study | |
Singh | Fundamentals of bioinformatics and computational biology | |
Minei et al. | De novo assembly of middle-sized genome using MinION and Illumina sequencers | |
Bowman et al. | A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa | |
Beier et al. | Multiplex sequencing of bacterial artificial chromosomes for assembling complex plant genomes | |
Sandhya et al. | Methods and tools for plant organelle genome sequencing, assembly, and downstream analysis | |
Wang et al. | BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach | |
Hoang et al. | De novo assembly and characterizing of the culm-derived meta-transcriptome from the polyploid sugarcane genome based on coding transcripts | |
Haiminen et al. | Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |