CN113035269B

CN113035269B - Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology

Info

Publication number: CN113035269B
Application number: CN202110422896.2A
Authority: CN
Inventors: 汤雷翰; 申志勇; 周豪魁; 叶羽; 罗明晶; 谢秀秀
Original assignee: Beijing Computational Science Research Center; Shenzhen Institute of Advanced Technology of CAS
Current assignee: Beijing Computational Science Research Center; Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2022-11-01
Anticipated expiration: 2041-04-16
Also published as: CN113035269A

Abstract

The invention discloses a method for constructing, optimizing and visualizing a genome metabolism model based on a high-throughput sequencing technology, which combines the I l umina second-generation sequencing technology and the PacB I o third-generation sequencing technology in the high-throughput sequencing field, carries out genome splicing and annotation with complementary advantages, compares a biological metabolism pathway database KEGG and MetaCyc through annotated protein information on the basis, and further obtains a whole genome scale metabolism network model according to the correlation of gene-enzyme-reaction according to genes corresponding to the proteins; and further searching a metabolic network model of the existing genetic species in the KEGG database, constructing relatively accurate genome-scale metabolic network models (GEMs) by an integration strategy of gene homology comparison genetic models, realizing flow automation and generating related result report files. The construction of the genome-scale metabolic network models GEMs based on genome sequence annotation and metabolic biochemical information integration is helpful for accelerating the reconstruction of microorganisms.

Description

Genome metabolic model construction, optimization and visualization based on high-throughput sequencing technology Methods

技术领域technical field

本发明属于生物信息学中代谢基因组学和计算模拟生物学领域，涉及一种基因组代谢模型构建、优化及可视化的方法，特别涉及一种基于高通量测序技术的基因组代谢模型构建、优化及可视化的方法。The invention belongs to the fields of metabolic genomics and computational simulation biology in bioinformatics, and relates to a method for constructing, optimizing and visualizing a genome metabolic model, in particular to a method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology Methods.

背景技术Background technique

为推动工业生物技术的发展，迫切需要进一步提高设计和改造微生物细胞的代谢能力。伴随着全基因组高通量测序的快速发展和生物信息学分析策略的不断涌现，使全局性、系统化设计和调控微生物代谢功能成为可能。其中基于基因组序列注释和代谢生化信息整合的基因组规模代谢网络模型GEMs的构建将有助于更好理解和加快改造微生物。因此如何利用高通量测序技术和大型计算模拟快速构建一个高精确度的GEM并能有效可视化分析，让生物的模拟改造过程能被迅速掌握微成为该领域目前急需解决的问题。In order to promote the development of industrial biotechnology, it is urgent to further improve the metabolic ability of designing and modifying microbial cells. With the rapid development of whole-genome high-throughput sequencing and the continuous emergence of bioinformatics analysis strategies, it is possible to design and regulate microbial metabolic functions in a global and systematic manner. Among them, the construction of genome-scale metabolic network model GEMs based on the integration of genome sequence annotation and metabolic and biochemical information will help to better understand and accelerate the transformation of microorganisms. Therefore, how to use high-throughput sequencing technology and large-scale computing simulation to quickly construct a high-precision GEM and effectively visualize and analyze it, so that the biological simulation transformation process can be quickly grasped has become an urgent problem in this field.

发明内容Contents of the invention

本发明的目的在于整合二三代测序的各自技术优势，提高基因组组装质量，以及基因组注释的准确性，并通过基因模型表达式进行计算，得到基因预测结果的综合打分数值，确定分值最高的基因预测模型为初步基因功能注释结果；结合相近物种的转录组数据,对初步基因组拼接注释结果进行匹配，完成基因组序列注释结果的优化；快速构建基因组规模的代谢网络模型GEMs，对代谢网络进行可视化，方便用户迅速掌握微生物的模拟改造过程及科研、分析查找。The purpose of the present invention is to integrate the respective technical advantages of the second- and third-generation sequencing, improve the quality of genome assembly, and the accuracy of genome annotation, and calculate the gene model expression to obtain the comprehensive scoring value of the gene prediction results, and determine the highest score. The gene prediction model is the preliminary gene function annotation result; combined with the transcriptome data of similar species, the preliminary genome splicing annotation result is matched, and the genome sequence annotation result is optimized; the genome-scale metabolic network model GEMs is quickly constructed to visualize the metabolic network , which is convenient for users to quickly grasp the simulation transformation process of microorganisms and scientific research, analysis and search.

本发明为一种基于高通量测序技术的基因组代谢模型的构建、优化及可视化的方法，包括以下步骤：The present invention is a method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology, comprising the following steps:

步骤1、分别利用二代Illumina和三代PacBio高通量测序技术对物种进行序列测定，得到带有质量控制的原始序列文件；Step 1. Use the second-generation Illumina and third-generation PacBio high-throughput sequencing technologies to sequence the species, and obtain the original sequence files with quality control;

步骤2、分别对原始序列文件进行质量统计及评价，评价合格则执行步骤3，否则返回执行步骤1；Step 2. Perform quality statistics and evaluation on the original sequence files respectively. If the evaluation is qualified, execute step 3; otherwise, return to execute step 1;

步骤3、分别对原始序列文件去除引物、接头序列和质量/长度低于设定要求的读序reads序列，获得经过筛选的序列数据；Step 3. Remove primers, linker sequences, and read sequence reads whose quality/length is lower than the set requirement for the original sequence file respectively, and obtain the sequence data after screening;

步骤4、对二代Illumina高通量筛选数据进行拼接；Step 4. Splicing the second-generation Illumina high-throughput screening data;

以不同的k-mer size参数对经过筛选的二代Illumina测序数据进行拼接，选择参数N50最大的拼接结果作为二代Illumia测序数据拼接的contig文件；Splice the screened second-generation Illumina sequencing data with different k-mer size parameters, and select the splicing result with the largest parameter N50 as the contig file of the second-generation Illumina sequencing data;

步骤5、对三代PacBio高通量筛选数据进行第一次拼接；Step 5. Perform the first splicing of the third-generation PacBio high-throughput screening data;

通过拼接软件对经过筛选的三代PacBio测序数据进行第一次拼接，获取到第一次拼接的contig文件，其拼接步骤包括：Use the splicing software to splice the screened third-generation PacBio sequencing data for the first time, and obtain the contig file for the first splicing. The splicing steps include:

步骤51、修正：将经过筛选的reads堆垛到一起进行碱基修正，提高reads中碱基的准确性；Step 51, correction: stacking the screened reads together for base correction to improve the accuracy of the bases in the reads;

步骤52、修剪：确定reads的高质量区域和低质量区域，并对reads的低质量区域进行修整，保留单个最高质量的序列块，删除低质量序列块；Step 52, pruning: determine the high-quality region and the low-quality region of the reads, and trim the low-quality region of the reads, retain a single highest-quality sequence block, and delete the low-quality sequence block;

步骤53、组装：将reads序列排序为contigs文件，生成对应的共有的一致序列consensus sequences，并创建共有序列互相相连的路径；Step 53, assembly: sort the reads sequence into a contigs file, generate corresponding consensus sequences consensus sequences, and create a path connecting the consensus sequences;

步骤6、将所述二代Illumina的contig文件融合嵌入三代PacBio高通量筛选数据进行第二次拼接，获取高质量的基因组拼接结果，具体步骤如下：Step 6. Fusion and embed the contig file of the second-generation Illumina into the third-generation PacBio high-throughput screening data for the second splicing to obtain high-quality genome splicing results. The specific steps are as follows:

步骤61、对PacBio第一次拼接数据进行序列纠错，得到基于PacBio纠正后的reads序列；Step 61, performing sequence error correction on the first splicing data of PacBio, and obtaining the corrected reads sequence based on PacBio;

步骤62、将二代Illumina拼接的contig文件作为PacBio纠正后的reads序列的补充，填补三代PacBio测序中gap，进行基因组第二次拼接，得到PacBio的contig文件；Step 62, using the contig file spliced by the second-generation Illumina as a supplement to the reads sequence corrected by PacBio, filling the gap in the third-generation PacBio sequencing, performing second splicing of the genome, and obtaining the contig file of PacBio;

步骤63、对PacBio的contig文件，利用Illumina原始测序数据进一步比对，过滤高质量比对的reads，通过碱基错误修订软件Pilon以align部分产生的bam文件为依据，对第二次拼接结果中的碱基错误进行更正；Step 63: For the contig file of PacBio, use Illumina original sequencing data to further compare, filter the reads of high-quality comparison, use the bam file generated by the align part through the base error correction software Pilon, and make the second splicing result base errors were corrected;

步骤64、通过延伸软件利用Illumina原始reads数据PacBio的纠正后拼接结果进一步延伸，最终获取到高质量的基因组拼接结果；Step 64, using the extension software to further extend the corrected splicing result of Illumina original reads data PacBio, and finally obtain a high-quality genome splicing result;

步骤7、对基因组拼接结果进行预测、匹配，完成基因组序列注释；Step 7. Predict and match the genome assembly results, and complete the genome sequence annotation;

依据所述拼接结果，分别使用基因预测软件Augustus,Maker和Braker，获取所有基因预测结果，将预测出的结果分别比对NCBI的nt数据库，BUSCO数据库以及Pfam数据库获取对应分值，并通过基因模型表达式进行计算，得到基因预测模型的综合打分数值，确定分值最高的基因预测模型为初步基因组拼接注释结果，结合相近物种的转录组数据,对初步基因组拼接注释结果进行匹配，如果测序读序reads与基因组genome的匹配度>80％，则确定初步基因组拼接注释结果为最终结果，完成基因组序列注释，否则改变基因预测软件，重新进行预测和选取；According to the splicing results, use the gene prediction software Augustus, Maker and Braker respectively to obtain all the gene prediction results, compare the predicted results with the NCBI nt database, BUSCO database and Pfam database to obtain the corresponding scores, and pass the gene model The expression is calculated to obtain the comprehensive scoring value of the gene prediction model, and the gene prediction model with the highest score is determined to be the preliminary genome assembly annotation result. Combined with the transcriptome data of similar species, the preliminary genome assembly annotation result is matched. If the sequencing read sequence If the matching degree between reads and the genome is >80%, the preliminary genome assembly annotation result is determined as the final result, and the genome sequence annotation is completed; otherwise, change the gene prediction software and re-predict and select;

步骤8、构建基因组代谢网络模型GEMs；Step 8, constructing the genome metabolic network model GEMs;

步骤9、通过MetExploreViz搭建代谢网络模型GEMs的可视化网络平台，进行代谢模型GEMs的可视化展示；Step 9, build a visual network platform of metabolic network model GEMs through MetExploreViz, and perform visual display of metabolic model GEMs;

步骤10、利用optFlux软件对代谢网络模型GEMs进行界面可视化的FBA流平衡分析。Step 10, using optFlux software to perform interface visualization FBA flow balance analysis on the metabolic network model GEMs.

进一步地，所述步骤3中质量低于设定要求指质量低于20的碱基，长度低于设定要求指长度小于100bp的序列。Further, in the step 3, the quality lower than the set requirement refers to the bases whose quality is lower than 20, and the length lower than the set requirement refers to the sequence whose length is less than 100 bp.

进一步地，所述步骤4中不同的k-mer size参数分别为21、33、55、77、99。Further, the different k-mer size parameters in step 4 are 21, 33, 55, 77, and 99, respectively.

进一步地，所述步骤7中基因模型表达式为：Further, the gene model expression in the step 7 is:

Evidence score(gene model)＝BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)Evidence score(gene model)＝BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)

其中，BLASTp_score为来源于BLASTp蛋白序列比对程序的分值；Among them, BLASTp_score is the score derived from the BLASTp protein sequence alignment program;

BLASTp_score为来源于BLASTn序列碱基比对程序的分值；BUSCO_score为来源于BUSCO的Benchmarking Universal Single-Copy Orthologs数据库比对得到的分值；Pfam_scores为来源于Pfam数据库中比对得到的分值；cov(query)为查询序列的全长比对覆盖度；cov(target)为目标序列的全长比对覆盖度；Evidence score为最后基因模型的综合打分数值。BLASTp_score is the score derived from the BLASTn sequence base alignment program; BUSCO_score is the score derived from the comparison of the Benchmarking Universal Single-Copy Orthologs database of BUSCO; Pfam_scores is the score derived from the comparison of the Pfam database; cov (query) is the full-length alignment coverage of the query sequence; cov(target) is the full-length alignment coverage of the target sequence; Evidence score is the comprehensive scoring value of the final gene model.

进一步地，所述步骤8中具体步骤包括：Further, the specific steps in the step 8 include:

步骤81、利用RAVEN对基因组注释出的基因组成的蛋白序列分别比对生物通路数据库KEGG和MetaCyc,获取到对应的代谢网络模型涉及的反应Rxns，代谢物Mets以及基因数目，将相同的反应，代谢物和涉及基因合并，进行去重融合，最终获取到基于KEGG和MetaCyc的代谢网络模型GEMs；Step 81. Use RAVEN to compare the protein sequences of the genes annotated by the genome to the biological pathway databases KEGG and MetaCyc, and obtain the reactions Rxns, metabolite Mets and gene numbers involved in the corresponding metabolic network model, and convert the same reactions, metabolic Combining substances and related genes, performing deduplication fusion, and finally obtaining metabolic network model GEMs based on KEGG and MetaCyc;

步骤82、在KEGG数据库中寻找与所述物种最接近的物种的代谢网络模型GEMs，通过基因同源比对策略，获取到牵扯的反应Rxns和代谢物Mets及基因，补充到所述获取的基于KEGG和MetaCyc代谢网络模型GEMs中；Step 82, search for the metabolic network model GEMs of the species closest to the species in the KEGG database, and obtain the involved reaction Rxns, metabolite Mets and genes through the gene homology comparison strategy, and supplement the obtained based on KEGG and MetaCyc metabolic network model GEMs;

步骤83、通过GapFind和GapFill,完善整个代谢网络模型GEMs使模型获得更多的基因和反应。Step 83, through GapFind and GapFill, improve the entire metabolic network model GEMs so that the model can obtain more genes and reactions.

进一步地，所述步骤2中二代Illumina高通量测序技术对物种进行序列质量测定评价标准为Clean Data Q20>85％；三代PacBio高通量测序技术对物种进行序列质量测定评价标准为Mean Concordance>85％；Further, in the step 2, the second-generation Illumina high-throughput sequencing technology evaluates the sequence quality of the species as Clean Data Q20>85%; the third-generation PacBio high-throughput sequencing technology evaluates the sequence quality of the species as Mean Concordance >85%;

进一步地，所述步骤5中的高质量区域是指reads的碱基质量大于等于20的区域，低质量区域是指reads的碱基质量小于20的区域。Further, the high-quality region in the step 5 refers to the region where the base quality of the reads is greater than or equal to 20, and the low-quality region refers to the region where the base quality of the reads is less than 20.

进一步地，所述物种为酵母。Further, the species is yeast.

进一步地，所述酵母为解脂耶氏酵母。Further, the yeast is Yarrowia lipolytica.

本发明的效果如下：Effect of the present invention is as follows:

本发明取长补短将Illumina二代测序和PacBio三代测序数据有效整合，获取到高质量的基因组染色体水平组装结果；整合多个真核基因预测软件结果并结合物种真实的转录组数据得到最可靠有效的基因组序列注释结果；基于前面的基因组序列注释结果快速构建基因组规模的代谢网络模型GEMs；对代谢网络进行可视化，方便科研人员或者用户分析查找；对已有GEMs模型进行快速有效的流平衡分析，完成微生物的模拟改造预测；显著提高微生物改造工程工作效率，降低科研成本，可广泛应用于高通量测序中的各种微生物研究工作，特别适用于在实际微生物应用领域中各种工程菌的分析和改良研究上。The present invention learns from each other and effectively integrates Illumina second-generation sequencing and PacBio third-generation sequencing data to obtain high-quality genome chromosome-level assembly results; integrates the results of multiple eukaryotic gene prediction software and combines the real transcriptome data of species to obtain the most reliable and effective genome Sequence annotation results; quickly construct genome-scale metabolic network model GEMs based on the previous genome sequence annotation results; visualize the metabolic network, which is convenient for researchers or users to analyze and search; perform rapid and effective flow balance analysis on existing GEMs models, and complete microbial It can significantly improve the efficiency of microbial transformation engineering and reduce the cost of scientific research. It can be widely used in various microbial research work in high-throughput sequencing, especially suitable for the analysis and improvement of various engineering bacteria in the field of actual microbial applications. research.

附图说明Description of drawings

图1为本发明基于高通量测序技术的基因组代谢模型的构建、优化及可视化的方法的基因组拼接组装流程图；Fig. 1 is the genome splicing and assembling flowchart of the method for constructing, optimizing and visualizing the genome metabolic model based on high-throughput sequencing technology of the present invention;

图2为本发明基因组序列注释流程图；Fig. 2 is the flowchart of genome sequence annotation of the present invention;

图3为本发明基因组注释结果示意图；Figure 3 is a schematic diagram of the genome annotation results of the present invention;

图4为本发明代谢网络模型构建流程图；Fig. 4 is the construction flow chart of metabolic network model of the present invention;

图5为本发明459个独特基因在染色体上的分布情况图；Fig. 5 is the distribution situation map on chromosome of 459 unique genes of the present invention;

图6为本发明分别对三个组装结果的评估图；Fig. 6 is the evaluation figure of the present invention respectively to three assembly results;

图7为本发明序列拼接过程图解；Figure 7 is a schematic illustration of the sequence splicing process of the present invention;

图8为参考染色体基因组与推测的对应染色体基因组的序列比对图；Figure 8 is a sequence comparison diagram of the reference chromosomal genome and the inferred corresponding chromosomal genome;

图9为本发明一个高度匹配的scaffold contig-Scaffold86_1结果图；Fig. 9 is a highly matched scaffold contig-Scaffold86_1 result diagram of the present invention;

图10为本发明数据结果存放说明图。Fig. 10 is an explanatory diagram for storing data results in the present invention.

具体实施方式Detailed ways

以下，参照附图1-10对本发明的实施方式进行说明。Hereinafter, embodiments of the present invention will be described with reference to FIGS. 1-10 .

为具体说明本技术方案，提供方案中涉及的中英文对照表如下：In order to specifically illustrate this technical solution, the Chinese-English comparison table involved in the solution is provided as follows:

本发明公开了一种基于高通量测序领域中，广泛应用的以Illumina为代表的二代测序和PacBio的三代测序技术相结合，优势互补的基因组拼接和注释方法，并在此基础上构建精准的基因组规模代谢网络模型GEMs。本方法流程首先由系统生成自定义的参数文件，在依据用户修改设定参数文件后和高通量数据处理流程模块生成批处理可执行文件，由系统执行批处理可执行文件，实现整个流程自动化并生成相关结果报告文件。从而能够高效的让生物信息科研人员，甚至是不太懂使用编程脚本进行数据分析的科研人员，完成高通量测序数据分析，其中基于基因组序列注释和代谢生化信息整合的基因组规模代谢网络模型(GEMs)的构建将有助于科研人员全局理解和加快改造微生物。本发明流程不仅仅可以用于酵母这类真核微生物的研究，并可推广应用到其他真核生物，甚至细菌等原核生物如大肠杆菌的GEMs重构，应用范围较为广泛。The invention discloses a genome splicing and annotation method based on the combination of the second-generation sequencing represented by Illumina and the third-generation sequencing technology of PacBio, which are widely used in the field of high-throughput sequencing, with complementary advantages. Genome-scale metabolic network models of GEMs. The process of this method first generates a custom parameter file by the system, and generates a batch executable file with the high-throughput data processing process module after modifying the parameter file according to the user, and executes the batch executable file by the system to realize the automation of the entire process And generate relevant result report files. In this way, bioinformatics researchers, even those who do not know how to use programming scripts for data analysis, can efficiently complete high-throughput sequencing data analysis. The genome-scale metabolic network model based on genome sequence annotation and metabolic and biochemical information integration ( The construction of GEMs) will help researchers understand and accelerate the transformation of microorganisms globally. The process of the present invention can not only be used in the research of eukaryotic microorganisms such as yeast, but also can be extended and applied to other eukaryotic organisms, even bacteria and other prokaryotic organisms such as Escherichia coli for reconstruction of GEMs, and the application range is relatively wide.

为分别解决上述问题，本发明提供了一种基于高通量测序技术的基因组代谢模型的构建、优化及可视化的方法，包括如下步骤：In order to solve the above problems respectively, the present invention provides a method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology, including the following steps:

1、获取高质量染色体水平组装结果步骤如附图1：用户根据所分析的物种，输入设定的拼接参数配置文件，分别导入二代测序和三代测序的高通量测序原始序列数据，经过各自专有处理流程软件，进行筛选和拼接得到有效的基因组拼接结果，即对应的contig文件。并进一步分析整合结果，拼接得到染色体水平上的基因组全长序列，在此基础上进行后续的生物信息学分析和代谢网络模型的构建，具体包括如下步骤：1. The steps to obtain high-quality chromosome level assembly results are shown in Figure 1: According to the analyzed species, the user enters the set splicing parameter configuration file, and imports the high-throughput sequencing raw sequence data of the second-generation sequencing and the third-generation sequencing respectively. Proprietary processing software for screening and splicing to obtain effective genome splicing results, that is, the corresponding contig files. The integration results were further analyzed, and the full-length genome sequence at the chromosome level was obtained by splicing. On this basis, the subsequent bioinformatics analysis and the construction of the metabolic network model were carried out, which specifically included the following steps:

(a)分别导入物种二代和三代测序的两种高通量测序原始序列文件；(a) Import two high-throughput sequencing original sequence files of the second-generation and third-generation sequencing of the species;

(b)分别用各自基本处理流程对原始序列文件进行质量控制与统计，并剔除低质量的序列数据，获得经过筛选的序列数据；(b) Carry out quality control and statistics on the original sequence files with their respective basic processing procedures, and eliminate low-quality sequence data to obtain screened sequence data;

(c)分别将经过筛选的数据进行各自拼接、组装成基因组的contig文件；(c) respectively splicing and assembling the screened data into contig files of the genome;

Illumina原始测序数据在去掉接头序列，以及尾部低质量(q<20)碱基序列后，通过SPAdes拼接软件，计算不同k-mer size参数结果，选择有效contig序列长度大于500bp,并选取N50最大的拼接结果作为最终Illumina组装结果。PacBio原始测序数据通过开源的软件canu标准流程，获取到PacBio数据的拼接组装contigs结果，接下来将二者contigs文件合并，再次通过碱基错误修订软件Pilon以及contig进一步延伸软件SSPACE,再次利用Illumina原始reads数据和组装contig结果，以PacBio的拼接结果为骨架，进一步提高组装结果的质量。步骤(C)中，使用SPAdes(v3.14.1)拼接软件，以不同的k-mer长度(21,33,55,77,99)对Illumina的高通量测序数据进行拼接，对于一些GC含量高，覆盖率低或参差不齐的数据集合可进一步选择单细胞single-cell模式(设定--sc参数),最后选择N50最大的拼接结果作为Illumia测序数据拼接的contig文件进行下一步分析。使用Canu(v2.1.1)拼接软件对PacBio的高通量测序数据进行拼接，其拼接过程中分为三步，第一步是修正(Correct),即将read"堆垛"(Pileup)到一起进行修正，提高reads中碱基的准确性，一般三代测序的错误以多一个碱基或者少一个碱基为主，因为单分子测序有时候可能测两个碱基的时候信号连到一起了或者有时候对同一个碱基测了多次导致这种错误；第二步是修剪(Trim),即使用overlap确定read哪些区域是高质量区域，哪些区域质量较低需要修整。最后保留单个最高质量的序列块，并删除可疑的区域；第三步是组装(assemble),即将reads排序为contigs，生成对应的一致序列(consensus sequences)并创建可能的共有序列互相相连的路径。After the Illumina original sequencing data removes the linker sequence and the low-quality (q<20) base sequence at the tail, use the SPAdes splicing software to calculate the results of different k-mer size parameters, select the effective contig sequence length greater than 500bp, and select the largest N50 The splicing results were used as the final Illumina assembly results. The PacBio original sequencing data obtained the splicing and assembly contigs results of the PacBio data through the open source software canu standard process, and then merged the two contigs files, and then further extended the software SSPACE through the base error revision software Pilon and contig, and again used the original Illumina The reads data and assembly contig results use PacBio's splicing results as the backbone to further improve the quality of the assembly results. In step (C), use the SPAdes (v3.14.1) assembly software to assemble the Illumina high-throughput sequencing data with different k-mer lengths (21, 33, 55, 77, 99), for some GC content For data sets with low coverage or uneven coverage, you can further select the single-cell mode (set --sc parameter), and finally select the splicing result with the largest N50 as the contig file of the Illumia sequencing data splicing for the next step of analysis. Use Canu (v2.1.1) splicing software to splice PacBio's high-throughput sequencing data. The splicing process is divided into three steps. The first step is to correct (Correct), which is to read "stack" (Pileup) together. Correction to improve the accuracy of bases in reads. Generally, the error of third-generation sequencing is mainly one base more or one less, because single-molecule sequencing may sometimes measure two bases when the signals are connected together or have When the same base is measured multiple times, this error is caused; the second step is trimming (Trim), that is, using overlap to determine which regions of the read are high-quality regions and which regions are of low quality and need to be trimmed. Finally, a single sequence block with the highest quality is retained, and suspicious regions are deleted; the third step is assemble, which is to sort reads into contigs, generate corresponding consensus sequences, and create paths where possible consensus sequences are connected to each other.

(d)然后整合两种测序拼接结果，以测序较长的三代结果为骨架，利用测序错误率更低的二代测序reads进行矫正和延长，最终拿到最佳的基因组组装结果；步骤(d)中，先通过各自标准拼接流程拿到初步contig文件，然后在以PacBio的长contig文件为骨架，以Illumina的contig文件和高质量的reads测序数据为补充辅助，在碱基错误纠正软件Pilon(v1.23)进行校正和序列进一步延伸软件SSPACE(v2.1.1)的延伸协助下，拿到基因组拼接组装的最终版本，进行下一步基因组注释和代谢模型构建。(d) Then integrate the two sequencing results, take the longer third-generation sequencing results as the skeleton, use the second-generation sequencing reads with lower sequencing error rates to correct and extend, and finally get the best genome assembly results; step (d) ), the preliminary contig files are first obtained through their respective standard splicing processes, and then the base error correction software Pilon ( v1.23) for correction and sequence extension with the assistance of the extension software SSPACE (v2.1.1), the final version of genome assembly was obtained, and the next step of genome annotation and metabolic model construction was carried out.

(1)获取可靠有效的基因组序列注释步骤如附图2：(1) The steps to obtain reliable and effective genome sequence annotation are shown in Figure 2:

依据上一步拼接结果，用户根据所分析的物种，选取物种转录组参考数据，依据真核生物中多种基因预测软件结果，对所有预测出的结果进行验证，并依据综合打分结果，去除低分数的基因预测模型，最终确定最可靠的基因预测结果完成注释；According to the splicing results of the previous step, the user selects the species transcriptome reference data according to the analyzed species, verifies all the predicted results according to the results of multiple gene prediction software in eukaryotes, and removes low scores based on the comprehensive scoring results Gene prediction model, and finally determine the most reliable gene prediction results to complete the annotation;

(a)分别导入参考物种的转录组高通量测序原始序列文件和步骤(1)中最终的基因组拼接结果文件；(a) respectively importing the original sequence file of the high-throughput sequencing of the transcriptome of the reference species and the final genome assembly result file in step (1);

(b)使用RepeatModeler从头对基因组的重复序列家族进行建模注释，并且使用RepeatMasker来检测和屏蔽基因组中的重复序列，因为基因组中存在重复序列结构区，特别是高等真核生物，重复序列占了相当大的比例，会影响基因预测的质量，也会带来不必要的资源消耗。(b) Use RepeatModeler to model and annotate the repeat sequence family of the genome from scratch, and use RepeatMasker to detect and mask the repeat sequence in the genome, because there are repeat sequence structural regions in the genome, especially in higher eukaryotes, the repeat sequence accounts for A considerable proportion will affect the quality of gene prediction and cause unnecessary resource consumption.

然后通过三种常见的基因预测软件Augustus,Maker和Braker来完成基因组拼接后的基因注释结果。对于参考的转录组数据，通过Hisat/Trinity流程，拿到RNA-seq的拼接组装结果，在利用Maker和Braker完成对该结果的基因预测。所有的预测结果，将全部进行评估打分，通过BLASTp,BUSCO,InterProScan和BLASTn获取到对应的分值，最后通过基因模型的Evidence Score表达式如下，获取到Evidence Score分值最高的预测结果作为最终的基因预测模型。Then three common gene prediction software Augustus, Maker and Braker are used to complete the gene annotation results after genome splicing. For the reference transcriptome data, through the Hisat/Trinity process, the RNA-seq assembly results are obtained, and Maker and Braker are used to complete the gene prediction of the results. All prediction results will be evaluated and scored, and the corresponding scores are obtained through BLASTp, BUSCO, InterProScan and BLASTn. Finally, the Evidence Score expression of the gene model is as follows, and the prediction result with the highest Evidence Score is obtained as the final result Gene prediction models.

其中，BLASTp_score为来源于BLASTp蛋白序列比对程序的分值；BLASTp_score为来源于BLASTn序列碱基比对程序的分值；BUSCO_score为来源于Benchmarking UniversalSingle-Copy Orthologs(BUSCO)数据库比对得到的分值；Pfam_scores为来源于Pfam数据库中比对得到的分值；cov(query)为查询序列的全长比对覆盖度；cov(target)为目标序列的全长比对覆盖度；Evidence score为最后基因模型的综合打分数值。Among them, BLASTp_score is the score derived from the BLASTp protein sequence alignment program; BLASTp_score is the score derived from the BLASTn sequence base alignment program; BUSCO_score is the score derived from the Benchmarking Universal Single-Copy Orthologs (BUSCO) database comparison ; Pfam_scores is the score derived from the comparison in the Pfam database; cov (query) is the full-length alignment coverage of the query sequence; cov (target) is the full-length alignment coverage of the target sequence; Evidence score is the last gene The comprehensive scoring value of the model.

(2)构建代谢网络模型GEMs步骤如附图4：根据参数配置文件和基因组序列注释结果，选取开源的RAVEN工具集，生成对应的自动化GEMs构建流程，具体包括如下步骤：(2) The steps for constructing metabolic network model GEMs are shown in Figure 4: According to the parameter configuration file and genome sequence annotation results, the open source RAVEN tool set is selected to generate the corresponding automated GEMs construction process, which specifically includes the following steps:

(a)利用RAVEN2.0工具集平台，首先对基因组注释出的蛋白序列分别通过比对KEGG和MetaCyc两大主要的生物通路数据库,分别获取到对应的代谢网络模型所涉及的反应Rxns，代谢物Mets以及牵扯到的基因数目。将相同的反应，代谢物和涉及基因合并，最终获取到一个基于以上两大主流平台的代谢网络模型GEMs。(a) Using the RAVEN2.0 tool set platform, firstly, the protein sequences annotated by the genome are compared with the two main biological pathway databases KEGG and MetaCyc, and the corresponding metabolic network models involved in the reaction Rxns and metabolites are respectively obtained Mets and the number of genes involved. Merge the same reactions, metabolites and genes involved, and finally obtain a metabolic network model GEMs based on the above two mainstream platforms.

(b)利用KEGG数据库中已经存在的代谢网络模型GEMs，寻找最接近的物种，通过基因的同源比对策略，获取到牵扯的反应Rxns和代谢物Mets及基因，作为进一步补充，整合到前面获取到的代谢网络模型GEMs中。(b) Use the existing metabolic network model GEMs in the KEGG database to find the closest species, and obtain the involved reaction Rxns, metabolite Mets and genes through the homology comparison strategy of the genes, and integrate them into the front as a further supplement The obtained metabolic network model GEMs.

(c)最后对整合得到的代谢网络模型GEMs，通过平台流程中的GapFind和GapFill,完善整个代谢网络模型GEMs，(c) Finally, for the integrated metabolic network model GEMs, through GapFind and GapFill in the platform process, improve the entire metabolic network model GEMs,

(3)执行及输出步骤：执行所描述的自动化GEMs构建流程，获得基于序列注释结果的基因组规模的代谢网络模型GEMs报告并完成可视化展示，具体包括如下步骤:(3) Execution and output steps: Execute the described automated GEMs construction process, obtain the genome-scale metabolic network model GEMs report based on the sequence annotation results and complete the visual display, specifically including the following steps:

使用但不限于利用ModelExplore2.0完成GEM网络可视化评估分析，通过评估所涉及的基因数目，网络的连通性以及基因敲除后所预测的生物学功能的准确性来确定代谢网络模型的好坏。并通过MetExploreViz进一步搭建代谢模型GEMs的可视化网络平台，方便用户具体查看某些特定代谢通路和反应。Use but not limited to ModelExplore2.0 to complete the GEM network visualization evaluation analysis, and determine the quality of the metabolic network model by evaluating the number of genes involved, the connectivity of the network, and the accuracy of the predicted biological function after gene knockout. And through MetExploreViz to further build a visualization network platform of metabolic model GEMs, it is convenient for users to view certain specific metabolic pathways and reactions.

(4)代谢网络模型GEMs的界面可视化的流平衡分析主要但不限于利用开源的optFlux软件进行相关FBA流平衡分析。(4) The flow balance analysis of the interface visualization of the metabolic network model GEMs is mainly but not limited to using the open source optFlux software for related FBA flow balance analysis.

利用本发明，在拼接模块流程上，将二代Illumina为主高通量测序数据和三代PacBio的高通量测序数据进行有效结合，优势互补，使得物种基因组拼接能够接近或达到染色体组装水平，在基因组注释模块流程上，结合了真实的相近物种转录组数据，并使用多个基因预测软件，获取到所有可能预测结果，将预测出的结果分别去比对NCBI的nt数据库，BUSCO数据库以及Pfam数据库获取对应分值，通过Evidence score公式计算，得到基因预测模型的综合打分数值，并最终确定综合打分值最高的基因预测模型为最可靠的基因组拼接注释结果。最后在构建基因组代谢网络模型流程上，利用基因组注释的蛋白序列，通过比对KEGG和MetaCyc数据库内同源基因所涉及的代谢网络，以及提取已有相近物种GEMs中同源蛋白所涉及的代谢网络，最终获取到一个较完整的基因组规模代谢网络模型GEMs.整个模块流程能够帮助科研人员和检测人员迅速完成一整套从基因组测序到重建基因组规模代谢网络模型GEMs的分析，并模拟物种代谢途径从而为物种的优化改造提供方向。本发明分析流程思路清晰，着重于整合二三代测序的各自技术优势，提高基因组组装质量，以及基因组注释的准确性，最终显著提高微生物改造工程工作效率，降低科研成本，可广泛应用于高通量测序中的各种微生物研究工作，特别适用于在实际微生物应用领域中各种工程菌的分析和改良研究上。Utilizing the present invention, in the splicing module process, the second-generation Illumina-based high-throughput sequencing data and the third-generation PacBio high-throughput sequencing data are effectively combined, and the advantages are complementary, so that the genome splicing of species can approach or reach the level of chromosome assembly. In the process of the genome annotation module, the real transcriptome data of similar species are combined, and multiple gene prediction software are used to obtain all possible prediction results, and the predicted results are compared with NCBI's nt database, BUSCO database and Pfam database The corresponding score was obtained, calculated by the Evidence score formula, and the comprehensive score value of the gene prediction model was obtained, and finally the gene prediction model with the highest comprehensive score value was determined as the most reliable genome splicing annotation result. Finally, in the process of constructing the genome metabolic network model, using the annotated protein sequence of the genome, comparing the metabolic network involved in the homologous genes in the KEGG and MetaCyc databases, and extracting the metabolic network involved in the homologous proteins in GEMs of similar species , and finally obtain a relatively complete genome-scale metabolic network model GEMs. The entire module process can help researchers and testing personnel quickly complete a set of analysis from genome sequencing to reconstruction of genome-scale metabolic network model GEMs, and simulate species metabolic pathways to provide The optimal transformation of species provides direction. The analysis process of the present invention has a clear idea, focuses on integrating the respective technical advantages of the second and third generation sequencing, improves the quality of genome assembly, and the accuracy of genome annotation, and finally significantly improves the efficiency of microbial transformation projects and reduces scientific research costs. It can be widely used in Qualcomm Various microbial research work in quantitative sequencing, especially suitable for the analysis and improvement research of various engineering bacteria in the field of practical microbial applications.

本发明的方法，首先由系统流程提供自定义参数配置文件，在根据用户修改设定参数后的自定义参数文件和高通量数据处理流程模块生成与数据流程对应的批处理可执行文件，由流程系统执行批处理可执行文件，实现数据流程分析，生成结果报告文件，从而能够高效的帮助生物信息分析人员完成一套标准化的高通量数据整合分析流程，甚至可以让不太懂高通量数据分析和代谢网络建模的科研人员自己完成从测序数据拼接组装，注释分析，到建立基因组规模代谢网络模型。本发明的拼接技巧和注释方法可以进一步推广和应用到后续更多的高通量测序领域中。In the method of the present invention, firstly, the system flow provides a custom parameter configuration file, and the user-defined parameter file and the high-throughput data processing flow module generate a batch executable file corresponding to the data flow according to the user-defined parameter file after modifying the set parameters. The process system executes batch executable files, realizes data process analysis, and generates result report files, so that it can efficiently help bioinformatics analysts complete a set of standardized high-throughput data integration analysis processes, and even help people who do not understand high-throughput Data analysis and metabolic network modeling researchers complete the sequence data splicing and assembly, annotation analysis, and the establishment of genome-scale metabolic network models by themselves. The splicing technique and annotation method of the present invention can be further extended and applied to more subsequent fields of high-throughput sequencing.

以下结合具体实施例子对上述方案做进一步说明，该实施例子将被用于说明本发明而不是限制本发明的应用范围，实施例子中采用的相关条件和参数设定可以根据具体应用要求的条件做进一步调整，未注明的实施条件通常为常规实验中的条件。Below in conjunction with specific implementation example, above-mentioned scheme is described further, and this implementation example will be used for illustrating the present invention rather than limiting the scope of application of the present invention, and the relevant conditions and parameter setting that adopt in the implementation example can be done according to the condition of concrete application requirement For further adjustments, the unspecified implementation conditions are usually the conditions in routine experiments.

第一，对原始数据进行过滤处理，即去除接头序列和低质量的reads序列，先分别利用各自标准流程完成对Illumina和PacBio前期reads的处理和拼接。First, filter the raw data, that is, remove adapter sequences and low-quality reads sequences, and use their respective standard procedures to complete the processing and splicing of Illumina and PacBio's previous reads.

Illumina测序数据质量的统计见下表：The statistics of Illumina sequencing data quality are shown in the table below:

上表中：Sample ID:测序样品的名称；Insert Size(bp):建库时候插入的DNA片段长度；Clean Reads Length(bp)：reads测序的长度；Raw Data(Mb)：原始数据测序碱基总量；Clean Q20(％):测序错误率小于1％的碱基数量百分比；Clean Q30(％):测序错误率小于0.1％的碱基数量百分比；GC(％):C+G的碱基占所有碱基的数量百分比。In the above table: Sample ID: the name of the sequencing sample; Insert Size (bp): the length of the DNA fragment inserted when building the library; Clean Reads Length (bp): the length of the reads sequencing; Raw Data (Mb): the original data sequencing base Total; Clean Q20(%): the percentage of bases whose sequencing error rate is less than 1%; Clean Q30(%): the percentage of bases whose sequencing error rate is less than 0.1%; GC(%): the bases of C+G % of all bases.

PacBio测序数据质量的统计见下表：The statistics of PacBio sequencing data quality are shown in the table below:

上表中：Sample ID:测序样品的名称；Mean Concordance：平均一致度得分；Number of Reads：reads个数；Number of Bases：数据量；Mean Read Length：平均测序读长：N50 ReadLength(bp)：测序读长N50长度，即将测序数据从大到小依次排列，在总体50％处的测序读长。In the above table: Sample ID: the name of the sequencing sample; Mean Concordance: the average consistency score; Number of Reads: the number of reads; Number of Bases: the amount of data; Mean Read Length: the average sequencing read length: N50 ReadLength (bp): Sequencing read length N50 length, that is, the sequencing data are arranged in descending order, and the sequencing read length at 50% of the total.

测序数据质量优化：高通量测序中通常会出现一些点突变等测序错误，而且序列末端的质量一般比较低，而且含有接头序列，因此为了提高拼接质量，拿到更准确的生物信息分析结果，需要对原始数据进行优化处理。Sequencing data quality optimization: Sequencing errors such as point mutations usually occur in high-throughput sequencing, and the quality of the end of the sequence is generally low and contains adapter sequences. Therefore, in order to improve the quality of splicing and obtain more accurate biological information analysis results, Raw data needs to be optimized.

具体步骤及参数设定：使用Trimmomatic(v0.32)，对Illumina测序原始数据去除引物和接头序列，去除两端质量低于20的碱基，去除长度小于100bp的序列；使用SPAdes(v3.14.1)对Illumina测序数据进行拼接，选择paired reads输入形式，默认参数,完成contig的组装拼接。使用Canu(v2.1.1)对PacBio测序数据进行拼接，设定genomeSize为20m(酵母参考基因组大小),correctedErrorRate默认设定(0.045for PacBio reads,and0.144for Nanopore reads)，将依据测序深度，覆盖度低于30X时候，可以增加1％的corrected ErrorRate，当测序覆盖度高于60X时候，减少1％的corrected ErrorRate。其他参数设定，均以默认参数为准，如minOverlapLength<integer＝500>，minReadLength<integer＝1000>。使用Pilon(v1.23)，以FASTA和BAM文件作为输入，根据比对结果对输入的参考基因组进行提高，包括查找单碱基差异，小的插入缺失(indels)，较大的插入缺失或者block，填充参考序列中的N以及找到局部的错误组装。其contig的fasta文件，来自Illumina和PacBio各自标准流程组装结果，合并去除冗余，最终获取到经过Pilon处理后，拼接质量进一步提高的基因组拼接组装序列。使用SSPACE(v2.1.1)，进一步对拼接后的contig利用原始reads序列paired-end信息，进行可能的延伸，最终获取到延伸后的基因组拼接版本。Specific steps and parameter settings: Use Trimmomatic (v0.32) to remove primers and adapter sequences from Illumina sequencing raw data, remove bases with a quality lower than 20 at both ends, and remove sequences with a length less than 100 bp; use SPAdes (v3.14.1 ) to splice the Illumina sequencing data, select the paired reads input form, default parameters, and complete the assembly and splicing of the contig. Use Canu (v2.1.1) to splice PacBio sequencing data, set genomeSize to 20m (yeast reference genome size), correctedErrorRate default setting (0.045for PacBio reads, and 0.144for Nanopore reads), will be based on sequencing depth, coverage When the sequencing coverage is lower than 30X, the corrected ErrorRate can be increased by 1%, and when the sequencing coverage is higher than 60X, the corrected ErrorRate can be reduced by 1%. Other parameter settings are subject to the default parameters, such as minOverlapLength<integer=500>, minReadLength<integer=1000>. Use Pilon (v1.23), take FASTA and BAM files as input, improve the input reference genome according to the comparison results, including finding single base differences, small indels, large indels or blocks , filling N in the reference sequence and finding local misassemblies. The fasta file of its contig comes from the assembly results of Illumina and PacBio's respective standard processes, which are merged to remove redundancy, and finally obtain the genome assembly sequence with further improved splicing quality after Pilon processing. Using SSPACE (v2.1.1), further use the paired-end information of the original reads sequence for the spliced contig to perform possible extensions, and finally obtain the spliced version of the extended genome.

分别使用QUAST(v5.0.2)，BUSCO(v4.1.4)对延伸后的基因组拼接版本进行质量评估。其中用QUAST计算组装结果里的mismatch和InDel比例来评估碱基的准确度。通过查询BUSCO数据库比对结果来对组装结果进行完整性评估。其中C代表能匹配上的数据库中完整的单拷贝直系同源数目，F代表匹配上但不完整，M代表没能匹配上的单拷贝直系同源基因。n代表对应数据库中包含的单拷贝直系同源基因总数。QUAST (v5.0.2) and BUSCO (v4.1.4) were used to evaluate the quality of the extended genome assembly version respectively. Among them, QUAST is used to calculate the ratio of mismatch and InDel in the assembly result to evaluate the accuracy of the base. The integrity of the assembly results was evaluated by querying the comparison results of the BUSCO database. Among them, C represents the number of complete single-copy orthologous genes in the database that can be matched, F represents the number of single-copy orthologous genes that are matched but incomplete, and M represents the single-copy orthologous genes that cannot be matched. n represents the total number of single-copy orthologous genes contained in the corresponding database.

考虑到实际拼接的基因组来自酵母，这里选择比较适合真菌的专属注释流程Fungap(v1.1.1)来完成，并选取了已有转录组数据的参考邻近酵母基因组Y.lipolyticaW29的转录组RNAseq序列作为辅助，流程中对以下三个不同基因预测软件Augustus(v3.3.3),Braker(v2.1.5),Maker(v2.31.10)的结果进行评估,通过构建evidence score分值公式，确定最佳基因模型预测结果。Considering that the actual spliced genome comes from yeast, Fungap (v1.1.1), a dedicated annotation process suitable for fungi, is selected here to complete it, and the transcriptome RNAseq sequence of the reference adjacent yeast genome Y. lipolyticaW29 with existing transcriptome data is selected as an auxiliary In the process, the results of the following three different gene prediction software Augustus (v3.3.3), Braker (v2.1.5), Maker (v2.31.10) are evaluated, and the best gene model prediction is determined by constructing the evidence score formula result.

使用RAVEN Toolbox(v2.0)，基于预测到的基因组注释的蛋白序列文件，通过对KEGG数据库和MetaCyc数据库中比对查找，获取到蛋白对应的同源基因，依据基因-酶-反应的关联，分别重头构建基因组代谢规模网络模型，包含各自所涉及的反应Rxns，代谢物Mets和基因Genes.通过对相同反应，代谢物和基因的整合，获取到整合的基因组代谢规模网络模型。Using RAVEN Toolbox (v2.0), based on the predicted protein sequence files of genome annotations, by comparing and searching the KEGG database and the MetaCyc database, the homologous genes corresponding to the proteins were obtained, and based on the gene-enzyme-reaction association, Re-construct the genome metabolic scale network model from scratch, including the respective reaction Rxns, metabolite Mets and gene Genes. By integrating the same reactions, metabolites and genes, an integrated genome metabolic scale network model is obtained.

通过ModelExplore(v2.0)完成GMEs的评估,并通过MetExploreViz进一步搭建代谢模型GEMs的可视化网络平台展示。使用OptFlux(v3.4.0)导入代谢模型，实现FBA流平衡的分析,帮助科研人员进行微生物的代谢模拟和改造。The evaluation of GMEs was completed through ModelExplore (v2.0), and the visual network platform display of metabolic model GEMs was further built through MetExploreViz. Use OptFlux (v3.4.0) to import the metabolic model to realize the analysis of FBA flow balance and help researchers to simulate and transform the metabolism of microorganisms.

整套流程所涉及使用到的分析软件：Trimmomatic(v0.32),SPAdes(v3.14.1),Canu(v2.1.1),Pilon(v1.23),SSPACE(v2.1.1)，QUAST(v5.0.2)，BUSCO(v4.1.4)，Fungap(v1.1.1),Augustus(v3.3.3),Braker(v2.1.5),Maker(v2.31.10),Samtools(v1.10),Bamtools(v2.5.1),Pfam_scan(v1.6),GeneMark-ES/ET(v4.59),RepeatModeler(v2.0.1),Hisat2(v2.2.0)，RAVEN Toolbox(v2.0),ModelExplore(v2.0),MetExploreViz,OptFlux(v3.4.0)。The analysis software involved in the whole process: Trimmomatic (v0.32), SPAdes (v3.14.1), Canu (v2.1.1), Pilon (v1.23), SSPACE (v2.1.1), QUAST (v5.0.2 ), BUSCO(v4.1.4), Fungap(v1.1.1), Augustus(v3.3.3), Braker(v2.1.5), Maker(v2.31.10), Samtools(v1.10), Bamtools(v2.5.1) ,Pfam_scan(v1.6),GeneMark-ES/ET(v4.59),RepeatModeler(v2.0.1),Hisat2(v2.2.0),RAVEN Toolbox(v2.0),ModelExplore(v2.0),MetExploreViz, OptFlux (v3.4.0).

下面以解脂耶氏酵母为例说明本发明技术方案：Below take Yarrowia lipolytica as an example to illustrate the technical scheme of the present invention:

第一部分：解脂耶氏酵母基因组拼接Part I: Yarrowia lipolytica genome assembly

序列拼接过程如附图7所示，序列组装拼接、评估流程图如附图1所示。The process of sequence assembly is shown in Figure 7, and the flowchart of sequence assembly and evaluation is shown in Figure 1.

基因组序列拼接将分别以Illumina二代测序和PacBio三代测序完成相应平台的基因组组装结果，并将以PacBio三代测序结果为骨架，进一步使用Illumina二代测序的高质量reads通过pilon和SSPACE程序进行序列纠错和衍生，从而得到我们的第一个基因组分析组装版本Poly_version 1.0。最后分别利用N50参数、QUAST和BUSCO数据库完成整个组装评估工作。Genome sequence assembly will use Illumina next-generation sequencing and PacBio third-generation sequencing to complete the genome assembly results of the corresponding platforms, and will use the PacBio third-generation sequencing results as the backbone to further use the high-quality reads of Illumina next-generation sequencing to perform sequence correction through pilon and SSPACE programs Wrong and derived, resulting in our first genome analysis assembly version Poly_version 1.0. Finally, the whole assembly evaluation work is completed by using N50 parameters, QUAST and BUSCO databases respectively.

(1)Illumina平台建库测序结果：(1) Illumina platform library construction and sequencing results:

Illumina测序深度评估Illumina Sequencing Depth Assessment

7,567,4601502/20,000,000＝113X(酵母基因组按照20MB预估)7,567,4601502/20,000,000=113X (the yeast genome is estimated at 20MB)

Illumina NovaSeq PE150测序结果见下表：Illumina NovaSeq PE150 sequencing results are shown in the table below:

Sample ID:测序样品的名称；Insert Size(bp):建库时候插入的DNA片段长度；Clean Reads Length(bp)：reads测序的长度；Raw Data(Mb)：原始数据测序碱基总量；Clean Q20(％):测序错误率小于1％的碱基数量百分比；Clean Q30(％):测序错误率小于0.1％的碱基数量百分比；GC(％):C+G的碱基占所有碱基的数量百分比Sample ID: the name of the sequencing sample; Insert Size (bp): the length of the DNA fragment inserted when building the library; Clean Reads Length (bp): the length of the reads sequencing; Raw Data (Mb): the total number of bases sequenced in the raw data; Clean Q20(%): The percentage of bases whose sequencing error rate is less than 1%; Clean Q30(%): The percentage of bases whose sequencing error rate is less than 0.1%; GC(%): C+G bases account for all bases % of quantity

PacBio平台建库测序结果见下表：The results of library construction and sequencing on the PacBio platform are shown in the table below:

PacBio平台建库测序结果文件解压缩后对应文件为data/data_jie/jiequ05.subreads.bamPacBio测序深度评估The PacBio platform library construction sequencing result file is decompressed and the corresponding file is data/data_jie/jiequ05.subreads.bamPacBio sequencing depth evaluation

3,000,081,562/20,000,000＝150X(酵母基因组按照20MB预估)3,000,081,562/20,000,000=150X (the yeast genome is estimated at 20MB)

(2)基因组拼接结果：(2) Genome splicing results:

Illumina测序初步拼接结果见下表：The preliminary splicing results of Illumina sequencing are shown in the table below:

序号serial number 项目project Scaffold片段Scaffold Fragment Contig片段Contig Fragment 11 总数目(>500bp)Total number (>500bp) 117117 147147 22 总长度(bp)total length (bp) 20,295,84620,295,846 20,295,54620,295,546 33 N50长度(bp)N50 length (bp) 372,953372,953 316,305316,305 44 N90长度(bp)N90 length (bp) 139,075139,075 87,73487,734 55 最大长度(bp)Maximum length (bp) 949,861949,861 949,861949,861 66 最小长度(bp)Minimum length (bp) 574574 547547 77 序列GC含量(％)Sequence GC content (%) 4949 4949

过程：通过SPAdes等二代测序常用软件进行拼接组装。Process: Splicing and assembly is performed by commonly used next-generation sequencing software such as SPAdes.

PacBio测序拼接结果见下表：The splicing results of PacBio sequencing are shown in the table below:

过程：通过PacBio官方提供的SMRT Link v5.0.1软件对reads进行组装，利用变异检测软件polish完成矫正后，进一步使用resequencing程序将没用到的reads与组装结果polished_assembly.fasta进行mapping,最后得到6个contigs的PacBio拼装结果。Process: The reads were assembled through the SMRT Link v5.0.1 software officially provided by PacBio. After the correction was completed using the variation detection software polish, the resequencing program was further used to map the unused reads and the assembly result polished_assembly.fasta, and finally 6 were obtained. PacBio assembly results of contigs.

Illumina和PacBio数据结合拼接结果见下表：The combined splicing results of Illumina and PacBio data are shown in the table below:

过程：Pilon和bowtie程序对组装结果错误程序进行修订，详细结果解释举例：Contig1:1146572Contig1_pilon:1146573-1146574.ATProcess: Pilon and bowtie programs revised the assembly result error program, detailed result explanation example: Contig1:1146572Contig1_pilon:1146573-1146574.AT

在PacBio拼接出的最初Contig1序列文件，经过pilon和bowtie程序的核查，发现在其1146572个核甘酸的位置，遗漏了两个核甘酸AT，即1146573-1146574位置。In the original Contig1 sequence file spliced by PacBio, after verification by the pilon and bowtie programs, it was found that two nucleotide ATs were missing at the position of 1146572 nucleotides, that is, positions 1146573-1146574.

纠正后结果进一步衍生The corrected result is further derived

(3)基因组拼接结果评估：(3) Evaluation of genome splicing results:

Quast评估结果见下表：The Quast evaluation results are shown in the table below:

依据BUSCO数据库评估结果：基因组拼接和注释的完整性评估通过使用BUSCO(通用单拷贝直系同源基因为基准)数据库。构成每个主要谱系的BUSCO集的基因是选自直系同源组，其基因在至少90％的物种中以单拷贝直系同源物形式存在。Results were assessed against the BUSCO database: The completeness of genome assembly and annotation was assessed by using the BUSCO (universal single-copy orthologous genes as benchmark) database. The genes comprising the BUSCO set of each major lineage were selected from ortholog groups whose genes were present as single-copy orthologs in at least 90% of the species.

过程：针对我们拼接的基因组Yarrowia lipolytica属于ascomycota phyla.选取fungi_odb9和ascomycota_odb9分别对三个组装结果进行评估。结果如(附图6)。三个组装结果，在依赖BUSCO数据库评估上，相差不大，在ascomycota_odb9中都能找到97.5％的BUSCO集合.我们选取二代和三代测序相结合的组装结果Poly_version 1.0进行后续分析。Process: For our spliced genome Yarrowia lipolytica belongs to ascomycota phyla. Select fungi_odb9 and ascomycota_odb9 to evaluate the three assembly results respectively. The result is as (accompanying drawing 6). The three assembly results are not much different in terms of relying on the evaluation of the BUSCO database, and 97.5% of the BUSCO collection can be found in ascomycota_odb9. We selected the assembly result Poly_version 1.0 of the combination of second-generation and third-generation sequencing for subsequent analysis.

(4)染色体编号的确定：由实验信息所知拼接的酵母总共有6个染色体，我们的组装拼接结果对应的contig数目也正好是6，可以确定拼接结果达到了染色体水平，利用网上KEGG数据库中的一个Yarrowia lipolytica的yli菌株信息所知，信息如下，我们首先依据大小对应关系，初步确定了我们染色体的编号。(4) Determination of chromosome number: According to the experimental information, the spliced yeast has a total of 6 chromosomes, and the number of contigs corresponding to our assembly and splicing results is exactly 6. It can be determined that the splicing results have reached the chromosome level. Using the online KEGG database The information of a yli strain of Yarrowia lipolytica is known, and the information is as follows. We first determined the number of our chromosomes based on the size correspondence.

序号serial number Ref Chromosome IDRef Chromosome ID Size(bp)Size(bp) Scaffold IDScaffold ID Scaffold size(bp)Scaffold size(bp) 11 Chromosome EChromosome E 4,224,1034,224,103 scaffold1scaffold1 4,202,6644,202,664 22 Chromosome FChromosome F 4,003,3624,003,362 scaffold2scaffold2 4,015,0944,015,094 33 Chromosome DChromosome D 3,633,2723,633,272 scaffold3scaffold3 3,641,5803,641,580 44 Chromosome CChromosome C 3,272,6093,272,609 scaffold4scaffold4 3,323,7923,323,792 55 Chromosome BChromosome B 3,066,3743,066,374 scaffold5scaffold5 3,073,8383,073,838 66 Chromosome AChromosome A 2,303,2612,303,261 scaffold6scaffold6 2,275,4892,275,489

并通过flak-1.0软件对参考染色体基因组和我们推测的对应染色体基因组进行了序列比对，结果如附图8：由以上比对结果可知，我们组装的染色体基因组序列和对应的参考染色体基因组序列高度匹配，除了scaffold5和ChrB在150kb附近出现了个倒转reverse的匹配。我们针对该区段进一步查看Illumina reads和组装的scaffold5 contig的mapping情况，对所得到的bam文件利用可视化软件IGV查看alignment的情况，我们能够找到大部分高度匹配的灰色reads覆盖了该区域，这里reads会用不同的颜色表示(插入大小小于期望值；插入大小大于期望值；倒置、重复、易位事件)，由此推测该区段确实是个与参考yli菌株不同的地方。对PacBio的数据也和scaffold5 contig进行了alignmentmapping,结果可视化相同区段around 160kb,同样能找到长的PacBio reads顺利跨过匹配。如果组装有问题，PacBio全部不能顺利跨过去。And through the flak-1.0 software, the reference chromosomal genome and our speculated corresponding chromosomal genome were compared, and the results are shown in Figure 8: From the above comparison results, we can see that the chromosomal genome sequence we assembled and the corresponding reference chromosomal genome sequence are highly Matching, except for scaffold5 and ChrB, there is a reverse match near 150kb. We further checked the mapping situation of Illumina reads and assembled scaffold5 contig for this section, and used the visualization software IGV to check the alignment of the obtained bam file. We were able to find that most of the highly matched gray reads covered this area, here reads It will be indicated by different colors (insertion size smaller than expected; insertion size larger than expected; inversion, duplication, translocation event), so it is speculated that this segment is indeed different from the reference yli strain. Alignment mapping was also performed on the PacBio data and the scaffold5 contig, and the results visualized the same segment around 160kb, and the long PacBio reads could also be found to cross the match smoothly. If there is a problem with the assembly, PacBio will not be able to pass through smoothly.

(5)线立体染色体序列的查找：由于参考基因组有yli还存在一个线粒体染色体序列，鉴于他们6条对应染色体序列的高度匹配，我们有理由相信拼接的基因组中是否漏掉了线粒体序列，首先通过reads和参考线粒体序列进行mapping查找，结果如下：(5) Search for linear three-dimensional chromosome sequences: Since the reference genome has yli and there is also a mitochondrial chromosome sequence, in view of the high matching of their six corresponding chromosome sequences, we have reason to believe whether the mitochondrial sequence is missing in the spliced genome. First, through reads and reference mitochondrial sequence for mapping search, the results are as follows:

我们找到有7.2％的reads可以匹配，接下来，我们将这些匹配的reads抓取出来，通过SPAdes软件进行denovo的再次拼接。得到size大小是47,912bp的线粒体染色体基因组文件mt_chr.fasta.为了验证是否为环状结构序列，我们截取首尾一段DNA序列，创建test序列的fasta文件。利用test文件与前面Illumina拼接得到的scaffold contigs文件进行blat查询，找到了一个高度匹配的scaffold contig-Scaffold86_1，见附图9，Test测试序列与scaffolds86_1的reverse序列可以找到很好的跨度匹配,从而证明了为一个完整的环状mt_chr.fasta，该序列与参考基因组线粒体染色体序列比对，高度相似，证明线粒体染色体基因组序列高度保守，基本没有发生变化，最后的基因组组装序列结果为(我们将线粒体环状比对中多出的8个碱基去掉了)We found that 7.2% of the reads could be matched. Next, we grabbed these matched reads and used the SPAdes software to perform denovo splicing again. The mitochondrial chromosome genome file mt_chr.fasta with a size of 47,912bp was obtained. In order to verify whether it is a circular structure sequence, we intercepted a DNA sequence from the beginning and the end to create a fasta file of the test sequence. Using the blat query between the test file and the scaffold contigs file spliced by Illumina, a highly matching scaffold contig-Scaffold86_1 was found, as shown in Figure 9. A good span match can be found between the Test test sequence and the reverse sequence of scaffolds86_1, thus proving In order to create a complete circular mt_chr.fasta, the sequence is highly similar to the mitochondrial chromosome sequence of the reference genome, which proves that the mitochondrial chromosome genome sequence is highly conserved and basically unchanged. The final genome assembly sequence result is (we put the mitochondrial ring The extra 8 bases in the comparison were removed)

序号serial number Ref Chromosome IDRef Chromosome ID Size(bp)Size(bp) Scaffold IDScaffold ID Scaffold size(bp)Scaffold size(bp) 11 Chromosone MTChromosone MT 47,91647,916 scaffold_mtscaffold_mt 47,90447,904 22 Chromosome EChromosome E 4,224,1034,224,103 scaffold1scaffold1 4,202,6644,202,664 33 Chromosome FChromosome F 4,003,3624,003,362 scaffold2scaffold2 4,015,0944,015,094 44 Chromosome DChromosome D 3,633,2723,633,272 scaffold3scaffold3 3,641,5803,641,580 55 Chromosome CChromosome C 3,272,6093,272,609 scaffold4scaffold4 3,323,7923,323,792 66 Chromosome BChromosome B 3,066,3743,066,374 scaffold5scaffold5 3,073,8383,073,838 77 Chromosome AChromosome A 2,303,2612,303,261 scaffold6scaffold6 2,275,4892,275,489

以上整个第一部分流程参数选择说明如下：The selection of process parameters in the first part of the above is explained as follows:

对Illumina二代测序数据，首先通过Trimmomatic软件对Illumina原始测序数据在去掉接头序列，以及尾部低质量(q<20)碱基序列，去除长度小于36pb序列，通过SPAdes拼接软件，计算不同k-mer size参数结果后默认选择有效contig序列长度大于500bp,并选取N50最大的拼接结果作为最终Illumina组装contig结果.对于PacBio三代测序数据，则是通过开源的canu标准流程中先进行序列纠错，对我们的酵母基因组数据进行处理，并依据网上已有参考酵母基因组大小设定参考基因组大小为20M。在拿到纠正后的PacBio数据后，我们发现利用Circlator软件，整合Illumina拼接的contig文件作为有益补充，基于PacBio纠正后的reads序列来组装，其效果会好于canu默认自带拼接组装步骤结果，特别是对于环状的原核生物基因组，甚至可以拿到整个完整的环状基因组序列，而无需要在用PCR实验去补基因组上的gap。纠正部分，这里我们对拿到的PacBio的contig组装结果，利用Illumina原始测序数据，经过bowtie和samtools软件进一步完成比对，过滤高质量比对的read，通过碱基错误修订软件Pilon以align部分产生的bam文件为依据，对拼接结果中一些可能的碱基错误进行更正，并通过进一步的延伸软件SSPACE,利用Illumina原始reads数据PacBio的纠正后拼接结果进一延伸，最终获取到了一个高质量的基因组组装结果。基因组组装评估部分，我们通过quast和BUSCO数据库信息完成了对组装结果的评估工作。基因组注释部分，我们通过fungap调用RepeatModeler从头对基因组的重复序列家族进行建模注释，并且使用RepeatMasker来检测和屏蔽基因组中的重复序列，因为基因组中存在重复序列结构区，特别是高等真核生物，重复序列占了相当大的比例，会影响基因预测的质量，也会带来不必要的资源消耗。然后，通过三种常见的基因预测软件Augustus,Maker和Braker来完成基因组拼接后的基因注释结果。对于参考的转录组数据，通过Hisat/Trinity流程，拿到RNA-seq的拼接组装结果，在利用Maker和Braker完成对该结果的基因预测。所有的预测结果，将全部进行评估打分，通过BLASTp,BUSCO,InterProScan和BLASTn获取到对应的分值，最后通过基因模型的Evidence Score公式如下，获取到Evidence Score分值最高的预测结果作为最终的基因预测模型。在这一步参数选择上，我们还加入了sister_organisms,即下载了NCBI数据库中所有9个解脂耶氏酵母基因组做注释参考，以提高注释的准确性。以我们的酵母基因组拼接为例子，获取到的拼接，评估和注释比对结果详见研究报告文档。这里我们可以看到与NCBI数据库中已有的完整染色体水平的解脂耶氏酵母基因组参考的Yarrowialipolytica CLIB122比较我们最后的拼接组装结果也是得到了6个染色体和一个质粒序列。For Illumina next-generation sequencing data, first use Trimmomatic software to remove adapter sequences and tail low-quality (q<20) base sequences from Illumina original sequencing data, remove sequences less than 36pb in length, and use SPAdes splicing software to calculate different k-mers After the size parameter results, the effective contig sequence length is selected by default to be greater than 500bp, and the splicing result with the largest N50 is selected as the final Illumina assembly contig result. For the PacBio third-generation sequencing data, the sequence error correction is performed first through the open source canu standard process. The yeast genome data were processed, and the reference genome size was set to 20M based on the existing reference yeast genome size on the Internet. After getting the corrected PacBio data, we found that using the Circlator software to integrate the contig files spliced by Illumina as a useful supplement, based on the corrected reads sequence of PacBio to assemble, the effect will be better than the default splicing and assembly steps of canu. Especially for circular prokaryotic genomes, even the entire complete circular genome sequence can be obtained without the need to use PCR experiments to fill in the gaps on the genome. For the correction part, here we use the Illumina original sequencing data to further complete the alignment through the bowtie and samtools software for the obtained PacBio contig assembly results, filter the high-quality aligned reads, and use the base error correction software Pilon to generate the align part Based on the bam file of Illumina, some possible base errors in the splicing results were corrected, and through the further extension software SSPACE, the corrected splicing results were further extended using Illumina original reads data PacBio, and finally a high-quality genome was obtained Assembly result. In the genome assembly evaluation part, we completed the evaluation of the assembly results through the quast and BUSCO database information. In the genome annotation part, we use Fungap to call RepeatModeler to model and annotate the repetitive sequence family of the genome from scratch, and use RepeatMasker to detect and mask the repetitive sequences in the genome, because there are repetitive sequence structural regions in the genome, especially in higher eukaryotes, Repeated sequences account for a considerable proportion, which will affect the quality of gene prediction and cause unnecessary resource consumption. Then, three common gene prediction software Augustus, Maker and Braker are used to complete the gene annotation results after genome splicing. For the reference transcriptome data, through the Hisat/Trinity process, the RNA-seq splicing and assembly results are obtained, and Maker and Braker are used to complete the gene prediction of the results. All prediction results will be evaluated and scored, and the corresponding scores will be obtained through BLASTp, BUSCO, InterProScan and BLASTn. Finally, the Evidence Score formula of the gene model is as follows, and the prediction result with the highest Evidence Score is obtained as the final gene predictive model. In this step of parameter selection, we also added sister_organisms, that is, downloaded all 9 Yarrowia lipolytica genomes in the NCBI database for annotation reference to improve the accuracy of annotation. Taking our yeast genome assembly as an example, the obtained assembly, evaluation and annotation comparison results are detailed in the research report document. Here we can see that compared with Yarrowia lipolytica CLIB122, which is a complete chromosome-level Yarrowia lipolytica genome reference in the NCBI database, our final splicing and assembly results also obtained 6 chromosomes and a plasmid sequence.

第二部分：解脂耶氏酵母基因组注释Part II: Yarrowia lipolytica genome annotation

(1)基因组注释过程(1) Genome annotation process

整套流程利用了转录组数据做辅助，针对目前我们没有自己的转录组序列，选取了高度相似的同种酵母Yarrowia lipolytica Strain W29一套转录组序列为辅助。The entire process uses transcriptome data as an aid. Since we do not have our own transcriptome sequence at present, we selected a set of transcriptome sequence from the highly similar yeast Yarrowia lipolytica Strain W29 as an assistant.

依据文章提供的信息，当选取相同species或genus水平，对应平均DNA相似度达到96％或92.7％，则预测结果与真实转录组数据辅助结果非常接近。According to the information provided in the article, when the same species or genus level is selected, and the corresponding average DNA similarity reaches 96% or 92.7%, the predicted result is very close to the auxiliary result of the real transcriptome data.

(2)基因组注释具体过程步骤如下(2) The specific process steps of genome annotation are as follows

Step1：输入数据的预处理。Step1: Preprocessing of input data.

重复遮蔽；基因组区域如转座子重复，经常会使错误的排列，干扰基因预测。Repeat cloaking; genomic regions such as transposon repeats, which often result in erroneous alignments, interfere with gene predictions.

使用RepeatModeler构建的基因组特异性重复库。Genome-specific repeat library built using RepeatModeler.

mRNA读数的组装；Assembly of mRNA reads;

用户提供的mRNA读数由Trinity程序组装。User-supplied mRNA reads were assembled by the Trinity program.

Step2：基因预测Step2: Gene prediction

FunGAP使用的制造者和默认参数；用迭代SNAP基因模型训练运行四次。由于真菌基因组的基因密度较高，在mRNA组装中纠正相邻转录本的融合。Augustus和FunGAP使用的默认参数；与用户指定的augustus_species参数。允许重叠的CDS预测。FunGAP使用的刹车器和默认参数；使用GeneMark-ET和Augustus进行基于无监督RNA测序的基因组注释。Maker and default parameters used by FunGAP; four runs with iterative SNAP gene model training. Correction of fusions of adjacent transcripts in mRNA assemblies due to the high gene density of fungal genomes The default parameters used by Augustus and FunGAP; with the user-specified augustus_species parameter. Overlapping CDS predictions are allowed. Brakes and default parameters used by FunGAP; Unsupervised RNA-Seq-based genome annotation using GeneMark-ET and Augustus.

Step3：基因模型评估Step3: Gene model evaluation

BLASTp；BUSCO；InterProScan(Pfam域预测)；BLASTn；证据得分(基因模型)＝BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)BLASTp; BUSCO; InterProScan(Pfam domain prediction); BLASTn; Evidence score(gene model) = BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)

Step4：基因模型过滤Step4: Gene model filtering

提供的信息，注释流程中,基因预测程序augustus选取的参考模型为相同的种yarrowia_lipolytica进行预测，并下载9个yarrowia proteome数据进行同源查找。Provided information, in the annotation process, the reference model selected by the gene prediction program augustus is the same species yarrowia_lipolytica for prediction, and 9 yarrowia proteome data are downloaded for homology search.

下载姐妹蛋白质组(Download sister proteome)Download sister proteome

(3)基因组注释结果:FunGAP报告：创建于2019-09-17，该基因组中预测了6970个基因。(3) Genome annotation results: FunGAP report: Created on 2019-09-17, 6970 genes were predicted in this genome.

1.基因结构Gene structure见下表：1. Gene structure See the table below:

序号serial number AttributesAttributes ValuesValues 11 Total protein-coding genesTotal protein-coding genes 6,9706,970 22 Transcript length(avg/med)Transcript length(avg/med) 1,463.9/1,201.01,463.9/1,201.0 33 CDS length(avg/med)CDS length(avg/med) 1,362.7/1,131.01,362.7/1,131.0 44 Protein length(avg/med)Protein length (avg/med) 454.2/377.0454.2/377.0 55 Exon length(avg/med)Exon length(avg/med) 1,035.0/810.01,035.0/810.0 66 Intron length(avg/med)Intron length(avg/med) 319.5/168.0319.5/168.0 77 Spliced genesSpliced genes 1,833(26.3％)1,833 (26.3%) 88 Gene density(genes/Mb)Gene density (genes/Mb) 338.67338.67 99 Number of intronsNumber of introns 2,2072,207 1010 Number of introns per gene(med)Number of introns per gene(med) 1.01.0 1111 Number of exonsNumber of exons 9,1779,177 1212 Number of exons per gene(med)Number of exons per gene(med) 1.01.0

2.转录组读长序列组装(Transcriptome reads assembly)见下表：2. Transcriptome reads assembly is shown in the table below:

序号serial number AttributesAttributes ValuesValues 11 Number of mapped readsNumber of mapped reads 20,207,84720,207,847 22 Number of assembled contigsNumber of assembled contigs 10,44810,448 33 Number of contigs>1kbpNumber of contigs>1kbp 5,2955,295 44 Total transcript size(Mbp)Total transcript size(Mbp) 13,320,73813,320,738

3.转录本长度分布(Transcript length distribution)见附图33. Transcript length distribution (Transcript length distribution) see Figure 3

4.蛋白质长度分布(Protein length distribution)见附图3总阅读量：22,385,814次；映射读数：20,207,847次；对齐率：90.27％。Total reads:22,385,814；Mappedreads:20,207,847；Alignment rate:90.27％4. Protein length distribution (Protein length distribution) is shown in Figure 3. Total reads: 22,385,814; mapped reads: 20,207,847; alignment rate: 90.27%. Total reads: 22,385,814; Mapped reads: 20,207,847; Alignment rate: 90.27%

依据现有技术可知，当选取的transcriptome reads alignment rate不低于80％时候，基因预测效果就比较可靠，这里由于我们选择了species水平的酵母转录组数据，reads与genome的匹配度高达90.27％.说明预测的结果是对的。According to the existing technology, when the selected transcriptome reads alignment rate is not less than 80%, the gene prediction effect is more reliable. Here, because we have selected the yeast transcriptome data at the species level, the matching degree between reads and genome is as high as 90.27%. It shows that the predicted result is correct.

第三部分：解脂耶氏酵母基因组注释与参考基因组比较Part III: Yarrowia lipolytica genome annotation and reference genome comparison

(1)将注释完成的基因组通过BUSCO数据库评估组装结果5个Missing BUSCOs(M):EOG092D12CU、EOG092D2L06、EOG092D3PHH、EOG092D4JRM、EOG092D4SCY对于KEGG数据库中存在的完成参考基因组Yarrowia lipolytica Strain CLIB122的进行同样的BUSCO数据库核查，结果如下以下9个Missing BUSCOs(M)(EOG092D1EU1、EOG092D1Q04、EOG092D1RDH、EOG092D2L06、EOG092D2RP6、EOG092D37V8、EOG092D3C2F、EOG092D4H30、EOG092D4SCY)与前面的5个Missing BUSCOs存在2个重叠BUSCOs，可见他们都缺乏以下2个BUSCOs.(1) Evaluate the assembly results of the annotated genome through the BUSCO database 5 Missing BUSCOs (M): EOG092D12CU, EOG092D2L06, EOG092D3PHH, EOG092D4JRM, EOG092D4SCY For the completed reference genome Yarrowia lipolytica Strain CLIB122 existing in the KEGG database, perform the same BUSCO database核查，结果如下以下9个Missing BUSCOs(M)(EOG092D1EU1、EOG092D1Q04、EOG092D1RDH、EOG092D2L06、EOG092D2RP6、EOG092D37V8、EOG092D3C2F、EOG092D4H30、EOG092D4SCY)与前面的5个Missing BUSCOs存在2个重叠BUSCOs，可见他们都缺乏以下2 BUSCOs.

第1个：EOG092D2L06"Zinc finger,RING-like"156 155 1 329 1.3703 NAIPR011513；IPR0148571st: EOG092D2L06 "Zinc finger, RING-like" 156 155 1 329 1.3703 NAIPR011513; IPR014857

第2个：EOG092D4SCY"EF-hand domain"164 162 2 205 1.0865NA IPR002048；IPR011992；IPR0182472nd: EOG092D4SCY "EF-hand domain" 164 162 2 205 1.0865NA IPR002048; IPR011992; IPR018247

(2)选取的参考基因组为Yarrowia lipolytica Strain W29，下载其对应的蛋白序列注释文件进行在线比对，找到的共有的6511pair genes。(2) The reference genome selected was Yarrowia lipolytica Strain W29, and its corresponding protein sequence annotation file was downloaded for online comparison, and a total of 6511 pair genes were found.

接下来，我们以注释的6970基因文件fungap_out_prot.faa为基准，进行VENN图分析，其中参考基因组YarliW29与其共有的基因map数目为6511.另外我们也比较了前面BUSCO中ascomycota_odb9数据库中的1315个BUSCO group匹配情况，总共是匹配到了1310个BUSCO group(包含完整和片段匹配),对应6970个基因中的1318个基因。其中2个为0的数值，是由于reference bias我们只选择我们拼接注释的6970个基因为共同参考比较，因此YarliW29与BUSCO(1315groups)的完全比较存在遗漏可能。并没有见到比较明显的基因在染色体上的聚类情况，每条染色体上与参考基因组的比较，存在相当数目的独特基因分布，附图5为459个独特基因在染色体上的分布情况图。Next, we use the annotated 6970 gene file fungap_out_prot.faa as a benchmark to perform VENN map analysis, in which the reference genome YarliW29 shares 6511 gene maps with it. In addition, we also compared the 1315 BUSCO groups in the ascomycota_odb9 database in the previous BUSCO In terms of matching, a total of 1310 BUSCO groups (including complete and fragment matches) were matched, corresponding to 1318 genes in 6970 genes. Two of the values are 0, because we only select the 6970 genes that we spliced and annotated as the common reference comparison for reference bias, so there may be omissions in the complete comparison between YarliW29 and BUSCO (1315groups). There is no obvious clustering of genes on chromosomes. Compared with the reference genome, there are a considerable number of unique gene distributions on each chromosome. Attached Figure 5 is a map of the distribution of 459 unique genes on chromosomes.

第四部分：解脂耶氏酵母基因组代谢模型重构Part IV: Reconstruction of Yarrowia lipolytica genome metabolic model

(1)基因组规模的代谢模型(GEMs)(1) Genome-scale metabolic models (GEMs)

基因组尺度代谢模型(GEMs)通过计算描述生物体内整个代谢基因的基因-蛋白-反应关联，可以模拟预测各种系统级代谢研究的代谢通量。最新的GEM，iMK1208，是一个高质量的模型，包括1208个基因和1643个反应，并成功地用于预测增加肌动蛋白生产的代谢工程目标。天蓝链霉菌(Streptomyces coelicolor)基因组尺度模型(GEMs)用于系统生物学研究，创业板自动重建草案Automated draft GEM reconstructionGenome-scale metabolic models (GEMs) can simulate and predict the metabolic fluxes of various system-level metabolic studies by computationally describing the gene-protein-response associations of the entire metabolic gene in an organism. The latest GEM, iMK1208, is a high-quality model comprising 1208 genes and 1643 reactions and was successfully used to predict metabolic engineering targets that increase actin production. Streptomyces coelicolor Genome Scale Models (GEMs) for Systems Biology Research, Automated draft GEM reconstruction

基于两个数据库的重构Refactoring based on two databases

(a)基于MetaCyc的重建模块：MetaCyc是一个精心策划的数据库，收录了来自所有生命领域的实验阐明的代谢途径。MetaCyc包含来自3009种不同生物体的2722条途径。(a) MetaCyc-based reconstruction module: MetaCyc is a curated database of experimentally elucidated metabolic pathways from all domains of life. MetaCyc contains 2722 pathways from 3009 different organisms.

(b)基于KEGG的重建模块：(1)基于KEGG注释的基因组信息的GEM草案。(2)用KEGG训练的HMMs从同源性中重建GEM草案；(b) KEGG-based reconstruction module: (1) GEM draft based on KEGG-annotated genome information. (2) Reconstruct the GEM draft from the homology with KEGG-trained HMMs;

ScoKEGGAnnotation＝getKEGGModelForOrganism('yli',”,”,”,0,0)；ScoKEGGAnnotation = getKEGGModelForOrganism('yli',",",",",0,0);

ScoKEGGAnnotation.id＝'ScoKEGGAnnotation'；ScoKEGGAnnotation.id = 'ScoKEGGAnnotation';

(c)基于MetaCyc的模型和基于KEGG的模型的合并。KEGG依赖于基于序列的注释，而MetaCyc只收集经过实验验证的途径。.(c) Merge of MetaCyc-based and KEGG-based models. KEGG relies on sequence-based annotations, while MetaCyc only collects experimentally validated pathways. .

(2)参考Yarrowia lipolytica CLIB122基因组信息(2) Refer to the genome information of Yarrowia lipolytica CLIB122

(3)JGI数据库目前已有的全部8个Yarrowia lipolytica信息(3) All 8 Yarrowia lipolytica information currently available in the JGI database

(4)半自动创建代谢模型迭代过程(4) Iterative process of creating metabolic model semi-automatically

(5)解脂耶氏酵母基因组代谢模型具体重构流程(5) Specific reconstruction process of Yarrowia lipolytica genome metabolic model

基于以上流程，我们目前的代谢模型第一个版本信息如下Based on the above process, the information of the first version of our current metabolic model is as follows

对于后期进一步的迭代优化，弥补gap需要进一步查看相关文献以及后期合作探讨才能深入优化进行。For further iterative optimization in the later stage, making up for the gap requires further review of relevant literature and later cooperation and discussion before in-depth optimization can be carried out.

第五部分：数据结果存放说明见附图10Part V: For the storage instructions of data results, please refer to Attachment 10

以上显示描述了本发明的基本原理，特征和主要优点。本发明不受上述实例的限制，上述实例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，这些变化和改进都将落入要求保护的本发明范围内。The above presentation describes the basic principles, features and main advantages of the invention. The present invention is not limited by the above-mentioned examples, and what described in the above-mentioned examples and the description only illustrates the principle of the present invention, without departing from the spirit and scope of the present invention, the present invention also has various changes and improvements, these changes and improvements All will fall within the scope of the claimed invention.

Claims

1. A method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology, characterized in that it comprises the following steps:

Step 1. Use the second-generation Illumina and third-generation PacBio high-throughput sequencing technologies to sequence the species, and obtain the original sequence files with quality control;

Step 2. Perform quality statistics and evaluation on the original sequence files respectively. If the evaluation is qualified, execute step 3; otherwise, return to execute step 1;

Step 3. Remove primers, linker sequences, and read sequence reads whose quality/length is lower than the set requirement for the original sequence file respectively, and obtain the sequence data after screening;

Step 4. Splicing the second-generation Illumina high-throughput screening data;

Splice the screened second-generation Illumina sequencing data with different k-mer size parameters, and select the splicing result with the largest parameter N50 as the contig file of the second-generation Illumina sequencing data;

Step 5. Perform the first splicing of the third-generation PacBio high-throughput screening data;

Use the splicing software to splice the screened third-generation PacBio sequencing data for the first time, and obtain the contig file for the first splicing. The splicing steps include:

Step 51, correction: stacking the screened reads together for base correction to improve the accuracy of the bases in the reads;

Step 52, pruning: determine the high-quality region and the low-quality region of the reads, and trim the low-quality region of the reads, retain a single highest-quality sequence block, and delete the low-quality sequence block;

Step 53, assembly: sort the reads sequence into a contigs file, generate corresponding consensus sequences consensus sequences, and create a path connecting the consensus sequences;

Step 6. Fusion and embed the contig file of the second-generation Illumina into the third-generation PacBio high-throughput screening data for the second splicing to obtain high-quality genome splicing results. The specific steps are as follows:

Step 61, performing sequence error correction on the first splicing data of PacBio, and obtaining the corrected reads sequence based on PacBio;

Step 62, using the contig file spliced by the second-generation Illumina as a supplement to the reads sequence corrected by PacBio, filling the gap in the third-generation PacBio sequencing, performing second splicing of the genome, and obtaining the contig file of PacBio;

Step 63: For the contig file of PacBio, use Illumina original sequencing data to further compare, filter the reads of high-quality comparison, use the bam file generated by the align part through the base error correction software Pilon, and make the second splicing result base errors were corrected;

Step 64, using the extension software to further extend the corrected splicing result of Illumina original reads data PacBio, and finally obtain a high-quality genome splicing result;

Step 7. Predict and match the genome assembly results, and complete the genome sequence annotation;

According to the splicing results, use the gene prediction software Augustus, Maker and Braker respectively to obtain all the gene prediction results, compare the predicted results with the NCBI nt database, BUSCO database and Pfam database to obtain the corresponding scores, and pass the gene model The expression is calculated to obtain the comprehensive scoring value of the gene prediction model, and the gene prediction model with the highest score is determined to be the preliminary genome assembly annotation result. Combined with the transcriptome data of similar species, the preliminary genome assembly annotation result is matched. If the sequencing read sequence If the matching degree between reads and the genome is >80%, the preliminary genome assembly annotation result is determined as the final result, and the genome sequence annotation is completed; otherwise, change the gene prediction software and re-predict and select;

Step 8, constructing the genome metabolic network model GEMs;

Step 9, build a visual network platform of metabolic network model GEMs through MetExploreViz, and perform visual display of metabolic model GEMs;

Step 10, using optFlux software to perform interface visualization FBA flow balance analysis on the metabolic network model GEMs.

2. The method for constructing, optimizing and visualizing a genome metabolism model based on high-throughput sequencing technology according to claim 1, characterized in that, in said step 3, the quality is lower than the set requirement refers to bases whose quality is lower than 20 , the length is lower than the set requirement refers to the sequence whose length is less than 100bp.

3. The method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology according to claim 1, wherein the different k-mer size parameters in step 4 are 21, 33, 55 respectively. , 77, 99.

4. The method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology according to claim 1, wherein the expression of the gene model in the step 7 is:

Evidence score(gene model)＝BLASTp_score*cov(query)*cov(target)+BUSCO_score+Pfam_scores+BLASTn_score*cov(query)*cov(target)

Among them, BLASTp_score is the score derived from the BLASTp protein sequence alignment program;

BLASTp_score is the score derived from the BLASTn sequence base alignment program; BUSCO_score is the score derived from the comparison of the Benchmarking Universal Single-Copy Orthologs database of BUSCO; Pfam_scores is the score derived from the comparison of the Pfam database;

cov(query) is the full-length alignment coverage of the query sequence; cov(target) is the full-length alignment coverage of the target sequence; Evidence score is the comprehensive scoring value of the final gene model.

5. The method for constructing, optimizing and visualizing a genome metabolism model based on high-throughput sequencing technology according to claim 1, wherein the specific steps in the step 8 include:

Step 81. Use RAVEN to compare the protein sequences of the genes annotated by the genome to the biological pathway databases KEGG and MetaCyc, and obtain the reactions Rxns, metabolite Mets and gene numbers involved in the corresponding metabolic network model, and convert the same reactions, metabolic Combining substances and related genes, performing deduplication fusion, and finally obtaining metabolic network model GEMs based on KEGG and MetaCyc;

Step 82, search for the metabolic network model GEMs of the species closest to the species in the KEGG database, and obtain the involved reaction Rxns, metabolite Mets and genes through the gene homology comparison strategy, and supplement the obtained based on KEGG and MetaCyc metabolic network model GEMs;

Step 83, through GapFind and GapFill, improve the entire metabolic network model GEMs so that the model can obtain more genes and reactions.

6. The method for constructing, optimizing and visualizing the genome metabolism model based on high-throughput sequencing technology according to claim 1, characterized in that, in the step 2, the second-generation Illumina high-throughput sequencing technology measures the sequence quality of the species The evaluation standard is Clean Data Q20>85%; the third-generation PacBio high-throughput sequencing technology determines the sequence quality of the species and the evaluation standard is Mean Concordance>85%.

7. The method for constructing, optimizing and visualizing a genome metabolism model based on high-throughput sequencing technology according to claim 1, wherein the high-quality region in the step 5 means that the base quality of the reads is greater than or equal to 20 The low-quality region refers to the region where the base quality of the reads is less than 20.

8. The method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology according to claim 1, wherein the species is yeast.

9. The method for constructing, optimizing and visualizing a genome metabolic model based on high-throughput sequencing technology according to claim 8, wherein the yeast is Yarrowia lipolytica.