CN112599192A

CN112599192A - New coronavirus whole genome analysis system based on nanopore sequencing

Info

Publication number: CN112599192A
Application number: CN202011641513.2A
Authority: CN
Inventors: 毛凌峰; 徐兴宇; 沈航杰; 倪莉丽
Original assignee: Hangzhou Boyi Technology Co ltd
Current assignee: Hangzhou Boyi Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-02

Abstract

The invention provides a nanopore sequencing-based whole genome analysis system for a new coronavirus, which is used for establishing a complete analysis process aiming at sequencing data of the new coronavirus, performing overall analysis of quality control, genome coverage, variation detection, genome assembly and genome integrity aiming at sequencing data of second-generation sequencing and third-generation sequencing, and performing tree analysis aiming at variation detection. The mutation detection result is correlated with the sample, so that the epidemic history of the new coronavirus can be controlled conveniently, in addition, the whole analysis process of sequencing analysis is visually displayed, an operator can simply perform analysis operation according to an operation instruction on an operation interface, and the analysis result is comprehensively displayed in a chart form.

Description

New coronavirus whole genome analysis system based on nanopore sequencing

Technical Field

The invention relates to a gene analysis system, in particular to a new coronavirus whole genome analysis system based on nanopore sequencing.

Background

The new coronavirus (SARs-CoV) as the infectious atypical pneumonia virus can be diagnosed by real-time PCR, virus gene sequencing or virus specific antibody detection, and the detection and control of the new coronavirus can be carried out by a whole genome sequencing method, so that good effect can be achieved. At present, a nanopore sequencing technology based on electric signal detection is a high-throughput sequencing platform with simplest experimental operation and fastest sequencing speed at present, but a large amount of original data generated by a sequencing method at present needs to be analyzed by professional experimenters trained for a long time, and the professional experimenters need to call various programs in a linux shell command line form to realize analysis work such as filtration, sequence comparison, microorganism species classification, microorganism reading number statistics, pathogenic microorganism detection, target species data extraction, genome integrity calculation and the like on original sequences. Such a drawback lies in that it is necessary for professional experimenters to have very strong biological information analysis and linux system operation capabilities, and each analysis program has different selection schemes and parameters, and professional experimenters are required to spend a large amount of time to repeatedly search and adjust the programs and parameters, so that the efficiency is very low, the visualization effect display of data is problematic, and the degree of automation is very low.

In summary, no simple and easy-to-operate sequencing analysis system for whole genome analysis of new coronavirus exists, and mutation detection of new coronavirus cannot be performed rapidly.

Disclosure of Invention

The invention aims to provide a nanopore sequencing-based whole genome analysis system for a new coronavirus, which integrates various analysis programs, has a simple and clear operation interface, can be easily operated by experimenters in a short time, can detect a variant gene, correlates an analysis result with sample information and is convenient for managing and controlling the new coronavirus.

In order to achieve the above object, the present technical solution provides a new coronavirus whole genome analysis system based on nanopore sequencing, comprising:

the system comprises a data analysis system and a sample management system which are mutually associated, wherein the data analysis system is used for acquiring sequencing data of a pathogen to be detected so as to identify the type of the pathogen to be detected, the sample management system is used for acquiring sample information corresponding to the pathogen to be detected, and the sequencing data is associated with the sample information;

the data analysis system includes:

the task establishing unit is used for establishing an analysis task corresponding to the sequencing data of the pathogen to be detected, wherein the analysis task stores the sequencing data and analysis parameters of the pathogen to be detected;

a reference genome storing a reference gene sequence of the new coronavirus;

the sequence comparison unit is used for obtaining a comparison instruction, comparing the sequencing data with the reference genome and obtaining a detection sequence of the pathogen to be detected;

the sequence analysis unit is used for acquiring a sequence analysis instruction and performing at least one sequence analysis task of genome coverage rate, variation detection, genome assembly and genome integrity on the basis of the detection sequence;

and the analysis report generating unit is used for acquiring the report instruction, extracting the analysis result data of the sequence analyzing unit and the sequence comparing unit, and associating the analysis result data with the sample management system to generate an analysis report.

Compared with the prior art, the technical scheme has the following characteristics and beneficial effects: providing visual display of an analysis process, optimizing parameter adjustment input, and analyzing a sequencing result in a one-click mode; integrating the analysis process, and providing new coronavirus genome negative/positive detection and genome variation detection; the analysis system which provides a graphical interface in one key mode and the PDF format detection report in one key mode enable data interpretation of a sequencing sequence to be simpler.

Drawings

FIG. 1 is a schematic diagram of a framework of a nanopore sequencing based genome wide analysis system for a new coronavirus according to the present invention.

FIG. 2 is a schematic diagram of input data and analysis parameters.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

It is understood that the terms "a" and "an" should be interpreted as meaning that a number of one element or element is one in one embodiment, while a number of other elements is one in another embodiment, and the terms "a" and "an" should not be interpreted as limiting the number.

The whole genome analysis system of the new coronavirus based on nanopore sequencing is constructed, a complete analysis process is established by the whole genome analysis system aiming at sequencing data of the new coronavirus, the whole analysis of quality control, genome coverage, variation detection, genome assembly and genome integrity is carried out aiming at the sequencing data of second-generation sequencing and third-generation sequencing, and the dendriform analysis is carried out aiming at the variation detection. The mutation detection result is correlated with the sample, so that the epidemic history of the new coronavirus can be controlled conveniently, in addition, the whole analysis process of sequencing analysis is visually displayed, an operator can simply perform analysis operation according to an operation instruction on an operation interface, and the analysis result is comprehensively displayed in a chart form.

The scheme content of the novel coronavirus whole genome analysis system based on nanopore sequencing comprises the following steps: a comprehensive analysis process is constructed based on the sequencing data of suspected pathogens sequenced by the nanopore, and sequence comparison and variation detection of reference genomes are carried out aiming at the suspected pathogens, so that the problems that the conventional new crown conventional fluorescence quantitative PCR detection cannot be used for virus variation detection and virus evolution detection are solved; in addition, the analysis system supports complete data types, supports a nanopore sequencing technology with data formats of fastq, fast5 and barcoded fastq, and can also support single-end and double-end fastq second-generation data; the analysis process is visualized, and the analysis result can be generated one key and displayed in a graphical mode.

Fig. 1 shows a schematic diagram of a framework of the nanopore sequencing-based whole genome analysis system for a new coronavirus to be detected, which can perform gene sequence analysis on a pathogen to be detected, identify whether the pathogen is a new coronavirus, and identify variation of the new coronavirus, and the system comprises:

the data analysis system includes:

a reference genome storing a reference gene sequence of the new coronavirus;

and the analysis report generating unit is used for acquiring the report instruction, extracting the analysis result data of the sequence analyzing unit and the sequence comparing unit, and associating the analysis result data with the sample management system to generate a visual analysis report.

In the scheme, the nanopore sequencing-based whole genome analysis system for the new coronavirus can be used for identifying whether an unknown pathogen is the new coronavirus or not, and detecting and evolving variation conditions of the new coronavirus. That is, when the coincidence rate of the gene sequence of the pathogen to be detected and the gene sequence in the reference genome is greater than a set threshold value, the pathogen to be detected is determined to be a new coronavirus; and carrying out subsequent variation detection on the pathogen to be detected so as to obtain the variation condition and the evolution process of the new coronavirus. And after the sequence comparison unit acquires the comparison instruction, comparing whether the sequencing data of the pathogen to be detected has the gene sequence corresponding to the reference genome, and if so, judging that the pathogen to be detected is the new coronavirus.

The task establishing unit is provided with a plurality of interfaces aiming at different types of data, different sequencing analysis channels are arranged corresponding to the different interfaces, and the corresponding sequencing analysis channels are selected according to the types of the sequencing data. Specifically, the multiple interfaces of the task establishing unit in the scheme enable the analysis system to analyze not only fastq and fast5 types of nanopore sequencing data, but also single-end and double-end fastq second-generation nanopore sequencing data, and simultaneously analyze and process multiple types of sequencing data, and the analysis system is applicable to various types of sequencing data, such as: illumina, huada, ion torrent, Pacbio, etc. are almost all high throughput sequencing platforms. This is because the present solution classifies the tasks created by the task creating unit and individually sets the sequencing analysis channel.

Specifically, the task establishing unit comprises a second-generation sequencing task module and a third-generation sequencing task module, the second-generation sequencing task module stores a second-generation sequencing task and corresponding analysis parameters, the third-generation sequencing task module stores a third-generation sequencing task and corresponding analysis parameters, the task establishing unit is provided with a parameter setting module, the parameter setting module is used for manually adjusting the parameters of sequencing data, and the parameter setting module is displayed on a system interface through a visual process. Moreover, it is worth mentioning that an independent sequencing analysis channel is established for each sequencing task, and a storage folder is established for the corresponding sequencing task.

Specifically, the analysis parameters for the third-generation sequencing task include a task name, a mode, a sequence path, a sample mixing reagent, a thread number, a length limit value, an accuracy limit value, a consistency depth and an SNP accuracy Q value, wherein the task name defines the name of each sequencing analysis channel so as to facilitate the user to quickly position and manage the established sequencing data; the mode can select one of fast5, fast q and barcoded fast q, and different subsequent analysis channels are selected according to each mode; inputting a file path of the folder by the sequence path; the mixed sample reagent provides single sample sequencing and a sequencing scheme of a multi-sample sequencing reagent corresponding to the Nanopore sequencing according to the analysis type.

The analysis parameters aiming at the second-generation sequencing task comprise a task name, sequence selection, thread number, consistency depth and SNP accuracy Q value, and the mode corresponding to the second-generation sequencing task is a fastq mode.

Particularly, the analysis parameters of the sequencing task can be set by manual selection according to the scheme. In particular, the pattern of sequencing data corresponds to different subsequent sequencing analysis efforts. The number of the threads is default to 10, the length limit value is default to 500, the accuracy limit value is default to 80, the consistency depth is default to 20, the SNP accuracy Q value is default to 20, and the specific parameters can be correspondingly adjusted according to the parameter setting module. In particular, since the present protocol provides a separate sequencing analysis channel, it allows the protocol to be targeted to different types of data.

The user inputs sequencing data on an interface of the analysis system and fills or modifies corresponding analysis parameters according to the instructions, the task establishing unit establishes corresponding storage folders based on the obtained sequencing data and the analysis parameters, and if the input sequencing data is a third-generation sequencing task, options of different data models are displayed.

Sample information corresponding to sequencing data is input into the sample management system, and the sample information includes but is not limited to: the task name of the sequencing data, the sampling information of the sequencing data and the personnel information of the sampling personnel corresponding to the sequencing data. The sampling information includes sample type, sampling date, and sequencing date. The person information includes the name, gender, and age of the person who sampled. And the sample information is associated with the sequencing data and stored in a folder corresponding to the sequencing data, or the sequencing data is associated with the sample data and stored in a sample management system.

And filling sample information on an interface of the analysis system by a user according to the instruction, and associating the task name in the task establishing unit with the option corresponding to the task name of the sequencing data for the user to select autonomously. Alternatively, the user may enter "the task name of the sequencing data" to match the corresponding sequencing data from the task creation unit.

And when the corresponding sequencing data are stored in the folder for storing the sequencing data by the analysis system, carrying out subsequent sequencing analysis according to the operation instruction of the user. The scheme reconfigures the triggering interfaces and the cascade relation of a reference genome, a sequence comparison unit, a sequence analysis unit and an analysis report generation unit according to a sequence analysis process, wherein the cascade relation is as follows: the sequence comparison unit is a lower-level task node of the reference genome, the sequence analysis unit is a lower-level associated task node of the sequence comparison unit, and the analysis report generation unit is a lower-level associated task node of the sequence comparison unit, the sequence analysis unit and the analysis report generation unit. A trigger interface of the sequence comparison unit corresponds to the comparison instruction, and the sequence comparison unit is triggered only after the comparison instruction is obtained; the trigger interface of the sequence analysis unit corresponds to the analysis instruction, and the species sequence analysis unit and the functional gene sequence analysis unit are triggered only after the analysis instruction is obtained.

The operation pressure of the analysis system is reduced through the arrangement of the mode, and the operation difficulty of operators is reduced. An operator selects corresponding content on an application interface of the analysis system according to requirements, generates corresponding instructions, cannot trigger lower task nodes under the condition that higher cascade conditions are not met due to the fact that cascade relations are set among units of the analysis system, and data to be analyzed are circulated in the analysis system according to the set flow direction.

In the scheme, the processing processes of the sequence comparison unit and the sequence analysis unit are independent and related, so that the result generated by comparison can be directly extracted in the analysis report generation unit. In addition, since the task establishing unit classifies the data modes, the sequence comparison unit can normally operate according to the comparison of the corresponding modes.

In some embodiments, the system comprises a data quality control unit, and the data quality control unit performs quality control on sequencing data of a pathogen to be detected according to set quality control conditions. At this time, the sequence comparison unit is a lower-level associated task node of the data quality control unit, and the sequencing data triggers the sequence comparison unit to perform comparison only after the quality control is completed.

In this scenario, the reference genome comprises the gene sequence for the new coronavirus. It is worth mentioning that the present scheme detects the variation of the new coronavirus, and if the variation gene sequence is obtained, the reference genome can be updated.

The sequence analysis unit can be divided into an independent variation detection unit, a genome assembly unit, a genome coverage rate unit and a genome integrity calculation unit according to the analysis content, and one or more of the variation detection unit, the genome assembly unit, the genome coverage rate unit and the genome integrity calculation unit are triggered according to the analysis instruction, so that the simple operation and analysis of the whole process are realized. That is to say, the scheme collects the tasks of the new crown gene sequence analysis, and the operator selects the analyzed tasks according to the requirement and triggers the corresponding sequence analysis unit to analyze.

Correspondingly, the user selects an analysis task on a page corresponding to the analysis system according to the variation detection, the genome assembly, the genome coverage rate and the genome integrity, and correspondingly generates different sequence analysis results.

Wherein the sequence analysis unit is triggered after the sequence comparison unit, and the analysis task of the sequence analysis unit is carried out only when a new crown gene sequence is detected to be contained in the pathogen to be detected. The sequence analysis unit analyzes the new coronavirus gene sequence and the whole gene sequence of the pathogen to be detected. Wherein the genome coverage rate is the coverage rate of the new coronavirus gene sequence in the whole gene sequence, and the genome integrity is the integrity of the new coronavirus gene sequence.

It is worth mentioning that the sequence analysis unit of the scheme designs the variation detection unit aiming at the new coronavirus, and the result detected by the variation detection unit is displayed in a tree form, so that the experimenter can conveniently perform visual analysis processing. The variation detection unit at least comprises one variation detection task of variation type effect annotation, evolution analysis and sample grouping, and different variation detection tasks are carried out according to different instruction operations.

Annotation of variant type effects: and a gene annotation file is built in, and the variation condition is annotated based on the gene annotation file. Evolution analysis: and analyzing the gene sequence according to the GISAID reference strain, and analyzing whether the gene sequence is evolved or updated. And (3) sample grouping, namely grouping the gene sequences according to the GISAID grouping standard, and classifying the gene sequences of the same type into a group. And the analysis results obtained by the evolution analysis and the sample grouping are displayed in a tree form.

In the scheme, the mutation detection result obtained by the mutation detection unit is correlated with the sample management system. Specifically, the mutation detection result is associated with the corresponding sample to be detected and the detection personnel so as to facilitate the tracking treatment of the new coronavirus.

In addition, the analysis report generating unit embeds an analysis report template, extracts corresponding data content according to the gene analysis command and fills the data content into the analysis report template, wherein the extracted data content comprises: one or more of sample information, sequencing data, sequence alignment results, and sequence analysis results. Moreover, it should be noted that, since the processing procedures of the present solution are independent and related, the analysis report generation unit is convenient to extract the corresponding content independently. In addition, the analysis report generation unit can generate a chart-type analysis report according to the extracted data content.

The flow interface of the analysis system provided by the scheme is simple and easy to operate, after the sequencing data and the sample information are input by an operator, the corresponding analysis content is displayed according to the indication, and finally the analysis content is summarized to obtain an analysis report, so that one-click output from offline data biological information to a result report is realized.

The nanopore sequencing-based new coronavirus whole genome analysis system provided by the present scheme can be carried and run on a computer system, and the computer system of the server comprises a Central Processing Unit (CPU) which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage part into a Random Access Memory (RAM). In the RAM, various programs and data necessary for system operation are also stored. The CPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus. The modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described modules may also be disposed in a processor.

The present invention is not limited to the above-mentioned preferred embodiments, and any other products in various forms can be obtained by anyone in the light of the present invention, but any changes in the shape or structure thereof, which have the same or similar technical solutions as those of the present application, fall within the protection scope of the present invention.

Claims

1. A nanopore sequencing-based neocoronavirus whole genome analysis system, comprising:

the data analysis system includes:

a reference genome storing a reference gene sequence of the new coronavirus;

and the analysis report generating unit is used for acquiring the report command, extracting the analysis result data of the sequence analyzing unit, and associating the analysis result data with the sample management system to generate an analysis report.

2. The nanopore sequencing-based neocoronavirus whole genome analysis system of claim 1, wherein the task establishment unit comprises a second-generation sequencing task module and a third-generation sequencing task module, the second-generation sequencing task module stores a second-generation sequencing task and corresponding analysis parameters, the third-generation sequencing task module stores a third-generation sequencing task and corresponding analysis parameters, and the task establishment unit is provided with a parameter setting module.

3. The nanopore sequencing based neocoronavirus whole genome analysis system of claim 2, wherein the analysis parameters for the third generation sequencing task comprise task name, pattern, sequence path, mixing reagent, number of threads, length limit, accuracy limit, depth of identity, and SNP accuracy Q-value, and the analysis parameters for the second generation sequencing task comprise task name, sequence selection, number of threads, depth of identity, and SNP accuracy Q-value.

4. The nanopore sequencing-based neocoronavirus whole genome analysis system according to claim 3, wherein one of fast5, fast q and barcodefstq is selected for the third generation sequencing task, and the corresponding mode for the second generation sequencing task is the fast q mode.

5. The nanopore sequencing-based neocoronavirus whole genome analysis system of claim 1, wherein the trigger interfaces and the cascade relationship of the reference genome, the sequence alignment unit, the sequence analysis unit, the analysis report generation unit are reconfigured according to a sequence analysis process.

6. The nanopore sequencing-based neocoronavirus whole genome analysis system of claim 1, wherein the sequence analysis unit is further divided into an independent variation detection unit, a genome assembly unit, a genome coverage unit and a genome integrity calculation unit according to the analysis content.

7. The nanopore sequencing-based neocoronavirus whole genome analysis system of claim 6, wherein the mutation detection unit comprises at least one mutation detection task selected from mutation type effect annotation, evolution analysis, and sample grouping.

8. The nanopore sequencing-based whole genome analysis system for neocoronavirus according to claim 7, wherein the variation detection result obtained by the variation detection unit is displayed in a tree form.

9. The nanopore sequencing-based neocoronavirus whole genome analysis system of claim 6, wherein the mutation detection unit detects a mutation and correlates with the sample management system.

10. The nanopore sequencing-based neocoronavirus whole genome analysis system of claim 1, comprising a data quality control unit, wherein the sequence comparison unit is a subordinate associated task node of the data quality control unit, and sequencing data triggers the sequence comparison unit to perform comparison only after quality control is completed.