CN114333994B

CN114333994B - Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing

Info

Publication number: CN114333994B
Application number: CN202011069458.4A
Authority: CN
Inventors: 田振阳; 王苹
Original assignee: Tianjin Modern Innovation Traditional Chinese Medicine Technology Co ltd
Current assignee: Tianjin Modern Innovation Traditional Chinese Medicine Technology Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2023-07-07
Anticipated expiration: 2040-09-30
Also published as: CN114333994A

Abstract

The invention discloses a method and a system for determining a differential gene pathway based on ginseng-free transcriptome sequencing, wherein the method comprises the following steps: acquiring a differential gene list including a plurality of gene differences, and acquiring a previously generated gene sequence file from which a repeated gene sequence is removed; analyzing the gene sequence file to determine a plurality of gene sequences, and carrying out path annotation on each gene in the plurality of gene sequences so as to generate an annotated gene sequence file; adding annotation information for each gene data item in the difference gene list according to the channel annotation of each gene in the annotated gene sequence file; and enriching annotation information of each gene data item in the differential gene list to obtain enriched multiple gene paths, selecting multiple target gene paths from the enriched multiple gene paths, and taking each target gene path as a differential gene path to obtain multiple differential gene paths.

Description

Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing

Technical Field

The present invention relates to the field of bioinformatic analysis technology, and more particularly, to methods and systems for determining differential gene pathways based on parameter-free transcriptome sequencing.

Background

Transcriptome sequencing refers to sequencing mRNA of mRNA transcribed from a cell under conditions where a particular biological sample is known, and comparing differential gene expression between samples. A parameter-free transcriptome analysis refers to a transcriptome analysis performed on a species for which a reliable reference genome has not been obtained. In the analysis, the most important is the pathway analysis of differential genes, including: annotation, enrichment, and mapping of pathways.

The Kyoto encyclopedia of genes and genome (KEGG, kyoto Encyclopedia of Genes and Genomes) database (https:// www.kegg.jp /) is a database that integrates genomic, chemical and system functional information. KEGG databases are typically used as pathway notes for genes. After obtaining the differentially expressed genes, it is often necessary to enrich the differentially expressed genes in different pathways by aligning the KEGG database. The results are displayed by marking the different variation genes in the KEGG pathway diagram by using different colors, so that interpretation of the results is convenient.

In existing methods, KEGG annotation and enrichment are standard procedures for transcriptome analysis. However, KEGG enrichment for the reference-free transcriptome often also requires selection of reference species information and is therefore not universal for some species without a reference genome. In the step of KEGG enrichment pathway map drawing, the on-line tool drawing is mainly performed by a KEGG transformer mapper. However, these tools have the disadvantages of requiring manual labeling of differential genes, low throughput, and the need to query the genetic information again when looking up the result pictures.

Disclosure of Invention

The invention provides a high-throughput method for obtaining a high-quality KEGG pathway enrichment map aiming at a ginseng-free transcriptome in order to overcome the defects of a ginseng-free transcriptome differential gene pathway enrichment analysis method in the prior art. The invention aims to solve the problems that the enrichment of the differential gene pathway of the ginseng-free transcriptome depends on a reference genome, and the KEGG enrichment map is low in medium and low flux and is manually operated, and realize high flux, automatic and efficient KEGG enrichment of the ginseng-free transcriptome and the mapping of the pathway map.

To solve the problems in the prior art, the present invention provides a method for determining differential gene pathways based on reference-free transcriptome sequencing, the method comprising:

Transcriptome analysis of differential genes from species samples from which a reference genome was not determined to obtain a differential gene list including a plurality of gene differences, and obtaining a pre-generated gene sequence file from which repetitive gene sequences were removed;

analyzing the gene sequence file to determine a plurality of gene sequences, and carrying out path annotation on each gene in the plurality of gene sequences so as to generate an annotated gene sequence file;

adding annotation information for each gene data item in the difference gene list according to the channel annotation of each gene in the annotated gene sequence file; and

enrichment processing is carried out on annotation information of each gene data item in the differential gene list so as to obtain a plurality of enriched gene paths, a plurality of target gene paths are selected from the plurality of enriched gene paths, and each target gene path is used as a differential gene path so as to obtain a plurality of differential gene paths.

The format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.

Wherein pathway annotation for each gene in the plurality of gene sequences comprises:

A gene identification number and a gene pathway number are determined for each gene in the plurality of gene sequences to enable pathway annotation of each gene.

Also included is removing duplicate gene data items from all gene data items in the differential gene list.

Also included is removal of duplicate gene sequences from all gene sequences in the annotated gene sequence file.

The selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises:

and sorting each of the enriched plurality of gene pathways in descending order according to the confidence level to generate a descending order sequence, and selecting a predetermined number of gene pathways with the highest confidence level from the descending order sequence as target gene pathways.

Generating a dot pattern of gene pathways from the differential gene list and annotated gene sequence files;

in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and

in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.

The method further comprises the step of enriching annotation information of each gene data item in the difference gene list so as to generate an enrichment file;

Extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file;

determining a difference gene according to the gene identification number in the path list file;

determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and

and determining identification marks for the differential genes according to the fold change and the credibility of the differential genes so as to generate an identified differential gene list.

Wherein determining the identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises:

assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and

the genes with fold changes less than-1 and confidence less than 0.05 were assigned a second type of identifier.

analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map;

establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list;

Determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation;

identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map;

wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.

According to another aspect of the present invention, there is provided a system for determining differential gene pathways based on reference-free transcriptome sequencing, the system comprising:

an analysis device for performing transcriptome analysis on differential genes from a species sample from which a reference genome is not determined to obtain a differential gene list including a plurality of gene differences, and obtaining a previously generated gene sequence file from which a repeated gene sequence is removed;

annotating means for parsing the gene sequence file to determine a plurality of gene sequences, path annotating each gene in the plurality of gene sequences, thereby generating an annotated gene sequence file;

an adding device for adding annotation information for each gene data item in the difference gene list according to the path annotation of each gene in the annotated gene sequence file; and

And a processing device for enriching the annotation information of each gene data item in the differential gene list to obtain a plurality of enriched gene pathways, selecting a plurality of target gene pathways from the plurality of enriched gene pathways, and obtaining a plurality of differential gene pathways by using each target gene pathway as a differential gene pathway.

Wherein the annotating means for pathway annotating each gene of the plurality of gene sequences comprises:

the annotating means determines a gene identification number and a gene pathway number for each gene in the plurality of gene sequences to enable pathway annotation of each gene.

The duplication removing device is also included for removing duplicate gene data items from all the gene data items in the differential gene list.

A deduplication device is also included to remove duplicate gene sequences from all gene sequences in the annotated gene sequence file.

The processing device selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises:

the processing device performs descending order sequencing on each gene pathway in the enriched plurality of gene pathways according to the credibility so as to generate a descending order sequence, and selects a preset number of gene pathways with the largest credibility from the descending order sequence as target gene pathways.

The generation device is used for generating a dot pattern of the gene path according to the difference gene list and the annotated gene sequence file;

The device further comprises a determining device, wherein the determining device is used for carrying out enrichment processing on annotation information of each gene data item in the difference gene list so as to generate an enrichment file;

Wherein the determining means determines an identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises:

The method further comprises the step of establishing a device for enriching annotation information of each gene data item in the differential gene list so as to generate an enrichment file;

The difficulty of the present invention is that, first, conventional transcriptome KEGG enrichment analysis methods require specifying reference species information, whereas many reference-free transcriptomes do not have a reference species genome, and the present invention provides a mature and useful solution for enrichment analysis of reference-free transcriptomes. Secondly, when the conventional differential gene KEGG enrichment pathway is mapped, the gene information needs to be manually arranged, the KO number corresponding to the gene information is manually input, and the coloring of the gene is manually judged. The script of the invention arranges the difference genes and the annotation information, thereby realizing automation. Finally, the conventional KEGG pathway enrichment map is drawn to limit the number of genes processed at one time, so that high throughput cannot be realized, the obtained result does not contain gene linking information, and the gene information needs to be queried again during interpretation, so that inconvenience is brought. The method of the invention draws the enrichment result of the KEGG channel, realizes the high flux and automation of the whole flow, and gives out the output result of two formats, wherein the svg format picture contains the gene information in the channel diagram, thereby facilitating the later interpretation of the data. The innovation point of the invention is that the biological information technology is applied to automatically processing large-batch differential gene expression data, and a high-quality KEGG enrichment pathway diagram is obtained in a high throughput manner.

Drawings

Exemplary embodiments of the present invention may be more completely understood in consideration of the following drawings:

FIG. 1 is a flow chart of a method for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of generating a KEGG pathway plot according to an embodiment of the invention;

FIG. 3 is a scatter plot of KEGG pathway enrichment according to an embodiment of the invention;

FIG. 4 is a schematic diagram of information of a kegg_viewer according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the png format of a KEGG pathway enrichment map according to an embodiment of the invention;

FIG. 6 is a schematic diagram of the svg format of a KEGG pathway enrichment map according to an embodiment of the invention; and

FIG. 7 is a schematic diagram of the structure of a system for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.

Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

FIG. 1 is a flow chart of a method 100 for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention. The method 100 begins at step 101.

In step 101, transcriptome analysis is performed on differential genes from species samples for which reference genomes are not determined to obtain a differential gene list including a plurality of gene differences, and a pre-generated gene sequence file from which repetitive gene sequences are removed is obtained. Wherein the format of the gene sequence file is text-based and is used to represent the nucleotide sequence or the amino acid sequence.

The method 100 may include two input files. A list of differentially expressed genes is typically obtained after transcriptome sequencing. Furthermore, the analysis of the parameter-free transcriptome will result in an assembled gene sequence unigene file. The unigene file is typically in fasta format. The present application uses the two files, the differentially expressed gene list file and the unigene file, as input files for method 100.

In step 102, the gene sequence file is parsed to determine a plurality of gene sequences, and pathway annotation is performed for each gene in the plurality of gene sequences, thereby generating an annotated gene sequence file. Wherein pathway annotation for each gene in the plurality of gene sequences comprises: a gene identification number and a gene pathway number are determined for each gene in the plurality of gene sequences to enable pathway annotation of each gene.

In step 103, annotation information is added for each gene data item in the differential gene list according to the pathway annotation of each gene in the annotated gene sequence file. Also included is removing duplicate gene data items from all gene data items in the differential gene list. Also included is removal of duplicate gene sequences from all gene sequences in the annotated gene sequence file.

Wherein the gene sequence unigene annotation comprises: the unigene is first annotated with KEGG pathways to prepare for subsequent KEGG pathway enrichment analysis. In practice, there are many tools that enable KEGG path annotation, here exemplified by eggnog-mapper v 2.0.1. The input file is an unigene text file. The gene sequences were annotated, so that annotation files were obtained as shown in table 1. Whichever software is used, the annotation information obtained should include, but is not limited to, at least the KO number corresponding to the gene and the pathway KO number in which it resides.

Table 1 KEGG annotates file content

GeneID	KEGG_ko	KEGG_Pathway
			sc_6/125/1253	NA	NA
sc_8/104/598	ko:K02716	ko00195,ko01100,map00195,map01100
			sc_9/116/406	ko:K13094	NA
sc_27/112/611	ko:K02950	ko03010,map03010
			sc_41/105/390	NA	NA
sc_51/118/501	ko:K00706,ko:K11000	ko00500,ko04011,map00500,map04011
			sc_58/103/969	NA	NA
sc_61/117/536	NA	NA
			sc_64/113/508	NA	NA
sc_65/109/447	NA	NA
			sc_67/131/501	ko:K02953	ko03010,map03010
sc_69/130/944	NA	NA
			sc_73/113/525	ko:K09955	NA
sc_79/122/628	NA	NA

In step 104, enrichment processing is performed on annotation information of each gene data item in the differential gene list to obtain an enriched plurality of gene pathways, a plurality of target gene pathways are selected from the enriched plurality of gene pathways, and each target gene pathway is used as a differential gene pathway to obtain a plurality of differential gene pathways.

Wherein selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises: and sorting each of the enriched plurality of gene pathways in descending order according to the confidence level to generate a descending order sequence, and selecting a predetermined number of gene pathways with the highest confidence level from the descending order sequence as target gene pathways.

And carrying out pathway enrichment on the differential genes. In-line tools for differential gene pathway enrichment tend to be low in throughput and there are limits to the number of genes. To this end, the present application employs an assay that can be run in high throughput, automated, and is suitable for KEGG and GO enrichment assays at the same time. The analysis method of the present application has no limitation on the number of genes input, and after inputting the difference gene list and the unigene annotation file, an annotation scatter diagram of the KEGG pathway can be obtained, as shown in fig. 3.

FIG. 3 is a scatter plot of KEGG pathway enrichment according to an embodiment of the invention. FIG. 3 is a screen shot of a generated scatter plot. The abscissa in FIG. 3, the gene ratio, geneRatio, represents the extent of enrichment, with a larger GeneRatio indicating more and more reliable genes that are enriched for this pathway. The ordinate represents the enriched pathway description. The size of the scatter represents how many genes are. The scatter color change or gray change represents the enriched pvalue value.

The specific operation flow of the analysis method is as follows:

a) Reading in the difference gene list and the unigene annotation file, and removing repeated data items in the difference gene list and the unigene annotation file.

B) And adding KEGG annotation information according to KEGG enrichment analysis set in the parameters. I.e. adding to each row of the difference gene the KEGG annotation information it corresponds to in the unigene annotation file.

C) The KEGG annotation information in the differential gene table is enriched using an enrichment function, and the algorithm saves the results to a data box using default Benjamini-Hochberg procedure (BH) and simultaneously saves the results to a result file KEGG.

D) And (3) sequencing according to the pvalue columns in the results, and carrying out scatter diagram drawing on the first 10 paths which are remarkably enriched. The gene ratio GeneRatio, gene enrichment index, was calculated using the function in tool ggplot. The dotplot function is used to set information such as the x-axis, y-axis, scatter fill, scatter size, note color, font color, and the like of the plot, thereby plotting the scatter plot.

A differential gene KEGG pathway enrichment map drawing file was prepared. The method 100 needs to extract the difference genes and useful information in the enrichment file KEGG. Result. Xls obtained in the previous step, and add drawing factors, so as to sort the input files kegg_path. List and kegg_gene. List for KEGG path drawing, as shown in table 2 and table 3.

The kegg_path.list table is composed of the sequence of gene KO numbers from the kegg. Result.xls table and its corresponding sequence of passages. And the kegg_gene.list table is automatically collated and output by the script get_kegg_gene.py provided by the present application. The principle is that the difference genes and corresponding KO number columns in an enrichment file kegg. Result. Xls are input first, then the change multiple logfoldchange value and the credibility qvalue of the difference genes are matched in an original difference gene table according to the difference gene ID, and screening and gene color judgment are carried out to obtain the gene. Wherein the judgment standard is logfoldchange >1 and qvalue <0.05 is marked red; logfcoldchange < -1 and qvalue <0.05 is marked blue.

TABLE 2 kegg_path. List

Gene KO numbering	Pathway corresponding to gene
		KO2092	ko00196
KO2289	ko00196

TABLE 3 kegg_gene list

Gene KO numbering	Gene color
		K00001	red
K00059	blue
		K00083	blue
K00083	red
		K00128	red

The method 100 further comprises generating a stippled graph of gene pathways from the differential gene list and annotated gene sequence files; in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.

The method 100 further includes performing enrichment processing on annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file; determining a difference gene according to the gene identification number in the path list file; determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and determining an identification symbol for the differential gene according to the fold change and the credibility of the differential gene so as to generate an identified differential gene list.

Wherein determining the identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises: assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and assigning a second type of identifier to the differential gene having a fold change of less than-1 and a confidence level of less than 0.05.

The method 100 further includes performing enrichment processing on annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map; establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list; determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation; identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map; wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.

Since KEGG pathway enrichment results are not intuitive, the usual method can only select manual mode when the results are presented on KEGG pathway map. In this way, a KEGG pathway enrichment map is generated at low throughput, and the generated image does not contain gene links in the pathway, which is inconvenient to view. For this purpose, the present application uses the two obtained kegg_path. List and kegg_gene. List files, and the software kegg_viewer outputs two path enrichment maps in png format and svg format. Fig. 4 is a schematic information diagram of a kegg_viewer according to an embodiment of the present invention. Fig. 4 shows the description and parameters of the kegg_viewer program. FIG. 4 is a screen shot showing the description and parameters of the kegg_viewer program.

Wherein fig. 5 is a png format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention, and fig. 6 is a svg format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention. Wherein FIG. 5 is a png format screen shot of a KEGG pathway enrichment map. FIG. 6 is a screen shot in svg format of a KEGG pathway enrichment map.

The specific process flow of the software kegg_viewer comprises the following steps:

a) First, the KEGG path diagram listed by the kegg_path list is downloaded and called, and the corresponding webpage configuration file is obtained.

B) The KO numbers of the genes on the pathway map are in one-to-one correspondence with the KO numbers of the differential genes in the kegg_gene.list differential gene list.

C) The coordinates of these corresponding genes in the KEGG pathway map in the web page configuration file are then extracted.

D) And finally, dyeing according to different colors corresponding to different difference genes in the kegg_gene.list table and the coordinates found in the previous step, so as to obtain the KEGG channel enrichment map with color matching marks. The png format is a pixel diagram, and the svg format can be used for clicking the gene name in the channel directly to jump to the corresponding gene annotation webpage, so that the result can be checked conveniently. The software kegg_viewer can process the enrichment analysis results of all differential genes at one time and automatically.

FIG. 2 is a flow chart of a method of generating a KEGG pathway plot according to an embodiment of the invention.

Step 1, aiming at the description of the input file: two input files. A list of differentially expressed genes is typically obtained after transcriptome sequencing. Furthermore, the analysis of the parameter-free transcriptome will result in an assembled gene sequence unigene file. The unigene file is typically in fasta format. The present application uses the two files, the differentially expressed gene list file and the unigene file, as input files for method 100.

Step 2, annotation of the gene sequence unigene: the unigene is first annotated with KEGG pathways to prepare for subsequent KEGG pathway enrichment analysis. In practice, there are many tools that can make KEGG path notes, here exemplified by the software tool eggnog-mapper v 2.0.1. The input file is an unigene text file. The gene sequences were annotated, so that annotation files were obtained as shown in table 1. Whichever software is used, the annotation information obtained should include, but is not limited to, at least the KO number corresponding to the gene and the pathway KO number in which it resides.

Step 3, pathway enrichment of differential genes: in-line tools for differential gene pathway enrichment tend to be low in throughput and there are limits to the number of genes. To this end, the present application employs an assay that can be run in high throughput, automated, and is suitable for KEGG and GO enrichment assays at the same time. The analysis method of the application has no limitation on the number of the inputted genes, and after the difference gene list and the unigene annotation file are inputted, an annotation scatter diagram of the kegg pathway can be obtained, as shown in fig. 3.

FIG. 3 is a scatter plot of KEGG pathway enrichment according to an embodiment of the invention. The abscissa in FIG. 3, the gene ratio, geneRatio, represents the extent of enrichment, with a larger GeneRatio indicating more and more reliable genes that are enriched for this pathway. The ordinate represents the enriched pathway description. The size of the scatter represents how many genes are. The scatter color change or gray change represents the enriched pvalue value.

The specific operation flow of the analysis method is as follows:

D) And (3) sequencing according to the pvalue columns in the results, and carrying out scatter diagram drawing on the first 10 paths which are remarkably enriched. The gene ratio gene, gene enrichment index, was calculated using the function in tool ggplot. The dotplot function is used to set information such as the x-axis, y-axis, scatter fill, scatter size, note color, font color, and the like of the plot, thereby plotting the scatter plot.

Step 4, preparing a differential gene KEGG pathway enrichment map drawing file: the difference genes and useful information in the enrichment file kegg. Result. Xls obtained in the last step are required to be extracted, and drawing factors are added, so that the difference genes and useful information in the enrichment file kegg_path. List and kegg_gene. List drawn by the kegg pathway are arranged as input files kegg_path. List and kegg_gene. List, as shown in tables 2 and 3.

In step 5, since the KEGG pathway enrichment result is not intuitive, when the result is displayed on the KEGG pathway map, the usual method can only select a manual mode. In this way, a KEGG pathway enrichment map is generated at low throughput, and the generated image does not contain gene links in the pathway, which is inconvenient to view. For this purpose, the present application uses the two obtained kegg_path. List and kegg_gene. List files, and the software kegg_viewer outputs two path enrichment maps in png format and svg format. Fig. 4 is a schematic information diagram of a kegg_viewer according to an embodiment of the present invention. Fig. 4 shows the description and parameters of the kegg_viewer program.

Wherein fig. 5 is a png format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention, and fig. 6 is a svg format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of a system 700 for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention. The system 700 includes: analysis means 701, annotation means 702, adding means 703, processing means 704, deduplication means 705, generation means 706, determination means 707, and creation means 708.

The analysis device 701 performs transcriptome analysis on differential genes from a species sample for which a reference genome is not determined to acquire a differential gene list including a plurality of gene differences, and acquires a previously generated gene sequence file from which a repeated gene sequence is removed. The format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.

Annotating device 702 parses the gene sequence file to determine a plurality of gene sequences, pathway annotates each gene in the plurality of gene sequences, and thereby generates an annotated gene sequence file. Annotating device 702 performs pathway annotation for each gene in the plurality of gene sequences comprising: annotating device 702 determines a gene identification number and a gene pathway number for each gene in the plurality of gene sequences to enable pathway annotation of each gene.

The adding means 703 adds annotation information for each gene data item in the differential gene list according to the passage annotation of each gene in the annotated gene sequence file.

The processing means 704 performs enrichment processing on the annotation information of each gene data item in the differential gene list to obtain an enriched plurality of gene pathways, selects a plurality of target gene pathways from the enriched plurality of gene pathways, and uses each target gene pathway as a differential gene pathway to obtain a plurality of differential gene pathways. Processing device 704 selects a plurality of gene pathways of interest from the enriched plurality of gene pathways comprising: the processing device 704 sorts each of the enriched plurality of gene pathways in descending order according to the degree of confidence to generate a descending order sequence from which a predetermined number of gene pathways with the greatest degree of confidence are selected as target gene pathways.

The deduplication device 705 removes duplicate gene data items among all the gene data items in the differential gene list. The deduplication device 705 removes duplicate gene sequences from all gene sequences in the annotated gene sequence file.

A generating means 706 for generating a dot pattern of gene pathways from the difference gene list and the annotated gene sequence file; in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.

The determining means 707 performs enrichment processing on the annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file; determining a difference gene according to the gene identification number in the path list file; determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and determining an identification symbol for the differential gene according to the fold change and the credibility of the differential gene so as to generate an identified differential gene list. Wherein the determining means 707 determines an identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises: assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and assigning a second type of identifier to the differential gene having a fold change of less than-1 and a confidence level of less than 0.05.

Establishing means 708 performs enrichment processing on annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map; establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list; determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation; identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map; wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.

The invention has been described with reference to a few embodiments. However, as is well known to those skilled in the art, other embodiments than the above disclosed invention are equally possible within the scope of the invention, as defined by the appended patent claims.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise therein. All references to "a/an/the [ means, component, etc. ]" are to be interpreted openly as referring to at least one instance of said means, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

Claims

1. A method of determining differential gene pathways based on ginseng-free transcriptome sequencing, the method comprising:

Enriching annotation information of each gene data item in the differential gene list to obtain enriched multiple gene paths, selecting multiple target gene paths from the enriched multiple gene paths, and taking each target gene path as a differential gene path to obtain multiple differential gene paths;

2. The method of claim 1, wherein the format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.

3. The method of claim 1, wherein pathway annotation for each gene in the plurality of gene sequences comprises:

4. The method of claim 1, further comprising removing duplicate gene data items from all gene data items in the differential gene list.

5. The method of claim 1, further comprising removing duplicate gene sequences from all gene sequences in the annotated gene sequence file.

6. The method of claim 1, further comprising,

generating a dot pattern of gene paths according to the difference gene list and the annotated gene sequence file;

7. The method of claim 1, further comprising,

enrichment processing is carried out on annotation information of each gene data item in the difference gene list, so that an enrichment file is generated;

8. The method of claim 7, wherein determining an identification symbol for a differential gene based on fold-of-change and confidence in the differential gene comprises:

9. The method of claim 1 or 7, further comprising,

10. A system for determining differential gene pathways based on ginseng-free transcriptome sequencing, the system comprising:

a processing device for performing enrichment processing on annotation information of each gene data item in the differential gene list to obtain a plurality of enriched gene pathways, selecting a plurality of target gene pathways from the plurality of enriched gene pathways, and obtaining a plurality of differential gene pathways by taking each target gene pathway as a differential gene pathway;

11. The system of claim 10, wherein the format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.

12. The system of claim 10, wherein annotating each gene in the plurality of gene sequences with an annotation device comprises:

13. The system of claim 10, further comprising a deduplication device that removes duplicate gene data items from all gene data items in the differential gene list.

14. The system of claim 10, further comprising a deduplication device that removes duplicate gene sequences from all gene sequences in the annotated gene sequence file.

15. The system of claim 10, further comprising generating means for generating a stippled graph of gene pathways from the list of differential genes and annotated gene sequence files;

16. The system of claim 10, further comprising a determining device,

17. The system of claim 16, wherein the determining means for determining the identification symbol for the differential gene based on the fold of change and the confidence level of the differential gene comprises:

18. The system according to claim 10 or 16, further comprising establishing means,