CN114333994B - Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing - Google Patents

Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing Download PDF

Info

Publication number
CN114333994B
CN114333994B CN202011069458.4A CN202011069458A CN114333994B CN 114333994 B CN114333994 B CN 114333994B CN 202011069458 A CN202011069458 A CN 202011069458A CN 114333994 B CN114333994 B CN 114333994B
Authority
CN
China
Prior art keywords
gene
differential
list
enrichment
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011069458.4A
Other languages
Chinese (zh)
Other versions
CN114333994A (en
Inventor
田振阳
王苹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Modern Innovation Traditional Chinese Medicine Technology Co ltd
Original Assignee
Tianjin Modern Innovation Traditional Chinese Medicine Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Modern Innovation Traditional Chinese Medicine Technology Co ltd filed Critical Tianjin Modern Innovation Traditional Chinese Medicine Technology Co ltd
Priority to CN202011069458.4A priority Critical patent/CN114333994B/en
Publication of CN114333994A publication Critical patent/CN114333994A/en
Application granted granted Critical
Publication of CN114333994B publication Critical patent/CN114333994B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a method and a system for determining a differential gene pathway based on ginseng-free transcriptome sequencing, wherein the method comprises the following steps: acquiring a differential gene list including a plurality of gene differences, and acquiring a previously generated gene sequence file from which a repeated gene sequence is removed; analyzing the gene sequence file to determine a plurality of gene sequences, and carrying out path annotation on each gene in the plurality of gene sequences so as to generate an annotated gene sequence file; adding annotation information for each gene data item in the difference gene list according to the channel annotation of each gene in the annotated gene sequence file; and enriching annotation information of each gene data item in the differential gene list to obtain enriched multiple gene paths, selecting multiple target gene paths from the enriched multiple gene paths, and taking each target gene path as a differential gene path to obtain multiple differential gene paths.

Description

Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing
Technical Field
The present invention relates to the field of bioinformatic analysis technology, and more particularly, to methods and systems for determining differential gene pathways based on parameter-free transcriptome sequencing.
Background
Transcriptome sequencing refers to sequencing mRNA of mRNA transcribed from a cell under conditions where a particular biological sample is known, and comparing differential gene expression between samples. A parameter-free transcriptome analysis refers to a transcriptome analysis performed on a species for which a reliable reference genome has not been obtained. In the analysis, the most important is the pathway analysis of differential genes, including: annotation, enrichment, and mapping of pathways.
The Kyoto encyclopedia of genes and genome (KEGG, kyoto Encyclopedia of Genes and Genomes) database (https:// www.kegg.jp /) is a database that integrates genomic, chemical and system functional information. KEGG databases are typically used as pathway notes for genes. After obtaining the differentially expressed genes, it is often necessary to enrich the differentially expressed genes in different pathways by aligning the KEGG database. The results are displayed by marking the different variation genes in the KEGG pathway diagram by using different colors, so that interpretation of the results is convenient.
In existing methods, KEGG annotation and enrichment are standard procedures for transcriptome analysis. However, KEGG enrichment for the reference-free transcriptome often also requires selection of reference species information and is therefore not universal for some species without a reference genome. In the step of KEGG enrichment pathway map drawing, the on-line tool drawing is mainly performed by a KEGG transformer mapper. However, these tools have the disadvantages of requiring manual labeling of differential genes, low throughput, and the need to query the genetic information again when looking up the result pictures.
Disclosure of Invention
The invention provides a high-throughput method for obtaining a high-quality KEGG pathway enrichment map aiming at a ginseng-free transcriptome in order to overcome the defects of a ginseng-free transcriptome differential gene pathway enrichment analysis method in the prior art. The invention aims to solve the problems that the enrichment of the differential gene pathway of the ginseng-free transcriptome depends on a reference genome, and the KEGG enrichment map is low in medium and low flux and is manually operated, and realize high flux, automatic and efficient KEGG enrichment of the ginseng-free transcriptome and the mapping of the pathway map.
To solve the problems in the prior art, the present invention provides a method for determining differential gene pathways based on reference-free transcriptome sequencing, the method comprising:
Transcriptome analysis of differential genes from species samples from which a reference genome was not determined to obtain a differential gene list including a plurality of gene differences, and obtaining a pre-generated gene sequence file from which repetitive gene sequences were removed;
analyzing the gene sequence file to determine a plurality of gene sequences, and carrying out path annotation on each gene in the plurality of gene sequences so as to generate an annotated gene sequence file;
adding annotation information for each gene data item in the difference gene list according to the channel annotation of each gene in the annotated gene sequence file; and
enrichment processing is carried out on annotation information of each gene data item in the differential gene list so as to obtain a plurality of enriched gene paths, a plurality of target gene paths are selected from the plurality of enriched gene paths, and each target gene path is used as a differential gene path so as to obtain a plurality of differential gene paths.
The format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.
Wherein pathway annotation for each gene in the plurality of gene sequences comprises:
A gene identification number and a gene pathway number are determined for each gene in the plurality of gene sequences to enable pathway annotation of each gene.
Also included is removing duplicate gene data items from all gene data items in the differential gene list.
Also included is removal of duplicate gene sequences from all gene sequences in the annotated gene sequence file.
The selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises:
and sorting each of the enriched plurality of gene pathways in descending order according to the confidence level to generate a descending order sequence, and selecting a predetermined number of gene pathways with the highest confidence level from the descending order sequence as target gene pathways.
Generating a dot pattern of gene pathways from the differential gene list and annotated gene sequence files;
in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and
in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.
The method further comprises the step of enriching annotation information of each gene data item in the difference gene list so as to generate an enrichment file;
Extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file;
determining a difference gene according to the gene identification number in the path list file;
determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and
and determining identification marks for the differential genes according to the fold change and the credibility of the differential genes so as to generate an identified differential gene list.
Wherein determining the identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises:
assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and
the genes with fold changes less than-1 and confidence less than 0.05 were assigned a second type of identifier.
The method further comprises the step of enriching annotation information of each gene data item in the difference gene list so as to generate an enrichment file;
analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map;
establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list;
Determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation;
identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map;
wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.
According to another aspect of the present invention, there is provided a system for determining differential gene pathways based on reference-free transcriptome sequencing, the system comprising:
an analysis device for performing transcriptome analysis on differential genes from a species sample from which a reference genome is not determined to obtain a differential gene list including a plurality of gene differences, and obtaining a previously generated gene sequence file from which a repeated gene sequence is removed;
annotating means for parsing the gene sequence file to determine a plurality of gene sequences, path annotating each gene in the plurality of gene sequences, thereby generating an annotated gene sequence file;
an adding device for adding annotation information for each gene data item in the difference gene list according to the path annotation of each gene in the annotated gene sequence file; and
And a processing device for enriching the annotation information of each gene data item in the differential gene list to obtain a plurality of enriched gene pathways, selecting a plurality of target gene pathways from the plurality of enriched gene pathways, and obtaining a plurality of differential gene pathways by using each target gene pathway as a differential gene pathway.
The format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.
Wherein the annotating means for pathway annotating each gene of the plurality of gene sequences comprises:
the annotating means determines a gene identification number and a gene pathway number for each gene in the plurality of gene sequences to enable pathway annotation of each gene.
The duplication removing device is also included for removing duplicate gene data items from all the gene data items in the differential gene list.
A deduplication device is also included to remove duplicate gene sequences from all gene sequences in the annotated gene sequence file.
The processing device selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises:
the processing device performs descending order sequencing on each gene pathway in the enriched plurality of gene pathways according to the credibility so as to generate a descending order sequence, and selects a preset number of gene pathways with the largest credibility from the descending order sequence as target gene pathways.
The generation device is used for generating a dot pattern of the gene path according to the difference gene list and the annotated gene sequence file;
in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and
in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.
The device further comprises a determining device, wherein the determining device is used for carrying out enrichment processing on annotation information of each gene data item in the difference gene list so as to generate an enrichment file;
extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file;
determining a difference gene according to the gene identification number in the path list file;
determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and
and determining identification marks for the differential genes according to the fold change and the credibility of the differential genes so as to generate an identified differential gene list.
Wherein the determining means determines an identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises:
Assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and
the genes with fold changes less than-1 and confidence less than 0.05 were assigned a second type of identifier.
The method further comprises the step of establishing a device for enriching annotation information of each gene data item in the differential gene list so as to generate an enrichment file;
analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map;
establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list;
determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation;
identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map;
wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.
The difficulty of the present invention is that, first, conventional transcriptome KEGG enrichment analysis methods require specifying reference species information, whereas many reference-free transcriptomes do not have a reference species genome, and the present invention provides a mature and useful solution for enrichment analysis of reference-free transcriptomes. Secondly, when the conventional differential gene KEGG enrichment pathway is mapped, the gene information needs to be manually arranged, the KO number corresponding to the gene information is manually input, and the coloring of the gene is manually judged. The script of the invention arranges the difference genes and the annotation information, thereby realizing automation. Finally, the conventional KEGG pathway enrichment map is drawn to limit the number of genes processed at one time, so that high throughput cannot be realized, the obtained result does not contain gene linking information, and the gene information needs to be queried again during interpretation, so that inconvenience is brought. The method of the invention draws the enrichment result of the KEGG channel, realizes the high flux and automation of the whole flow, and gives out the output result of two formats, wherein the svg format picture contains the gene information in the channel diagram, thereby facilitating the later interpretation of the data. The innovation point of the invention is that the biological information technology is applied to automatically processing large-batch differential gene expression data, and a high-quality KEGG enrichment pathway diagram is obtained in a high throughput manner.
Drawings
Exemplary embodiments of the present invention may be more completely understood in consideration of the following drawings:
FIG. 1 is a flow chart of a method for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of generating a KEGG pathway plot according to an embodiment of the invention;
FIG. 3 is a scatter plot of KEGG pathway enrichment according to an embodiment of the invention;
FIG. 4 is a schematic diagram of information of a kegg_viewer according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the png format of a KEGG pathway enrichment map according to an embodiment of the invention;
FIG. 6 is a schematic diagram of the svg format of a KEGG pathway enrichment map according to an embodiment of the invention; and
FIG. 7 is a schematic diagram of the structure of a system for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.
Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
FIG. 1 is a flow chart of a method 100 for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention. The method 100 begins at step 101.
In step 101, transcriptome analysis is performed on differential genes from species samples for which reference genomes are not determined to obtain a differential gene list including a plurality of gene differences, and a pre-generated gene sequence file from which repetitive gene sequences are removed is obtained. Wherein the format of the gene sequence file is text-based and is used to represent the nucleotide sequence or the amino acid sequence.
The method 100 may include two input files. A list of differentially expressed genes is typically obtained after transcriptome sequencing. Furthermore, the analysis of the parameter-free transcriptome will result in an assembled gene sequence unigene file. The unigene file is typically in fasta format. The present application uses the two files, the differentially expressed gene list file and the unigene file, as input files for method 100.
In step 102, the gene sequence file is parsed to determine a plurality of gene sequences, and pathway annotation is performed for each gene in the plurality of gene sequences, thereby generating an annotated gene sequence file. Wherein pathway annotation for each gene in the plurality of gene sequences comprises: a gene identification number and a gene pathway number are determined for each gene in the plurality of gene sequences to enable pathway annotation of each gene.
In step 103, annotation information is added for each gene data item in the differential gene list according to the pathway annotation of each gene in the annotated gene sequence file. Also included is removing duplicate gene data items from all gene data items in the differential gene list. Also included is removal of duplicate gene sequences from all gene sequences in the annotated gene sequence file.
Wherein the gene sequence unigene annotation comprises: the unigene is first annotated with KEGG pathways to prepare for subsequent KEGG pathway enrichment analysis. In practice, there are many tools that enable KEGG path annotation, here exemplified by eggnog-mapper v 2.0.1. The input file is an unigene text file. The gene sequences were annotated, so that annotation files were obtained as shown in table 1. Whichever software is used, the annotation information obtained should include, but is not limited to, at least the KO number corresponding to the gene and the pathway KO number in which it resides.
Table 1 KEGG annotates file content
GeneID KEGG_ko KEGG_Pathway
sc_6/125/1253 NA NA
sc_8/104/598 ko:K02716 ko00195,ko01100,map00195,map01100
sc_9/116/406 ko:K13094 NA
sc_27/112/611 ko:K02950 ko03010,map03010
sc_41/105/390 NA NA
sc_51/118/501 ko:K00706,ko:K11000 ko00500,ko04011,map00500,map04011
sc_58/103/969 NA NA
sc_61/117/536 NA NA
sc_64/113/508 NA NA
sc_65/109/447 NA NA
sc_67/131/501 ko:K02953 ko03010,map03010
sc_69/130/944 NA NA
sc_73/113/525 ko:K09955 NA
sc_79/122/628 NA NA
In step 104, enrichment processing is performed on annotation information of each gene data item in the differential gene list to obtain an enriched plurality of gene pathways, a plurality of target gene pathways are selected from the enriched plurality of gene pathways, and each target gene pathway is used as a differential gene pathway to obtain a plurality of differential gene pathways.
Wherein selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises: and sorting each of the enriched plurality of gene pathways in descending order according to the confidence level to generate a descending order sequence, and selecting a predetermined number of gene pathways with the highest confidence level from the descending order sequence as target gene pathways.
And carrying out pathway enrichment on the differential genes. In-line tools for differential gene pathway enrichment tend to be low in throughput and there are limits to the number of genes. To this end, the present application employs an assay that can be run in high throughput, automated, and is suitable for KEGG and GO enrichment assays at the same time. The analysis method of the present application has no limitation on the number of genes input, and after inputting the difference gene list and the unigene annotation file, an annotation scatter diagram of the KEGG pathway can be obtained, as shown in fig. 3.
FIG. 3 is a scatter plot of KEGG pathway enrichment according to an embodiment of the invention. FIG. 3 is a screen shot of a generated scatter plot. The abscissa in FIG. 3, the gene ratio, geneRatio, represents the extent of enrichment, with a larger GeneRatio indicating more and more reliable genes that are enriched for this pathway. The ordinate represents the enriched pathway description. The size of the scatter represents how many genes are. The scatter color change or gray change represents the enriched pvalue value.
The specific operation flow of the analysis method is as follows:
a) Reading in the difference gene list and the unigene annotation file, and removing repeated data items in the difference gene list and the unigene annotation file.
B) And adding KEGG annotation information according to KEGG enrichment analysis set in the parameters. I.e. adding to each row of the difference gene the KEGG annotation information it corresponds to in the unigene annotation file.
C) The KEGG annotation information in the differential gene table is enriched using an enrichment function, and the algorithm saves the results to a data box using default Benjamini-Hochberg procedure (BH) and simultaneously saves the results to a result file KEGG.
D) And (3) sequencing according to the pvalue columns in the results, and carrying out scatter diagram drawing on the first 10 paths which are remarkably enriched. The gene ratio GeneRatio, gene enrichment index, was calculated using the function in tool ggplot. The dotplot function is used to set information such as the x-axis, y-axis, scatter fill, scatter size, note color, font color, and the like of the plot, thereby plotting the scatter plot.
A differential gene KEGG pathway enrichment map drawing file was prepared. The method 100 needs to extract the difference genes and useful information in the enrichment file KEGG. Result. Xls obtained in the previous step, and add drawing factors, so as to sort the input files kegg_path. List and kegg_gene. List for KEGG path drawing, as shown in table 2 and table 3.
The kegg_path.list table is composed of the sequence of gene KO numbers from the kegg. Result.xls table and its corresponding sequence of passages. And the kegg_gene.list table is automatically collated and output by the script get_kegg_gene.py provided by the present application. The principle is that the difference genes and corresponding KO number columns in an enrichment file kegg. Result. Xls are input first, then the change multiple logfoldchange value and the credibility qvalue of the difference genes are matched in an original difference gene table according to the difference gene ID, and screening and gene color judgment are carried out to obtain the gene. Wherein the judgment standard is logfoldchange >1 and qvalue <0.05 is marked red; logfcoldchange < -1 and qvalue <0.05 is marked blue.
TABLE 2 kegg_path. List
Gene KO numbering Pathway corresponding to gene
KO2092 ko00196
KO2289 ko00196
TABLE 3 kegg_gene list
Gene KO numbering Gene color
K00001 red
K00059 blue
K00083 blue
K00083 red
K00128 red
The method 100 further comprises generating a stippled graph of gene pathways from the differential gene list and annotated gene sequence files; in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.
The method 100 further includes performing enrichment processing on annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file; determining a difference gene according to the gene identification number in the path list file; determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and determining an identification symbol for the differential gene according to the fold change and the credibility of the differential gene so as to generate an identified differential gene list.
Wherein determining the identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises: assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and assigning a second type of identifier to the differential gene having a fold change of less than-1 and a confidence level of less than 0.05.
The method 100 further includes performing enrichment processing on annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map; establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list; determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation; identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map; wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.
Since KEGG pathway enrichment results are not intuitive, the usual method can only select manual mode when the results are presented on KEGG pathway map. In this way, a KEGG pathway enrichment map is generated at low throughput, and the generated image does not contain gene links in the pathway, which is inconvenient to view. For this purpose, the present application uses the two obtained kegg_path. List and kegg_gene. List files, and the software kegg_viewer outputs two path enrichment maps in png format and svg format. Fig. 4 is a schematic information diagram of a kegg_viewer according to an embodiment of the present invention. Fig. 4 shows the description and parameters of the kegg_viewer program. FIG. 4 is a screen shot showing the description and parameters of the kegg_viewer program.
Wherein fig. 5 is a png format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention, and fig. 6 is a svg format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention. Wherein FIG. 5 is a png format screen shot of a KEGG pathway enrichment map. FIG. 6 is a screen shot in svg format of a KEGG pathway enrichment map.
The specific process flow of the software kegg_viewer comprises the following steps:
a) First, the KEGG path diagram listed by the kegg_path list is downloaded and called, and the corresponding webpage configuration file is obtained.
B) The KO numbers of the genes on the pathway map are in one-to-one correspondence with the KO numbers of the differential genes in the kegg_gene.list differential gene list.
C) The coordinates of these corresponding genes in the KEGG pathway map in the web page configuration file are then extracted.
D) And finally, dyeing according to different colors corresponding to different difference genes in the kegg_gene.list table and the coordinates found in the previous step, so as to obtain the KEGG channel enrichment map with color matching marks. The png format is a pixel diagram, and the svg format can be used for clicking the gene name in the channel directly to jump to the corresponding gene annotation webpage, so that the result can be checked conveniently. The software kegg_viewer can process the enrichment analysis results of all differential genes at one time and automatically.
FIG. 2 is a flow chart of a method of generating a KEGG pathway plot according to an embodiment of the invention.
Step 1, aiming at the description of the input file: two input files. A list of differentially expressed genes is typically obtained after transcriptome sequencing. Furthermore, the analysis of the parameter-free transcriptome will result in an assembled gene sequence unigene file. The unigene file is typically in fasta format. The present application uses the two files, the differentially expressed gene list file and the unigene file, as input files for method 100.
Step 2, annotation of the gene sequence unigene: the unigene is first annotated with KEGG pathways to prepare for subsequent KEGG pathway enrichment analysis. In practice, there are many tools that can make KEGG path notes, here exemplified by the software tool eggnog-mapper v 2.0.1. The input file is an unigene text file. The gene sequences were annotated, so that annotation files were obtained as shown in table 1. Whichever software is used, the annotation information obtained should include, but is not limited to, at least the KO number corresponding to the gene and the pathway KO number in which it resides.
Step 3, pathway enrichment of differential genes: in-line tools for differential gene pathway enrichment tend to be low in throughput and there are limits to the number of genes. To this end, the present application employs an assay that can be run in high throughput, automated, and is suitable for KEGG and GO enrichment assays at the same time. The analysis method of the application has no limitation on the number of the inputted genes, and after the difference gene list and the unigene annotation file are inputted, an annotation scatter diagram of the kegg pathway can be obtained, as shown in fig. 3.
FIG. 3 is a scatter plot of KEGG pathway enrichment according to an embodiment of the invention. The abscissa in FIG. 3, the gene ratio, geneRatio, represents the extent of enrichment, with a larger GeneRatio indicating more and more reliable genes that are enriched for this pathway. The ordinate represents the enriched pathway description. The size of the scatter represents how many genes are. The scatter color change or gray change represents the enriched pvalue value.
The specific operation flow of the analysis method is as follows:
a) Reading in the difference gene list and the unigene annotation file, and removing repeated data items in the difference gene list and the unigene annotation file.
B) And adding KEGG annotation information according to KEGG enrichment analysis set in the parameters. I.e. adding to each row of the difference gene the KEGG annotation information it corresponds to in the unigene annotation file.
C) The KEGG annotation information in the differential gene table is enriched using an enrichment function, and the algorithm saves the results to a data box using default Benjamini-Hochberg procedure (BH) and simultaneously saves the results to a result file KEGG.
D) And (3) sequencing according to the pvalue columns in the results, and carrying out scatter diagram drawing on the first 10 paths which are remarkably enriched. The gene ratio gene, gene enrichment index, was calculated using the function in tool ggplot. The dotplot function is used to set information such as the x-axis, y-axis, scatter fill, scatter size, note color, font color, and the like of the plot, thereby plotting the scatter plot.
Step 4, preparing a differential gene KEGG pathway enrichment map drawing file: the difference genes and useful information in the enrichment file kegg. Result. Xls obtained in the last step are required to be extracted, and drawing factors are added, so that the difference genes and useful information in the enrichment file kegg_path. List and kegg_gene. List drawn by the kegg pathway are arranged as input files kegg_path. List and kegg_gene. List, as shown in tables 2 and 3.
The kegg_path.list table is composed of the sequence of gene KO numbers from the kegg. Result.xls table and its corresponding sequence of passages. And the kegg_gene.list table is automatically collated and output by the script get_kegg_gene.py provided by the present application. The principle is that the difference genes and corresponding KO number columns in an enrichment file kegg. Result. Xls are input first, then the change multiple logfoldchange value and the credibility qvalue of the difference genes are matched in an original difference gene table according to the difference gene ID, and screening and gene color judgment are carried out to obtain the gene. Wherein the judgment standard is logfoldchange >1 and qvalue <0.05 is marked red; logfcoldchange < -1 and qvalue <0.05 is marked blue.
In step 5, since the KEGG pathway enrichment result is not intuitive, when the result is displayed on the KEGG pathway map, the usual method can only select a manual mode. In this way, a KEGG pathway enrichment map is generated at low throughput, and the generated image does not contain gene links in the pathway, which is inconvenient to view. For this purpose, the present application uses the two obtained kegg_path. List and kegg_gene. List files, and the software kegg_viewer outputs two path enrichment maps in png format and svg format. Fig. 4 is a schematic information diagram of a kegg_viewer according to an embodiment of the present invention. Fig. 4 shows the description and parameters of the kegg_viewer program.
Wherein fig. 5 is a png format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention, and fig. 6 is a svg format schematic of a KEGG pathway enrichment map according to an embodiment of the present invention.
The specific process flow of the software kegg_viewer comprises the following steps:
a) First, the KEGG path diagram listed by the kegg_path list is downloaded and called, and the corresponding webpage configuration file is obtained.
B) The KO numbers of the genes on the pathway map are in one-to-one correspondence with the KO numbers of the differential genes in the kegg_gene.list differential gene list.
C) The coordinates of these corresponding genes in the KEGG pathway map in the web page configuration file are then extracted.
D) And finally, dyeing according to different colors corresponding to different difference genes in the kegg_gene.list table and the coordinates found in the previous step, so as to obtain the KEGG channel enrichment map with color matching marks. The png format is a pixel diagram, and the svg format can be used for clicking the gene name in the channel directly to jump to the corresponding gene annotation webpage, so that the result can be checked conveniently. The software kegg_viewer can process the enrichment analysis results of all differential genes at one time and automatically.
FIG. 7 is a schematic diagram of a system 700 for determining differential gene pathways based on reference-free transcriptome sequencing according to an embodiment of the present invention. The system 700 includes: analysis means 701, annotation means 702, adding means 703, processing means 704, deduplication means 705, generation means 706, determination means 707, and creation means 708.
The analysis device 701 performs transcriptome analysis on differential genes from a species sample for which a reference genome is not determined to acquire a differential gene list including a plurality of gene differences, and acquires a previously generated gene sequence file from which a repeated gene sequence is removed. The format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.
Annotating device 702 parses the gene sequence file to determine a plurality of gene sequences, pathway annotates each gene in the plurality of gene sequences, and thereby generates an annotated gene sequence file. Annotating device 702 performs pathway annotation for each gene in the plurality of gene sequences comprising: annotating device 702 determines a gene identification number and a gene pathway number for each gene in the plurality of gene sequences to enable pathway annotation of each gene.
The adding means 703 adds annotation information for each gene data item in the differential gene list according to the passage annotation of each gene in the annotated gene sequence file.
The processing means 704 performs enrichment processing on the annotation information of each gene data item in the differential gene list to obtain an enriched plurality of gene pathways, selects a plurality of target gene pathways from the enriched plurality of gene pathways, and uses each target gene pathway as a differential gene pathway to obtain a plurality of differential gene pathways. Processing device 704 selects a plurality of gene pathways of interest from the enriched plurality of gene pathways comprising: the processing device 704 sorts each of the enriched plurality of gene pathways in descending order according to the degree of confidence to generate a descending order sequence from which a predetermined number of gene pathways with the greatest degree of confidence are selected as target gene pathways.
The deduplication device 705 removes duplicate gene data items among all the gene data items in the differential gene list. The deduplication device 705 removes duplicate gene sequences from all gene sequences in the annotated gene sequence file.
A generating means 706 for generating a dot pattern of gene pathways from the difference gene list and the annotated gene sequence file; in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.
The determining means 707 performs enrichment processing on the annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file; determining a difference gene according to the gene identification number in the path list file; determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and determining an identification symbol for the differential gene according to the fold change and the credibility of the differential gene so as to generate an identified differential gene list. Wherein the determining means 707 determines an identification symbol for the differential gene based on the fold change and the confidence level of the differential gene comprises: assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and assigning a second type of identifier to the differential gene having a fold change of less than-1 and a confidence level of less than 0.05.
Establishing means 708 performs enrichment processing on annotation information of each gene data item in the differential gene list, thereby generating an enrichment file; analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map; establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list; determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation; identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map; wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.
The invention has been described with reference to a few embodiments. However, as is well known to those skilled in the art, other embodiments than the above disclosed invention are equally possible within the scope of the invention, as defined by the appended patent claims.
Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise therein. All references to "a/an/the [ means, component, etc. ]" are to be interpreted openly as referring to at least one instance of said means, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

Claims (18)

1. A method of determining differential gene pathways based on ginseng-free transcriptome sequencing, the method comprising:
transcriptome analysis of differential genes from species samples from which a reference genome was not determined to obtain a differential gene list including a plurality of gene differences, and obtaining a pre-generated gene sequence file from which repetitive gene sequences were removed;
analyzing the gene sequence file to determine a plurality of gene sequences, and carrying out path annotation on each gene in the plurality of gene sequences so as to generate an annotated gene sequence file;
adding annotation information for each gene data item in the difference gene list according to the channel annotation of each gene in the annotated gene sequence file; and
Enriching annotation information of each gene data item in the differential gene list to obtain enriched multiple gene paths, selecting multiple target gene paths from the enriched multiple gene paths, and taking each target gene path as a differential gene path to obtain multiple differential gene paths;
the selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises:
and sorting each of the enriched plurality of gene pathways in descending order according to the confidence level to generate a descending order sequence, and selecting a predetermined number of gene pathways with the highest confidence level from the descending order sequence as target gene pathways.
2. The method of claim 1, wherein the format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.
3. The method of claim 1, wherein pathway annotation for each gene in the plurality of gene sequences comprises:
a gene identification number and a gene pathway number are determined for each gene in the plurality of gene sequences to enable pathway annotation of each gene.
4. The method of claim 1, further comprising removing duplicate gene data items from all gene data items in the differential gene list.
5. The method of claim 1, further comprising removing duplicate gene sequences from all gene sequences in the annotated gene sequence file.
6. The method of claim 1, further comprising,
generating a dot pattern of gene paths according to the difference gene list and the annotated gene sequence file;
in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and
in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.
7. The method of claim 1, further comprising,
enrichment processing is carried out on annotation information of each gene data item in the difference gene list, so that an enrichment file is generated;
extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file;
determining a difference gene according to the gene identification number in the path list file;
determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and
And determining identification marks for the differential genes according to the fold change and the credibility of the differential genes so as to generate an identified differential gene list.
8. The method of claim 7, wherein determining an identification symbol for a differential gene based on fold-of-change and confidence in the differential gene comprises:
assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and
the genes with fold changes less than-1 and confidence less than 0.05 were assigned a second type of identifier.
9. The method of claim 1 or 7, further comprising,
enrichment processing is carried out on annotation information of each gene data item in the difference gene list, so that an enrichment file is generated;
analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map;
establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list;
determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation;
identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map;
Wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.
10. A system for determining differential gene pathways based on ginseng-free transcriptome sequencing, the system comprising:
an analysis device for performing transcriptome analysis on differential genes from a species sample from which a reference genome is not determined to obtain a differential gene list including a plurality of gene differences, and obtaining a previously generated gene sequence file from which a repeated gene sequence is removed;
annotating means for parsing the gene sequence file to determine a plurality of gene sequences, path annotating each gene in the plurality of gene sequences, thereby generating an annotated gene sequence file;
an adding device for adding annotation information for each gene data item in the difference gene list according to the path annotation of each gene in the annotated gene sequence file; and
a processing device for performing enrichment processing on annotation information of each gene data item in the differential gene list to obtain a plurality of enriched gene pathways, selecting a plurality of target gene pathways from the plurality of enriched gene pathways, and obtaining a plurality of differential gene pathways by taking each target gene pathway as a differential gene pathway;
The processing device selecting a plurality of gene pathways of interest from the enriched plurality of gene pathways comprises:
the processing device performs descending order sequencing on each gene pathway in the enriched plurality of gene pathways according to the credibility so as to generate a descending order sequence, and selects a preset number of gene pathways with the largest credibility from the descending order sequence as target gene pathways.
11. The system of claim 10, wherein the format of the gene sequence file is text-based and is used to represent a nucleotide sequence or an amino acid sequence.
12. The system of claim 10, wherein annotating each gene in the plurality of gene sequences with an annotation device comprises:
the annotating means determines a gene identification number and a gene pathway number for each gene in the plurality of gene sequences to enable pathway annotation of each gene.
13. The system of claim 10, further comprising a deduplication device that removes duplicate gene data items from all gene data items in the differential gene list.
14. The system of claim 10, further comprising a deduplication device that removes duplicate gene sequences from all gene sequences in the annotated gene sequence file.
15. The system of claim 10, further comprising generating means for generating a stippled graph of gene pathways from the list of differential genes and annotated gene sequence files;
in the dot plot, the abscissa represents the degree of enrichment of the number and reliability of genes, and the ordinate represents the description of the enriched gene pathways; and
in the dot pattern, the size of the number of genes is indicated by the size of dots, and the reliability of enrichment is represented by the color of dots.
16. The system of claim 10, further comprising a determining device,
enrichment processing is carried out on annotation information of each gene data item in the difference gene list, so that an enrichment file is generated;
extracting a gene identification number and a gene path number from the path annotation of the enrichment file to generate a path list file;
determining a difference gene according to the gene identification number in the path list file;
determining the change times and the credibility of the difference genes in a difference gene list according to the identifiers of the difference genes; and
and determining identification marks for the differential genes according to the fold change and the credibility of the differential genes so as to generate an identified differential gene list.
17. The system of claim 16, wherein the determining means for determining the identification symbol for the differential gene based on the fold of change and the confidence level of the differential gene comprises:
assigning a first type of identifier to the differential gene with a fold change greater than 1 and a confidence level less than 0.05; and
the genes with fold changes less than-1 and confidence less than 0.05 were assigned a second type of identifier.
18. The system according to claim 10 or 16, further comprising establishing means,
enrichment processing is carried out on annotation information of each gene data item in the difference gene list, so that an enrichment file is generated;
analyzing the enrichment file to determine a gene pathway map and determining a webpage configuration file corresponding to the gene pathway map;
establishing an association between the gene identification number of each gene path in the gene path diagram and the identification number of the corresponding differential gene in the differential gene list;
determining coordinates in a webpage configuration file for the difference genes in the gene path diagram according to the association relation;
identifying the differential genes at the coordinate positions of each differential gene according to the identification marks of the differential genes in the identified differential gene list so as to generate a pathway enrichment map;
Wherein in the pathway enrichment map, the coordinate position of each differential gene is a pixel icon; alternatively, in the pathway enrichment map, the coordinate position of each differential gene is a web page link.
CN202011069458.4A 2020-09-30 2020-09-30 Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing Active CN114333994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011069458.4A CN114333994B (en) 2020-09-30 2020-09-30 Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011069458.4A CN114333994B (en) 2020-09-30 2020-09-30 Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing

Publications (2)

Publication Number Publication Date
CN114333994A CN114333994A (en) 2022-04-12
CN114333994B true CN114333994B (en) 2023-07-07

Family

ID=81032355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011069458.4A Active CN114333994B (en) 2020-09-30 2020-09-30 Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing

Country Status (1)

Country Link
CN (1) CN114333994B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102277416A (en) * 2010-06-10 2011-12-14 中国科学院上海生命科学研究院 Transcriptome and functional gene for Ostrinia furnacalis Guenee
CN107368707A (en) * 2017-07-20 2017-11-21 东北大学 Gene chip expression data analysis system and method based on US ELM
CN111402955A (en) * 2020-04-09 2020-07-10 德州学院 Biological information measuring method, system, storage medium and terminal

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9909134B2 (en) * 2013-12-05 2018-03-06 Rutgers, The State University Of New Jersey RNA-Seq transcriptome analysis of spirodela dormancy without reproduction and identification of molecular targets useful for improving biomass production for industrial applications
CN104328165B (en) * 2014-06-10 2015-11-11 湖南农业大学 The screening reagent box of the stagnant green gene SGR gene of a kind of early stage non-irrigated stress-inducing Festuca Arundinacea and method
CN104636638A (en) * 2015-01-23 2015-05-20 安徽省农业科学院畜牧兽医研究所 Method for screening and annotating of longissimus dorsi differential expression genes of pigs of different varieties
CN105653900B (en) * 2015-12-25 2019-03-26 北京百迈客生物科技有限公司 Without ginseng transcriptome analysis system and method
CN107766696A (en) * 2016-08-23 2018-03-06 武汉生命之美科技有限公司 Eucaryote alternative splicing analysis method and system based on RNA seq data
CN107391963A (en) * 2017-07-21 2017-11-24 上海桑格信息技术有限公司 Eucaryon based on calculating cloud platform is without ginseng transcript profile interaction analysis system and method
CN108192893B (en) * 2017-08-31 2021-06-04 中国热带农业科学院热带作物品种资源研究所 Method for developing blumea balsamifera SSR primer based on transcriptome sequencing
CN113228194A (en) * 2018-10-12 2021-08-06 人类长寿公司 Multigroup search engine for comprehensive analysis of cancer genome and clinical data
CN109979527A (en) * 2019-03-08 2019-07-05 广州基迪奥生物科技有限公司 A kind of transcript profile and metabolism group data relation analysis method and system
CN109949864A (en) * 2019-03-08 2019-06-28 广州基迪奥生物科技有限公司 A kind of sequencing of transcript profile and protein science sequencing data run through analysis method and system
CN109880921B (en) * 2019-04-26 2023-02-14 上海海洋大学 SNP molecular marker associated with lefteye flounder blackening and albinism and application thereof
CN111354418B (en) * 2020-01-19 2023-02-10 上海欧易生物医学科技有限公司 High-throughput sequencing technology animal tRFs data analysis method based on reference genome annotation file

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102277416A (en) * 2010-06-10 2011-12-14 中国科学院上海生命科学研究院 Transcriptome and functional gene for Ostrinia furnacalis Guenee
CN107368707A (en) * 2017-07-20 2017-11-21 东北大学 Gene chip expression data analysis system and method based on US ELM
CN111402955A (en) * 2020-04-09 2020-07-10 德州学院 Biological information measuring method, system, storage medium and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
鹅等级前卵泡组织转录组测序分析;孙永峰;武惠岩;王子遒;路洪涛;李书哲;杨威;隋玉健;;中国畜牧杂志(第06期);全文 *

Also Published As

Publication number Publication date
CN114333994A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN111933218B (en) Optimized metagenome binding method for analyzing microbial community
CN111354418B (en) High-throughput sequencing technology animal tRFs data analysis method based on reference genome annotation file
CN110257547B (en) Corn core SNP marker developed based on KASP technology and application thereof
CN114420212B (en) Escherichia coli strain identification method and system
CN113066532B (en) Method for analyzing virus source sRNA data in host based on high-throughput sequencing technology
CN115458052A (en) Gene mutation analysis method, equipment and storage medium based on first generation sequencing
CN114333994B (en) Method and system for determining differential gene pathways based on ginseng-free transcriptome sequencing
Love et al. Differential analysis of RNA-Seq data at the gene level using the DESeq2 package
US6927779B2 (en) Web-based well plate information retrieval and display system
Clark MALIGNED: a multiple sequence alignment editor
US8189931B2 (en) Method and apparatus for matching of bracketed patterns in test strings
US7930108B2 (en) Exploratory visualization of protein complexes by molecular weight
CN112885407B (en) Second-generation sequencing-based micro-haplotype detection and typing system and method
JP4356541B2 (en) Patent map creation support system, program thereof, and analysis apparatus
Dash et al. Analysis of capillary electrophoresis results by geneMapper® ID-X v 1.5 software
CN111429967A (en) Processing method of Pacbio third-generation sequencing data
US7730108B2 (en) Information processing apparatus and method, and program
CN111206104A (en) Universal primer and method for efficiently and simply obtaining mitochondrial genome of insects in psyllium superfamily and application of universal primer and method
CN113284552B (en) Screening method and device for micro haplotypes
GB2420895A (en) Copying and manipulating data
CN113409885B (en) Automatic data processing and mapping method and system
CN115346604B (en) DNA sample equilibrium analysis method and device
JP2005012401A (en) Color image output apparatus, image data control program, and storage medium storing this program
Aryamanesh A Reproducible and Dynamic Workflow for Analysis and Annotation of scRNA-Seq Data
CN117634522A (en) Label two-dimensional code data analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant