CN112466394A - Proteomics IPR annotation and enrichment analysis method and device - Google Patents
Proteomics IPR annotation and enrichment analysis method and device Download PDFInfo
- Publication number
- CN112466394A CN112466394A CN202011386481.6A CN202011386481A CN112466394A CN 112466394 A CN112466394 A CN 112466394A CN 202011386481 A CN202011386481 A CN 202011386481A CN 112466394 A CN112466394 A CN 112466394A
- Authority
- CN
- China
- Prior art keywords
- ipr
- analysis
- protein sequence
- annotation
- protein
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010201 enrichment analysis Methods 0.000 title claims abstract description 93
- 238000000034 method Methods 0.000 title claims abstract description 42
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 228
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 227
- 238000004458 analytical method Methods 0.000 claims abstract description 82
- 108010026552 Proteome Proteins 0.000 claims abstract description 26
- 238000007619 statistical method Methods 0.000 claims abstract description 25
- 238000012545 processing Methods 0.000 claims description 18
- 238000004445 quantitative analysis Methods 0.000 claims description 8
- 238000000551 statistical hypothesis test Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004451 qualitative analysis Methods 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 5
- 230000001105 regulatory effect Effects 0.000 claims description 4
- 230000004853 protein function Effects 0.000 abstract description 8
- 235000018102 proteins Nutrition 0.000 description 130
- 238000010586 diagram Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 235000004252 protein component Nutrition 0.000 description 2
- 108010033040 Histones Proteins 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010230 functional analysis Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 150000007523 nucleic acids Chemical group 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 229920001184 polypeptide Chemical group 0.000 description 1
- 108090000765 processed proteins & peptides Chemical group 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Abstract
The invention relates to a method and a device for proteomics IPR annotation and enrichment analysis, wherein the method comprises the following steps: obtaining high-throughput proteome data; determining a corresponding protein sequence FASTA file according to the high-throughput proteome data; carrying out IPR annotation analysis on a plurality of protein sequence FASTA files; carrying out statistical analysis on IPR annotation analysis results corresponding to a plurality of protein sequence FASTA files; carrying out IPR enrichment analysis on a plurality of protein sequence FASTA files; and (4) carrying out statistical analysis on IPR enrichment analysis results corresponding to a plurality of protein sequence FASTA files. The method is based on an Interpro resource database, performs IPR analysis on protein functions independently or completely by using the unique advantages of each database, then performs one-key operation by using various intelligent analyses, only needs to press an Enter key, automatically completes all related analyses and outputs the related analyses to a specified folder, saves time and labor in one-key operation, and avoids human errors generated when single processes are analyzed one by one.
Description
Technical Field
The invention relates to the technical field of data processing of high-throughput proteomics, in particular to a method and a device for proteomic IPR annotation and enrichment analysis.
Background
Proteomics large-scale analysis almost exclusively involves families of proteins and their domain and important site predictions. Interpro is a powerful classification tool that reduces redundancy and simplifies protein sequence analysis, allows simultaneous query of all member databases, and takes advantage of each of these advantages. However, there are many disadvantages in the existing Interpro data processing process. On one hand, only one protein sequence can be analyzed once by on-line analysis of an Interpro website, and for IPR annotation analysis of high-throughput proteomic data, if the method is used, time is consumed, and errors are easily caused by manual operation; although the Interpro localization can analyze a plurality of protein sequences at a time, only one group of protein sequences (each group comprises a plurality of protein sequences) can be analyzed at a time, for high-throughput proteomic data, a plurality of groups of protein sequences are generally analyzed, and for the situation, IPR annotation analysis also needs to simplify the operation and improve the efficiency. On the other hand, after the annotation analysis of the IPR for high-throughput proteomic data, the IPR enrichment analysis and the statistical mapping analysis of the annotation and enrichment results (the annotation and enrichment results are visually displayed by mapping) are usually performed, and if the steps are performed separately, the error is easy and the efficiency is low.
In conclusion, how to perform efficient and rapid data processing of high-throughput proteomics is an urgent problem to be solved.
Disclosure of Invention
In view of the above, there is a need to provide a method and a device for proteomics IPR annotation and enrichment analysis, so as to solve the problem in the prior art that efficient and fast high-throughput proteomics data processing cannot be performed.
The invention provides a proteomics IPR annotation and enrichment analysis method, which comprises the following steps:
obtaining high-throughput proteome data;
determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;
performing IPR annotation analysis on the plurality of protein sequence FASTA files;
carrying out statistical analysis on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files;
performing IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result;
and carrying out statistical analysis on IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files.
Further, the protein sequence FASTA files include a qualitative protein sequence FASTA file and a quantitative protein sequence FASTA file, and determining the corresponding plurality of protein sequence FASTA files according to the high-throughput proteome data includes:
acquiring a qualitative protein sequence FASTA file aiming at qualitative analysis data in the high-throughput proteome data;
and acquiring the quantitative protein sequence FASTA file aiming at quantitative analysis data in the high-throughput proteome data, wherein the quantitative protein sequence FASTA file comprises a total identification protein sequence, differentially expressed protein sequences of each comparison group and a list of log2FC of the differentially expressed proteins.
Further, the performing IPR annotation analysis on the plurality of protein sequence FASTA files comprises:
setting parameter configuration of Interproscan, wherein the parameter configuration comprises an Interpro database and an IPR analysis output format;
calling Interproscan to perform IPR analysis of the corresponding qualitative protein sequence aiming at the FASTA file of the qualitative protein sequence;
and calling Interproscan to perform IPR analysis of the corresponding quantitative protein sequence aiming at the FASTA file of the quantitative protein sequence.
Further, the performing statistical analysis on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files includes:
determining a list of IPR pointing proteins according to the IPR annotation analysis results corresponding to the protein sequence FASTA files, wherein the list is suitable for viewing annotated IPR entries and the number and names of matched proteins;
determining a list of protein pointing IPRs according to the IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files, wherein the list is suitable for viewing the IPR entries annotated by each protein;
and sorting the number of the proteins matched with each annotated IPR from large to small according to the list of the proteins pointed to by the IPR, and generating a corresponding histogram according to a sorting result.
Further, the performing IPR enrichment analysis on the plurality of protein sequence FASTA files comprises:
and (3) determining IPR entries significantly enriched in the differentially expressed protein sequences of each comparison group against the quantitative analysis data in the high-throughput proteome data with the IPR sequencing result of the total identified protein sequences as background.
Further, the determining, with the IPR ranking results of the total identified protein sequences as background, IPR entries significantly enriched in differentially expressed protein sequences of each of the comparison groups comprises:
determining the number of proteins with IPR annotation information in the total identified protein sequence and the number of proteins of differentially expressed proteins in the total identified protein sequence according to the IPR sequencing result of the total identified protein sequence;
determining, for each IPR entry, a corresponding total number of identified proteins and a corresponding number of differentially expressed proteins;
and performing enrichment analysis calculation according to the number of proteins with IPR annotation information in the total identified protein sequence, the number of differentially expressed proteins in the total identified protein sequence, the total identified protein number corresponding to each IPR item and the corresponding differentially expressed protein number by a statistical significance test Pvalue formula.
Further, the statistical significance test pvave formula is expressed as:
wherein, N is the number of proteins with IPR annotation information in the total identified protein sequence, N is the number of differentially expressed proteins in the total identified protein sequence, M is the total identified protein number corresponding to each IPR entry, M is the differentially expressed protein number corresponding to each IPR entry, P is the Pvalue value of enrichment analysis, and i is the iteration number.
Further, the performing a statistical analysis on the IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files comprises:
and determining an IPR enrichment analysis result list of each differentially expressed protein sequence according to the Pvalue value of the enrichment analysis, wherein the IPR enrichment analysis result list is suitable for checking the enriched IPR items, the Pvalue value of the enrichment analysis corresponding to the enriched IPR items and the number of up-down proteins matched with the IPR corresponding to the enriched IPR items.
The invention also provides a proteomics IPR annotation and enrichment analysis device, which comprises:
an acquisition unit for acquiring high-throughput proteome data;
the processing unit is used for determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;
an analysis unit, which is used for carrying out IPR annotation analysis on the plurality of protein sequence FASTA files; the statistical analysis is also carried out on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files; the analysis module is also used for carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the result of the IPR annotation analysis; and the method is also used for carrying out statistical analysis on IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method of proteomic IPR annotation and enrichment analysis as described above.
Compared with the prior art, the invention has the beneficial effects that: firstly, determining a corresponding protein sequence FASTA file according to high-throughput proteome data so as to generate a corresponding FASTA file based on an Interpro resource database, thereby ensuring the rapidity of data processing; further carrying out IPR annotation analysis, and carrying out statistical analysis on the result of the IPR annotation analysis so as to independently or completely utilize the unique advantages of each database to carry out the IPR analysis of the protein function; and then carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result, carrying out statistical analysis on the IPR enrichment analysis result, and directly carrying out further enrichment analysis by using the IPR enrichment analysis result, thereby avoiding a single flow, simplifying the processing process and improving the accuracy of data processing. In conclusion, the protein function IPR analysis method is based on an Interpro resource database, the unique advantages of each database are independently or completely utilized to carry out protein function IPR analysis, then various intelligent analyses are utilized to complete one-key operation, only the 'Enter' key is pressed, the related analyses are completely and automatically completed and output to a designated folder, time and labor are saved due to one-key operation, and human errors generated when single processes are analyzed one by one are avoided.
Drawings
Figure 1 is a schematic flow diagram of a method for proteomic IPR annotation and enrichment analysis provided by the present invention;
FIG. 2 is a schematic flow chart of obtaining a FASTA file according to the present invention;
FIG. 3 is a schematic flow diagram of an IPR analysis provided by the present invention;
FIG. 4 is a schematic flow chart of the statistical analysis of the IPR annotation analysis result provided by the present invention;
FIG. 5 is a schematic flow diagram of an enrichment assay provided by the present invention;
figure 6 is a schematic structural diagram of the proteomic IPR annotation and enrichment analysis device provided by the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
Example 1
The embodiment of the present invention provides a method for proteomic IPR annotation and enrichment analysis, and referring to fig. 1, fig. 1 is a schematic flow chart of the method for proteomic IPR annotation and enrichment analysis provided by the present invention, wherein the method for proteomic IPR annotation and enrichment analysis includes steps S1 to S6, wherein:
in step S1, high-throughput proteome data is acquired;
in step S2, determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;
in step S3, performing IPR annotation analysis on a plurality of protein sequence FASTA files;
in step S4, performing statistical analysis on IPR annotation analysis results corresponding to a plurality of protein sequence FASTA files;
in step S5, performing IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result;
in step S6, the IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files are statistically analyzed.
In the embodiment of the invention, firstly, a corresponding protein sequence FASTA file is determined according to high-throughput proteome data, so that a corresponding FASTA file is generated based on an Interpro resource database, and the rapidity of data processing is ensured; further carrying out IPR annotation analysis, and carrying out statistical analysis on the result of the IPR annotation analysis so as to independently or completely utilize the unique advantages of each database to carry out the IPR analysis of the protein function; and then directly annotating the analysis result according to the IPR, carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files, carrying out statistical analysis on the IPR enrichment analysis result, and directly carrying out further enrichment analysis by using the IPR enrichment analysis result, thereby avoiding a single flow, simplifying the processing process and improving the accuracy of data processing.
It should be noted that IPR is an abbreviation for Interpro, and InterPro is a bioinformatics integration database that integrates the prediction information on protein function from many member databases (member databases), thereby outlining the family to which the protein belongs and the domains and important sites contained therein; also a bioinformatic diagnostic tool, which incorporates the protein signatures of these member databases into a searchable resource and utilizes their unique advantages to provide functional analysis of proteins; the Interproscan is a software package, runs a scanning algorithm of an InterPro database in an integrated mode, is convenient for a user to analyze the functional characteristics of nucleotide or protein sequences, namely the sequences are submitted in a FASTA format and then are matched with a prediction model (Signatures) of the InterPro database, and finally, the prediction results of the family affiliation, the structural domain and the important sites of the protein are produced; IPR annotation is to match a target sequence with an Interpro database so as to obtain the prediction results of the family affiliation of the protein corresponding to the target sequence, the structural domain and the important site of the protein; protein sequence the FASTA file is a text-based format for representing nucleic acid sequences or polypeptide sequences; enrichment analysis refers to classifying genes according to prior knowledge, namely genome annotation information; log2FC is log2 logarithmized for the ratio of the expression levels of a comparative histone in both samples (Fold Change).
Preferably, referring to fig. 2, fig. 2 is a schematic flowchart of the process of acquiring the FASTA file provided by the present invention, where the step S2 includes steps S21 to S22, where:
in step S21, acquiring a qualitative protein sequence FASTA file for qualitative analysis data in the high-throughput proteome data;
in step S22, a quantitative protein sequence FASTA file is obtained for the quantitative analysis data in the high-throughput proteome data, wherein the quantitative protein sequence FASTA file includes the total identified protein sequences, the differentially expressed protein sequences of each comparison group, and a list of log2FC of the differentially expressed proteins.
Therefore, different data processing is performed specifically through qualitative analysis data and quantitative analysis data, so that IPR annotation and enrichment analysis can be performed effectively in the following.
Preferably, referring to fig. 3, fig. 3 is a schematic flow chart of IPR analysis provided by the present invention, and the step S3 includes steps S31 to S33, where:
in step S31, setting a parameter configuration of Interpro, where the parameter configuration includes an Interpro database and an IPR analysis output format;
in step S32, calling Interproscan to perform IPR analysis of the corresponding qualitative protein sequence against the qualitative protein sequence FASTA file;
in step S33, the inter proscan is called to perform IPR analysis of the corresponding quantitative protein sequence against the quantitative protein sequence FASTA file.
Therefore, different data processing is respectively performed on qualitative analysis data and quantitative analysis data in a targeted manner through parameter configuration, Interproscan is called to perform IPR analysis of qualitative protein sequences, IPR analysis is performed once on each FASTA file, and several times of analysis are performed; interproscan was invoked for IPR analysis of quantitative protein sequences, one for each FASTA file, several for each FASTA file, for subsequent efficient IPR annotation statistics and enrichment analysis.
Preferably, referring to fig. 4, fig. 4 is a schematic flowchart of the statistical analysis of IPR annotation analysis results provided in the present invention, where the step S4 includes steps S41 to S43, where:
in step S41, determining a list of IPR-pointed proteins, suitable for viewing the annotated IPR entries, and the number and name of the matched proteins, according to the IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files;
in step S42, determining a list of protein-pointing IPRs suitable for viewing the IPR entry annotated by each protein according to the IPR annotation analysis result corresponding to the plurality of protein sequence FASTA files;
in step S43, the numbers of proteins matched by each annotated IPR are sorted from large to small according to the list of proteins pointed by the IPR, and a corresponding histogram is generated according to the sorting result.
Therefore, a list of IPR-directed proteins is obtained, one list is provided for each FASTA file, and a researcher can conveniently view related information of the annotated IPR entries and other information such as the number and name of the proteins matched with the annotated IPR entries. Meanwhile, a list of protein-directed IPRs is obtained, one list is provided for each FASTA file, and researchers can conveniently view IPR entries annotated by each protein and other information. And the list of the IPR-directed proteins is subjected to statistical analysis and mapping, so that a researcher can conveniently and intuitively find important IPR items. That is, for each IPR, the list points to the protein, and the number of proteins is sorted from large to small according to the IPR matched by each annotation, and then the IPR entry at the top 20 is taken as a histogram (the higher the ranking is, the more important the IPR is to the group of data, the more the direction can be provided for further research of researchers). In addition, the IPR result list for differentially expressed proteins will show their matching up-and down-regulated protein names and numbers at the same time; also, up-and down-regulated proteins are distinguished on their statistical map.
Preferably, the step S5 specifically includes:
and (3) aiming at quantitative analysis data in the high-throughput proteome data, taking IPR sequencing results of total identified protein sequences as background, and determining IPR entries obviously enriched in the differentially expressed protein sequences of each comparison group.
Therefore, enrichment analysis is not needed for qualitative proteomic data generally, the process is ended, quantitative proteomic data is subjected to the process, the process is continued, namely IPR entries obviously enriched in differentially expressed proteins of each comparison group are found out by taking IPR results of total identified proteins as backgrounds on the basis of S4 output results, and several times of enrichment analysis are carried out on FASTA files of several comparison group protein sequences, so that the purpose of effectively carrying out enrichment analysis is achieved.
Preferably, referring to fig. 5, fig. 5 is a schematic flow chart of the enrichment analysis provided by the present invention, and the step S5 includes steps S51 to S53, wherein:
in step S51, determining the number of proteins having IPR annotation information in the total identified protein sequence and the number of proteins differentially expressing proteins in the total identified protein sequence according to the IPR ranking of the total identified protein sequence;
in step S52, for each IPR entry, determining a corresponding total number of identified proteins and a corresponding number of differentially expressed proteins;
in step S53, an enrichment analysis calculation is performed according to the number of proteins with IPR annotation information in the total identified protein sequence, the number of differentially expressed proteins in the total identified protein sequence, and the total identified protein number corresponding to each IPR entry and the corresponding differentially expressed protein number by the statistical significance test pvave formula.
Therefore, based on IPR annotation analysis results, enrichment analysis calculation is effectively carried out by using a Pvalid formula for statistical significance test. Determining the number of proteins with IPR annotation information and the number of proteins of differentially expressed proteins by taking the total identified protein sequence as a background protein; and determining the total identified protein sequence and the number of the proteins of the differentially expressed protein corresponding to each IPR entry (obtained by the IPR annotation analysis), so as to perform enrichment analysis and effectively determine the protein component condition.
Preferably, the statistical significance test pvave formula is expressed as:
wherein, N is the number of proteins with IPR annotation information in the total identification protein sequence, N is the number of differentially expressed proteins in the total identification protein sequence, M is the total identification protein number corresponding to each IPR entry, M is the differentially expressed protein number corresponding to each IPR entry, P is the Pvalue value of enrichment analysis, and i is the iteration number.
Therefore, the statistical significance test Pvalid formula is utilized to effectively perform enrichment analysis calculation and ensure the accuracy of the enrichment analysis.
Preferably, step S6 specifically includes:
and determining an IPR enrichment analysis result list of each differentially expressed protein sequence according to the Pvalue value of the enrichment analysis, wherein the IPR enrichment analysis result list is suitable for viewing the enriched IPR items, the Pvalue value of the enrichment analysis corresponding to the enriched IPR items and the number of up-down proteins matched with the corresponding IPR.
Therefore, according to the Pvalue of the enrichment analysis, reflecting the protein component condition, determining an IPR enrichment analysis result list of each differentially expressed protein sequence, so that a researcher can conveniently check the corresponding statistical result.
Specifically, for each differentially expressed protein sequence, according to the enrichment analyzed Pvalue, the enriched IPR entries and the number of up-down regulated proteins matched with the corresponding IPR entries, an IPR enrichment analysis result list of each differentially expressed protein sequence is established, so that a researcher can quickly check the corresponding statistical results and efficiently know the enrichment analysis results.
Example 2
The embodiment of the present invention provides a proteomics IPR annotation and enrichment analysis device, and referring to fig. 6, fig. 6 is a schematic structural diagram of the proteomics IPR annotation and enrichment analysis device provided by the present invention, wherein the proteomics IPR annotation and enrichment analysis device 600 includes:
an acquisition unit 601 configured to acquire high-throughput proteome data;
a processing unit 602, configured to determine a corresponding protein sequence FASTA file according to the high-throughput proteome data;
an analysis unit 603, configured to perform IPR annotation analysis on a plurality of protein sequence FASTA files; the method is also used for carrying out statistical analysis on IPR annotation analysis results corresponding to a plurality of protein sequence FASTA files; also used for IPR enrichment analysis of multiple protein sequence FASTA files; the method is also used for carrying out statistical analysis on IPR enrichment analysis results corresponding to a plurality of protein sequence FASTA files.
Example 3
The embodiment of the invention provides a proteomics IPR annotation and enrichment analysis device which comprises a processor and a memory, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the proteomics IPR annotation and enrichment analysis method is realized.
Example 4
Embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for proteomic IPR annotation and enrichment analysis as described above.
The invention discloses a method and a device for proteomics IPR annotation and enrichment analysis, wherein in the method, firstly, a corresponding protein sequence FASTA file is determined according to high-throughput proteomic data, so that a corresponding FASTA file is generated based on an Interpro resource database, and the rapidity of data processing is ensured; further carrying out IPR annotation analysis, and carrying out statistical analysis on the result of the IPR annotation analysis so as to independently or completely utilize the unique advantages of each database to carry out the IPR analysis of the protein function; and then carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result, carrying out statistical analysis on the IPR enrichment analysis result, and directly carrying out further enrichment analysis by using the IPR enrichment analysis result, thereby avoiding a single flow, simplifying the processing process and improving the accuracy of data processing.
According to the technical scheme, the protein function IPR analysis is carried out by independently or completely utilizing the unique advantages of each database based on the resource database of Interpro, then, a plurality of intelligent analyses are utilized to complete one-key operation, only the 'Enter' key is needed to be pressed, the related analyses are completely and automatically completed and output to the designated folder, the one-key operation is time-saving and labor-saving, and human errors generated when single processes are analyzed one by one are avoided.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
Claims (10)
1. A method for proteomic IPR annotation and enrichment analysis, comprising:
obtaining high-throughput proteome data;
determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;
performing IPR annotation analysis on the plurality of protein sequence FASTA files;
carrying out statistical analysis on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files;
performing IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result;
and carrying out statistical analysis on IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files.
2. The method for proteomic IPR annotation and enrichment analysis of claim 1, wherein the protein sequence FASTA files comprise a qualitative protein sequence FASTA file and a quantitative protein sequence FASTA file, and wherein determining the corresponding plurality of protein sequence FASTA files from the high-throughput proteomic data comprises:
acquiring a qualitative protein sequence FASTA file aiming at qualitative analysis data in the high-throughput proteome data;
and acquiring the quantitative protein sequence FASTA file aiming at quantitative analysis data in the high-throughput proteome data, wherein the quantitative protein sequence FASTA file comprises a total identification protein sequence, differentially expressed protein sequences of each comparison group and a list of log2FC of the differentially expressed proteins.
3. The method for proteomic IPR annotation and enrichment analysis of claim 2, wherein the IPR annotation analysis of the plurality of protein sequence FASTA files comprises:
setting parameter configuration of Interproscan, wherein the parameter configuration comprises an Interpro database and an IPR analysis output format;
calling Interproscan to perform IPR analysis of the corresponding qualitative protein sequence aiming at the FASTA file of the qualitative protein sequence;
and calling Interproscan to perform IPR analysis of the corresponding quantitative protein sequence aiming at the FASTA file of the quantitative protein sequence.
4. The method for proteomic IPR annotation and enrichment analysis of claim 3, wherein the statistical analysis of IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files comprises:
determining a list of IPR pointing proteins according to the IPR annotation analysis results corresponding to the protein sequence FASTA files so as to check the annotated IPR entries, the number of matched proteins and corresponding names;
determining a list of protein pointing IPRs according to the IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files so as to view the IPR entries annotated by each protein;
and sorting the number of the proteins matched with each annotated IPR from large to small according to the list of the proteins pointed to by the IPR, and generating a corresponding histogram according to a sorting result.
5. The method for proteomic IPR annotation and enrichment analysis of claim 2, wherein the IPR enrichment analysis of the plurality of protein sequence FASTA files comprises:
and (3) determining IPR entries significantly enriched in the differentially expressed protein sequences of each comparison group against the quantitative analysis data in the high-throughput proteome data with the IPR sequencing result of the total identified protein sequences as background.
6. The method for proteomic IPR annotation and enrichment analysis of claim 5, wherein based on the IPR ranking of the total identified protein sequences, determining IPR entries significantly enriched for each of the comparison sets of differentially expressed protein sequences comprises:
determining the number of proteins with IPR annotation information in the total identified protein sequence and the number of proteins of differentially expressed proteins in the total identified protein sequence according to the IPR sequencing result of the total identified protein sequence;
determining, for each IPR entry, a corresponding total number of identified proteins and a corresponding number of differentially expressed proteins;
and performing enrichment analysis calculation according to the number of proteins with IPR annotation information in the total identification protein sequence, the number of differentially expressed proteins in the total identification protein sequence, the total identification protein number corresponding to each IPR item and the corresponding differentially expressed protein number by a statistical significance test Pvalue formula.
7. The method for proteomic IPR annotation and enrichment analysis of claim 6, wherein the statistical significance test pvave formula is represented as:
wherein, N is the number of proteins with IPR annotation information in the total identified protein sequence, N is the number of differentially expressed proteins in the total identified protein sequence, M is the total identified protein number corresponding to each IPR entry, M is the differentially expressed protein number corresponding to each IPR entry, P is the Pvalue value of enrichment analysis, and i is the iteration number.
8. The method for proteomic IPR annotation and enrichment analysis of claim 7, wherein said statistically analyzing IPR enrichment analysis results corresponding to said plurality of protein sequence FASTA files comprises:
and determining an IPR enrichment analysis result list of each comparison group differential expression protein sequence according to the Pvalue value of the enrichment analysis so as to check the enriched IPR items, the Pvalue of the enrichment analysis corresponding to the enriched IPR items and the number of up-down regulated proteins matched with the corresponding IPR.
9. A proteomics IPR annotation and enrichment analysis device, comprising:
an acquisition unit for acquiring high-throughput proteome data;
the processing unit is used for determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;
an analysis unit, which is used for carrying out IPR annotation analysis on the plurality of protein sequence FASTA files; the statistical analysis is also carried out on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files; the analysis module is also used for carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the result of the IPR annotation analysis; and the method is also used for carrying out statistical analysis on IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs a method for proteomic IPR annotation and enrichment analysis according to any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011386481.6A CN112466394A (en) | 2020-12-01 | 2020-12-01 | Proteomics IPR annotation and enrichment analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011386481.6A CN112466394A (en) | 2020-12-01 | 2020-12-01 | Proteomics IPR annotation and enrichment analysis method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112466394A true CN112466394A (en) | 2021-03-09 |
Family
ID=74805235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011386481.6A Pending CN112466394A (en) | 2020-12-01 | 2020-12-01 | Proteomics IPR annotation and enrichment analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112466394A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160232293A1 (en) * | 2013-10-17 | 2016-08-11 | Sanford-Burnham Medical Research Institute | Drug sensitivity biomarkers and methods of identifying and using drug sensitivity biomarkers |
CN109712669A (en) * | 2018-12-05 | 2019-05-03 | 上海美吉生物医药科技有限公司 | A kind of protein function annotation method and system |
CN111650327A (en) * | 2020-07-16 | 2020-09-11 | 扬州大学 | Method for searching different cell or tissue differential expression proteins of Rugao yellow chicken based on non-standard quantitative proteomics technology |
-
2020
- 2020-12-01 CN CN202011386481.6A patent/CN112466394A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160232293A1 (en) * | 2013-10-17 | 2016-08-11 | Sanford-Burnham Medical Research Institute | Drug sensitivity biomarkers and methods of identifying and using drug sensitivity biomarkers |
CN109712669A (en) * | 2018-12-05 | 2019-05-03 | 上海美吉生物医药科技有限公司 | A kind of protein function annotation method and system |
CN111650327A (en) * | 2020-07-16 | 2020-09-11 | 扬州大学 | Method for searching different cell or tissue differential expression proteins of Rugao yellow chicken based on non-standard quantitative proteomics technology |
Non-Patent Citations (2)
Title |
---|
柴立辉,刘峰涛: "《医学免疫学实验技术》", 河南大学出版社, pages: 21 * |
马启财: "基于 TMT 技术筛选牦牛和犏牛皮下脂肪及背最长肌组织间差异蛋白", 《中国优秀硕士学位论文全文数据库 农业科技辑》, 15 August 2020 (2020-08-15), pages 2 - 3 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106295250B (en) | Short sequence quick comparison analysis method and device was sequenced in two generations | |
EP1769398A1 (en) | Data collection cataloguing and searching method and system | |
Xirasagar et al. | CEBS object model for systems biology data, SysBio-OM | |
Lu et al. | A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications | |
Adamski et al. | Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project | |
Saheb Kashaf et al. | Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data | |
Bushel et al. | MAPS: a microarray project system for gene expression experiment information and data validation | |
US20080215529A1 (en) | Method for using lengths of data paths in assessing the similarity of sets of data | |
Medina et al. | ALEPH: a network-oriented approach for the generation of fragment-based libraries and for structure interpretation | |
US11150878B2 (en) | Method and system for extracting concepts from research publications to identify necessary source code for implementation | |
D. LeDuc et al. | Using ProSight PTM and related tools for targeted protein identification and characterization with high mass accuracy tandem MS data | |
CN112698861A (en) | Source code clone identification method and system | |
US7848890B2 (en) | Method and system for predicting gene pathway using gene expression pattern data and protein interaction data | |
CN112466394A (en) | Proteomics IPR annotation and enrichment analysis method and device | |
CN107977550A (en) | A kind of quick analysis Disease-causing gene algorithm based on compression | |
Oh et al. | Mining protein data from two‐dimensional gels: Tools for systematic post‐planned analyses | |
CN109243527B (en) | Enzyme digestion probability-assisted peptide fragment detectability prediction method | |
Triplet et al. | Systems biology warehousing: challenges and strategies toward effective data integration | |
Chen et al. | iEsGene-ZCPseKNC: Identify Essential Genes Based on Z Curve Pseudo $ k $-Tuple Nucleotide Composition | |
CN112534508B (en) | Cut point method for identifying complex molecular substructures | |
Berrar et al. | Introduction to genomic and proteomic data analysis | |
Matthiesen | Virtual Expert Mass Spectrometrist v3. 0: an integrated tool for proteome analysis | |
Ravichandran et al. | Toward data standards for proteomics | |
Haynes et al. | The wildcat toolbox: a set of perl script utilities for use in peptide mass spectral database searching and proteomics experiments | |
Jones et al. | Proposal for a standard representation of two‐dimensional gel electrophoresis data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210309 |