CN112466394A

CN112466394A - Proteomics IPR annotation and enrichment analysis method and device

Info

Publication number: CN112466394A
Application number: CN202011386481.6A
Authority: CN
Inventors: 顾洪涛; 赵海义
Original assignee: Wuhan Genecreate Biological Engineering Co ltd
Current assignee: Wuhan Genecreate Biological Engineering Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-03-09

Abstract

The invention relates to a method and a device for proteomics IPR annotation and enrichment analysis, wherein the method comprises the following steps: obtaining high-throughput proteome data; determining a corresponding protein sequence FASTA file according to the high-throughput proteome data; carrying out IPR annotation analysis on a plurality of protein sequence FASTA files; carrying out statistical analysis on IPR annotation analysis results corresponding to a plurality of protein sequence FASTA files; carrying out IPR enrichment analysis on a plurality of protein sequence FASTA files; and (4) carrying out statistical analysis on IPR enrichment analysis results corresponding to a plurality of protein sequence FASTA files. The method is based on an Interpro resource database, performs IPR analysis on protein functions independently or completely by using the unique advantages of each database, then performs one-key operation by using various intelligent analyses, only needs to press an Enter key, automatically completes all related analyses and outputs the related analyses to a specified folder, saves time and labor in one-key operation, and avoids human errors generated when single processes are analyzed one by one.

Description

Proteomics IPR annotation and enrichment analysis method and device

Technical Field

The invention relates to the technical field of data processing of high-throughput proteomics, in particular to a method and a device for proteomic IPR annotation and enrichment analysis.

Background

Proteomics large-scale analysis almost exclusively involves families of proteins and their domain and important site predictions. Interpro is a powerful classification tool that reduces redundancy and simplifies protein sequence analysis, allows simultaneous query of all member databases, and takes advantage of each of these advantages. However, there are many disadvantages in the existing Interpro data processing process. On one hand, only one protein sequence can be analyzed once by on-line analysis of an Interpro website, and for IPR annotation analysis of high-throughput proteomic data, if the method is used, time is consumed, and errors are easily caused by manual operation; although the Interpro localization can analyze a plurality of protein sequences at a time, only one group of protein sequences (each group comprises a plurality of protein sequences) can be analyzed at a time, for high-throughput proteomic data, a plurality of groups of protein sequences are generally analyzed, and for the situation, IPR annotation analysis also needs to simplify the operation and improve the efficiency. On the other hand, after the annotation analysis of the IPR for high-throughput proteomic data, the IPR enrichment analysis and the statistical mapping analysis of the annotation and enrichment results (the annotation and enrichment results are visually displayed by mapping) are usually performed, and if the steps are performed separately, the error is easy and the efficiency is low.

In conclusion, how to perform efficient and rapid data processing of high-throughput proteomics is an urgent problem to be solved.

Disclosure of Invention

In view of the above, there is a need to provide a method and a device for proteomics IPR annotation and enrichment analysis, so as to solve the problem in the prior art that efficient and fast high-throughput proteomics data processing cannot be performed.

The invention provides a proteomics IPR annotation and enrichment analysis method, which comprises the following steps:

obtaining high-throughput proteome data;

determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;

performing IPR annotation analysis on the plurality of protein sequence FASTA files;

carrying out statistical analysis on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files;

performing IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result;

and carrying out statistical analysis on IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files.

Further, the protein sequence FASTA files include a qualitative protein sequence FASTA file and a quantitative protein sequence FASTA file, and determining the corresponding plurality of protein sequence FASTA files according to the high-throughput proteome data includes:

acquiring a qualitative protein sequence FASTA file aiming at qualitative analysis data in the high-throughput proteome data;

and acquiring the quantitative protein sequence FASTA file aiming at quantitative analysis data in the high-throughput proteome data, wherein the quantitative protein sequence FASTA file comprises a total identification protein sequence, differentially expressed protein sequences of each comparison group and a list of log2FC of the differentially expressed proteins.

Further, the performing IPR annotation analysis on the plurality of protein sequence FASTA files comprises:

setting parameter configuration of Interproscan, wherein the parameter configuration comprises an Interpro database and an IPR analysis output format;

calling Interproscan to perform IPR analysis of the corresponding qualitative protein sequence aiming at the FASTA file of the qualitative protein sequence;

and calling Interproscan to perform IPR analysis of the corresponding quantitative protein sequence aiming at the FASTA file of the quantitative protein sequence.

Further, the performing statistical analysis on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files includes:

determining a list of IPR pointing proteins according to the IPR annotation analysis results corresponding to the protein sequence FASTA files, wherein the list is suitable for viewing annotated IPR entries and the number and names of matched proteins;

determining a list of protein pointing IPRs according to the IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files, wherein the list is suitable for viewing the IPR entries annotated by each protein;

and sorting the number of the proteins matched with each annotated IPR from large to small according to the list of the proteins pointed to by the IPR, and generating a corresponding histogram according to a sorting result.

Further, the performing IPR enrichment analysis on the plurality of protein sequence FASTA files comprises:

and (3) determining IPR entries significantly enriched in the differentially expressed protein sequences of each comparison group against the quantitative analysis data in the high-throughput proteome data with the IPR sequencing result of the total identified protein sequences as background.

Further, the determining, with the IPR ranking results of the total identified protein sequences as background, IPR entries significantly enriched in differentially expressed protein sequences of each of the comparison groups comprises:

determining the number of proteins with IPR annotation information in the total identified protein sequence and the number of proteins of differentially expressed proteins in the total identified protein sequence according to the IPR sequencing result of the total identified protein sequence;

determining, for each IPR entry, a corresponding total number of identified proteins and a corresponding number of differentially expressed proteins;

and performing enrichment analysis calculation according to the number of proteins with IPR annotation information in the total identified protein sequence, the number of differentially expressed proteins in the total identified protein sequence, the total identified protein number corresponding to each IPR item and the corresponding differentially expressed protein number by a statistical significance test Pvalue formula.

Further, the statistical significance test pvave formula is expressed as:

wherein, N is the number of proteins with IPR annotation information in the total identified protein sequence, N is the number of differentially expressed proteins in the total identified protein sequence, M is the total identified protein number corresponding to each IPR entry, M is the differentially expressed protein number corresponding to each IPR entry, P is the Pvalue value of enrichment analysis, and i is the iteration number.

Further, the performing a statistical analysis on the IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files comprises:

and determining an IPR enrichment analysis result list of each differentially expressed protein sequence according to the Pvalue value of the enrichment analysis, wherein the IPR enrichment analysis result list is suitable for checking the enriched IPR items, the Pvalue value of the enrichment analysis corresponding to the enriched IPR items and the number of up-down proteins matched with the IPR corresponding to the enriched IPR items.

The invention also provides a proteomics IPR annotation and enrichment analysis device, which comprises:

an acquisition unit for acquiring high-throughput proteome data;

the processing unit is used for determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;

an analysis unit, which is used for carrying out IPR annotation analysis on the plurality of protein sequence FASTA files; the statistical analysis is also carried out on IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files; the analysis module is also used for carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the result of the IPR annotation analysis; and the method is also used for carrying out statistical analysis on IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method of proteomic IPR annotation and enrichment analysis as described above.

Compared with the prior art, the invention has the beneficial effects that: firstly, determining a corresponding protein sequence FASTA file according to high-throughput proteome data so as to generate a corresponding FASTA file based on an Interpro resource database, thereby ensuring the rapidity of data processing; further carrying out IPR annotation analysis, and carrying out statistical analysis on the result of the IPR annotation analysis so as to independently or completely utilize the unique advantages of each database to carry out the IPR analysis of the protein function; and then carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result, carrying out statistical analysis on the IPR enrichment analysis result, and directly carrying out further enrichment analysis by using the IPR enrichment analysis result, thereby avoiding a single flow, simplifying the processing process and improving the accuracy of data processing. In conclusion, the protein function IPR analysis method is based on an Interpro resource database, the unique advantages of each database are independently or completely utilized to carry out protein function IPR analysis, then various intelligent analyses are utilized to complete one-key operation, only the 'Enter' key is pressed, the related analyses are completely and automatically completed and output to a designated folder, time and labor are saved due to one-key operation, and human errors generated when single processes are analyzed one by one are avoided.

Drawings

Figure 1 is a schematic flow diagram of a method for proteomic IPR annotation and enrichment analysis provided by the present invention;

FIG. 2 is a schematic flow chart of obtaining a FASTA file according to the present invention;

FIG. 3 is a schematic flow diagram of an IPR analysis provided by the present invention;

FIG. 4 is a schematic flow chart of the statistical analysis of the IPR annotation analysis result provided by the present invention;

FIG. 5 is a schematic flow diagram of an enrichment assay provided by the present invention;

figure 6 is a schematic structural diagram of the proteomic IPR annotation and enrichment analysis device provided by the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

Example 1

The embodiment of the present invention provides a method for proteomic IPR annotation and enrichment analysis, and referring to fig. 1, fig. 1 is a schematic flow chart of the method for proteomic IPR annotation and enrichment analysis provided by the present invention, wherein the method for proteomic IPR annotation and enrichment analysis includes steps S1 to S6, wherein:

in step S1, high-throughput proteome data is acquired;

in step S2, determining a corresponding protein sequence FASTA file according to the high-throughput proteome data;

in step S3, performing IPR annotation analysis on a plurality of protein sequence FASTA files;

in step S4, performing statistical analysis on IPR annotation analysis results corresponding to a plurality of protein sequence FASTA files;

in step S5, performing IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result;

in step S6, the IPR enrichment analysis results corresponding to the plurality of protein sequence FASTA files are statistically analyzed.

In the embodiment of the invention, firstly, a corresponding protein sequence FASTA file is determined according to high-throughput proteome data, so that a corresponding FASTA file is generated based on an Interpro resource database, and the rapidity of data processing is ensured; further carrying out IPR annotation analysis, and carrying out statistical analysis on the result of the IPR annotation analysis so as to independently or completely utilize the unique advantages of each database to carry out the IPR analysis of the protein function; and then directly annotating the analysis result according to the IPR, carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files, carrying out statistical analysis on the IPR enrichment analysis result, and directly carrying out further enrichment analysis by using the IPR enrichment analysis result, thereby avoiding a single flow, simplifying the processing process and improving the accuracy of data processing.

It should be noted that IPR is an abbreviation for Interpro, and InterPro is a bioinformatics integration database that integrates the prediction information on protein function from many member databases (member databases), thereby outlining the family to which the protein belongs and the domains and important sites contained therein; also a bioinformatic diagnostic tool, which incorporates the protein signatures of these member databases into a searchable resource and utilizes their unique advantages to provide functional analysis of proteins; the Interproscan is a software package, runs a scanning algorithm of an InterPro database in an integrated mode, is convenient for a user to analyze the functional characteristics of nucleotide or protein sequences, namely the sequences are submitted in a FASTA format and then are matched with a prediction model (Signatures) of the InterPro database, and finally, the prediction results of the family affiliation, the structural domain and the important sites of the protein are produced; IPR annotation is to match a target sequence with an Interpro database so as to obtain the prediction results of the family affiliation of the protein corresponding to the target sequence, the structural domain and the important site of the protein; protein sequence the FASTA file is a text-based format for representing nucleic acid sequences or polypeptide sequences; enrichment analysis refers to classifying genes according to prior knowledge, namely genome annotation information; log2FC is log2 logarithmized for the ratio of the expression levels of a comparative histone in both samples (Fold Change).

Preferably, referring to fig. 2, fig. 2 is a schematic flowchart of the process of acquiring the FASTA file provided by the present invention, where the step S2 includes steps S21 to S22, where:

in step S21, acquiring a qualitative protein sequence FASTA file for qualitative analysis data in the high-throughput proteome data;

in step S22, a quantitative protein sequence FASTA file is obtained for the quantitative analysis data in the high-throughput proteome data, wherein the quantitative protein sequence FASTA file includes the total identified protein sequences, the differentially expressed protein sequences of each comparison group, and a list of log2FC of the differentially expressed proteins.

Therefore, different data processing is performed specifically through qualitative analysis data and quantitative analysis data, so that IPR annotation and enrichment analysis can be performed effectively in the following.

Preferably, referring to fig. 3, fig. 3 is a schematic flow chart of IPR analysis provided by the present invention, and the step S3 includes steps S31 to S33, where:

in step S31, setting a parameter configuration of Interpro, where the parameter configuration includes an Interpro database and an IPR analysis output format;

in step S32, calling Interproscan to perform IPR analysis of the corresponding qualitative protein sequence against the qualitative protein sequence FASTA file;

in step S33, the inter proscan is called to perform IPR analysis of the corresponding quantitative protein sequence against the quantitative protein sequence FASTA file.

Therefore, different data processing is respectively performed on qualitative analysis data and quantitative analysis data in a targeted manner through parameter configuration, Interproscan is called to perform IPR analysis of qualitative protein sequences, IPR analysis is performed once on each FASTA file, and several times of analysis are performed; interproscan was invoked for IPR analysis of quantitative protein sequences, one for each FASTA file, several for each FASTA file, for subsequent efficient IPR annotation statistics and enrichment analysis.

Preferably, referring to fig. 4, fig. 4 is a schematic flowchart of the statistical analysis of IPR annotation analysis results provided in the present invention, where the step S4 includes steps S41 to S43, where:

in step S41, determining a list of IPR-pointed proteins, suitable for viewing the annotated IPR entries, and the number and name of the matched proteins, according to the IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files;

in step S42, determining a list of protein-pointing IPRs suitable for viewing the IPR entry annotated by each protein according to the IPR annotation analysis result corresponding to the plurality of protein sequence FASTA files;

in step S43, the numbers of proteins matched by each annotated IPR are sorted from large to small according to the list of proteins pointed by the IPR, and a corresponding histogram is generated according to the sorting result.

Therefore, a list of IPR-directed proteins is obtained, one list is provided for each FASTA file, and a researcher can conveniently view related information of the annotated IPR entries and other information such as the number and name of the proteins matched with the annotated IPR entries. Meanwhile, a list of protein-directed IPRs is obtained, one list is provided for each FASTA file, and researchers can conveniently view IPR entries annotated by each protein and other information. And the list of the IPR-directed proteins is subjected to statistical analysis and mapping, so that a researcher can conveniently and intuitively find important IPR items. That is, for each IPR, the list points to the protein, and the number of proteins is sorted from large to small according to the IPR matched by each annotation, and then the IPR entry at the top 20 is taken as a histogram (the higher the ranking is, the more important the IPR is to the group of data, the more the direction can be provided for further research of researchers). In addition, the IPR result list for differentially expressed proteins will show their matching up-and down-regulated protein names and numbers at the same time; also, up-and down-regulated proteins are distinguished on their statistical map.

Preferably, the step S5 specifically includes:

and (3) aiming at quantitative analysis data in the high-throughput proteome data, taking IPR sequencing results of total identified protein sequences as background, and determining IPR entries obviously enriched in the differentially expressed protein sequences of each comparison group.

Therefore, enrichment analysis is not needed for qualitative proteomic data generally, the process is ended, quantitative proteomic data is subjected to the process, the process is continued, namely IPR entries obviously enriched in differentially expressed proteins of each comparison group are found out by taking IPR results of total identified proteins as backgrounds on the basis of S4 output results, and several times of enrichment analysis are carried out on FASTA files of several comparison group protein sequences, so that the purpose of effectively carrying out enrichment analysis is achieved.

Preferably, referring to fig. 5, fig. 5 is a schematic flow chart of the enrichment analysis provided by the present invention, and the step S5 includes steps S51 to S53, wherein:

in step S51, determining the number of proteins having IPR annotation information in the total identified protein sequence and the number of proteins differentially expressing proteins in the total identified protein sequence according to the IPR ranking of the total identified protein sequence;

in step S52, for each IPR entry, determining a corresponding total number of identified proteins and a corresponding number of differentially expressed proteins;

in step S53, an enrichment analysis calculation is performed according to the number of proteins with IPR annotation information in the total identified protein sequence, the number of differentially expressed proteins in the total identified protein sequence, and the total identified protein number corresponding to each IPR entry and the corresponding differentially expressed protein number by the statistical significance test pvave formula.

Therefore, based on IPR annotation analysis results, enrichment analysis calculation is effectively carried out by using a Pvalid formula for statistical significance test. Determining the number of proteins with IPR annotation information and the number of proteins of differentially expressed proteins by taking the total identified protein sequence as a background protein; and determining the total identified protein sequence and the number of the proteins of the differentially expressed protein corresponding to each IPR entry (obtained by the IPR annotation analysis), so as to perform enrichment analysis and effectively determine the protein component condition.

Preferably, the statistical significance test pvave formula is expressed as:

wherein, N is the number of proteins with IPR annotation information in the total identification protein sequence, N is the number of differentially expressed proteins in the total identification protein sequence, M is the total identification protein number corresponding to each IPR entry, M is the differentially expressed protein number corresponding to each IPR entry, P is the Pvalue value of enrichment analysis, and i is the iteration number.

Therefore, the statistical significance test Pvalid formula is utilized to effectively perform enrichment analysis calculation and ensure the accuracy of the enrichment analysis.

Preferably, step S6 specifically includes:

and determining an IPR enrichment analysis result list of each differentially expressed protein sequence according to the Pvalue value of the enrichment analysis, wherein the IPR enrichment analysis result list is suitable for viewing the enriched IPR items, the Pvalue value of the enrichment analysis corresponding to the enriched IPR items and the number of up-down proteins matched with the corresponding IPR.

Therefore, according to the Pvalue of the enrichment analysis, reflecting the protein component condition, determining an IPR enrichment analysis result list of each differentially expressed protein sequence, so that a researcher can conveniently check the corresponding statistical result.

Specifically, for each differentially expressed protein sequence, according to the enrichment analyzed Pvalue, the enriched IPR entries and the number of up-down regulated proteins matched with the corresponding IPR entries, an IPR enrichment analysis result list of each differentially expressed protein sequence is established, so that a researcher can quickly check the corresponding statistical results and efficiently know the enrichment analysis results.

Example 2

The embodiment of the present invention provides a proteomics IPR annotation and enrichment analysis device, and referring to fig. 6, fig. 6 is a schematic structural diagram of the proteomics IPR annotation and enrichment analysis device provided by the present invention, wherein the proteomics IPR annotation and enrichment analysis device 600 includes:

an acquisition unit 601 configured to acquire high-throughput proteome data;

a processing unit 602, configured to determine a corresponding protein sequence FASTA file according to the high-throughput proteome data;

an analysis unit 603, configured to perform IPR annotation analysis on a plurality of protein sequence FASTA files; the method is also used for carrying out statistical analysis on IPR annotation analysis results corresponding to a plurality of protein sequence FASTA files; also used for IPR enrichment analysis of multiple protein sequence FASTA files; the method is also used for carrying out statistical analysis on IPR enrichment analysis results corresponding to a plurality of protein sequence FASTA files.

Example 3

The embodiment of the invention provides a proteomics IPR annotation and enrichment analysis device which comprises a processor and a memory, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the proteomics IPR annotation and enrichment analysis method is realized.

Example 4

Embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for proteomic IPR annotation and enrichment analysis as described above.

The invention discloses a method and a device for proteomics IPR annotation and enrichment analysis, wherein in the method, firstly, a corresponding protein sequence FASTA file is determined according to high-throughput proteomic data, so that a corresponding FASTA file is generated based on an Interpro resource database, and the rapidity of data processing is ensured; further carrying out IPR annotation analysis, and carrying out statistical analysis on the result of the IPR annotation analysis so as to independently or completely utilize the unique advantages of each database to carry out the IPR analysis of the protein function; and then carrying out IPR enrichment analysis on the plurality of protein sequence FASTA files according to the IPR annotation analysis result, carrying out statistical analysis on the IPR enrichment analysis result, and directly carrying out further enrichment analysis by using the IPR enrichment analysis result, thereby avoiding a single flow, simplifying the processing process and improving the accuracy of data processing.

According to the technical scheme, the protein function IPR analysis is carried out by independently or completely utilizing the unique advantages of each database based on the resource database of Interpro, then, a plurality of intelligent analyses are utilized to complete one-key operation, only the 'Enter' key is needed to be pressed, the related analyses are completely and automatically completed and output to the designated folder, the one-key operation is time-saving and labor-saving, and human errors generated when single processes are analyzed one by one are avoided.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A method for proteomic IPR annotation and enrichment analysis, comprising:

obtaining high-throughput proteome data;

2. The method for proteomic IPR annotation and enrichment analysis of claim 1, wherein the protein sequence FASTA files comprise a qualitative protein sequence FASTA file and a quantitative protein sequence FASTA file, and wherein determining the corresponding plurality of protein sequence FASTA files from the high-throughput proteomic data comprises:

3. The method for proteomic IPR annotation and enrichment analysis of claim 2, wherein the IPR annotation analysis of the plurality of protein sequence FASTA files comprises:

4. The method for proteomic IPR annotation and enrichment analysis of claim 3, wherein the statistical analysis of IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files comprises:

determining a list of IPR pointing proteins according to the IPR annotation analysis results corresponding to the protein sequence FASTA files so as to check the annotated IPR entries, the number of matched proteins and corresponding names;

determining a list of protein pointing IPRs according to the IPR annotation analysis results corresponding to the plurality of protein sequence FASTA files so as to view the IPR entries annotated by each protein;

5. The method for proteomic IPR annotation and enrichment analysis of claim 2, wherein the IPR enrichment analysis of the plurality of protein sequence FASTA files comprises:

6. The method for proteomic IPR annotation and enrichment analysis of claim 5, wherein based on the IPR ranking of the total identified protein sequences, determining IPR entries significantly enriched for each of the comparison sets of differentially expressed protein sequences comprises:

and performing enrichment analysis calculation according to the number of proteins with IPR annotation information in the total identification protein sequence, the number of differentially expressed proteins in the total identification protein sequence, the total identification protein number corresponding to each IPR item and the corresponding differentially expressed protein number by a statistical significance test Pvalue formula.

7. The method for proteomic IPR annotation and enrichment analysis of claim 6, wherein the statistical significance test pvave formula is represented as:

8. The method for proteomic IPR annotation and enrichment analysis of claim 7, wherein said statistically analyzing IPR enrichment analysis results corresponding to said plurality of protein sequence FASTA files comprises:

and determining an IPR enrichment analysis result list of each comparison group differential expression protein sequence according to the Pvalue value of the enrichment analysis so as to check the enriched IPR items, the Pvalue of the enrichment analysis corresponding to the enriched IPR items and the number of up-down regulated proteins matched with the corresponding IPR.

9. A proteomics IPR annotation and enrichment analysis device, comprising:

an acquisition unit for acquiring high-throughput proteome data;

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs a method for proteomic IPR annotation and enrichment analysis according to any one of claims 1-8.