CN114496095A

CN114496095A - Modification site recognition method, system, device and storage medium

Info

Publication number: CN114496095A
Application number: CN202210066721.7A
Authority: CN
Inventors: 李占潮; 杨楠翔; 王梦茹; 彭冬冬; 刘洁; 胡党中; 邓倩; 刘雨琦
Original assignee: Guangdong Pharmaceutical University
Current assignee: Guangdong Pharmaceutical University
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-13

Abstract

The invention discloses a method, a system, a device and a storage medium for identifying modification sites, wherein the method comprises the following steps: obtaining protein acetylation modification sites and corresponding protein acetylation amino acid sequence fragments; performing two-dimensional matrix conversion on amino acids in the protein acetylated amino acid sequence fragment; performing continuous wavelet transform operation on the protein acetylation two-dimensional matrix characteristic; and inputting the scale matrix into a deep convolution neural network model, and outputting the recognition result of the protein acetylation site. The system comprises: the device comprises a sequence acquisition module, a matrix characteristic acquisition module, a scale matrix acquisition module and an identification module. By using the method, the protein acetylation modification sites can be identified in the proteome scale, and the method has the advantages of accuracy, rapidness, low cost and the like. The invention can be widely applied to the field of recognition of the post-translational modification sites of the protein as a method, a system, a device and a storage medium for recognizing the modification sites.

Description

Modification site recognition method, system, device and storage medium

Technical Field

The invention relates to the field of protein posttranslational modification site recognition, in particular to a modification site recognition method, a system, a device and a storage medium.

Background

Acetylation of proteins is one of the post-translational modifications of proteins, and refers to the dynamic process of adding/removing acetyl groups to/from lysine residues in proteins by the action of lysine acetyltransferase/lysine deacetylase. Researches show that protein acetylation modification participates in various biological reaction processes such as metabolism, transcriptional activation, subcellular localization, protein stabilization, stress reaction and the like, and is closely related to the occurrence and development of complex and serious diseases such as cancer, diabetes, cardiovascular diseases, neurodegenerative diseases and the like. Therefore, the identification of the protein acetylation modification site is not only helpful for further elucidating the action mechanism of protein acetylation and the relationship between protein posttranslational modification and function, but also has important research significance and application value for the identification of disease-related proteins and the development of related drugs. With the advent of the post-genomic era, and the rapid development of various sequencing technologies, a great deal of protein sequence data has been identified and generated. Although high resolution mass spectrometry instruments are able to identify potential protein acetylation modification sites. But the method has the defects of time consumption, labor waste and high cost, and cannot meet the requirement of large-scale proteomics research.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method, a system, a device and a storage medium for identifying a modification site, which can identify a protein acetylation modification site on a proteome scale, and has the advantages of accuracy, rapidness, low cost, etc.

The first technical scheme adopted by the invention is as follows: a method of modifying site recognition comprising the steps of:

obtaining protein acetylation modification sites and corresponding protein acetylation amino acid sequence fragments;

performing two-dimensional matrix conversion on amino acids in the protein acetylation amino acid sequence fragment to obtain protein acetylation two-dimensional matrix characteristics;

performing continuous wavelet transform operation on the protein acetylation two-dimensional matrix characteristics based on a wavelet transform function to obtain a scale matrix;

and inputting the scale matrix into a deep convolution neural network model, and outputting the recognition result of the protein acetylation site.

Further, the step of obtaining the protein acetylation modification site and the corresponding protein acetylation amino acid sequence fragment specifically includes:

obtaining protein acetylation modification sites;

taking a protein acetylation modification site as a center, intercepting sequence segments respectively comprising 10 amino acid residues at the upstream and the downstream to obtain corresponding protein acetylation amino acid sequence segments.

Further, the step of performing two-dimensional matrix transformation on amino acids in the protein acetylated amino acid sequence fragment to obtain a protein acetylated two-dimensional matrix characteristic specifically includes:

respectively defining 20 natural amino acid residues as a two-dimensional characteristic matrix with the size of 7 multiplied by 5 to obtain a defined matrix;

and converting the corresponding protein acetylated amino acid sequence fragments into protein acetylated two-dimensional matrix characteristics according to the definition matrix.

Further, the step of performing continuous wavelet transform operation on the protein acetylation two-dimensional matrix characteristics based on the wavelet transform function to obtain a scale matrix specifically includes:

presetting a wavelet transformation function and a scale;

sequentially changing wavelet transformation scales and performing continuous wavelet transformation according to the wavelet transformation function and the row characteristics of the protein acetylation two-dimensional matrix characteristics to obtain a corresponding transformation coefficient under each scale;

and constructing a scale matrix by taking the corresponding transformation coefficient under each scale as a row vector.

Further, the building step of the deep convolutional neural network model specifically comprises:

collecting protein acetylation site data information, and processing by adopting a predefined amino acid two-dimensional characteristic matrix, a wavelet transformation function and a scale process to obtain a first training scale matrix;

obtaining a non-protein acetylation modification site sequence according to the collected protein acetylation site data information;

for a non-protein acetylation modification site sequence, processing by adopting a predefined amino acid two-dimensional characteristic matrix, a wavelet transformation function and a scale process to obtain a second training scale matrix;

taking the first training scale matrix as an input data positive sample and the second training scale matrix as an input data negative sample;

extracting a positive sample of input data, a corresponding positive sample mark of output data, a negative sample of input data and a corresponding negative sample mark of output data in a preset proportion, and taking the positive sample, the corresponding positive sample mark, the negative sample and the corresponding negative sample mark of the output data as a training set to train the deep convolutional neural network model;

and extracting input data positive samples, corresponding output data positive sample marks, input data negative samples and corresponding output data negative sample marks in a preset proportion to serve as a test set to test the deep convolutional neural network model.

Further, the step of obtaining a non-protein acetylation modification site sequence according to the collected data information of the protein acetylation site specifically includes:

searching the protein containing the protein acetylation modification site according to the collected data information of the protein acetylation site to obtain a lysine residue which is not annotated as the acetylation site;

taking a lysine residue which is not annotated as an acetylation site as a center, and intercepting sequence segments respectively comprising 10 amino acid residues at the upstream and the downstream to obtain a non-protein acetylation modification site sequence.

The second technical scheme adopted by the invention is as follows: a modification site recognition system comprising:

the sequence acquisition module is used for acquiring protein acetylation modification sites and corresponding protein acetylation amino acid sequence fragments;

the matrix characteristic acquisition module is used for carrying out two-dimensional matrix conversion on amino acids in the protein acetylation amino acid sequence fragments to obtain protein acetylation two-dimensional matrix characteristics;

the scale matrix acquisition module is used for carrying out continuous wavelet transform operation on the protein acetylation two-dimensional matrix characteristics based on a wavelet transform function to obtain a scale matrix;

and the recognition module is used for inputting the scale matrix into the deep convolutional neural network model and outputting the recognition result of the protein acetylation site.

The third technical scheme adopted by the invention is as follows: a modification site recognition device comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a modification site recognition method as described above.

The fourth technical scheme adopted by the invention is as follows: a storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a modification site recognition method as described above.

The method, the system, the device and the storage medium have the advantages that: the method is based on the deep learning theory, adopts a data processing method to construct a theoretical model, and can quickly, accurately and efficiently identify the protein acetylation modification sites in the proteome scale, thereby being beneficial to the research of the protein acetylation modification mechanism and the identification of the post-translational modification sites of the disease-related proteins.

Drawings

FIG. 1 is a flow chart of the steps of a method of identifying a modification site according to the present invention;

FIG. 2 is a block diagram showing the structure of a modification site recognition system according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

As shown in FIG. 1, the present invention provides a modification site recognition method comprising the steps of:

s1, acquiring protein acetylation modification sites and corresponding protein acetylation amino acid sequence fragments;

s1.1, obtaining protein acetylation modification sites;

s1.2, taking the protein acetylation modification site as a center, intercepting sequence segments respectively comprising 10 amino acid residues at the upstream and the downstream to obtain corresponding protein acetylation amino acid sequence segments.

S2, performing two-dimensional matrix transformation on amino acids in the protein acetylation amino acid sequence fragment to obtain protein acetylation two-dimensional matrix characteristics;

s2.1, respectively defining 20 natural amino acid residues as two-dimensional characteristic matrixes with the size of 7 multiplied by 5 to obtain defined matrixes;

specifically, defining the matrix includes the following:

cysteine C is represented as:

aspartic acid D is represented as:

lysine K is represented as:

tryptophan W is represented as:

tyrosine Y is represented as:

and S2.2, converting the corresponding protein acetylated amino acid sequence fragment into a protein acetylated two-dimensional matrix characteristic according to the defined matrix.

The following amino acid sequence fragments are taken as examples: tryptophan W, cysteine C, lysine K, aspartic acid D and tyrosine Y (abbreviation: WCKDY), which show the following transformation processes:

splicing the W, C, K, D and Y two-dimensional characteristic matrixes left and right according to the sequence of amino acid participating in the sequence fragment, and converting into protein acetylation two-dimensional matrix characteristics (with 7 rows and 25 columns):

[1 0 0 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 1

1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1

1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0

1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 0

1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0

1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0

1 0 0 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1 0 0]

s3, performing continuous wavelet transform operation on the protein acetylation two-dimensional matrix characteristics based on a wavelet transform function to obtain a scale matrix;

s3.1, presetting a wavelet transformation function and a scale;

specifically, the wavelet transform function formula is expressed as follows:

in the above formula, CWT_f(a, b) are continuous wavelet transform coefficients, ψ is a wavelet function, f (t) is each row characteristic of a two-dimensional matrix characteristic of protein acetylation modification sites, a is a scale parameter, and b is a displacement factor. The wavelet function is defined as follows:

wherein, ω is₀The center frequency.

Setting f (t) as the first row of the protein acetylation two-dimensional matrix features, i.e., f (t) ═ 1000101110100011111010001;

preferably, the f (t) feature is subjected to a continuous wavelet transform using the following formula, where a is 1,2, … …,48, and b is 1.

Preferably, in the continuous wavelet transform formula, the wavelet function is defined as follows:

wherein ω is₀＝3。

S3.2, sequentially changing wavelet transformation scales and carrying out continuous wavelet transformation according to the wavelet transformation function and the row characteristics of the protein acetylation two-dimensional matrix characteristics to obtain a corresponding transformation coefficient under each scale;

and S3.3, constructing a scale matrix by taking the corresponding transformation coefficient under each scale as a row vector.

And S4, inputting the scale matrix into the deep convolutional neural network model, and outputting the recognition result of the protein acetylation site.

Further as a preferred embodiment of the method, the step of constructing the deep convolutional neural network model specifically includes:

specifically, the data related to the Protein acetylation modification sites is obtained by screening data in a UniProtKB (the Universal Protein Resource knowledge database) database, and the specific steps are as follows: collecting protein information marked as human in a database (namely, Homo sapiens); human acetylated proteins annotated as acetylated (i.e., N6-acetyllysine) at lysine residues were collected. A lysine is an acetylation modification site if each lysine in the amino acid sequence of the human protein is annotated by "N6-acetyllysine"; collecting the amino acid sequence information of the acetylation modification sites of the protein.

extracting 80% of input data positive samples, corresponding output data positive sample labels, input data negative samples and corresponding output data negative sample labels to serve as a training set to train the deep convolutional neural network model;

and extracting 20% of input data positive samples, corresponding output data positive sample marks, input data negative samples and corresponding output data negative sample marks to be used as a test set to test the deep convolutional neural network model.

The deep convolutional neural network model adopts the following architecture and parameters: input layer (48 × 105), convolutional layer (32 5 × 5 filters, step size 2 × 2), active layer (ReLU), relaxation layer (Max relaxation, size step 2 × 2, step size 2 × 2), convolutional layer (64 5 × 5 filters, step size 2 × 2), active layer (ReLU), relaxation layer (Max relaxation, size step 2 × 2, step size 2 × 2), convolutional layer (128 5 × 5 filters, step size 2 × 2), active layer (ReLU), relaxation layer (Max relaxation, size step size 2 × 2, step size 2 × 2), fully connected layer (200 neurons, drout 0.5), fully connected layer (2 neurons), softmax layer, output layer (2 neurons). The stochastic gradient descent optimizer, momentum, learning rate and maximum training theoretic number were set to 0.9, 0.01 and 50, respectively. Results as shown in table 1, the training set achieved an overall accuracy of 60.20%, a sensitivity of 58.37%, a specificity of 62.03%, a mahalanobis correlation coefficient of 0.2041, and a subject work curve area of 0.3528, respectively. The test set yielded an overall accuracy of 61.25%, a sensitivity of 59.76%, a specificity of 62.74%, a mahalanobis correlation coefficient of 0.2250 and an area under the subject's working curve of 0.3388, respectively. These results demonstrate that the present embodiments are able to identify potential protein acetylation modification sites.

Further as a preferred embodiment of the method, the step of obtaining a sequence of a non-protein acetylation modification site according to the collected data information of the protein acetylation site specifically includes:

As shown in fig. 2, a modification site recognition system comprising:

the matrix characteristic acquisition module is used for performing two-dimensional matrix conversion on amino acids in the protein acetylation amino acid sequence fragment to obtain protein acetylation two-dimensional matrix characteristics;

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

A modification site recognition device:

at least one processor;

at least one memory for storing at least one program;

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a modification site recognition method as described above.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for identifying a modification site, comprising the steps of:

2. The method for identifying a modification site according to claim 1, wherein the step of obtaining the acetylated modification site of the protein and the corresponding acetylated amino acid sequence fragment of the protein comprises:

obtaining protein acetylation modification sites;

3. The method for identifying modification sites according to claim 2, wherein the step of performing two-dimensional matrix transformation on amino acids in the protein acetylated amino acid sequence fragment to obtain the protein acetylated two-dimensional matrix features specifically comprises:

4. The method for identifying acetylated protein modification sites of claim 3, wherein the step of performing continuous wavelet transform on the acetylated protein two-dimensional matrix features based on a wavelet transform function to obtain a scale matrix specifically comprises:

presetting a wavelet transformation function and a scale;

5. The method for identifying modification sites according to claim 4, wherein the step of constructing the deep convolutional neural network model specifically comprises:

and extracting a positive sample of input data, a corresponding positive sample mark of output data, a negative sample of input data and a corresponding negative sample mark of output data in a preset proportion, and using the positive sample, the corresponding positive sample mark, the corresponding negative sample and the corresponding negative sample mark of the output data as a test set to test the deep convolutional neural network model.

6. The method for identifying a modification site according to claim 5, wherein the step of obtaining the sequence of the non-protein acetylation modification site according to the collected data information of the protein acetylation site comprises:

7. A modification site recognition system comprising:

8. A modification site recognition device, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method for recognition of acetylated modification sites according to any one of claims 1 to 6.

9. A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by a processor, are for implementing a modification site recognition method according to any one of claims 1 to 6.