CN113012752A - Alpha transmembrane protein secondary and topological structure prediction method and system - Google Patents

Alpha transmembrane protein secondary and topological structure prediction method and system Download PDF

Info

Publication number
CN113012752A
CN113012752A CN202110332960.8A CN202110332960A CN113012752A CN 113012752 A CN113012752 A CN 113012752A CN 202110332960 A CN202110332960 A CN 202110332960A CN 113012752 A CN113012752 A CN 113012752A
Authority
CN
China
Prior art keywords
transmembrane protein
learning model
deep learning
alpha
topological structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110332960.8A
Other languages
Chinese (zh)
Inventor
林关宁
刘喆
王晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110332960.8A priority Critical patent/CN113012752A/en
Publication of CN113012752A publication Critical patent/CN113012752A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of computers, and provides a method and a system for predicting secondary and topological structures of alpha transmembrane proteins, wherein the method comprises the following steps: constructing a deep learning model: acquiring a large number of alpha transmembrane protein sequences from a transmembrane protein database as a data set for building a deep learning model, and manufacturing a secondary structure label and a topological structure label; the method comprises the steps of performing characteristic coding by using the one-hot coding and the HHb l its attribute as the characteristics of an input model, and supplementing columns added at the tail of the one-hot coding and the HHb l its attribute to adapt to a sliding window with a fixed length; obtaining a characteristic map after a sliding window is performed on each residue in the alpha transmembrane protein sequence, wherein each characteristic map corresponds to two labels including a secondary structure label and a topological structure label; inputting the new alpha transmembrane protein sequence into a deep learning model, and performing data preprocessing, feature coding and prediction output. Has the advantage of being capable of simultaneously predicting the secondary structure and the topological structure of alpha transmembrane protein.

Description

Alpha transmembrane protein secondary and topological structure prediction method and system
Technical Field
The invention relates to the technical field of computers, in particular to application of computational biology in the field of protein structure prediction, and specifically relates to a method and a system for predicting secondary and topological structures of alpha transmembrane proteins. And predicting the secondary structure index of the protein from the protein sequence by using a deep learning method.
Background
Membrane proteins are participants in a variety of important biological mechanisms and processes, and assume major biofilm functions. Transmembrane proteins, a very representative group of membrane proteins, extend across and span both ends of a biological membrane and function as a portal or receptor in one molecular life cycle of the entire biological membrane. Transmembrane proteins are involved in a variety of life processes such as cytomechanical regulation, signal transduction, molecular transport, and the like. At the same time, they are also associated with many types of diseases, such as autism, dyslipidemia, and various types of cancer, etc. Transmembrane proteins play multiple roles in basic physiology and pathophysiology, and accurate knowledge of transmembrane protein structures is of great significance to drug development, life activity interpretation and the like, but at the present stage, the cost and the time consumption for protein crystal structure determination are high, so that the calculation method is a very meaningful work for auxiliary prediction of protein structures.
It has been very challenging to predict the tertiary structure of a protein starting from an amino acid sequence using computational methods. Until the release of AlphaFold2 of google, there has been no computational tool that can predict protein tertiary structure very accurately, while AlphaFold2 is non-open, and has certain drawbacks. Under the background, indirect indexes of secondary structure level, such as secondary structure and topological structure of protein, can be predicted, and the tertiary structure information of the protein can be laterally reflected while the accuracy of the predicted indexes is ensured. Since the first proposal of protein secondary structure prediction in 1951, there have been many similar efforts to achieve significant results in protein structure prediction. It is worth mentioning that due to the chemical differences and the large differences in the structures of the transmembrane protein and the water-soluble protein, the tools for transmembrane proteins and water-soluble proteins are not always universal.
Protein secondary structure prediction is generally performed by classifying protein sequences into 3 major classes in units of fragments, namely, "Helix (Helix)", "barrel (Strand)", and "turn (Coil)", and making a finer classification method if the three major classes are further classified. The prediction of the topology of transmembrane proteins can divide protein sequences into "intramembrane", "extramembrane" and "transmembrane region" or simply into "transmembrane region" and "non-transmembrane region" in fragment units. In the present invention, we refer to the three classifications of "Helix (Helix)", "barrel (Strand)" and "turn (Coil)", and the two classifications of "transmembrane region" and "non-transmembrane region" as the prediction of the topology.
In recent years, a plurality of tools are emerging for predicting the secondary structure level of the transmembrane protein, such as a transmembrane protein solvent accessible surface area prediction tool TMP-SSurface, a transmembrane protein topological structure prediction tool DMCTOP, a transmembrane protein residue vertical distance prediction tool TM-ZC from the center of a membrane and the like. It is worth mentioning that a large number of new prediction tools with better performance, which are recently introduced, are developed based on a deep learning method, and are different from the previous machine learning or statistical methods, which indicates that the deep learning has advantages over the traditional machine learning in a large-sequence sample learning task.
At present, a great number of protein secondary structure prediction tools exist, and representative examples thereof include SSpro5, PSIPRED 4, Raptorx-Property, Porter 5, Spider3, SPOT-1D, MUFOLD-SSW, JPred4 and the like. There are also very many tools for predicting transmembrane protein topology, such as HMMTOP 2, OCTOPUS, TOPCONS, Philius, PolyPhobius, SCAMPI, SPOCTOPUS, etc.
As can be seen from the technical solutions disclosed in the above prior arts, the existing tools do not have a tool capable of outputting secondary structure and topology of transmembrane protein simultaneously only by using the same prediction model, and more specifically, the existing tools cannot distinguish between a helix of a transmembrane region and a helix of a non-transmembrane region, which inevitably affects further research on the structure of transmembrane protein. In addition, after testing, the secondary structure prediction tools SSpro5, psicred 4, SPOT-1D, MUFOLD-SSW and JPred4 have certain limits on the length of the input sequence, and the limits are 1500, 1500, 750, 700 and 800 respectively, which means that the protein with long sequence length cannot be input into the model at one time. Meanwhile, when the independent test set organized by the users is used for carrying out tool performance test, all tools need more than 100 minutes of processing time to output results without exception for 50 transmembrane protein sequences, and more than several tools represented by DeepCNF need 1000 to 3000 minutes to output prediction, which is very high in cost for user prediction calculation. SSpro5 is used as a secondary structure prediction tool with the highest precision, and because the protein template needs to be input for auxiliary prediction, the protein sequence with unknown structure cannot be accepted, and the use scenario is greatly limited. In the aspect of a topological structure prediction tool, the accuracy of prediction and the output speed also need to be further improved.
Disclosure of Invention
In view of the above problems, the present invention aims to provide a method and a system for predicting secondary and topological structures of alpha transmembrane proteins, which can simultaneously predict the secondary structure and the topological structure of the alpha transmembrane proteins, and more specifically, to make up for the deficiency that the existing tool cannot distinguish between a transmembrane-domain helix and a non-transmembrane-domain helix, overcome the limitation of the length of an input sequence, and make accurate and reliable predictions on the basis of as little computation overhead as possible.
The above object of the present invention is achieved by the following technical solutions:
an alpha transmembrane protein secondary and topological structure prediction method comprises the following steps:
s1, constructing a deep learning model for predicting a two-dimensional structure and a topological structure, which specifically comprises the following steps:
s11: acquiring a large number of alpha transmembrane protein sequences from a transmembrane protein database as a data set for building a deep learning model for predicting a secondary structure and a topological structure, and manufacturing a secondary structure label and a topological structure label;
s12: performing characteristic coding by using the one-hot code and the HHblits attribute as the characteristics of an input model, and adding columns at the tail of the one-hot code and the HHblits attribute for completing to adapt to a sliding window with a fixed length;
s13: obtaining a feature map after a sliding window for each residue in the alpha transmembrane protein sequence, wherein each feature map corresponds to two tags including the secondary structure tag and the topological structure tag;
s2: building and training a deep learning model, wherein the framework of the deep learning model sequentially comprises the following steps from input to output: the system comprises a preprocessing layer, a packet convolutional layer, a bidirectional long-short term memory network layer, an attention layer and a normalized output layer;
s3: inputting the new alpha transmembrane protein sequence into the deep learning model, performing data preprocessing, feature coding and prediction output, and saving the prediction output in a corresponding file.
Further, in step S11, a large number of alpha transmembrane protein sequences are obtained from the transmembrane protein database as a data set for constructing a deep learning model for predicting secondary structure and topological structure, which specifically includes:
removing sequences containing unknown amino acids and sequences less than 30 residues in length for the alpha transmembrane protein sequences obtained from the transmembrane protein database;
using CD-HIT software, a fixed threshold value is used to perform de-honor operation on the alpha transmembrane protein sequence.
Further, the data set is divided to form a training set and a verification set independent test set respectively.
Further, in step S11, the making the secondary structure label and the topological structure label specifically includes:
inputting the PDB file of the data set for storing the alpha transmembrane protein sequence into DSSP software to obtain a DSSP file, and extracting the secondary structure label from the DSSP file;
extracting the topological structure tag directly from an XML file in the transmembrane protein database.
Further, in step S12, the method further includes:
the one-hot encoding is a process that converts classification variables into a form that can be provided to a machine learning algorithm for prediction; and specifically includes a sparse vector with one element set to 1 and all other elements set to 0, the one-hot code being 20 in length, with the position representing a particular amino acid being marked as 1.
Further, in step S12, the method further includes:
the HHblits attribute is a 30-dimensional vector output by the HHblits tool and aligned using a library of alignments that represent the degree of similarity and conservation of the current sequence and the sequences in the library.
Further, the deep learning model comprises a grouping convolution layer, a bidirectional long-short term memory network layer, an attention mechanism and random inactivation layer and a normalization output layer.
A system for performing the above alpha transmembrane protein secondary and topological prediction method, comprising:
the learning model establishing module is used for establishing a deep learning model for predicting a two-dimensional structure and a topological structure, and specifically comprises the following steps:
the data set acquisition unit is used for acquiring a large number of alpha transmembrane protein sequences from the transmembrane protein database as a data set for building a deep learning model for predicting a secondary structure and a topological structure and manufacturing a secondary structure label and a topological structure label;
the characteristic coding unit is used for carrying out characteristic coding by adopting the one-hot coding and the HHblits attribute as the characteristics of an input model, and adding columns at the tail ends of the one-hot coding and the HHblits attribute for completion so as to adapt to a sliding window with a fixed length;
a label establishing unit, configured to obtain a feature map after a sliding window is performed on each residue in the alpha transmembrane protein sequence, where each feature map corresponds to two labels including the secondary structure label and the topological structure label;
the learning model training module is used for building and training a deep learning model, and the framework of the deep learning model is sequentially from input to output: the system comprises a preprocessing layer, a packet convolutional layer, a bidirectional long-short term memory network layer, an attention layer and a normalized output layer;
and the prediction output module is used for inputting the new alpha transmembrane protein sequence into the deep learning model, performing data preprocessing, feature coding and prediction output, and storing the prediction output in a corresponding file.
An electronic device comprising a processor and a memory, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and wherein the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method as described above.
A computer readable storage medium storing computer code which, when executed, performs a method as described above.
Compared with the prior art, the invention has the following beneficial effects:
the invention can output the secondary structure and topological structure of the protein simultaneously when the alpha transmembrane protein sequence is taken as input under the conditions of minimum calculation overhead (minimum using characteristic number) of a similar tool and no limitation on the length of the input sequence. Through comparison with the predicted accuracy performance of the secondary structure tool on TEST50, the output result of the method takes the shortest time, the accuracy exceeds the average level, and no input limit on the length of any sequence exists; in comparison with the predicted accuracy performance of the topology tool on TEST50, our invention is the most accurate of the same type of tool. It is worth mentioning that only tools currently available on the market, which are unique to the present invention, are capable of distinguishing between transmembrane and non-transmembrane helices.
Drawings
FIG. 1 is an overall flow chart of an alpha transmembrane protein secondary and topological structure prediction method of the present invention;
FIG. 2 is a design objective diagram of the present invention;
FIG. 3 is a diagram of the internal framework from input data to prediction output for the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
First embodiment
As shown in FIGS. 1 and 2, this example provides a method for predicting secondary and topological structures of alpha transmembrane proteins, comprising the following steps:
and S1, constructing a deep learning model for predicting the two-dimensional structure and the topological structure.
Specifically, before predicting the secondary structure and the topological structure, a deep learning model for prediction needs to be established, and then when actual data is used for prediction, a new alpha transmembrane protein sequence is directly input into the deep learning model, namely, a corresponding secondary structure label and a corresponding topological structure label can be output.
Establishing a deep learning model, specifically comprising the following steps:
s11: and acquiring a large number of alpha transmembrane protein sequences from a transmembrane protein database to serve as a data set for building a deep learning model for predicting a secondary structure and a topological structure, and manufacturing a secondary structure label and a topological structure label.
Specifically, in this embodiment, a set of non-redundant standard databases is required for training the deep learning model. We downloaded 4336 alpha transmembrane protein sequences from the transmembrane protein database (version number: 2020-2-7), minus sequences containing unknown amino acids (denoted by the letter "X") and sequences less than 30 residues in length. Next, we performed de-honoring operations on these sequences using CD-HIT software at a threshold of 30%, with a total of 911 alpha transmembrane proteins remaining. After this, we compare 811: 50: a scale of 50 divides the data set into a training set, a validation set, and an independent TEST set (named "TEST 50"), respectively. Then, we input the PDB file (program database file) into DSSP software to obtain a DSSP file, and extract secondary structure tags from the DSSP file, where the topology tags are directly extracted from the downloadable XML file in PDBTM (transmembrane protein database).
S12: and performing characteristic coding by using the one-hot code and the HHblits attribute as the characteristics of an input model, and adding columns at the tail ends of the one-hot code and the HHblits attribute for completing to adapt to a sliding window with a fixed length.
Specifically, after the data set is divided and the label is made, the feature encoding stage is entered. In the present invention, we use One-hot (One-hot) and HHblits attributes as features of the input model. One-hot encoding is the process of converting classification variables into a form that can be provided to a machine learning algorithm for better prediction. A sparse vector, wherein: one element is set to 1; all other elements are set to 0. In this problem, the one-hot code is 20 in length, and the position representing a particular amino acid is labeled 1. While the HHblits attribute is a 30-dimensional vector output by the HHblits tool, we use "uniprot 20_2016_ 02" as an alignment library in the present invention, the HHblits vector indicating the degree of similarity and conservation of the current sequence and the sequences in the alignment library. After that, we add a column called "NoSeq" at the end of the 20-dimensional one-hot encoding and 30-dimensional HHblits attribute, respectively, and we need to perform front-end and back-end padding on the feature map in order for the sequence to be able to fit into a sliding window of length 19, and the NoSeq column can indicate whether the row is obtained after padding.
S13: for each residue in the alpha transmembrane protein sequence, obtaining a feature map after a sliding window, wherein each feature map corresponds to two tags including the secondary structure tag and the topological structure tag.
Specifically, after the above steps, a 19 × 52 feature map is obtained for each residue (in the sequence of the protein, the amino and carboxyl groups between the amino acids are dehydrated and bonded, and the remaining groups that are not dehydrated are called residues), and several such feature maps are obtained for several residues in the data set, each feature map corresponding to two labels, namely, the secondary structure label and the topological structure label of the residue.
S2: building and training a deep learning model, wherein the framework of the deep learning model sequentially comprises the following steps from input to output: the device comprises a preprocessing layer, a packet convolution layer, a bidirectional long-short term memory network layer, an attention layer and a normalization output layer.
S3: inputting the new alpha transmembrane protein sequence into the deep learning model, performing data preprocessing, feature coding and prediction output, and saving the prediction output in a corresponding file.
Second embodiment
This embodiment is basically the same as the first embodiment, and the deep learning model framework is divided into 4 parts, which are respectively a packet convolution layer, a bidirectional long-short term memory network layer, an attention mechanism and random deactivation layer, and a normalized output layer, as shown in fig. 3. The code implementation, training and testing of the whole model framework are performed by using Keras and Tensorflow deep learning frameworks, and the whole experiment is performed on a 1080Ti GPU. In the training process, an 'early stop' and 'optimal storage' learning strategy is also used, and main super parameters (including sliding window length, random deactivation rate, the number of long and short term memory network units and the like) are optimized to select a value which enables the verification set to have the best result. The initial values of the neural network used the Keras random initialization method and the parameter training used Adam optimizer.
Third embodiment
This embodiment provides a system for performing the alpha transmembrane protein secondary and topological structure prediction method of the first embodiment, comprising:
the learning model building module 1 is configured to build a deep learning model for predicting a two-dimensional structure and a topological structure, and specifically includes:
the data set acquisition unit 11 is used for acquiring a large number of alpha transmembrane protein sequences from a transmembrane protein database as a data set for building a deep learning model for predicting a secondary structure and a topological structure and manufacturing a secondary structure label and a topological structure label;
a feature encoding unit 12, configured to perform feature encoding using the one-hot code and the HHblits attribute as features of an input model, and complement the end addition columns of the one-hot code and the HHblits attribute to adapt to a sliding window with a fixed length;
a tag establishing unit 13, configured to obtain a feature map after a sliding window is performed on each residue in the alpha transmembrane protein sequence, where each feature map corresponds to two tags including the secondary structure tag and the topological structure tag;
the learning model training module 2 is used for building and training a deep learning model, and the framework of the deep learning model is sequentially from input to output: the system comprises a preprocessing layer, a packet convolutional layer, a bidirectional long-short term memory network layer, an attention layer and a normalized output layer;
and the prediction output module 3 is used for inputting the new alpha transmembrane protein sequence into the deep learning model, performing data preprocessing, feature coding and prediction output, and storing the prediction output in a corresponding file.
A computer readable storage medium storing computer code which, when executed, performs the method as described above. Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
The software program of the present invention can be executed by a processor to implement the steps or functions described above. Also, the software programs (including associated data structures) of the present invention can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functionality of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various functions or steps. The method disclosed by the embodiment shown in the embodiment of the present specification can be applied to or realized by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
Embodiments also provide a computer readable storage medium storing one or more programs that, when executed by an electronic system including a plurality of application programs, cause the electronic system to perform the method of embodiment one. And will not be described in detail herein.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-RO M), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium (tr ansitory medium), such as a modulated data signal and a carrier wave.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In addition, some of the present invention can be applied as a computer program product, such as computer program instructions, which when executed by a computer, can invoke or provide the method and/or technical solution according to the present invention through the operation of the computer. Program instructions which invoke the methods of the present invention may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the invention herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the invention as described above.

Claims (10)

1. An alpha transmembrane protein secondary and topological structure prediction method is characterized by comprising the following steps:
s1, constructing a deep learning model for predicting a two-dimensional structure and a topological structure, which specifically comprises the following steps:
s11: acquiring a large number of alpha transmembrane protein sequences from a transmembrane protein database as a data set for building a deep learning model for predicting a secondary structure and a topological structure, and manufacturing a secondary structure label and a topological structure label;
s12: performing characteristic coding by using the one-hot code and the HHblits attribute as the characteristics of an input model, and adding columns at the tail of the one-hot code and the HHblits attribute for completing to adapt to a sliding window with a fixed length;
s13: obtaining a feature map after a sliding window for each residue in the alpha transmembrane protein sequence, wherein each feature map corresponds to two tags including the secondary structure tag and the topological structure tag;
s2: building and training a deep learning model, wherein the framework of the deep learning model sequentially comprises the following steps from input to output: the system comprises a preprocessing layer, a packet convolutional layer, a bidirectional long-short term memory network layer, an attention layer and a normalized output layer;
s3: inputting the new alpha transmembrane protein sequence into the deep learning model, performing data preprocessing, feature coding and prediction output, and saving the prediction output in a corresponding file.
2. The alpha transmembrane protein secondary and topological structure prediction method according to claim 1, wherein in step S11, a large number of alpha transmembrane protein sequences are obtained from a transmembrane protein database as a data set for constructing a deep learning model for predicting secondary structures and topological structures, and the method specifically comprises:
removing sequences containing unknown amino acids and sequences less than 30 residues in length for the alpha transmembrane protein sequences obtained from the transmembrane protein database;
using CD-HIT software, a fixed threshold value is used to perform de-honor operation on the alpha transmembrane protein sequence.
3. The method for predicting secondary and topological structures of alpha transmembrane proteins according to claim 1, further comprising: and dividing the data set to respectively form a training set and a verification set independent test set.
4. The alpha transmembrane protein secondary and topological structure prediction method according to claim 1, wherein in step S11, said secondary structure label and said topological structure label are prepared, specifically:
inputting the PDB file of the data set for storing the alpha transmembrane protein sequence into DSSP software to obtain a DSSP file, and extracting the secondary structure label from the DSSP file;
extracting the topological structure tag directly from an XML file in the transmembrane protein database.
5. The method for predicting secondary and topological structures of transmembrane alpha proteins according to claim 1, further comprising, in step S12:
the one-hot encoding is a process that converts classification variables into a form that can be provided to a machine learning algorithm for prediction; and specifically includes a sparse vector with one element set to 1 and all other elements set to 0, the one-hot code being 20 in length, with the position representing a particular amino acid being marked as 1.
6. The method for predicting secondary and topological structures of transmembrane alpha proteins according to claim 1, further comprising, in step S12:
the HHblits attribute is a 30-dimensional vector output by the HHblits tool and aligned using a library of alignments that represent the degree of similarity and conservation of the current sequence and the sequences in the library.
7. The method for predicting secondary and topological structures of alpha transmembrane proteins according to claim 1, further comprising: the deep learning model comprises a grouping convolution layer, a bidirectional long-short term memory network layer, an attention mechanism and random inactivation layer and a normalization output layer.
8. A system for performing the alpha transmembrane protein secondary and topological prediction method of claims 1-7, comprising:
the learning model establishing module is used for establishing a deep learning model for predicting a two-dimensional structure and a topological structure, and specifically comprises the following steps:
the data set acquisition unit is used for acquiring a large number of alpha transmembrane protein sequences from the transmembrane protein database as a data set for building a deep learning model for predicting a secondary structure and a topological structure and manufacturing a secondary structure label and a topological structure label;
the characteristic coding unit is used for carrying out characteristic coding by adopting the one-hot coding and the HHblits attribute as the characteristics of an input model, and adding columns at the tail ends of the one-hot coding and the HHblits attribute for completion so as to adapt to a sliding window with a fixed length;
a label establishing unit, configured to obtain a feature map after a sliding window is performed on each residue in the alpha transmembrane protein sequence, where each feature map corresponds to two labels including the secondary structure label and the topological structure label;
the learning model training module is used for building and training a deep learning model, and the framework of the deep learning model is sequentially from input to output: the system comprises a preprocessing layer, a packet convolutional layer, a bidirectional long-short term memory network layer, an attention layer and a normalized output layer;
and the prediction output module is used for inputting the new alpha transmembrane protein sequence into the deep learning model, performing data preprocessing, feature coding and prediction output, and storing the prediction output in a corresponding file.
9. An electronic device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of any one of claims 1 to 7.
10. A computer readable storage medium storing computer code which, when executed, performs the method of any of claims 1 to 7.
CN202110332960.8A 2021-03-29 2021-03-29 Alpha transmembrane protein secondary and topological structure prediction method and system Pending CN113012752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110332960.8A CN113012752A (en) 2021-03-29 2021-03-29 Alpha transmembrane protein secondary and topological structure prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110332960.8A CN113012752A (en) 2021-03-29 2021-03-29 Alpha transmembrane protein secondary and topological structure prediction method and system

Publications (1)

Publication Number Publication Date
CN113012752A true CN113012752A (en) 2021-06-22

Family

ID=76408548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110332960.8A Pending CN113012752A (en) 2021-03-29 2021-03-29 Alpha transmembrane protein secondary and topological structure prediction method and system

Country Status (1)

Country Link
CN (1) CN113012752A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658641A (en) * 2021-07-20 2021-11-16 北京大学 Phage classification method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHE LIU.ET.: "TMPSS: A Deep Learning-Based Predictor for Secondary Structure and Topology Structure Prediction of Alpha-Helical Transmembrane Proteins", 《FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY》 *
王晗: "跨膜蛋白折叠识别方法研究", 《中国优秀博硕士学位论文全文数据库(博士)基础科学辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658641A (en) * 2021-07-20 2021-11-16 北京大学 Phage classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11055557B2 (en) Automated extraction of product attributes from images
JP6894058B2 (en) Hazardous address identification methods, computer-readable storage media, and electronic devices
CN111382255B (en) Method, apparatus, device and medium for question-answering processing
KR20190019892A (en) Method and apparatus for constructing a decision model, computer device and storage medium
CN114418568A (en) Payment mode recommendation method, device and equipment
CN112256828A (en) Medical entity relationship extraction method and device, computer equipment and readable storage medium
US20220138240A1 (en) Source code retrieval
CN113435499B (en) Label classification method, device, electronic equipment and storage medium
CN109753647B (en) Paragraph dividing method and device
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN111222336A (en) Method and device for identifying unknown entity
CN115510982A (en) Clustering method, device, equipment and computer storage medium
CN114036950A (en) Medical text named entity recognition method and system
CN113239702A (en) Intention recognition method and device and electronic equipment
CN114154029B (en) Sample query method and server based on artificial intelligence and chromatographic analysis
CN113010661A (en) Method, device and equipment for analyzing statement and storage medium
CN113012752A (en) Alpha transmembrane protein secondary and topological structure prediction method and system
CN112818126B (en) Training method, application method and device for network security corpus construction model
CN112560463B (en) Text multi-labeling method, device, equipment and storage medium
CN117727043A (en) Training and image retrieval methods, devices and equipment of information reconstruction model
CN115861902B (en) Unsupervised action migration and discovery method, system, device and medium
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
CN116401344A (en) Method and device for searching table according to question
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN114358011A (en) Named entity extraction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210622

WD01 Invention patent application deemed withdrawn after publication