CN112530517A - Protein structure prediction method, device, platform and storage medium - Google Patents

Protein structure prediction method, device, platform and storage medium Download PDF

Info

Publication number
CN112530517A
CN112530517A CN201910880279.XA CN201910880279A CN112530517A CN 112530517 A CN112530517 A CN 112530517A CN 201910880279 A CN201910880279 A CN 201910880279A CN 112530517 A CN112530517 A CN 112530517A
Authority
CN
China
Prior art keywords
sequence
matching
protein
matched
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910880279.XA
Other languages
Chinese (zh)
Inventor
郭敏
余晴
林介一
于雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kangmaxin Shanghai Intelligent Technology Co ltd
Original Assignee
Kangmaxin Shanghai Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kangmaxin Shanghai Intelligent Technology Co ltd filed Critical Kangmaxin Shanghai Intelligent Technology Co ltd
Priority to CN201910880279.XA priority Critical patent/CN112530517A/en
Publication of CN112530517A publication Critical patent/CN112530517A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Abstract

The invention discloses a protein structure prediction method, which comprises the following steps: extracting a target sequence from a protein file to be detected; matching the target sequence in a protein database with a known structure, and finding out a matched sequence; acquiring a matching structure of the matching sequence according to the matching sequence; constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof; combining the unmatched sequence segments of the target sequence with a portion of the adjacent matched sequence segments into a sub-target sequence; searching a matching subsequence of the sub-target sequence and the structure thereof in a protein database with a known structure; and filling up the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof, and obtaining the three-dimensional structure of the protein file to be detected. By adopting the protein structure prediction method, the structure of the protein can be predicted more accurately.

Description

Protein structure prediction method, device, platform and storage medium
Technical Field
The invention relates to the field of protein structure prediction, in particular to a protein structure prediction method, a protein structure prediction device, a protein structure prediction platform and a storage medium.
Background
Studies of the three-dimensional structure of proteins can explain protein function at the molecular level. For example, protein complexes are often the core material of many cellular metabolic processes. Describing the mutual contact and quaternary structure of each component in the protein complex from the molecular level helps researchers understand the operation mechanism of the protein complex in the metabolic environment where the protein complex is located and is inspired by regulating the operation process. In recent years, protein complexes in protein crystal databases (PDB) have also grown rapidly due to the ongoing development of structure determination methods based on electron microscopy. However, the speed of determining the three-dimensional structure of protein complexes by experimental methods still does not keep up with the need for high-throughput screening of protein-protein complexes. For this reason, prediction of protein structure by computational prediction methods is an effective approach to solve this problem.
At present, the prediction methods of proteins include ab initio prediction (ab initio prediction), clue technique (threading modeling), and homomorphic modeling (homo modeling), depending on the situation. Compared with the first two prediction methods, the homology modeling method is the most mature method currently developed and applied. During evolution, homologous proteins have similar sequences and similar structures. Based on the principle, the homologous modeling method searches an amino acid sequence (template sequence: model) with a known structure from a database, compares the similarity of a target protein and the template sequence to screen out an optimal modeling template, and then maps residues of the target protein sequence into a protein structure of the template to complete the construction of a three-dimensional structure. In the homology modeling process, the problem that the target sequence cannot completely find the template sequence, namely the similarity between the target sequence and the template sequence is low, affects the integrity and the accuracy of the target protein structure prediction.
Disclosure of Invention
In order to solve the technical problems, the invention provides a protein structure prediction method, a device, a platform and a storage medium; specifically, the technical scheme of the invention is as follows:
in a first aspect, the present invention provides a method for predicting protein structure, comprising: extracting a target sequence from a protein file to be detected; matching the target sequence in a protein database with a known structure, and finding out a matched sequence; acquiring a matching structure of the matching sequence according to the matching sequence; constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof; combining an unmatched sequence segment of the target sequence with a portion of an adjacent matched sequence segment into a sub-target sequence; searching the matching subsequence and the structure of the sub-target sequence in the protein database with the known structure; and filling up the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof to obtain the three-dimensional structure of the protein file to be detected.
Preferably, the non-matching sequence segments of the target sequence are combined with a part of adjacent matched sequence segments to form a sub-target sequence; and searching the matching subsequence of the sub-target sequence and the structure thereof in the protein database with the known structure specifically comprises: acquiring a matched sequence segment in the target sequence and the matched sequence as a matched sequence segment; taking a sequence segment on the target sequence that does not match the matching sequence as an unmatched sequence segment; when the fragment length of the unmatched sequence fragment is larger than a preset length value, determining a cutting site in the matched sequence fragment, and combining the unmatched sequence fragment and the adjacent cut matched sequence fragment to form a sub-target sequence; when the fragment length of the unmatched sequence fragment is smaller than or equal to the preset length value, intercepting the unmatched sequence fragment as a sub-target sequence; and matching the sub-target sequences in the protein database with the matched structure to find matched sub-sequences and structures thereof.
Preferably, the determining the cleavage sites in the matched sequence fragments, when the fragment length of the unmatched sequence fragments is greater than a preset length value, determining the cleavage sites in the matched sequence fragments, and combining the unmatched sequence fragments and the adjacent cut matched sequence fragments to form a sub-target sequence specifically includes: if the starting site of the unmatched sequence fragment is set to be Me1The termination site is Ms2(ii) a Setting the unmatched sequence segmentsThe adjacent matched sequence segments are first matched sequence segments and/or second matched sequence segments; the first matching sequence fragment has M1Amino acid, the starting site of which is Ms1The termination site is Me1(ii) a Said second matching sequence segment has M2Amino acid, the starting site of which is Ms2The termination site is Me2
When M iss2-Me1>15 hours: if the matched sequence fragment adjacent to the unmatched sequence fragment only has a first matched sequence fragment, cutting the cutting site T1 of the first matched sequence to the termination site M of the unmatched sequence fragmentS2The sequence fragment in between is a sub-target sequence; if the matched sequence segment adjacent to the unmatched sequence segment only has a second matched sequence segment, intercepting the start site M of the unmatched sequence segmente1A sequence fragment up to the cleavage site T2 of the second matching sequence as a sub-target sequence; if the matched sequence segments adjacent to the unmatched sequence segment are a first matched sequence segment and a second sequence segment, intercepting a sequence segment between a cutting site T1 of the first matched sequence and a cutting site T2 of the second matched sequence as a sub-targeting sequence; wherein:
if M is1/2>10, selecting the cleavage site T1 of the first matching sequence as Me1-9;
If M is1(M is 2 ≦ 10) and the cleavage site T1 of the first matching sequence is chosen as (M)e1-M 12+ 1); wherein, M is1(ii)/2 rounding up or rounding down;
if M is2/2>10, selecting the cleavage site T2 of the second matching sequence as Ms2+9;
If M is2(M is 2 ≦ 10) and the cleavage site T2 of the second matching sequence is chosen as (M)s2+M22-1); wherein, M is2And/2 rounding up or rounding down.
When the fragment length of the unmatched sequence fragment is less than or equal to the preset length value, intercepting the unmatched sequence fragment as a sub-target sequence specifically comprises: when M iss2-Me1When the value is less than or equal to 15, intercepting the unmatched sequence fragment Me1To Ms2As sub-target sequence fragments.
Preferably, the constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof specifically includes: acquiring the target sequence and the matched sequence segments in the matched sequence; constructing an initial three-dimensional structure model of the target sequence by taking the matching structure of the matching sequence as a template; in the initial three-dimensional structure model, the structure of the matched sequence segment of the target sequence adopts the structure of the corresponding matched sequence segment in the matched sequence; and processing the structure of the unmatched sequence segment in the target sequence by adopting a deletion structure.
Preferably, when no matching sequence or matching structure is found in the protein database of known structure, a matching sequence segment of the target sequence is found; selecting the best matching sequence segment as a target sequence segment from the searched matching sequence segments; determining a cutting site on the target sequence according to the position relation of the target sequence fragment on the target sequence, and acquiring a sub-target sequence; and searching a template in a protein database with a known structure according to the sub-target sequences to construct an initial three-dimensional structure of the target sequence.
Preferably, determining a cleavage site on the target sequence according to the position relationship of the target sequence fragment on the target sequence, and acquiring the sub-target sequence specifically includes: if the target sequence is set to have N amino acids, the initial site is Ns, and the termination site is Ne; the target sequence fragment has Q amino acids, and the starting site is Qs, and the termination site is Qe;
when Qs-Ns > 15: if Q/2 is more than 10, selecting the cutting site of the target sequence as Qs +9, and intercepting a segment from Ns to (Qs +9) as a sub-target sequence; if Q/2 is less than or equal to 10, selecting the cutting site of the target sequence as (Qs + Q/2-1), and intercepting a segment from Ns to (Qs + Q/2-1) as a sub-target sequence; wherein Q/2 is rounded up or rounded down;
when Qs-Ns is less than or equal to 15, selecting a cutting site of the target sequence as Qs, and intercepting a segment from Ns to Qs as a sub-target sequence;
when Ne-Qe > 15: if Q/2 is more than 10, selecting the cutting site of the target sequence as Qe-9, and intercepting a segment from (Qe-9) to Ne as a sub-target sequence; if Q/2 is less than or equal to 10, selecting the cutting site of the target sequence as (Qe-Q/2+1), and intercepting a segment from (Qe-Q/2+1) to Ne as a sub-target sequence; wherein Q/2 is rounded up or rounded down;
and when Ne-Qe ≦ 15, intercepting the unmatched sequence fragments Qe to Ne as the sub-target sequences.
Preferably, the protein structure prediction method of the present invention further comprises: reconstructing a side chain of a three-dimensional structure of the protein to be detected; and minimizing the energy of the three-dimensional structure model of the protein to be detected by adopting molecular mechanics.
Preferably, the protein structure prediction method of the present invention further comprises: and predicting the secondary structure of the protein to be detected according to the three-dimensional structure of the protein to be detected and the information of the three-dimensional space structure and the secondary structure of each protein sequence in the protein database with the known structure.
In a second aspect, the present invention discloses a protein structure prediction device, comprising: the sequence extraction module is used for extracting a target sequence from a protein file to be detected; the matching search module is used for matching the target sequence in a protein database with a known structure to search a matching sequence; acquiring a matching structure of the matching sequence according to the matching sequence; the model construction module is used for constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof; a combining module for combining an unmatched sequence segment of the target sequence with a portion of an adjacent matched sequence segment into a sub-target sequence; searching the matching subsequence and the structure of the sub-target sequence in the protein database with the known structure through the matching searching module; and the filling module is used for filling the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof to obtain the three-dimensional structure of the protein file to be detected.
Preferably, the combination module comprises: the acquisition submodule is used for acquiring the matched sequence segments of the target sequence and the matched sequence as matched sequence segments; taking a sequence segment on the target sequence that does not match the matching sequence as an unmatched sequence segment; the determining submodule is used for determining a cutting site in the matched sequence fragment when the fragment length of the unmatched sequence fragment is larger than a preset length value, and combining the unmatched sequence fragment and the adjacent cut matched sequence fragment to form a sub-target sequence; the matching unit is also used for intercepting the unmatched sequence fragment as a sub-target sequence when the fragment length of the unmatched sequence fragment is smaller than or equal to the preset length value; and the matching search module matches the sub-target sequences in the protein database with the matched structure to search matched sub-sequences and structures thereof.
Preferably, the determining submodule comprises a selecting unit and an intercepting unit; wherein: if the starting site of the unmatched sequence fragment is set to be Me1The termination site is Ms2(ii) a Setting adjacent matched sequence segments of the unmatched sequence segments as first matched sequence segments and/or second matched sequence segments; the first matching sequence fragment has M1Amino acid, the starting site of which is Ms1The termination site is Me1(ii) a Said second matching sequence segment has M2Amino acid, the starting site of which is Ms2The termination site is Me2
When M iss2-Me1>15 hours: if the matched sequence fragment adjacent to the unmatched sequence fragment only has a first matched sequence fragment, the truncation unit truncates the cleavage site T1 of the first matched sequence to the termination site M of the unmatched sequence fragmentS2The sequence fragment in between is a sub-target sequence; if the matching sequence segment adjacent to the unmatched sequence segment only has a second matching sequence segment, the interception unit intercepts the start site M of the unmatched sequence segmente1To said second matching sequenceThe sequence fragment between the cleavage sites T2 is used as a sub-target sequence; if the matching sequence segments adjacent to the unmatched sequence segment are the first matching sequence segment and the second sequence segment, the intercepting unit intercepts the sequence segment between the cutting site T1 of the first matching sequence and the cutting site T2 of the second matching sequence as a sub-targeting sequence; wherein:
if M is1/2>10, the selection unit selects the cleavage site T1 of the first matching sequence as Me1-9;
If M is1The/2 is less than or equal to 10, and the selection unit selects the cleavage site T1 of the first matching sequence as (M)e1-M 12+ 1); wherein, M is1(ii)/2 rounding up or rounding down;
if M is2/2>10, the selection unit selects the cleavage site T2 of the second matching sequence as Ms2+9;
If M is2The/2 is ≦ 10, and the selection unit selects the cleavage site T2 of the second matching sequence as (M)s2+M22-1); wherein, M is2(ii)/2 rounding up or rounding down;
when M iss2-Me1At ≦ 15: the interception unit correspondingly intercepts the unmatched sequence fragment Ms2To Me1As sub-target sequence fragments.
Preferably, the protein structure prediction apparatus further comprises: the side chain construction module is used for reconstructing a side chain of a three-dimensional structure of the protein to be detected; and the structure optimization module is used for minimizing the energy of the three-dimensional structure model of the protein to be detected by adopting molecular mechanics.
Preferably, the protein structure prediction apparatus further comprises: and the secondary structure prediction module is used for predicting the secondary structure of the protein to be detected according to the three-dimensional structure of the protein to be detected and by combining the information of the three-dimensional space structure and the secondary structure of each protein sequence in the protein database with the known structure.
In a third aspect, the present invention discloses a storage medium storing a plurality of instructions for execution by one or more processors to perform the steps of the protein structure prediction method of any one of the present invention.
In a fourth aspect, the invention discloses a protein structure prediction platform, which comprises the protein structure prediction device of any one of the invention, wherein the protein structure prediction platform is constructed on a server, and is provided with an online visualization program of a protein three-dimensional structure secondary structure, so as to visually display the structure of the protein.
The invention at least comprises the following technical effects:
(1) according to the technical scheme, a target sequence of the protein to be detected is extracted, then a matching sequence of the target sequence is searched, a structure template of the matching sequence is selected, an initial three-dimensional structure is established, in the modeling process, the missing part of the model is correspondingly processed, and specifically, the unmatched sequence segments of the target sequence are combined with a part of the adjacent matched sequence segments to form a sub-target sequence; and then searching the matching subsequence and the structure of the sub-target sequence in the protein database with a known structure to fill the missing part of the initial three-dimensional structure of the protein to be detected, so that the prediction result is more accurate.
(2) According to the technical scheme of the invention, the method for determining the cutting sites of the matched sequences is provided, different determination modes are adopted according to different lengths of unmatched sequences and different lengths of matched sequences, so that new sequence segments (sub-target sequences) obtained by interception are more reasonable, similar sequence structures are more likely to be matched, the corresponding missing parts are filled, and the prediction accuracy is improved.
(3) In the technical scheme of the invention, on the basis of the deletion of the complete three-dimensional structure of the protein to be detected, the side chain reconstruction is carried out on the structural model and the operation of adopting the energy of the molecular mechanics minimization model is carried out, so that the predicted structural model is optimized and is more stable.
(4) The invention can also utilize the information in the existing database, including the information of secondary structure and space structure, predict the secondary structure of the target sequence of the protein, so that the protein can be displayed more structurally by visualization.
(5) The protein structure prediction platform comprises the protein structure prediction device, the platform is arranged on a server and is provided with visual programs of a protein secondary structure and a protein three-dimensional structure, a user can log in the platform on any intelligent equipment so as to perform protein structure prediction operation, and after the protein structure prediction platform obtains the protein to be detected input by the user, the protein structure prediction device in the platform can predict the three-dimensional structure of the protein to be detected input by the user by adopting the protein structure prediction method, and the final prediction result is displayed visually, so that the protein structure prediction platform is convenient and quick for the user to check.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flow chart of one embodiment of a protein structure prediction method of the present invention;
FIG. 2 is a schematic diagram of the process of extracting a target sequence of a protein to be detected according to the present invention;
FIG. 3 is a flowchart of another embodiment of a protein structure prediction method according to the present invention;
FIG. 4a is a diagram illustrating a determination of a sequence of sub-targets;
FIG. 4b is a diagram illustrating another determination of the sequence of sub-targets;
FIG. 4c is a diagram illustrating another example of determining the sequence of sub-targets;
FIG. 4d is a schematic diagram of determining a sequence of sub-targets for searching a template structure;
FIG. 5 is a schematic flow chart of another embodiment of the protein structure prediction method of the present invention;
FIG. 6 is a schematic diagram showing the three-dimensional structure prediction result of the protein to be detected;
FIG. 7 is a diagram showing the secondary structure prediction results of the protein to be detected;
FIG. 8 is a block diagram of a protein structure prediction apparatus according to an embodiment of the present invention;
FIG. 9 is a block diagram of another embodiment of the protein structure prediction apparatus according to the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically depicted, or only one of them is labeled. In this document, "one" means not only "only one" but also a case of "more than one".
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
FIG. 1 shows a flow chart of an embodiment of the present invention provides a protein structure prediction method, comprising:
s101, extracting a target sequence from a protein file to be detected;
specifically, twenty basic amino acids of the protein are available, the protein file has the English name of the amino acid, the name is in the form of three-character abbreviation, the corresponding one-character abbreviation is converted, the ATOM ATOM beginning of the standard residue based on the protein file (PDB) is obtained, then the name corresponding to the corresponding amino acid is obtained according to the sequence number, and the chain identifier is obtained to be used as the amino acid sequence contained in different chains.
It should be noted that to duplicate repeated sequences in a single protein file, filtering is required for atomic coordinates of standard residues doped with non-standard residues, residue insertion codes are required for residue sequence number residues, residue insertion codes are also required to be added to sequence numbers, then a sequence is obtained for each chain, and then the duplication is carried out, and the extracted sequences are the same, but the chains or file names are different, and then the sequences are added together. A library of NRs (protein library) is obtained, where several protein sequence libraries are combined and then a blast (basic alignment search tool) is installed to put the extracted sequence library into a formatted sequence library. The scheme for extracting the protein sequence (target sequence) is shown in FIG. 2.
Of the above, the NR (Non-Redundant Protein Sequence Database) Database is a Non-Redundant Protein Database, and all Non-Redundant Protein sequences in GenBank + EMBL + DDBJ + PDB, for all known or possible coding sequences, the corresponding amino acid sequences (deduced from the known or possible reading frames) and the Sequence numbers in the specialized Protein Database are given in the NR records. The NR library corresponds to a cross-reference based on nucleic acid sequence, linking nucleic acid data to protein data.
S102, matching the target sequence in a protein database with a known structure, and finding out a matched sequence;
specifically, after the target sequence is obtained, blastp is called to search and match a similar sequence matched with the target sequence so as to select a corresponding structural template. In this step, pdb (protein Data bank) can be selected as the protein database with known structure, the target sequence extracted in the method of the present invention includes amino acid sequence or nucleotide sequence, if the target sequence is amino acid sequence, then blastp can be used to directly perform search alignment, and the search result is returned. If the target sequence is a nucleotide sequence, calling blastx, translating the nucleotide sequence into a protein sequence, searching and matching, and returning a searching result.
blast, called Basic Local Alignment Search Tool, is a "Search Tool based on Local Alignment algorithm". Blast operates by first building a database with target sequences (this database is called database, each sequence in the database is called subject), and then searching the database with the sequence to be searched (called query), wherein each query and each subject in the database are subjected to double-sequence alignment, so as to obtain the total alignment result.
Blast is an integrated package, and by calling different alignment modules, Blast implements five possible sequence alignment modes:
blastp: protein sequences are compared with a protein library, and homology of the protein sequences is directly compared.
blastx: the alignment of nucleic acid sequences to protein libraries involves first translating nucleic acid sequences into protein sequences (which may be translated into 6 possible protein sequences depending on the phase) and then aligning the protein libraries.
blastn: alignment of nucleic acid sequences against a library of nucleic acids directly compares the homology of nucleic acid sequences.
tblastn: alignment of protein sequences to nucleic acid libraries, translation of nucleic acids in a library into protein sequences, and then alignment.
tblastx: nucleic acid sequence alignment to nucleic acid library at the protein level, library and the sequence to be checked are translated into protein sequences, and then the protein sequences are aligned.
S103, acquiring a matching structure of the matching sequence according to the matching sequence;
specifically, after the final matching result is obtained, the best matching result is selected as the matching sequence, and then the structure of the matching sequence is selected as the model template.
S104, constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof;
specifically, the matching sequence and the corresponding matching structure thereof obtained in the previous step are used as a structure template of the target sequence to construct an initial three-dimensional structure model of the target sequence, the target sequence is aligned with the initial three-dimensional structure model one by one, the matching sequence is aligned with the matching structure one by one, and then a modeling engine (such as prodod 3) is used to generate a protein model.
S105, combining the unmatched sequence segments of the target sequence with part of the adjacent matched sequence segments to form a sub-target sequence; searching a matching subsequence of the sub-target sequence and the structure thereof in a protein database with a known structure;
specifically, in the modeling process, the missing part in the structural model needs to be filled up to complete the three-dimensional structure. For unmatched sequence segments, a new sequence segment (sub-target sequence) can be formed by combining part of matched sequence segments, and blastp is called to search for a structural template corresponding to the new sequence segment, so that the missing part of the initial three-dimensional structural model of the target sequence is filled.
And S106, filling the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof, and obtaining the three-dimensional structure of the protein file to be detected.
Specifically, the three-dimensional structure of the protein file to be detected is obtained after the missing parts in the initial three-dimensional structure model are completely filled by adopting the method.
As shown in FIG. 3, this embodiment adds specific descriptions to step S105 in the previous embodiment (S205-S209) on the basis of the previous embodiment. The structure prediction method of the embodiment specifically includes:
s201, extracting a target sequence from a protein file to be detected;
s202, matching the target sequence in a protein database with a known structure, and finding out a matched sequence;
s203, acquiring a matching structure of the matching sequence according to the matching sequence;
s204, constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof;
s205, acquiring a sequence segment matched between the target sequence and the matching sequence as a matched sequence segment; taking a sequence segment on the target sequence which is not matched with the matched sequence as an unmatched sequence segment;
s206, judging whether the length of the unmatched sequence fragment is larger than a preset length value, if so, entering a step S207, otherwise, entering a step S208;
s207, determining the cutting sites in the matched sequence fragments, and combining the unmatched sequence fragments and the adjacent cut matched sequence fragments to form a sub-target sequence; the flow advances to step S209;
s208, intercepting the unmatched sequence fragments as sub-target sequences;
s209, matching the sub-target sequences in the protein database with the matched structure, and searching matched sub-sequences and structures thereof;
s210, filling the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof, and obtaining the three-dimensional structure of the protein file to be detected.
For an unmatched sequence with a sequence fragment length smaller than or equal to a preset length, the unmatched sequence fragment can be directly intercepted as a sub-target sequence due to the short length of the unmatched sequence fragment, and then the sub-target sequence is searched in a protein database with a known structure to obtain a corresponding template structure so as to fill up a missing part corresponding to the unmatched sequence fragment on the initial three-dimensional structure of the target sequence.
For unmatched sequence segments with the sequence segment length larger than the preset length value, combining a part of adjacent matched sequence segments to form a new sequence segment (sub-target sequence), and then matching and searching the new sequence segment in a protein database with a known structure to see whether a matched sequence can be found or not, so that a structure template of the new sequence segment is obtained to fill up the deletion in the original three-dimensional structure model of the target sequence.
The selection of a part of the adjacent matched sequence segment will affect the composition of the new sequence segment (sub-target sequence), and also affect the subsequent matching search result. Therefore, it is important to select and determine the optimal cleavage site in the matched sequence fragment. Preferably, in step S207, when the length of the fragment of the unmatched sequence fragment is greater than a preset length value, determining a cutting site in the matched sequence fragment, and combining the unmatched sequence fragment and the adjacent cut matched sequence fragment to form a sub-target sequence specifically includes:
if the starting site of the unmatched sequence fragment is set to be Me1The termination site is Ms2(ii) a Setting adjacent matched sequence segments of the unmatched sequence segments as first matched sequence segments and/or second matched sequence segments; the first matching sequence fragment has M1Amino acid, the starting site of which is Ms1The termination site is Me1(ii) a Said second matching sequence segment has M2Amino acid, the starting site of which is Ms2The termination site is Me2
1. When M iss2-Me1>15 hours:
(1) if the matching sequence fragment adjacent to the unmatched sequence fragment has only the first matching sequence fragment, cutting the cleavage site T1 of the first matching sequence to the termination site M of the unmatched sequence fragment as shown in FIG. 4aS2The sequence fragment in between is a sub-target sequence;
(2) if the matching sequence segment adjacent to the unmatched sequence segment has only the second matching sequence segment, the start site M of the unmatched sequence segment is truncatede1A sequence fragment up to the cleavage site T2 of the second matching sequence as a sub-target sequence;
(3) as shown in fig. 4c, if the matching sequence segments adjacent to the unmatched sequence segment are the first matching sequence segment and the second matching sequence segment, intercepting the sequence segment between the cleavage site T1 of the first matching sequence and the cleavage site T2 of the second matching sequence as the sub-targeting sequence;
wherein: if M is1/2>10, selecting the cleavage site T1 of the first matching sequence as Me1-9;
If M is1(M is 2 ≦ 10) and the cleavage site T1 of the first matching sequence is chosen as (M)e1-M 12+ 1); wherein, M is1(ii)/2 rounding up or rounding down;
if M is2/2>10, selecting the cleavage site T2 of the second matching sequence as Ms2+9;
If M is2(M is 2 ≦ 10) and the cleavage site T2 of the second matching sequence is chosen as (M)s2+M22-1); wherein, M is2And/2 rounding up or rounding down.
2. The step S208 of intercepting the unmatched sequence segment as a sub-target sequence specifically includes:
when M iss2-Me1When the value is less than or equal to 15, intercepting the unmatched sequence fragment Ms2To Me1As a target sequence fragment.
In any of the above method embodiments, matching the target sequence in a pre-stored protein structure database, and finding the matching sequence specifically includes: matching and searching the target sequence in a pre-stored protein structure database, and selecting a sequence with the highest similarity from the searched similar sequences as a matching sequence; preferably, the best matching sequence is selected from the searched similar sequences by combining factors such as similarity and length of the matching sequence fragment.
Specifically, if a protein of unknown structure has sufficient sequence similarity to a protein of known structure, an approximate three-dimensional model can be constructed for the protein of unknown structure based on the similarity principle. If a portion of the target protein sequence is similar to a domain region of a protein of known structure, the target protein can be considered to have the same domain or functional region. In the aspect of protein structure prediction, the method with the most reliable prediction result is a homologous modeling method.
The main idea of the homologous modeling method is as follows: for a protein with unknown structure, finding a homologous protein with known structure, and establishing a structure model for the protein with unknown structure by taking the structure of the protein as a template. Generally, if the sequence identity degree with the target sequence can be found to be 25% or more (preferably, if the sequence identity part of the target sequence exceeds 30%), a protein structure can be predicted by adopting a homologous modeling method. Often, homology comparisons of proteins are performed by means of sequence alignments, by which evolutionary relationships between proteins can be found. In the aspect of protein structure analysis, sequence conservation patterns or mutation patterns can be found through sequence alignment, and the sequence patterns contain very useful three-dimensional structure information. The structure of 10-30% of protein can be predicted by using a homology modeling method.
Generally, after extracting a target sequence, a result file is obtained by calling blastp (alignment of protein sequences in a protein database) to obtain data of similarity, and when scoring is evaluated, the Score value is the result of scoring, and the Score value is larger as the matching fragment is longer and the similarity is higher. Then we can choose the sequence with the highest similarity, i.e. the highest Scoer score, as the best matching sequence. Preferably, in addition to referring to the similarity, the best matching sequence can be selected by combining other reference factors, for example, the sequence with high similarity to the target sequence, small number of alignment gaps and small length can be selected as the template sequence by comprehensively considering the similarity to the target sequence, the number and length of the alignment gaps, and the like. Of course, other reference factors may also be set, such as in combination with template resolution, etc.
And after the matching sequence and the matching structure are obtained, an initial three-dimensional model of the target sequence is constructed. For the sequence segments which are not matched, processing the structure of the corresponding part by missing data when constructing the initial three-dimensional model; if missing data is present, then along with the target sequence, target structure, template sequence, template structure corresponding to the missing data, or data with no comparison results, it is necessary to generate a protein model according to promod3 (modeling engine). The specific complete flow diagram is shown in fig. 5.
Specifically, using promod3 (modeling engine) to generate a protein model, first install all dependencies required by promod3, see promod 3-related documents, and need to import the target sequence and structure and match the sequence and structure alignment of the input, and then model the steps:
1. an initial three-dimensional structure model is constructed from the template structure.
2. And executing cyclic modeling and filling up missing gaps. Specifically, there are two approaches to cycle modeling, one of which is based on conformational searches or conformational counts in a given environment for de novo calculation of cycle predictions, as directed by scoring or energy functions. Using different protein representations, energy function terms and optimization or enumeration algorithms. Another database approach for cycle prediction involves finding the main segments of the two stem regions that fit into the cycle, and performing a search for such segments through a database of many known protein structures, rather than just homologues of the simulated protein. In this embodiment, for the missing gap in the initial three-dimensional structure model, the method of combining the unmatched sequence segments and the adjacent partially matched sequence segments to form a new sequence segment, and then searching for matching in the known structural protein database, searching for the matched sequence segment, and then obtaining the corresponding template is adopted, and the obtained corresponding template structure fills up the corresponding missing gap portion in the initial three-dimensional structure model. And circularly searching and filling in the way until a complete three-dimensional structure model is obtained.
3. The side chains are reconstituted. Specifically, because the side chain of an amino acid in a protein is connected with an alpha-carbon atom through a sigma bond, the bond is flexible, so that the conformation of the side chain is not easy to determine, and the conformation of a certain amino acid side chain in a specific protein is found by the SCWRL method for reconstructing the side chain. And reconstructing the side chain of the three-dimensional structure of the protein to be detected, so that the three-dimensional structure model of the protein to be detected is more stable.
4. Molecular mechanics is used to minimize the energy of the final model. Specifically, the energy of the three-dimensional structure model of the protein to be detected is minimized by adopting molecular mechanics, the distance between atoms in the structure can be adjusted, the model structure is stabilized, and the purpose of optimizing the three-dimensional structure of the protein to be detected is achieved.
Preferably, blastp finds similar sequences only in a small segment, and the rest cannot be found. Specifically, as shown in FIG. 4d, it is assumed that the protein or its fragment whose structure is to be predicted has N amino acids, the start site is Ns, and the termination site is Ne. The best aligned fragment of blastp has Q amino acids, the start site is Qs and the stop site is Qe.
If Qs-Ns > 15:
if Q/2>10, the fragment from Ns to (Qs +9) is truncated (cleavage site T. Qs +9), and blastp is called, looking for the template.
If Q/2< ═ 10, the fragment from Ns to (Qs + Q/2-1) is truncated (cleavage site T ═ Qs + Q/2-1), and blastp is called, looking for the template.
If Qs-Ns < ═ 15:
the fragment from Ns to Qs is truncated to retrieve the corresponding structure from the protein database of known structure.
If Ne-Qe > 15:
if Q/2>10, the segment from (Qe-9) to Ne is intercepted, blastp is called, and the template is found.
If Q/2< ═ 10, the segment from (Qe-Q/2+1) to Ne is truncated, blastp is called, and the template is found.
If Ne-Qe < ═ 15:
the fragment from Qe to Ne is truncated and the corresponding structure is retrieved from a protein database of known structure.
That is, if a matching sequence of a target sequence is searched, if the matching sequence and a corresponding structure template are not searched, for example, only some matching sequence segments are found, then a segment with the best alignment result is selected from the searched matching sequence segments as the target sequence segment (for example, the segment with the longest matching segment is selected as the segment with the best alignment result), and according to the position relationship of the target sequence segment on the target sequence, a cutting site is determined on the target sequence to obtain a sub-target sequence; and searching a template in a protein database with a known structure according to the sub-target sequences to construct an initial three-dimensional structure of the target sequence.
In addition, in the modeling of proQed3, most parts are filled in by the middle missing part, but if only one missing part (the unmatched part) at the two ends of the target sequence is not filled in, the adjacent sequences including the three-dimensional structure can be removed firstly, and then two missing parts are filled in.
In the modeling with the promod3, promod3 cannot predict new amino acids such as U in the modeling process, and the invention needs to do a separate process for such cases, convert U into C by a sequence recognition method and then perform structure prediction.
For the fact that the promed3 modeling engine is a homologous modeling prediction three-dimensional structure, the embodiment makes full use of the information in the existing database, including the information of secondary structure and spatial structure, to predict the secondary structure of the protein sequence, so that the protein can be displayed more structurally in a visualized mode.
When the sequence is continuously modeled to generate a complete three-dimensional structure model, the final result of fig. 6 is obtained, and then the secondary structure model is predicted, and the final result of fig. 7 is obtained. The results are predicted and a Protein (PDB) file is returned.
In addition, if a similar sequence of the target sequence is not found in the protein database with a known structure (no comparison result, or no known sequence containing equivalent sequence fragments is found), the information in the existing database, including the information of the secondary structure and the spatial structure, can be fully utilized to predict the secondary structure of the protein sequence from the protein sequence, and then predict the spatial structure of the protein from the secondary structure, or predict the protein by using a de novo prediction method.
Based on the same technical concept, the present invention discloses a protein structure prediction device, which can predict the three-dimensional structure of a protein by using any of the above protein structure prediction methods, and specifically, an embodiment of the protein structure prediction device of the present invention, as shown in fig. 8, includes:
a sequence extraction module 100, configured to extract a target sequence from a protein file to be detected; specifically, firstly, an amino acid sequence is extracted from a protein file, filtering processing is required for atomic coordinates of standard residues doped with non-standard residues, residue insertion codes are doped for residue sequence numbers, the residue insertion codes are also required to be added into the sequence numbers, then, a sequence of each chain is obtained, then, de-duplication processing is carried out, repeated sequences are removed, the extracted sequences are the same but the chains or the file names are different, then, an nr library (protein library) is obtained, the nr database is a database combining a plurality of protein sequence libraries, and then, blast (basic contrast search tool) is installed to put the extracted sequence library into a formatted sequence library.
The matching search module 200 is used for matching the target sequence in a protein database with a known structure to search a matching sequence; acquiring a matching structure of the matching sequence according to the matching sequence; specifically, after the target sequence is obtained, blastp is called to search and match a similar sequence matched with the target sequence so as to select a corresponding structural template. For example, if a matched sequence with a similarity of more than 25% (or more preferably more than 30%) is obtained, an initial three-dimensional structure model of the target sequence may be constructed by using a homology modeling method.
A model construction module 300, configured to construct an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof;
a combining module 400 for combining unmatched sequence segments of the target sequence with a portion of adjacent matched sequence segments into sub-target sequences; searching the matching subsequence and the structure of the sub-target sequence in the protein database with the known structure through a matching searching module; specifically, in the modeling process, the missing part in the structural model needs to be filled up to complete the three-dimensional structure. For unmatched sequence segments, a new sequence segment (sub-target sequence) can be formed by combining part of matched sequence segments, and blastp is called to search for a structural template corresponding to the new sequence segment, so that the missing part of the initial three-dimensional structural model of the target sequence is filled.
And a filling module 500, configured to fill up a missing part in the initial three-dimensional structure model according to the found matching subsequence and the structure thereof, and obtain a three-dimensional structure of the protein file to be detected. Specifically, the three-dimensional structure of the protein file to be detected is obtained after the missing parts in the initial three-dimensional structure model are completely filled in by adopting the mode.
In another embodiment of the prediction apparatus of the present invention, as shown in fig. 9, on the basis of the above embodiment of the prediction apparatus, the combination module 400 includes:
an obtaining submodule 410, configured to obtain a sequence segment on the target sequence and the matching sequence, as a matched sequence segment; taking the unmatched sequence segments in the target sequence and the matched sequence as unmatched sequence segments;
a determining submodule 420, configured to determine a cutting site in the matched sequence fragment when the fragment length of the unmatched sequence fragment is greater than a preset length value, and combine the unmatched sequence fragment and an adjacent cut matched sequence fragment to form a sub-target sequence; the method is also used for determining a cutting site in the unmatched sequence fragment when the fragment length of the unmatched sequence fragment is smaller than or equal to a preset length value, and intercepting a part of the unmatched sequence fragment as a sub-target sequence; and the matching search module matches the sub-target sequences in the protein database with the matched structure to search the matched sub-target sequences and the matched structure thereof.
For unmatched sequence segments, since the sequence segments are processed in the initial three-dimensional structure in a missing structure, the missing parts in the initial three-dimensional structure also need to be filled. For the unmatched sequence segments, different processing modes can be adopted according to different segment lengths, specifically, for the unmatched sequence segments with the length smaller than or equal to the preset length, a cutting point can be determined on the unmatched sequence segments, a part of the unmatched sequence segments is cut out to serve as a sub-target sequence, the sub-target sequence is searched in a protein database with a known structure, a corresponding template structure is obtained, and a missing part corresponding to the unmatched sequence segments is filled in the initial three-dimensional structure of the target sequence. And for the condition that the length of the unmatched sequence fragment is larger than the preset length value, combining the adjacent partial matched sequence fragments to form a new sequence fragment (sub-target sequence), and then matching and searching the new sequence fragment in a protein database with a known structure to see whether the matched sequence can be found or not, so that a structure template of the new sequence fragment is obtained to fill the deletion in the original three-dimensional structure model of the target sequence.
Then, how well to determine the cleavage site? Specifically, the determining submodule 420 includes a selecting unit 420 and an intercepting unit 422; wherein: if the starting site of the unmatched sequence fragment is set to be Me1The termination site is Ms2(ii) a Setting adjacent matched sequence segments of the unmatched sequence segments as first matched sequence segments and/or second matched sequence segments; the first matching sequence fragment has M1Amino acid, the starting site of which is Ms1The termination site is Me1(ii) a Said second matching sequence segment has M2Amino acid, the starting site of which is Ms2The termination site is Me2
1. When M iss2-Me1>15 hours:
(1) if the matching sequence segment adjacent to the unmatched sequence segment only has the first matching sequence segment, the intercepting unit intercepts the cutting site T1 of the first matching sequence to the termination site M of the unmatched sequence segmentS2The sequence fragment in between is a sub-target sequence;
(2) if the matching sequence segment adjacent to the unmatched sequence segment only has a second matching sequence segment, the interception unit intercepts the start site M of the unmatched sequence segmente1A sequence fragment up to the cleavage site T2 of the second matching sequence as a sub-target sequence;
(3) if the matching sequence segments adjacent to the unmatched sequence segment are the first matching sequence segment and the second sequence segment, the intercepting unit intercepts the sequence segment between the cutting site T1 of the first matching sequence and the cutting site T2 of the second matching sequence as a sub-targeting sequence;
wherein: if M is1/2>10, the selection unit selects the cutting site T1 of the first matching sequence as Me1-9;
If M is1The/2 is less than or equal to 10, and the selection unit selects the cleavage site T1 of the first matching sequence as (M)e1-M 12+ 1); wherein, M is1(ii)/2 rounding up or rounding down;
if M is2/2>10, the selection unit selects the cutting site T2 of the second matching sequence as Ms2+9;
If M is2(M is 2 ≦ 10) and the cleavage site T2 of the second matching sequence is chosen as (M)s2+M22-1); wherein, M is2(ii)/2 rounding up or rounding down;
3. when M iss2-Me1At ≦ 15:
an intercepting unit intercepts the unmatched sequence fragment Ms2To Me1As a target sequence fragment.
In another embodiment of the prediction apparatus of the present invention, in addition to any of the above embodiments of the apparatus, the protein structure prediction apparatus further includes:
the side chain construction module 600 is used for reconstructing a side chain of a three-dimensional structure of the protein to be detected;
a structure optimization module 700 for minimizing the energy of the three-dimensional structure model of the protein to be tested using molecular mechanics.
In any of the above embodiments of the prediction apparatus, the protein structure prediction apparatus further comprises:
and the secondary structure prediction module is used for predicting the secondary structure of the protein to be detected according to the three-dimensional structure of the protein to be detected and by combining the information of the three-dimensional space structure and the secondary structure of each protein sequence in the protein database with the known structure.
Specifically, for example, in this embodiment, the modeling engine of promed3 can be used to predict the three-dimensional structure by homology modeling (including remote homology), and on this basis, the secondary structure of the protein can be predicted from the protein sequence by fully utilizing the information in the database of known structural proteins, including the information of secondary structure and spatial structure, so that the protein can be displayed more structurally by visualization.
In addition, most of the protomed 3 modeling is to fill in the middle missing part, but if only one missing (non-matching part) at both ends of the target sequence is not filled, the adjacent sequences including the three-dimensional structure can be removed first, and then two missing parts are filled.
Promod3 cannot be predicted for new amino acids such as U in the modelling process, and the present invention requires a separate process for this type of situation, converting U to C by sequence recognition methods and then performing structure prediction.
Similarly, the present invention also discloses a storage medium, wherein the storage medium stores a plurality of instructions, and the plurality of instructions are executed by one or more processors to implement the steps of the protein structure prediction method according to any one of the above prediction method embodiments of the present invention.
For example, the storage medium of this embodiment stores a plurality of instructions for implementing the following steps of the protein structure prediction method:
extracting a target sequence from a protein file to be detected;
matching the target sequence in a protein database with a known structure, and finding out a matched sequence;
acquiring a matching structure of the matching sequence according to the matching sequence;
constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof;
combining an unmatched sequence segment of the target sequence with a portion of an adjacent matched sequence segment into a sub-target sequence; searching the matching subsequence and the structure of the sub-target sequence in the protein database with the known structure;
and filling up the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof to obtain the three-dimensional structure of the protein file to be detected.
Of course, the storage medium storing the implementation instructions of the other embodiments of the protein structure prediction method also belongs to the storage medium of the present invention, and is not described herein again.
Finally, the invention discloses a protein structure prediction platform, which comprises the protein structure prediction device in any embodiment of the prediction device, wherein the protein structure prediction platform is constructed on a server and is provided with a protein three-dimensional structure and/or secondary structure online visualization program for visually displaying the structure of the protein.
The protein structure prediction platform comprises the protein structure prediction device, the protein structure prediction device corresponds to each step of the protein structure prediction method, the prediction platform is established on a server, preferably a cloud server, and is also provided with a protein three-dimensional structure or secondary structure online visualization program, a user can predict the structure of the protein through the protein structure prediction platform, and the prediction platform can display the secondary structure or the three-dimensional structure of the protein to the user.
The protein structure prediction method of the present invention corresponds to the protein structure prediction apparatus, and the technical details of the embodiments of the protein structure prediction method of the present invention are also applicable to the protein structure prediction apparatus of the present invention, and are not described again for reducing the repetition.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (15)

1. A protein structure prediction method comprising:
extracting a target sequence from a protein file to be detected;
matching the target sequence in a protein database with a known structure, and finding out a matched sequence;
acquiring a matching structure of the matching sequence according to the matching sequence;
constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof;
combining an unmatched sequence segment of the target sequence with a portion of an adjacent matched sequence segment into a sub-target sequence; searching the matching subsequence and the structure of the sub-target sequence in the protein database with the known structure;
and filling up the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof to obtain the three-dimensional structure of the protein file to be detected.
2. The method of claim 1, wherein the unmatched sequence segments of the target sequence are combined with a portion of adjacent matched sequence segments to form a sub-target sequence; and searching the matching subsequence of the sub-target sequence and the structure thereof in the protein database with the known structure specifically comprises:
acquiring a matched sequence segment in the target sequence and the matched sequence as a matched sequence segment; taking a sequence segment on the target sequence that does not match the matching sequence as an unmatched sequence segment;
when the fragment length of the unmatched sequence fragment is larger than a preset length value, determining a cutting site in the matched sequence fragment, and combining the unmatched sequence fragment and the adjacent cut matched sequence fragment to form a sub-target sequence;
when the fragment length of the unmatched sequence fragment is smaller than or equal to the preset length value, intercepting the unmatched sequence fragment as a sub-target sequence;
and matching the sub-target sequences in the protein database with the matched structure to find matched sub-sequences and structures thereof.
3. The method of claim 2, wherein when the length of the fragment of the unmatched sequence fragment is greater than a predetermined length value, determining the cleavage site in the matched sequence fragment, and combining the unmatched sequence fragment with the adjacent cleaved matched sequence fragment to form a sub-target sequence specifically comprises:
if the starting site of the unmatched sequence fragment is set to be Me1The termination site is Ms2(ii) a Setting adjacent matched sequence segments of the unmatched sequence segments as first matched sequence segments and/or second matched sequence segments; the first matching sequence fragment has M1Amino acid, the starting site of which is Ms1The termination site is Me1(ii) a Said second matching sequence segment has M2Amino acid, the starting site of which is Ms2The termination site is Me2
When M iss2-Me1>15 hours:
if the matched sequence fragment adjacent to the unmatched sequence fragment only has a first matched sequence fragment, cutting the cutting site T1 of the first matched sequence to the termination site M of the unmatched sequence fragmentS2The sequence fragment in between is a sub-target sequence;
if the matched sequence segment adjacent to the unmatched sequence segment only has a second matched sequence segment, intercepting the start site M of the unmatched sequence segmente1A sequence fragment up to the cleavage site T2 of the second matching sequence as a sub-target sequence;
if the matched sequence segments adjacent to the unmatched sequence segment are a first matched sequence segment and a second sequence segment, intercepting a sequence segment between a cutting site T1 of the first matched sequence and a cutting site T2 of the second matched sequence as a sub-targeting sequence;
wherein:
if M is1/2>10, selecting the cleavage site T1 of the first matching sequence as Me1-9;
If M is1(M is 2 ≦ 10) and the cleavage site T1 of the first matching sequence is chosen as (M)e1-M12+ 1); wherein, M is1(ii)/2 rounding up or rounding down;
if M is2/2>10, selecting the cleavage site T2 of the second matching sequence as Ms2+9;
If M is2(M is 2 ≦ 10) and the cleavage site T2 of the second matching sequence is chosen as (M)s2+M22-1); wherein, M is2(ii)/2 rounding up or rounding down;
when the fragment length of the unmatched sequence fragment is less than or equal to the preset length value, intercepting the unmatched sequence fragment as a sub-target sequence specifically comprises:
when M iss2-Me1When the value is less than or equal to 15, intercepting the unmatched sequence fragment Me1To Ms2As sub-target sequence fragments.
4. The method of claim 1, wherein the constructing the initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof specifically comprises:
acquiring the target sequence and the matched sequence segments in the matched sequence;
constructing an initial three-dimensional structure model of the target sequence by taking the matching structure of the matching sequence as a template; in the initial three-dimensional structure model, the structure of the matched sequence segment of the target sequence adopts the structure of the corresponding matched sequence segment in the matched sequence; and processing the structure of the unmatched sequence segment in the target sequence by adopting a deletion structure.
5. The method for predicting protein structure according to claim 1, further comprising:
when a matching sequence or a matching structure is not found in the protein database with the known structure, searching a matching sequence segment of the target sequence;
selecting the best matching sequence segment as a target sequence segment from the searched matching sequence segments;
determining a cutting site on the target sequence according to the position relation of the target sequence fragment on the target sequence, and acquiring a sub-target sequence;
and searching a template in a protein database with a known structure according to the sub-target sequences to construct an initial three-dimensional structure of the target sequence.
6. The method of claim 5, wherein the determining the cleavage site on the target sequence according to the positional relationship of the target sequence fragment on the target sequence, and the obtaining of the sub-target sequence specifically comprises:
if the target sequence is set to have N amino acids, the initial site is Ns, and the termination site is Ne; the target sequence fragment has Q amino acids, and the starting site is Qs, and the termination site is Qe;
when Qs-Ns > 15:
if Q/2 is more than 10, selecting the cutting site of the target sequence as Qs +9, and intercepting a segment from Ns to (Qs +9) as a sub-target sequence;
if Q/2 is less than or equal to 10, selecting the cutting site of the target sequence as (Qs + Q/2-1), and intercepting a segment from Ns to (Qs + Q/2-1) as a sub-target sequence; wherein Q/2 is rounded up or rounded down;
when Qs-Ns is less than or equal to 15, selecting a cutting site of the target sequence as Qs, and intercepting a segment from Ns to Qs as a sub-target sequence;
when Ne-Qe > 15:
if Q/2 is more than 10, selecting the cutting site of the target sequence as Qe-9, and intercepting a segment from (Qe-9) to Ne as a sub-target sequence;
if Q/2 is less than or equal to 10, selecting the cutting site of the target sequence as (Qe-Q/2+1), and intercepting a segment from (Qe-Q/2+1) to Ne as a sub-target sequence; wherein Q/2 is rounded up or rounded down;
and when Ne-Qe is less than or equal to 15, selecting the cleavage site of the target sequence as Qe, and intercepting a segment from Qe to Ne as a sub-target sequence.
7. The method for predicting protein structure according to any one of claims 1 to 6, further comprising:
reconstructing a side chain of a three-dimensional structure of the protein to be detected;
and minimizing the energy of the three-dimensional structure model of the protein to be detected by adopting molecular mechanics.
8. The method for predicting protein structure according to claim 1, further comprising:
and predicting the secondary structure of the protein to be detected according to the three-dimensional structure of the protein to be detected and the information of the three-dimensional space structure and the secondary structure of each protein sequence in the protein database with the known structure.
9. A protein structure prediction device, comprising:
the sequence extraction module is used for extracting a target sequence from a protein file to be detected;
the matching search module is used for matching the target sequence in a protein database with a known structure to search a matching sequence; acquiring a matching structure of the matching sequence according to the matching sequence;
the model construction module is used for constructing an initial three-dimensional structure model of the target sequence based on the matching sequence and the matching structure thereof;
a combining module for combining an unmatched sequence segment of the target sequence with a portion of an adjacent matched sequence segment into a sub-target sequence; searching the matching subsequence and the structure of the sub-target sequence in the protein database with the known structure through the matching searching module;
and the filling module is used for filling the missing part in the initial three-dimensional structure model according to the searched matching subsequence and the structure thereof to obtain the three-dimensional structure of the protein file to be detected.
10. The protein structure prediction device of claim 9, wherein the combination module comprises:
the acquisition submodule is used for acquiring the matched sequence segments of the target sequence and the matched sequence as matched sequence segments; taking a sequence segment on the target sequence that does not match the matching sequence as an unmatched sequence segment;
the determining submodule is used for determining a cutting site in the matched sequence fragment when the fragment length of the unmatched sequence fragment is larger than a preset length value, and combining the unmatched sequence fragment and the adjacent cut matched sequence fragment to form a sub-target sequence; the matching unit is also used for intercepting the unmatched sequence fragment as a sub-target sequence when the fragment length of the unmatched sequence fragment is smaller than or equal to the preset length value;
and the matching search module is used for matching the sub-target sequences in the protein database with the matched structure to search the matched sub-target sequences and the matched structure thereof.
11. The device of claim 10, wherein the determining submodule comprises a selecting unit and a clipping unit; wherein:
if the starting site of the unmatched sequence fragment is set to be Me1The termination site is Ms2(ii) a Setting the missThe adjacent matched sequence segments of the matched sequence segments are first matched sequence segments and/or second matched sequence segments; the first matching sequence fragment has M1Amino acid, the starting site of which is Ms1The termination site is Me1(ii) a Said second matching sequence segment has M2Amino acid, the starting site of which is Ms2The termination site is Me2
When M iss2-Me1>15 hours:
if the matched sequence fragment adjacent to the unmatched sequence fragment only has a first matched sequence fragment, the truncation unit truncates the cleavage site T1 of the first matched sequence to the termination site M of the unmatched sequence fragmentS2The sequence fragment in between is a sub-target sequence;
if the matching sequence segment adjacent to the unmatched sequence segment only has a second matching sequence segment, the interception unit intercepts the start site M of the unmatched sequence segmente1A sequence fragment up to the cleavage site T2 of the second matching sequence as a sub-target sequence;
if the matching sequence segments adjacent to the unmatched sequence segment are the first matching sequence segment and the second sequence segment, the intercepting unit intercepts the sequence segment between the cutting site T1 of the first matching sequence and the cutting site T2 of the second matching sequence as a sub-targeting sequence;
wherein:
if M is1/2>10, the selection unit selects the cleavage site T1 of the first matching sequence as Me1-9;
If M is1The/2 is less than or equal to 10, and the selection unit selects the cleavage site T1 of the first matching sequence as (M)e1-M12+ 1); wherein, M is1(ii)/2 rounding up or rounding down;
if M is2/2>10, the selection unit selects the cleavage site T2 of the second matching sequence as Ms2+9;
If M is2The/2 is ≦ 10, and the selection unit selects the cleavage site T2 of the second matching sequence as (M)s2+M22-1); wherein, M is2(ii)/2 rounding up or rounding down;
when M iss2-Me1At ≦ 15: the interception unit intercepts the unmatched sequence fragment Ms2To Me1As sub-target sequence fragments.
12. The protein structure prediction device of claim 9, further comprising:
the side chain construction module is used for reconstructing a side chain of a three-dimensional structure of the protein to be detected;
and the structure optimization module is used for minimizing the energy of the three-dimensional structure model of the protein to be detected by adopting molecular mechanics.
13. The protein structure prediction device according to any one of claims 9-12, further comprising:
and the secondary structure prediction module is used for predicting the secondary structure of the protein to be detected according to the three-dimensional structure of the protein to be detected and by combining the information of the three-dimensional space structure and the secondary structure of each protein sequence in the protein database with the known structure.
14. A storage medium storing a plurality of instructions for execution by one or more processors to perform the steps of the protein structure prediction method of any one of claims 1-8.
15. A protein structure prediction platform comprising the protein structure prediction device according to any one of claims 9 to 13, wherein the protein structure prediction platform is constructed on a server and is provided with an online visualization program for visualizing a secondary structure of a three-dimensional structure of a protein, so as to display the structure of the protein.
CN201910880279.XA 2019-09-18 2019-09-18 Protein structure prediction method, device, platform and storage medium Pending CN112530517A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910880279.XA CN112530517A (en) 2019-09-18 2019-09-18 Protein structure prediction method, device, platform and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910880279.XA CN112530517A (en) 2019-09-18 2019-09-18 Protein structure prediction method, device, platform and storage medium

Publications (1)

Publication Number Publication Date
CN112530517A true CN112530517A (en) 2021-03-19

Family

ID=74974934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910880279.XA Pending CN112530517A (en) 2019-09-18 2019-09-18 Protein structure prediction method, device, platform and storage medium

Country Status (1)

Country Link
CN (1) CN112530517A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921083A (en) * 2021-10-27 2022-01-11 云舟生物科技(广州)有限公司 Custom sequence analysis method, computer storage medium and electronic device
CN115035947A (en) * 2022-06-10 2022-09-09 水木未来(北京)科技有限公司 Protein structure modeling method and device, electronic device and storage medium
CN115881211A (en) * 2021-12-23 2023-03-31 上海智峪生物科技有限公司 Protein sequence alignment method, device, computer equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921083A (en) * 2021-10-27 2022-01-11 云舟生物科技(广州)有限公司 Custom sequence analysis method, computer storage medium and electronic device
CN115881211A (en) * 2021-12-23 2023-03-31 上海智峪生物科技有限公司 Protein sequence alignment method, device, computer equipment and storage medium
WO2023116816A1 (en) * 2021-12-23 2023-06-29 上海智峪生物科技有限公司 Protein sequence alignment method and apparatus, and server and storage medium
CN115881211B (en) * 2021-12-23 2024-02-20 上海智峪生物科技有限公司 Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium
CN115035947A (en) * 2022-06-10 2022-09-09 水木未来(北京)科技有限公司 Protein structure modeling method and device, electronic device and storage medium
CN115035947B (en) * 2022-06-10 2023-03-10 水木未来(北京)科技有限公司 Protein structure modeling method and device, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN112530517A (en) Protein structure prediction method, device, platform and storage medium
Wang et al. Analysis of deep learning methods for blind protein contact prediction in CASP12
Sinha et al. Docking by structural similarity at protein‐protein interfaces
CN112233723B (en) Protein structure prediction method and system based on deep learning
US7333980B2 (en) Searching queries using database partitioning
Zhang et al. Secondary structure and contact guided differential evolution for protein structure prediction
Vernon et al. Improved chemical shift based fragment selection for CS-Rosetta using Rosetta3 fragment picker
Ho et al. LISA: towards learned DNA sequence search
CN101294970B (en) Prediction method for protein three-dimensional structure
CN105260626B (en) The full information Forecasting Methodology of protein structure space conformation
KR20210148544A (en) A protein tertiary structure prediction method using adjacent map images between amino acids
CN110502606A (en) Retrieve device, search method and search program
CN110020456A (en) The method for gradually generating FPGA realization using the similarity searching based on figure
Sun et al. A novel conformational B-cell epitope prediction method based on mimotope and patch analysis
KR100456627B1 (en) System and method for predicting 3d-structure based on the macromolecular function
CN106934007A (en) The method for pushing and device of related information
CN112329797A (en) Target object retrieval method, device, server and storage medium
Wang et al. Reconstruction of Protein Backbone with the alpha-Carbon Coordinates.
CN105574359B (en) A kind of extending method and device in protein template library
KR20080019857A (en) Apparatus for prediction of tertiary structure from the protein amino acid sequences and prediction method thereof
KR102563901B1 (en) Prediction method for property of pharmaceutical active ingredient
Yang et al. Fast and accurate algorithms for mapping and aligning long reads
JP2004227165A (en) Document score calculation method and device, and program
CN110706741B (en) Multi-modal protein structure prediction method based on sequence niche
Jadczyk et al. Examining protein folding process simulation and searching for common structure motifs in a protein family as experiments in the gridspace2 virtual laboratory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination