CN116955713A - Method for generating protein index, method and device for querying protein fragment - Google Patents

Method for generating protein index, method and device for querying protein fragment Download PDF

Info

Publication number
CN116955713A
CN116955713A CN202310146693.4A CN202310146693A CN116955713A CN 116955713 A CN116955713 A CN 116955713A CN 202310146693 A CN202310146693 A CN 202310146693A CN 116955713 A CN116955713 A CN 116955713A
Authority
CN
China
Prior art keywords
protein
node
atomic
topological
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310146693.4A
Other languages
Chinese (zh)
Inventor
吴黎明
赵康菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310146693.4A priority Critical patent/CN116955713A/en
Publication of CN116955713A publication Critical patent/CN116955713A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a protein index generation method, a protein fragment query method and a protein fragment query device, and belongs to the technical field of computers. According to the application, the atomic topological graph is constructed for each protein fragment according to the key atoms on the amino acid residues, so that the extracted atomic topological features reflect the spatial arrangement mode of the amino acid residues from the atomic granularity, when the offline index is utilized to provide the online query service, only the index is utilized to position part of protein fragments and then fine screening is carried out, no linear scanning of the whole library protein fragments is needed, the calculation cost of the query process is greatly reduced, the query efficiency based on the constructed protein index is improved, and the rapid response high-concurrency online query task can be carried out based on the index.

Description

Method for generating protein index, method and device for querying protein fragment
Technical Field
The application relates to the technical field of computers, in particular to a method for generating protein indexes, a method and a device for inquiring protein fragments.
Background
With the development of computational biology, in the process of designing the protein structure of an antibody, an expert will usually design the protein structure of the antibody against an antigen, and search a protein database for protein fragments similar to the protein structure based on the designed protein structure.
Generally, for a given protein structure to be queried, a structure similarity score (e.g., TM score, an indicator for measuring the structural similarity of proteins) between each protein fragment in the protein database and the protein structure is calculated one by using a structure matching algorithm such as TM-Align, and then the most similar protein fragments are selected.
Since the size of a protein database is usually in the order of millions or billions, the above query method needs to linearly scan each protein fragment in the database, so that the calculation cost is extremely high, the query efficiency is extremely low, and the response to the high-concurrency online query task is difficult.
Disclosure of Invention
The embodiment of the application provides a protein index generation method, a protein fragment query method and a protein fragment query device, which can reduce the calculation cost for querying the protein fragment, improve the query efficiency of the protein fragment and quickly respond to high-concurrency online query tasks. The technical scheme is as follows:
in one aspect, a method for generating a protein index is provided, the method comprising:
constructing an atomic topology map of a protein fragment based on key atoms in amino acid residues of the protein fragment, each node in the atomic topology map representing one key atom in one amino acid residue;
Inputting the position information and the category characteristics of each node in the atomic topological graph into a isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
the position information and the category characteristics of each node are respectively processed through a plurality of attention weighting layers in the isomorphic neural network, and the atomic topology characteristics of the protein fragments are output by the last attention weighting layer;
an index of the plurality of protein fragments is generated based on atomic topological features of the plurality of protein fragments.
In one aspect, a method for querying a protein fragment is provided, the method comprising:
constructing an atomic topology diagram of the protein to be queried based on key atoms in amino acid residues of the protein to be queried, wherein each node in the atomic topology diagram represents one key atom in one amino acid residue;
inputting the position information and the category characteristics of each node in the atomic topological graph into a isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
The position information and the category characteristics of each node are respectively processed through a plurality of attention weighting layers in the isomorphic neural network, and the last attention weighting layer outputs the atomic topology characteristics of the protein to be queried;
determining a query string of the protein to be queried based on the atomic topological feature;
and returning a plurality of target protein fragments meeting similar conditions with the protein to be queried based on the query string.
In one aspect, there is provided a protein index generating apparatus, the apparatus comprising:
a building module for building an atomic topology map of a protein fragment based on key atoms in amino acid residues of the protein fragment, each node in the atomic topology map representing a key atom in an amino acid residue;
the input module is used for inputting the position information and the category characteristics of each node in the atomic topological graph into the isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
The processing module is used for respectively processing the position information and the category characteristics of each node through a plurality of attention weighting layers in the isomorphic neural network, and outputting the atomic topology characteristics of the protein fragments through the last attention weighting layer;
a generation module for generating an index of the plurality of protein fragments based on atomic topology features of the plurality of protein fragments.
In some embodiments, the building block is to:
for each amino acid residue in the protein fragment, determining a plurality of key atoms from the backbone of the amino acid residue;
constructing each node in the atomic topology map based on each key atom of each amino acid residue;
for any pair of nodes in the atomic topology graph, an edge is constructed for connecting the pair of nodes.
In some embodiments, each attention weighting layer in the isomorphic neural network is further configured to predict output characteristics and output coordinates for each node in the atomic topology map;
the processing module comprises:
the weighted mapping unit is used for carrying out weighted mapping on the output characteristics of each node in the previous attention weighting layer on any attention weighting layer based on the query matrix, the key matrix and the value matrix of the attention weighting layer to obtain the query vector, the key vector and the value vector of each node;
The score acquisition unit is used for acquiring the attention score of a node pair formed by any node and other nodes based on the query vector of the node and the key vector of the other nodes;
a feature obtaining unit, configured to obtain an output feature of the attention weighting layer to the node based on an attention score of each node pair including the node and a value vector of each other node;
and the coordinate acquisition unit is used for acquiring the output coordinates of the attention weighting layer to the nodes based on the attention score of each node pair containing the nodes and the output coordinates of the last attention weighting layer to the nodes and each other node.
In some embodiments, the feature acquisition unit is configured to:
weighting each node pair containing the nodes, and based on the attention score of the node pair, weighting the value vectors of other nodes in the node pair to obtain a weighted value vector of the node pair;
and fusing the weighted value vectors of each node pair comprising the nodes to obtain the output characteristics of the attention weighted layer to the nodes.
In some embodiments, the coordinate acquisition unit is configured to:
For each node pair comprising the node, acquiring a coordinate difference between output coordinates of the node of the previous attention weighting layer pair and other nodes in the node pair;
weighting the coordinate differences of the node pairs based on the attention scores of the node pairs to obtain weighted coordinate differences;
fusing the weighted coordinate differences of each node pair comprising the nodes to obtain a coordinate offset;
and determining the output coordinates of the attention weighting layer to the node based on the output coordinates of the previous attention weighting layer to the node, the coordinate offset and a normalization factor.
In some embodiments, the score acquisition unit is configured to:
multiplying the query vector of any node and the key vector of other nodes to obtain the initial attention score of the node pair;
and carrying out exponential normalization on the initial attention score of each node pair containing the node to obtain the attention score of the node pair.
In some embodiments, the processing module is further to:
and fusing the output characteristics of the last attention weighting layer to each node to obtain the atomic topology characteristics of the protein fragments.
In some embodiments, the isomorphic neural network is trained based on the position information and the category characteristics of each node in the sample topological graph, and the loss function value of the isomorphic neural network in the training stage comprises a coordinate loss term and a distance loss term, wherein the coordinate loss term represents an error between an atomic coordinate and a predicted coordinate of each node in the sample topological graph, and the distance loss term represents an error between an atomic distance and a predicted distance of each pair of nodes in the sample topological graph.
In some embodiments, the generating module comprises:
the clustering unit is used for clustering the atomic topological characteristics of the protein fragments to obtain a plurality of protein fragment sets which are not intersected with each other;
a segmentation unit, configured to segment atomic topology features of protein fragments in the protein fragment set by using the protein fragment set as a unit, to obtain a plurality of sub-topology features of the protein fragments;
and the generation unit is used for generating indexes of the protein fragments in the protein fragment set based on the cluster centers of each group of sub-topological features with the same sequence numbers in different protein fragments.
In some embodiments, the generating unit comprises:
A clustering subunit, configured to cluster each group of sub-topological features with the same sequence number after different protein fragments in the protein fragment set are segmented, and cluster a plurality of sub-topological features in each group of sub-topological features to obtain a plurality of clustering subsets of each group of sub-topological features;
a codebook generating subunit, configured to generate a codebook of each group of sub-topological features based on cluster centers of a plurality of cluster subsets of each group of sub-topological features, where the codebook characterizes sub-topological features located at the cluster centers of the plurality of cluster subsets;
an index generation subunit for generating an index of protein fragments in the set of protein fragments based on the codebook of each set of sub-topological features.
In some embodiments, the index generation subunit is configured to:
determining a plurality of sub-topological features of any protein fragment in the protein fragment set after the protein fragment is segmented;
determining a cluster index value of a cluster subset of each sub-topological feature in a group of sub-topological features to which the sub-topological feature belongs based on a codebook of each group of sub-topological features;
and splicing the cluster index values of the sub-topological features after the protein fragments are segmented to obtain indexes of the protein fragments.
In one aspect, there is provided a protein fragment querying device, the device comprising:
the construction module is used for constructing an atomic topological graph of the protein to be queried based on key atoms in amino acid residues of the protein to be queried, wherein each node in the atomic topological graph represents one key atom in one amino acid residue;
the input module is used for inputting the position information and the category characteristics of each node in the atomic topological graph into the isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
the processing module is used for respectively processing the position information and the category characteristics of each node through a plurality of attention weighting layers in the isomorphic graph neural network, and outputting the atomic topology characteristics of the protein to be queried through the last attention weighting layer;
a determining module, configured to determine a query string of the protein to be queried based on the atomic topology feature;
and the return module is used for returning a plurality of target protein fragments which accord with the similar conditions with the protein to be queried based on the query string.
In some embodiments, the determining module comprises:
the set determining unit is used for determining a target fragment set with the highest matching degree with the protein to be queried from a plurality of protein fragment sets based on the atomic topological characteristics;
and the index determining unit is used for determining the query string of the protein to be queried based on the cluster centers of a plurality of cluster subsets of each group of sub-topological features in the target fragment set.
In some embodiments, the set determination unit is configured to:
acquiring the distance between the atomic topological feature and the atomic topological feature positioned in the clustering center of each protein fragment set;
and determining a protein fragment set closest to the plurality of protein fragment sets as the target fragment set.
In some embodiments, the index determination unit is configured to:
dividing the atomic topological characteristics of the protein to be queried to obtain a plurality of sub-topological characteristics of the protein to be queried;
for each sub-topological feature, determining a target cluster subset in which the sub-topological feature falls from a plurality of cluster subsets of a group of sub-topological features to which the sub-topological feature belongs in the target fragment set;
Inquiring a codebook of the group of sub-topological features to obtain a cluster index value of the target cluster subset, wherein the codebook represents the sub-topological features positioned in the cluster centers of the plurality of cluster subsets;
and splicing the cluster index values of the target cluster subsets which are respectively fallen into the sub-topological features to obtain the query string.
In some embodiments, the return module is to:
determining a plurality of candidate protein fragments which meet the matching condition with the query string from a plurality of protein fragments in the target fragment set;
screening the plurality of candidate protein fragments based on the distance between the atomic topological feature of the protein to be queried and the atomic topological feature of the candidate protein fragment to obtain the plurality of target protein fragments;
returning the plurality of target protein fragments.
In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one computer program stored therein, the at least one computer program loaded and executed by the one or more processors to implement a method of generating a protein index or a method of querying protein fragments as in any of the possible implementations described above.
In one aspect, a computer readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement a method for generating a protein index or a method for querying protein fragments according to any of the possible implementations described above.
In one aspect, a computer program product is provided that includes one or more computer programs stored in a computer-readable storage medium. The one or more processors of the computer device are capable of reading the one or more computer programs from the computer-readable storage medium, the one or more processors executing the one or more computer programs so that the computer device is capable of performing the method of generating a protein index or the method of querying a protein fragment of any of the possible embodiments described above.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
by constructing an atomic topology map according to key atoms on amino acid residues for each protein fragment, the atomic topology features extracted based on the atomic topology map can reflect the features of the spatial arrangement mode of the amino acid residues of the protein fragments from the atomic granularity, the characteristics of the isograph neural network ensure that even if the protein fragments are transformed through translation, rotation and the like, the extracted atomic topology features are unchanged because the internal structures of the protein fragments are unchanged, so that the expression capacity and the accuracy degree of the atomic topology features can be greatly ensured, the accuracy degree of indexes of the protein fragments generated according to the atomic topology features is improved, and when the online query service is provided by using the offline indexes, fine screening is performed after only positioning part of the protein fragments, linear scanning of the whole library protein fragments is not needed, the calculation cost of the query process is greatly reduced, the query efficiency based on the constructed protein indexes is improved, and the quick response and high-concurrency online query task can be performed based on the indexes.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an embodiment of a method for generating a protein index according to the present application;
FIG. 2 is a schematic diagram of a protein design process according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for generating a protein index according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for generating a protein index according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an offline index construction and online query of a protein fragment according to an embodiment of the present application;
FIG. 6 is a schematic diagram of the construction of a product quantization-based inverted file index according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a discretized index of protein fragments provided by an embodiment of the present application;
FIG. 8 is a flowchart of a method for querying a protein fragment according to an embodiment of the present application;
FIG. 9 is a flowchart of a method for querying a protein fragment according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a protein index generating apparatus according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a protein fragment query device according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution.
The term "at least one" in the present application means one or more, and "a plurality" means two or more, for example, a plurality of protein fragments means two or more protein fragments.
The term "comprising at least one of A or B" in the present application relates to the following cases: only a, only B, and both a and B.
The user related information (including but not limited to user equipment information, personal information, behavior information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.) and signals referred to in the present application, when applied to a specific product or technology by the method of the present application, are all licensed, agreed, authorized, or fully authorized by the user, and the collection, use, and processing of the related information, data, and signals is required to comply with relevant laws and regulations and standards of the relevant country and region. For example, the protein fragments referred to in the present application are all obtained with sufficient authorization.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises audio processing technology, computer vision technology, natural language processing technology, machine learning/deep learning and other directions.
The method is characterized in that a computer can listen, watch, say and feel, and is a development direction of human-computer interaction in the future, wherein Machine Learning (ML) is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, intelligent wearing equipment, virtual assistants, intelligent sound boxes, intelligent marketing, unmanned, automatic driving, unmanned aerial vehicle, robot, intelligent medical treatment, intelligent customer service, internet of vehicles, automatic driving, intelligent traffic and the like, and it is believed that with development of technology, the artificial intelligence technology will be applied in more fields and become more and more important, and the scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and by combining machine learning with graph theory, automatic extraction of atomic topological characteristics of protein fragments can be realized.
Hereinafter, terms related to the embodiments of the present application will be explained.
Computational biology (Computational Biology): is a branch of biology, and refers to a subject for developing and applying methods of data analysis and theory, mathematical modeling and computer simulation techniques, etc., and for studying biological, behavioral and social group systems. The ultimate goal of computational biology is not limited to sequencing alone, but rather uses computer thought to solve biological problems, build and describe and simulate the biological world with computer language and mathematical logic.
Protein design: the aim is to rationally design new protein molecules by improving and generating the protein molecules to meet specific activities, behaviors or purposes. The protein structure can be designed from scratch during the design, or new protein molecules can be designed by calculating variants of known protein structures and their amino acid sequences. In embodiments of the present application, the goal of protein design is to predict its amino acid sequence through a specific protein structure.
Antibody (anti): refers to proteins with protective effects produced by the body as a result of stimulation by an Antigen (anti). It is a large Y-shaped protein secreted by plasma cells and used by the immune system to identify and neutralize antigens, and is found only in body fluids such as vertebrate blood, and the cell membrane surface of its B cells. The whole antibody molecule can be divided into constant and variable regions.
Complementarity determining regions (Complementarity Determining Region, CDRs): is a special protein fragment in the antibody, and the protein fragment can be combined with the antigen to form complementation in space structure, so that the neutralization of the antigen is realized. Stated another way, CDRs refer to hypervariable regions (or hypervariable regions) in which a small percentage of amino acid residues within the variable regions of antibodies and T cell receptors vary particularly strongly, and since CDRs are typically where an antigen contacts an antibody, the structure of a CDR is also known as an antigen binding site and directly determines the antigen binding specificity of an antibody.
K-nearest neighbor query (K-Nearest Neighbor Search, KNN Search): for a given distance metric function in a high-dimensional space χKnown database +.>Comprising n high-dimensional vectors in a high-dimensional space χGiven an arbitrary query vector q, a database will be returned by KNN query>The medium and query vectors q are K high-dimensional vectors with minimum distance calculated based on the distance measurement function d. Wherein n is an integer greater than or equal to 1, and K is an integer greater than or equal to 1.
K Means Clustering (K-Means Clustering): the method is a cluster analysis algorithm for iterative solution, is popular in the field of data mining, and aims to divide all data into K cluster subsets, so that each data belongs to a cluster subset corresponding to a mean value (referred to as a cluster center) nearest to the data, and the data is used as a standard of K mean value clustering. The method comprises the steps of dividing data into K groups, randomly selecting K objects as initial cluster centers, calculating the distance between each object and each cluster center, and distributing each object to the cluster center closest to the object, wherein the cluster centers and the objects distributed to the cluster centers represent a cluster subset. Each new assignment of an object to a subset of clusters whose cluster centers are recalculated based on all the objects already present. The above process will repeat until a certain termination condition is met, including but not limited to: no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are changed again, or the sum of squares error of K-means clustering is locally minimum, etc.
Graph Theory (Graph Theory): the figure is a mathematical branch of the subject. The Graph in Graph theory is used to describe a specific relationship between some entities, and the Graph is a Graph formed by a plurality of given nodes (vertical) and edges (Edge) connecting two points, wherein the nodes represent the entities, and the edges connecting two points represent a specific relationship between two entities. In the embodiment of the application, an atomic topological graph is constructed for each protein fragment, the nodes in the atomic topological graph represent key atoms on amino acids in the amino acid sequence of the protein fragment, and the edges connecting two points in the atomic topological graph represent the spatial position relationship between the two key atoms.
Inverted file index (Inverted File Index): also known as inverted indexing, reverse indexing, placement archives, reverse archives, etc., is an indexing method used to store a mapping of the storage location of a word in a document or group of documents under a full text search. It is the most commonly used data structure in document retrieval systems. In an embodiment of the application, it relates to constructing an inverted index for protein fragments, i.e. a mapping for storing storage locations of sub-topological features of a certain protein fragment after segmentation in a full library search in a set of sub-topological features with the same sequence number.
Hereinafter, technical ideas related to the embodiments of the present application will be described.
In computing biology, it is often involved in the task of designing protein structures, for example, designing protein structures of antibodies, and an expert designs protein structures of antibodies against antigens, and searches protein database for protein fragments similar to the protein structures by using artificially designed protein structures as Query terms (Query).
In general, the structural similarity between different proteins is determined mainly by a structure matching algorithm, for example, for a given protein structure to be queried, a TM score between each protein fragment in a protein database and the protein structure is calculated one by using a TM-alignment algorithm, where the TM score is an index that is output by the TM-alignment algorithm and measures the structural similarity of the protein, and then, based on the TM score of each protein fragment, a protein fragment that is most similar to the given protein structure is selected from the protein database.
As protein structure prediction tools are becoming mature, protein databases are becoming increasingly large in size, capable of reaching millions or billions of levels. In the query mode based on the structure matching algorithm, each protein fragment in the database needs to be linearly scanned, so that the calculation cost is extremely high, the query efficiency is extremely low, and the high-concurrency online query task is difficult to respond. In addition, due to the similarity measurement of protein structures, the matching of protein structures with indefinite length, the matching is determined by rules in different structure matching algorithms and heuristics of proteins, so that a certain specific structure matching algorithm may have better effect only on a certain type of protein design task, and cannot be applied to general protein structure analysis.
In view of this, the embodiment of the application provides a method for generating protein indexes and a method for querying protein fragments, which constructs indexes of protein fragments by using graph theory basis for each protein fragment in a protein database, and after the indexes of the whole protein database are built, KNN query of a given protein structure can be provided to the outside, so that the rapid online query requirement can be met, and the method is applicable to general protein structure analysis.
The system architecture according to the embodiment of the present application is described below.
FIG. 1 is a schematic diagram of an embodiment of a method for generating a protein index according to the present application. Referring to fig. 1, the implementation environment involves a terminal 101 and a server 102, which are described in detail below:
terminal 101 is any computer device capable of supporting protein fragment query services, and terminal 101 installs and runs an application supporting a protein fragment query system, alternatively the application may be: protein prediction applications, protein query applications, protein analysis applications, etc., the type of application program in the embodiments of the present application is not particularly limited.
In some embodiments, the biological expert formulates a protein to be queried, for example, a protein structure fragment which is provided by designing an antibody protein CDR for an antigen protein, then inputs the protein to be queried into an application program as a query term, and in response to a query operation on the protein to be queried, the terminal 101 generates a query request carrying the protein to be queried, and the terminal 101 sends the query request to the server 102.
The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
Server 102 is a server, a plurality of servers, a cloud computing platform, a virtualization center, or the like capable of providing protein fragment query services. Server 102 is used to provide background services for applications on terminal 101 that support the protein fragment query service. Optionally, during the query of the protein fragments, the server 102 performs a primary query, and the terminal 101 performs a secondary query; alternatively, the server 102 undertakes the secondary query, and the terminal 101 undertakes the primary query; alternatively, a distributed computing architecture is used for collaborative querying between the terminal 101 and the server 102.
In some embodiments, server 102 is a stand-alone physical server, or a server cluster or distributed system of multiple physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.
In some embodiments, the server 102 maintains a protein database, in which a huge number of natural protein fragments are stored, and the size of the protein database is typically millions or billions, and the server 102 generates, for each protein fragment in the protein database, an index of the protein fragment based on the atomic topology feature of the protein fragment according to the method for generating a protein index provided by the embodiment of the present application, because the atomic topology feature is typically a continuous vector representation in a high-dimensional space, and the index is typically a discrete vector representation in a low-dimensional space, the continuous high-dimensional feature can be compressed into a discrete index, so that the rapid query of the protein fragment can be conveniently responded.
In some embodiments, the server 102 receives a query request sent by the terminal 101, parses the query request to obtain a protein to be queried, extracts an atomic topology feature of the protein to be queried, routes the protein to be queried to a protein fragment set to which the protein to be queried belongs based on the atomic topology feature of the protein to be queried, and performs fine KNN query inside the protein fragment set to return K target protein fragments most similar to the protein to be queried, where K is a preset value representing the number of protein fragments returned by each query, and may be configured individually by a technician according to a query requirement or may be specified by a user who initiates the query.
In some embodiments, the device types of the terminal 101 include: at least one of a smart phone, tablet, notebook, desktop, smart speaker, smart watch, MP3 (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer 4) player, or e-book reader, but not limited thereto.
Those skilled in the art will appreciate that the terminal 101 may refer broadly to one of a plurality of terminals, and that the number of terminals 101 may be greater or lesser. For example, the number of the terminals 101 may be only one, or the number of the terminals 101 may be several tens or hundreds, or more. The number and device type of the terminals 101 are not limited in the embodiment of the present application.
Hereinafter, a protein design flow will be described by taking a protein design scenario as an example.
FIG. 2 is a schematic diagram of a protein design process according to an embodiment of the present application, and as shown in FIG. 2, the protein design process includes: 1) In the structure query stage, biological expert prepares protein to be queried and initiates a query request for the protein to be queried; 2) In-library indexing stage, analyzing and obtaining protein to be queried based on the query request, and performing KNN query on the protein to be queried in a protein database based on the query method of the protein fragments, wherein the protein database constructs the index of the protein fragments for each protein fragment based on the generation method of the protein index of the embodiment of the application; 3) Inquiring and returning the K most similar target protein fragments in the protein database; 4) In the design and embedding stage, the biological expert further designs, embeds or adjusts the amino acid sequence of the protein to be queried based on the amino acid sequences of the returned K target protein fragments so as to realize the structural design of the protein to be queried.
Taking the sequence design process of the CDR fragments of the antibody as an example, a biological expert manually designs the protein structure of the CDR fragments according to the epitope, then, by utilizing the method for generating the protein index of the embodiment of the application, CDR indexes are respectively created for the CDR fragments of the known massive antibodies, and for any given protein structure, K most similar CDR fragments can be quickly inquired out from a protein database of the CDR fragments so as to quickly respond to the structural inquiry requirement of the biological expert, and the amino acid sequences of K CDR fragments in the protein database are provided, thereby being convenient for the biological expert to further design and embed the amino acid sequences of the CDR fragments to be designed based on the amino acid sequences of the natural CDR fragments.
For millions or superior-level protein databases, the traditional structure matching algorithm is limited to a linear scanning query mode, the query time is usually more than a few hours, and the CDR index greatly improving the query efficiency is constructed by the embodiment of the application, so that the query time can be compressed to a second level, the query efficiency is greatly improved, and the protein design efficiency is further improved.
It should be noted that the sequence design process of CDR fragments of an antibody is only described as an example, and protein indexes can be constructed and protein fragments in a protein database can be quickly retrieved based on similar manner for the remaining protein fragments of an antibody or for other protein design processes except for an antibody.
The basic flow of the protein index generation method according to the embodiment of the present application will be described below.
FIG. 3 is a flowchart of a method for generating a protein index according to an embodiment of the present application. Referring to fig. 3, this embodiment is performed by a computer device, which may be implemented as a terminal 101 in the above-described implementation environment or as a server 102 in the above-described implementation environment, and is described here by taking the computer device as an example only, and includes the following steps:
301. the server constructs an atomic topology map of the protein fragment based on the key atoms in the amino acid residues of the protein fragment, each node in the atomic topology map characterizing a key atom in one of the amino acid residues.
The protein fragment refers to any protein fragment which is cut or segmented from any natural protein molecule stored in a protein database at the server side, for example, in the design of antibody CDR fragments, the CDR fragments of all natural protein molecules in the protein database need to be indexed.
Wherein each protein fragment has a unique amino acid sequence, which refers to a sequence formed by a plurality of amino acid residues constituting the protein fragment. For each amino acid residue, the atomic structure generally includes a main chain and one or more side chains, and the individual atoms on the main chain and side chains are arranged according to the spatial structure of the amino acid residue.
Schematically, for each amino acid residue on a protein fragment, only 4 key atoms (C, N, O, C α ) Wherein C, N, O each represents a carbon atom, a nitrogen atom and an oxygen atom constituting a peptide bond in the main chain, C α Representing Alpha carbon atoms in the backbone.
Illustratively, for each amino acid residue on a protein fragment, in addition to the key atoms on the backbone, key atoms on the side chains may also be included in the process of establishing an atomic topology map, which is not specifically limited in the embodiments of the present application.
In some embodiments, 4 key atoms (C, N, O, C) in the backbone are selected α ) For purposes of illustration, for each protein fragment in the protein database, the amino acid sequence of that protein fragment is obtained, and for each amino acid residue in the amino acid sequence, 4 key atoms (C, N, O, C) of that amino acid residue in the backbone are obtained α ) Is a three-dimensional coordinate of (c). Next, based on the 4 key atoms (C, N, O, C) in the main chain of each amino acid residue α ) An atomic topology map can be constructed for the protein fragment such that each node in the atomic topology map represents a key atom on an amino acid residue in the amino acid sequence, and edges in the atomic topology map connecting a pair of nodes represent a positional relationship of a pair of key atoms in three-dimensional space. Optionally, all nodes in the atomic topology are constructed according to each key atom of each amino acid residue in the amino acid sequence on the main chain, and then a connected edge is created for any pair of nodes in the atomic topology, so that a connection exists between any two different nodes in the atomic topology Edges to facilitate the transfer of attention information between all key atoms using a isovariogram neural network.
Schematically, for a protein database comprising n (n.gtoreq.1) protein fragmentsThe i (1.ltoreq.i.ltoreq.n) th protein fragment p i For example, protein fragment p i Amino acid sequence a= [ a ] consisting of m (m is not less than 1) amino acid residues 1 ,...,a m ]The composition comprises the j (1. Ltoreq.j.ltoreq.m) th amino acid residue a in the amino acid sequence j For example, the amino acid residue p is taken j 4 key atoms in the main chain (C, N, O, C) α ) Is the three-dimensional coordinate x of (2) j,1...4 Thus, the three-dimensional coordinates of the 4 key atoms of all m amino acid residues in the main chain can form a three-dimensional coordinate sequence, which can be regarded as a protein fragment p i Coordinate information of->The protein fragment p can be characterized by the coordinate information X i Spatial arrangement in atomic granularity.
For each protein fragment in the protein database, an atomic topology map can be constructed by utilizing the amino acid sequence A and the coordinate information X, so that each node in the atomic topology map represents a certain key atom (C, N, O or C) α ) Optionally, after all nodes on the atomic topology map are constructed, a connected edge is created for any two different nodes to facilitate subsequent use of the isovariogram neural network to transfer attention information between all key atoms.
Schematically, as protein fragment p i For example, an atomic topology map G can be constructed by the above mapping method i = (V, E, X), where v= { V 1 ,υ 2 ,...,υ 4m And 4m nodes (for each of the m amino acid residues)Selecting 4 key atoms on the main chain, thus 4m key atoms, namely 4m nodes), E= { (v) i ,υ j ) I. 4m, j=1.4m } represents the set of edges in the atomic topology, i.e. any pair of mutually different nodes v on the atomic topology i And v j There is a connecting edge (i.noteq.j) between them, X= { X 1 ,x 2 ,...,x 4m And represents the three-dimensional coordinates of the key atom indicated by each of the 4m nodes. Thus, a graph set of atomic topology of the individual protein fragments can be finally obtainedFrom the point of view of graph theory, the atomic topology map of each protein fragment belongs to a complete map, and the complete map refers to: there is a simple undirected graph with one edge connected between each pair of different nodes.
302. The server inputs the position information and category characteristics of each node in the atomic topological graph into an isograph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the node, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the node belong, and the isograph neural network is used for extracting the atomic topological characteristics of the input topological graph.
Wherein the atomic topology features are used to characterize the spatial arrangement of amino acid residues in the protein fragment at atomic granularity.
Wherein, the isomorphic graph neural network is a graph neural network for message transmission using an attention mechanism, and the graph neural network needs to satisfy the following conditions: if the output signal is a scalar (i.e., the protein fragment being queried has no directional variable), the output scalar will not change with the change of the input signal, i.e., f (Px) =f (x); if the output signal is a vector (e.g., a variable having a direction such as a coordinate, a force, etc.), the output vector will change equally with the change of the input signal, i.e., f (Px) =pf (x), where x represents the input signal, P represents a transformation matrix applied to the input signal in the euclidean space, and f () represents the effect of the graph neural network.
In some embodiments, based on the atomic topology map constructed in step 301, for each node on the atomic topology map, the node has a unique location information (i.e., three-dimensional coordinates of the key atoms indicated by the node) and a unique class feature (i.e., features of the atomic class to which the key atoms indicated by the node belong), where the atomic class may include C, N, O, C with only the key atoms on the backbone being considered α There are 4 total categories, each atom category corresponding to a unique category characteristic.
In some embodiments, after the position information and the category characteristics of each node in the atomic topological graph are acquired, the position information and the category characteristics of each node are input into the isovariogram neural network, the position information and the category characteristics of each node are processed through the isovariogram neural network, so that the attention information of different key atoms can be transmitted between different nodes, and finally the atomic topological characteristics of one protein fragment are output.
In the process, after the atomic topological graph is constructed on the protein fragment, the atomic topological characteristic of the protein fragment is extracted by utilizing the isomorphous graph neural network, so that the query result returned in the protein database is unchanged if the transformation operation (namely, the transformation matrix P is applied to the protein to be queried) of rotating, translating and the like on the protein fragment is performed without changing the internal structure of the protein fragment by utilizing the characteristic of the isomorphous graph neural network.
303. The server processes the position information and the category characteristics of each node through a plurality of attention weighting layers in the variogram neural network, and the last attention weighting layer outputs the atomic topology characteristics of the protein fragment.
The constant graph neural network is a multi-layer graph neural network, and comprises a plurality of attention weighting layers connected in series, wherein the series relation means that each attention weighting layer takes the output signal of the above attention weighting layer as input, and the output signal obtained after the constant graph neural network is processed is input into the next attention weighting layer.
Wherein each attention weighting layer relates to an attention layer and a softmax (exponential normalization) layer, and the attention layer is used for weighting matrix W under an attention mechanism Q 、W K 、W V The weighted mapping is performed, and the softmax layer is used for exponentially normalizing the attention score, and the detailed attention mechanism will be described in detail in the latter embodiment, and is not developed here.
In some embodiments, after the server obtains the location information and the class feature for each node in the atomic topology map, the location information and the class feature of each node are input into a first attention weighting layer of the isomorphic neural network, the location information and the class feature of each node are processed through the first attention weighting layer, an output coordinate and an output feature are predicted for each node based on an attention mechanism, then the first attention weighting layer inputs the output coordinate and the output feature predicted for each node into a second attention weighting layer, and so on, a final output coordinate and an output feature are predicted for each node by a last attention weighting layer, and then the output features of all nodes in the last attention weighting layer are fused to obtain the atomic topology feature of the protein segment on the whole atomic topology map. Optionally, the fusion manner of the output features of each node includes, but is not limited to: splicing (Concat), adding by element, multiplying by element, bilinear joining, etc., which is not particularly limited in the embodiments of the present application. The method comprises the steps that each attention weighting layer predicts output coordinates of each node, represents coordinate information on a three-dimensional space estimated by the current attention weighting layer on the node, and similarly, each attention weighting layer predicts output characteristics of each node, represents spatial distribution characteristics of key atoms on the three-dimensional space estimated by the current attention weighting layer on the node, and attention scores are calculated by taking the characteristics of all the key atoms on the three-dimensional space distribution into consideration by an attention mechanism, so that the mutual influence of different key atoms on the three-dimensional space distribution can be transmitted between different nodes corresponding to different key atoms through the attention scores, and the contribution degree of the different key atoms on the output characteristics of the prediction mutually is controlled.
In the process, on the basis of constructing an atomic topological graph for each protein fragment, and extracting an atomic topological feature from each atomic topological graph by utilizing a constant-change graph neural network, a graph database can be constructed based on a protein database, and the atomic topological feature of the protein fragment can be conveniently extracted based on the property quantification of the graph data. Illustratively, for protein databasesEach protein fragment in (a) constructs an atomic topology map, and the atomic topology maps of all protein fragments form a map databaseExtracting atomic topological characteristics of each atomic topological graph by utilizing a constant graph neural network, wherein the atomic topological characteristics of all protein fragments form a characteristic database +.>n represents the number of protein fragments and also the size of the database.
In steps 301-303, the protein structure of each protein fragment in the protein database, which is abstracted in the three-dimensional space, is represented as a feature vector (i.e. an atomic topological feature) in a high-dimensional space, and because each protein fragment is regarded as graph data in the data structure, the vector representation of each atomic topological graph is conveniently generated by using a self-supervised graph representation learning method during training of the isograph neural network, namely, the atomic topological feature of each protein fragment is extracted by using a self-supervised training mode, so that the training cost can be reduced, and the training efficiency can be accelerated.
304. The server generates an index of the plurality of protein fragments based on the atomic topology features of the plurality of protein fragments.
In some embodiments, for each protein segment in the protein database, the atomic topology feature of each protein segment is extracted through the steps 301-303, then, for the atomic topology feature of each protein segment in the protein database, an offline index of each protein segment may be constructed in a product quantization manner, and a detailed description of the product quantization will be omitted in the next embodiment, so that the index construction manner can further convert and represent the atomic topology feature of each protein segment from a feature vector in a high-dimensional space into a series of compression vectors with equal lengths, thereby constructing an inverted file index for the protein database by using each compression vector, and facilitating the quick query of the protein segments by using the inverted file index in the online query stage.
Schematically, in the online query stage, after the atomic topology features of the input protein to be queried can be extracted in a manner similar to that of steps 301-303, a query string of the protein to be queried is obtained in a manner similar to that of step 304, and finally KNN query is performed in the inverted file index to return to K target protein fragments with the most similar structure, and the specific online query manner will not be described in detail in the following embodiments.
According to the method provided by the embodiment of the application, the atomic topological graph is constructed according to the key atoms on the amino acid residues for each protein fragment, so that the atomic topological features extracted based on the atomic topological graph can reflect the features of the spatial arrangement mode of the amino acid residues of the protein fragments from the atomic granularity, the characteristics of the isograph neural network ensure that even if the protein fragments are subjected to translation, rotation and other transformations, the extracted atomic topological features are unchanged due to the fact that the internal structures of the protein fragments are unchanged, the expression capacity and the accuracy degree of the atomic topological features can be greatly ensured, the accuracy degree of the indexes of the protein fragments generated according to the atomic topological features is improved, and when the online query service is provided by utilizing the offline indexes, only the indexes are needed to be used for positioning part of the protein fragments, then fine screening is not needed, the calculation cost of the query process is greatly reduced, the query efficiency based on the constructed protein indexes is improved, and the quick response high-concurrency online query task can be performed based on the indexes.
All the above optional solutions can be combined to form an optional embodiment of the present disclosure, which is not described in detail herein.
In the above embodiment, the construction flow of the protein index is simply introduced, that is, an atomic topology map is constructed for each protein fragment in the protein database, and then an isomorphous map neural network is used to extract the atomic topology features of each atomic topology map, so as to generate an offline index according to each atomic topology feature.
In the embodiment of the present application, the index construction process of each protein fragment in the protein database will be described in detail, and in particular, the extraction process of the atomic topology feature of each protein fragment and the product quantization process of constructing the offline index will be described below.
Fig. 4 is a flowchart of a method for generating a protein index according to an embodiment of the present application. Referring to fig. 4, this embodiment is performed by a computer device, which may be implemented as a terminal 101 in the above-described implementation environment or as a server 102 in the above-described implementation environment, and is described here by taking the computer device as an example only, and includes the following steps:
401. the server constructs an atomic topology map of the protein fragment based on the key atoms in the amino acid residues of the protein fragment, each node in the atomic topology map characterizing a key atom in one of the amino acid residues.
In some embodiments, for each protein fragment in the protein database, the server determines, for each amino acid residue in the protein fragment, a plurality of key atoms from the backbone of that amino acid residue. Illustratively, for each protein fragment, the server obtainsThe amino acid sequence of the protein fragment is then followed by, for each amino acid residue in the amino acid sequence, determining a plurality of key atoms from the backbone of the amino acid residue, e.g., selecting 4 key atoms (C, N, O, C α ) Alternatively, the key atoms on the side chain may be included in the construction process of the atomic topological graph, so as to improve the fineness of the atomic topological graph.
In some embodiments, the server builds each node in the atomic topology map based on each key atom of each amino acid residue, e.g., for each amino acid residue in the amino acid sequence of a protein fragment, for each key atom of that amino acid residue on the backbone, creates a node in the atomic topology map that uniquely indicates that key atom. For example, for a protein database comprising n (n.gtoreq.1) protein fragments The i (1.ltoreq.i.ltoreq.n) th protein fragment p i For example, protein fragment p i Amino acid sequence a= [ a ] consisting of m (m is not less than 1) amino acid residues 1 ,…,a m ]The composition, for each amino acid residue in amino acid sequence A, comprises 4 key atoms (C, N, O, C α ) Respectively at protein fragment p i Atomic topology map G of (1) i In such a way that each node corresponds uniquely to one key atom of an amino acid residue in the amino acid sequence of the protein fragment on the main chain, it is obvious that in the case that m amino acid residues are contained in the amino acid sequence, since a total of 4m key atoms of each of the m amino acid residues on the main chain need to be considered, the resulting atomic topology will contain 4m nodes. That is, 1 amino acid residue creates 4 nodes in the atomic topology map, i.e., an atomic topology map G can be constructed based on amino acid sequence A i Node set V of (1)Middle V= { V 1 ,υ 2 ,...,υ 4m Each node in node set V is used to uniquely indicate a certain key atom on the backbone of a certain amino acid residue in the amino acid sequence of a protein fragment.
It should be noted that, only 4 key atoms on the main chain are selected for each amino acid residue, and if more or fewer key atoms are considered for each amino acid residue, the number of nodes in the atomic topology map will also be changed, and will not be described again.
In some embodiments, the server constructs an edge for any pair of nodes in the atomic topology map that connects the pair of nodes. After the server utilizes key atoms of each amino acid residue on a main chain to construct all nodes on an atomic topological graph, a complete graph can be formed directly according to all the nodes, wherein the complete graph is a simple undirected graph with one connected edge between each pair of different nodes, and the formed complete graph is the atomic topological graph of a protein fragment, so that the condition that one connected edge exists exactly in any pair of nodes in the atomic topological graph is ensured, and the isograph neural network is convenient for message transmission between any pair of nodes based on an attention mechanism.
Schematically, as protein fragment p i For example, build an atomic topology map G i = (V, E, X), where the node set v= { V 1 ,υ 2 ,...,υ 4m Edge set e= { (v) i ,υ j ) I=1..4m, j=1..4m, i.e., any pair of mutually different nodes v on the atomic topology map i And v j There is a connecting edge (i.noteq.j) between them, coordinate set x= { X 1 ,x 2 ,...,x 4m And represents the three-dimensional coordinates of the key atom indicated by each of the 4m nodes.
402. The server inputs the position information and category characteristics of each node in the atomic topological graph into an isograph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the node, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the node belong, and the isograph neural network is used for extracting the atomic topological characteristics of the input topological graph.
Wherein the atomic topology features are used to characterize the spatial arrangement of amino acid residues in the protein fragment at atomic granularity.
In some embodiments, based on the atomic topology map constructed in step 401, for each node on the atomic topology map, the three-dimensional coordinates of the key atom indicated by the node are obtained as the location information of the node, and a class feature of the node is initialized according to the atom class to which the key atom indicated by the node belongs. Alternatively, where only key atoms on the backbone are considered, the atom class may include C, N, O, C α And 4 kinds of atoms, wherein each kind of atom category corresponds to a unique category characteristic, key atoms of the same atom category have the same category characteristic, and key atoms of different atom categories have different category characteristics.
In some embodiments, after the location information and the category characteristics of each node in the atomic topology map are acquired, the location information and the category characteristics of each node are input into the isovariogram neural network, and the location information and the category characteristics of each node are processed through the isovariogram neural network, and step 403 is performed. The constant graph neural network is a multi-layer graph neural network, and comprises a plurality of attention weighting layers connected in series, wherein the series relation means that each attention weighting layer takes the output signal of the above attention weighting layer as input, and the output signal obtained after the constant graph neural network is processed is input into the next attention weighting layer.
403. The server processes the position information and the category characteristics of each node through a plurality of attention weighting layers in the variogram neural network, and the last attention weighting layer outputs the atomic topology characteristics of the protein fragment.
Wherein each attention weighting layer in the variogram neural network is further configured to predict an output characteristic and an output coordinate for each node in the atomic topology map.
In some embodiments, after the server obtains the location information and the class feature for each node in the atomic topology map, the location information and the class feature for each node are input together into a first attention weighting layer of the isomorphic neural network, the location information and the class feature for each node are processed through the first attention weighting layer, an output coordinate and an output feature are predicted for each node based on the attention mechanism, then the first attention weighting layer inputs the output coordinate and the output feature predicted for each node into a second attention weighting layer, and so on, a final output coordinate and an output feature are predicted for each node by a last attention weighting layer.
In some embodiments, the server fuses the output features of the last attention weighting layer to each node to obtain the atomic topology features of the protein fragment across the entire atomic topology map. Optionally, the fusion manner of the output features of each node includes, but is not limited to: splicing (Concat), adding by element, multiplying by element, bilinear joining, etc., which is not particularly limited in the embodiments of the present application.
The method comprises the steps that each attention weighting layer predicts output coordinates of each node, represents coordinate information on a three-dimensional space estimated by the current attention weighting layer on the node, and similarly, each attention weighting layer predicts output characteristics of each node, represents spatial distribution characteristics of key atoms on the three-dimensional space estimated by the current attention weighting layer on the node, and attention scores are calculated by taking the characteristics of all the key atoms on the three-dimensional space distribution into consideration by an attention mechanism, so that the mutual influence of different key atoms on the three-dimensional space distribution can be transmitted between different nodes corresponding to different key atoms through the attention scores, and the contribution degree of the different key atoms on the output characteristics of the prediction mutually is controlled.
Illustratively, for a constant-change graph neural network of a multi-layer series-connected attention-weighting layer architecture, the input of each attention-weighting layer is the last attention-plusThe output coordinates and output characteristics of each node obtained after the weight layer processing, and the output of each attention weighting layer is a new output coordinate and output characteristic predicted for each node after message transmission, namely weighting mapping, is performed based on an attention mechanism. And (3) representing the number of layers of the attention weighting layer in the isovariogram neural network by using l, wherein l is an integer greater than or equal to 0, and predicting the output coordinate x of each node for the first attention weighting layer l And output characteristic h l The output coordinates x of each node l And output characteristic h l After input to the l+1th attention weighting layer, a new output coordinate x of each node is predicted l+1 And output characteristic h l+1 Namely, there is the following relationship h l+1 ,x l+1 =f(h l ,x l ) F () represents a weighted mapping operation in the current attention weighting layer.
It should be noted that, when l=0, the initial feature h of each node in the 0 th attention weighting layer may be set 0 Initializing the initial coordinates x of each node in the 0 th attention weighting layer as the category characteristics of the node 0 Initialized to the position information of the nodes, in other words, the position information of each node is taken as the initial coordinate x of each node 0 Input into the first attention weighting layer, the category characteristic of each node is taken as the initial characteristic h of each node 0 Input into the first attention weighting layer.
In some embodiments, since the processing manners of different attention weighting layers in the isomorphic graph neural network are the same, only a single attention weighting layer is taken as an example for explanation, and when a single-head attention mechanism is adopted, any attention weighting layer in the isomorphic graph neural network can process an input signal by the following steps A1 to A4:
and A1, the server respectively carries out weighted mapping on the output characteristics of each node in the previous attention weighting layer based on the query matrix, the key matrix and the value matrix of the attention weighting layer to obtain the query vector, the key vector and the value vector of each node.
In some casesIn an embodiment, when a single-head attention mechanism is adopted, one attention layer and one softmax layer are involved in each attention weighting layer in the isovariogram neural network, and the attention layers are used for being based on a weight matrix W under the attention mechanism Q 、W K 、W V The softmax layer is used for exponentially normalizing the attention score by weighting mapping, wherein the weighting matrix W Q Also known as a Query (Query) matrix, a weight matrix W K Also called Key (Key) matrix, weight matrix W V Also known as a Value matrix.
It should be noted that, when the multi-head attention mechanism is adopted, only multiple sets of different weight matrices W need to be configured in each attention layer Q 、W K 、W V Each set of weight matrix W Q 、W K 、W V The weighted mapping mode based on the attention mechanism is the same, and only the multiple sets of weight matrixes W based on the multi-head attention mechanism in the last attention weighting layer are needed Q 、W K 、W V The extracted plurality of output features are fused in a manner including but not limited to: splicing (Concat), adding by element, multiplying by element, bilinear joining, etc., which is not particularly limited in the embodiments of the present application.
For simplicity, a single-head attention mechanism is illustrated as an example, with each attention layer having a uniquely determined query matrix W Q Key matrix W K Sum matrix W V Query matrix W Q Key matrix W K Sum matrix W V Is determined after training is finished.
With the ith node v in the atomic topology map i For example, the current attention weighting layer is the l+1st layer, the last attention weighting layer is the l layer, and the l layer provides the node v to the l+1st layer i Output coordinates of (a)And output characteristics->For the firstQuery matrix W in layer l+1 Q Key matrix W K Sum matrix W V Using a query matrix W Q Key matrix W K Sum matrix W V Node v respectively opposite to the first layer i Output characteristics of->Weighting mapping is carried out to obtain node v i Query vector in layer l+1->Key vector->Sum vector->Namely, the weighted mapping relation provided by the following formula is satisfied among the variables:
/>
wherein l is an integer greater than or equal to 0.
The above query vectorKey vector->Sum vector->Is according to the node v of the first layer i Output characteristics of->Three vectors are used to calculate the attention score for each node pair of layer l+1.
A2, the server obtains the attention score of the node pair formed by any node and other nodes based on the query vector of the node and the key vector of the other nodes.
In some embodiments, there are a total of 4m nodes in the atomic topology map, for any node v i Node v i And every other node v in the atomic topology map j Can form a node pair and also consider the node v i And node v i The node pairs are formed by the nodes, so that a total of 4m node pairs can be determined for each node.
With any node v i With other nodes v j Node pair { v i ,υ j By way of example, the server pairs a node v i The node upsilon can be obtained through the step A1 i Is a query vector of (1)Key vector->Sum vector->Similarly, the server is used for controlling other nodes v j Other nodes v can also be obtained through the step A1 j Is +.>Key vector->Sum vector->Then, in the computing node pair { v i ,υ j In the course of the attention score, at least the node v is used i Is +.>And other nodes v j Key vector of->Wherein the other nodes are v j Refers to the atomic topology diagram except for the node v i Any node other than the one, of course, may be made to be v j Equal to the node v i At this time, the node v is calculated i Attention score transferred to itself.
In some embodiments, the node pair { v } is calculated first i ,υ j The initial attention score of }, reuse of the node v i Totally 4m node pairs, pair node pair { v ] i ,υ j The initial attention score of } is exponentially normalized to obtain the node pair { v } i ,υ j The final attention score is described as an example, and the above-described manner of acquiring the attention score by the node pair after normalization involves the following steps a21 to a22. Wherein the above-described process of calculating the initial attention score corresponds to the processing/operation of the attention layer in the current attention weighting layer, and the above-described process of exponential normalization corresponds to the processing/operation of the softmax layer in the current attention weighting layer.
A21, multiplying the query vector of any node and the key vector of other nodes by the node pair formed by any node and other nodes by the server to obtain the initial attention score of the node pair.
In some embodiments, for any node v i With other nodes v j Node pair { v i ,υ j -calculating the node v by the current attention layer of the first +1 layer i Is a query vector of (1)And other nodes v j Key vector of->Then the node v i Is +.>And other nodes v j Key vector of->Multiplying to obtain node pair { v i ,υ j Initial attention score of }Where "×" denotes the operation of vector dot product.
Wherein, node pair { v i ,υ j Initial attention score of }Characterizing v from other nodes j To node v i The transferred attention weight, which then reflects the other nodes v j Key vector of->Which characteristic values are passing through the node v i Is +.>Is enhanced after filtering and which eigenvalues are passed through node v i Is +.>Is suppressed after filtering, and therefore, the node pair { v i ,υ j Initial attention score +.>The message transmission based on the attention mechanism between a pair of nodes is embodied, the transmitted message is actually the influence information between the node characteristics of each of two key atoms on the same protein fragment, so that the node characteristics of different key atoms are mutually coupled, the node characteristics of other key atoms in the same protein fragment are considered, the node characteristics of other key atoms are biased and abandoned to some extent, certain characteristic values with larger contributions, which correspond to other key atoms, are embodied in the output characteristics of the key atoms, and otherwise, certain characteristic values with smaller contributions, which correspond to other key atoms, are restrained in the output characteristics of the key atoms. / >
A22, the server performs exponential normalization on the initial attention score of each node pair containing the node based on the initial attention score of the node pair to obtain the attention score of the node pair.
In some embodiments, the server pair contains node v i Can calculate the initial attention score of each node pair based on the mode provided in the step A21, and then contain the node v i The initial attention score of the total 4m node pairs is input to the softmax layer of the current layer 1, and the softmax activation function is used for normalization through the softmax layer to obtain the node pair { v } i ,υ j Normalized attention score
For any node v i With other nodes v j Node pair { v i ,υ j The node pair { v } can be calculated based on the modes provided in the steps A21 to A22 i ,υ j Attention score of }Repeating the steps A21-A22, the attention score of any node pair in the atomic topological graph can be obtained.
In the steps a21 to a22, a possible implementation manner of calculating the initial attention score and then performing exponential normalization to obtain the final attention score is provided, so that the attention score of each node pair can be fully regularized and normalized, and normalization of the output features is facilitated.
In other embodiments, the initial attention score of each node pair may not be exponentially normalized, so that a softmax layer is not required to be set in each attention weighting layer, and the initial attention score is directly used to participate in a subsequent weighted mapping process, so that the architecture of the isomorphic graph neural network is simplified, and the calculation flow of the current attention weighting layer on the output characteristics of each node is simplified.
And A3, the server acquires the output characteristics of the attention weighting layer to the node based on the attention score of each node pair containing the node and the value vector of each other node.
In some embodiments, for any node v i Based on inclusion of node v i Concentration score for a total of 4m total node pairsAnd the value vector of 4m other nodes respectively involved in the 4m total node pairs +.>To obtain the current layer 1+1 pair node v i Output characteristics of->
In some embodiments, a node containing v may be utilized i Attention score for each node pair of (a)To pair nodesIs equal to the other node v j Value vector of +.>Weighted summation is carried out to obtain a layer I+1 pair node v i Output characteristics of->The following describes steps a31 to a32 in detail.
And A31, the server weights the value vectors of other nodes in the node pair based on the attention score of the node pair to obtain a weighted value vector of the node pair.
In some embodiments, the pair contains a node v i With { v }, by node pair i ,υ j Illustrated by way of example, based on node pairs { v } i ,υ j Attention score of }Pair of nodes { v i ,υ j Other nodes v involved in } j Value vector of +.>Weighting to obtain a node pair { v } i ,υ j Weight vector +.>
A32, the server fuses the weighted value vectors of each node pair comprising the node to obtain the output characteristics of the attention weighted layer to the node.
In some embodiments, the pair contains a node v i All of 4m node pairs, all of which execute step A31 to calculate the respective weight vectors of the 4m node pairsThe weight vector of each of the 4m node pairs is further added>Fusion is carried out to obtain a layer I+1 pair node v i Output characteristics of->/>
Wherein j epsilon N (i) refers to other nodes v j Belongs to the node v i In a 1-step neighbor subgraph of (a), i.e., other node v j Is equal to the node v i Any one of the nodes connected (including node v i Self-contained), since the atomic topology is a complete graph, the other nodes v j In practice, a total of 4m nodes in the atomic topology.
In the above process, the 4m nodes are respectively weighted value vectors only in a fusion mannerAlthough the description is given by taking element addition as an example, other fusion modes such as splicing, element multiplication, bilinear fusion and the like can be adopted, and the embodiment of the application is not limited in particular.
In the above steps A31 to A32, a new output feature is calculated in layer 1+1In one possible implementation manner, the value vectors of other nodes are weighted and summed by using the attention score transmitted by each other node in the 1-step neighbor subgraph from which each node starts, so that the interaction information of the value vectors of other nodes is fully fused in the output characteristics of each node, the interaction information is biased, the output characteristic expression capability of each node is improved, and the three-dimensional representation of different nodes is fully contained in the output characteristics of each nodeSpatially interactive information.
And A4, the server acquires the output coordinates of the attention weighting layer to the node based on the attention score of each node pair comprising the node and the output coordinates of the previous attention weighting layer to the node and each other node.
In some embodiments, for any node v i Based on inclusion of node v i Concentration score for a total of 4m total node pairsThe node v of the previous layer pair i Output coordinates of +.>And to other nodes v j Output coordinates of +.>To obtain the current layer 1+1 pair node v i Output coordinates of +.>
In some embodiments, a method for obtaining a layer 1+1 pair node v is described by steps A41-A44 i Output coordinates of (a)The following describes possible embodiments of (a) the present invention.
A41, the server pair comprises each node pair of the node, and the coordinate difference between the output coordinates of the node of the last attention weighting layer pair and other nodes in the node pair is obtained.
In some embodiments, the pair contains a node v i With { v }, by node pair i ,υ j Illustrated by way of example, the node v in the node pair is obtained i And other nodes v j The difference in coordinates between the output coordinates of the first layer, i.e. the first layer pair node v, is obtained i Output coordinates of (a)And layer i vs. other nodes v j Output coordinates of +.>Difference between->
And A42, the server weights the coordinate differences of the node pairs based on the attention scores of the node pairs to obtain weighted coordinate differences.
In some embodiments, the pair contains a node v i With { v }, by node pair i ,υ j Illustrated by way of example, based on node pairs { v } i ,υ j Attention score of }For the coordinate difference calculated in the above step A41 +.>Weighting to obtain a weighted coordinate difference +.>That is, the node pair { v } i ,υ j Attention score +.>Coordinate difference +.>Multiplying to obtain a weighted coordinate difference->
And A43, the server fuses the weighted coordinate differences of each node pair comprising the node to obtain the coordinate offset.
In some embodiments, steps A41-A42 are performed on a total of 4m node pairs including node v i to calculate the respective weighted coordinate differences of the 4m node pairsAnd then the 4m nodes are subjected to respective weighted coordinate differencesFusion is carried out to obtain a layer I+1 pair node v i Coordinate offset of the output coordinates of (a): />/>
Wherein j epsilon N (i) refers to other nodes v j Belongs to the node v i In a 1-step neighbor subgraph of (a), i.e., other node v j Is equal to the node v i Any node connected is a complete graph, so that other nodes v j In practice, a total of 4m nodes in the atomic topology.
And A44, the server determines the output coordinates of the attention weighting layer to the node based on the output coordinates of the previous attention weighting layer to the node, the coordinate offset and the normalization factor.
In some embodiments, the server pairs the nodes v based on layer i i Output coordinates of (a)Coordinate offset amount +.>And a preset normalization factor C for determining the node v of the first layer +1 pair i Output coordinates of +.>
Schematically, the normalization factor C is compared with the coordinate offsetMultiplying the multiplied value with the node v of the first layer pair i Output coordinates of +.>Adding to obtain a layer I+1 pair node v i Output coordinates of +.>I.e.)>And->The following relationship is satisfied:
wherein c=1/(4 m-1) represents a normalization factor, and m is the number of amino acid residues contained in the protein fragment.
When the multi-head attention mechanism is adopted, only a plurality of sets of coordinate offsets calculated under the multi-head attention mechanism are neededThe aggregation is carried out, and the multi-head attention mechanism does not influence the normalization factor C and the node upsilon of the first layer i Output coordinates of +.>Is a value of (a).
In the above steps a41 to a44, processing logic for weighting mapping within any attention weighting layer in a isomorphous graph neural network is provided under a single-head attention mechanism, but it should be understood that the processing logic for the single-head attention mechanism is described herein for simplicity only, but the attention weighting is performed Multiple head attention mechanisms can be adopted in the layer, and only multiple sets of weight matrixes W need to be trained Q 、W K 、W V That is, there is no particular limitation on whether the attention weighting layer adopts a single-head attention mechanism or a multi-head attention mechanism.
In the process, after being processed by each attention weighting layer, not only a predicted output characteristic can be obtained for each node, but also a predicted output coordinate can be obtained for each node, and the function of the output coordinate is mainly used for facilitating self-supervision training of the isograph neural network, because the real coordinates of key atoms indicated by each node can be obtained, no additional supervision signal is required to be applied, and thus the training cost of the isograph neural network can be greatly reduced; the function of the output features is to generate atomic topology features of protein fragments, because a protein fragment is abstractly represented by key atoms on each amino acid residue, the server executes steps A1-A4 on each attention weighting layer of the peer-to-peer mapping neural network, and repeats for a plurality of times until the last attention weighting layer predicts a final output feature for each node, at this time, the output features of all nodes can be fused to obtain the atomic topology features of the whole protein fragment.
Taking the example that the isovariogram neural network totally comprises L attention weighted layers, wherein the last attention weighted layer is the L-th layer, and fusing the output characteristics of each node by the L-th layer to obtain the atomic topological characteristic h of the protein fragment on the atomic topological graph G G
/>
Wherein L is an integer greater than or equal to 1, i.e. the isomorphous graph neural network comprises at least 1 attention weighting layer.
The atomic topology characteristics of the protein fragments retain the arrangement characteristics of each node, namely the key atoms, in a three-dimensional space, and the node arrangement characteristics of each key atom fully reflect the information interaction with other key atoms based on an attention mechanism, so that the atomic topology characteristics have high expression capacity and accuracy, and the atomic topology characteristics of the protein fragments are ensured to have invariance when the protein fragments are translated and rotated.
In some embodiments, the variogram neural network is obtained based on the position information and the category characteristics of each node in the sample topological graph, and because the structural information of the sample protein is easily obtained in the protein database, the three-dimensional coordinates and the atom categories of each key atom in the sample protein are provided in the structural information, the sample topological graph of the sample protein is easily constructed, and the initial map neural network is automatically supervised and trained according to the position information (namely, the three-dimensional coordinates of the key atom) and the category characteristics (namely, the characteristics of the atom category to which the key atom belongs) of each node in the sample topological graph, so as to obtain the variogram neural network.
In some embodiments, the cost function values of the variogram neural network during the training phase include a coordinate cost term that characterizes an error between an atomic coordinate and a predicted coordinate of each node in the sample topology and a distance cost term that characterizes an error between an atomic distance and a predicted distance of each pair of nodes in the sample topology.
Specifically, an output coordinate is predicted for each node by each attention weighting layer introduced in step A4, so that by comparing the error between the predicted coordinate for each node in the sample topology (i.e., the output coordinate of the last attention weighting layer to the node) and the true atomic coordinate, a coordinate loss term L in self-supervised training can be constructed coor The node coordinates are used as self-supervision signals, and the isograph neural network is assisted to learn the structural information of the protein fragments better.
Furthermore, in the training stage, random gaussian noise E can be added to the coordinates X of partial nodes in any sample topological graph g= (V, E, X) to obtain a noisy sample topological graphReuse of noisy sample topology>Extracting corresponding atomic topological features, and obtaining an output coordinate X' for the nodes, so that for the real atomic coordinate X of each node, the real distance between any two nodes in the three-dimensional space can be calculated, namely, a distance matrix _square is calculated for the whole sample topological graph >Wherein->Similarly, for the output coordinate X 'predicted by the isomorphic neural network for each node, a distance matrix D' can be calculated in a similar manner, so that a distance loss term in self-supervision training can be constructed according to the error between the distance matrices D and D
In other embodiments, the distance matrix D is calculated in a constant manner, but for the element D 'of the ith row and jth column of the distance matrix D' ij Can obtain the node v of the isomorphous graph neural network i Predicted output feature h i And pair node v j Predicted output feature h j The output characteristic h is calculated i And output characteristic h j Splicing to obtain a splicing characteristic, inputting the splicing characteristic into a neural network, and calculating an element D ' in an output distance matrix D ' through the neural network ' ij For example, the neural network may be implemented as an MLP (MultilayerPerceptron), in which case D' ij =MLP(h i ,h j ). In this way, the distance matrix D 'is parameterized by a neural network such that each element D' in the distance matrix D 'is' ij Are all made ofThe output characteristics of a pair of nodes are determined, and the element D 'can be improved' ij To optimize the distance loss term L dist Has better training effect.
Further, the sum value between the coordinate loss term and the distance loss term is used as a loss function value L of the isomorphous graph neural network in the training stage final =L coor +L dist Therefore, in the iterative training process of minimizing the loss function value, the node coordinates predicted by the isovariogram neural network are as close as possible to the real coordinates, and the structural information of the protein fragments is learned by the isovariogram neural network.
In other embodiments, in addition to using the coordinate loss term and the distance loss term as the supervisory signals, a dihedral angle of each protein fragment may be introduced as the supervisory signals, that is, a dihedral angle loss term may be added to the loss function value, the true dihedral angle may be directly calculated from the structural information of the protein fragment, and the predicted dihedral angle may be calculated from the predicted atomic coordinates and the atomic topology features, so that an error between the true dihedral angle and the predicted dihedral angle may be used as the dihedral angle loss term.
In other embodiments, a similarity monitor signal of a protein fragment may be introduced during the training process, for example, in a specific application, if a protein fragment whose similarity satisfies a certain protein similarity measure needs to be queried, a similarity monitor signal based on the protein similarity measure may be added during the training process of the isogram neural network, or, for a trained isogram neural network, fine tuning may be performed based on the similarity monitor signal of the protein similarity measure.
Alternatively, when the similarity supervision signal is introduced, a comparison learning method may be used, and a comparison loss term is newly added in the loss function value, where the comparison loss term is expressed as: when the continuous representation and the discrete representation of the protein fragments with high similarity are judged to be more close based on the protein similarity measurement, or the continuous representation and the discrete representation of the protein fragments with low similarity are judged to be more far based on the protein similarity measurement, the comparison loss terms are smaller. For example, assuming that a specific protein similarity measure required by a certain protein design task is a TM-Align algorithm, sampling multiple pairs of positive samples with higher TM score (i.e., multiple pairs of positive samples with high similarity and with close TM score), sampling multiple pairs of negative samples with lower TM score (i.e., multiple pairs of non-similar negative samples with large TM score differences), taking the multiple pairs of positive samples and the multiple pairs of negative samples as similarity monitor signals, and controlling a feature extraction network under the action of the similarity monitor signals by adding a comparison loss term in a loss function, so that atomic topology features of the positive samples are as close as possible in a high-dimensional vector space and a discrete index space, and ensuring that the atomic topology features of the negative samples are as far as possible in the high-dimensional vector space and the discrete index space.
404. The server clusters the atomic topological features of the protein fragments to obtain a plurality of protein fragment sets which are not intersected with each other.
In some embodiments, the server extracts the atomic topology features of each protein fragment in the protein database by the above steps 401-403, which corresponds to inputting the amino acid sequence A and the coordinate information X of each protein fragment into the isomorphous neural network f, and is subjected to the following actionsEach protein fragment can be characterized by abstracting it into a d-dimensional atomic topology feature, and then clustering the atomic topology features of the individual protein fragments in the protein database to form a plurality of non-intersecting protein fragment sets (i.e., coarse-grained cluster sets of protein fragments).
In some embodiments, the atomic topology features of the individual protein fragments are K-means clustered, i.e., the entire protein database is pre-partitioned into K 1 Collections of protein fragments, K 1 Randomly selecting K for a positive integer preset by a technician 1 Protein fragments as K 1 Sets of individual protein fragmentsThe initial cluster centers are combined, and the atomic topological characteristics of each protein fragment and the K are calculated 1 And the distance between the atomic topological features of the initial cluster centers is used for distributing the current protein fragments to the protein fragment collection which belongs to the nearest initial cluster center. For each protein fragment set, every time a new protein fragment is allocated to the protein fragment set, an updated cluster center is recalculated according to all the existing protein fragments in the protein fragment set, that is, an average value of atomic topological features of all the existing protein fragments is taken as the updated cluster center. Iteratively performing the K-means clustering process described above until a coarse cluster termination condition is satisfied, optionally including, but not limited to: no (or a minimum number of) protein fragments are reassigned to different sets of protein fragments; or, no (or a minimum number of) cluster centers of the collection of protein fragments are changed again; or, the error square sum of K-means clustering is locally minimum; or the iteration step number reaches a coarse clustering iteration threshold, wherein the coarse clustering iteration threshold is a positive integer preset by a technician, and the embodiment of the application does not specifically limit the coarse clustering termination condition.
Illustratively, for protein databasesEach protein fragment of (a) constitutes an atomic topology map, and the atomic topology maps of all protein fragments form a map database +.>Extracting the atomic topology characteristics of each protein fragment on the basis of the atomic topology map of the protein fragments, wherein the atomic topology characteristics of all the protein fragments form a characteristic database +.>A map database corresponding to fragments of proteins +.>Encoded into a feature database +.>Each atomic topology feature in the feature database is a d-dimensional vector. Then, the characteristic database is divided into K by using a K-means clustering algorithm 1 A collection of non-intersecting protein fragments, denoted by symbol c as the cluster center of the collection of protein fragments, then K 1 Cluster center forming sequence of the collection of individual protein fragments +.>
FIG. 5 is a schematic diagram of offline index construction and online query of protein fragments, which is provided by the embodiment of the application, as shown in FIG. 5, in the offline library construction stage, an atomic topology map is constructed for each protein fragment in a protein database, and an isomorphous map neural network is utilized to extract the atomic topology characteristics of each protein fragment, so that the protein database is converted into a characteristic database, and vectorization of the protein structure is completed; then, the atomic topology features of each protein fragment in the feature database are subjected to coarse clustering and are divided into K 1 A collection of non-intersecting protein fragments.
In K 1 For example, =4, performing coarse-grained K-means clustering on all the atomic topological features in the whole feature database to obtain 4 mutually non-intersecting protein fragment sets, wherein the 4 protein fragment sets are equivalent to coarse-grained segmentation of the whole protein database, and the clustering centers of the 4 protein fragment sets form a sequence { c } 1 ,...,c 4 }. The coarse-granularity segmentation mode can establish inverted file indexes for protein fragment sets formed by the coarse clusters in combination with the product quantization mode of the following steps 405-406, and can greatly improve query efficiency for millions or superior level scale feature databases.
In some embodiments, besides K-means clustering, each protein fragment in the protein database can also be clustered by KNN clustering or other clustering algorithms or clustering models to divide a plurality of mutually non-intersecting protein fragment sets, where the meaning of mutually non-intersecting means that each protein fragment is uniquely assigned to one protein fragment set, that is, that different protein fragment sets cannot occur and contain the same protein fragment, and the clustering manner is not specifically limited in the embodiments of the present application.
405. And the server takes the protein fragment set as a unit, and segments the atomic topology features of the protein fragments in the protein fragment set to obtain a plurality of sub-topology features of the protein fragments.
Wherein, the sub-topological features with the same serial number after the segmentation of different protein fragments in the protein fragment set form a group of sub-topological features.
In some embodiments, each protein fragment set obtained by coarse clustering in step 404 above is further subjected to fine clustering: for each protein fragment in the collection of protein fragments, truncating the atomic topology features of the protein fragment into a plurality of sub-topology features.
FIG. 6 is a schematic diagram of the construction of a reverse file index based on product quantization according to an embodiment of the present application, as shown in FIG. 6, for a feature databaseThe atomic topological characteristic of each protein fragment in the (B) is subjected to coarse-granularity K-means clustering to obtain K 1 A collection of non-intersecting protein fragments. Taking any one of the protein fragment sets as an example for explanation, assuming that the protein fragment set contains y protein fragments which are clustered roughly, according to the property of the isomorphous graph neural network, the atomic topological features of all the protein fragments are vector representations of d (d is equal to or greater than 1) dimension, namely the dimension of each atomic topological feature in the feature database is equal to and equal to d dimension.
Atomic topology feature f of y d dimensions for y protein fragments in the current collection of protein fragments i Atomic topology feature f for each d-dimension i All are divided into d dimensionsSub-topology feature f i Are truncated into M sub-topological features (f i1 ,...,f iM In the equidistant cut-off mode, the lengths of the M sub-topological features are all equal to d/M, where M is a preset hyper-parameter by a technician, and M is an integer greater than or equal to 1, for example, m=3. In one example, d/M is set to the integer power of 2.
Optionally, each protein fragment in the protein fragment set is truncated equidistantly to obtain M sub-topological features, so that sub-topological features of different protein fragments with the same sequence number after being segmented form a group of sub-topological features, for example, the first segment of sub-topological features of the y protein fragments after being segmented form a first group of sub-topological features { f 11 ,...,f y1 The second segment of sub-topological features after the division of the y protein segments form a second group of sub-topological features { f } 12 ,...,f y2 And so on until the M th segment of sub-topology features after dividing the y protein fragments form the M th group of sub-topology features { f } 1M ,...,f yM And finally, M groups of sub-topological features can be formed.
Still referring to fig. 6, for each of the 4 non-intersecting protein fragment sets obtained by coarse clustering, the atomic topology features of each protein fragment in the protein fragment set are truncated by M equal divisions to obtain M sub-topology features of each protein fragment, and finally M groups of sub-topology features are formed.
406. The server generates an index of protein fragments in the collection of protein fragments based on cluster centers of each group of sub-topological features with the same sequence numbers in different protein fragments.
In some embodiments, the server performs group-wise clustering on each group of sub-topological features comprising sub-topological features of the same sequence number after segmentation of different protein fragments in the protein fragment set to generate K for each group of sub-topological features 2 A subset of clusters, thereby based on K of each group of sub-topological features 2 Individual clustersK of set 2 And generating indexes of y protein fragments in the protein fragment set by using the clustering centers.
One possible implementation of how to generate an offline index of protein fragments from the cluster centers of each set of sub-topological features will be described below by steps B1 to B3:
B1, the server clusters a plurality of sub-topological features in each group of sub-topological features with the same sequence number after dividing different protein fragments in the protein fragment set to obtain a plurality of clustering subsets of the sub-topological features in each group.
In some embodiments, for each group of sub-topological features formed by sub-topological features of the same sequence number after segmentation of different protein fragments in the protein fragment set, the server may cluster a plurality of sub-topological features inside the group of sub-topological features to obtain a plurality of clustered subsets of the group of sub-topological features. It should be noted that, for each group of sub-topological features, all sub-topological features included in the group of sub-topological features are equally assigned to different protein fragments.
In some embodiments, each protein fragment set is truncated by step 405 to obtain M groups of sub-topological features, and each sub-topological feature in each group of sub-topological features of each protein fragment set is clustered to form a plurality of non-intersecting subsets of clusters (i.e., fine-grained clusters of sub-topological features).
In some embodiments, the individual sub-topological features in each set of sub-topological features are K-means clustered, i.e., the set of sub-topological features are pre-partitioned into K 2 A subset of clusters, K 2 Randomly selecting K for a positive integer preset by a technician 2 Sub-topology features as K 2 The initial cluster center of each cluster subset is calculated again, and each sub-topology feature and the K are calculated again 2 And the distance between the initial cluster centers is used for distributing the current sub-topological characteristic to the cluster subset which belongs to the nearest initial cluster center. For each subset of clusters, each time a new sub-topology feature is assigned to the subset of clusters, it is determined that the subset of clusters is availableAnd (3) recalculating all sub-topological features to obtain an updated cluster center, namely taking the average value of all the existing sub-topological features as the updated cluster center. Iteratively performing the K-means clustering process described above until a fine cluster termination condition is satisfied, optionally including, but not limited to: no (or a minimum number of) sub-topological features are reassigned to different subsets of clusters; or, no (or a minimum number of) cluster centers of the subset of clusters are changed again; or, the error square sum of K-means clustering is locally minimum; or the iteration steps reach a fine clustering iteration threshold, wherein the fine clustering iteration threshold is a positive integer preset by a technician, and the embodiment of the application does not specifically limit the fine clustering termination condition.
Illustratively, for the current set of protein fragments, the i (1. Ltoreq.i.ltoreq.M) th group of sub-topological features { f 1i ,...,f yi K mean value clustering is carried out on each sub-topological feature in the process, and the sub-topological features are divided into K 2 A cluster subset which does not cross each other, the cluster center of the cluster subset is represented by a symbol c, then K 2 Cluster center forming sequence of individual cluster subsets
Still referring to FIG. 6 as an example, for each of the M sets of sub-topological features, fine-grained K-means clustering is performed on all sub-topological features within the set of sub-topological features to obtain K 2 A subset of clusters that do not intersect each other, K 2 The subset of clusters corresponds to a fine-grained segmentation of the sub-topological features of each protein fragment in each set of protein fragments, K 2 The cluster centers of the cluster subsets form the codebook of the current group of sub-topological features, and M codebooks { C } of M groups of sub-topological features can be finally obtained 1 ,...,C M The M codebooks may be integrated into a global codebook c=c 1 ×...×C M The global codebook corresponds to a Cartesian product of a set formed by the cluster centers of each group of cluster subsets in the M groups of sub-topological features.
The above fine-grained segmentation method can classify each protein segment in the protein segment set in a fine-grained manner, so that the protein index is conveniently constructed based on the cluster subset in which each sub-topological feature of the protein segment falls (the index constructed in this way is an inverted index).
It should be noted that, in the embodiment of the present application, step 404 involves one coarse-grained clustering, and step B1 involves one fine-grained clustering, and in the clustering process, since distances between each object (an object in the coarse-grained clustering refers to an atomic topological feature, and an object in the fine-grained clustering refers to a sub-topological feature) and all cluster centers need to be calculated, a metric function of the distances is an L2 distance, a vector inner product, or other distance metric functions, and the embodiment of the present application does not specifically limit the distance metric functions.
In some embodiments, besides K-means clustering, each sub-topological feature in each group of sub-topological features can be clustered by KNN clustering or other clustering algorithms or clustering models to divide a plurality of non-intersecting cluster subsets, where the non-intersecting meaning means that each sub-topological feature is uniquely assigned to one cluster subset, that is, different cluster subsets cannot appear to contain the same sub-topological feature.
And B2, the server generates a codebook of each group of sub-topological features based on the cluster centers of a plurality of cluster subsets of each group of sub-topological features, wherein the codebook characterizes the sub-topological features positioned in the cluster centers of the plurality of cluster subsets.
In some embodiments, K is partitioned by step B1 for each set of sub-topological features in the set of protein fragments 2 A subset of clusters based on K 2 Sub-topological features of each cluster subset at a cluster center of each cluster subset are generated, a codebook of the current set of sub-topological features is generated, and the codebook characterizes K of the current set of sub-topological features 2 Sub-topology features of the cluster centers of the respective cluster subsets.
For example, the codebook is used to record the K of the current set of sub-topological features 2 The cluster index values of the sub-topological features of the cluster centers of the respective cluster subsets can obtain M codebooks { C } of the respective M groups of sub-topological features for each protein fragment set 1 ,...,C M The M codebooks { C } 1 ,...,C M I.e. can provide an index to any protein fragment located in the collection of protein fragments.
B3, the server generates indexes of protein fragments in the protein fragment set based on the codebook of each group of sub-topological features.
In some embodiments, K is formed for step 404 clustering 1 Each protein fragment set in the protein fragment sets is divided into M groups of sub-topological features by the step 405, and each group of sub-topological features in the M groups of sub-topological features is divided into K by the step B1 2 The subsets are clustered and M sets of M codebooks of sub-topological features are generated by step B2, followed by constructing an index for each protein fragment in each set of protein fragments by the following sub-steps B31-B33.
It should be noted that, in the embodiment of the present application, a single protein fragment set is taken as an example, and the index generation flow of each protein fragment in the protein fragment set is described, it should be understood that K is formed by coarse clustering in step 404 1 Each of the collection of protein fragments can be used to construct an index for each protein fragment within the collection by the following sub-steps B31-B33, as described below:
b31, the server determines a plurality of sub-topological features of any protein fragment in the protein fragment set after the protein fragment is segmented.
In some embodiments, K is coarsely clustered via step 404 1 For each protein fragment set in the protein fragment set, since the dependency relationship between the atomic topology features of the protein fragments and the sub-topology features after segmentation can be easily obtained when the truncation is performed in the step B1, the segmentation of each protein fragment can be determined The latter plurality of sub-topological features, e.g. atomic topological feature f of the ith protein fragment in its d-dimension i Truncated into M sub-topological features { f i1 ,...,f iM Then the M sub-topological features of the ith protein fragment after segmentation are { f } i1 ,...,f iM }。
B32, the server determines cluster index values of the cluster subsets of each sub-topological feature in the group of sub-topological features based on the codebook of each group of sub-topological features.
In some embodiments, for each protein fragment in the current protein fragment set, when fine-grained clustering is performed on M groups of sub-topological features, since M sub-topological features after the segmentation of the protein fragment also participate in the fine-grained clustering process of the M groups of sub-topological features, it is necessary to determine a clustering subset in which each sub-topological feature of the protein fragment falls in each corresponding group of sub-topological features.
In some embodiments, for each segment of sub-topological feature of the segmented protein segment, after determining the cluster subset of the sub-topological feature falling into the group of sub-topological features to which the sub-topological feature belongs, the codebook generated by the group of sub-topological features through the sub-step B2 is queried to obtain the cluster index value of the falling cluster subset.
For example, for a protein fragment in the current protein fragment set, taking m=3 as an example, the protein fragment is truncated by 3 equal divisions to form 3 segments of sub-topological features, and a cluster subset into which each segment of sub-topological features is divided when K-means clustering is performed in a group of sub-topological features to which the sub-topological features belong is queried. In one example, it is determined that the segment 1 sub-topology feature falls into the ith of the group 1 sub-topology feature set 1 Determining that the 2 nd segment sub-topology feature falls into the i < th > of the 2 nd group sub-topology feature set 2 Determining that the 3 rd segment sub-topology features fall into the ith of the 3 rd group of sub-topology feature sets 3 A subset of clusters, wherein i 1 、i 2 、i 3 Are all greater than or equal to 1 and less than or equal to K 2 Is an integer of (a).
For the 1 st segment sub-topology feature, the codebook C of the 1 st group sub-topology feature is queried 1 Obtain the ith 1 The clustering index value of the clustering subset is 3, and aiming at the 2 nd-segment sub-topological feature, the codebook C of the 2 nd-group sub-topological feature is queried 2 Obtain the ith 2 The clustering index value of the clustering subset is 24, and aiming at the 3 rd segment sub-topological feature, the codebook C of the 3 rd group sub-topological feature is queried 3 Obtain the ith 3 The cluster index value of the subset of clusters is "10".
And B33, the server splices the cluster index values of the sub-topological features after the protein fragments are segmented, and the indexes of the protein fragments are obtained.
In some embodiments, for each sub-topological feature of the protein fragment, the cluster index value of the cluster subset in which the current sub-topological feature falls can be obtained through the sub-step B32, and the cluster index values of the cluster subset in which all the sub-topological features of the protein fragment fall are spliced, so that the index of the protein fragment can be obtained. For example, the cluster index values of the cluster subsets, in which the M sub-topological features of the protein fragment fall respectively, in the codebook thereof are spliced to obtain an M-dimensional vector, which is the index of the protein fragment, in other words, each protein fragment can use an M-dimensional vector to construct its own index in the protein database.
Continuing with the example in sub-step B32, the 3 cluster index values "3", "24" and "10" of the 3 sub-topological features of the protein fragment are respectively spliced to form a 3-dimensional vector [3,24,10], and the 3-dimensional vector [3,24,10] can be used as the index of the protein fragment, and in this example, only m=3 is taken as an example for illustration, and it should be understood that the M can take different values.
As shown in fig. 6, after the index of one M-dimensional vector is constructed for each protein fragment in each protein fragment set based on the above sub-steps B31 to B33, the index of each protein fragment set can also be cached in the server in an inverted index manner. Therefore, y d-dimensional atomic topological features in the original protein fragment set can be converted into indexes of y M-dimensional vectors in a product quantization mode, and discretization compression representation of the atomic topological features is realized.
Optionally, for each protein fragment set formed by coarse clustering, taking the cluster center of the protein fragment set as Key, taking the M-dimensional vector index of each protein fragment in the protein fragment set as Value, and constructing a Key Value pair, namely Key-Value data structure, so that the atomic topology characteristic of each protein fragment in d-dimensional space can be compressed into an index of an M-dimensional vector, and the Value of each dimension in the M-dimensional vector characterizes the cluster index Value of a cluster subset to which a segment of sub-topology characteristic belongs in a corresponding codebook. Obviously, protein fragment p i Atomic topology feature f of (2) i The j-th segment of sub-topology feature f ij The jth codebook C to be assembled from the current protein fragments j The ID of the cluster subset to which the cluster center closest thereto belongs.
In one example, as shown in FIG. 7, for the 2 nd set of protein fragments, key refers to the cluster center c of the 2 nd set of protein fragments 2 Value refers to the M-dimensional vector index of each protein fragment in the 2 nd protein fragment set, assuming protein fragment p 1 And p 5 Is assigned to the 2 nd set of protein fragments such that Value contains at least protein fragment p 1 Atomic topology feature f of (2) 1 M-dimensional vector index [24,4, 142, 16 ] of M IDs.]And protein fragment p 5 Atomic topology feature f of (2) 5 M-dimensional vector index [45,6, 127, 36 ] of M IDs.]And so on, no further description is given.
In the steps B31 to B33, a possible implementation manner of constructing an offline index of the protein fragment according to the cluster center of each group of sub-topological features is provided, and the codebook of each group of sub-topological features is generated by means of product quantization, so that the cluster centers of all the cluster subsets in each group of sub-topological features are represented by using the codebook, and then the offline index is constructed according to each codebook, so that the discretization representation of the atomic topological features can be realized with great efficiency.
In particular, feature databaseAtomic topology features of protein fragments in (a) can be represented using offline indexing based on codebook construction, e.g., for atomic topology feature f i Is not equal to each sub-topological feature f ij Expressed by the cluster index value of the cluster subset of the nearest cluster center in the codebook, the d-dimensional floating point vector f can be expressed i Conversion and compression into an integer vector of M dimension, thereby using 8-bit integer to replace 32-bit floating point number, greatly saving the storage space of protein index, on one hand, being capable of greatly saving the storage space, on the other hand, when K is used 2 In the case of setting to 256, the compression ratio can be up to +.>
In steps 404-406 described above, the server generates an index of the plurality of protein fragments based on the atomic topology features of the plurality of protein fragments. The offline index is constructed by product quantization, and in some embodiments, coarse clustering may not be performed, which is not particularly limited in the embodiments of the present application.
Furthermore, in the embodiment of the present application, the two processes of vectorizing the structure of the protein fragments and establishing the feature database based on the method of product quantization are performed independently, in other embodiments, in order to improve the effect of product quantization, a codebook parameterization network may be designed, so that the codebook parameterization network learns the codebook in product quantization, so that the codebook content is parameterized, which is equivalent to obtaining M codebooks in each protein fragment set without performing K-means clustering twice on each atomic topology feature extracted by the isovariogram neural network, but directly by an end-to-end network formed by splicing the isovariogram neural network and the codebook parameterization network in an integrated manner, atomic topology features of all protein fragments and vector representations of cluster centers of all the codebooks (i.e. outputting vector representations of all cluster subsets in each codebook) are output at one time, or the codebook parameterization is used as another self-supervision task of the isovariogram neural network, so that the isovariogram neural network itself is reformed into an end-to-end network, and thus the vector representations can be jointly learned (i.e. the vector representations of the cluster centers in the isovariogram neural network) can be continuously combined with each other to obtain the vector representation (i.e. the cluster center-to achieve the better training effect). For example, the topological structure features (continuous vector representation) obtained by encoding the isomorphic graph neural network and the parameterized codebook (discrete vector representation) are connected into a decoder to realize the downstream self-supervision task, for example, by adding a loss term for representing the distance between the atomic topological features and the parameterized codebook in a loss function, the distance between the continuous vector representation and the discrete vector representation can be minimized by minimizing the loss function, so that the end-to-end network learns the parameterized codebook through the self-supervision task.
According to the method provided by the embodiment of the application, the atomic topological graph is constructed according to the key atoms on the amino acid residues for each protein fragment, so that the atomic topological features extracted based on the atomic topological graph can reflect the features of the spatial arrangement mode of the amino acid residues of the protein fragments from the atomic granularity, the characteristics of the isograph neural network ensure that even if the protein fragments are subjected to translation, rotation and other transformations, the extracted atomic topological features are unchanged due to the fact that the internal structures of the protein fragments are unchanged, the expression capacity and the accuracy degree of the atomic topological features can be greatly ensured, the accuracy degree of the indexes of the protein fragments generated according to the atomic topological features is improved, and when the online query service is provided by utilizing the offline indexes, only the indexes are needed to be used for positioning part of the protein fragments, then fine screening is not needed, the calculation cost of the query process is greatly reduced, the query efficiency based on the constructed protein indexes is improved, and the quick response high-concurrency online query task can be performed based on the indexes.
All the above optional solutions can be combined to form an optional embodiment of the present disclosure, which is not described in detail herein.
In the above embodiment, the offline protein index construction method in the embodiment of the present application is described in detail, in the offline index construction stage, each protein fragment in the protein database in the three-dimensional space can be represented as a topological structure feature in the high-dimensional space by using a graph representation learning method, and product quantization is performed by using K-means clustering twice, so that an inverted index of the protein fragment is constructed for the protein database, and each protein fragment can be represented as an M-dimensional vector index formed by M integers.
In the embodiment of the present application, how to implement online query of protein fragments based on the constructed inverted index will be briefly described, and fig. 8 is a flowchart of a method for querying protein fragments according to the embodiment of the present application, as shown in fig. 8, where the embodiment is implemented by a computer device, and the computer device may be implemented as a terminal 101 in the implementation environment or as a server 102 in the implementation environment, and only the computer device is taken as a server for illustration, and the embodiment includes the following steps:
801. the server constructs an atomic topology map of the protein to be queried based on key atoms in amino acid residues of the protein to be queried, wherein each node in the atomic topology map represents one key atom in one amino acid residue.
The above step 801 is the same as step 301 in the previous embodiment, and will not be described in detail.
In some embodiments, the amino acid sequence of the protein to be queried is obtained, and for each amino acid residue in the amino acid sequence, the 4 key atoms (C, N, O, C) of that amino acid residue in the backbone are obtained α ) Is a three-dimensional coordinate of (c). Next, based on the 4 key atoms (C, N, O, C) in the main chain of each amino acid residue α ) Constructing an atomic topology map for the protein to be queried such that each node in the atomic topology map characterizes an amino acid sequenceThe edge connecting a pair of nodes in the atomic topology map represents the positional relationship of a pair of key atoms in three-dimensional space.
Optionally, all nodes in the atomic topology map are constructed according to each key atom of each amino acid residue in the amino acid sequence on the main chain, and then a connected edge is created for any pair of nodes in the atomic topology map, so that a connected edge exists between any two different nodes in the atomic topology map, and attention information can be conveniently transmitted among all key atoms by using the isovariogram neural network.
802. The server inputs the position information and category characteristics of each node in the atomic topological graph into an isograph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the node, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the node belong, and the isograph neural network is used for extracting the atomic topological characteristics of the input topological graph.
Step 802 is the same as step 302 in the previous embodiment, and will not be described in detail.
In some embodiments, based on the atomic topology map constructed in step 801, for each node on the atomic topology map, the node has a unique location information (i.e., three-dimensional coordinates of the key atoms indicated by the node) and a unique class feature (i.e., features of the atomic class to which the key atoms indicated by the node belong), where the atomic class may include C, N, O, C with only the key atoms on the backbone being considered α There are 4 total categories, each atom category corresponding to a unique category characteristic.
In some embodiments, after the position information and the category characteristics of each node in the atomic topological graph are acquired, the position information and the category characteristics of each node are input into a isovariogram neural network, and the position information and the category characteristics of each node are processed through the isovariogram neural network, so that the attention information of different key atoms can be transmitted between different nodes, and finally the atomic topological characteristics of the protein to be queried are output.
803. The server processes the position information and the category characteristics of each node through a plurality of attention weighting layers in the variogram neural network, and the last attention weighting layer outputs the atomic topology characteristics of the protein to be queried.
Step 803 is the same as step 303 in the previous embodiment, and will not be described in detail.
In some embodiments, after the server obtains the location information and the category feature for each node in the atomic topology map, the location information and the category feature of each node are input into a first attention weighting layer of the isomorphous map neural network, the location information and the category feature of each node are processed through the first attention weighting layer, an output coordinate and an output feature are predicted for each node based on an attention mechanism, then the first attention weighting layer inputs the output coordinate and the output feature predicted for each node into a second attention weighting layer, and so on, a final output coordinate and an output feature are predicted for each node by a last attention weighting layer, and then the output features of all nodes in the last attention weighting layer are fused to obtain the atomic topology feature of the protein to be queried on the whole atomic topology map. Optionally, the fusion manner of the output features of each node includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., which are not particularly limited in this embodiment of the application. The method comprises the steps that each attention weighting layer predicts output coordinates of each node, represents coordinate information on a three-dimensional space estimated by the current attention weighting layer on the node, and similarly, each attention weighting layer predicts output characteristics of each node, represents spatial distribution characteristics of key atoms on the three-dimensional space estimated by the current attention weighting layer on the node, and attention scores are calculated by taking the characteristics of all the key atoms on the three-dimensional space distribution into consideration by an attention mechanism, so that the mutual influence of different key atoms on the three-dimensional space distribution can be transmitted between different nodes corresponding to different key atoms through the attention scores, and the contribution degree of the different key atoms on the output characteristics of the prediction mutually is controlled.
804. The server determines a query string for the protein to be queried based on the atomic topology features.
Step 804 is the same as step 304 in the previous embodiment, and will not be described in detail.
In some embodiments, the atomic topology features of the protein to be queried extracted in step 803 are also used to construct a query string of the protein to be queried by using a product quantization method, and a specific implementation of product quantization will be described in steps 904-905 in the next embodiment, which will not be described herein.
By the index construction mode, the atomic topological characteristics of the protein to be inquired can be further converted and expressed into a compressed vector (namely an inquiry string) with a fixed length from the characteristic vector in a high-dimensional space, so that the inquiry string can be utilized to perform quick online inquiry according to the inverted file index constructed by the protein database, and the inquiry efficiency is improved.
805. The server returns a plurality of target protein fragments which meet similar conditions with the protein to be queried based on the query string.
In some embodiments, since the protein database is divided into a plurality of protein fragment sets that do not intersect each other, according to the query string of the protein to be queried in step 804, the server first determines a target fragment set that matches the protein to be queried to the highest degree from the plurality of protein fragment sets, then determines a plurality of candidate protein fragments that match the query string by KNN query from the plurality of protein fragments in the target fragment set, finally finely screens the plurality of target protein fragments from the plurality of candidate protein fragment sets, and returns the plurality of target protein fragments to the terminal that initiates the query request. For the determination method of the target fragment set, the screening method of the candidate protein fragments, and the screening method of the target protein fragments, please refer to the next embodiment, and details are not repeated here.
According to the method provided by the embodiment of the application, the method is used for inquiring the topK target protein fragments which are the most similar to the protein to be inquired in the three-dimensional structure by carrying out KNN inquiry in the protein database, and returning the amino acid sequences of the topK target protein fragments and the position information of each key atom on the amino acid residue main chain.
In the above embodiment, the online query flow of the protein fragment is briefly introduced, and in the embodiment of the present application, how to implement online query of the protein fragment based on the constructed inverted index will be described in detail.
Fig. 9 is a flowchart of a method for querying a protein fragment according to an embodiment of the present application, as shown in fig. 9, where the embodiment is executed by a computer device, and the computer device may be implemented as a terminal 101 in the above-mentioned implementation environment or as a server 102 in the above-mentioned implementation environment, and the embodiment is described only by taking the computer device as a server as an example, and includes the following steps:
901. The server responds to the query request, and constructs an atomic topological diagram of the protein to be queried based on key atoms in amino acid residues of the protein to be queried in the query request, wherein each node in the atomic topological diagram represents one key atom in one amino acid residue.
The above step 901 is the same as the step 401 in the previous embodiment, and will not be described in detail.
In some embodiments, the server receives a query request sent by any terminal, parses the query request to obtain the protein to be queried, and optionally, q= [ c ] for a protein to be queried designed by a biological expert 1 ,...,c m ]In the same manner as in step 401 in the previous embodiment,constructing an atomic topology map G of the protein q to be queried according to key atoms of amino acid residues in the amino acid sequence of the protein q to be queried q The construction process is not described here.
902. The server inputs the position information and category characteristics of each node in the atomic topological graph into an isograph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the node, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the node belong, and the isograph neural network is used for extracting the atomic topological characteristics of the input topological graph.
Wherein the atomic topology features characterize at least the spatial arrangement of amino acid residues in the protein to be queried in terms of atomic granularity.
Step 902 is the same as step 402 in the previous embodiment, and will not be described in detail.
In some embodiments, an atomic topology map G of the protein q to be queried is determined q The location information and the category characteristics of each node in the network are input into a plurality of attention weighting layers of the constant graph neural network.
903. The server processes the position information and the category characteristics of each node through a plurality of attention weighting layers in the variogram neural network, and the last attention weighting layer outputs the atomic topology characteristics of the protein to be queried.
The above step 903 is similar to step 403 in the previous embodiment, and will not be described in detail.
In some embodiments, the position information and the category characteristics of each node are respectively processed through a plurality of attention weighting layers of the isomorphous graph neural network, the output characteristics of the last attention weighting layer predicted for each node are obtained, the output characteristics of each node are fused, and the atomic topology characteristics f of the protein q to be queried are obtained q
904. Based on the atomic topological feature, the server determines a target fragment set with the highest matching degree with the protein to be queried from a plurality of protein fragment sets.
In some embodiments, the server obtains distances between the atomic topology features of the protein to be queried and the atomic topology features located at the cluster center of each set of protein fragments, and then determines the closest set of protein fragments among the plurality of sets of protein fragments as the set of target fragments.
In some embodiments, since it has been described in the previous embodiment, the entire protein database is coarsely clustered to obtain K 1 A collection of mutually non-intersecting protein fragments, in which case the atomic topology features f of the protein q to be queried are calculated q And K is equal to 1 Respective K of the collection of protein fragments 1 Cluster center { c } 1 ,...,c K1 Distance between } to obtain K 1 Distance of the K 1 And determining the protein fragment set corresponding to the minimum value in the distances as the target fragment set. The distance between the atomic topology features and the cluster center here uses L2 distance or vector inner product between different vectors, which is not specifically limited in the embodiment of the present application.
905. The server determines a query string for the protein to be queried based on cluster centers of a plurality of cluster subsets of each set of sub-topological features in the target segment set.
Step 905 is the same as step 406 in the previous embodiment, and will not be described again.
In some embodiments, the atomic topology features f for the protein q to be queried q Dividing to obtain multiple sub-topological features { f } of the protein q to be queried q1 ,...,f qM }. Then, for each sub-topological feature, determining a target cluster subset in which the sub-topological feature falls from a plurality of cluster subsets of a group of sub-topological features to which the sub-topological feature belongs in the target fragment set, for example, the j (1. Ltoreq.j. Ltoreq.M) th sub-topological feature f of the protein q to be queried qj K from the j-th set of sub-topological features in the target fragment set 2 Subset of clustersIn determining and sub-topology features f qj The subset of clusters of targets closest to the cluster. Next, the codebook of the set of sub-topological features is queried to obtain the cluster index value of the target cluster subset, the codebook characterizes the sub-topological features located in the cluster center of the plurality of cluster subsets, e.g. the codebook C of the j th set of sub-structural features is queried j To determine the cluster Index value (i.e., index ID) of the target cluster subset. Then, the cluster Index values of the target cluster subsets, into which the sub-topological features fall, are spliced to obtain the query string, for example, M Index IDs (abbreviated as ind) of the M sub-topological features are spliced to obtain an M-dimensional vector f q =(ind q1 ,ind q2 ,...,ind qM ) As a query string of the protein q to be queried, wherein ind qr Representing atomic topology features f q Is the nth sub-structural feature f qr Cluster index values for the target cluster subsets that fall within the r-th set of substructural features.
In the above steps 904-905, one possible implementation of determining the query string of the protein to be queried based on the atomic topology features is provided. That is, after the atomic topological feature of the protein to be queried is extracted by using the isomorphous graph neural network, the query string of the protein to be queried is obtained in a similar product quantization mode.
906. The server determines a plurality of candidate protein fragments which meet the matching condition with the query string from a plurality of protein fragments in the target fragment set.
In some embodiments, the matching condition includes at least one of: the index of the candidate protein fragment is the same as the query string; or, the index distance between the index of the candidate protein fragment and the query string meets a recall condition, wherein the index distance characterizes the total distance between the cluster centers indicated by each of one index and the other index.
In some embodiments, during KNN query within the target fragment set, all candidate protein fragments with M-dimensional vector indexes identical to the query string of the protein q to be queried are first fetched from the target fragment set, if fetched The number of the candidate protein fragments is larger than or equal to the number K of the protein fragments which need to be returned each time by KNN query, so that the candidate protein fragments which meet recall conditions do not need to be selected continuously; otherwise, if the number of the candidate protein fragments is less than the number K of the protein fragments required to be returned each time by KNN query, further querying the protein fragments meeting the recall condition as the candidate protein fragments, at this time, the server may store all M codebooks in the current target fragment set offline in advance 2 The distance between every two clustering centers is cached to K 2 *K 2 This facilitates the efficiency of querying protein fragments that meet recall conditions.
Schematically, in querying protein fragments meeting recall conditions, atomic topology features f for protein q to be queried q Dividing to obtain M sub-topology features { f q1 ,...,f qM For any one of the sub-topological features f qj (1. Ltoreq.j. Ltoreq.M), querying the jth codebook C of the target fragment set j Find and sub-topology feature f qj topN nearest to the target cluster subsets 1 The candidate cluster subsets are combined, indexes of protein fragments can hit the returned candidate cluster subsets are determined, and topN, in which the index distance between the indexes of the protein fragments and the query string is located in descending order, is selected from all protein fragments of the index hit candidate cluster subsets 2 Is a candidate protein fragment of (a). Then repeatedly executing the above operation, traversing all protein fragments of the index hit candidate cluster subset, sorting according to the order of the index distance from small to large, and arranging the sorting at the top N 2 Bit (topN) 2 ) As candidate protein fragments for recall, where N 1 And N 2 Are all preset integers which are larger than or equal to 1, and optionally, the number of the determined candidate protein fragments can be further restricted to be larger than or equal to the K value of KNN query, so that the defect of insufficient number of the queried protein fragments is avoided.
The index distance refers to a table of K2 x K2 accessed to a cache, the distance between the cluster center of each ID recorded in the index of the protein fragment and the cluster center of the corresponding position ID in the query string is queried, and the sum value of the distances between all IDs is taken as the index distance. For example, protein fragment p of the index hit candidate cluster subset i Index f of (2) i =(ind i1 ,ind i2 ,...,ind iM ) The query string of the protein q to be queried is f q =(ind q1 ,ind q2 ,...,ind qM ) Then access the cached K 2 *K 2 Is queried to find ind i1 Cluster center to ind for the indicated subset of clusters q1 Distance diff (ind) i1 ,ind q1 ) And so on until ind is queried iM Cluster center to ind for the indicated subset of clusters qM Distance diff (ind) iM ,ind qM ) Adding the M distances to obtain a total index distance
907. The server screens the candidate protein fragments based on the distance between the atomic topological feature of the protein to be queried and the atomic topological feature of the candidate protein fragment to obtain the target protein fragments.
In some embodiments, in the process of performing the KNN query inside the target fragment set, if the number of candidate protein fragments determined in step 906 is equal to the number K of protein fragments that need to be returned by the KNN query each time, the K candidate protein fragments are directly returned as target protein fragments, that is, step 908 is directly performed by skipping step 907; or, if the number of candidate protein fragments determined in step 906 exceeds the number K of protein fragments required to be returned each time of KNN query, it is required to calculate the distance between the atomic topology feature of the protein q to be queried and the atomic topology feature of each of the candidate protein fragments taken out, order the candidate protein fragments in order from the smaller distance to the larger distance, and then use the candidate protein fragment in the top K (topK) as K target protein fragments.
It should be noted that, the number K of protein fragments returned by each query in the KNN query is a preset value, which may be configured individually by a technician according to the query requirement, or may be specified by the user who initiates the query, and is not specifically limited herein.
In other embodiments, in addition to the KNN query logic that returns the topK, the candidate protein fragments that are less than the target distance may be used as target protein fragments, i.e., all the candidate protein fragments that are less than the target distance may be returned, without limiting the number of protein fragments returned, and the query logic is not specifically limited in the embodiments of the present application.
908. The server returns the plurality of target protein fragments.
In some embodiments, the server returns the plurality of target protein fragments determined in step 907 above to the terminal. For example, the plurality of target protein fragments are packaged into a query response, which is sent to the terminal.
In the foregoing steps 906-908, a possible implementation manner of returning a plurality of target protein fragments meeting the similar conditions with the protein to be queried based on the query string is provided, optionally, in the case of training the isovariogram neural network, a query network may be further trained to automatically learn to implement KNN query in the protein database, and the query mode is not specifically limited in the embodiment of the present application. It should be noted that, in the embodiment of the present application, the protein q to be queried and the returned target protein fragment can be sequences with different lengths.
In the embodiment of the application, through KNN inquiry in a protein database, the topK target protein fragments which are the most similar to the protein to be inquired in the three-dimensional structure are inquired, and the amino acid sequences of the topK target protein fragments and the position information of each key atom on the amino acid residue main chain are returned.
Furthermore, the online query scheme provided by the embodiment of the application can greatly reduce the search range of the target protein fragments and ensure high search accuracy. In the online query stage, only the protein to be queried and K are compared 1 The distance between the clustering centers of the protein fragment sets of the coarse clusters and the distance between the M sub-topological features and the clustering centers of the clustering subsets in the codebook to which the M sub-topological features belong, so that the complexity of the online query stage is not limited by the data volume of a protein database, the online query stage has strong expandability, and because the feature database, the indexes of the protein fragments and the like are all built offline, parallel query can be supported, multi-core parallel architecture in a high concurrency scene is well supported, the multi-core parallel architecture in a modern processor is more suitable, and related GPU (Graphics Processing Unit, image processor) accelerates the scene, and even the similarity retrieval task of millions-billions of real-time protein fragments can be realized.
Furthermore, for improving the retrieval accuracy of target protein fragments, a two-stage strategy of combining filtering and verification can be adopted, so that the protein fragment query method provided by the embodiment of the application is used as a filtering tool, a candidate structure list with a given length is returned first, and then a protein structure matching algorithm is called to accurately calculate similarity measurement between the target protein fragments in the candidate structure list and the protein to be queried so as to verify whether the target protein fragments have structural similarity with the protein to be queried, so that the retrieval accuracy of a protein query stage can be further improved.
Fig. 10 is a schematic structural diagram of a protein index generating apparatus according to an embodiment of the present application, please refer to fig. 10, which includes:
a construction module 1001 for constructing an atomic topology map of a protein fragment based on key atoms in amino acid residues of the protein fragment, each node in the atomic topology map representing a key atom in an amino acid residue;
an input module 1002, configured to input location information and category characteristics of each node in the atomic topology map into an isogram neural network, where the location information represents location coordinates of a key atom indicated by the node, the category characteristics represent characteristics of an atomic category to which the key atom indicated by the node belongs, and the isogram neural network is configured to extract atomic topology characteristics of the input topology map;
The processing module 1003 is configured to process the location information and the category characteristics of each node through a plurality of attention weighting layers in the variogram neural network, and output the atomic topology characteristics of the protein fragment through the last attention weighting layer;
a generating module 1004 is configured to generate an index of the plurality of protein fragments based on atomic topology features of the plurality of protein fragments.
According to the device provided by the embodiment of the application, the atomic topological graph is constructed according to the key atoms on the amino acid residues for each protein fragment, so that the atomic topological features extracted based on the atomic topological graph can reflect the features of the spatial arrangement mode of the amino acid residues of the protein fragments from the atomic granularity, the characteristics of the isograph neural network ensure that even if the protein fragments are subjected to translation, rotation and other transformations, the extracted atomic topological features are unchanged due to the fact that the internal structures of the protein fragments are unchanged, the expression capacity and the accuracy degree of the atomic topological features can be greatly ensured, the accuracy degree of the indexes of the protein fragments generated according to the atomic topological features is improved, and when the online query service is provided by utilizing the offline indexes, only the indexes are needed to be used for positioning part of the protein fragments, then fine screening is not needed, the calculation cost of the query process is greatly reduced, the query efficiency based on the constructed protein indexes is improved, and the quick response high-concurrency online query task can be performed based on the indexes.
In some embodiments, the building block 1001 is configured to:
for each amino acid residue in the protein fragment, determining a plurality of key atoms from the backbone of the amino acid residue;
constructing each node in the atomic topology map based on each key atom of each amino acid residue;
for any pair of nodes in the atomic topology, an edge is constructed for connecting the pair of nodes.
In some embodiments, each attention weighting layer in the variogram neural network is further configured to predict output features and output coordinates for each node in the atomic topology map;
based on the apparatus composition of fig. 10, the processing module 1003 includes:
the weighted mapping unit is used for carrying out weighted mapping on the output characteristics of each node in the previous attention weighting layer on the basis of the query matrix, the key matrix and the value matrix of any attention weighting layer to obtain the query vector, the key vector and the value vector of each node;
the score acquisition unit is used for acquiring the attention score of the node pair formed by any node and other nodes based on the query vector of the node and the key vector of the other nodes;
a feature obtaining unit, configured to obtain an output feature of the attention weighting layer for each node pair including the node, based on the attention score of the node pair and a value vector of each other node;
And the coordinate acquisition unit is used for acquiring the output coordinates of the attention weighting layer to the node based on the attention score of each node pair comprising the node and the output coordinates of the previous attention weighting layer to the node and each other node.
In some embodiments, the feature acquisition unit is configured to:
weighting each node pair containing the node, and based on the attention score of the node pair, weighting the value vectors of other nodes in the node pair to obtain a weighted value vector of the node pair;
and fusing the weighted value vectors of each node pair comprising the node to obtain the output characteristics of the attention weighted layer to the node.
In some embodiments, the coordinate acquisition unit is configured to:
for each node pair containing the node, acquiring the coordinate difference between the output coordinates of the node of the previous attention weighting layer pair and other nodes in the node pair;
weighting the coordinate difference of the node pair based on the attention score of the node pair to obtain a weighted coordinate difference;
fusing the weighted coordinate differences of each node pair comprising the node to obtain a coordinate offset;
and determining the output coordinates of the attention weighting layer to the node based on the output coordinates of the previous attention weighting layer to the node, the coordinate offset and the normalization factor.
In some embodiments, the score acquisition unit is configured to:
multiplying the query vector of any node by the key vector of other nodes to obtain the initial attention score of the node pair;
the initial attention score of each node pair including the node is exponentially normalized based on the initial attention score of the node pair to obtain the attention score of the node pair.
In some embodiments, the processing module 1003 is further to:
and fusing the output characteristics of the last attention weighting layer to each node to obtain the atomic topology characteristics of the protein fragment.
In some embodiments, the variogram neural networks are trained based on the position information and class characteristics of each node in the sample topology, and the loss function values of the variogram neural networks in the training phase comprise coordinate loss terms and distance loss terms, wherein the coordinate loss terms represent errors between atomic coordinates and predicted coordinates of each node in the sample topology, and the distance loss terms represent errors between atomic distances and predicted distances of each pair of nodes in the sample topology.
In some embodiments, based on the apparatus composition of fig. 10, the generating module 1004 includes:
The clustering unit is used for clustering the atomic topological characteristics of the protein fragments to obtain a plurality of protein fragment sets which are not intersected with each other;
a segmentation unit, configured to segment atomic topology features of protein fragments in the protein fragment set by using the protein fragment set as a unit, to obtain a plurality of sub-topology features of the protein fragments;
and the generation unit is used for generating indexes of the protein fragments in the protein fragment set based on the cluster centers of each group of sub-topological features with the same sequence numbers in different protein fragments.
In some embodiments, based on the apparatus composition of fig. 10, the generating unit includes:
the clustering subunit is used for clustering each group of sub-topological features with the same sequence number after different protein fragments in the protein fragment set are segmented, and a plurality of sub-topological features in each group of sub-topological features to obtain a plurality of clustering subsets of each group of sub-topological features;
a codebook generating subunit, configured to generate a codebook of each group of sub-topological features based on cluster centers of a plurality of cluster subsets of each group of sub-topological features, where the codebook characterizes sub-topological features located at the cluster centers of the plurality of cluster subsets;
An index generation subunit for generating an index of protein fragments in the set of protein fragments based on the codebook of each set of sub-topological features.
In some embodiments, the index generation subunit is configured to:
determining a plurality of sub-topological features of any protein fragment in the protein fragment set after the segmentation of the protein fragment;
determining a cluster index value of a cluster subset of each sub-topological feature in a group of sub-topological features to which the sub-topological feature belongs based on a codebook of each group of sub-topological features;
and splicing the cluster index values of the sub-topological features after the protein fragments are segmented to obtain the indexes of the protein fragments.
All the above optional solutions can be combined to form an optional embodiment of the present disclosure, which is not described in detail herein.
It should be noted that: the protein index generating device provided in the above embodiment is only exemplified by the division of the above functional modules when generating the index of each protein fragment, and in practical application, the above functional allocation can be performed by different functional modules according to needs, i.e. the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for generating a protein index provided in the above embodiment belongs to the same concept as the embodiment of the method for generating a protein index, and specific implementation procedures of the apparatus are detailed in the embodiment of the method for generating a protein index, which is not described herein.
Fig. 11 is a schematic structural diagram of a protein fragment query device according to an embodiment of the present application, please refer to fig. 11, which includes:
a construction module 1101, configured to construct an atomic topology map of a protein to be queried based on key atoms in amino acid residues of the protein to be queried, where each node in the atomic topology map represents a key atom in an amino acid residue;
the input module 1102 is configured to input location information and category characteristics of each node in the atomic topology map into an isogram neural network, where the location information represents location coordinates of a key atom indicated by the node, the category characteristics represent characteristics of an atomic category to which the key atom indicated by the node belongs, and the isogram neural network is configured to extract atomic topology characteristics of the input topology map;
the processing module 1103 is configured to process the location information and the category characteristics of each node through a plurality of attention weighting layers in the variogram neural network, and output the atomic topology characteristics of the protein to be queried through the last attention weighting layer;
a determining module 1104 for determining a query string of the protein to be queried based on the atomic topology features;
A returning module 1105, configured to return, based on the query string, a plurality of target protein fragments that conform to a similar condition to the protein to be queried.
According to the device provided by the embodiment of the application, the KNN query is carried out in the protein database, the topK target protein fragments which are the most similar to the protein to be queried in the three-dimensional structure are queried, the amino acid sequences of the topK target protein fragments and the position information of each key atom on the amino acid residue main chain are returned, and the protein to be queried is routed to the target fragment set, so that the KNN query is carried out on the rest protein fragment sets without spending calculation resources, the query efficiency can be greatly improved, and further, the most similar topK target protein fragments can be conveniently found according to the query strings by querying the query strings of the protein to be queried from the cached codebook, because the calculation amount is small and the calculation complexity is low when the target protein fragments are searched through the query strings, so that the query efficiency can be further improved.
In some embodiments, based on the apparatus composition of fig. 11, the determining module 1104 includes:
the set determining unit is used for determining a target fragment set with the highest matching degree with the protein to be queried from a plurality of protein fragment sets based on the atomic topological feature;
And the index determining unit is used for determining the query string of the protein to be queried based on the cluster centers of a plurality of cluster subsets of each group of sub-topological features in the target fragment set.
In some embodiments, the set determination unit is to:
acquiring the distance between the atomic topological feature and the atomic topological feature positioned in the clustering center of each protein fragment set;
the closest protein fragment set among the plurality of protein fragment sets is determined as the target fragment set.
In some embodiments, the index determination unit is configured to:
dividing the atomic topological feature of the protein to be queried to obtain a plurality of sub-topological features of the protein to be queried;
for each sub-topological feature, determining a target cluster subset in which the sub-topological feature falls from a plurality of cluster subsets of a group of sub-topological features to which the sub-topological feature belongs in the target fragment set;
inquiring a codebook of the group of sub-topological features to obtain a cluster index value of the target cluster subset, wherein the codebook represents the sub-topological features positioned in the cluster centers of the plurality of cluster subsets;
and splicing the cluster index values of the target cluster subsets which are respectively fallen into the sub-topological features to obtain the query string.
In some embodiments, the return module 1105 is to:
determining a plurality of candidate protein fragments which meet the matching condition with the query string from a plurality of protein fragments in the target fragment set;
screening the plurality of candidate protein fragments based on the distance between the atomic topological feature of the protein to be queried and the atomic topological feature of the candidate protein fragment to obtain a plurality of target protein fragments;
returning the plurality of target protein fragments.
All the above optional solutions can be combined to form an optional embodiment of the present disclosure, which is not described in detail herein.
It should be noted that: the protein fragment query device provided in the above embodiment only illustrates the division of the functional modules when querying each protein fragment, and in practical application, the functional modules can be allocated to different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the protein fragment query device and the protein fragment query method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the protein fragment query device and the protein fragment query method are detailed in the protein fragment query method, which are not described herein.
Fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 1200 may generate a relatively large difference due to different configurations or performances, and the computer device 1200 includes one or more processors (Central Processing Units, CPU) 1201 and one or more memories 1202, where at least one computer program is stored in the memories 1202, and the at least one computer program is loaded and executed by the one or more processors 1201 to implement the method for generating a protein index or the method for querying a protein fragment according to the embodiments described above. Optionally, the computer device 1200 further includes a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
In an exemplary embodiment, a computer readable storage medium is also provided, for example a memory comprising at least one computer program executable by a processor in a computer device to perform the method of generating a protein index or the method of querying a protein fragment in the respective embodiments described above. For example, the computer readable storage medium includes ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, comprising one or more computer programs, the one or more computer programs stored in a computer readable storage medium. The one or more processors of the computer device are capable of reading the one or more computer programs from the computer-readable storage medium, and executing the one or more computer programs, so that the computer device is capable of executing to complete the method of generating a protein index or the method of querying a protein fragment in the above embodiments.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments can be implemented by hardware, or can be implemented by a program instructing the relevant hardware, optionally stored in a computer readable storage medium, optionally a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (20)

1. A method of generating a protein index, the method comprising:
constructing an atomic topology map of a protein fragment based on key atoms in amino acid residues of the protein fragment, each node in the atomic topology map representing one key atom in one amino acid residue;
inputting the position information and the category characteristics of each node in the atomic topological graph into a isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
the position information and the category characteristics of each node are respectively processed through a plurality of attention weighting layers in the isomorphic neural network, and the atomic topology characteristics of the protein fragments are output by the last attention weighting layer;
an index of the plurality of protein fragments is generated based on atomic topological features of the plurality of protein fragments.
2. The method of claim 1, wherein constructing an atomic topology map of a protein fragment based on key atoms in amino acid residues of the protein fragment comprises:
For each amino acid residue in the protein fragment, determining a plurality of key atoms from the backbone of the amino acid residue;
constructing each node in the atomic topology map based on each key atom of each amino acid residue;
for any pair of nodes in the atomic topology graph, an edge is constructed for connecting the pair of nodes.
3. The method of claim 1, wherein each attention weighting layer in the isomorphic neural network is further configured to predict output features and output coordinates for each node in the atomic topology map;
the processing of the position information and the category characteristics of each node through a plurality of attention weighting layers in the isomorphic neural network comprises the following steps:
respectively carrying out weighted mapping on the output characteristics of each node in the previous attention weighting layer on any attention weighting layer based on the query matrix, the key matrix and the value matrix of the attention weighting layer to obtain the query vector, the key vector and the value vector of each node;
for node pairs formed by any node and other nodes, acquiring attention scores of the node pairs based on query vectors of the nodes and key vectors of the other nodes;
Based on the attention score of each node pair containing the node and the value vector of each other node, obtaining the output characteristic of the attention weighting layer to the node;
and obtaining the output coordinates of the attention weighting layer to the nodes based on the attention score of each node pair containing the nodes and the output coordinates of the previous attention weighting layer to the nodes and each other node.
4. A method according to claim 3, wherein said obtaining the output characteristics of the attention weighting layer for the node based on the attention score of each node pair containing the node and the value vector of each other node comprises:
weighting each node pair containing the nodes, and based on the attention score of the node pair, weighting the value vectors of other nodes in the node pair to obtain a weighted value vector of the node pair;
and fusing the weighted value vectors of each node pair comprising the nodes to obtain the output characteristics of the attention weighted layer to the nodes.
5. A method according to claim 3, wherein the obtaining the output coordinates of the attention weighting layer for the node based on the attention score of each node pair containing the node and the output coordinates of the previous attention weighting layer for the node and each other node comprises:
For each node pair comprising the node, acquiring a coordinate difference between output coordinates of the node of the previous attention weighting layer pair and other nodes in the node pair;
weighting the coordinate differences of the node pairs based on the attention scores of the node pairs to obtain weighted coordinate differences;
fusing the weighted coordinate differences of each node pair comprising the nodes to obtain a coordinate offset;
and determining the output coordinates of the attention weighting layer to the node based on the output coordinates of the previous attention weighting layer to the node, the coordinate offset and a normalization factor.
6. The method of claim 3, wherein the obtaining the attention score of the node pair based on the query vector of the node and the key vector of the other node for the node pair comprising any node pair with other nodes comprises:
multiplying the query vector of any node and the key vector of other nodes to obtain the initial attention score of the node pair;
and carrying out exponential normalization on the initial attention score of each node pair containing the node to obtain the attention score of the node pair.
7. A method according to claim 3, wherein said outputting, by the last attention weighting layer, the atomic topology characteristics of the protein fragments comprises:
and fusing the output characteristics of the last attention weighting layer to each node to obtain the atomic topology characteristics of the protein fragments.
8. The method according to claim 1, wherein the isomorphic neural network is trained based on position information and category characteristics of each node in a sample topology, and the loss function value of the isomorphic neural network in the training phase comprises a coordinate loss term and a distance loss term, wherein the coordinate loss term represents an error between an atomic coordinate and a predicted coordinate of each node in the sample topology, and the distance loss term represents an error between an atomic distance and a predicted distance of each pair of nodes in the sample topology.
9. The method of claim 1, wherein generating the index of the plurality of protein fragments based on the atomic topology characteristics of the plurality of protein fragments comprises:
clustering the atomic topological characteristics of the protein fragments to obtain a plurality of protein fragment sets which are not intersected with each other;
Dividing the atomic topological characteristics of the protein fragments in the protein fragment set by taking the protein fragment set as a unit to obtain a plurality of sub-topological characteristics of the protein fragments;
and generating indexes of the protein fragments in the protein fragment set based on the cluster centers of each group of sub-topological features with the same serial numbers in different protein fragments.
10. The method of claim 9, wherein generating the index of protein fragments in the collection of protein fragments based on the cluster centers of each group of sub-topological features with the same sequence number in the different protein fragments comprises:
clustering a plurality of sub-topological features in each group of sub-topological features to obtain a plurality of clustering subsets of the sub-topological features of each group;
generating a codebook of each group of sub-topological features based on cluster centers of a plurality of cluster subsets of each group of sub-topological features, the codebook characterizing sub-topological features located at the cluster centers of the plurality of cluster subsets;
an index of protein fragments in the set of protein fragments is generated based on the codebook of each set of sub-topological features.
11. The method of claim 10, wherein generating an index of protein fragments in the set of protein fragments based on the codebook of each set of sub-topological features comprises:
determining a plurality of sub-topological features of any protein fragment in the protein fragment set after the protein fragment is segmented;
determining a cluster index value of a cluster subset of each sub-topological feature in a group of sub-topological features to which the sub-topological feature belongs based on a codebook of each group of sub-topological features;
and splicing the cluster index values of the sub-topological features after the protein fragments are segmented to obtain indexes of the protein fragments.
12. A method for querying a protein fragment, the method comprising:
constructing an atomic topology diagram of the protein to be queried based on key atoms in amino acid residues of the protein to be queried, wherein each node in the atomic topology diagram represents one key atom in one amino acid residue;
inputting the position information and the category characteristics of each node in the atomic topological graph into a isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
The position information and the category characteristics of each node are respectively processed through a plurality of attention weighting layers in the isomorphic neural network, and the last attention weighting layer outputs the atomic topology characteristics of the protein to be queried;
determining a query string of the protein to be queried based on the atomic topological feature;
and returning a plurality of target protein fragments meeting similar conditions with the protein to be queried based on the query string.
13. The method of claim 12, wherein the determining the query string for the protein to be queried based on the atomic topology features comprises:
determining a target fragment set with highest matching degree with the protein to be queried from a plurality of protein fragment sets based on the atomic topological characteristics;
and determining the query string of the protein to be queried based on the cluster centers of a plurality of cluster subsets of each group of sub-topological features in the target fragment set.
14. The method of claim 13, wherein the determining a set of target fragments that most closely match the protein to be queried from a plurality of sets of protein fragments based on the atomic topology signature comprises:
Acquiring the distance between the atomic topological feature and the atomic topological feature positioned in the clustering center of each protein fragment set;
and determining a protein fragment set closest to the plurality of protein fragment sets as the target fragment set.
15. The method of claim 13, wherein the determining the query string for the protein to be queried based on the cluster centers of the plurality of cluster subsets for each set of sub-topological features in the set of target segments comprises:
dividing the atomic topological characteristics of the protein to be queried to obtain a plurality of sub-topological characteristics of the protein to be queried;
for each sub-topological feature, determining a target cluster subset in which the sub-topological feature falls from a plurality of cluster subsets of a group of sub-topological features to which the sub-topological feature belongs in the target fragment set;
inquiring a codebook of the group of sub-topological features to obtain a cluster index value of the target cluster subset, wherein the codebook represents the sub-topological features positioned in the cluster centers of the plurality of cluster subsets;
and splicing the cluster index values of the target cluster subsets which are respectively fallen into the sub-topological features to obtain the query string.
16. The method of claim 13, wherein the returning, based on the query string, a plurality of target protein fragments that satisfy a similarity condition with the protein to be queried comprises:
determining a plurality of candidate protein fragments which meet the matching condition with the query string from a plurality of protein fragments in the target fragment set;
screening the plurality of candidate protein fragments based on the distance between the atomic topological feature of the protein to be queried and the atomic topological feature of the candidate protein fragment to obtain the plurality of target protein fragments;
returning the plurality of target protein fragments.
17. A protein index generating apparatus, comprising:
a building module for building an atomic topology map of a protein fragment based on key atoms in amino acid residues of the protein fragment, each node in the atomic topology map representing a key atom in an amino acid residue;
the input module is used for inputting the position information and the category characteristics of each node in the atomic topological graph into the isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
The processing module is used for respectively processing the position information and the category characteristics of each node through a plurality of attention weighting layers in the isomorphic neural network, and outputting the atomic topology characteristics of the protein fragments through the last attention weighting layer;
a generation module for generating an index of the plurality of protein fragments based on atomic topology features of the plurality of protein fragments.
18. A device for querying a protein fragment, the device comprising:
the construction module is used for constructing an atomic topological graph of the protein to be queried based on key atoms in amino acid residues of the protein to be queried, wherein each node in the atomic topological graph represents one key atom in one amino acid residue;
the input module is used for inputting the position information and the category characteristics of each node in the atomic topological graph into the isomorphic graph neural network, wherein the position information represents the position coordinates of the key atoms indicated by the nodes, the category characteristics represent the characteristics of the atomic categories to which the key atoms indicated by the nodes belong, and the isomorphic graph neural network is used for extracting the atomic topological characteristics of the input topological graph;
The processing module is used for respectively processing the position information and the category characteristics of each node through a plurality of attention weighting layers in the isomorphic graph neural network, and outputting the atomic topology characteristics of the protein to be queried through the last attention weighting layer;
a determining module, configured to determine a query string of the protein to be queried based on the atomic topology feature;
and the return module is used for returning a plurality of target protein fragments which accord with the similar conditions with the protein to be queried based on the query string.
19. A computer device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one computer program loaded and executed by the one or more processors to implement the method of generating a protein index as claimed in any of claims 1 to 11; or a method of querying a protein fragment according to any one of claims 12 to 16.
20. A computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, the at least one computer program being loaded and executed by a processor to implement the method of generating a protein index according to any one of claims 1 to 11; or a method of querying a protein fragment according to any one of claims 12 to 16.
CN202310146693.4A 2023-02-08 2023-02-08 Method for generating protein index, method and device for querying protein fragment Pending CN116955713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310146693.4A CN116955713A (en) 2023-02-08 2023-02-08 Method for generating protein index, method and device for querying protein fragment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310146693.4A CN116955713A (en) 2023-02-08 2023-02-08 Method for generating protein index, method and device for querying protein fragment

Publications (1)

Publication Number Publication Date
CN116955713A true CN116955713A (en) 2023-10-27

Family

ID=88453652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310146693.4A Pending CN116955713A (en) 2023-02-08 2023-02-08 Method for generating protein index, method and device for querying protein fragment

Country Status (1)

Country Link
CN (1) CN116955713A (en)

Similar Documents

Publication Publication Date Title
US20240144092A1 (en) Generative machine learning systems for drug design
CN112364880B (en) Omics data processing method, device, equipment and medium based on graph neural network
CN113327644A (en) Medicine-target interaction prediction method based on deep embedding learning of graph and sequence
US20230123770A1 (en) Protein database search using learned representations
CN112465120A (en) Fast attention neural network architecture searching method based on evolution method
Watanabe et al. A new pattern representation scheme using data compression
CN113821670B (en) Image retrieval method, device, equipment and computer readable storage medium
CN112905801A (en) Event map-based travel prediction method, system, device and storage medium
CN112395487B (en) Information recommendation method and device, computer readable storage medium and electronic equipment
Wang et al. Imbalance data processing strategy for protein interaction sites prediction
CN114329029B (en) Object retrieval method, device, equipment and computer storage medium
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN113821527A (en) Hash code generation method and device, computer equipment and storage medium
Wang et al. Multi-concept representation learning for knowledge graph completion
CN115101145A (en) Medicine virtual screening method based on adaptive meta-learning
CN115221369A (en) Visual question-answer implementation method and visual question-answer inspection model-based method
CN116646001B (en) Method for predicting drug target binding based on combined cross-domain attention model
Ávila et al. A gene expression programming algorithm for multi-label classification
CN116662566A (en) Heterogeneous information network link prediction method based on contrast learning mechanism
Zerrouk et al. Evolutionary algorithm for optimized CNN architecture search applied to real-time boat detection in aerial images
CN116955713A (en) Method for generating protein index, method and device for querying protein fragment
Ma et al. Drug-target binding affinity prediction method based on a deep graph neural network
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product
Singh et al. Lightweight convolutional neural network architecture design for music genre classification using evolutionary stochastic hyperparameter selection
Kurniawan et al. Prediction of protein tertiary structure using pre-trained self-supervised learning based on transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40098974

Country of ref document: HK