CN112201314A - Method and device for extracting molecular fingerprints and calculating correlation degree based on molecular fingerprints - Google Patents

Method and device for extracting molecular fingerprints and calculating correlation degree based on molecular fingerprints Download PDF

Info

Publication number
CN112201314A
CN112201314A CN202010988652.6A CN202010988652A CN112201314A CN 112201314 A CN112201314 A CN 112201314A CN 202010988652 A CN202010988652 A CN 202010988652A CN 112201314 A CN112201314 A CN 112201314A
Authority
CN
China
Prior art keywords
molecule
character
sample
molecular
molecules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010988652.6A
Other languages
Chinese (zh)
Other versions
CN112201314B (en
Inventor
李相彬
周杰龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wangshi Intelligent Technology Co ltd
Original Assignee
Beijing Wangshi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wangshi Intelligent Technology Co ltd filed Critical Beijing Wangshi Intelligent Technology Co ltd
Priority to CN202010988652.6A priority Critical patent/CN112201314B/en
Publication of CN112201314A publication Critical patent/CN112201314A/en
Application granted granted Critical
Publication of CN112201314B publication Critical patent/CN112201314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs

Landscapes

  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a method and a device for extracting molecular fingerprints and calculating correlation based on the molecular fingerprints, wherein the method for extracting the molecular fingerprints comprises the following steps: acquiring a plurality of characters of molecules to be detected; respectively determining a feature vector corresponding to each character according to the characters and a preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model. By implementing the method, the problem that the molecular fingerprint determined based on the artificially designed molecular characteristics in the related technology cannot describe the whole structure of the molecule, so that the potential activity of the molecule is irrelevant even though the structure is similar is solved, the key characteristic information of the molecule is obtained, the accurate molecular activity correlation information is obtained, the molecular similarity can be accurately evaluated, the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening is effectively shortened.

Description

Method and device for extracting molecular fingerprints and calculating correlation degree based on molecular fingerprints
Technical Field
The invention relates to the field of data processing and analysis, in particular to a method and a device for extracting molecular fingerprints and calculating correlation based on the molecular fingerprints.
Background
The search for potential molecules with activity is a key ring in the process of drug design and discovery, and the potential molecules with activity can be named HIT molecules. Generally, a pharmaceutical expert will use a computer and other related technologies to assist in finding HIT molecules, and virtual screening is one of the important technologies. Molecular fingerprinting is generally used to determine the similarity of a reference ligand to a candidate ligand, i.e., to perform a virtual screening process for molecules. Molecular fingerprints are abstract representations of molecules, and are used for converting the molecules into a string of bit strings and comparing the molecules according to various vector similarity calculation modes.
Molecular fingerprints in the prior art are as follows: (1) setting a bit string based on the molecular fingerprints of the substructures according to the presence or absence of certain substructures or features in a given structural list; (2) topology or Path Based molecular fingerprinting (topologic or Path Based Fingerprint) can be generated by hashing the fragments in each Path by analyzing all the molecular fragments on the Path from one atom until a specified number of bonds are reached; (3) circular molecular fingerprints (Circular Fingerprint) are used for searching a molecular segment with a fixed radius and length by taking a certain heavy atom as a center, and then carrying out hash on structural characteristics of the segments; (4) pharmacophore fingerprints (Pharmacophore fingerprints), which encode the structural features of a molecule in a similar manner to a substructure-based Fingerprint, and the distances between the features, are classified by distance range to generate a bit string.
It follows that different molecular fingerprints have different implementations and different aspects, but in the virtual screening process, the purpose of using molecular fingerprints is to find molecules with closer activities. And existing molecular fingerprints are determined based on artificially designed molecular characteristics, and the description of the overall structure of the molecule is not complete, so that the molecular fingerprints are not close to the potential activity of the molecule even though the molecular fingerprints are structurally similar.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to overcome the defect in the prior art that the description of the whole structure of molecules is not complete enough, resulting in the selected molecules being not close in terms of potential activity of molecules even though the molecules are structurally similar, thereby providing a method and a device for extracting molecular fingerprints and calculating the correlation based on the molecular fingerprints.
According to a first aspect, an embodiment of the present invention provides a method for extracting a molecular fingerprint, including: acquiring a plurality of characters of molecules to be detected; respectively determining a feature vector corresponding to each character according to the characters and a preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model.
With reference to the first aspect, in a first implementation manner of the first aspect, the extracting a molecular fingerprint of the molecule to be detected according to the feature vector and a molecular fingerprint extraction model specifically includes: generating a hidden state of an initial character and an output state of an initial coding long-short term memory chain unit corresponding to the initial character according to a feature vector of the initial character and a preset input state; generating a hidden state of the (n-1) th character and an output state of the (n-1) th coding long and short term memory chain unit corresponding to the (n-1) th character according to the feature vector corresponding to the (n-1) th character and the output state of the coding long and short term memory chain unit corresponding to the (n-2) th character, wherein n is more than or equal to 3; and generating the hidden state of the nth character and the molecular fingerprint of the molecule to be detected according to the feature vector corresponding to the nth character and the output state of the coding long-term and short-term memory chain unit corresponding to the (n-1) th character.
With reference to the first aspect, in a second embodiment of the first aspect, the step of constructing the molecular fingerprint extraction model includes: acquiring a target molecule set, and dividing the target molecule set into a training set and a test set, wherein the training set comprises a plurality of training subsets; obtaining a plurality of sample characters of a plurality of sample molecules in the training subset; respectively determining sample feature vectors corresponding to the sample characters according to the sample characters and a preset character dictionary; generating a hidden state of the initial sample character and an output state of an initial coding long-short term memory chain unit corresponding to the initial sample character according to a sample feature vector of the initial sample character and a preset input state; generating a hidden state of the (n-1) th sample character and an output state of the (n-1) th coding long-short term memory chain unit corresponding to the (n-1) th sample character according to the sample feature vector corresponding to the (n-1) th sample character and the output state of the coding long-short term memory chain unit corresponding to the (n-2) th sample character, wherein n is more than or equal to 3; generating a hidden state of the nth sample character and a molecular fingerprint of the sample molecule according to a sample feature vector corresponding to the nth sample character and an output state of a coding long-short term memory chain unit corresponding to the (n-1) th sample character;
obtaining an output state and an initial hidden state of an initial decoding long-short term memory chain unit according to the molecular fingerprint of the sample molecule and a preset starting identifier; generating an initial sampling character probability matrix according to the initial hidden state and the coding hidden state set; screening and generating initial sampling characters according to the initial sampling character probability matrix; the encoding hidden state set is used for representing the hidden states of the initial sample character until the set of the hidden states of the nth sample character; obtaining the output state of the (n-1) th decoding long-short term memory chain unit and the (n-1) th hidden state according to the sampling feature vector corresponding to the (n-2) th sampling character and the output state of the (n-2) th decoding long-short term memory chain unit; generating an (n-1) th sampling character probability matrix according to the (n-1) th hidden state and the coding hidden state set; according to the n-1 sampling character probability matrix, screening to generate an n-1 sampling character, wherein n is more than or equal to 3; generating a hidden state of the nth sample character according to a sample feature vector corresponding to the (n-1) th sample character and an output state of a decoding long-short term memory chain unit corresponding to the (n-1) th sample character, and generating an nth sampling character probability matrix according to the nth hidden state and a coding hidden state set; screening and generating an nth sampling character according to the nth sampling character probability matrix; generating sample recovery molecules according to the plurality of sampling characters; and constructing the molecular fingerprint extraction model according to the sample molecules and the sample recovery molecules.
With reference to the second embodiment of the first aspect, in the third embodiment of the first aspect, before the step of obtaining the set of target molecules, the method further comprises: acquiring a molecular set in a preset database; cleaning the molecular set according to preset conditions to generate a cleaned molecular set; and converting the cleaned molecular set into a preset character format to generate a target molecular set.
With reference to the third embodiment of the first aspect, in the fourth embodiment of the first aspect, the nth sample character probability matrix is calculated by the following formula:
Figure BDA0002690074630000031
Figure BDA0002690074630000032
Figure BDA0002690074630000041
wherein weight represents a weight of the set of encoded hidden states,
Figure BDA0002690074630000042
indicating the t-th hidden state of the system,
Figure BDA0002690074630000043
the hidden state of the ith sample character is represented, linear represents a linear function, and concat represents a splicing function.
With reference to the fourth embodiment of the first aspect, in the fifth embodiment of the first aspect, the step of constructing the molecular fingerprint extraction model according to the sample molecules and the sample recovery molecules includes: calculating to obtain the reconstruction losses of the sample molecules and the sample recovery molecules according to the number of the sample molecules in the training subset, the length of each sample molecule, the length of a preset character dictionary and the feature data of each sample molecule, wherein the feature data is used for representing a preset label of a sampling character at any position of the sample recovery molecules and the occurrence probability of the sampling character at any position of the sample recovery molecules; determining the target training times of a training set according to the reconstruction loss; and when the training times of the training set reach the target training times, determining to generate a molecular fingerprint extraction model.
With reference to the fifth embodiment of the first aspect, in the sixth embodiment of the first aspect, the values of loss of reconstitution of the sample molecules and the sample reconstituted molecules are calculated by the following formula:
Figure BDA0002690074630000044
wherein N represents the number of sample molecules in the training subset, L represents the length of sample molecules, D represents the length of a preset character dictionary,
Figure BDA0002690074630000045
a predetermined label indicating that the ith position of the sample recovery numerator corresponds to the sample character j,
Figure BDA0002690074630000046
indicating the probability of occurrence of the sample character j at the ith position of the nth sample restitution numerator.
With reference to the sixth embodiment of the first aspect, in the seventh embodiment of the first aspect, the method further includes: obtaining a marker post molecule and an activity index value thereof; obtaining test molecules in the test set and activity index values thereof; generating a marker post molecular fingerprint and a test molecular fingerprint according to the molecular fingerprint extraction model, the marker post molecules and the test molecules; calculating to obtain the similarity between the marker post molecule and the test molecule according to the marker post molecule fingerprint, the activity index value of the marker post molecule, the test molecule fingerprint and the activity index value of the test molecule; calculating to obtain the correlation degree between the similarity degree of the test molecules and the benchmarking molecules and the activity index difference value according to the similarity degree and a preset spearman correlation coefficient function; and when the correlation degree is greater than a preset correlation degree threshold value, determining that the molecular fingerprint extraction model is effective.
With reference to the seventh embodiment of the first aspect, in the eighth embodiment of the first aspect, the similarity between the benchmarking molecule and the test molecule is calculated by the following formula:
Figure BDA0002690074630000051
wherein similarity represents the similarity of the target molecule and the test molecule, fps1Representing flagpole molecular fingerprints, fps2Representing a test molecular fingerprint;
calculating the correlation by the following formula:
corr=spearman(similarity,|IC501-IC502|),
wherein corr represents the correlation between the similarity of the test molecule and the benchmarking molecule and the difference of the activity indexes, and spearman represents a preset spearman correlation coefficient function.
According to a second aspect, an embodiment of the present invention provides a method for calculating a correlation based on a molecular fingerprint, including: acquiring a marker post molecule and a molecule to be detected, and extracting a marker post molecule fingerprint and a molecule fingerprint to be detected according to the marker post molecule and the molecule to be detected, wherein the marker post molecule fingerprint and the molecule fingerprint to be detected are obtained according to the extraction method of the molecule fingerprint in any one of claims 1 to 9; acquiring a first activity index value of the molecule to be detected and a second activity index value of the benchmarking molecule; calculating to obtain an activity index difference value and the similarity between the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint and the molecule fingerprint to be detected; and calculating to obtain target correlation according to the activity index difference, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation is used for representing the correlation degree between the similarity of the molecules to be detected and the benchmarking molecules and the activity index difference.
With reference to the second aspect, in the first embodiment of the second aspect, the similarity between the benchmarking molecule and the molecule to be detected is calculated by the following formula:
Figure BDA0002690074630000052
wherein similarity represents the similarity of the benchmark molecule and the molecule to be detected, fps1Representing flagpole molecular fingerprints, fps3Indicating the point to be measuredA sub-fingerprint;
calculating the correlation by the following formula:
corr=spearman(similarity,|IC501-IC502|),
wherein corr represents the correlation degree between the similarity of the molecule to be detected and the benchmarking molecule and the activity index difference value, and spearman represents a preset spearman correlation coefficient function.
According to a third aspect, an embodiment of the present invention provides an apparatus for extracting a molecular fingerprint, including: the device comprises a to-be-detected molecule character acquisition module, a to-be-detected molecule character acquisition module and a to-be-detected molecule character acquisition module, wherein the to-be-detected molecule character acquisition module is used for acquiring a plurality of characters of to-be-detected molecules; the characteristic vector determining module is used for respectively determining a characteristic vector corresponding to each character according to the plurality of characters and a preset character dictionary; and the first molecular fingerprint extraction module is used for extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model.
According to a fourth aspect, an embodiment of the present invention provides a device for calculating correlation based on molecular fingerprints, including: a second molecular fingerprint extraction module, configured to obtain a marker post molecule and a molecule to be detected, and extract a marker post molecular fingerprint and a molecular fingerprint to be detected according to the marker post molecule and the molecule to be detected, where the marker post molecular fingerprint and the molecular fingerprint to be detected are obtained according to the first aspect or the method for extracting a molecular fingerprint according to any one of the embodiments of the first aspect; an activity index value acquisition module for acquiring a first activity index value of the molecule to be detected and a second activity index value of the benchmarking molecule; the similarity calculation module is used for calculating to obtain an activity index difference value and the similarity between the benchmarking molecules and the molecules to be detected according to the benchmarking molecule fingerprints and the molecular fingerprints to be detected; and the target correlation degree calculation module is used for calculating to obtain target correlation degree according to the activity index difference value, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, and the target correlation degree is used for representing the correlation degree between the similarity of the molecules to be detected and the benchmarking molecules and the activity index difference value.
According to a fifth aspect, an embodiment of the present invention provides a computer device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method for extracting molecular fingerprints according to the first aspect or any of the embodiments of the first aspect, and the steps of the method for calculating correlations based on molecular fingerprints according to the second aspect or the second embodiment of the first aspect.
According to a sixth aspect, embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method for extracting a molecular fingerprint according to the first aspect or any of the embodiments of the first aspect, and the steps of the method for calculating a correlation based on molecular fingerprints according to the second aspect or the second embodiment of the first aspect.
The technical scheme of the invention has the following advantages:
the invention provides a method and a device for extracting molecular fingerprints and calculating correlation based on the molecular fingerprints, wherein the method for extracting the molecular fingerprints comprises the following steps: acquiring a plurality of characters of molecules to be detected; respectively determining a feature vector corresponding to each character according to the characters and a preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model. By implementing the invention, the problem that the molecular fingerprints determined based on the artificially designed molecular characteristics in the related technology cannot describe the whole structure of the molecule, so that the potential activity of the molecule is irrelevant even though the structure is similar is solved, the higher the similarity of the molecular fingerprints is, the higher the similarity of the potential activity of the molecule is, namely, the key characteristic information of the molecule is learned, the more accurate molecular activity correlation information is obtained, the molecular similarity can be accurately evaluated, so that the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening is effectively shortened.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a specific example of a molecular fingerprint extraction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the structure of a molecule in the method for extracting a molecular fingerprint according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the positions of character feature vectors in a molecule after conversion into SMILES format according to the method for extracting a molecular fingerprint of the present invention;
FIG. 4 is a schematic structural diagram of an encoder in a molecular fingerprint extraction model constructed in the method for extracting a molecular fingerprint according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an encoder and a decoder in a molecular fingerprint extraction model in the method for extracting a molecular fingerprint according to an embodiment of the present invention;
FIG. 6 is a graph showing a comparison of the application effects of molecular fingerprints in the method for extracting molecular fingerprints according to the embodiment of the present invention;
FIG. 7 is a flowchart illustrating a specific example of a method for calculating correlation based on molecular fingerprints according to an embodiment of the present invention;
FIG. 8 is a schematic block diagram of a specific example of a molecular fingerprint extraction apparatus according to an embodiment of the present invention;
FIG. 9 is a functional block diagram of a specific example of a computing device based on the correlation of molecular fingerprints in an embodiment of the present invention;
FIG. 10 is a diagram showing a specific example of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
One of the most important issues in comparing similarity between two molecules is the complexity of molecular characterization. To make comparison of molecules computationally easier, some degree of simplification or abstraction of the molecules is required; the embodiment of the invention provides a method and a device for extracting molecular fingerprints and calculating the correlation degree based on the molecular fingerprints, aiming at obtaining more accurate molecular key information, further determining the similarity of the molecular potential activity by comparing the similarity of the molecular fingerprints, shortening the virtual screening time and improving the virtual screening efficiency.
The embodiment of the invention provides a molecular fingerprint extraction method, as shown in fig. 1, comprising the following steps:
step S11: acquiring a plurality of characters of molecules to be detected; in this embodiment, the molecule to be evaluated may be a molecule to be evaluated in any molecule database; after the molecules to be detected are converted into the preset character format, a plurality of characters of the molecules to be detected are generated, specifically, the molecules to be detected can be converted into a simple Molecular-Input Line-Entry System (SMILES) format, and the SMILES format can be a character-based Molecular structure representation form and can comprehensively represent the overall structural feature information of the molecules. For example, the molecule shown in FIG. 2, after being converted to SMILES format, can be represented by CN (C) CCC (c1ccccc1) c2ccccn 2.
Step S12: respectively determining a feature vector corresponding to each character according to the characters and a preset character dictionary; in this embodiment, the preset character dictionary may be a pre-stored database for storing corresponding molecular characters and feature vectors of the characters; as shown in fig. 3, the feature vector of a character may be specific position information representing each character in a molecule; specifically, the molecules to be tested are converted into the SMILES format, the characters in the molecules can be arranged according to the conversion sequence, and the feature vectors corresponding to the characters are determined according to the preset character dictionary.
Step S13: and extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model. In this embodiment, the molecular fingerprint extraction model may be a model for extracting molecular fingerprints of various molecules, and may be generated by presetting a test subset in a database and training the test subset; acquiring each character of a molecule to be detected, determining a characteristic vector corresponding to each character, sequentially inputting the characteristic vector of each character into a molecule fingerprint extraction model, and extracting the molecule fingerprint of the molecule to be detected.
The invention provides a molecular fingerprint extraction method, which comprises the following steps: acquiring a plurality of characters of molecules to be detected; respectively determining a feature vector corresponding to each character according to the characters and a preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model. By implementing the method, the problem that the molecular fingerprint determined based on the artificially designed molecular characteristics can not describe the whole structure of the molecule in the related technology is solved, the higher the similarity of the molecular fingerprint is, the higher the potential activity similarity of the molecule can be shown, the more accurate the key information of the molecule can be mastered, and the structural information of the molecule can be completely and comprehensively described, so that the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening can be effectively shortened.
As an optional implementation manner of the present invention, in the step S13, the step of extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model specifically includes:
firstly, generating a hidden state of an initial character and an output state of an initial coding long-short term memory chain unit corresponding to the initial character according to a feature vector of the initial character and a preset input state; in this embodiment, the process of extracting the molecular fingerprint of the molecule to be detected by the molecular fingerprint extraction model may be implemented by an encoder, a specific schematic diagram of the encoder may be as shown in fig. 4, specifically, the encoder may be an LSTM chain, which may include a plurality of LSTM units, and the number of the LSTM units may be determined according to the character length of the molecule to be detected; the feature vectors corresponding to a plurality of characters of the molecule to be detected are sequentially input into the corresponding LSTM units for encoding, that is, the input of the encoder can be SMILES of the whole molecule to be detected and a preset initial input state. The feature vector of the initial character can be a feature vector corresponding to the first character of the molecule to be detected; the preset input state can be a preset initial state of a Long Short-Term Memory (LSTM) unit coded in the coder; the hidden state of the initial character can be the hidden state of the first character of the molecule to be detected, the hidden state can be an output of the LSTM unit, and can be a set containing the character and characters before the character; the output state of the initial encoded long short term memory chain unit corresponding to the initial character may be the output state of the first bit LSTM unit in the encoder. That is, the input of the first LSTM unit is the feature vector corresponding to the first character of the molecule to be detected and the preset input state S0, and the output of the first LSTM unit is the output state of the first LSTM unit and the hidden state of the first character of the molecule to be detected.
Then, generating a hidden state of the (n-1) th character and an output state of the (n-1) th coding long-short term memory chain unit corresponding to the (n-1) th character according to the feature vector corresponding to the (n-1) th character and the output state of the coding long-short term memory chain unit corresponding to the (n-2) th character, wherein n is more than or equal to 3; in this embodiment, the encoder includes a plurality of LSTM units, and the execution process from the 2 nd bit LSTM unit to the n-1 th bit LSTM unit may be as follows, and the hidden state of the character at the position corresponding to the LSTM unit and the output state of the LSTM unit are generated according to the feature vector of the character at the position corresponding to the LSTM unit and the output state of the last LSTM unit.
And then generating a hidden state of the nth character and a molecular fingerprint of the molecule to be detected according to the feature vector corresponding to the nth character and the output state of the coding long-short term memory chain unit corresponding to the (n-1) th character. In this embodiment, in the last LSTM unit in the encoder, encoding is performed in the last LSTM unit according to the feature vector corresponding to the last character of the molecule to be detected and the output state of the last LSTM unit, and finally, the hidden state of the last character of the molecule to be detected and the molecule fingerprint of the molecule to be detected may be generated.
The embodiment of the invention provides a method for extracting a molecular fingerprint, which combines a plurality of LSTM units in an encoder and a molecule to be detected in an SMILES format to obtain the hidden state of each character of the molecule to be detected and the molecular fingerprint of the molecule to be detected. The molecular fingerprint of the accurate representation molecule overall structure characteristic information can be extracted, the correlation between the molecule potential activity and the molecular structure similarity can be improved, namely, the molecular fingerprint of the molecule to be detected extracted through the embodiment is similar to other key index information of the benchmarking molecule under the condition that the molecular fingerprint of the benchmarking molecule is similar, therefore, the time required by virtual screening based on ligand similarity can be shortened, and the virtual screening efficiency is improved.
As an optional embodiment of the present invention, the step of constructing the molecular fingerprint extraction model in step S13 specifically includes:
firstly, acquiring a target molecule set, and dividing the target molecule set into a training set and a test set, wherein the training set comprises a plurality of training subsets; in this embodiment, the target molecule set may be a set of molecules that are in a preset database, and are subjected to cleaning and screening by preset steps, and converted into a SMILES format, and meet the requirements for constructing a molecule fingerprint extraction model; the training set may be some set of target molecules for training the molecular fingerprint extraction model; the test set can be a molecular set for testing a molecular fingerprint extraction model trained by the training set; the training set may include a plurality of training subsets, that is, the training set may include a plurality of training batches, that is, a plurality of batch packets, and the number of molecules contained in each batch may be preset.
Specifically, the target molecule set is randomly divided into a training set and a test set, and then training can be performed according To the molecules in the training set and an initial model of a preset integrated language translation model (Sequence To Sequence, Sequence 2 Sequence) and Attention mechanism model (Attention) To generate a molecule fingerprint extraction model.
Then, obtaining a plurality of sample characters of a plurality of sample molecules in the training subset; in this embodiment, the molecules in the training subset may be referred to as sample molecules. For each sample molecule, the SMILES of each sample molecule is taken first, i.e. all sample characters of the sample molecule are taken in sequence, e.g. the characters of the sample molecule may be cn (c) CCC (c1ccccc1) c2ccccn 2.
Then, respectively determining sample characteristic vectors corresponding to the sample characters according to the sample characters and a preset character dictionary; in this embodiment, sample feature vectors corresponding to sample characters are determined one by one according to a pre-stored character dictionary.
Then, generating a hidden state of the initial sample character and an output state of an initial coding long-short term memory chain unit corresponding to the initial sample character according to the sample feature vector of the initial sample character and a preset input state; in this embodiment, as shown in fig. 5, the training process of the molecular fingerprint extraction model can be divided into an encoder and a decoder; the encoder stores a plurality of encoding LSTM units, the number of the encoding LSTM units can be determined according to the number of characters of sample molecules, when the number of characters of the sample molecules is 15, the number of LSTM units in the encoder is 15, correspondingly, the number of decoding LSTM units in the decoder is also 15, and the decoding LSTM units correspond to the sample characters of the sample molecules.
Specifically, in the encoder, the input of the first bit encoding LSTM unit may be a preset initial input state and a sample feature vector corresponding to the first bit sample character of the sample molecule, and in the first bit encoding LSTM unit, the output state of the first bit encoding LSTM unit and the hidden state of the first bit sample character of the sample molecule are generated.
Then, generating a hidden state of the (n-1) th sample character and an output state of the (n-1) th coding long-short term memory chain unit corresponding to the (n-1) th sample character according to the sample feature vector corresponding to the (n-1) th sample character and the output state of the coding long-short term memory chain unit corresponding to the (n-2) th sample character, wherein n is more than or equal to 3; in this embodiment, when the number of sample characters of a sample molecule is 15, and the number of LSTM units in the encoder is 15, then for the 2 nd bit encoding LSTM unit up to the 14 th bit encoding LSTM unit, the input of the encoding LSTM unit is the output state of the last LSTM unit and the sample feature vector corresponding to the sample character at the corresponding position of the sample molecule. The output is the output state of the LSTM unit and the hidden state of the sample character at the corresponding position of the sample molecule.
Then, generating a hidden state of the nth sample character and a molecular fingerprint of a sample molecule according to a sample feature vector corresponding to the nth sample character and an output state of a coding long-short term memory chain unit corresponding to the (n-1) th sample character; in this embodiment, when the last LSTM unit in the encoder is used, the input of the last LSTM unit is the output state of the last LSTM unit and the feature vector corresponding to the last sample character of the sample molecule, and the output of the last LSTM unit is the molecular fingerprint of the sample molecule and the hidden state of the last character of the sample molecule.
Then, obtaining the output state and the initial hidden state of the initial decoding long-short term memory chain unit according to the molecular fingerprint of the sample molecules and a preset starting identifier; generating an initial sampling character probability matrix according to the initial hidden state and the coding hidden state set; screening and generating initial sampling characters according to the initial sampling character probability matrix; the encoding hidden state set is used for representing the hidden state of the initial sample character until the hidden state set of the nth sample character; in this embodiment, the sample molecules pass through the encoder to generate a molecular fingerprint of the sample molecules, and an encoder hidden state set of each sample character of the sample molecules.
Illustratively, according to the molecular fingerprint of the sample molecule generated by the encoder and the start identifier, generating the output state and the first hidden state of the first bit-decoding LSTM unit in the first bit-decoding LSTM unit; the decoder comprises a plurality of decoding LSTM units, an attention layer and a linear layer, wherein specifically, in the attention layer, according to a first hidden state output by a first bit decoding LSTM unit and an encoder hidden state set output by an encoder, hidden state selection and random combination are carried out to generate a linear matrix, namely an initial sampling character probability matrix, which is used for representing the output probability of each sampling character; and sampling according to the initial sampling character probability matrix to obtain a first bit sampling character.
Specifically, after the training of the molecular fingerprint extraction model is completed, the initial sample characters output by the initial decoding LSTM unit are consistent with the initial sample characters of the sample molecules.
Then, obtaining the output state of the (n-1) th decoding long-short term memory chain unit and the (n-1) th hidden state according to the sampling feature vector corresponding to the (n-2) th sampling character and the output state of the (n-2) th decoding long-short term memory chain unit; generating an (n-1) th sampling character probability matrix according to the (n-1) th hidden state and the coding hidden state set; according to the n-1 sampling character probability matrix, screening to generate an n-1 sampling character, wherein n is more than or equal to 3; in this embodiment, for the 2 nd bit-decoded LSTM unit to the n-1 th bit-decoded LSTM unit, the training process may be to generate the output state and the hidden state of the decoded LSTM unit, i.e., the 2 nd bit-hidden state to the n-1 th bit-hidden state, according to the output state of the last decoded LSTM unit and the sampling feature vector corresponding to the sampling character output by the last decoded LSTM unit. The process of generating the corresponding sampling character according to the hidden state from the 2 nd bit to the hidden state from the n-1 th bit is similar to the process of generating the initial sampling character, and is not repeated herein.
Then, generating an nth hidden state according to a sample feature vector corresponding to the (n-1) th sample character and an output state of a decoding long-short term memory chain unit corresponding to the (n-1) th sample character, and generating an nth sampling character probability matrix according to the nth hidden state and a coding hidden state set; screening and generating an nth sampling character according to the nth sampling character probability matrix; generating sample recovery molecules according to the plurality of sampling characters; in this embodiment, in the last bit decoding LSTM unit, according to the output state of the previous bit decoding LSTM unit and the probability matrix of the sampled character selected by attention, the sampled eigenvector corresponding to the (n-1) th sampled character is generated by sampling, and in the last bit decoding LSTM unit, the output state of the last bit decoding long-short term memory chain unit and the nth hidden state are generated. And generating a final sampling character through the attention selection of the attention layer and the sampling character probability matrix of the linear layer. And generating sampling characters according to each decoding LSTM unit, and sequentially arranging and generating sample recovery molecules in the SMILES format.
Specifically, after the initial model training of the integrated language translation model and the attention mechanism model is completed, the molecular fingerprint extraction model is generated, and at this time, the sample molecules in the SMILES format input to the initial model are consistent with the sample recovery molecules in the SMILES format generated by the initial model.
And then, constructing a molecular fingerprint extraction model according to the sample molecules and the sample recovery molecules. In this embodiment, a molecular fingerprint extraction model is generated based on the input sample molecules and the sample restoration molecules generated by the initial model.
The embodiment of the invention provides a molecular fingerprint extraction method, which combines a preset initial model integrating a language translation model and an attention mechanism model, restores SMILES according to input sample molecules in the SMILES format, realizes the analysis and reconstruction of the molecules, and can enable the initial model to learn key feature information behind the molecules.
As an optional embodiment of the present invention, before the step of acquiring the target molecule set, the method for extracting a molecule fingerprint further includes:
acquiring a molecular set in a preset database; cleaning the molecular set according to preset conditions to generate a cleaned molecular set; and converting the cleaned molecular set into a preset character format to generate a target molecular set. In this embodiment, the preset database may be a molecular set library stored with a high probability of being drugged, such as ChEMBL-25 database; washing the collection of molecules according to preset conditions may be to remove molecules that are not recognized by the RDKit, to remove molecules with heavy atom number less than 10 and greater than 50, to remove molecules with bond number greater than 65, to remove molecules containing unusual element types (for example, common element types may include P, S, N, O, C, Cl, Br, F, I, H), to remove molecules containing unusual bond types (common bond types are single bond, double bond, triple bond, and aromatic bond); the predetermined character format may be a conversion of the washed set of molecules into a SMILES format.
Specifically, when the molecular set is converted into the SMILES format, due to the existence of a random condition, different character strings may be generated in the SMILES of the same molecule, which may affect the training of the molecular extraction model, and therefore, the SMILES of the molecule may be normalized. And simultaneously, carrying out de-duplication on the normalized SMILES. For example, the normalization process may assign a number to each atom in the molecule, and then iterate according to the characteristics of each atom and the characteristics of the surrounding environment, and update the numbers until the numbers of all atoms are not changed any more. For example, the ChEMBL-25 database mentioned above, the number of remaining molecules after the above treatment was 1,607,036.
The embodiment of the invention provides a molecular fingerprint extraction method, which combines preset cleaning conditions to perform cleaning, format conversion, normalization and de-duplication processing on molecules in a preset database to generate a target molecule set for training a molecular fingerprint extraction model. Clean and normative molecular data can be provided for model training.
Specifically, the nth sample character probability matrix of the attention layer and the linear layer may be calculated by the following formula:
Figure BDA0002690074630000161
Figure BDA0002690074630000162
Figure BDA0002690074630000163
wherein weight represents the weight of each sample character in the coding hidden state set,
Figure BDA0002690074630000164
representing the t-th hidden state of the decoder output,
Figure BDA0002690074630000165
the implicit state of the ith sample character output by the LSTM unit of the encoder is represented, linear represents a linear function, and concat represents a splicing function.
As an optional embodiment of the present invention, the step of constructing a molecular fingerprint extraction model according to the sample molecules and the sample recovery molecules may specifically include:
firstly, calculating to obtain the reconstruction losses of the sample molecules and the sample recovery molecules according to the number of the sample molecules in the training subset, the length of each sample molecule, the length of a preset character dictionary and the feature data of each sample recovery molecule, wherein the feature data is used for representing the preset label of the sampling character at any position of the sample recovery molecule and the occurrence probability of the sampling character at any position of the sample recovery molecule; in this embodiment, in the training process of the molecular fingerprint extraction model, since the input sample molecules and the output sample recovery molecules are not always identical, it is necessary to adjust or optimize the training process of the molecular fingerprint extraction model for the errors between the sample molecules and the sample recovery molecules.
Specifically, according to the number of sample molecules in each training batch, the character length of each SMILES format sample molecule, the length of a preset character dictionary, the character label corresponding to the ith character in the nth sample recovery molecule, and the output probability corresponding to the ith character in the nth sample recovery molecule, the training loss of the recovered sample molecules is calculated, that is, the training error value occurring in the process of analyzing and reconstructing the sample molecules, and the training parameters in the encoder and the decoder can be optimized and adjusted according to the training error value.
Then, determining the target training times of a training set according to the change of the reconstruction loss in the training process; and when the training times of the training set reach the target training times, determining to generate a molecular fingerprint extraction model. Specifically, the target training times of the whole training set can be calculated according to the calculated reconstruction loss value of the sample molecules, and when the training times of the training set reach the target training times, the training of the molecular fingerprint extraction model can be considered to be completed.
Specifically, the specific process of adjusting the training parameters in the encoder and the decoder in the training process according to the calculated reconstruction loss value may be: according to an Adaptive Moment Estimation method (Adam), that is, an algorithm for performing first-order gradient optimization on a random objective function, a training parameter is optimized according to a first-order gradient of a reconstruction loss function, and a step length and a weight of an optimized initial learning rate lr can be determined according to first-order and second-order Moment estimations of the reconstruction loss function.
Specifically, when a model of sample molecules of a training batch is trained, the average loss of all sample molecules of the training batch is calculated according to the method described in the above embodiment; the number of sample molecules in a training batch can be determined synthetically from batch _ size; then, calculating a first derivative of the average reconstruction loss to the training parameters, and updating all the training parameters according to the step length of the optimized initial learning rate (lr) and the weight thereof; until the training set is trained to the target training times (num _ epochs), it can be considered that the molecular fingerprint extraction model has been trained at this time.
Wherein, 1 epoch represents that the molecular fingerprint extraction model has been trained for 1 time according to all sample molecules in the training set; 5 epochs indicate that training has been completed 5 times. After a step length (decay _ step) of the decay of the learning rate is preset, the step length is attenuated to ensure the stability of the training process of the molecular fingerprint extraction model, the attenuation degree can be determined according to a decay coefficient (decay) of the learning rate, but the attenuation degree is finite, the lowest value can be determined according to an allowed minimum learning rate (min _ lr), and in addition, in order to ensure the stability of the training process of the molecular fingerprint extraction model, a limit range exists in the first derivative of all training parameters, which can be a range of [ -clip _ grad, clip _ grad ], that is, a threshold (clip _ grad) of the gradient in the training process.
The embodiment of the invention provides a molecular fingerprint extraction method, which can calculate the target training times of the whole training set by combining the change of the reconstruction loss value of the sample molecules calculated in the training process, and can consider that the training of a molecular fingerprint extraction model is finished when the training times of the training set reach the target training times. And training parameters in an encoder and a decoder in the training process can be adjusted according to the calculated reconstruction loss value, so that the accuracy and stability of the training of the molecular fingerprint extraction model are ensured, and the error of the model in the training process is reduced.
In particular, the representation of the training parameters and their meaning may be as shown in table 1 below:
TABLE 1
Figure BDA0002690074630000181
Specifically, the reconstruction loss values of the sample molecules and the sample recovery molecules can be calculated by the following formula:
Figure BDA0002690074630000182
wherein N represents the number of sample molecules in the training subset, L represents the length of the sample molecules, D represents the length of the predetermined character dictionary,
Figure BDA0002690074630000183
a predetermined label indicating that the ith position of the sample recovery numerator corresponds to the sample character j,
Figure BDA0002690074630000184
indicating the probability of occurrence of the sample character j at the ith position of the nth sample restitution numerator.
As an optional embodiment of the present invention, the method for extracting a molecular fingerprint further includes:
firstly, acquiring a marker post molecule and an activity index value thereof; in this embodiment, the target molecule may be a pre-set molecule with a desired potential activity; the activity index value may be a potential activity index value of the flagpole molecule; specifically, a target molecule is preset and its corresponding potential activity index value is determined.
Then, obtaining test molecules in the test set and activity index values thereof; in this embodiment, the test set may be a molecular set generated by randomly dividing a target molecule set to test the molecular fingerprint extraction model, and each test molecule in the test set and its corresponding potential activity index value, that is, an activity index value, are obtained.
Then, generating a marker post molecular fingerprint and a test molecular fingerprint according to the molecular fingerprint extraction model, the marker post molecules and the test molecules; in this embodiment, the generated molecular fingerprint extraction model is constructed by the method described in the above embodiment, the SMILES of the target molecule is input to the molecular fingerprint extraction model, and a molecular fingerprint of the target molecule, that is, a target molecular fingerprint, is generated through an encoder model in the molecular fingerprint extraction model; through a similar process, a test molecule fingerprint is generated, which is not described in detail herein.
Then, according to the marker post molecule fingerprint and the test molecule fingerprint, calculating to obtain the similarity between the marker post molecule and the test molecule; in this embodiment, the similarity of the molecular structure between the generated target molecule and the test molecule is calculated according to the extracted target molecule fingerprint and the test molecule fingerprint.
Then, according to the similarity and a preset spearman correlation coefficient function, calculating to obtain the correlation between the similarity of the test molecules and the benchmarking molecules and the activity index difference; in this embodiment, the correlation between the structural similarity between the target molecule and the test molecule and the difference between the molecular activity indexes can be determined according to the calculated similarity and the difference between the activity index values of the target molecule and the test molecule.
And then, when the correlation degree is greater than a preset correlation degree threshold value, determining that the molecular fingerprint extraction model is effective. In this embodiment, when the correlation between the calculated molecular structure similarity and the difference between the molecular potential activity indexes is greater than a preset correlation threshold according to the molecular fingerprint extracted by the molecular fingerprint extraction model, it may be determined that the molecular fingerprint extraction model may be actually applied. Specifically, the preset correlation threshold may be a correlation average value calculated by using other conventional molecular fingerprints, and when the calculated correlation of the extracted molecular fingerprint is greater than the correlation calculated by using other conventional molecular fingerprints based on the method described in the above embodiment, it may be considered that the molecular fingerprint extraction model may be applied to practical applications, for example, to similarity screening of ligands.
The molecular fingerprint extraction model generated by the method described in the above embodiment is described in detail below with reference to a specific embodiment, and the performance comparison between the extracted molecular fingerprint and other conventional molecular fingerprints in activity correlation is performed.
According to the molecular fingerprint (deep fp) extracted by the method of the embodiment of the invention, the correlation between the molecular structure similarity and the activity is calculated according to the deep fp, meanwhile, the correlation between the molecular structure similarity and the activity is calculated by comparing the ECFP fingerprint, the ErG fingerprint and the MACCSKeys fingerprint, and the calculation result is shown in Table 2:
TABLE 2
Figure BDA0002690074630000201
As shown in table 2 above, the spearman correlation coefficient between similarity and activity was calculated based on the deepp and three common fingerprints, and as can be seen from the data in table 2, the average spearman correlation coefficient of the deepp on 301 targets in the test set was 0.43, which is higher than that of the other three fingerprints.
Specifically, 39 targets are randomly selected from 301 targets in the test set, and the spearman correlation coefficient of the 39 targets is visualized, as shown in fig. 6, it can be seen from the graph that most of the curves of the deep fp wrap the curves of other three fingerprints, and thus it can also be shown that deep fp has better activity correlation and is better than other common molecular fingerprints.
Specifically, the similarity of the benchmarking molecule and the test molecule can be calculated by the following formula:
Figure BDA0002690074630000202
wherein, similarity denotes the similarity of the target molecule to the test molecule, fps1Representing flagpole molecular fingerprints, fps2Representing a test molecular fingerprint;
the correlation is calculated by the following formula:
corr=spearman(similarity,|IC501-IC502|),
wherein corr represents the correlation between the similarity of the test molecule and the benchmarking molecule and the difference value of the activity indexes, and spearman represents a preset spearman correlation coefficient function.
The embodiment of the invention provides a method for calculating correlation based on molecular fingerprints, as shown in fig. 7, the method comprises the following steps:
step S21: acquiring a marker post molecule and a molecule to be detected, and extracting a marker post molecule fingerprint and a molecule fingerprint to be detected according to the marker post molecule and the molecule to be detected, wherein the marker post molecule fingerprint and the molecule fingerprint to be detected are obtained according to the extraction method of the molecule fingerprint in any embodiment;
step S22: acquiring a first activity index value of a molecule to be detected and a second activity index value of a benchmarking molecule;
step S23: calculating to obtain an activity index difference value and the similarity between the marker post molecules and the molecules to be detected according to the marker post molecule fingerprint and the molecular fingerprint to be detected;
step S24: and calculating to obtain target correlation according to the activity index difference, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation is used for representing the correlation degree between the similarity of the molecules to be detected and the benchmarking molecules and the activity index difference.
The embodiment of the invention provides a method for calculating correlation based on molecular fingerprints, which comprises the following steps: acquiring a marker post molecule and a molecule to be detected, extracting a marker post molecule fingerprint and a molecule fingerprint to be detected according to the marker post molecule and the molecule to be detected, and acquiring a first activity index value of the molecule to be detected and a second activity index value of the marker post molecule; and calculating to obtain an activity index difference value and the similarity between the benchmarking molecules and the molecules to be detected according to the benchmarking molecule fingerprint and the molecular fingerprint to be detected, and calculating to obtain a target correlation degree according to the activity index difference value, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation degree is used for representing the correlation degree between the similarity between the molecules to be detected and the benchmarking molecules and the activity index difference value. The deep learning method is utilized to extract the characteristic vectors from the mass molecules to form the molecular fingerprints, so that the correlation between the molecular similarity and the activity difference value of the molecular similarity can be improved.
Specifically, the similarity between the benchmarking molecule and the molecule to be detected is calculated by the following formula:
Figure BDA0002690074630000211
wherein similarity represents the similarity between the marker post molecule and the molecule to be detected, fps1Representing flagpole molecular fingerprints, fps3Representing the molecular fingerprint to be detected;
the correlation is calculated by the following formula:
corr=spearman(similarity,|IC501-IC502|),
wherein corr represents the correlation degree between the similarity of the molecule to be detected and the benchmarking molecule and the difference value of the activity indexes, and spearman represents a preset spearman correlation coefficient function.
An embodiment of the present invention provides an apparatus for extracting a molecular fingerprint, as shown in fig. 8, the apparatus includes:
a molecule-to-be-detected character acquisition module 31 for acquiring a plurality of characters of a molecule to be detected; the detailed implementation can be referred to the related description of step S11 in the above method embodiment.
A feature vector determining module 32, configured to determine, according to the multiple characters and a preset character dictionary, a feature vector corresponding to each character respectively; the detailed implementation can be referred to the related description of step S12 in the above method embodiment.
And a first molecular fingerprint extraction module 33, configured to extract a molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. The detailed implementation can be referred to the related description of step S13 in the above method embodiment.
The invention provides a molecular fingerprint extraction device, which comprises: acquiring a plurality of characters of molecules to be detected through a molecule character acquisition module 31 to be detected; respectively determining the feature vector corresponding to each character according to the plurality of characters and a preset character dictionary through a feature vector determination module 32; the first molecular fingerprint extraction module 33 extracts the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. By implementing the method, the problem that the molecular fingerprint determined based on the artificially designed molecular characteristics in the related technology cannot describe the overall structure of the molecule is solved, the key information of the molecule can be more accurately grasped, and the structural information of the molecule can be completely and comprehensively described, so that the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening can be effectively shortened.
An embodiment of the present invention provides a device for calculating correlation based on molecular fingerprints, as shown in fig. 9, the device includes:
a second molecular fingerprint extraction module 41, configured to obtain a flagpole molecule and a molecule to be detected, and extract a flagpole molecular fingerprint and a molecule fingerprint to be detected according to the flagpole molecule and the molecule to be detected, where the flagpole molecular fingerprint and the molecule fingerprint to be detected are obtained according to the method for extracting a molecular fingerprint described in the foregoing embodiment; the detailed implementation can be referred to the related description of step S21 in the above method embodiment.
An activity index value acquisition module 42, configured to acquire a first activity index value of the molecule to be detected and a second activity index value of the benchmarking molecule; the detailed implementation can be referred to the related description of step S22 in the above method embodiment.
A similarity calculation module 43, configured to calculate an activity index difference and a similarity between the target molecule and the molecule to be detected according to the target molecule fingerprint, the molecule fingerprint to be detected, the first activity index value, and the second activity index value; the detailed implementation can be referred to the related description of step S23 in the above method embodiment.
And the target correlation calculation module 44 is configured to calculate a target correlation according to the activity index difference, the similarity of the molecule to be detected, and a preset spearman correlation coefficient function, where the target correlation is used to represent a correlation degree between the similarity of the molecule to be detected and the benchmarking molecule and the activity index difference. The detailed implementation can be referred to the related description of step S24 in the above method embodiment.
The embodiment of the invention provides a device for calculating correlation based on molecular fingerprints, which comprises: acquiring a marker post molecule and a molecule to be detected through a second molecule fingerprint extraction module 41, and extracting a marker post molecule fingerprint and a molecule fingerprint to be detected according to the marker post molecule and the molecule to be detected; acquiring a first activity index value of the molecule to be detected and a second activity index value of the benchmarking molecule by an activity index value acquisition module 42; and calculating to obtain an activity index difference value and the similarity between the target molecules and the molecules to be detected through a similarity calculation module 43 according to the marker post molecule fingerprint and the molecular fingerprint to be detected, and calculating to obtain a target correlation through a target correlation calculation module 44 according to the activity index difference value, the similarity between the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation is used for representing the correlation degree between the similarity between the molecules to be detected and the marker post molecules and the activity index difference value. The deep learning method is utilized to extract the characteristic vectors from the mass molecules to form the molecular fingerprints, so that the correlation between the molecular similarity and the activity difference value of the molecular similarity can be improved.
An embodiment of the present invention further provides a computer device, as shown in fig. 10, the computer device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 10 takes the example of connection by a bus as an example.
The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52 is a non-transitory computer-readable storage medium, and can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the molecular fingerprint extraction method in the embodiment of the present invention (for example, the to-be-tested molecular character acquisition module 31, the feature vector determination module 32, and the first molecular fingerprint extraction module 33 shown in fig. 8, and the second molecular fingerprint extraction module 41, the activity index value acquisition module 42, the similarity calculation module 43, and the target correlation calculation module 44 shown in fig. 9). The processor 51 executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions and modules stored in the memory 52, namely, implements the molecular fingerprint extraction method in the above method embodiment.
The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 52 and, when executed by the processor 51, perform a method of extracting a molecular fingerprint as in the embodiment shown in fig. 1 or a method of calculating a correlation based on molecular fingerprints as in the embodiment shown in fig. 7.
The details of the computer device can be understood by referring to the corresponding related descriptions and effects in the embodiments shown in fig. 1 and fig. 7, and are not described herein again.
Optionally, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the method for extracting a molecular fingerprint or the method for calculating a correlation based on a molecular fingerprint described in any of the above embodiments, where the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (15)

1. A method for extracting molecular fingerprints is characterized by comprising the following steps:
acquiring a plurality of characters of molecules to be detected;
respectively determining a feature vector corresponding to each character according to the characters and a preset character dictionary;
and extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model.
2. The method for extracting molecular fingerprints according to claim 1, wherein extracting the molecular fingerprint of the molecule to be detected according to the feature vector and a molecular fingerprint extraction model comprises:
generating a hidden state of an initial character and an output state of an initial coding long-short term memory chain unit corresponding to the initial character according to a feature vector of the initial character and a preset input state;
generating a hidden state of the (n-1) th character and an output state of the (n-1) th coding long and short term memory chain unit corresponding to the (n-1) th character according to the feature vector corresponding to the (n-1) th character and the output state of the coding long and short term memory chain unit corresponding to the (n-2) th character, wherein n is more than or equal to 3;
and generating the hidden state of the nth character and the molecular fingerprint of the molecule to be detected according to the feature vector corresponding to the nth character and the output state of the coding long-term and short-term memory chain unit corresponding to the (n-1) th character.
3. The method for extracting molecular fingerprints according to claim 1, wherein the step of constructing the molecular fingerprint extraction model comprises:
acquiring a target molecule set, and dividing the target molecule set into a training set and a test set, wherein the training set comprises a plurality of training subsets;
obtaining a plurality of sample characters of a plurality of sample molecules in the training subset;
respectively determining sample feature vectors corresponding to the sample characters according to the sample characters and a preset character dictionary;
generating a hidden state of the initial sample character and an output state of an initial coding long-short term memory chain unit corresponding to the initial sample character according to a sample feature vector of the initial sample character and a preset input state;
generating a hidden state of the (n-1) th sample character and an output state of the (n-1) th coding long-short term memory chain unit corresponding to the (n-1) th sample character according to the sample feature vector corresponding to the (n-1) th sample character and the output state of the coding long-short term memory chain unit corresponding to the (n-2) th sample character, wherein n is more than or equal to 3;
generating a hidden state of the nth sample character and a molecular fingerprint of the sample molecule according to a sample feature vector corresponding to the nth sample character and an output state of a coding long-short term memory chain unit corresponding to the (n-1) th sample character;
obtaining an output state and an initial hidden state of an initial decoding long-short term memory chain unit according to the molecular fingerprint of the sample molecule and a preset starting identifier; generating an initial sampling character probability matrix according to the initial hidden state and the coding hidden state set; screening and generating initial sampling characters according to the initial sampling character probability matrix; the encoding hidden state set is used for representing the hidden states of the initial sample character until the set of the hidden states of the nth sample character;
obtaining the output state of the (n-1) th decoding long-short term memory chain unit and the (n-1) th hidden state according to the sampling feature vector corresponding to the (n-2) th sampling character and the output state of the (n-2) th decoding long-short term memory chain unit; generating an (n-1) th sampling character probability matrix according to the (n-1) th hidden state and the coding hidden state set; according to the n-1 sampling character probability matrix, screening to generate an n-1 sampling character, wherein n is more than or equal to 3;
generating a hidden state of the nth sample character according to a sample feature vector corresponding to the (n-1) th sample character and an output state of a decoding long-short term memory chain unit corresponding to the (n-1) th sample character, and generating an nth sampling character probability matrix according to the nth hidden state and a coding hidden state set; screening and generating an nth sampling character according to the nth sampling character probability matrix; generating sample recovery molecules according to the plurality of sampling characters;
and constructing the molecular fingerprint extraction model according to the sample molecules and the sample recovery molecules.
4. The method of claim 3, further comprising, prior to the step of obtaining the set of target molecules:
acquiring a molecular set in a preset database;
cleaning the molecular set according to preset conditions to generate a cleaned molecular set;
and converting the cleaned molecular set into a preset character format to generate a target molecular set.
5. The method of claim 4 wherein the nth sample character probability matrix is calculated by the following equation:
Figure FDA0002690074620000031
Figure FDA0002690074620000032
Figure FDA0002690074620000033
wherein weight represents a weight of the set of encoded hidden states,
Figure FDA0002690074620000034
indicating the t-th hidden state of the system,
Figure FDA0002690074620000035
the hidden state of the ith sample character is represented, linear represents a linear function, and concat represents a splicing function.
6. The method of claim 5, wherein the step of constructing the molecular fingerprint extraction model from the sample molecules and sample recovery molecules comprises:
calculating to obtain the reconstruction losses of the sample molecules and the sample recovery molecules according to the number of the sample molecules in the training subset, the length of each sample molecule, the length of a preset character dictionary and the feature data of each sample molecule, wherein the feature data is used for representing a preset label of a sampling character at any position of the sample recovery molecules and the occurrence probability of the sampling character at any position of the sample recovery molecules;
determining the target training times of a training set according to the reconstruction loss;
and when the training times of the training set reach the target training times, determining to generate a molecular fingerprint extraction model.
7. The method of claim 6, wherein the values of loss of reconstitution of the sample molecules and sample recovery molecules are calculated by the following formula:
Figure FDA0002690074620000036
wherein N represents the number of sample molecules in the training subset, L represents the length of sample molecules, D represents the length of a preset character dictionary,
Figure FDA0002690074620000037
a predetermined label indicating that the ith position of the sample recovery numerator corresponds to the sample character j,
Figure FDA0002690074620000038
indicating the probability of occurrence of the sample character j at the ith position of the nth sample restitution numerator.
8. The method of claim 7, further comprising:
obtaining a marker post molecule and an activity index value thereof;
obtaining test molecules in the test set and activity index values thereof;
generating a marker post molecular fingerprint and a test molecular fingerprint according to the molecular fingerprint extraction model, the marker post molecules and the test molecules;
calculating to obtain the similarity between the marker post molecule and the test molecule according to the marker post molecule fingerprint, the activity index value of the marker post molecule, the test molecule fingerprint and the activity index value of the test molecule;
calculating to obtain the correlation degree between the similarity degree of the test molecules and the benchmarking molecules and the activity index difference value according to the similarity degree and a preset spearman correlation coefficient function;
and when the correlation degree is greater than a preset correlation degree threshold value, determining that the molecular fingerprint extraction model is effective.
9. The method of claim 8, wherein the similarity of the benchmarking molecule and the test molecule is calculated by the following formula:
Figure FDA0002690074620000041
wherein similarity represents the similarity of the target molecule and the test molecule, fps1Representing flagpole molecular fingerprints, fps2Representing a test molecular fingerprint;
calculating the correlation by the following formula:
corr=spearman(similarity,|IC501-IC502|),
wherein corr represents the correlation between the similarity of the test molecule and the benchmarking molecule and the difference of the activity indexes, and spearman represents a preset spearman correlation coefficient function.
10. A method for calculating correlation based on molecular fingerprints, comprising:
acquiring a marker post molecule and a molecule to be detected, and extracting a marker post molecule fingerprint and a molecule fingerprint to be detected according to the marker post molecule and the molecule to be detected, wherein the marker post molecule fingerprint and the molecule fingerprint to be detected are obtained according to the extraction method of the molecule fingerprint in any one of claims 1 to 9;
acquiring a first activity index value of the molecule to be detected and a second activity index value of the benchmarking molecule;
calculating to obtain an activity index difference value and the similarity between the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint and the molecule fingerprint to be detected;
and calculating to obtain target correlation according to the activity index difference, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation is used for representing the correlation degree between the similarity of the molecules to be detected and the benchmarking molecules and the activity index difference.
11. The method of claim 10, wherein the similarity between the benchmarking molecule and the test molecule is calculated by the following formula:
Figure FDA0002690074620000051
wherein similarity represents the similarity of the benchmark molecule and the molecule to be detected, fps1Representing flagpole molecular fingerprints, fps3Representing the molecular fingerprint to be detected;
calculating the correlation by the following formula:
corr=spearman(similarity,|IC501-IC502|),
wherein corr represents the correlation degree between the similarity of the molecule to be detected and the benchmarking molecule and the activity index difference value, and spearman represents a preset spearman correlation coefficient function.
12. An apparatus for extracting a molecular fingerprint, comprising:
the device comprises a to-be-detected molecule character acquisition module, a to-be-detected molecule character acquisition module and a to-be-detected molecule character acquisition module, wherein the to-be-detected molecule character acquisition module is used for acquiring a plurality of characters of to-be-detected molecules;
the characteristic vector determining module is used for respectively determining a characteristic vector corresponding to each character according to the plurality of characters and a preset character dictionary;
and the first molecular fingerprint extraction module is used for extracting the molecular fingerprint of the molecule to be detected according to the characteristic vector and the molecular fingerprint extraction model.
13. A computing device based on correlation of molecular fingerprints, comprising:
a second molecular fingerprint extraction module, configured to obtain a marker post molecule and a molecule to be detected, and extract the marker post molecular fingerprint and the molecule fingerprint to be detected according to the marker post molecule and the molecule to be detected, where the marker post molecular fingerprint and the molecule fingerprint to be detected are obtained according to the method for extracting a molecular fingerprint according to any one of claims 1 to 9;
an activity index value acquisition module for acquiring a first activity index value of the molecule to be detected and a second activity index value of the benchmarking molecule;
the similarity calculation module is used for calculating to obtain an activity index difference value and the similarity between the benchmarking molecules and the molecules to be detected according to the benchmarking molecule fingerprints and the molecular fingerprints to be detected;
and the target correlation degree calculation module is used for calculating to obtain target correlation degree according to the activity index difference value, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, and the target correlation degree is used for representing the correlation degree between the similarity of the molecules to be detected and the benchmarking molecules and the activity index difference value.
14. A computer device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method for extracting molecular fingerprints as claimed in any one of claims 1 to 9 and the steps of the method for calculating correlations based on molecular fingerprints as claimed in claim 10 or 11.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for extracting a molecular fingerprint according to any one of claims 1 to 9 and the steps of the method for calculating a correlation based on molecular fingerprints according to claim 10 or 11.
CN202010988652.6A 2020-09-18 2020-09-18 Method and device for extracting molecular fingerprint and calculating correlation based on molecular fingerprint Active CN112201314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010988652.6A CN112201314B (en) 2020-09-18 2020-09-18 Method and device for extracting molecular fingerprint and calculating correlation based on molecular fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010988652.6A CN112201314B (en) 2020-09-18 2020-09-18 Method and device for extracting molecular fingerprint and calculating correlation based on molecular fingerprint

Publications (2)

Publication Number Publication Date
CN112201314A true CN112201314A (en) 2021-01-08
CN112201314B CN112201314B (en) 2024-05-03

Family

ID=74015642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010988652.6A Active CN112201314B (en) 2020-09-18 2020-09-18 Method and device for extracting molecular fingerprint and calculating correlation based on molecular fingerprint

Country Status (1)

Country Link
CN (1) CN112201314B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223632A (en) * 2021-05-12 2021-08-06 北京望石智慧科技有限公司 Molecular fragment library determination method, molecular segmentation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090123411A1 (en) * 2005-07-04 2009-05-14 Polyintell Molecular Fingerprints With Enhanced Identifying Capability, Method for Preparing Same and Use Thereof
CN102206630A (en) * 2011-04-12 2011-10-05 中国海洋大学 Method and kit for extracting total DNA of soil and sediment
CN102930169A (en) * 2012-11-07 2013-02-13 景德镇陶瓷学院 Method for predicating drug-target combination based on grey theory and molecular fingerprints
CN106777986A (en) * 2016-12-19 2017-05-31 南京邮电大学 Ligand molecular fingerprint generation method based on depth Hash in drug screening
CN110689965A (en) * 2019-10-10 2020-01-14 电子科技大学 Drug target affinity prediction method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090123411A1 (en) * 2005-07-04 2009-05-14 Polyintell Molecular Fingerprints With Enhanced Identifying Capability, Method for Preparing Same and Use Thereof
CN102206630A (en) * 2011-04-12 2011-10-05 中国海洋大学 Method and kit for extracting total DNA of soil and sediment
CN102930169A (en) * 2012-11-07 2013-02-13 景德镇陶瓷学院 Method for predicating drug-target combination based on grey theory and molecular fingerprints
CN106777986A (en) * 2016-12-19 2017-05-31 南京邮电大学 Ligand molecular fingerprint generation method based on depth Hash in drug screening
CN110689965A (en) * 2019-10-10 2020-01-14 电子科技大学 Drug target affinity prediction method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SEO ET AL.: "Development of Natural Compound Molecular Fingerprint (NC‑MFP) with the Dictionary of Natural Products (DNP) for natural product‑based drug development", 《J CHEMINFORM》, pages 1 - 17 *
唐玉焕;林克江;尤启冬;: "基于2D分子指纹的分子相似性方法在虚拟筛选中的应用", 中国药科大学学报, no. 02, pages 87 - 93 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223632A (en) * 2021-05-12 2021-08-06 北京望石智慧科技有限公司 Molecular fragment library determination method, molecular segmentation method and device
CN113223632B (en) * 2021-05-12 2024-02-13 北京望石智慧科技有限公司 Determination method of molecular fragment library, molecular segmentation method and device

Also Published As

Publication number Publication date
CN112201314B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN111967502B (en) Network intrusion detection method based on conditional variation self-encoder
JP6793774B2 (en) Systems and methods for classifying multidimensional time series of parameters
CN112639831A (en) Mutual information countermeasure automatic encoder
CN107609185B (en) Method, device, equipment and computer-readable storage medium for similarity calculation of POI
CN111526119B (en) Abnormal flow detection method and device, electronic equipment and computer readable medium
CN112148955A (en) Method and system for detecting abnormal time sequence data of Internet of things
US11537950B2 (en) Utilizing a joint-learning self-distillation framework for improving text sequential labeling machine-learning models
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
JP7257585B2 (en) Methods for Multimodal Search and Clustering Using Deep CCA and Active Pairwise Queries
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN114386421A (en) Similar news detection method and device, computer equipment and storage medium
US20230154573A1 (en) Method and system for structure-based drug design using a multi-modal deep learning model
CN113140254A (en) Meta-learning drug-target interaction prediction system and prediction method
CN117115581A (en) Intelligent misoperation early warning method and system based on multi-mode deep learning
CN115983087A (en) Method for detecting time sequence data abnormity by combining attention mechanism and LSTM and terminal
CN112201314A (en) Method and device for extracting molecular fingerprints and calculating correlation degree based on molecular fingerprints
CN113779190B (en) Event causal relationship identification method, device, electronic equipment and storage medium
CN113569061A (en) Method and system for improving completion precision of knowledge graph
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
CN111161884A (en) Disease prediction method, device, equipment and medium for unbalanced data
CN113705092B (en) Disease prediction method and device based on machine learning
CN114862372A (en) Intelligent education data tamper-proof processing method and system based on block chain
CN115114345B (en) Feature representation extraction method, device, equipment, storage medium and program product
Stefanoiu et al. FORWAVER–A Wavelet-Based Predictor for Non Stationary Signals
WO2022214409A1 (en) System and method for searching time series data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant