CN112201314B

CN112201314B - Method and device for extracting molecular fingerprint and calculating correlation based on molecular fingerprint

Info

Publication number: CN112201314B
Application number: CN202010988652.6A
Authority: CN
Inventors: 李相彬; 周杰龙
Original assignee: Beijing Wangshi Intelligent Technology Co ltd
Current assignee: Beijing Wangshi Intelligent Technology Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2024-05-03
Anticipated expiration: 2040-09-18
Also published as: CN112201314A

Abstract

The invention discloses a method and a device for extracting molecular fingerprints and calculating correlation based on the molecular fingerprints, wherein the method for extracting the molecular fingerprints comprises the following steps: acquiring a plurality of characters of a molecule to be detected; respectively determining feature vectors corresponding to the characters according to the characters and the preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. By implementing the invention, the problem that the molecular fingerprint determined based on the manually designed molecular characteristics cannot describe the whole structure of the molecule, even though the structure is similar, the potential activity aspect of the molecule is not relevant is solved, the key characteristic information of the molecule is obtained, the accurate molecular activity correlation degree information is obtained, the molecular similarity can be accurately evaluated, and therefore, the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening is effectively shortened.

Description

Method and device for extracting molecular fingerprint and calculating correlation based on molecular fingerprint

Technical Field

The invention relates to the field of data processing and analysis, in particular to a method and a device for extracting molecular fingerprints and calculating correlation based on the molecular fingerprints.

Background

Searching for an active potential molecule is a critical part of the drug design and discovery process and may be termed a HIT molecule. In general, a pharmacy expert can assist in searching HIT molecules by using related technologies such as a computer, and virtual screening is one of the important technologies. Molecular fingerprinting is typically used to determine the similarity of a reference ligand to a candidate ligand, i.e., to perform a virtual screening process of the molecule. Molecular fingerprints are abstract representations of molecules, which are converted into a string of bit strings, and are compared according to various vector similarity calculation modes.

The molecular fingerprints in the prior art are as follows: (1) Setting bit strings based on molecular fingerprints of substructures, based on the presence or absence of certain substructures or features in a given structure list; (2) A topology or path based molecular fingerprint (Topological or Path Based Fingerprint) may be generated by hashing the fragments in each path by analyzing all molecular fragments on the path from one atom until a specified number of bonds are reached; (3) Circular molecular fingerprint (Circular Fingerprint), taking a heavy atom as a center, searching a molecular fragment with a fixed radius length, and then hashing the structural characteristics of the fragments; (4) A pharmacophore fingerprint (Pharmacophore Fingerprint) encodes structural features of a molecule in a manner similar to a substructured-based fingerprint, and the distances between the features, which are classified by distance range to generate a bit string.

It follows that different molecular fingerprints have different implementations and also different aspects, but in the course of virtual screening the purpose of using molecular fingerprints is to find molecules with relatively close activity. And the existing molecular fingerprints are determined based on the molecular characteristics of artificial design, and the description of the overall structure of the molecule is not complete, so that the molecular potential activity is not close even though the molecular fingerprints are similar in structure.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defects that the description of the whole structure of the molecule in the prior art is incomplete, and the selected molecule is not close to the potential activity of the molecule even though the structure is similar, so as to provide a method and a device for extracting the molecular fingerprint and calculating the correlation based on the molecular fingerprint.

According to a first aspect, an embodiment of the present invention provides a method for extracting a molecular fingerprint, including: acquiring a plurality of characters of a molecule to be detected; respectively determining feature vectors corresponding to the characters according to the characters and a preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model.

With reference to the first aspect, in a first implementation manner of the first aspect, extracting a molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model specifically includes: generating a hidden state of the initial character and an output state of an initial coding long-short-period memory chain unit corresponding to the initial character according to the feature vector of the initial character and a preset input state; generating a hidden state of the (n-1) -th character and an output state of the (n-1) -th coding long-short-period memory chain unit corresponding to the (n-1) -th character according to the feature vector corresponding to the (n-1) -th character and the output state of the coding long-short-period memory chain unit corresponding to the (n-2) -th character, wherein n is more than or equal to 3; and generating the hidden state of the nth character and the molecular fingerprint of the molecule to be detected according to the feature vector corresponding to the nth character and the output state of the coding long-short-period memory chain unit corresponding to the nth character.

With reference to the first aspect, in a second implementation manner of the first aspect, the step of constructing the molecular fingerprint extraction model includes: obtaining a target molecule set, and dividing the target molecule set into a training set and a testing set, wherein the training set comprises a plurality of training subsets; acquiring a plurality of sample characters of a plurality of sample molecules in the training subset; according to the plurality of sample characters and the preset character dictionary, respectively determining sample feature vectors corresponding to the sample characters; generating a hidden state of the initial sample character and an output state of an initial coding long-period memory chain unit corresponding to the initial sample character according to a sample feature vector of the initial sample character and a preset input state; generating a hidden state of the n-1 sample character and an output state of the n-1 coding long-period memory chain unit corresponding to the n-1 sample character according to the sample feature vector corresponding to the n-1 sample character and the output state of the coding long-period memory chain unit corresponding to the n-2 sample character, wherein n is more than or equal to 3; generating a hidden state of the nth sample character and a molecular fingerprint of the sample molecule according to a sample feature vector corresponding to the nth sample character and an output state of a code long-short-period memory chain unit corresponding to the nth-1 sample character;

Obtaining an output state and an initial hidden state of the initial decoding long-period memory chain unit according to the molecular fingerprint of the sample molecule and a preset starting identifier; generating an initial sampling character probability matrix according to the initial hidden state and the coding hidden state set; screening and generating initial sampling characters according to the initial sampling character probability matrix; the set of encoded hidden states is used to characterize the hidden states of the initial sample character up to the set of hidden states of the nth sample character; according to the sampling feature vector corresponding to the n-2 sampling character and the output state of the n-2 decoding long-short-period memory chain unit, the output state of the n-1 decoding long-short-period memory chain unit and the n-1 hidden state are obtained; generating an n-1 sampling character probability matrix according to the n-1 hidden state and the coding hidden state set; according to the n-1 sampling character probability matrix, screening and generating an n-1 sampling character, wherein n is more than or equal to 3; generating an hidden state of the nth sample character according to a sample feature vector corresponding to the nth-1 sample character and an output state of a decoding long-short-period memory chain unit corresponding to the nth-1 sample character, and generating an nth sample character probability matrix according to the nth hidden state and a coding hidden state set; according to the nth sampling character probability matrix, screening and generating an nth sampling character; generating sample restoring molecules according to the plurality of sampling characters; and constructing the molecular fingerprint extraction model according to the sample molecules and the sample restoring molecules.

With reference to the second embodiment of the first aspect, in a third embodiment of the first aspect, before the step of obtaining the target molecule set, the method further includes: acquiring a molecular set in a preset database; cleaning the molecular set according to preset conditions to generate a cleaned molecular set; and converting the cleaned molecule set into a preset character format to generate a target molecule set.

With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the nth sampling character probability matrix is calculated by the following formula:

wherein weight represents the weight of the set of encoded hidden states, Representing the t-th hidden state,/>Representing the hidden state of the ith sample character, linear represents a linear function, and concat represents a stitching function.

With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the step of constructing the molecular fingerprint extraction model according to the sample molecule and the sample restoring molecule includes: according to the number of sample molecules in the training subset, the length of each sample molecule, the length of a preset character dictionary and the characteristic data of each sample molecule, the reconstruction loss of the sample molecules and the sample restoration molecules is calculated, and the characteristic data are used for representing the preset labels of the sampling characters at any position of the sample restoration molecules and the occurrence probability of the sampling characters at any position of the sample restoration molecules; determining target training times of a training set according to the reconstruction loss; and when the training times of the training set reach the target training times, determining to generate a molecular fingerprint extraction model.

With reference to the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the reconstruction loss values of the sample molecule and the sample recovered molecule are calculated by the following formula:

wherein N represents the number of sample molecules in the training subset, L represents the length of the sample molecules, D represents the length of a preset character dictionary, A preset label representing the sample character j corresponding to the ith position of the nth sample retrieval molecule,Indicating the probability of occurrence of the corresponding sample character j at the ith position of the nth sample retrieval molecule.

With reference to the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the method further includes: obtaining a marker post molecule and an activity index value thereof; obtaining test molecules in the test set and activity index values of the test molecules; generating a molecular fingerprint of the marker post and a molecular fingerprint of the test molecule according to the molecular fingerprint extraction model, the marker post molecule and the test molecule; calculating to obtain the similarity of the marker post molecule and the test molecule according to the marker post molecule fingerprint, the activity index value of the marker post molecule, the test molecule fingerprint and the activity index value of the test molecule; calculating the correlation between the similarity of the test molecule and the marker post molecule and the activity index difference value according to the similarity and a preset spearman correlation coefficient function; and when the correlation degree is larger than a preset correlation degree threshold value, determining that the molecular fingerprint extraction model is effective.

With reference to the seventh implementation manner of the first aspect, in an eighth implementation manner of the first aspect, the similarity between the marker molecules and the test molecules is calculated by the following formula:

wherein similarity represents similarity of the marker molecules to the test molecules, fps ₁ represents marker molecule fingerprint, fps ₂ represents test molecule fingerprint;

the correlation is calculated by the following formula:

corr＝spearman(similarity,|IC50₁-IC50₂|)，

Wherein corr represents the correlation between the similarity of the test molecule and the marker molecule and the difference value of the activity index, spearman represents a preset spearman correlation coefficient function.

According to a second aspect, an embodiment of the present invention provides a method for calculating a correlation based on a molecular fingerprint, including: obtaining a marker post molecule and a molecule to be detected, and extracting a marker post molecule fingerprint and a molecule to be detected fingerprint according to the marker post molecule and the molecule to be detected, wherein the marker post molecule fingerprint and the molecule to be detected fingerprint are obtained by the extraction method of the molecule fingerprint according to any one of claims 1-9; acquiring a first activity index value of the molecule to be detected and a second activity index value of the marker post molecule; calculating to obtain an activity index difference value and the similarity of the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint and the molecule to be detected fingerprint; and calculating to obtain a target correlation according to the activity index difference, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation is used for representing the degree of correlation between the similarity of the molecules to be detected and the marker post molecules and the activity index difference.

With reference to the second aspect, in a first embodiment of the second aspect, the similarity between the marker molecules and the molecules to be detected is calculated by the following formula:

wherein similarity represents similarity between the marker post molecule and the molecule to be detected, fps ₁ represents marker post molecule fingerprint, and fps ₃ represents molecule to be detected fingerprint;

the correlation is calculated by the following formula:

corr＝spearman(similarity,|IC50₁-IC50₂|)，

wherein corr represents the correlation between the similarity of the molecules to be detected and the marker post molecules and the difference value of the activity index, spearman represents a preset spearman correlation coefficient function.

According to a third aspect, an embodiment of the present invention provides an extraction device for molecular fingerprints, including: the character acquisition module of the molecule to be detected is used for acquiring a plurality of characters of the molecule to be detected; the characteristic vector determining module is used for respectively determining characteristic vectors corresponding to the characters according to the characters and the preset character dictionary; and the first molecular fingerprint extraction module is used for extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model.

According to a fourth aspect, an embodiment of the present invention provides a device for calculating a correlation based on a molecular fingerprint, including: the second molecular fingerprint extraction module is used for acquiring a marker post molecule and a molecule to be detected, extracting a marker post molecular fingerprint and a molecule to be detected according to the marker post molecule and the molecule to be detected, wherein the marker post molecular fingerprint and the molecule to be detected are obtained by the molecular fingerprint extraction method according to the first aspect or any implementation mode of the first aspect; the activity index value acquisition module is used for acquiring a first activity index value of the molecule to be detected and a second activity index value of the marker post molecule; the similarity calculation module is used for calculating and obtaining an activity index difference value and the similarity of the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint and the molecule to be detected fingerprint; the target correlation calculation module is used for calculating and obtaining target correlation according to the activity index difference value, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, and the target correlation is used for representing the correlation degree between the similarity of the molecules to be detected and the marker post molecules and the activity index difference value.

According to a fifth aspect, an embodiment of the present invention provides a computer apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method for extracting a molecular fingerprint according to the first aspect or any implementation manner of the first aspect, and the steps of the method for calculating a correlation based on a molecular fingerprint according to the second aspect or a second implementation manner of the first aspect.

According to a sixth aspect, an embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the method for extracting a molecular fingerprint according to the first aspect or any implementation manner of the first aspect, and the steps of the method for calculating a correlation based on a molecular fingerprint according to the second aspect or the second implementation manner of the first aspect.

The technical scheme of the invention has the following advantages:

the invention provides a method and a device for extracting molecular fingerprints and calculating correlation based on the molecular fingerprints, wherein the method for extracting the molecular fingerprints comprises the following steps: acquiring a plurality of characters of a molecule to be detected; respectively determining feature vectors corresponding to the characters according to the characters and the preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. By implementing the invention, the problem that molecular fingerprints determined based on manually designed molecular characteristics cannot describe the whole structure of molecules, even though the structures are similar, the potential activity aspect of the molecules is not relevant is solved, the higher the similarity of the molecular fingerprints is, the higher the potential activity similarity of the molecules is indicated, namely, key characteristic information of the molecules is learned, more accurate molecular activity correlation information is obtained, the molecular similarity can be accurately evaluated, so that the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening is effectively shortened.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a specific example of a method for extracting molecular fingerprints according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a molecular structure in a method for extracting molecular fingerprints according to an embodiment of the present invention;

FIG. 3 is a schematic diagram showing the positions of character feature vectors in a molecule after converting the molecular fingerprint into SMILES format according to the method for extracting molecular fingerprint of the present invention;

FIG. 4 is a schematic diagram of the structure of an encoder in constructing a molecular fingerprint extraction model in the molecular fingerprint extraction method according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of the structure of an encoder and a decoder in a molecular fingerprint extraction model in the molecular fingerprint extraction method according to the embodiment of the present invention;

FIG. 6 is a graph showing the comparison of the application effects of molecular fingerprints in the method for extracting molecular fingerprints according to an embodiment of the present invention;

FIG. 7 is a flowchart of a specific example of a method for calculating correlation based on molecular fingerprints according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of a specific example of an extraction device for molecular fingerprints in an embodiment of the present invention;

FIG. 9 is a schematic block diagram of a specific example of a molecular fingerprint-based correlation computing device in an embodiment of the present invention;

fig. 10 is a diagram showing a specific example of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, or can be communicated inside the two components, or can be connected wirelessly or in a wired way. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

One of the most important problems in comparing the similarity between two molecules is the complexity of the characterization of the molecules. In order to make the comparison of molecules easier at the computational level, a certain degree of simplification or abstraction of the molecules is required; the embodiment of the invention provides a method and a device for extracting molecular fingerprints and calculating the correlation based on the molecular fingerprints, which aim to obtain more accurate molecular key information, further can determine the similarity of the potential activities of the molecules by comparing the similarity of the molecular fingerprints, shorten the virtual screening time and improve the virtual screening efficiency.

The embodiment of the invention provides a method for extracting molecular fingerprints, which is shown in figure 1 and comprises the following steps:

Step S11: acquiring a plurality of characters of a molecule to be detected; in this embodiment, the molecule to be tested may be a molecule to be evaluated in any molecular database; after the molecule to be detected is converted into the preset character format, a plurality of characters of the molecule to be detected are generated, specifically, the molecule to be detected can be converted into a SMILES (SIMPLIFIED MOLECULAR-Input Line-ENTRY SYSTEM) format, and the SMILES format can be a character-based molecular structure representation form, so that the overall structural characteristic information of the molecule can be comprehensively represented. For example, as shown in fig. 2, the representation may be CN (C) CCC (C1 cccc 1) C2ccccn2 after conversion to the SMILES format.

Step S12: respectively determining feature vectors corresponding to the characters according to the characters and the preset character dictionary; in this embodiment, the preset character dictionary may be a database stored in advance, for storing corresponding molecular characters and feature vectors of the characters; as shown in fig. 3, the feature vector of the character may be specific position information representing each character in the molecule; specifically, the molecule to be detected is converted into the SMILES format, the characters in the molecule can be arranged according to the conversion sequence, and the feature vector corresponding to each character is determined according to a preset character dictionary.

Step S13: and extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. In this embodiment, the molecular fingerprint extraction model may be a model for extracting molecular fingerprints of various molecules, and may be generated by presetting a test subset in a database and training the test subset; and acquiring each character of the molecule to be detected, determining the feature vector corresponding to each character, sequentially inputting the feature vectors of each character into a molecular fingerprint extraction model, and extracting the molecular fingerprint of the molecule to be detected.

The invention provides a method for extracting molecular fingerprints, which comprises the following steps: acquiring a plurality of characters of a molecule to be detected; respectively determining feature vectors corresponding to the characters according to the characters and the preset character dictionary; and extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. By implementing the invention, the problem that the molecular fingerprint determined based on the manually designed molecular characteristics in the related technology cannot describe the whole structure of the molecule is solved, the higher the similarity of the molecular fingerprint is, the higher the potential activity similarity of the molecule is, the more accurate key information of the molecule can be grasped, and the structural information of the molecule is completely and comprehensively described, so that the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening is effectively shortened.

As an optional embodiment of the present invention, the step S13, extracting a molecular fingerprint of a molecule to be detected according to the feature vector and the molecular fingerprint extraction model, specifically includes:

Firstly, generating a hidden state of an initial character and an output state of an initial coding long-short-period memory chain unit corresponding to the initial character according to a feature vector of the initial character and a preset input state; in this embodiment, the process of extracting the molecular fingerprint of the molecule to be detected by the molecular fingerprint extraction model may be implemented by an encoder, and a specific schematic diagram of the encoder may be as shown in fig. 4, specifically, the encoder may be an LSTM chain, where a plurality of LSTM units may be included, and the number of LSTM units may be determined according to the character length of the molecule to be detected; the feature vectors corresponding to the characters of the molecule to be detected are sequentially input into the corresponding LSTM units for encoding, that is, the input of the encoder can be SMILES of the whole molecule to be detected and the preset initial input state. The feature vector of the initial character can be the feature vector corresponding to the first character of the molecule to be detected; the preset input state may be preset, and an initial state of a Long Short-Term Memory (LSTM) unit is encoded in the encoder; the hidden state of the initial character can be the hidden state of the first character of the molecule to be tested, the hidden state can be one output of the LSTM unit, and the hidden state can be a set containing the character and the character before the character; the output state of the initial encoded long-short-term memory chain unit corresponding to the initial character may be the output state of the first bit LSTM unit in the encoder. That is, the input of the first bit LSTM unit is a feature vector corresponding to the first bit character of the molecule to be tested and a preset input state S0, and the output of the first bit LSTM unit is the output state of the first bit LSTM unit and the hidden state of the first bit character of the molecule to be tested.

Then, according to the feature vector corresponding to the n-1 character and the output state of the coding long-short-period memory chain unit corresponding to the n-2 character, generating the hidden state of the n-1 character and the output state of the n-1 coding long-period memory chain unit corresponding to the n-1 character, wherein n is more than or equal to 3; in this embodiment, the encoder includes a plurality of LSTM units, and the execution process from the 2 nd LSTM unit to the n-1 st LSTM unit may be as follows, where the hidden state of the character at the corresponding position of the LSTM unit and the output state of the LSTM unit are generated according to the feature vector of the character at the corresponding position of the LSTM unit and the output state of the last LSTM unit.

And then, generating the hidden state of the nth character and the molecular fingerprint of the molecule to be detected according to the feature vector corresponding to the nth character and the output state of the code long-short-period memory chain unit corresponding to the (n-1) th character. In this embodiment, in the last LSTM unit in the encoder, encoding is performed in the last LSTM unit according to the feature vector corresponding to the last character of the molecule to be detected and the output state of the last LSTM unit, and finally, the hidden state of the last character of the molecule to be detected and the molecular fingerprint of the molecule to be detected may be generated.

The embodiment of the invention provides a method for extracting a molecular fingerprint, which combines a plurality of LSTM unit components in an encoder and a molecule to be detected in an SMILES format to obtain the hidden state of each character of the molecule to be detected and the molecular fingerprint of the molecule to be detected. The molecular fingerprint of the molecular to be detected extracted through the embodiment is similar to the molecular fingerprint of the marker post molecule, and other key index information of the marker post molecule can be similar to the molecular fingerprint of the marker post molecule, so that the time required for virtual screening based on ligand similarity can be shortened, and the efficiency of virtual screening can be improved.

As an alternative embodiment of the present invention, the step of constructing the molecular fingerprint extraction model in the step S13 specifically includes:

Firstly, acquiring a target molecule set, dividing the target molecule set into a training set and a testing set, wherein the training set comprises a plurality of training subsets; in this embodiment, the target molecule set may be a molecule set that is in a preset database, in which molecules in the preset database are cleaned and screened through a preset step, and converted into a SMILES format, and meets the requirement of building a molecular fingerprint extraction model; the training set may be some set of target molecules that train the molecular fingerprint extraction model; the test set can be a molecular set for testing a molecular fingerprint extraction model trained by the training set; the training set may include a plurality of training subsets, that is, the training set may include a plurality of training batches, that is, a plurality of batch packages, each batch containing a predetermined number of molecules.

Specifically, the target molecule set is randomly divided into a training set and a testing set, and then training can be performed according to molecules in the training set and initial models of a preset integrated language translation model (Sequence To Sequence, seq2 Seq) and an Attention mechanism model (Attention), so as to generate a molecular fingerprint extraction model.

Then, obtaining a plurality of sample characters of a plurality of sample molecules in the training subset; in this embodiment, the molecules in the training subset may be referred to as sample molecules. For each sample molecule, SMILES of each sample molecule is first acquired, i.e., all sample characters of the sample molecule are acquired sequentially, e.g., the character of the sample molecule may be CN (C) CCC (C1 cccc 1) C2ccccn2.

Then, according to a plurality of sample characters and a preset character dictionary, respectively determining sample feature vectors corresponding to the sample characters; in this embodiment, sample feature vectors corresponding to sample characters are determined one by one according to a character dictionary stored in advance.

Then, according to the sample feature vector of the initial sample character and a preset input state, generating a hidden state of the initial sample character and an output state of an initial coding long-short-period memory chain unit corresponding to the initial sample character; in this embodiment, as shown in fig. 5, the training process of the molecular fingerprint extraction model may be divided into an encoder and a decoder; the encoder stores a plurality of coded LSTM units, the number of the coded LSTM units can be determined according to the number of characters of the sample molecules, when the number of the characters of the sample molecules is 15, the number of the LSTM units in the encoder is 15 at this time, and correspondingly, the number of the decoded LSTM units in the decoder is 15, which corresponds to the sample characters of the sample molecules.

Specifically, in the encoder, the input of the first bit encoded LSTM unit may be a preset initial input state, and the sample feature vector corresponding to the first bit sample character of the sample molecule, and in the first bit encoded LSTM unit, the output state of the first bit encoded LSTM unit and the hidden state of the first bit sample character of the sample molecule are generated.

Then, according to the sample feature vector corresponding to the n-1 sample character and the output state of the coding long-short-period memory chain unit corresponding to the n-2 sample character, generating the hidden state of the n-1 sample character and the output state of the n-1 coding long-period memory chain unit corresponding to the n-1 sample character, wherein n is more than or equal to 3; in this embodiment, when the number of sample characters of the sample molecule is 15, the number of LSTM units in the encoder is 15, and the input of the encoded LSTM unit is the output state of the last LSTM unit and the sample feature vector corresponding to the sample character at the corresponding position of the sample molecule for the 2 nd encoded LSTM unit to the 14 th encoded LSTM unit. The output is the output state of the LSTM unit and the hidden state of the sample character at the corresponding position of the sample molecule.

Then, according to the sample feature vector corresponding to the nth sample character and the output state of the coding long-short-period memory chain unit corresponding to the nth sample character, generating the hidden state of the nth sample character and the molecular fingerprint of the sample molecule; in this embodiment, when the last LSTM unit is in the encoder, the input of the last LSTM unit is the output state of the last LSTM unit and the feature vector corresponding to the last sample character of the sample molecule, and the output of the last LSTM unit is the molecular fingerprint of the sample molecule and the hidden state of the last sample character of the sample molecule.

Then, according to the molecular fingerprint of the sample molecule and a preset start identifier, obtaining the output state and the initial hidden state of the initial decoding long-period memory chain unit; generating an initial sampling character probability matrix according to the initial hidden state and the coding hidden state set; screening and generating initial sampling characters according to the initial sampling character probability matrix; the coding hidden state set is used for representing the hidden states of the initial sample characters until the set of the hidden states of the nth sample character; in this embodiment, the sample molecule passes through the encoder to generate a molecular fingerprint of the sample molecule, and an encoder hidden state set of each sample character of the sample molecule.

Illustratively, the output state and the first hidden state of the first bit decoded LSTM cell are generated in the first bit decoded LSTM cell according to the molecular fingerprint of the sample molecule generated by the encoder and the start identifier; the decoder comprises a plurality of decoding LSTM units, an attention layer and a linear layer, specifically, in the attention layer, according to a first hidden state output by the first decoding LSTM units and an encoder hidden state set output by an encoder, hidden state selection and random combination are carried out to generate a linear matrix, namely an initial sampling character probability matrix, and the linear matrix is used for representing the output probability of each sampling character; and sampling to obtain a first bit sampling character according to the initial sampling character probability matrix.

Specifically, when the molecular fingerprint extraction model training is completed, the initial sampling character output by the initial decoding LSTM unit is consistent with the initial sample character of the sample molecule.

Then, according to the sampling feature vector corresponding to the n-2 sampling character and the output state of the n-2 decoding long-short-period memory chain unit, the output state of the n-1 decoding long-short-period memory chain unit and the n-1 hidden state are obtained; generating an n-1 sampling character probability matrix according to the n-1 hidden state and the coding hidden state set; according to the n-1 sampling character probability matrix, screening and generating an n-1 sampling character, wherein n is more than or equal to 3; in this embodiment, for the 2 nd to n-1 th decoding LSTM units, the training process may be to generate the output state and the hidden state of the decoding LSTM unit according to the output state of the last decoding LSTM unit and the sampling feature vector corresponding to the sampling character output by the last decoding LSTM unit, that is, the 2 nd to n-1 st hidden state. The generation of the corresponding sample character from the 2 nd bit hidden state to the n-1 st hidden state is similar to the process of generating the initial sample character and will not be described again.

Then, generating an nth hidden state according to a sample feature vector corresponding to the nth-1 sample character and an output state of a decoding long-short-period memory chain unit corresponding to the nth-1 sample character, and generating an nth sampling character probability matrix according to the nth hidden state and a coding hidden state set; according to the nth sampling character probability matrix, screening and generating an nth sampling character; generating sample restoring molecules according to the plurality of sampling characters; in this embodiment, in the last-bit decoding LSTM unit, according to the output state of the previous-bit decoding LSTM unit, and the attention-selected and sampled character probability matrix, the sampled feature vector corresponding to the n-1 th sampled character is generated, and in the last-bit decoding LSTM unit, the output state and the n-th hidden state of the last-bit decoding long-short-term memory chain unit are generated. And generating last sampling characters through the attention selection of the attention layer and the sampling character probability matrix of the linear layer. And generating sampling characters according to each decoding LSTM unit, and sequentially arranging and generating sample restoration molecules in the SMILES format.

Specifically, after the initial model training of the integrated language translation model and the attention mechanism model is completed, a molecular fingerprint extraction model is generated, and at this time, the sample molecules in the SMILES format input to the initial model are consistent with the sample restoration molecules in the SMILES format generated by the initial model.

Then, a molecular fingerprint extraction model is constructed according to the sample molecules and the sample restoring molecules. In this embodiment, a molecular fingerprint extraction model is generated from the input sample molecules and the sample restoration molecules generated by the initial model.

The embodiment of the invention provides a molecular fingerprint extraction method, which combines a preset integrated language translation model and an initial model of an attention mechanism model, recovers SMILES according to an input sample molecule in an SMILES format, realizes analysis and reconstruction of the molecule, and can enable the initial model to learn key characteristic information behind the molecule, so that the molecular fingerprint obtained based on the method can accurately grasp the key index information of the molecule, can grasp the key information of the molecule more accurately, completely and comprehensively describe the structural information of the molecule, and can enable virtual screening based on a ligand to be more accurate and efficient, thereby effectively shortening the time required by virtual screening.

As an optional embodiment of the present invention, before the step of obtaining the target molecule set, the method for extracting a molecular fingerprint further includes:

acquiring a molecular set in a preset database; cleaning the molecular set according to preset conditions to generate a cleaned molecular set; and converting the cleaned molecular set into a preset character format to generate a target molecular set. In this embodiment, the preset database may be a molecular library, for example ChEMBL-25 database, in which molecules that can be prepared with a high probability of existence are stored; the washing of the molecular set according to the preset condition may be removing RDKit unidentifiable molecules, removing molecules having a number of heavy atoms of less than 10 and greater than 50, removing molecules having a number of bonds of greater than 65, removing molecules containing unusual element types (for example, the unusual element types may include P, S, N, O, C, cl, br, F, I, H), removing molecules containing unusual bond types (the unusual bond types are single bond, double bond, triple bond, and aromatic bond); the preset character format may be to convert all the washed molecular sets into a SMILES format.

Specifically, when the above molecular set is converted into the SMILES format, due to the existence of randomness, the SMILES of the same molecule will have different character strings, which will affect the training of the molecular extraction model, so that the SMILES of the molecule will be normalized. And meanwhile, de-duplicating the normalized SMILES. For example, the normalization procedure may be to assign a number to each atom in the molecule, and then iterate according to the characteristics of each atom itself and the surrounding environment, updating the numbers until all atom numbers have not changed. Such as the ChEMBL-25 database mentioned above, the number of molecules remaining after the above-described treatment is 1,607,036.

The embodiment of the invention provides a method for extracting molecular fingerprints, which combines preset cleaning conditions to clean, convert formats, normalize and de-duplicate molecules in a preset database, so as to generate a target molecular set which can be used for training a molecular fingerprint extraction model. Clean and canonical molecular data can be provided for model training.

Specifically, the nth sample character probability matrix of the attention layer and the linear layer can be calculated by the following formula:

Wherein weight represents the weight of each sample character in the encoded hidden state set, Represents the t-th hidden state of the decoder output,/>The hidden state of the ith sample character output by the encoder LSTM unit is represented, linear represents a linear function, and concat represents a splicing function.

As an alternative embodiment of the present invention, the constructing a molecular fingerprint extraction model according to the sample molecule and the sample restoring molecule may specifically include:

Firstly, calculating to obtain reconstruction losses of sample molecules and sample restoration molecules according to the number of sample molecules in a training subset, the length of each sample molecule, the length of a preset character dictionary and the characteristic data of each sample restoration molecule, wherein the characteristic data are used for representing preset labels of sampling characters at any position of the sample restoration molecules and the occurrence probability of the sampling characters at any position of the sample restoration molecules; in this embodiment, in the training process of the molecular fingerprint extraction model, since the input sample molecule and the output sample restoration molecule are not always completely consistent, it is necessary to adjust or optimize the training process of the molecular fingerprint extraction model for errors existing between the sample molecule and the sample restoration molecule.

Specifically, according to the number of sample molecules in each training batch, the character length of each SMILES format sample molecule, the length of a character dictionary, the character label corresponding to the ith character in the nth sample restoring molecule, and the output probability corresponding to the ith character in the nth sample restoring molecule, the training loss of the restoring sample molecule is calculated and generated, that is, the training error value occurring in the process of analyzing and reconstructing the sample molecule can be calculated, and the training parameters in the encoder and the decoder can be optimized and adjusted according to the training error value.

Then, determining target training times of the training set according to the change of the reconstruction loss in the training process; and when the training times of the training set reach the target training times, determining to generate a molecular fingerprint extraction model. Specifically, according to the calculated reconstruction loss value of the sample molecule, the target training times of the whole training set can be calculated, and when the training times of the training set reach the target training times, the molecular fingerprint extraction model can be considered to be trained.

Specifically, the specific process of adjusting the training parameters in the encoder and decoder in the training process according to the calculated reconstruction loss value may be: according to the adaptive moment estimation method (Adaptive Moment Estimation, adam), i.e. an algorithm that performs a step-wise optimization of the random objective function, the training parameters are optimized according to the first-order gradient of the reconstructed loss function, and the step size of the optimized initial learning rate lr and its weight can be determined from the first-order and second-order moment estimates of the reconstructed loss function.

Specifically, when model training of a training batch of sample molecules, the average loss of all sample molecules of the training batch is calculated according to the method described in the above embodiments; the number of sample molecules in a training batch can be comprehensively determined according to batch_size; then calculating the first derivative of the average reconstruction loss to the training parameters, and updating all the training parameters according to the step length of optimizing the initial learning rate (lr) and the weight thereof; until the training set is trained to the target training time (num_ epochs), it can be considered that the molecular fingerprint extraction model is already trained at this time.

Wherein 1 epoch represents that the molecular fingerprint extraction model has been trained 1 time based on all sample molecules in the training set; 5 epochs indicate that training has been completed 5 times. After the step length (decay_step) of the learning rate decay is preset, the step length is decayed to ensure the stability of the training process of the molecular fingerprint extraction model, the decay degree can be determined according to the learning rate decay coefficient (decay), but the decay degree is limited, the minimum value can be according to the minimum learning rate (min_lr) allowed, in addition, in order to ensure the stability of the training process of the molecular fingerprint extraction model, a limit range exists in the first derivative of all training parameters, namely the range of the section [ -clip_grad, clip_grad ] which is the threshold value (clip_grad) of the gradient in the training process.

The embodiment of the invention provides a molecular fingerprint extraction method, which combines the change of a reconstruction loss value of sample molecules calculated according to a training process, can calculate the target training times of the whole training set, and can consider that the molecular fingerprint extraction model training is completed when the training times of the training set reach the target training times. The training parameters of the encoder and the decoder in the training process can be adjusted according to the calculated reconstruction loss value, so that the accuracy and the stability of the molecular fingerprint extraction model training are ensured, and the error generated in the training process of the model is reduced.

Specifically, the representation of each training parameter and its meaning may be as shown in table 1 below:

TABLE 1

Specifically, the reconstruction loss values of the sample molecules and the sample restoration molecules can be calculated by the following formula:

where N represents the number of sample molecules in the training subset, L represents the length of the sample molecules, D represents the length of the pre-set character dictionary, Preset label representing sample character j corresponding to the ith position of the nth sample retrieval molecule,/>Indicating the probability of occurrence of the corresponding sample character j at the ith position of the nth sample retrieval molecule.

As an optional embodiment of the present invention, the method for extracting a molecular fingerprint further includes:

Firstly, obtaining a marker post molecule and an activity index value thereof; in this embodiment, the marker post molecule may be a preset molecule with ideal potential activity; the activity index value may be a potential activity index value of a marker post molecule; specifically, a marker molecule is preset and the corresponding potential activity index value is determined.

Then, obtaining test molecules in the test set and activity index values of the test molecules; in this embodiment, the test set may be a molecular set generated by randomly dividing a target molecular set and used for testing the molecular fingerprint extraction model, and each test molecule in the test set and its corresponding potential activity index value, that is, an activity index value, are obtained.

Then, according to the molecular fingerprint extraction model, the marker post molecules and the test molecules, generating marker post molecular fingerprints and test molecular fingerprints; in this embodiment, by the method described in the foregoing embodiment, a molecular fingerprint extraction model is constructed, SMILES of the marker post molecule is input into the molecular fingerprint extraction model, and a molecular fingerprint of the marker post molecule, that is, a marker post molecular fingerprint, is generated through an encoder model in the molecular fingerprint extraction model; through similar processes, test molecular fingerprints are generated, which are not described in detail herein.

Then, according to the molecular fingerprints of the marker post and the molecular fingerprints of the test, calculating to obtain the similarity between the marker post and the test molecule; in this embodiment, the molecular structure similarity between the generated marker post molecule and the test molecule is calculated according to the extracted marker post molecule fingerprint and the test molecule fingerprint.

Then, calculating and obtaining the correlation between the similarity of the test molecules and the marker molecules and the difference value of the activity index according to the similarity and a preset spearman correlation coefficient function; in this embodiment, the correlation between the structural similarity of the marker post molecule and the test molecule and the difference value of the molecular activity index can be determined according to the calculated similarity and the difference value of the activity index of the marker post molecule and the test molecule.

And then, when the correlation degree is larger than a preset correlation degree threshold value, determining that the molecular fingerprint extraction model is effective. In this embodiment, when the correlation between the calculated molecular structure similarity and the difference value of the molecular potential activity index is greater than a preset correlation threshold according to the molecular fingerprint extracted by the molecular fingerprint extraction model, the molecular fingerprint extraction model may be considered to be actually applied. Specifically, the preset correlation threshold may be an average correlation value calculated through other conventional molecular fingerprints, and when the calculated correlation is greater than the calculated correlation based on other conventional molecular fingerprints, the molecular fingerprint extraction model may be considered to be practically applied, for example, to the similarity screening of ligands.

The molecular fingerprint extraction model generated by the method described in the above embodiment will be described in detail with reference to a specific embodiment, and the performance comparison between the extracted molecular fingerprint and other conventional molecular fingerprints in the activity correlation will be described.

Molecular fingerprints (DeepFP) extracted by the method according to the embodiment of the invention calculate the correlation degree of molecular structure similarity and activity according to DeepFP, and meanwhile calculate the correlation degree of molecular structure similarity and activity according to ECFP fingerprints, erG fingerprints and MACCSKEYS fingerprints in comparison, and the calculation results are shown in Table 2:

TABLE 2

As shown in table 2 above, the spearman correlation coefficient between similarity and activity was calculated from DeepFP and three common fingerprints, and as can be seen from the data in table 2, the average spearman correlation coefficient of DeepFP on 301 targets of the test set was 0.43, which is higher than that of the other three fingerprints.

Specifically, 39 targets are randomly selected from 301 targets in the test set, and the spearman correlation coefficients are visualized, as shown in fig. 6, it can be seen that most of the DeepFP curves can cover the curves of other three fingerprints, so that DeepFP can be shown to have better activity correlation degree and better performance than other common molecular fingerprints.

Specifically, the similarity of the marker molecules to the test molecules can be calculated by the following formula:

Wherein similarity represents similarity of the marker molecules and the test molecules, fps ₁ represents marker molecule fingerprints, and fps ₂ represents test molecule fingerprints;

The correlation is calculated by the following formula:

corr＝spearman(similarity,|IC50₁-IC50₂|)，

Wherein corr represents the correlation between the similarity of the test molecule and the marker molecule and the difference value of the activity index, spearman represents the preset spearman correlation coefficient function.

The embodiment of the invention provides a method for calculating the correlation degree based on molecular fingerprints, which is shown in fig. 7 and comprises the following steps:

Step S21: obtaining a marker post molecule and a molecule to be detected, and extracting a marker post molecule fingerprint and a molecule to be detected fingerprint according to the marker post molecule and the molecule to be detected, wherein the marker post molecule fingerprint and the molecule to be detected fingerprint are obtained according to the extraction method of the molecule fingerprint in any embodiment;

step S22: acquiring a first activity index value of a molecule to be detected and a second activity index value of a marker post molecule;

Step S23: calculating to obtain an activity index difference value according to the molecular fingerprints of the marker post and the molecular fingerprints to be detected, and calculating the similarity between the marker post molecules and the molecules to be detected;

Step S24: according to the difference value of the activity index, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, calculating to obtain target correlation, wherein the target correlation is used for representing the correlation degree between the similarity of the molecules to be detected and the marker post molecules and the difference value of the activity index.

The embodiment of the invention provides a method for calculating the correlation degree based on molecular fingerprints, which comprises the following steps: obtaining a marker post molecule and a molecule to be detected, extracting a marker post molecule fingerprint and a molecule to be detected fingerprint according to the marker post molecule and the molecule to be detected, and obtaining a first activity index value of the molecule to be detected and a second activity index value of the marker post molecule; according to the molecular fingerprints of the marker post and the molecular fingerprints to be detected, calculating to obtain an activity index difference value and the similarity between the marker post and the molecular to be detected, and according to the activity index difference value, the similarity between the molecular to be detected and a preset spearman correlation coefficient function, calculating to obtain a target correlation degree, wherein the target correlation degree is used for representing the correlation degree between the similarity between the molecular to be detected and the marker post and the activity index difference value. The deep learning method is utilized to extract feature vectors from mass molecules to form molecular fingerprints, so that the correlation between the molecular similarity and the activity difference value can be improved.

Specifically, the similarity between the marker post molecule and the molecule to be measured is calculated by the following formula:

Wherein similarity represents similarity between the marker post molecules and the molecules to be detected, fps ₁ represents marker post molecule fingerprints, and fps ₃ represents the molecules to be detected fingerprints;

The correlation is calculated by the following formula:

corr＝spearman(similarity,|IC50₁-IC50₂|)，

wherein corr represents the correlation between the similarity of the molecules to be detected and the marker molecules and the difference value of the activity index, spearman represents the preset spearman correlation coefficient function.

The embodiment of the invention provides a molecular fingerprint extraction device, as shown in fig. 8, which comprises:

a molecule to be measured character acquisition module 31 for acquiring a plurality of characters of a molecule to be measured; for details, see the description of step S11 in the above method embodiment.

A feature vector determining module 32, configured to determine feature vectors corresponding to the characters according to the plurality of characters and the preset character dictionary, respectively; for details, see the description of step S12 in the above method embodiment.

The first molecular fingerprint extraction module 33 is configured to extract a molecular fingerprint of a molecule to be detected according to the feature vector and the molecular fingerprint extraction model. For details, see the description of step S13 in the above method embodiment.

The invention provides an extraction device of molecular fingerprints, which comprises: acquiring a plurality of characters of the molecules to be detected through a molecule to be detected character acquisition module 31; determining, by the feature vector determining module 32, feature vectors corresponding to the characters according to the plurality of characters and the preset character dictionary; the first molecular fingerprint extraction module 33 extracts the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model. By implementing the invention, the problem that the molecular fingerprint determined based on the manually designed molecular characteristics in the related technology cannot describe the whole structure of the molecule is solved, the key information of the molecule can be more accurately grasped, and the structural information of the molecule can be completely and comprehensively described, so that the ligand-based virtual screening can be more accurate and efficient, and the time required by the virtual screening can be effectively shortened.

The embodiment of the invention provides a calculation device of relevance based on molecular fingerprints, as shown in fig. 9, the device comprises:

A second molecular fingerprint extraction module 41, configured to obtain a target molecule and a molecule to be detected, and extract a target molecular fingerprint and a molecule to be detected fingerprint according to the target molecule and the molecule to be detected, where the target molecular fingerprint and the molecule to be detected fingerprint are obtained according to the molecular fingerprint extraction method described in the above embodiment; for details, see the description of step S21 in the above method embodiment.

An activity index value obtaining module 42, configured to obtain a first activity index value of a molecule to be detected and a second activity index value of a marker post molecule; for details, see the description of step S22 in the above method embodiment.

The similarity calculation module 43 is configured to calculate an activity index difference value and a similarity between the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint, the molecule to be detected fingerprint, the first activity index value and the second activity index value; for details, see the description of step S23 in the above method embodiment.

The target correlation calculation module 44 is configured to calculate a target correlation according to the activity index difference, the similarity of the molecule to be detected, and a preset spearman correlation coefficient function, where the target correlation is used to represent a degree of correlation between the similarity of the molecule to be detected and the marker post molecule and the activity index difference. For details, see the description of step S24 in the above method embodiment.

The embodiment of the invention provides a calculation device of relevance based on molecular fingerprints, which comprises: obtaining a marker post molecule and a molecule to be detected through a second molecular fingerprint extraction module 41, and extracting a marker post molecule fingerprint and a molecule to be detected according to the marker post molecule and the molecule to be detected; acquiring a first activity index value of a molecule to be detected and a second activity index value of a marker post molecule through an activity index value acquisition module 42; the difference of the activity index and the similarity between the marker post molecule and the molecule to be detected are calculated according to the marker post molecule fingerprint and the molecule to be detected by the similarity calculation module 43, and the target correlation is calculated according to the difference of the activity index, the similarity of the molecule to be detected and a preset spearman correlation coefficient function by the target correlation calculation module 44, wherein the target correlation is used for representing the correlation degree between the similarity between the molecule to be detected and the marker post molecule and the difference of the activity index. The deep learning method is utilized to extract feature vectors from mass molecules to form molecular fingerprints, so that the correlation between the molecular similarity and the activity difference value can be improved.

The embodiment of the present invention further provides a computer device, as shown in fig. 10, which may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or other means, and in fig. 10, the connection is exemplified by a bus.

The processor 51 may be a central processing unit (Central Processing Unit, CPU). The Processor 51 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 52 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as program instructions/modules corresponding to the method for extracting a molecular fingerprint in the embodiment of the present invention (e.g., the molecular character to be detected acquisition module 31, the feature vector determination module 32, the first molecular fingerprint extraction module 33, and the second molecular fingerprint extraction module 41, the activity index value acquisition module 42, the similarity calculation module 43, and the target correlation calculation module 44 shown in fig. 8). The processor 51 executes various functional applications of the processor and data processing, i.e., implements the molecular fingerprint extraction method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 52.

Memory 52 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the processor 51, etc. In addition, memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 52 may optionally include memory located remotely from processor 51, which may be connected to processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 52 and when executed by the processor 51 perform the method of extracting a molecular fingerprint in the embodiment shown in fig. 1 or the method of calculating a correlation based on a molecular fingerprint in the embodiment shown in fig. 7.

The details of the above-mentioned computer device may be understood correspondingly with reference to the corresponding relevant descriptions and effects in the embodiments shown in fig. 1 and fig. 7, which are not repeated here.

Optionally, an embodiment of the present invention further provides a non-transitory computer readable medium storing computer instructions for causing a computer to perform the method for extracting a molecular fingerprint or the method for calculating a correlation based on a molecular fingerprint as described in any one of the above embodiments, where the storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a hard disk (HARD DISK DRIVE, abbreviated as HDD), a Solid state disk (Solid-state-STATE DRIVE, SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A method for extracting a molecular fingerprint, comprising:

Acquiring a plurality of characters of a molecule to be detected;

respectively determining feature vectors corresponding to the characters according to the characters and a preset character dictionary;

extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model;

The step of constructing the molecular fingerprint extraction model comprises the following steps:

obtaining a target molecule set, and dividing the target molecule set into a training set and a testing set, wherein the training set comprises a plurality of training subsets;

Acquiring a plurality of sample characters of a plurality of sample molecules in the training subset;

According to the plurality of sample characters and the preset character dictionary, respectively determining sample feature vectors corresponding to the sample characters;

generating a hidden state of the initial sample character and an output state of an initial coding long-period memory chain unit corresponding to the initial sample character according to a sample feature vector of the initial sample character and a preset input state;

Generating a hidden state of the n-1 sample character and an output state of the n-1 coding long-period memory chain unit corresponding to the n-1 sample character according to the sample feature vector corresponding to the n-1 sample character and the output state of the coding long-period memory chain unit corresponding to the n-2 sample character, wherein n is more than or equal to 3;

generating a hidden state of the nth sample character and a molecular fingerprint of the sample molecule according to a sample feature vector corresponding to the nth sample character and an output state of a code long-short-period memory chain unit corresponding to the nth-1 sample character;

Obtaining an output state and an initial hidden state of the initial decoding long-period memory chain unit according to the molecular fingerprint of the sample molecule and a preset starting identifier; generating an initial sampling character probability matrix according to the initial hidden state and the coding hidden state set; screening and generating initial sampling characters according to the initial sampling character probability matrix; the set of encoded hidden states is used to characterize the hidden states of the initial sample character up to the set of hidden states of the nth sample character;

According to the sampling feature vector corresponding to the n-2 sampling character and the output state of the n-2 decoding long-short-period memory chain unit, the output state of the n-1 decoding long-short-period memory chain unit and the n-1 hidden state are obtained; generating an n-1 sampling character probability matrix according to the n-1 hidden state and the coding hidden state set; according to the n-1 sampling character probability matrix, screening and generating an n-1 sampling character, wherein n is more than or equal to 3;

generating an hidden state of the nth sample character according to a sample feature vector corresponding to the nth-1 sample character and an output state of a decoding long-short-period memory chain unit corresponding to the nth-1 sample character, and generating an nth sample character probability matrix according to the nth hidden state and a coding hidden state set; according to the nth sampling character probability matrix, screening and generating an nth sampling character; generating sample restoring molecules according to the plurality of sampling characters;

and constructing the molecular fingerprint extraction model according to the sample molecules and the sample restoring molecules.

2. The method for extracting molecular fingerprints according to claim 1, wherein extracting the molecular fingerprints of the molecules to be detected according to the feature vector and the molecular fingerprint extraction model comprises:

generating a hidden state of the initial character and an output state of an initial coding long-short-period memory chain unit corresponding to the initial character according to the feature vector of the initial character and a preset input state;

Generating a hidden state of the (n-1) -th character and an output state of the (n-1) -th coding long-short-period memory chain unit corresponding to the (n-1) -th character according to the feature vector corresponding to the (n-1) -th character and the output state of the coding long-short-period memory chain unit corresponding to the (n-2) -th character, wherein n is more than or equal to 3;

And generating the hidden state of the nth character and the molecular fingerprint of the molecule to be detected according to the feature vector corresponding to the nth character and the output state of the coding long-short-period memory chain unit corresponding to the nth character.

3. The method of claim 2, further comprising, prior to the step of obtaining the set of target molecules:

Acquiring a molecular set in a preset database;

Cleaning the molecular set according to preset conditions to generate a cleaned molecular set;

And converting the cleaned molecule set into a preset character format to generate a target molecule set.

4. A method according to claim 3, wherein the nth sample character probability matrix is calculated by the formula:

Wherein, Weights representing the set of encoded hidden states,/>The t-th hidden state representing the decoder output,Representing the hidden state of the ith sample character, i is not less than 1 and not more than L, and is not less than 1%Representing a linear function,/>Representing the stitching function.

5. The method of claim 4, wherein the step of constructing the molecular fingerprint extraction model from the sample molecules and sample retrieval molecules comprises:

According to the number of sample molecules in the training subset, the length of each sample molecule, the length of a preset character dictionary and the characteristic data of each sample molecule, the reconstruction loss of the sample molecules and the sample restoration molecules is calculated, and the characteristic data are used for representing the preset labels of the sampling characters at any position of the sample restoration molecules and the occurrence probability of the sampling characters at any position of the sample restoration molecules;

Determining target training times of a training set according to the reconstruction loss;

And when the training times of the training set reach the target training times, determining to generate a molecular fingerprint extraction model.

6. The method of claim 5, wherein the reconstruction loss values for the sample molecules and the sample reconstituted molecules are calculated by the following formula:

Wherein, Representing the number of sample molecules in the training subset,/>Length of sample molecule,/>Presetting the length of a character dictionary,/>In/>First/>, of the individual sample retrieval moleculesThe individual positions correspond to sampling characters/>Is provided with a preset tag of (a),In/>First/>, of the individual sample retrieval moleculesThe individual positions correspond to sampling characters/>Is a probability of occurrence of (a).

7. The method as recited in claim 6, further comprising:

obtaining a marker post molecule and an activity index value thereof;

obtaining test molecules in the test set and activity index values of the test molecules;

generating a molecular fingerprint of the marker post and a molecular fingerprint of the test molecule according to the molecular fingerprint extraction model, the marker post molecule and the test molecule;

Calculating to obtain the similarity of the marker post molecule and the test molecule according to the marker post molecule fingerprint, the activity index value of the marker post molecule, the test molecule fingerprint and the activity index value of the test molecule;

Calculating the correlation between the similarity of the test molecule and the marker post molecule and the activity index difference value according to the similarity and a preset spearman correlation coefficient function;

And when the correlation degree is larger than a preset correlation degree threshold value, determining that the molecular fingerprint extraction model is effective.

8. The method of claim 7, wherein the similarity of the marker molecules to the test molecules is calculated by the following formula:

Wherein, Representing the similarity of the marker post molecule to the test molecule,/>The molecular fingerprint of the marker post is represented,Representing a test molecular fingerprint;

the correlation is calculated by the following formula:

Wherein, Representing the correlation between the similarity of the test molecule and the marker molecules and the difference value of the activity index,Representing a preset spearman correlation coefficient function,/>Represents the concentration of the molecule at which the test molecule reaches 50% inhibition,/>The concentration of the molecules at which the marker molecules reach 50% inhibition is shown.

9. The method for calculating the correlation based on the molecular fingerprint is characterized by comprising the following steps of:

Obtaining a marker post molecule and a molecule to be detected, and extracting a marker post molecule fingerprint and a molecule to be detected fingerprint according to the marker post molecule and the molecule to be detected, wherein the marker post molecule fingerprint and the molecule to be detected fingerprint are obtained by the extraction method of the molecule fingerprint according to any one of claims 1-8;

Acquiring a first activity index value of the molecule to be detected and a second activity index value of the marker post molecule;

Calculating to obtain an activity index difference value and the similarity of the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint and the molecule to be detected fingerprint;

And calculating to obtain a target correlation according to the activity index difference, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, wherein the target correlation is used for representing the degree of correlation between the similarity of the molecules to be detected and the marker post molecules and the activity index difference.

10. The method of claim 9, wherein the similarity of the marker molecules to the molecules to be measured is calculated by the following formula:

Wherein, Representing the similarity of the marker post molecule and the molecule to be detected,/>The molecular fingerprint of the marker post is represented,Representing the fingerprint of the molecule to be detected;

the correlation is calculated by the following formula:

Wherein, Representing the correlation between the similarity of the molecules to be detected and the marker post molecules and the difference value of the activity index,Representing a preset spearman correlation coefficient function,/>Represents the concentration of the molecule at which the test molecule reaches 50% inhibition,/>The concentration of the molecules at which the marker molecules reach 50% inhibition is shown.

11. An extraction device for molecular fingerprints, comprising:

the character acquisition module of the molecule to be detected is used for acquiring a plurality of characters of the molecule to be detected;

the characteristic vector determining module is used for respectively determining characteristic vectors corresponding to the characters according to the characters and the preset character dictionary;

the first molecular fingerprint extraction module is used for extracting the molecular fingerprint of the molecule to be detected according to the feature vector and the molecular fingerprint extraction model;

12. A molecular fingerprint-based correlation computing device, comprising:

The second molecular fingerprint extraction module is used for obtaining a marker post molecule and a molecule to be detected, extracting a marker post molecular fingerprint and a molecule to be detected according to the marker post molecule and the molecule to be detected, wherein the marker post molecular fingerprint and the molecule to be detected are obtained by the molecular fingerprint extraction method according to any one of claims 1-8;

The activity index value acquisition module is used for acquiring a first activity index value of the molecule to be detected and a second activity index value of the marker post molecule;

the similarity calculation module is used for calculating and obtaining an activity index difference value and the similarity of the marker post molecule and the molecule to be detected according to the marker post molecule fingerprint and the molecule to be detected fingerprint;

the target correlation calculation module is used for calculating and obtaining target correlation according to the activity index difference value, the similarity of the molecules to be detected and a preset spearman correlation coefficient function, and the target correlation is used for representing the correlation degree between the similarity of the molecules to be detected and the marker post molecules and the activity index difference value.

13. A computer device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the steps of the method of extracting molecular fingerprints according to any one of claims 1 to 8 and the steps of the method of calculating correlation based on molecular fingerprints according to claim 9 or 10.

14. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method for extracting a molecular fingerprint according to any one of claims 1 to 8, and the steps of the method for calculating a correlation based on a molecular fingerprint according to claim 9 or 10.