CN114822726A - Construction method, analysis method, device, storage medium and computer equipment - Google Patents

Construction method, analysis method, device, storage medium and computer equipment Download PDF

Info

Publication number
CN114822726A
CN114822726A CN202210534809.7A CN202210534809A CN114822726A CN 114822726 A CN114822726 A CN 114822726A CN 202210534809 A CN202210534809 A CN 202210534809A CN 114822726 A CN114822726 A CN 114822726A
Authority
CN
China
Prior art keywords
sequence
compound
smiles
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210534809.7A
Other languages
Chinese (zh)
Inventor
王俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210534809.7A priority Critical patent/CN114822726A/en
Publication of CN114822726A publication Critical patent/CN114822726A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of chemistry and discloses a construction method, an analysis method, a device, a storage medium and computer equipment, wherein the construction method comprises the following steps: obtaining a SMILES sequence for a plurality of compound samples; performing splicing treatment on SMILES sequences of the plurality of compound samples, and determining a spliced SMILES sequence; training a BERT model according to a training sample sequence to construct the pre-training model, wherein the training sample sequence comprises SMILES sequences of the compound samples and the spliced SMILES sequence. According to the method, the corresponding machine learning model can be obtained by performing supervised learning on the BERT model from unlabeled sample data, so that not only is the model training time saved, but also the model generalization capability is strong.

Description

Construction method, analysis method, device, storage medium and computer equipment
Technical Field
The present application relates to the field of chemical technology, and in particular, to a construction method, an analysis method, an apparatus, a storage medium, and a computer device.
Background
The compound SMILES (Simplified Molecular Input Line Entry System) is a linear code representing a molecule of a compound, that is, a specification in which the structure of a molecule is described explicitly by ASCII character strings. The essence of the SMILES sequence is a linear symbol sequence formed by symbolizing atoms, bonds and other information in a molecule by a certain naming rule and then arranging the symbols in a certain sequence. The SMILES sequence has the characteristics of uniqueness and uniqueness in naming and low requirement on storage space, and is an ideal method for representing chemical structures in computers.
In the related art, the SMILES sequence can be directly input to a sequence model such as RNN to assist in compound development by a machine learning method. However, the selection of different molecular descriptors often has a great influence on the performance of a machine learning model, and in order to learn a strong expression ability, a large amount of manual label data is needed to define and optimize a learning target, for large-scale label data, especially label data obtained by experimental determination of a compound is often difficult to obtain, and the labeling of different data requires corresponding professional field knowledge or experimental equipment, that is, complicated and time-consuming feature labeling engineering is needed. Moreover, for the traditional learnt neural network model which is oriented to the downstream task of the compound SMILES, the difference between the pre-training target of feature extraction and the classification task target of the whole compound of the downstream task is large, so that the model effect is not obvious.
Disclosure of Invention
In view of this, the present application provides a construction method, an analysis method, an apparatus, a storage medium, and a computer device, which can perform supervised learning on a BERT model from unlabeled sample data to obtain a corresponding machine learning model, and not only save model training time, but also have strong model generalization capability.
In a first aspect, a method for constructing a pre-trained model of a compound expression is provided, which includes:
obtaining a SMILES sequence for a plurality of compound samples;
splicing SMILES sequences of a plurality of compound samples to determine spliced SMILES sequences;
training the BERT model according to the characteristic expression vector of the training sample sequence to construct a pre-training model, wherein the training sample sequence comprises a SMILES sequence and a splicing SMILES sequence of a plurality of compound samples.
In a second aspect, there is provided a compound assay method comprising:
acquiring a SMILES sequence of a target compound;
inputting the SMILES sequence of the target compound into a pre-training model constructed by the construction method of the pre-training model of the compound expression provided by the first aspect, and determining sequence information of the target compound, wherein the sequence information comprises a sequence relation prediction result and structural feature data;
and if the sequence prediction result is that the SMILES sequence of the target compound conforms to the chemical rule, inputting the structural feature data into a preset analysis task model to obtain an analysis task result of the target compound.
In a third aspect, there is provided an apparatus for constructing a pre-trained model of a compound expression, comprising:
an acquisition module for acquiring SMILES sequences of a plurality of compound samples;
the sample splicing module is used for splicing SMILES sequences of a plurality of compound samples and determining spliced SMILES sequences;
and the training module is used for training the BERT model according to the characteristic expression vector of the training sample sequence to construct a pre-training model, wherein the training sample sequence comprises a SMILES sequence and a spliced SMILES sequence of a plurality of compound samples.
In a fourth aspect, there is provided a compound analysis device comprising:
an acquisition module for acquiring a SMILES sequence of a target compound;
the feature extraction module is used for inputting the SMILES sequence of the target compound into the pre-training model constructed by the construction method of the pre-training model of the compound expression provided by the first aspect, and determining the sequence information of the target compound, wherein the sequence information comprises a sequence relation prediction result and structural feature data;
and the analysis module is used for inputting the structural characteristic data into a preset analysis task model to obtain an analysis task result of the target compound if the sequence prediction result is that the SMILES sequence of the target compound conforms to the chemical rule.
In a fifth aspect, there is provided a computer device comprising a storage medium, a processor and a computer program stored in the storage medium and executable on the processor, the processor implementing the steps of the method for constructing a pre-trained model of the above-described compound expression and/or the steps of the method for analyzing a compound when executing the computer program.
In a sixth aspect, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method for constructing a pre-trained model of a compound expression and/or the steps of the method for analyzing a compound.
In the scheme implemented by the above construction method, apparatus, analysis method, storage medium, and computer device, part of the SMILES sequence of any compound sample is randomly replaced with part of the SMILES sequence of other compound samples by splicing processing, so as to obtain a plurality of spliced SMILES sequences. The method comprises the steps of taking a spliced SMILES sequence as a negative sample, taking an initial SMILES sequence of a compound sample as a positive sample, inputting a BERT (Bidirectional Encoder Prediction from converters) Model together for training, and taking a Next Sentence Prediction (NSP) task and a Bidirectional word masking (MLM) task of the BERT Model as training targets, so that whether the SMILES sequence conforms to a chemical rule or not can be accurately distinguished by a pre-training Model obtained after training, and meanwhile, the extraction of the structural features of the SMILES sequence is realized. According to the technical scheme provided by the embodiment of the application, on one hand, the model can be supervised-learned without labeling a training sample sequence serving as sample data, so that the model learns the general rule in the sample data, the manpower and material resources required by labeling the sample are greatly reduced, and the training cost of the model is effectively reduced; on the other hand, the BERT model is used as a training framework of the pre-training model, so that the obtained pre-training model can efficiently calculate and learn the representation information of key compounds, capture the general structural rules in different SMILES sequence data, and further endow the pre-training model with the fitting capability on the downstream tasks of unlimited types, so that the learned pre-training model has better generalization, when the specific downstream tasks need to be solved, the pre-training model can be used for fine adjustment, and the training of a new model for each downstream task from zero is avoided.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a method for constructing a pre-trained model of compound expression in an embodiment of the present application;
FIG. 2 is a schematic flow diagram of a method of analyzing a compound according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an apparatus for constructing a pre-trained model of compound expression in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a compound analysis method in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, fig. 1 is a schematic flow chart of a method for constructing a pre-training model of a compound expression provided in an embodiment of the present application, and includes the following steps:
step 101, acquiring SMILES sequences of a plurality of compound samples;
specifically, a preset chemical structure conversion tool is used for converting a compound chemical structure file in a non-SMILES format into a SMILES sequence. The chemical structure file may include an image or text, for example, the text "ethane" corresponds to a SMILES sequence "CC", and the acetic acid chemical structure file corresponds to a SMILES sequence "CC (═ O) O". The preset chemical structure transformation tool is a SMILES sequence transformation tool that is conventional in the art, such as OpenBabel software, and the embodiment of the present application is not particularly limited.
Step 102, splicing SMILES sequences of a plurality of compound samples to determine a spliced SMILES sequence;
in this embodiment, a concatenated SMILES sequence in the form of sentence pairs is constructed by a concatenation process based on SMILES sequences of a plurality of compound samples. To facilitate learning compact information expressions for nodes of graph data by letting the BERT model learn the SMILES sequence (positive examples) and the stitched SMILES sequence (negative examples) that discriminate the original compound samples.
Further, in step 102, that is, performing a concatenation process on the SMILES sequences of the plurality of compound samples to determine a concatenated SMILES sequence, the method includes the following steps:
step 102-1, performing segmentation processing on the SMILES sequence of each compound sample, and determining a first subsequence and a second subsequence of each compound sample;
specifically, the slicing position of the slicing process may be set as needed, for example, at the 6 th character of the SMILES sequence, or the character interval position of the SMILES sequence adjacent to the central line.
And 102-2, carrying out random replacement treatment on the first subsequence or the second subsequence of each compound sample to obtain a spliced SMILES sequence.
For steps 102-1 to 102-2, the SMILES sequence of each compound sample is firstly segmented to divide the SMILES sequence into two parts, namely a first subsequence and a second subsequence. And randomly replacing the first subsequence or the second subsequence in the SMILES sequence of any compound sample with the first subsequence or the second subsequence in the SMILES sequence of other compound samples through a splicing process to construct a spliced SMILES sequence in the form of sentence pairs. Meanwhile, the spliced SMILES sequence is used as a disturbance negative sample of the original SMILES sequence, so that in the process of model training, a BERT model learns to distinguish whether the sample is the original SMILES sequence (positive sample) or the spliced SMILES sequence (negative sample) after random replacement, and further captures the most discriminative characteristic in the training sample sequence.
It will be appreciated that the random replacement process may be a replacement of the first subsequence of the SMILES sequence of one compound sample with the first subsequence of the SMILES sequence of another compound sample or the second subsequence of the SMILES sequence of another compound sample. Similarly, the second subsequence of the SMILES sequence of one compound sample may be replaced with the first subsequence of the SMILES sequence of another compound sample or the second subsequence of the SMILES sequence of another compound sample.
Specifically, for example, for the SMILES sequence of each native compound sample, the original state is retained with a probability of 0.5, and the original state is used as a positive sample, and the true relationship label of the positive sample is 1. Meanwhile, the unlabeled compound sample is cut at a probability of 0.5, the unlabeled compound sample is cut into two fragments (a first subsequence and a second subsequence) at random positions of 1/3-1/2, then a random substitution treatment is applied to obtain a disturbance negative sample corresponding to each compound sample, and the true relationship label of the negative sample is 0. Taking caffeine and aspirin as examples, the SMILES sequence of caffeine is: o ═ C1C2 ═ C (N ═ CN2C) N (C (═ O) N1C) C, and the SMILES sequence of aspirin is: CC (O) OC1 ═ CC ═ C1C (═ O) O, and the SMILES sequence of caffeine and the SMILES sequence of aspirin are respectively segmented to obtain a first subsequence of caffeine: o ═ C1C2 ═ C (N ═ CN2C) and the second subsequence: n (C (═ O) N1C) C, and aspirin first subsequence: CC (O) OC1 CC and a second subsequence: C1C (═ O) O, replacing the second subsequence of caffeine with the second subsequence of aspirin, resulting in a stitched SMILES sequence: by analogy, a large number of training samples can be obtained, and other forms of spliced SMILES sequences are not listed. Therefore, the SMILES sequence of splicing the SMILES sequence and the SMILES sequence of the compound sample are used as a training sample sequence, so that a BERT model is forced to learn the grammatical rules of the compound sequence, and whether the target compound conforms to the chemical rules or not is judged as correctly as possible.
And 103, training the BERT model according to the training sample sequence to construct a pre-training model.
Wherein the training sample sequence comprises a SMILES sequence and a spliced SMILES sequence of a plurality of compound samples. The BERT model is a general semantic representation model, and takes a Transformer as a network basic component.
In this embodiment, a Next-Sentence Prediction (NSP) task and a bidirectional word masking (MLM) task of the BERT Model are used as training targets, and the BERT Model is trained by using a concatenation SMILES sequence and a SMILES sequence of compound sample initial as training sample sequences. The NSP task is used for judging whether the training sample sequence is a positive sample meeting the chemical rule, and the MLM task is used for extracting structural feature data of the training sample sequence. Therefore, pseudo labels can be created from large-scale unmarked data by using the NSP task as supervision signals, and the model is supervised and learned by using the constructed supervision signals, so that potential features and information in the data can be effectively learned without using a large amount of labeled data, the model training time is saved, and the efficiency is higher. Moreover, the universal and training models obtained by training can be migrated, an indefinite number of subsequent downstream tasks can be supported, the method is suitable for a large-scale and reproducible industrial development mode, and the model application range is wider.
Further, in step 103, that is, training the BERT model according to the training sample sequence to construct a pre-training model, the method includes the following steps:
103-1, performing dimensionality reduction on the training sample sequence, and determining a feature expression vector of the training sample sequence;
and step 103-2, under the constraint of the cross entropy loss function, training the BERT model according to the feature expression vector of the training sample sequence.
For steps 103-1-103-2, in order to reduce the calculation amount in the subsequent model training process, the dimensionality of the training sample sequence is reduced to a lower dimensionality, and the feature representation vector corresponding to the training sample sequence is obtained. Then, the next NSP task and the MLM task of the BERT model are used as training targets, the BERT model is trained by using the feature expression vector of the training sample sequence, and meanwhile, the training of the BERT model is converged by adopting a cross entropy loss function, so that the rapid training of the model is realized, and the model training efficiency is further improved.
Further, in step 103-2, that is, under the constraint of the cross entropy loss function, training the BERT model according to the feature expression vector of the training sample sequence includes the following steps:
step 103-2-a, inputting the feature expression vector of the training sample sequence into a BERT model to generate a sequence relation prediction label of the training sample sequence;
the sequence relation prediction label is used for indicating whether each training sample sequence is a SMILES sequence of a plurality of compound samples.
It is worth mentioning that the training sample sequence after the dimensionality reduction processing is a linear sequence and comprises two sentences, the two sentences are divided by separators, and two identifiers are added at the top and the last of the training sample sequence. There are three imbedding per training sample sequence: position information embedding, word embedding, and sentence embedding. Wherein, the position information embedding is used for expressing the word sequence in Natural Language Processing (NLP) to code the position information; the sentence embedding, that is, the embedding item of each sentence in the whole of the two sentences of the training sample sequence, corresponds to each word. The three embedding corresponding to the training sample sequence are superposed to form the input (feature expression vector) of the BERT model.
Specifically, when the NSP task of the BERT model is pre-trained by the language model, sentence relation is predicted, that is, it is determined whether the second sentence is a subsequent sentence that is not the first sentence. One of the sentence relations is two sentences (a first subsequence and a second subsequence) that are truly connected in sequence in the training sample sequence; alternatively, the second sentence in the training sample sequence is randomly selected to be a spelled part after the first sentence. If the prediction result is yes, outputting a sequence relation prediction label IsNext, otherwise, outputting a sequence relation prediction label NotNext.
103-2-b, calculating a cross entropy loss function according to the sequence relation prediction label;
specifically, the cross entropy loss function is also called Softmax loss function, and the specific function is as follows:
Figure BDA0003647367100000081
wherein L is cross entropy loss Representing a cross entropy loss function, N representing the number of training sample sequences, t representing the number of training sample sequences, h yt A true relationship label, h, representing the training sample sequence t And C represents the number of classification tasks.
And step 103-2-c, if the cross entropy loss function is converged, determining the BERT model as a pre-training model.
For steps 103-2-a to 103-2-c, a certain number of training sample sequences are randomly extracted, and the training sample sequences include various types of positive samples and negative samples. And (4) inputting the extracted training sample sequence into a neural network for training so that a BERT model learns to distinguish whether the training sample sequence is an original SMILES sequence which is not spliced or a spliced SMILES sequence which is randomly replaced to obtain a sequence relation prediction label, so that the BERT model learns to understand and express the SMILES sequence of the compound. And then, calculating a cross entropy loss function according to the sequence relation prediction label so as to determine the prediction accuracy of the BERT model. When the cross entropy loss function is converged, which indicates that the sequence relation prediction label obtained by the trained BERT model is consistent with the real relation label of the training sample sequence, the completion of the training of the BERT model can be determined, and the BERT model is output as a pre-training model with a next task. Therefore, through the pre-training strategy constructed by sentence pairs, the representation information of key compounds can be efficiently calculated and learned at the same time, the general structural rules in different SMILES sequence data are captured, and the fitting capability of the SMILES sequence data on downstream tasks of unlimited types is further endowed.
The construction method of the pre-training model of the compound expression randomly replaces part of the SMILES sequence of any compound sample with part of the SMILES sequences of other compound samples through splicing processing to obtain a plurality of spliced SMILES sequences. The method comprises the steps of taking a spliced SMILES sequence as a negative sample, taking an initial SMILES sequence of a compound sample as a positive sample, inputting a BERT (Bidirectional Encoder Prediction from converters) Model together for training, and taking a Next Sentence Prediction (NSP) task and a Bidirectional word masking (MLM) task of the BERT Model as training targets, so that whether the SMILES sequence conforms to a chemical rule or not can be accurately distinguished by a pre-training Model obtained after training, and meanwhile, the extraction of the structural features of the SMILES sequence is realized. According to the technical scheme provided by the embodiment of the application, on one hand, the model can be supervised-learned without labeling a training sample sequence serving as sample data, so that the model learns the general rule in the sample data, the manpower and material resources required by labeling the sample are greatly reduced, and the training cost of the model is effectively reduced; on the other hand, the BERT model is adopted as a training framework of the pre-training model, so that the obtained pre-training model can efficiently calculate and learn the representation information of key compounds, catch the general structural rules in different SMILES sequence data, and further endow the pre-training model with the fitting capability on the downstream tasks of unlimited types, so that the learned pre-training model has better generalization, when the specific downstream tasks need to be solved, the pre-training model can be used for fine adjustment, and the brand-new model is prevented from being trained for each downstream task from zero
In one embodiment, as shown in FIG. 2, there is provided a compound analysis method comprising the steps of:
step 201, acquiring a SMILES sequence of a target compound;
step 202, inputting the SMILES sequence of the target compound into a pre-training model, and determining the sequence information of the target compound;
the sequence information comprises a sequence relation prediction result and structural feature data. The pre-trained Model is capable of performing Next Sentence Prediction (NSP) tasks and bidirectional word masking (MLM) tasks. The NSP task is used for determining a sequence relation prediction result of a target compound, and the MLM task is used for extracting structural feature data of the target compound.
In this embodiment, the SMILES sequence of the target compound to be analyzed is input into a pre-configured pre-trained model, which is trained by the BERT model. And determining a sequence relation prediction result of the SMILES sequence of the target compound through a pre-training model and extracting structural feature data of the target compound. So as to execute the downstream tasks of the demand through the sequence relation prediction result and the structural characteristic data.
It can be understood that, in order to reduce the computation amount of the pre-training model, the dimension reduction processing may be performed on the SMILES sequence of the target compound to obtain the feature representation vector of the SMILES sequence of the target compound, and the feature representation vector of the SMILES sequence of the target compound is used as the input of the pre-training model.
And step 203, if the sequence prediction result is that the SMILES sequence of the target compound conforms to the chemical rule, inputting the structural feature data into a preset analysis task model to obtain an analysis task result of the target compound.
In this embodiment, if the sequence prediction result output by the pre-training model indicates that the SMILES sequence of the target compound conforms to the chemical rule, that is, the SMILES sequence of the target compound indicates compliance, the subsequent chemical analysis may be performed. At the moment, the structural characteristic data is input into a preset analysis task model, so that the structural characteristic data of the target compound is analyzed through the preset analysis task model, and whether the target compound meets the analysis task requirement or not, namely the analysis task result is determined. Therefore, whether the target compound meets the requirements or not is automatically analyzed and verified under the condition that manual intervention analysis is not needed, and meanwhile, the SMILES sequence of the target compound is preprocessed through the pre-training model with strong generalization capability, so that reliable data support is provided for subsequent specific analysis tasks.
Further, if the sequence prediction result is that the SMILES sequence of the target compound does not accord with the chemical rule, which indicates that the SMILES sequence may have errors, a prompt message is output to remind a user to verify in time.
Specifically, the tasks executed by the preset analysis task model may be a synthetic reaction prediction task, a toxicity prediction task, a compound activity prediction task, or the like, and may be reasonably set according to an actual application scenario, and the embodiment of the present application is not specifically limited.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
In an embodiment, a device for constructing a pre-training model of a compound expression is provided, where the device for constructing a pre-training model of a compound expression corresponds to the method for constructing a pre-training model of a compound expression in the above embodiment one to one. As shown in fig. 3, the apparatus for constructing the pre-training model of the compound expression includes an obtaining module 301, a sample splicing module 302, and a training module 303. The functional modules are explained in detail as follows:
the obtaining module 301 is configured to obtain a SMILES sequence of a plurality of compound samples; the sample splicing module 302 is configured to splice SMILES sequences of a plurality of compound samples, and determine to splice the SMILES sequences; the training module 303 is configured to train the BERT model according to the feature expression vector of the training sample sequence, and construct a pre-training model, where the training sample sequence includes a SMILES sequence and a spliced SMILES sequence of a plurality of compound samples.
In an embodiment, the sample stitching module 302 is specifically configured to perform segmentation processing on the SMILES sequence of each compound sample, and determine a first subsequence and a second subsequence of each compound sample; and carrying out random replacement treatment on the first subsequence or the second subsequence of each compound sample to obtain a spliced SMILES sequence.
In one embodiment, the apparatus for constructing the pre-trained model of the compound expression further comprises: a dimension reduction module (not shown in the figure) for performing dimension reduction processing on the training sample sequence to determine a feature expression vector of the training sample sequence; the sample stitching module 302 is specifically configured to train the BERT model according to the feature expression vector of the training sample sequence under the constraint of the cross entropy loss function.
In an embodiment, the sample stitching module 302 is specifically configured to input the feature representation vector of the training sample sequence into the BERT model, and generate a sequence relation prediction tag of the training sample sequence, where the sequence relation prediction tag is used to indicate whether each training sample sequence is a SMILES sequence of a plurality of compound samples; calculating a cross entropy loss function according to the sequence relation prediction label; and if the cross entropy loss function is converged, determining the BERT model as a pre-training model.
In one embodiment, the cross entropy loss function is as follows:
Figure BDA0003647367100000111
wherein L is cross entropy loss Representing a cross entropy loss function, N representing the number of training sample sequences, t representing the number of training sample sequences, h yt A true relationship label, h, representing a sequence of training samples t And C represents the number of classification tasks.
The application provides a construction device of a pre-training model of a compound expression, wherein a part in a SMILES sequence of any compound sample is randomly replaced by a part in a SMILES sequence of other compound samples through splicing treatment, so that a plurality of spliced SMILES sequences are obtained. The method comprises the steps of taking a spliced SMILES sequence as a negative sample, taking an initial SMILES sequence of a compound sample as a positive sample, inputting a BERT (Bidirectional Encoder Prediction from converters) Model together for training, and taking a Next Sentence Prediction (NSP) task and a Bidirectional word masking (MLM) task of the BERT Model as training targets, so that whether the SMILES sequence conforms to a chemical rule or not can be accurately distinguished by a pre-training Model obtained after training, and meanwhile, the extraction of the structural features of the SMILES sequence is realized. According to the technical scheme provided by the embodiment of the application, on one hand, the model can be supervised-learned without labeling a training sample sequence serving as sample data, so that the model learns the general rule in the sample data, the manpower and material resources required by labeling the sample are greatly reduced, and the training cost of the model is effectively reduced; on the other hand, the BERT model is used as a training framework of the pre-training model, so that the obtained pre-training model can efficiently calculate and learn the representation information of key compounds, capture the general structural rules in different SMILES sequence data, and further endow the pre-training model with the fitting capability on the downstream tasks of unlimited types, so that the learned pre-training model has better generalization, when the specific downstream tasks need to be solved, the pre-training model can be used for fine adjustment, and the brand-new model is prevented from being trained for each downstream task from zero.
The various modules in the apparatus for constructing the pre-trained model of the above-described compound expression may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a compound analysis apparatus is provided, which corresponds to the compound analysis method in the above embodiments one to one. As shown in fig. 4, the compound analysis apparatus includes an acquisition module 401, a feature extraction module 402, and an analysis module 403. The functional modules are explained in detail as follows:
the acquiring module 401 is configured to acquire a SMILES sequence of a target compound; the feature extraction module 402 is configured to input the SMILES sequence of the target compound into the pre-training model constructed by the method for constructing the pre-training model of the compound expression provided in the first aspect, and determine sequence information of the target compound, where the sequence information includes a sequence relationship prediction result and structural feature data; the analysis module 403 is configured to, if the sequence prediction result is that the SMILES sequence of the target compound conforms to the chemical rule, input the structural feature data into the preset analysis task model to obtain an analysis task result of the target compound.
For the specific limitations of the compound analysis device, reference may be made to the limitations of the compound analysis method described above, which are not repeated herein. The respective modules in the above-described compound analysis apparatus may be entirely or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the steps of the method for constructing a pre-trained model of the above-described compound expression and/or the steps of the method for analyzing a compound when executing the computer program.
The computer device includes a processor, a storage medium, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media, internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the computer device is used for communicating with an external client through a network connection. The computer program is executed by a processor to implement the steps of a method of constructing a pre-trained model of a compound expression and/or the steps of a method of analyzing a compound.
It will be appreciated by those skilled in the art that the present embodiment provides a computer device architecture that is not limiting of the computer device, and that may include more or fewer components, or some components in combination, or a different arrangement of components.
In one embodiment, a readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of constructing a pre-trained model of the above-described compound expression and/or the steps of the method of analyzing a compound.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above.
Any reference to a storage medium, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile storage media. In particular, a non-volatile storage medium may include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Enhanced SDRAM (ESDRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), DRAM (SLDRAM), direct memory bus dynamic RAM (DRDRAM), synchronous link (Synchlink), memory bus dynamic RAM (RDRAM), memory bus (Rambus), and direct RAM (RDRAM).
It should be noted that, the functions or steps that can be realized by the readable storage medium or the computer device described above may correspond to the steps of the method for constructing the pre-training model of the compound expression and/or the method for analyzing the compound in the foregoing method embodiments, and are not described in detail here to avoid repetition.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method of constructing a pre-trained model of a compound expression, comprising:
obtaining a SMILES sequence for a plurality of compound samples;
performing splicing treatment on SMILES sequences of the plurality of compound samples, and determining a spliced SMILES sequence;
training a BERT model according to a training sample sequence to construct the pre-training model, wherein the training sample sequence comprises SMILES sequences of the compound samples and the spliced SMILES sequence.
2. The method for constructing a pre-training model of a compound expression according to claim 1, wherein the determining a stitched SMILES sequence by stitching SMILES sequences of the plurality of compound samples comprises:
performing segmentation processing on the SMILES sequence of each compound sample, and determining a first subsequence and a second subsequence of each compound sample;
and carrying out random replacement treatment on the first subsequence or the second subsequence of each compound sample to obtain the spliced SMILES sequence.
3. The method for constructing the pre-training model of the compound expression according to claim 1, wherein the training the BERT model according to the training sample sequence to construct the pre-training model comprises:
performing dimensionality reduction processing on the training sample sequence, and determining a feature expression vector of the training sample sequence;
and under the constraint of a cross entropy loss function, training the BERT model according to the feature expression vector of the training sample sequence.
4. The method of constructing a pre-trained model of a compound expression according to claim 3, wherein the training of the BERT model according to the feature representation vectors of the training sample sequence under the constraint of a cross-entropy loss function comprises:
inputting the feature representation vectors of the training sample sequences into the BERT model, and generating sequence relation prediction labels of the training sample sequences, wherein the sequence relation prediction labels are used for representing whether each training sample sequence is a SMILES sequence of the plurality of compound samples;
calculating the cross entropy loss function according to the sequence relation prediction label;
and if the cross entropy loss function is converged, confirming the BERT model as the pre-training model.
5. The method of constructing a pre-trained model of a compound expression according to claim 3, wherein the cross-entropy loss function is as follows:
Figure FDA0003647367090000021
wherein L is cross entropy loss Representing a cross entropy loss function, N representing the number of training sample sequences, t representing the number of training sample sequences, h yt A true relationship label, h, representing the training sample sequence t A sequence relation prediction label representing the training sample sequence, C represents classificationThe number of tasks.
6. A method of analyzing a compound, comprising:
acquiring a SMILES sequence of a target compound;
inputting the SMILES sequence of the target compound into the pre-trained model constructed by the method of constructing a pre-trained model of a compound expression according to any one of claims 1 to 5, determining sequence information of the target compound, the sequence information comprising sequence relationship predictors and structural feature data;
and if the sequence prediction result is that the SMILES sequence of the target compound conforms to a chemical rule, inputting the structural feature data into a preset analysis task model to obtain an analysis task result of the target compound.
7. An apparatus for constructing a pre-trained model of a compound expression, comprising:
an acquisition module for acquiring SMILES sequences of a plurality of compound samples;
the sample splicing module is used for splicing SMILES sequences of the compound samples and determining spliced SMILES sequences;
and the training module is used for training the BERT model according to the characteristic expression vector of the training sample sequence and constructing the pre-training model, wherein the training sample sequence comprises the SMILES sequences of the plurality of compound samples and the spliced SMILES sequence.
8. A compound analysis apparatus, comprising:
an acquisition module for acquiring a SMILES sequence of a target compound;
a feature extraction module, configured to input the SMILES sequence of the target compound into the pre-training model constructed by the method for constructing a pre-training model of a compound expression according to any one of claims 1 to 5, and determine sequence information of the target compound, where the sequence information includes a sequence relation prediction result and structural feature data;
and the analysis module is used for inputting the structural feature data into a preset analysis task model to obtain an analysis task result of the target compound if the sequence prediction result indicates that the SMILES sequence of the target compound conforms to a chemical rule.
9. A readable storage medium on which a computer program is stored, which program, when executed by a processor, implements a method of constructing a pre-trained model of a compound expression according to any one of claims 1 to 5 and/or a compound analysis method according to claim 6.
10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor, when executing the program, implements a method of constructing a pre-trained model of a compound expression according to any one of claims 1 to 5 and/or a method of analysing a compound according to claim 6.
CN202210534809.7A 2022-05-17 2022-05-17 Construction method, analysis method, device, storage medium and computer equipment Pending CN114822726A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210534809.7A CN114822726A (en) 2022-05-17 2022-05-17 Construction method, analysis method, device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210534809.7A CN114822726A (en) 2022-05-17 2022-05-17 Construction method, analysis method, device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN114822726A true CN114822726A (en) 2022-07-29

Family

ID=82516156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210534809.7A Pending CN114822726A (en) 2022-05-17 2022-05-17 Construction method, analysis method, device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN114822726A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912575A (en) * 2024-03-19 2024-04-19 苏州大学 Atomic importance analysis method based on multi-dimensional molecular pre-training model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912575A (en) * 2024-03-19 2024-04-19 苏州大学 Atomic importance analysis method based on multi-dimensional molecular pre-training model
CN117912575B (en) * 2024-03-19 2024-05-14 苏州大学 Atomic importance analysis method based on multi-dimensional molecular pre-training model

Similar Documents

Publication Publication Date Title
CN111611810B (en) Multi-tone word pronunciation disambiguation device and method
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN109471793B (en) Webpage automatic test defect positioning method based on deep learning
US20230080671A1 (en) User intention recognition method and apparatus based on statement context relationship prediction
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
JP7290861B2 (en) Answer classifier and expression generator for question answering system and computer program for training the expression generator
CN113064586B (en) Code completion method based on abstract syntax tree augmented graph model
US20220129450A1 (en) System and method for transferable natural language interface
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
CN116719520B (en) Code generation method and device
Leopold et al. Using hidden Markov models for the accurate linguistic analysis of process model activity labels
CN114298035A (en) Text recognition desensitization method and system thereof
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN112764762A (en) Method and system for automatically converting standard text into computable logic rule
CN116861269A (en) Multi-source heterogeneous data fusion and analysis method in engineering field
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN114822726A (en) Construction method, analysis method, device, storage medium and computer equipment
CN113377844A (en) Dialogue type data fuzzy retrieval method and device facing large relational database
KR20220021836A (en) Context sensitive spelling error correction system or method using Autoregressive language model
CN116384379A (en) Chinese clinical term standardization method based on deep learning
CN115964497A (en) Event extraction method integrating attention mechanism and convolutional neural network
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114610878A (en) Model training method, computer device and computer-readable storage medium
CN114417016A (en) Knowledge graph-based text information matching method and device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination