CN116189789B

CN116189789B - Method and apparatus for screening aggregation-induced emission molecules using machine learning

Info

Publication number: CN116189789B
Application number: CN202310488382.6A
Authority: CN
Inventors: 许改霞; 张怡斌; 江一航; 徐周睿; 范妙壮
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-11-21
Anticipated expiration: 2043-05-04
Also published as: CN116189789A

Abstract

The application discloses a method and equipment for screening aggregation-induced emission molecules by using machine learning, which constructs a virtual database to be screened; processing to obtain at least one molecular fingerprint information corresponding to each aggregation-induced emission molecule; sequentially inputting the fingerprint information of each molecule into a trained fluorescence quantum yield prediction model to obtain the corresponding predicted fluorescence quantum yield of each aggregation-induced emission molecule output by the fluorescence quantum yield prediction model; and screening out the aggregation-induced emission molecules with fluorescence quantum yield meeting preset requirements according to the predicted fluorescence quantum yield of each aggregation-induced emission molecule. According to the method disclosed by the embodiment, the unknown molecular structure space is subjected to large-scale screening to obtain the high-performance fluorescent molecular material, and the method accelerates the development of the high-performance organic fluorescent material and has the advantages of low calculation cost, high accuracy and high efficiency.

Description

Method and apparatus for screening aggregation-induced emission molecules using machine learning

Technical Field

The application relates to the technical field of organic fluorescent materials, in particular to a method and equipment for screening aggregation-induced emission molecules by using machine learning.

Background

For aggregation-induced emission molecules, brightness is a strong factor affecting their performance in different applications, such as OLEDs, bioimaging and sensors, although most aggregation-induced emission molecules can emit light in the solid state (powder, thin film and nano-aggregates), little focus is placed on achieving a high brightness solid, and the lack of aggregation-induced emission molecules with high fluorescence quantum yields prevents their potential to fully exploit depth of penetration in biomedical applications.

Whereas aggregation-induced emission molecules with high fluorescence quantum yields open up more possibilities for biomedical fluorescence imaging. Thus, researchers have shown great interest in designing aggregation-induced emission molecules with high fluorescence quantum yields in the solid state, but complex aggregation-induced emission mechanisms based on various photophysical dimensions, such as limiting intramolecular rotation or vibration, limiting excited state deformation, suppressing the kasha rule, etc., have made it difficult to design or screen aggregation-induced emission molecules with desired properties.

Accordingly, the prior art is in need of improvement.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention aims to provide a method and apparatus for screening aggregation-induced emission molecules by machine learning for users, which overcomes the defect that the rapid screening of aggregation-induced emission molecules cannot be achieved by the technology in the prior art.

The technical scheme adopted for solving the technical problems is as follows:

in a first aspect, the present embodiment provides a method for screening aggregation-induced emission molecules using machine learning, comprising:

constructing a virtual database to be screened; the virtual database to be screened comprises a plurality of aggregation-induced emission molecules, wherein each aggregation-induced emission molecule is formed by combining an electron donor, an electron acceptor and a pi bridge which are obtained through collection in a molecular docking mode;

carrying out data preprocessing on each aggregation-induced emission molecule to obtain at least one molecular fingerprint information corresponding to each aggregation-induced emission molecule after the data preprocessing, and carrying out merging and splicing on each molecular fingerprint corresponding to each aggregation-induced emission molecule to obtain a multi-mode molecular fingerprint after merging and splicing;

sequentially inputting multi-mode molecular fingerprints corresponding to the aggregation-induced emission molecules into a trained fluorescence quantum yield prediction model to obtain predicted fluorescence quantum yields of the aggregation-induced emission molecules under solid state, which are output by the fluorescence quantum yield prediction model; the fluorescence quantum yield prediction model is trained based on the correspondence between molecular fingerprint data of known aggregation-induced emission molecules and fluorescence quantum yield values of the known aggregation-induced emission molecules;

And screening out the aggregation-induced emission molecules with fluorescence quantum yield meeting preset requirements according to the predicted fluorescence quantum yield of each aggregation-induced emission molecule.

Optionally, the training method of the fluorescence quantum yield prediction model comprises the following steps:

constructing a training data set, wherein the training data set comprises a plurality of groups of sample aggregation-induced emission molecular data, and each group of sample aggregation-induced emission molecular data comprises molecular fingerprint data of aggregation-induced emission molecules and fluorescence quantum yield in a solid state corresponding to the aggregation-induced emission molecules;

inputting each group of sample aggregation-induced emission molecular data into a preset machine learning model to obtain a predicted value of fluorescence quantum yield of the sample aggregation-induced emission molecules in a solid state, which is output by the preset machine learning model;

correcting parameters of the preset machine learning model according to errors between a predicted value of the fluorescence quantum yield of the sample aggregation-induced emission molecules in a solid state and a true value of the fluorescence quantum yield of the sample aggregation-induced emission molecules; and until the training of the preset machine learning model meets preset conditions, obtaining the fluorescence quantum yield prediction model.

Optionally, the step of constructing the training data set includes:

carrying out data preprocessing on the sample aggregation-induced emission molecular data to obtain molecular fingerprints of the sample aggregation-induced emission molecules;

and respectively merging and splicing the molecular fingerprints of the sample aggregation-induced emission molecules to obtain multi-mode molecular fingerprint data of the merged and spliced sample aggregation-induced emission molecules.

Optionally, the preset machine learning model includes at least one XGBoost model and at least one random forest model;

the step of inputting each group of sample aggregation-induced emission molecular data into a preset machine learning model to obtain a predicted value of fluorescence quantum yield of the sample aggregation-induced emission molecules in a solid state, which is output by the preset machine learning model, comprises the following steps:

respectively inputting multi-mode molecular fingerprint data of sample aggregation-induced emission molecules into an XGBoost model and a random forest model to obtain a first fluorescence quantum yield predicted value output by the XGBoost model and a second fluorescence quantum yield predicted value output by the random forest model;

and weighting the first fluorescence quantum yield predicted value and the second fluorescence quantum yield predicted value according to a preset weight to obtain the fluorescence quantum yield predicted value of the sample aggregation-induced emission molecule in a solid state.

Optionally, the step of constructing the virtual database to be screened includes:

collecting to obtain a plurality of substructures of electron donors, electron acceptors and pi bridges;

utilizing a molecular space generation algorithm to mutually butt-joint and combine the substructures of a plurality of electron donors, electron acceptors and/or pi bridges to obtain a plurality of aggregation-induced emission molecules after butt-joint and combination; wherein the substructure includes one or more binding sites therein; the molecular space generation algorithm is to build butt joint of binding sites by single bonds of the substructures of each electron donor, electron acceptor and/or pi bridge based on an arrangement and combination mode;

and constructing a virtual database to be screened by using a plurality of aggregation-induced emission molecules obtained after butt joint combination.

Optionally, the step of merging and splicing the molecular fingerprints corresponding to the aggregation-induced emission molecules to obtain the merged and spliced multi-modal molecular fingerprints further includes:

and respectively calculating the correlation among the multi-modal molecular fingerprints corresponding to the aggregation-induced emission molecules, and deleting the multi-modal molecular fingerprints with the correlation higher than a preset correlation threshold.

Optionally, the step of calculating the correlation between the multi-modal molecular fingerprints corresponding to the aggregation-induced emission molecules includes:

And respectively acquiring bit vectors corresponding to the multi-modal molecular fingerprints, and sequentially calculating pearson correlation coefficients of meta-vectors of the multi-modal molecular fingerprints, wherein the calculated pearson correlation coefficients are used as correlation data among the multi-modal molecular fingerprints.

Optionally, the step of screening the aggregation-induced emission molecules, in which the fluorescence quantum yield meets the preset requirement, according to the predicted fluorescence quantum yield of each of the aggregation-induced emission molecules includes:

obtaining range information of a preset fluorescence quantum yield;

and screening aggregation-induced emission molecules with fluorescence quantum yield in a solid state within the fluorescence quantum yield range information from the predicted fluorescence quantum yield information.

In a second aspect, the present embodiment further discloses an information processing terminal, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

In a third aspect, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method.

The embodiment discloses a method and equipment for screening aggregation-induced emission molecules by using machine learning, and a virtual database to be screened is constructed; the virtual database to be screened comprises a plurality of aggregation-induced emission molecules, wherein each aggregation-induced emission molecule is formed by combining an electron donor, an electron acceptor and a pi bridge which are obtained through collection in a molecular docking mode; carrying out data preprocessing on each aggregation-induced emission molecule to obtain at least one molecular fingerprint information corresponding to each aggregation-induced emission molecule after the data preprocessing, and carrying out merging and splicing on each molecular fingerprint corresponding to each aggregation-induced emission molecule to obtain a multi-mode molecular fingerprint after merging and splicing; sequentially inputting multi-mode molecular fingerprints corresponding to the aggregation-induced emission molecules into a trained fluorescence quantum yield prediction model to obtain predicted fluorescence quantum yield values of the aggregation-induced emission molecules output by the fluorescence quantum yield prediction model; the fluorescence quantum yield prediction model is trained based on the correspondence between molecular fingerprint data of known aggregation-induced emission molecules and fluorescence quantum yield values of the known aggregation-induced emission molecules; and screening out the aggregation-induced emission molecules with fluorescence quantum yield meeting preset requirements according to the predicted fluorescence quantum yield of each aggregation-induced emission molecule. According to the screening method of the aggregation-induced emission molecules, disclosed by the embodiment, the high-performance fluorescent molecular material is obtained by carrying out large-scale screening on the unknown molecular structure space, so that the development of the high-performance organic fluorescent material is accelerated, and the method has the advantages of low calculation cost, high accuracy and high efficiency.

Drawings

FIG. 1 is a flow chart of a method provided in an embodiment of the present invention;

FIG. 2a is a schematic diagram showing a structure list of electron donors according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a structural list of pi bridges in an embodiment of the present invention;

FIG. 2c is a schematic diagram showing a structural list of electron acceptors in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a molecular space generating algorithm for molecular docking in an embodiment of the present invention;

FIG. 4 is a schematic diagram of the principle of molecular fingerprint generation in an embodiment of the present invention;

FIG. 5 is a schematic diagram of the generation principle of multi-modal molecular fingerprints in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a preset machine learning model according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating steps of an embodiment of a method application of the present invention;

fig. 8a is a scatter plot of fluorescence quantum yield predictions for a test set by a machine learning model based on MACCS fingerprints in an embodiment of the present invention;

FIG. 8b is a scatter plot of fluorescence quantum yield predictions for a test set by a machine learning model based on PubCHem fingerprints in an embodiment of the present invention;

FIG. 8c is a scatter plot of fluorescence quantum yield predictions for a test set based on a machine learning model of multi-modal molecular fingerprinting in an embodiment of the invention.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Fluorescent probes, in particular organic luminophores, are widely used for biological research due to their lower safety risk and biodegradability. However, conventional organic probes still suffer from inherent disadvantages of poor water solubility, severe photobleaching, and low stability. These inherent drawbacks have significantly prevented the widespread use of organic probes in biomedical research.

Aggregation-induced quenching is a common feature of conventional fluorescent dyes that prevents them from achieving optimal light emission performance in the solid or aggregated state, although such behavior may be beneficial for other applications related to non-radiative decay processes, such as photoacoustic imaging and phototherapy. In a scenario where strong luminescence is required, many strategies have been developed to overcome the bottleneck of quenching, but with little success. Fortunately, luminophores with aggregation-induced emission properties bring about a perfect solution. Aggregation-induced emission molecules have a weaker or even no emission in the molecular state, but exhibit a highly enhanced fluorescence emission in the aggregated state. This feature imparts good colloidal stability, greater resistance to photobleaching and highly enhanced functional stability to the aggregation-induced emission molecules. Thus, in recent years, aggregation-induced emission molecules have attracted considerable attention among scientists and are considered to be a better choice than conventional organic probes.

For aggregation-induced emission molecules, brightness is a performance factor that strongly affects their use in different applications (e.g., OLEDs, bioimaging, and sensors). Although most aggregation-induced emission molecules can emit light in the solid state (powder, film, and nano-aggregates), few design principles have focused on achieving a high-brightness solid. And the lack of high fluorescence quantum yield of aggregation-induced emission molecules prevents their potential to fully exploit depth of penetration of depth imaging in biomedical applications and results in poor temporal resolution, as longer exposure times are required to compensate for low brightness. Aggregation-induced emission molecules with high fluorescence quantum yields will open up many exciting possibilities for biomedical fluorescence imaging. Thus, researchers have shown great interest in designing aggregation-induced emission molecules with high fluorescence quantum yields in the solid state. Prior to chemical synthesis, it is important to have a deep understanding of the structure-property relationship between molecular structure and optical properties. This challenge arises from complex aggregation-induced emission mechanisms based on various photophysical dimensions, such as limiting intramolecular rotation or vibration, limiting excited state deformation, suppressing the kava rule, and so forth.

In order to improve the efficiency of data processing, machine learning is increasingly used in various fields, and has been successfully used in various fields, including: drug design, organic synthesis, and material chemistry. The machine learning model can predict the performance of the material with high accuracy by learning a large amount of information such as the structure, composition, properties, and the like of the material. This helps to make more accurate decisions in material design and screening, thereby speeding up the material development process. With some knowledge of the physical or chemical mechanism behind it, machine learning can help provide further research that enables researchers to develop molecules with properties that meet expectations.

In order to better solve the above-mentioned problems in the prior art, the present embodiment discloses a method and apparatus for screening aggregation-induced emission molecules by using machine learning, wherein a database containing a sample data set is built by obtaining aggregation-induced emission molecules known in the prior art, then a preset machine learning model is trained by using the sample database to obtain a trained fluorescence quantum yield prediction model, an electron donor, an electron acceptor and a substructure of pi bridge are butted into a plurality of unknown aggregation-induced emission molecules by using a molecular space generation algorithm, the unknown aggregation-induced emission molecules are subjected to fluorescence quantum yield prediction by using the trained fluorescence quantum yield prediction model, a predicted value of the fluorescence quantum yield of the unknown aggregation-induced emission molecules in a solid state is obtained, and finally the aggregation-induced emission molecules meeting the requirement of the preset fluorescence quantum yield are screened from the predicted value. By using the method for predicting the fluorescence quantum yield of the aggregation-induced emission molecules, provided by the invention, researchers are helped to design the aggregation-induced emission molecules with specific fluorescence quantum yield by utilizing the potential structure-property relationship of the molecules, so that the time and resources of experiments and calculation are saved, the efficiency of the experiments is improved, and blindness is avoided. Compared with a chemical synthesis route by means of a traditional trial-and-error method, the novel aggregation-induced emission molecular screening strategy provided by the invention has the characteristics of low calculation cost, high accuracy and high efficiency.

The following describes in detail a method for screening aggregation-induced emission molecules using machine learning according to this embodiment, with reference to the accompanying drawings.

The present embodiment discloses a method for screening aggregation-induced emission molecules using machine learning, as shown in fig. 1, comprising:

s1, constructing a virtual database to be screened; the virtual database to be screened comprises a plurality of aggregation-induced emission molecules, wherein each aggregation-induced emission molecule is formed by combining an electron donor, an electron acceptor and a pi bridge which are obtained through collection in a molecular docking mode.

In order to obtain more aggregation-induced emission molecules meeting the composition requirement, the step firstly constructs a virtual database to be screened, a large number of unknown molecules are stored in the virtual database to be screened, the molecules are formed by butt joint based on known structures, and the substructures of all the known molecules are arranged and butt joint randomly and form a huge molecular structure space based on the substructures of the known molecules.

Specifically, the step of constructing the virtual database to be screened includes:

step S11, collecting and obtaining a plurality of electron donor, electron acceptor and pi bridge substructures.

Since aggregation-induced emission molecules are composed of some classical electron Donor (Donor), electron Acceptor (acceptors) and pi-bridge substructures, the different Donor, acceptor and pi-bridges are combined to generate multiple integral aggregation-induced emission molecules of the Donor-pi-acceptors (DA) and Donor-pi-Donor (DAD) types through specific binding sites using a molecular space generation docking algorithm. Referring to fig. 2a to 2c, and fig. 3 and 4, fig. 2a shows a structure list of electron donors, fig. 2b shows a structure list of pi bridges, fig. 2c shows a structure list of electron acceptors, and fig. 3 shows a structure list of electron acceptors, where the above substructures are butted one by using a molecular space generation butt-joint algorithm to form a complete aggregation-induced emission molecule.

Step S12, utilizing a molecular space generation algorithm to mutually butt-joint and combine the substructures of a plurality of electron donors, electron acceptors and/or pi bridges to obtain a plurality of aggregation-induced emission molecules after butt-joint and combination; wherein the substructure includes one or more binding sites therein; the molecular space generation algorithm is based on the way of permutation and combination to establish the butt joint of binding sites by single bonds of the substructures of each electron donor, electron acceptor and/or pi bridge.

Specifically, as shown in fig. 3, the database to be screened stores a plurality of Donor (D), a plurality of acceptors (a) and a plurality of pi bridge substructures, each D, a plurality of a and a plurality of pi bridge substructures are put into a substructures pool, and as each D and each a has a selectable binding site, the D and the a are connected according to the binding site by using a molecular space generation algorithm in a permutation and combination mode. In one embodiment, the arrangement and combination are such that D1 is used to connect A1, A2 … … An respectively to form a plurality of aggregation-induced emission molecules such as D1A1, D1A2, D1A3 …, then D2 is connected to A1, A2 … … An respectively to obtain a plurality of aggregation-induced emission molecules such as D2A1, D2A2, D2A3 …, and then D3, D4, D5 … … Dn are connected to A1, A2 … … An respectively in the same manner as described above. Namely, the method comprises the following steps: a first aggregation-induced emission molecule is formed by linking A1 with D1, a second aggregation-induced emission molecule … … is formed by linking A2 with D1, and An nth aggregation-induced emission molecule is formed by linking An with D1. After D1 is used, D2 is used to join D1 to form the n+1th aggregation-induced emission molecule, D2 is joined to A2 to form the n+2th aggregation-induced emission molecule … … d2 is joined to An to form the 2nth aggregation-induced emission molecule. In addition, D is connected with pi bridge and then connected with A to form a Donor-pi-acceptors (DA) type molecule, D is connected with pi bridge and then connected with two sides of A to form a Donor-pi-acceptors-pi-Donor (DAD) type molecule, and a plurality of integral aggregation-induced emission molecules are finally obtained through the butt joint mode.

And S13, constructing a virtual database to be screened by using a plurality of aggregation-induced emission molecules obtained after butt joint combination.

And (3) combining the images shown in fig. 3, and storing the plurality of aggregation-induced emission molecules obtained after the butt joint combination into a virtual database to be screened to obtain the constructed virtual database to be screened.

The aggregation-induced emission molecules are obtained by random butt joint, so that the molecular property is unknown, but the number of the molecules in the virtual database to be screened is large because the aggregation-induced emission molecules have huge molecular space during splicing. The number of aggregation-induced emission molecules is not known to increase based on the increase in the number of substructures of the above-described electron donor, electron acceptor and/or pi bridge. In order to achieve a better screening result, in one embodiment, the number of molecules in the database to be screened exceeds 1 ten thousand.

And S2, carrying out data preprocessing on each aggregation-induced emission molecule to obtain at least one molecular fingerprint information corresponding to each aggregation-induced emission molecule after the data preprocessing, and carrying out merging and splicing on each molecular fingerprint corresponding to each aggregation-induced emission molecule to obtain a multi-mode molecular fingerprint after merging and splicing.

Molecular fingerprinting is to encode molecules into a plurality of bit vectors, and the bit vectors obtained through encoding are used for comparing the properties among the molecules. Each bit vector on each molecular fingerprint corresponds to the presence and absence of a molecular fragment. Such as: if a certain functional group or substructure of the aggregation-inducing luminescent molecule is present, this bit in the molecular fingerprint of the aggregation-inducing luminescent molecule is 1, otherwise 0, so that the molecular fingerprint of the whole aggregation-inducing luminescent molecule is obtained from the multiple acceptors (A) and the multiple pi-bridge structures encoded according to the respective dosor (D) that are butted together. Referring to fig. 4, in the aggregation-induced emission molecules shown in the figure, each of the molecules has three substructures, namely, a substructure 111, a substructure 112 and a substructure 113, and the bit vectors corresponding to the three substructures in the composed molecular fingerprint are respectively 1, and the final encoded molecular fingerprint 11 is 0010001001. Because different molecular fingerprints of the same aggregation-induced emission molecule can be obtained based on different coding forms, in a specific embodiment, MACCS (Molecular ACCess System ) fingerprint and PubChem fingerprint (PubChem fingerprint is a sign vector describing chemical molecular structure and is a molecular representation method used by PubChem database), the whole name of PubChem fingerprint is PubChem Structural Fingerprint, pubChem fingerprint or PCFP, and the PubChem fingerprint is preset with 881 seed structural features, that is, the molecular fingerprint is an array of 881 bits, and all contained elements are 0 or 1) to extract molecular features respectively, so as to obtain multiple different molecular fingerprints of the same aggregation-induced emission molecule.

In order to improve the accuracy of predicting the properties of unknown aggregation-induced emission molecules, in this embodiment, different molecular fingerprints corresponding to the same aggregation-induced emission molecule are combined together to form a novel molecular fingerprint with multiple modes, so as to improve the accuracy of model prediction. Multimodal molecular fingerprinting can integrate various types of information, including molecular structure, chemical properties, physical properties, and the like. This helps more fully describe the nature of the molecule, provides more rich, multi-level information, and multi-modal molecular fingerprints can cover information in multiple dimensions, such as atom type, bond length, bond angle, charge distribution, electron affinity, ionization potential, etc. This enables the multi-modal molecular fingerprint to capture the diversity of molecules and describe them in more detail and comprehensively. The multi-modal molecular fingerprints can provide highly descriptive information, so that different molecules can be compared and classified through the similarity between the fingerprints, thereby accelerating the efficiency of molecular screening and improving the accuracy of screening results.

As shown in fig. 5, in order to combine MACCS fingerprints and PubChen fingerprints corresponding to the same aggregation-induced emission molecule together to obtain a multi-modal molecular fingerprint of the aggregation-induced emission molecule, it is conceivable that the combination of multiple molecular fingerprints of the same molecule is not limited to two, but may be three or more. Because different molecular fingerprints correspond to different molecular properties of the aggregation-induced emission molecules, the combination of multiple molecular fingerprints can better show the property characteristics of the unknown aggregation-induced emission molecules, so that the more accurate prediction of the performance of the unknown aggregation-induced emission molecules can be realized.

S3, sequentially inputting multi-mode molecular fingerprints corresponding to the aggregation-induced emission molecules into a trained fluorescence quantum yield prediction model to obtain predicted fluorescence quantum yield values of the aggregation-induced emission molecules output by the fluorescence quantum yield prediction model; the fluorescence quantum yield prediction model is trained based on the correspondence between molecular fingerprint data of a plurality of known aggregation-induced emission molecules and fluorescence quantum yield values thereof.

And (2) sequentially inputting the multi-mode molecular fingerprints corresponding to the aggregation-induced emission molecules obtained in the step (S2) into a trained fluorescence quantum yield prediction model to obtain predicted values of fluorescence quantum yields of the aggregation-induced emission molecules.

Specifically, the training method of the fluorescence quantum yield prediction model comprises the following steps:

and constructing a training data set, wherein the training data set comprises a plurality of groups of sample aggregation-induced emission molecular data, and each group of sample aggregation-induced emission molecular data comprises molecular fingerprint data of the aggregation-induced emission molecules and fluorescence quantum yield corresponding to the aggregation-induced emission molecules.

And inputting each group of sample aggregation-induced emission molecular data into a preset machine learning model to obtain a predicted value of fluorescence quantum yield of the sample aggregation-induced emission molecules output by the preset machine learning model.

Correcting parameters of the preset machine learning model according to errors between the predicted value of the fluorescence quantum yield of the sample aggregation-induced emission molecules and the true value of the fluorescence quantum yield of the sample aggregation-induced emission molecules; and until the training of the preset machine learning model meets preset conditions, obtaining the fluorescence quantum yield prediction model.

Specifically, inputting each group of sample aggregation-induced emission molecular data into a preset machine learning model to obtain a predicted value of the sample aggregation-induced emission molecular fluorescence quantum yield output by the preset machine learning model; correcting parameters of the preset machine learning model according to errors between the predicted value of the fluorescence quantum yield of the sample aggregation-induced emission molecules and the true value of the fluorescence quantum yield of the sample aggregation-induced emission molecules; and continuously executing the steps of inputting each group of sample aggregation-induced emission molecular data into a preset machine learning model, and correcting parameters of the preset machine learning model according to the predicted value and the true value of the fluorescence quantum yield of the sample aggregation-induced emission molecular data until the training of the preset machine learning model meets preset conditions, so as to obtain the fluorescence quantum yield predicted model.

In order to achieve a better prediction result, the structure of the preset machine learning model in the embodiment is an XGBoost model structure and a random forest model; the sample aggregation-induced emission molecular data is molecular fingerprint data corresponding to each aggregation-induced emission molecule, and the molecular fingerprint data is specifically multi-mode molecular fingerprint data obtained by combining and splicing different molecular fingerprints corresponding to each sample aggregation-induced emission molecule.

The preset machine learning model includes at least one XGBoost model (extreme gradient lifting model) and at least one random forest model.

In one embodiment, as shown in connection with fig. 6, the pre-set machine learning model includes an XGBoost model and a random forest model. The XGBoost model is a set regressor consisting of a plurality of regressors, and multi-mode molecular fingerprints of aggregation-induced emission molecules are sequentially input into the set regressor to obtain a weighted average value of data results output by each regressor, namely a first fluorescence quantum yield predicted value corresponding to each aggregation-induced emission molecule. The random forest model consists of a plurality of weak classifiers, and the predicted value output by the random forest model is a second fluorescence quantum yield predicted value, and the first fluorescence quantum yield predicted value and the second fluorescence quantum yield predicted value are weighted to obtain the final predicted value of the fluorescence quantum yield output by the preset machine model.

Specifically, multi-modal molecular fingerprints (i.e., the original data shown in fig. 6) of the aggregation-induced emission molecules of each sample are sequentially input to a first regressor arranged at the first position in the XGBoost model to obtain a predicted value output by the first regressor, a residual error of the predicted value of the first regressor is obtained based on the predicted value, a second regressor uses the residual error of the first regressor to conduct second prediction to obtain a predicted value output by the second regressor and a residual error of the predicted value of the second regressor, and the step of predicting by the latter regressor using the residual error of the former regressor is executed until the predicted value and the true value meet preset requirements, or the predicted value and the true value are circulated until a specified number (such as 5000) is reached, and finally the XGBoost model with the training completed is obtained.

Meanwhile, multi-mode molecular fingerprints (the original data shown in fig. 6) of the sample aggregation-induced emission molecules are sequentially input into random forest models respectively, and predicted values corresponding to the aggregation-induced emission molecules output by the random forest models are obtained. And adjusting parameters of the model according to the predicted value, and repeatedly executing training on the random forest until the training is completed, so as to obtain the random forest model with the final training completed.

And fusing the trained XGBoost model and the random forest model together to obtain the fluorescence quantum yield prediction model provided by the embodiment. And setting a predicted value output by the XGBoost model as a first fluorescence quantum yield predicted value, setting a predicted value output by the random forest model as a second fluorescence quantum yield predicted value, and carrying out weighted calculation on the first fluorescence quantum yield predicted value and the second fluorescence quantum yield predicted value according to a preset weight value to obtain a final fluorescence quantum yield predicted value corresponding to each aggregation-induced emission molecule. As shown in fig. 6, the weights of the two models may be set to be the same, that is, the average value of the first fluorescence quantum yield predicted value and the second fluorescence quantum yield predicted value is directly calculated, and the average value is taken as the final predicted value of the fluorescence quantum yield.

The fluorescence quantum yield model disclosed in the embodiment is a set model fused with two different algorithm models, and the two different machine learning models are fused together, so that the advantages of the two models can be exerted, the defects of the two models are overcome, and more accurate prediction is realized. In this embodiment, the random forest model can process discrete features, and the extreme gradient lifting model (XGBoost model) can process high-dimensional data, so that the advantages of each model can be fully exerted by fusing a plurality of models, thereby improving the performance of the overall fluorescence quantum yield prediction model.

In one embodiment, the model is evaluated using three different evaluation criteria, including Mean Absolute Error (MAE), root Mean Square Error (RMSE), and a decision coefficient (R ² ) Thereby training to obtain a fluorescence quantum yield prediction model.

And S4, screening out aggregation-induced emission molecules with fluorescence quantum yield meeting preset requirements according to the predicted fluorescence quantum yield of each aggregation-induced emission molecule.

After the predicted fluorescence quantum yield of each unknown aggregation-induced emission molecule is obtained by outputting the fluorescence quantum yield prediction model, the aggregation-induced emission molecules with specific fluorescence quantum yields can be screened out according to requirements.

Specifically, the step of screening the aggregation-induced emission molecules in which the fluorescence quantum yield meets the preset requirement according to the predicted fluorescence quantum yield of each aggregation-induced emission molecule comprises the following steps: obtaining range information of a preset fluorescence quantum yield; and screening aggregation-induced emission molecules with fluorescence quantum yield within the fluorescence quantum yield range information from the predicted fluorescence quantum yield information.

In order to avoid over fitting of the model, increase the operation speed and reduce the operation cost, the embodiment further includes the following steps:

and respectively calculating the correlation among the multi-modal molecular fingerprints corresponding to the aggregation-induced emission molecules, and deleting the multi-modal molecular fingerprints with the correlation higher than a preset correlation threshold. In a specific embodiment, the preset correlation threshold may be set to be above 0.9, so as to achieve the effect of reducing the dimension of the multi-mode molecular fingerprint data and improving the prediction precision.

Specifically, the method for calculating the correlation between the molecular fingerprints in the above steps is as follows: and respectively acquiring bit vectors corresponding to the molecular fingerprints, sequentially calculating the pearson correlation coefficients among the bit vectors of the molecular fingerprints, and taking the calculated pearson correlation coefficients as correlation data among the bit vectors of the molecular fingerprints.

The bit vector data of each molecular fingerprint is shown in table 1, and each row is the bit vector data of one molecular fingerprint, and pearson correlation coefficients between each column are calculated to obtain the correlation between them.

Table 1 molecular fingerprint information example of aggregation-induced emission molecules

The molecular fingerprint correlation processing in the embodiment can reasonably reduce the dimension of data and redundancy under the condition of ensuring that the data information is not lost as much as possible, prevent the model from being over fitted, accelerate the operation speed, namely reduce the operation cost and improve the prediction precision.

In the examples provided herein, machine learning techniques are used to screen fluorescence quantum yield aggregation-induced emission molecules with specific requirements for various biomedical applications. First, a database of various aggregation-induced emission molecules collected from the literature was established. Then, by extracting the characteristics of the molecules and training various most advanced machine learning models, the structure-property relationship of the aggregation-induced emission molecules is obtained and the fluorescence quantum yield thereof is predicted.

An embodiment of the present method is described in further detail below with reference to fig. 7.

The method comprises the following steps when being implemented in specific applications:

h1, constructing a training database and dividing a data set in the training database: a database was first created containing 672 aggregation-induced emission molecular experimental data from literature published over the last 20 years. Each data entry contains the molecular structure of the aggregation-induced emission molecules and the fluorescence quantum yield in the solid state, and the aggregation-induced emission characteristics of each aggregation-induced emission molecule in the training database are described in the literature. These aggregation-inducing luminescent molecules include rotor structures or derivatives thereof, such as Triphenylamine (TPA), tetraphenylpyrazine (TPP), tetra-subunit ethylene (TPE), and Hexaphenylsilole (HPS). After the data set construction is completed, the obtained data set is processed according to 8:1: the scale of 1 is divided into a training set, a validation set and a test set.

H2, carrying out data preprocessing on each aggregation-induced emission molecule in the training database to obtain molecular fingerprints corresponding to each aggregation-induced emission molecule and constructing multi-mode molecular fingerprints of each aggregation-induced emission molecule: in order to obtain information that can be recognized and processed by machine learning, the conversion of molecular structure into molecular fingerprints in vector form, i.e. the acquisition of qualitative molecular descriptors, is crucial to improving the accuracy of the machine learning model. In this work, two forms of qualitative molecular descriptors, also known as molecular fingerprints, are selected. Molecular fingerprinting is an abstract representation of a molecule that converts (encodes) the molecule into a number of bit strings (also called bit vectors) and then makes comparisons between molecules easy. Each bit vector on a molecular fingerprint corresponds to the presence or absence of a molecular fragment.

In one embodiment, as shown in fig. 5, MACCS fingerprints and PubChem fingerprints are selected to extract molecular and solvent features, and multi-modal molecular fingerprints are created using MACCS fingerprints and PubChem fingerprints to improve the accuracy of a machine learning model for large-scale screening of molecules, i.e., stitching together different types of molecular fingerprints as a new molecular fingerprint. By using the strategy, the data of various molecular fingerprints are combined to form the characteristic with more complete information, and the accuracy of the XGBoost model for predicting the fluorescence quantum yield in the solid state is improved.

Step H3, machine learning model construction and model verification: and developing and verifying a machine learning model, and obtaining a final result through 10 independent training and verification, so as to realize training of a preset machine learning model and obtain a fluorescence quantum yield prediction model. In this embodiment, a more advanced and widely used algorithm model is selected to predict the fluorescence quantum yield of aggregation-induced emission molecules in the solid state, i.e. a composite model consisting of an extreme gradient lifting (XGBoost) model and a random forest model.

And respectively training an extreme gradient lifting (XGBoost) model and a random forest model by using the data values in the training database to obtain a fluorescence quantum yield prediction model after training.

In order to evaluate the effectiveness of the algorithm, the Mean Absolute Error (MAE), the decision coefficient (R ² ) And Root Mean Square Error (RMSE) as an evaluation index, and a ten-fold cross-validation strategy was used to evaluate different methods under different molecular fingerprints. The XGBoost model is an accumulated, iterative gradient-lifted tree model, and is enhanced, and XGBoost is a regression tree-based aggregation method,therefore, they have very good performance. The random forest model has great advantages for processing high-dimensional data and has high accuracy, so that the fluorescence quantum yield prediction model formed by fusion of the two models has the characteristics of high prediction accuracy and high effect.

Combining the scatter diagram of the fluorescence quantum yield prediction of the machine learning model based on MACS fingerprints to the test set of FIG. 8a and the scatter diagram of the fluorescence quantum yield prediction of the machine learning model based on PubCHem fingerprints to the test set of FIG. 8b, the machine learning model prediction result of single molecular fingerprints can be obtained, and the performance of MACS fingerprints on the model is superior to that of PubCHem fingerprints. In addition, as shown in a scatter diagram of fluorescence quantum yield prediction of a test set by combining a machine learning model based on multi-modal molecular fingerprints in fig. 8c, the multi-modal molecular fingerprints provided by the invention can be obtained, compared with single-molecular fingerprints, have more excellent performance, and the determination coefficient is as high as 0.97. These results show that the fluorescence quantum yield prediction model disclosed in the embodiment has excellent performance and can be well applied to practical screening application.

Step H4, constructing a virtual database to be screened: a database for machine learning model screening was constructed containing over 10,000 potential aggregation-induced emission molecules (fig. 2 a-2 c) from commonly used electron donors (D), electron acceptors (a) and pi bridges, since most of the aggregation-induced emission molecules were composed of different D, pi and a combinations, and thus the prepared D, pi and a databases combined these discrete molecular components into a plurality of integral aggregation-induced emission molecules in molecular docking. The molecular fingerprints corresponding to the aggregation-induced emission molecules are calculated respectively, at least two types of molecular fingerprints corresponding to the aggregation-induced emission molecules can be adopted, and different molecular fingerprints corresponding to the aggregation-induced emission molecules are combined and butted respectively, so that one or more multi-mode molecular fingerprints corresponding to each aggregation-induced emission molecule are realized, and the multi-mode molecular fingerprints with the correlation higher than a preset correlation threshold are deleted.

And step H5, performing fluorescence quantum yield prediction on each aggregation-induced emission molecule in the constructed virtual database to be screened by using a trained fluorescence quantum yield prediction model. Specifically, multi-mode molecular fingerprints corresponding to the aggregation-induced emission molecules are input into a fluorescence quantum yield prediction model, the fluorescence quantum yield prediction model outputs fluorescence quantum yield prediction values of the aggregation-induced emission molecules, and the aggregation-induced emission molecules conforming to the range are screened from a virtual database to be screened based on the required aggregation-induced emission molecular fluorescence quantum yield range and the obtained fluorescence quantum yield prediction values corresponding to the aggregation-induced emission molecules, so that the aggregation-induced emission molecules with high fluorescence quantum yield are finally obtained.

The invention discloses a method for efficiently predicting fluorescence quantum yield of aggregation-induced emission molecules based on machine learning to realize aggregation-induced emission molecule screening. Because the data in the embodiment establishes an efficient prediction model based on the data obtained from the literature, the embodiment has the advantages of simplicity, convenience and low cost. The machine learning model of the invention is used for predicting the fluorescence quantum yield of the aggregation-induced emission molecules, and utilizing the potential structure-property relationship of the molecules to help researchers design the aggregation-induced emission molecules with specific fluorescence quantum yield, so that the time and resources of experiments and calculation are saved, the efficiency of the experiments is improved, and the blindness of the experiments is avoided.

The embodiment also discloses an information processing terminal based on the method, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method for screening aggregation-induced emission molecules by using machine learning when executing the computer program.

The processor generally controls the overall operation of the device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processor may perform all or part of the steps of the methods of aggregation-induced emission molecular screening described above. Further, the processor may include one or more modules to facilitate interactions between the processor and other components. For example, the processor may include a multimedia module to facilitate interaction between the multimedia component and the processor. The processor may in some embodiments be a central processing unit, microprocessor or other data processing chip for executing program code or processing data stored in the memory, e.g. performing training steps of the fluorescence quantum yield prediction model, etc.

The memory is configured to store various types of data to support operations at the device. Examples of such data include instructions for any application or method operating on the device, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type or combination of volatile or nonvolatile memory devices such as static random access memory, electrically erasable programmable read only memory, magnetic memory, flash memory, magnetic or optical disk.

The memory may in some embodiments be an internal storage unit of the playback device, such as a hard disk or a memory of the smart television. The memory may also be an external storage device of the playing apparatus in other embodiments, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the Smart television. Further, the memory may further include both an internal storage unit and an external storage device of the playing apparatus. The memory is used for storing application software and various data installed on the playing device, such as program codes for installing the intelligent television. The memory may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory stores thereon a control program of a screening method for aggregation-induced emission molecules, and the control program based on the method for screening aggregation-induced emission molecules using machine learning is executable by the processor, thereby implementing the method for screening aggregation-induced emission molecules using machine learning in the present embodiment.

In an exemplary embodiment, the apparatus may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

The present embodiment also discloses a computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method of screening aggregation-induced emission molecules using machine learning.

The embodiment discloses a method and equipment for screening aggregation-induced emission molecules by using machine learning, which are used for obtaining a high-performance fluorescent molecular material by carrying out large-scale screening on an unknown molecular structure space.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of screening for aggregation-induced emission molecules using machine learning, comprising:

Screening out aggregation-induced emission molecules with fluorescence quantum yield meeting preset requirements according to the predicted fluorescence quantum yield of each aggregation-induced emission molecule;

the step of constructing the virtual database to be screened comprises the following steps:

constructing a virtual database to be screened by using a plurality of aggregation-induced emission molecules obtained after butt joint combination;

the manner in which the substructures of the individual electron donors, electron acceptors and/or pi bridges establish the docking of the binding sites by single bonds includes:

d1 is connected with A1 to form a first aggregation-induced emission molecule, D1 is connected with A2 to form a second aggregation-induced emission molecule, and D1 is connected with An to form An n-th aggregation-induced emission molecule; after D1 is used, D2 is connected with A1 to form n+1th aggregation-induced emission molecules, D2 is connected with A2 to form n+2th aggregation-induced emission molecules, and D2 is connected with An to form 2n < th > aggregation-induced emission molecules; d is connected with pi bridge and then connected with A to form DA type molecule; the attachment of D to the pi bridge followed by the attachment to both sides of A forms a DAD-type molecule, where D, D and D2 each represent An electron donor and A, A, A2 and An are each represented as An electron acceptor.

2. The method of claim 1, wherein the training method of the fluorescence quantum yield prediction model comprises:

constructing a training data set, wherein the training data set comprises a plurality of groups of sample aggregation-induced emission molecular data, and each group of sample aggregation-induced emission molecular data comprises molecular fingerprint data of an aggregation-induced emission molecule and fluorescence quantum yield corresponding to the molecular fingerprint data of the aggregation-induced emission molecule;

3. The method of claim 2, wherein the step of constructing a training data set comprises:

4. A method according to claim 3, wherein the pre-set machine learning model comprises at least one XGBoost model and at least one random forest model;

5. The method of claim 1, wherein the step of merging and stitching the molecular fingerprints corresponding to the aggregation-induced emission molecules to obtain the merged and stitched multi-modal molecular fingerprint further comprises:

6. The method of claim 5, wherein the step of separately calculating correlations between the multimodal molecular fingerprints for each aggregation-induced emission molecule comprises:

7. The method according to any one of claims 1 to 6, wherein the step of screening out the aggregation-induced emission molecules in which the fluorescence quantum yield meets a preset requirement based on the predicted fluorescence quantum yield of each of the aggregation-induced emission molecules comprises:

Obtaining range information of a preset fluorescence quantum yield;

and screening aggregation-induced emission molecules with fluorescence quantum yield within the range information of the fluorescence quantum yield from the predicted fluorescence quantum yield information.

8. An information processing terminal comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when executing the computer program.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.