CN110400607B - Method for expanding molecular library - Google Patents

Method for expanding molecular library Download PDF

Info

Publication number
CN110400607B
CN110400607B CN201910645529.1A CN201910645529A CN110400607B CN 110400607 B CN110400607 B CN 110400607B CN 201910645529 A CN201910645529 A CN 201910645529A CN 110400607 B CN110400607 B CN 110400607B
Authority
CN
China
Prior art keywords
molecular
library
fingerprint
new
bblist
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910645529.1A
Other languages
Chinese (zh)
Other versions
CN110400607A (en
Inventor
金霞
杨红飞
韩瑞峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huoshi Creation Technology Co ltd
Original Assignee
Hangzhou Firestone Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Firestone Technology Co ltd filed Critical Hangzhou Firestone Technology Co ltd
Priority to CN201910645529.1A priority Critical patent/CN110400607B/en
Publication of CN110400607A publication Critical patent/CN110400607A/en
Application granted granted Critical
Publication of CN110400607B publication Critical patent/CN110400607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • G16C20/62Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medicinal Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an expansion method of a molecular library, wherein the self-contained molecular library and a newly-entered molecular library are both an already-held molecular library, the complete molecular formula and fingerprint of each molecule are stored in the self-contained molecular library, the newly-entered molecular library is a newly-uploaded molecular library and is composed of a plurality of sub-libraries, and the process of splicing a plurality of sub-libraries into the complete molecular library is time-consuming.

Description

Method for expanding molecular library
Technical Field
The invention belongs to the technical field of molecular compound management, and particularly relates to an expansion method of a molecular library.
Background
In the management of molecular compounds, it is necessary to know the molecular formula in the database, and it is necessary to know whether each new chemical formula exists in its own molecular library, and in the production environment, the new chemical formula is generally given in the form of a plurality of sub-libraries, and each sub-library often has several thousands to several tens of thousands of molecular formulas. There is currently no optimization for this scenario.
Disclosure of Invention
The invention aims to provide an online extension method of a chemical molecular formula library, aiming at the defects of the prior art.
The purpose of the invention is realized by the following technical scheme: the method for expanding the molecular library comprises the steps that the self-contained molecular library and a new molecular library are used, the self-contained molecular library is composed of mol _ i { i ═ 1,2 and … n }, n is the number of chemical molecular formulas in the self-contained molecular library, the new molecular library is composed of BBlist _ j { j ═ 1,2 and … m }, m is the number of sub-libraries BBlist _ j in the new molecular library, and each sub-library BBlist _ j is composed of Mj chemical molecular formulas and is marked as BBlist _ j _ k { k ═ 1,2 and … Mj }. The expansion process comprises the following steps:
(1) the complete molecular formula mol _ i { i ═ 1,2, … n } and the fingerprint fp _ i { i ═ 1,2, … n } of each molecule, and the fingerprint fp _ all resulting from the or operation of all the fingerprints are stored in the own molecular library.
(2) And calculating fingerprints fp _ j _ k, j is 1,2, m, k is 1,2, … Mj of each molecular formula BBlist _ j _ k in each BBlist _ j for a new molecular library.
(3) Screening molecular formulas in BBlist: for each fingerprint fp _ j _ k, compared to the fingerprint fp _ all, if one bit (bit) in fp _ j _ k is not present in fp _ all, i.e. the one bit of fp _ j _ k is 1 and fp _ all is 0, BBlist _ j _ k and fp _ j _ k are removed and do not participate in the following calculation. In the screening process, if a certain fingerprint FP _ j _ K is not filtered, a set FP _ K containing fingerprints FP _ i of all parts of the fingerprint FP _ j _ K is recorded compared with each FP _ i.
(4) Carrying out one-step splicing on the molecular formula in the screened BBlist, and continuously screening in the splicing process, wherein the method specifically comprises the following steps: and (3) splicing two BBlist _ j, merging the two molecular formula sets FP _ K after each splicing to obtain a new FP _ K set of the spliced molecular formula, and screening in the new FP _ K set in the step (3).
(5) And (4) repeating the step (4) until a complete new molecule entering library and a set FP _ K in the self molecule library corresponding to each molecular formula in the new molecule library are obtained.
(6) Screening out the molecular formula which is completely the same as the molecular formula in the existing molecular library and adding the rest different molecular formulas into the existing molecular library to realize the expansion of the molecular formula library, wherein the screening process is as follows: searching whether a completely identical molecular formula exists in each molecular formula left in a newly-entered molecular formula library in a corresponding FP-K, firstly comparing whether the fingerprint of each molecular formula is completely identical to the fingerprint of a certain molecular formula in the FP-K, and accurately matching the two molecular formulas if the fingerprints are completely identical, or filtering the molecular formula; in the exact match process, the formula is filtered out if the match fails.
Further, in the step (4), the splicing can be performed only by using a formula labeled with the same site, the site in the formula is marked as [ i ], i is 0,1 … N, [ i ] corresponds to a certain reaction type and is obtained by performing a chemical reaction of the type by using a reagent molecule, and N is the total number of the reaction types.
Further, in the step (4), the step 3 of screening in the new FP _ K set specifically includes: and calculating a new fingerprint for the spliced molecular formula, comparing and screening the new fingerprint with FP _ all, and if the new fingerprint is not filtered, screening a set FP _ K _ sub containing the molecular formula of all parts of the new fingerprint in the new FP _ K set.
Further, in the step (6), the precise matching method adopts a smart pattern matching method.
The invention has the beneficial effects that: the method screens the newly uploaded new molecular library, and adds the rest molecular formulas into the self-owned molecular library after screening to realize the expansion of the molecular library. The newly entered molecular library is generally composed of a plurality of sub-libraries, the sub-libraries need to be spliced into a complete molecular library for screening, and the process of splicing a plurality of sub-libraries is time-consuming, so that the method for accelerating the process is realized.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the method for expanding a molecular library provided by the present invention specifically includes the following steps:
for two molecule libraries (a large number of chemical molecular formulas are stored in the molecule libraries), an own molecule library and a new molecule library are provided, the own molecule library is composed of mol _ i { i ═ 1,2 and … n }, n is the number of the chemical molecular formulas in the own molecule library, the new molecule library is composed of BBlist _ j { j ═ 1,2 and … m }, m is the number of the sub-libraries BBlist _ j in the new molecule library, each sub-library BBlist _ j is composed of Mj chemical molecular formulas, and each molecular formula is a building block, namely BBlist _ j _ k ═ 1,2 and … Mj }. In the actual online service, the self-contained molecular library is an already-held molecular library, the complete molecular formula and the fingerprint (finger print) of each molecule are stored in the self-contained molecular library, the newly-entered molecular library is a newly-uploaded molecular library and is composed of a plurality of sub-libraries BBlist _ j, the process of splicing a plurality of BBlist _ j into the complete molecular library is time-consuming, only the molecular formula which needs to be spliced is taken when the self-contained molecular library and the newly-entered molecular library are compared, the molecular formula which is not spliced can be obtained by filtering, the molecular formula which is the same as the self-contained molecular library is found, the rest different molecular formulas are added into the self-contained molecular library, and the expansion process of the molecular formula library is realized and comprises the following steps:
1. the complete molecular formula mol _ i { i ═ 1,2, … n } and the fingerprint fp _ i { i ═ 1,2, … n } of each molecule, and the fingerprint fp _ all resulting from the or operation of all the fingerprints are stored in the own molecular library.
2. For a new entry molecular library, a fingerprint (fingerprint) fp _ j _ k of each molecular formula BBlist _ j _ k in each BBlist _ j is calculated, wherein j is 1,2, m, k is 1,2, … Mj.
3. Screening molecular formulas in BBlist: for each fingerprint fp _ j _ k, compared to the fingerprint fp _ all, if one bit (bit) in fp _ j _ k is not present in fp _ all, i.e. the one bit of fp _ j _ k is 1 and fp _ all is 0, BBlist _ j _ k and fp _ j _ k are removed and do not participate in the following calculation. Since a bit of the fingerprint is 1, which indicates that the molecule has a specific atomic sequence or property, for a specific bit, fp _ j _ k is 1 and fp _ all is 0, which indicates that fp _ j _ k has an atomic sequence or property that fp _ all does not have, i.e., there is no atomic sequence or property in the own molecular library, i.e., fp _ j _ k cannot be identical to a certain mol _ i.
In the screening process, if a certain fingerprint FP _ j _ K is not filtered, a set FP _ K containing fingerprints FP _ i of all parts of the fingerprint FP _ j _ K is recorded compared with each FP _ i.
4. And (3) carrying out one-step splicing on the molecular formula in the screened BBlist, and continuously screening in the splicing process:
the complete molecular formula is obtained by splicing all the BBlist _ j { j ═ 1,2 and m }, namely splicing each molecular formula in the BBlist _1 with all the spliceable molecular formulas in the BBlist _2 to obtain the BBlist _1_2, and then splicing each molecular formula in the BBlist _1_2 with all the spliceable molecular formulas in the BBlist _3 to obtain the BBlist _1_2_ 3. The sites in the formula are marked as [ i ], i is 0,1 … N, and [ i ] corresponds to a certain reaction type and is obtained by carrying out chemical reaction of the type by using reagent molecules, and N is the total number of the reaction types; the molecular formula with the same site mark can be spliced, for example, the molecule [1] C ═ O in BBlist _1 and [2] CCO [1] in BBlist _2 have the same site mark [1] which can be spliced, but can not be spliced with [2] CCO [3 ].
In this step, two BBlist _ j are spliced, for example, BBlist _1 and BBlist _2 are spliced. The splicing sequence for forming the complete molecular library is assumed to be 1,2 and … m, namely 1 and 2 are spliced firstly, and the result is spliced with 3 again until the m-th BBlist is spliced to obtain the complete molecular library.
Merging the two molecular formula sets FP _ K after each splicing to obtain a new FP _ K set of the spliced molecular formula, and carrying out the step 3 screening in the new FP _ K set: and calculating a new fingerprint for the spliced molecular formula, comparing the new fingerprint with FP _ all for screening, and if the new fingerprint is not filtered, screening a set FP _ K _ sub containing the molecular formula of all parts of the new fingerprint in a new FP _ K set.
5. And (4) repeating the step (4) until a complete new molecule entering library and a set FP _ K in the self molecule library corresponding to each molecular formula in the new molecule library are obtained.
6. Screening out the molecular formula which is completely the same as the molecular formula in the existing molecular library and adding the rest different molecular formulas into the existing molecular library to realize the expansion of the molecular formula library, wherein the screening process is as follows: and searching whether the completely same molecular formula exists in the corresponding FP-K for each molecular formula left in the newly-entered molecular library, firstly comparing whether the fingerprint of each molecular formula is completely the same as the fingerprint of a certain molecular formula in the FP-K, and accurately matching the two molecular formulas if the fingerprints are completely the same, otherwise, filtering out the molecular formula. In the exact match process, the formula is filtered out if the match fails. The exact matching method may employ a smart pattern matching method.
It should be noted that the summary and the detailed description of the invention are intended to demonstrate the practical application of the technical solutions provided by the present invention, and should not be construed as limiting the scope of the present invention. Any modification and variation of the present invention within the spirit of the present invention and the scope of the claims will fall within the scope of the present invention.

Claims (4)

1. The method for expanding the molecular library is characterized in that for an own molecular library and a new molecular library, the own molecular library is composed of mol _ i { i ═ 1,2, … n }, n is the number of chemical molecular formulas in the own molecular library, the new molecular library is composed of BBlist _ j { j ═ 1,2, … m }, m is the number of sub-libraries BBlist _ j in the new molecular library, each sub-library BBlist _ j is composed of Mj chemical molecular formulas, and the number is marked as BBlist _ j _ k { k ═ 1,2, … Mj }; the expansion process comprises the following steps:
(1) the complete molecular formula mol _ i { i ═ 1,2, … n } and the fingerprint fp _ i { i ═ 1,2, … n } of each molecule, and the fingerprint fp _ all obtained by OR operation of all the fingerprints are stored in an own molecular library;
(2) calculating fingerprints fp _ j _ k, j is 1,2, m, k is 1,2, … Mj of each molecular formula BBlist _ j _ k in each BBlist _ j;
(3) screening molecular formulas in BBlist _ j: for each fingerprint fp _ j _ k, comparing with the fingerprint fp _ all, if one bit (bit) in fp _ j _ k is not available to fp _ all, namely the bit of fp _ j _ k is 1 and fp _ all is 0, removing BBlist _ j _ k and fp _ j _ k, and not participating in the later calculation; in the screening process, if a certain fingerprint FP _ j _ K is not filtered, comparing the certain fingerprint FP _ j _ K with each FP _ i, and recording a set FP _ K of fingerprints FP _ i containing all parts of the fingerprint FP _ j _ K;
(4) carrying out one-step splicing on the screened molecular formula in the BBlist _ j, and continuously screening in the splicing process, wherein the method specifically comprises the following steps: splicing two BBlist _ j, merging the two molecular formula sets FP _ K after each splicing to obtain a new FP _ K set of the spliced molecular formula, and screening in the step (3) in the new FP _ K set;
(5) repeating the step (4) until a complete new molecule entering library and a set FP _ K in the self molecule library corresponding to each molecular formula in the new molecule library are obtained;
(6) screening out the molecular formula which is completely the same as the molecular formula in the existing molecular library and adding the rest different molecular formulas into the existing molecular library to realize the expansion of the molecular formula library, wherein the screening process is as follows: searching whether a completely identical molecular formula exists in each molecular formula left in a newly-entered molecular formula library in a corresponding FP-K, firstly comparing whether the fingerprint of each molecular formula is completely identical to the fingerprint of a certain molecular formula in the FP-K, and accurately matching the two molecular formulas if the fingerprints are completely identical, or filtering the molecular formula; in the exact match process, the formula is filtered out if the match fails.
2. The method for expanding a library of molecules according to claim 1, wherein in step (4), the same site-labeled formula is used for splicing, the sites in the formula are represented by [ i ], i is 0,1 … N, and [ i ] corresponds to a certain reaction type and is obtained by performing a chemical reaction of the reaction type with a reagent molecule, and N is the total number of the reaction types.
3. The method for extending a molecular library according to claim 1, wherein in the step (4), the screening in the step (3) in the new FP _ K set is specifically: and calculating a new fingerprint for the spliced molecular formula, comparing and screening the new fingerprint with FP _ all, and if the new fingerprint is not filtered, screening a set FP _ K _ sub containing the molecular formula of all parts of the new fingerprint in the new FP _ K set.
4. The method for extending a molecular library according to claim 1, wherein in the step (6), the exact matching method is a smart pattern matching method.
CN201910645529.1A 2019-07-17 2019-07-17 Method for expanding molecular library Active CN110400607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910645529.1A CN110400607B (en) 2019-07-17 2019-07-17 Method for expanding molecular library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910645529.1A CN110400607B (en) 2019-07-17 2019-07-17 Method for expanding molecular library

Publications (2)

Publication Number Publication Date
CN110400607A CN110400607A (en) 2019-11-01
CN110400607B true CN110400607B (en) 2020-06-02

Family

ID=68325775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910645529.1A Active CN110400607B (en) 2019-07-17 2019-07-17 Method for expanding molecular library

Country Status (1)

Country Link
CN (1) CN110400607B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108149325A (en) * 2016-12-02 2018-06-12 杭州阿诺生物医药科技股份有限公司 The synthesis of DNA encoding dynamic library of molecules and screening technique

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1908774A1 (en) * 2006-10-06 2008-04-09 sanofi-aventis Antibacterial and antiviral peptides from Actinomadura namibiensis
CN104750761B (en) * 2013-12-31 2018-06-22 上海致化化学科技有限公司 The method for building up and searching method of Molecular structure database
CN109686413A (en) * 2018-12-24 2019-04-26 杭州费尔斯通科技有限公司 A kind of chemical molecular formula search method based on es inverted index

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108149325A (en) * 2016-12-02 2018-06-12 杭州阿诺生物医药科技股份有限公司 The synthesis of DNA encoding dynamic library of molecules and screening technique

Also Published As

Publication number Publication date
CN110400607A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN106033436B (en) Database merging method
CN107609217B (en) Processing method and device for collision check data
CN110532019B (en) Method for tracing history of software code segment
CN107491508B (en) Database query time prediction method based on recurrent neural network
CN114923992B (en) Analytical methods, devices and apparatus for identifying known and unknown metabolites
CN112307124A (en) Database synchronization verification method, device, equipment and storage medium
CN110990055B (en) Pull Request function classification method based on program analysis
CN110400607B (en) Method for expanding molecular library
CN113672692B (en) Data processing method, data processing device, computer equipment and storage medium
CN115908080A (en) Carbon emission optimization method and system based on multidimensional data analysis
CN109977128B (en) Power grid planning data fusion method based on temporal dimension
CN101751248B (en) Method and system applied by Web for designing time-sensitive performance test case
CN111831545A (en) Test case generation method, test case generation device, computer device, and storage medium
CN116561003A (en) Test data generation method, device, computer equipment and storage medium
AT522281A1 (en) Method for characterizing the operating status of a computer system
CN112836033A (en) Business model management method, device, equipment and storage medium
CN115168085A (en) Repetitive conflict scheme detection method based on diff code block matching
CN102143193A (en) Data synchronization method and system
CN104536878A (en) Method for verifying accurate repair of atomicity violation error in concurrent program
CN101382891A (en) Statistical method and apparatus for constructing log output every day
CN103577746A (en) XML configuration-based detection method for authorization difference between information systems
Németh et al. Preparation and cluster analysis of data from the industrial production process for failure prediction
Cirne et al. Data Mining for Process Modeling: A Clustered Process Discovery Approach
CN117312350B (en) Steel industry carbon emission data management method and device
CN103218450B (en) A kind of many application real-time data base method of data synchronization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 7 / F, building B, 482 Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province 310000

Patentee after: Huoshi Creation Technology Co.,Ltd.

Address before: 7 / F, building B, 482 Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province 310000

Patentee before: HANGZHOU FIRESTONE TECHNOLOGY Co.,Ltd.