CN110400607A - A kind of extended method in molecular formula library - Google Patents

A kind of extended method in molecular formula library Download PDF

Info

Publication number
CN110400607A
CN110400607A CN201910645529.1A CN201910645529A CN110400607A CN 110400607 A CN110400607 A CN 110400607A CN 201910645529 A CN201910645529 A CN 201910645529A CN 110400607 A CN110400607 A CN 110400607A
Authority
CN
China
Prior art keywords
library
molecular formula
molecules
fingerprint
bblist
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910645529.1A
Other languages
Chinese (zh)
Other versions
CN110400607B (en
Inventor
金霞
杨红飞
韩瑞峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huoshi Creation Technology Co ltd
Original Assignee
Hangzhou Firestone Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Firestone Technology Co Ltd filed Critical Hangzhou Firestone Technology Co Ltd
Priority to CN201910645529.1A priority Critical patent/CN110400607B/en
Publication of CN110400607A publication Critical patent/CN110400607A/en
Application granted granted Critical
Publication of CN110400607B publication Critical patent/CN110400607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • G16C20/62Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of extended methods in molecular formula library, for having library of molecules by oneself and newly entering library of molecules, own library of molecules is the library of molecules held, the entire molecule formula and fingerprint of each molecule are saved in own library of molecules, newly entering library of molecules is the library of molecules newly uploaded, it is made of several word banks, due to multiple word banks being spliced into the process in entire molecule library than relatively time-consuming, the present invention compares own library of molecules and only takes the molecular formula that must wherein splice when newly entering library of molecules, the molecular formula that can learn that it is not overlapped need not be spliced by filtering out, find out and have by oneself the identical molecular formula of library of molecules, remaining different molecular formula is added in own library of molecules, realize the Quick Extended in molecular formula library.

Description

A kind of extended method in molecular formula library
Technical field
The invention belongs to molecular compound administrative skill field more particularly to a kind of extended methods in molecular formula library.
Background technique
It in the management of molecular compound, needs to have to the molecular formula in database at fingertips, to each newly-increased chemistry Formula needs to know whether have existed in own library of molecules, and in production environment, newly-increased chemical formula is usually with multiple sons What the form in library provided, and usually have in each word bank it is thousands of to tens of thousands of a molecular formula, it is right according to current molecule joining method The splicing single-unit operation of scale word bank in this way is often required to tens days, and multi-host parallel will also consume a large amount of calculation resources, Bu Nengman Sufficient production requirement.There is presently no the prioritization schemes for being directed to this scene.
Summary of the invention
In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to propose a kind of online extension side in chemical molecular formula library Method.
The purpose of the present invention is achieved through the following technical solutions: a kind of extended method in molecular formula library, for certainly There is library of molecules and newly enter library of molecules, have library of molecules by oneself and be made of mol_i { i=1,2 ... n }, n is chemistry point in own library of molecules The number of minor newly enters library of molecules and is made of BBlist_j { j=1,2 ... m }, and m is newly to enter word bank BBlist_j in library of molecules Number, each word bank BBlist_j are made of Mj chemical molecular formula, are denoted as BBlist_j_k { k=1,2 ... Mj }.It propagates through Journey the following steps are included:
(1) the entire molecule formula mol_i { i=1,2 ... n } and fingerprint fp_i { i that each molecule is saved in library of molecules are had by oneself =1,2 ... n }, and whole fingerprint or fingerprint fp_all that operation obtains.
(2) newly enter library of molecules to one, calculate the fingerprint of each molecular formula BBlist_j_k in wherein each BBlist_j Fp_j_k, j=1,2, m, k=1,2 ... Mj.
(3) molecular formula in BBlist is screened: for each fingerprint fp_j_k, compared with fingerprint fp_all, if fp_j_ Having a position (bit) in k is fp_all no, i.e. this bit of fp_j_k is 1 and fp_all is 0, then removes BBlist_j_k and fp_j_k is not involved in subsequent calculating.In screening process, if some fingerprint fp_j_k is not by mistake Filter, then compared with each fp_i, the set FP_K of fingerprint fp_i of the record comprising fingerprint fp_j_k whole position.
(4) a step splicing is carried out to the molecular formula in the BBlist after screening, continues to screen in splicing, specifically Are as follows: the splicing of two BBlist_j is carried out, the set FP_K of two molecular formula is merged after splicing every time, molecule after being spliced The new FP_K set of formula, carries out the screening of step 3 in new FP_K set.
(5) step 4 is repeated, until obtaining completely newly entering library of molecules, and newly to enter each molecular formula in library of molecules corresponding Set FP_K in own library of molecules.
(6) it will newly enter in library of molecules and screened with the duplicate molecular formula of own library of molecules, by remaining different points Minor is added in own library of molecules, realizes the extension in molecular formula library, wherein screening process is as follows: remaining in library of molecules to newly entering Each molecular formula has searched whether duplicate molecular formula in its corresponding FP_K, has compared the finger of each molecular formula first Whether line is just the same with the fingerprint of some molecular formula in FP_K, and two molecular formula are carried out accurate if just the same Match, otherwise filters out the molecular formula;In accurate matching process, the molecular formula is filtered out if it fails to match.
Further, in the step (4), the molecular formula for having same loci to mark just can be carried out splicing, in molecular formula Site is denoted as [i], i=0,1 ... N, and it is the chemical reaction that the type is carried out with reagent molecule that [i], which corresponds to a certain reaction type, It obtains, N is the sum of reaction type.
Further, in the step (4), the screening of step 3 is carried out in new FP_K set specifically: after splicing Molecular formula calculate new fingerprint, first screened with fp_all comparison with new fingerprint, if do not filtered, in new FP_K collection The set FP_K_sub of molecular formula of the screening comprising new fingerprint whole position in conjunction.
Further, in the step (6), fine matching method uses SMARTS method for mode matching.
The beneficial effects of the present invention are: the method for the present invention screens the library of molecules that newly enters newly uploaded, will be remained after screening The own library of molecules held is added in remaining molecular formula, realizes the extension of library of molecules.Newly entering library of molecules is usually with several Word bank is constituted, and is needed word bank to be spliced into complete library of molecules just and be can be carried out screening, due to the process ratio that multiple word banks are spliced Relatively time-consuming, the present invention realizes a kind of method for accelerating this process.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
As shown in Figure 1, a kind of extended method in molecular formula library provided by the invention, specific as follows:
For two library of molecules (storing a large amount of chemical molecular formulas in library of molecules), has library of molecules by oneself and newly enter library of molecules, from Have library of molecules by mol_i { i=1,2 ... n } constitute, n be have by oneself library of molecules in chemical molecular formula number, newly enter library of molecules by BBlist_j { j=1,2 ... m } is constituted, and m is the number for newly entering word bank BBlist_j in library of molecules, each word bank BBlist_j by Mj chemical molecular formula is constituted, and each molecular formula is a building block, i.e. BBlist_j_k k=1,2 ... Mj}.Actual in line service, having library of molecules by oneself is the library of molecules held, and is saved in own library of molecules each The entire molecule formula and fingerprint (fingerprint) of molecule, newly entering library of molecules is the library of molecules newly uploaded, by several word banks BBlist_j is constituted, and due to multiple BBlist_j being spliced into the process in entire molecule library than relatively time-consuming, compares own library of molecules The molecular formula that must wherein splice only is taken when newly entering library of molecules, the molecule that can learn that it is not overlapped need not be spliced by filtering out The identical molecular formula of library of molecules is found out and had by oneself to formula, and remaining different molecular formula is added in own library of molecules, realizes molecular formula The extension in library, expansion process the following steps are included:
1. saving the entire molecule formula mol_i { i=1,2 ... n } and fingerprint fp_i { i=of each molecule in own library of molecules 1,2 ... n }, and whole fingerprint or fingerprint fp_all that operation obtains.
2. pair one newly enters library of molecules, the fingerprint of each molecular formula BBlist_j_k in wherein each BBlist_j is calculated (fingerprint) fp_j_k, j=1,2, m, k=1,2 ... Mj.
3. screening the molecular formula in BBlist: for each fingerprint fp_j_k, compared with fingerprint fp_all, if fp_j_k In there is a position (bit) to be fp_all no, i.e. this bit of fp_j_k is 1 and fp_all is 0, then removes BBlist_ J_k and fp_j_k is not involved in subsequent calculating.Due to fingerprint some position be 1 when indicate molecule have a kind of specific atom Sequence or attribute, position specific for one, fp_j_k is 1 and fp_all is 0 to illustrate that fp_j_k has an atomic series or attribute It is fp_all no, that is, is not present this atomic series or attribute in own library of molecules, is i.e. fp_j_k can not be with Some mol_i is just the same.
In screening process, if some fingerprint fp_j_k is not filtered, compared with each fp_i, record is comprising referring to The set FP_K of the fingerprint fp_i of line fp_j_k whole position.
4. the molecular formula in BBlist after pair screening carries out a step splicing, continue to screen in splicing:
Complete molecular formula is by the way that BBlist_j { j=1,2, m }, all splicing is obtained, i.e. in BBlist_1 each Molecular formula is spliced with the molecular formula that can splice all in BBlist_2, obtains BBlist_1_2, then every in BBlist_1_2 A molecular formula is spliced with the molecular formula that can splice all in BBlist_3, obtains BBlist_1_2_3.Site in molecular formula It is denoted as [i], i=0,1 ... N, it is that the chemical reaction for carrying out the type with reagent molecule obtains that [i], which corresponds to a certain reaction type, , N is the sum of reaction type;The molecular formula for having same loci to mark just can be carried out splicing, such as the molecule [1] in BBlist_1 [2] CCO [1] in C=O and BBlist_2 has same loci label [1] that can splice, and cannot then splice with [2] CCO [3].
In this step, the splicing for carrying out two BBlist_j, such as the splicing of BBlist_1 and BBlist_2.It is assumed that being formed Entire molecule library splicing sequence be 1,2 ... m, i.e., 1 first splices with 2, result again with 3 splice, until with m-th of BBlist Splicing, obtains entire molecule library.
The set FP_K of two molecular formula is merged after splicing every time, the new FP_K set of molecular formula after being spliced, The screening of step 3 is carried out in new FP_K set: being to calculate spliced molecular formula new fingerprint at this time, with new fingerprint It is first screened with fp_all comparison, if do not filtered, screening includes the molecule of new fingerprint whole position in new FP_K set The set FP_K_sub of formula.
5. repeat step 4, until obtaining completely newly entering library of molecules, and newly enter each molecular formula in library of molecules it is corresponding from There is the set FP_K in library of molecules.
6. being screened newly entering in library of molecules with the duplicate molecular formula of own library of molecules, by remaining different molecule Formula is added in own library of molecules, realizes the extension in molecular formula library, wherein screening process is as follows: remaining every in library of molecules to newly entering A molecular formula has searched whether duplicate molecular formula in its corresponding FP_K, has compared the fingerprint of each molecular formula first It is whether just the same with the fingerprint of some molecular formula in FP_K, two molecular formula are accurately matched if just the same, Otherwise the molecular formula is filtered out.In accurate matching process, the molecular formula is filtered out if it fails to match.Fine matching method SMARTS method for mode matching can be used.
It is to be understood that the content of present invention and specific embodiment are intended to prove the reality of technical solution provided by the present invention Border application, should not be construed as limiting the scope of the present invention.Within the spirit of the invention and the scope of protection of the claims, To any modifications and changes that the present invention makes, protection scope of the present invention is both fallen within.

Claims (4)

1. a kind of extended method in molecular formula library, which is characterized in that for having library of molecules by oneself and newly entering library of molecules, have library of molecules by oneself It is made of mol_i { i=1,2 ... n }, n is the number of chemical molecular formula in own library of molecules, newly enters library of molecules by BBlist_j { j=1,2 ... m } are constituted, and m is the number for newly entering word bank BBlist_j in library of molecules, and each word bank BBlist_j is by Mj chemistry Molecular formula is constituted, and is denoted as BBlist_j_k { k=1,2 ... Mj }.Expansion process the following steps are included:
(1) have by oneself saved in library of molecules each molecule entire molecule formula mol_i { i=1,2 ... n } and fingerprint fp_i i=1, 2 ... n }, and whole fingerprint or fingerprint fp_all that operation obtains.
(2) newly enter library of molecules to one, calculate the fingerprint fp_j_ of each molecular formula BBlist_j_k in wherein each BBlist_j K, j=1,2, m, k=1,2 ... Mj.
(3) molecular formula in BBlist is screened: for each fingerprint fp_j_k, compared with fingerprint fp_all, if in fp_j_k One position (bit) is fp_all no, i.e. this bit of fp_j_k is 1 and fp_all is 0, then removes BBlist_j_ K and fp_j_k is not involved in subsequent calculating.It is and each if some fingerprint fp_j_k is not filtered in screening process Fp_i compares, the set FP_K of fingerprint fp_i of the record comprising fingerprint fp_j_k whole position.
(4) a step splicing is carried out to the molecular formula in the BBlist after screening, continues to screen in splicing, specifically: into The splicing of two BBlist_j of row every time merges the set FP_K of two molecular formula after splicing, molecular formula after being spliced New FP_K set carries out the screening of step 3 in new FP_K set.
(5) step 4 is repeated, until obtaining completely newly entering library of molecules, and newly to enter each molecular formula in library of molecules corresponding own Set FP_K in library of molecules.
(6) it will newly enter in library of molecules and screened with the duplicate molecular formula of own library of molecules, by remaining different molecular formula It is added in own library of molecules, realizes the extension in molecular formula library, wherein screening process is as follows: is remaining each in library of molecules to newly entering Molecular formula, has searched whether duplicate molecular formula in its corresponding FP_K, and the fingerprint for comparing each molecular formula first is It is no just the same with the fingerprint of some molecular formula in FP_K, two molecular formula are accurately matched if just the same, it is no Then filter out the molecular formula;In accurate matching process, the molecular formula is filtered out if it fails to match.
2. a kind of extended method in molecular formula library according to claim 1, which is characterized in that in the step (4), there is phase Molecular formula with site-tag just can be carried out splicing, and the site in molecular formula is denoted as [i], i=0,1 ... N, and [i] corresponds to a certain Reaction type is that the chemical reaction for carrying out the type with reagent molecule obtains, and N is the sum of reaction type.
3. a kind of extended method in molecular formula library according to claim 1, which is characterized in that in the step (4), new FP_K set in carry out the screening of step 3 specifically: calculate spliced molecular formula new fingerprint, with new fingerprint first with Fp_all comparison screening, if do not filtered, molecular formula of the screening comprising new fingerprint whole position in new FP_K set Set FP_K_sub.
4. a kind of extended method in molecular formula library according to claim 1, which is characterized in that in the step (6), accurately Matching process uses SMARTS method for mode matching.
CN201910645529.1A 2019-07-17 2019-07-17 Method for expanding molecular library Active CN110400607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910645529.1A CN110400607B (en) 2019-07-17 2019-07-17 Method for expanding molecular library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910645529.1A CN110400607B (en) 2019-07-17 2019-07-17 Method for expanding molecular library

Publications (2)

Publication Number Publication Date
CN110400607A true CN110400607A (en) 2019-11-01
CN110400607B CN110400607B (en) 2020-06-02

Family

ID=68325775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910645529.1A Active CN110400607B (en) 2019-07-17 2019-07-17 Method for expanding molecular library

Country Status (1)

Country Link
CN (1) CN110400607B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1908774A1 (en) * 2006-10-06 2008-04-09 sanofi-aventis Antibacterial and antiviral peptides from Actinomadura namibiensis
CN104750761A (en) * 2013-12-31 2015-07-01 上海致化化学科技有限公司 Method for creating molecular structure databases and method for searching same
CN108149325A (en) * 2016-12-02 2018-06-12 杭州阿诺生物医药科技股份有限公司 The synthesis of DNA encoding dynamic library of molecules and screening technique
CN109686413A (en) * 2018-12-24 2019-04-26 杭州费尔斯通科技有限公司 A kind of chemical molecular formula search method based on es inverted index

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1908774A1 (en) * 2006-10-06 2008-04-09 sanofi-aventis Antibacterial and antiviral peptides from Actinomadura namibiensis
CN104750761A (en) * 2013-12-31 2015-07-01 上海致化化学科技有限公司 Method for creating molecular structure databases and method for searching same
CN108149325A (en) * 2016-12-02 2018-06-12 杭州阿诺生物医药科技股份有限公司 The synthesis of DNA encoding dynamic library of molecules and screening technique
CN109686413A (en) * 2018-12-24 2019-04-26 杭州费尔斯通科技有限公司 A kind of chemical molecular formula search method based on es inverted index

Also Published As

Publication number Publication date
CN110400607B (en) 2020-06-02

Similar Documents

Publication Publication Date Title
CN108509969B (en) Data labeling method and terminal
CN105446723B (en) Method and apparatus for identifying the semantic differential between source code version
CN106033436B (en) Database merging method
CN104461842A (en) Log similarity based failure processing method and device
CN103337113B (en) Method and device for intelligently analyzing electronic day-to-day journals, as well as processor
CN109144882A (en) A kind of software fault positioning method and device based on program invariants
CN105224543A (en) For the treatment of seasonal effect in time series method and apparatus
CN104636130B (en) For generating the method and system of event tree
CN110515986B (en) Processing method and device of social network diagram and storage medium
CN106462620A (en) Distance queries on massive networks
CN105335246B (en) A kind of program crashing defect self-repairing method based on question and answer web analytics
CN110674360B (en) Tracing method and system for data
CN101882135B (en) Data processing method and device
CN108415846A (en) A kind of method and apparatus generating minimal automation test use cases
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
KR20190039758A (en) Data analysis support device and data analysis support system
CN105279089B (en) A kind of method and device for obtaining page elements
CN108363684A (en) List creation method, device and server
CN110750588A (en) Multi-source heterogeneous data fusion method, system, device and storage medium
CN108345658B (en) Decomposition processing method of algorithm calculation track, server and storage medium
CN110737779B (en) Knowledge graph construction method and device, storage medium and electronic equipment
Ashraf et al. WeFreS: weighted frequent subgraph mining in a single large graph
CN104748757A (en) Data updating method and device for navigation electronic map
Bi et al. MM-GNN: Mix-moment graph neural network towards modeling neighborhood feature distribution
CN107426610A (en) Video information synchronous method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 7 / F, building B, 482 Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province 310000

Patentee after: Huoshi Creation Technology Co.,Ltd.

Address before: 7 / F, building B, 482 Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province 310000

Patentee before: HANGZHOU FIRESTONE TECHNOLOGY Co.,Ltd.

CP01 Change in the name or title of a patent holder