MXPA01003448A - A system for the classification and generation of chemical compounds. - Google Patents

A system for the classification and generation of chemical compounds.

Info

Publication number
MXPA01003448A
MXPA01003448A MXPA01003448A MXPA01003448A MXPA01003448A MX PA01003448 A MXPA01003448 A MX PA01003448A MX PA01003448 A MXPA01003448 A MX PA01003448A MX PA01003448 A MXPA01003448 A MX PA01003448A MX PA01003448 A MXPA01003448 A MX PA01003448A
Authority
MX
Mexico
Prior art keywords
molecules
cycles
classifying
length
categories
Prior art date
Application number
MXPA01003448A
Other languages
Spanish (es)
Inventor
Richard L Wife
Original Assignee
Specs And Biospecs Bv
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Specs And Biospecs Bv filed Critical Specs And Biospecs Bv
Publication of MXPA01003448A publication Critical patent/MXPA01003448A/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • G16C20/64Screening of libraries
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01JCHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
    • B01J2219/00Chemical, physical or physico-chemical processes in general; Their relevant apparatus
    • B01J2219/00274Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
    • B01J2219/0068Means for controlling the apparatus of the process
    • B01J2219/00686Automatic
    • B01J2219/00689Automatic using computers
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01JCHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
    • B01J2219/00Chemical, physical or physico-chemical processes in general; Their relevant apparatus
    • B01J2219/00274Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
    • B01J2219/0068Means for controlling the apparatus of the process
    • B01J2219/00695Synthesis control routines, e.g. using computer programs
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01JCHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
    • B01J2219/00Chemical, physical or physico-chemical processes in general; Their relevant apparatus
    • B01J2219/00274Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
    • B01J2219/0068Means for controlling the apparatus of the process
    • B01J2219/007Simulation or vitual synthesis
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01JCHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
    • B01J2219/00Chemical, physical or physico-chemical processes in general; Their relevant apparatus
    • B01J2219/00274Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
    • B01J2219/00718Type of compounds synthesised
    • B01J2219/0072Organic compounds

Landscapes

  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Library & Information Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biochemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Organic Low-Molecular-Weight Compounds And Preparation Thereof (AREA)

Abstract

The present invention discloses a method for classifying a library of chemical compounds and generating new chemical compounds to improve the diversity of the library. More particularly, the present invention creates an extended ring structure classification system, assigns chemical compounds from a library to categories within the classification system, recognizes empty categories, and generates new chemical structures to fill the empty categories in order to improve the range of structures represented by the library.

Description

DN SYSTEM FOR THE CLASSIFICATION AND GENERATION OF CHEMICAL COMPOUNDS FIELD OF THE INVENTION The present invention relates generally to a method for classifying a library of chemical compounds and for generating new chemical structures. More particularly, the present invention creates a ring structure classification system, assigns chemical compounds from a library to categories within the classification system, recognizes empty categories, and generates new chemical structures to fill the empty categories in order to improve the range of structures represented by the library.
BACKGROUND Conventional methods for creating chemical compounds with desired properties and lead compounds that identify activities, create variants of the lead compounds and evaluate the variant compounds with respect to the desired properties and activities. Conventional methods look for libraries of existing chemical compounds and use a combination of empirical science and intuition-chemistry to identify lead compounds. Conventional methods for creating chemical compounds have a fundamental limitation due to their dependence on libraries of existing chemical co-compounds for the identification of lead compounds. First, the chemical library can not have compounds with a wide range of different structures. Second, the chemical library may contain many compounds with a similar structure that increases the difficulty associated with the handling of information in the chemical libraries to identify lead compounds. Therefore, the utility of conventional methods for creating chemical compounds is limited by the amount of diversity in existing chemical libraries. Conventional methods for creating chemical compounds have an additional disadvantage due to their dependence on chemical intuition to identify lead compounds from libraries of existing chemical compounds. The Chemical Compendiums record files that exceed fifteen million compounds. The large number of chemical compounds limits the utility of methods that rely on an individual's chemical intuition to select libraries of chemical compounds by manually processing a large amount of disorganized data to identify "lead" computations. at least to date ~~ has not proven to be a particularly good source of lead compounds for the drug discovery process. " See Rudy M. Baum, "Combinatory! Approaches Provide Fresh Leads for Medicinal Chemistry" (C & EN, Feb. 7, 1994, pp. 20-26). Research in combinatorial chemistry has attempted to address some of the limitations associated with conventional methods for creating chemical compounds. Typical approaches in combinatorial chemistry form a library of chemical compounds by combining a set of blocks of chemical constructions in each possible form for a given compound length. However, the huge number of possible combinations limits the viability of these approaches. U.S. Patent 5, 463, 564, ("The Patent 564") discloses an increased combinatorial chemistry system that selectively combines blocks of chemical constructions in accordance with the desired physical, chemical, and bioactive properties rather than generating each possible combination. Specifically, the? 564 patent describes a chemical synthesis system for creating a library of direct diversity chemicals by selectively combining a number of building blocks such as reagents according to robotic synthesis instructions which are determined from the physical, chemical, and desired bioactives. After creating an initial chemistry library, the system analyzes the compounds in the direct diversity library, constructs the activity-structure and property-structure models and generates new robotic synthesis instructions, and generates a modified direct diversity chemistry library. This system repeats this procedure in an iterative manner until the properties of the chemical compounds in the library correspond to the desired properties. Further research has described techniques for analyzing existing libraries of chemical compounds. An exemplary system uses configuration description methods to analyze a database of commercially available drugs and to prepare a list of common drug configurations. See Guy W. Bemis and Mark A. Murcko, J., "The Properties of Know Drugs.1 Molecular Frameworks," Med. Chem. 1996, 39, 2887-2893. This system uses a hierarchical description of molecular framework comprising ring systems, bonds and side chain atoms to classify molecules. The system has a number of uses. First, drug discovery methods can use this classification scheme to identify well-represented frameworks. Then, combinatorial chemistry methods can use the bonds and ring systems identified by this classification scheme in the automated generation of chemical compound libraries. Finally, this classification scheme evaluates the particular libraries of chemical compounds by comparing the molecular frameworks of the chemical compounds with the molecular frameworks of the known drugs. Another system describes "A system for extracting, organizing, storing, and recovering ring systems contained in large structural databases." (Nilakantan R. et al .: "A ring-based chemical structural query system: use of a novel ring-complexity heuristic", Journal of Chemical Infomation and Computer Sciences, Feb. 1990, USA, vol 30, no. pages 65-68, XP000867655 ISSN: 0095-2338, page 65, later referred to as "? ilankantan"). But neither ilakantan nor any other reference describes a method for classifying a library of molecules that can be used to improve the range of structures represented by the library in order to facilitate the creation of chemical compounds. In particular, no other reference describes the elements recited in claim 1. For example, the ilakantan system is not a method for classifying a molecule with a list of larger ring lengths. In addition, the ilakantan system does not collect the lengths of a designated number of the largest rings of a molecule in a table nor does it collect the number of rings each of these lengths has in the table. Instead, the Nilakantan system simply extracts the ring systems, eliminates duplicates using an arbitrary coding scheme, classifies ring systems using a simple complexity index and stores the ring data in an easily searchable form. ? ilakantan in 65, col. 1. In contrast to the previous references, the present invention solves the problem of defining a classification scheme that can be used to improve the range of structures represented by a library of molecules in order to facilitate the creation of chemical compounds. This scheme comprises a designated number of larger ring lengths of a molecule and the number of rings each of these lengths have.
COMPENDIUM OF THE INVENTION The present invention includes a method for classifying one or more molecules comprising the step of determining a graphic representation, having one or more simple cyclones, for the molecule, characterized in that the method also comprises the steps of: defining a plurality of categories wherein each of the categories is identified by a table comprising: a designated number of designated number lengths of the largest single cycles of the graphic representation; and for each of the lengths, a corresponding number of simple cycles having each length; and assigning at least one of the molecules to one of the categories (120). It is a further object of the present invention to define a method for classifying one or more molecules further comprising the step of: finding one of the categories empty to identify absent structures in the molecule library. It is a further object of the present invention to define a method for classifying one or more molecules further comprising the step of: generating one or more chemical structures to fill the voids of the plurality of categories. It is a further objection of the present invention to define a method for classifying one or more molecules further comprising the steps of: identifying the isomorphic chemical structures in the plurality of categories; and eliminate the isomorphic chemical structures. It is a further object of the invention to define a system for the classification of a library of molecules comprising a graphic representation, having one or more simple cycles, for the molecule, characterized in that the system further comprises: a plurality of categories where each of the categories is identified by a table comprising: a designated number of designated number lengths of the longest single cycles of the graphic representation; and for each of the lengths, a corresponding number of each simple cycle has each length. It is a further object of the invention to define a system for the classification of a library of molecules in which the lengths in the table can not be duplicated.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 provides an overview of the classification method of the present invention. Figure 2 provides a partial illustration of an exemplary classification system. Figure 3 displays the number of categories in the classification system as a function of designated length of the sequence of non-negative integers. Figure 4 illustrates the relationship between small molecules and large molecules or receptors. Figure 5 illustrates the context of the classification system of the present invention. Figure 6 presents a flow chart of a method for assigning molecules to the classification system. Figure 7 presents a flowchart to create a graph without direction G [V, E, d] for a molecule. Figure 8a presents a flow chart of a method for finding a designated number of the largest single cycles in the graph without address G [V, E, d]. Figure 8b illustrates the operation of the method defined by the flow chart of Figure 8a. Figure 9 presents a flow chart of a second method to find a designated number of the largest single cycles in the graph without address G [V, E, d]. Figures 10a and 10b present a flow chart of a third method to find a designated number of the largest single cycles in the graph without direction G [V, Ef d]. The Figures ll-ll illustrate the operation of the method defined by the flow chart of Figures 10a and 10b. Figures 12a-12e display the results that would be obtained when executing the method to assign molecules in a classification system.
Figure 13 displays the digital impressions of four libraries of chemical compounds with a one-dimensional analysis. Figure 14 displays the digital impressions of the same four libraries of chemical compounds with a three-dimensional analysis. Figure 15 shows possible category populations and current category populations as a function of ring length for a library of chemicals. Figure 16a presents a flowchart of a method for finding a designated number of the largest single cycles in the graph with no G address [V, E, d] for an extended ring descriptor clfication system. Figure 16b illustrates the operation of the method defined by the flow chart of Figure 16a. Figure 17 presents a graph without address G [V, E, d] to illustrate the effect of the definition of the rings on the resolution of the clfication system. Figure 18 provides an overview of the generation method of the present invention. Figure 19 presents a flow diagram of a method that eliminates the graphs without isomorphic direction Gr [V, Er, d '] of the set of generated graphs without direction G' [V, Er, d '] that represent the chemical structures. Figure 20 presents a graph without sample direction G '[V, E', d '] to illustrate the derivation of a Roentgen descriptor 1915 for a vertex vr0.
DETAILED DESCRIPTION OF THE PREFERRED MODALITY Figure 1 provides an overview of the classification method of the present invention. In step 100, the method defines a classification system. This classification system comprises a plurality of categories. In step 110, the method introduces a library of molecules and classifies the molecule library. To classify the molecule library, step 120 assigns each molecule in the library to a category in the classification system. Step 120 produces a complete classification system. The completed classification system identifies the molecules that were assigned to each system category. Step 120 introduces the completed classification system and identifies the structures of the molecules that are lost in the molecule library. Step 120 identifies the structures of the missing molecules by locating the empty categories in the completed classification system. Step 120 produces a list of empty categories. The classification system identifies each of its categories by a sequence of non-negative integers that has a designated length. The first integer in the sequence represents the length of the largest ring in the molecule. Each successive whole number in the sequences represents the length of successively shorter constituent rings in the molecule. Figure 2 provides a partial illustration of an exemplary classification system 200. The exemplary classification system 200 has a larger possible ring length of twenty-seven. The sequence of the non-negative integers of the exemplary system 200 has a designated length of two. The sequence of non-negative integers,? 27, 26"identifies the first category 210 of the exemplary classification system 200. Accordingly, the first category 210 will contain molecules having a ring length greater than 27 and a following ring length. larger than 26. Similarly, the second category 220 of the exemplary classification system 200 will contain molecules having a ring length greater than 27 and a following ring length larger than 25. Similarly, the final category 230 shown in the exemplary classification system 200 will contain molecules having a ring length greater than 3 and a next ring length larger than 0. In other words, the final category 230 will contain molecules that only have rings with a length of 3 In the preferred embodiment, the designated length of the non-negative integer sequence of the classification system varies from three to seven and the maximum possible ring length supported by the classification system is approximately twenty seven. In a series of alternate modes, the designated length of the sequence of non-negative integers in the classification system varies from one to twenty-five. In addition, as indicated in the following argument, the classification system has characteristics that allow the support for larger possible maximum ring lengths. Figure 3 displays the number of categories in the classification system as a function of the designated length of the sequence of non-negative integers. For example, the classification system will have twenty-five categories with a designated length of the sequence of non-negative integers of one. Similarly, the classification system will have fifteen thousand two hundred and seventy-five categories with a designated length of the sequence of non-negative whole numbers of four. The designated length of the sequence of non-negative integers determines the resolution of the classification system. Specifically, the categories of a classification system having a designated length smaller than the sequence of non-negative integers will retain more molecules than the categories of a classification system having a larger designated sequence length. Therefore, the resolution of the classification system can be adjusted by changing the designated length of the sequence of non-negative integers. Figure 4 and Figure 5 illustrate the utility and context of the classification system of the present invention. Figure 4 shows the relationship between small molecules 420-495 and large molecules 410 or receptors. Libraries of chemical compounds keep small molecules 420-425. To be useful, a molecule (420-425) small must be compatible with a large 410 molecule or receptor. As shown in Figure 4, the small molecule 490 is compatible with the large molecule 410. A library of chemical compounds should contain compounds with a wide range of structures to increase the likelihood that a member of the library of chemical compounds will be compatible with a large molecule 410 or receptor. In other words, a library of chemical compounds may contain molecules that have a diverse range of structures. This is useful for measuring diversity with molecular structure because molecules with different structures will be compatible with different large molecules and receptors. Therefore, the classification system measures the diversity between the chemical compounds and a library in a useful form since it distinguishes molecules according to their structure. The classification system of the present invention achieves a library of chemical compounds having a diverse set of structures within the context shown in Figure 5. A large number of academic research groups labeled as 510 markers synthesize very long numbers of molecules. For example, the Chemical Compendiums record the files that currently contain more than fifteen million compounds. The pharmaceutical and agrochemical industry labeled as 530 users look for the largest numbers of molecules to find molecules with biological activity. The classification system of the present invention is used by the intermediary 520 to create a library of chemical compounds having a diverse set of structures from the largest number of chemical compounds supplied by the 510 markers to facilitate the task performed by the users. 530. Intermediary 520 uses the present invention to assign molecules in the library of chemical compounds in the categories of the classification system. Intermediary 520 identifies the structures of the molecules that are lost from the library of chemical compounds by identifying the empty or populated categories in scattered form in the classification system. Using this information, the intermediary 520 can add molecules with the identified structures to the library by collecting them from the markers 510. Similarly, the intermediate 520 identifies the structures of the molecules that are over-represented in the library of chemical compounds. Using this information, the intermediary 520 can remove the compounds with the identified structure to address the associated difficulty in handling the information experienced by users 530. By limiting the number of molecules in the library having a similar structure, the intermediary can also reduce the cost of the library to 530 users while providing a library with a diverse range of structures. Although the subsequent description of the invention uses examples from the pharmaceutical industry, the present invention is not limited to the pharmaceutical industry. The present invention is also adapted to the classification of all types of chemical compounds having a ring structure. Figure 6 presents a flowchart of method 600 for assigning molecules to the classification system previously described in the argument of Figure 2 and Figure 3. In step 610, the method creates a graph representation without address G [V, Er d] of the molecule. As is known to those skilled in the art, a graph without address G [V, E, d] consists of a set V of vertices, a set E of edges, and a function d of set E to the set of unordered pairs < u, v > for u e V and v e V. See for example, Preparata and Yeh, Introduction to Discrete Structures (Addison-Wesley Publishing Company, Inc., 1973 pg 67), ("Preparata"). In Step 620, method 600 is in the designated number of the largest simple signs with the graph without address G [V, E, d]. Figure 7 presents a flowchart to create a graph without direction G [Vf E, d] for a molecule 410. In step 705, the method places all the atoms of the molecule in a set of atoms A. In the step 710, the method starts a function named by g: A- >; V of the set of atoms A of the molecule to a set of vertices V in the graph without corresponding address G [V, E, d] to the nullification. In other words, each of the atoms of the set of atoms A of the molecule will have an unallocated vertex from the set of vertices V in G [V, E, d] after step 710. In step 715, the method determines if there are unselected atoms in the set of atoms. If there are atoms not selected in the set of atoms, the control proceeds to step 725. In step 725, the method selects an atom from the set of atoms A. In step 730, the method creates a vertex v which is returns a member of the set of vertices V in the graph without address G [V, E, d]. In step 735, the method assigns the vertex v to the atom a (g (a) = v). After step 735, the control proceeds to step 715 to process another set atom of atoms A. If there are no unselected atoms in the set of candidate atoms as determined in step 715, the "control proceeds to step 720. In step 720. The method places all the links of the molecule in a set of B links. In step 740, the method determines if there are links not selected in the set of B links. If there are links not selected in the set of links B, the control proceeds to step 745. In step 745, the method selects a link b from set B. In step 750, the method determines the atoms a, a2 associated with link b. 755, the method determines the vertices vi, v2 assigned to the atoms a, a2 respectively by the function g: A-> V in step 735. In step 760, the method creates an edge that connects to the vertices vi, v2 { < vl, v2 >) that become a member of the set of vertices E on the graph without address G [V, E, d]. After step 760, the control proceeds to step 745 to process another link in set of links B. If there are no links not selected in set of links B, as determined in step 740, the control proceeds to the step 765 indicating that the method has completed execution. Figure 8a presents a flowchart of the method to find the designated number of the largest single cycles in the graph without address G [V, E, d]. As is known to those skilled in the art, a cycle is a closed path in which the first vertex Vxl of the first edge < V l Vl2 > of the cycle is equal to the second vertex v32 of the last edge < V3l Vj2 > of the cycle. See, for example, Preparata. A path is a sequence of edges such that the second vertex Vk2 of a border << Vkl Vk2 > in the sequence is equal to the first vertex Vu of the next edge < Vu Vj_2 > in the sequences. A cycle is a simple cycle if it does not cross any edge in the graph without direction G [V, E, d] more than once. In step 810, the method finds all the simple cycles (se) in the graph without address G [V, E, d] and stores them in a list of simple cycles. As is known to those skilled in the art, the method can find all the simple cycles in the graph without direction G [V, E, d] using a variant of the method described in US Pat. No. 3,579,194, the contents of which are incorporated herein. for reference in its entirety. In step 820, the method determines the length of each of the simple cycles found in step 810 by counting the number of edges of each of the simple cycles and stores in each length with the corresponding simple cycle in the list of simple cycles . In step 830, the method selects the simple cycles by their corresponding lengths in descending order in a manner known to those skilled in the art. See, for example, Donald Er in Knuth, "The Art of Computer Programming" (3rd Edition). In step 840, the method removes simple cycles from the list of simple cycles that have lengths that are the same as the lengths of their preceding simple cycles in the selected list of simple cycles. In stage 850, the method produces the lengths of the simple cycles that appear in the designated number of entries at the beginning of the list of simple cycles. Accordingly, the method illustrated by flow chart of Figure 8a yields the designated number lengths of the largest single cycles of the non-directional graph G [V, E, d]. Figure 8b illustrates the operation of the method defined by the flow chart of Figure 8a. Figure 8b shows a graph without address G [V, E, d] that corresponds to a molecule from a library of chemical compounds. Step 810 will produce the following list of imples sites as identified by their edges of e_ components. ec__z elt e3, ßJt ßi7, Glt, &% »BCgi ei7, e (, ßjj, ß14l ßls, ßj SCjl ßs, e (, ß7, gg, &121 eIJ sc -4 < :: ße1_e0o, * ee,«, / ßef »,» Gß1z0ß, »ße j J" "" ".- .- e e e e e e e e e '?? S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S e «, is, ßt, e7, e", elt, elt, els, f, e! lt lf SC7: Glt ß2, G3, G, Gs, Gfl G7, Gß, ßt, G1B, G? r & 121 ß BIu «-t« ie eXxSs?> E? Ft? E el? * «R» e elat »BC -,» • ^ 17 / ß «r ^ S» ® < »®7» ®i0 »® Z < ®JU S» »? • G®j17« ß <, ßs, Gf, ß-f, Ga, Gt, G10, ell < ei '' e2í 'ei ec10? ßs, ßt , ß7, Bt, Gf, elß, ßll, Q? ¡, ß: Step 820 will produce the lengths of the simple cycles in the list shown below as follows: ec2: 6 SC2: 6 sc4: 5 SCS 10 SC: 14 SC7: 17 SCß: 10 sc: 13 SCJ0: 9 Step 830 will produce the list of simple cycles SC that are selected by their lengths in a descending order as shown below: SC7: 17 SC €: 14 SCg '. 13 scs: 10 scß: 10 SC10 and 9 scx: 6 ec3: € SCj z 6 sc: 5 Step 840 will eliminate the simple cycles from the list having a length equaling the length of a preceding simple cycle in the list to produce the modified list as shown below: SC7: 17 BC €: 14 SCg '. 13 SCS: 10 ßCj: 6 BC4. 5 Finally, step 850 will produce a sequence of the designated number of non-negative integers and each integer in the sequence corresponds to a length of one ring in the molecule corresponding to the graph without address G [V, E, d]. For example, if the designated number is five of the example of Figure 8b, step 850 will produce the following sequence: 17, 14, 13, 10, 9. In an alternative mode, the classification system uses a different method to find the designated number of larger single LC cycles on the graph with no G address [V, E, d] as shown by the flow chart of Figure 9a. In step 905, the method starts the list of the largest single LC cycles for alignment. In step 910, the method sets a desired simple cycle length DL to the cardinality of edge sets E. After step 910, DL equals the number of edges and the graph without address G [V, E, d]. In step 915, the method determines whether the largest simple cycle list LC contains fewer cycles than the designated number of single cycles. If the largest single cycle list contains fewer cycles than the designated number of single cycles, the control proceeds to step 320. In step 920, the method determines whether the graph without address G [V, E, d] has a simple cycle as length that is greater than or equal to DL. If the graph without address G [V, E, d] contains a simple cycle with a length that is greater than or equal to DL, the control proceeds to step 925. In step 925, the method marks the simple cycle with a length that is greater than or equal to DL.
. In. In step 930, the method determines whether the length of the single-labeled cycle is greater than DL. If the length of the single marked cycle is not greater than DL. The control proceeds to step 935. In step 935, the method stores the marked simple cycle and its corresponding length in the list of larger cyclones. In step 940, the method decreases DL, before the control proceeds to step 915. If the length of the simple cycle in the marking is greater than DL, as determined in step 930, the control proceeds to step 920 to determine whether the graph without address G [V, E, d] has another simple cycle with a length that is greater than or equal to DL. If the graph without address G [V, E, d] does not contain a single unlabeled cycle with a length that is greater than or equal to DL as determined in step 920, the control proceeds to step 940 where DL is decreased in preparation for a search for an increasingly smaller simple cycle. If the simple cycle list is larger LC contains the designated number of larger cycles as determined in step 915, the method ends. - In another alternative mode, the classification system uses a different method to find the designated number of larger single cycles LC on the graph without address G [V, E, d] using commercially available databases such as the Information System Integrated Science (ISIS) provided by Molecular Design Ltd., MDL Information Systems of San Leandro, California. In this method, the operator creates a group of molecular files or files of molecules that correspond to rings of different lengths. For example, the operator can create twenty-five molecule files for rings that vary in length from three to twenty-seven. Next, the operator requires that the database contain the library of chemical compounds to search for the sub-structures defined in the twenty-five molecule files described in the above. To query the database, the operator can run a PL program that is a standard query language provided by Design Limited that allows the operator to run programs when a structure database is opened with the ISIS MDL packages. The queries of the database yield the identification number of the molecules in the database that has the sub-structures defined in the files of molecules described in the above. Accordingly, the PL program produces a list file for each of the sub-structures defined in each molecule file. Then, the operator imports the list files into the database, such as a DataEase. The operator moves all the records in the list files in a form within the database. After this step in the method, the form will contain the records such as those displayed below: Identification number ring size AE-641/0123001 03 AE-641/0123002 03 AE-641 / 01230s3 03 AE-641/0123001 05 AE-641/0123003 05 AE-641/0123004 05 AE-641/0123006 06 The operator classifies these records in the form within the database by the identification numbers of the molecules. Finally, the operator accumulates the ring sizes associated with each identification number in the form as shown below: Identification number ring size AE-641/0123001 060503 AE-641/0123002 ~ "03 AE-641/0123003 0503 AE-641/0123004 05 Accordingly, this method produces a list of constituent ring lengths for each chemical compound in the library.
In another alternative mode, the classification system executes a more efficient method to find the designated number of the largest single cycles in the graph without direction G [V, E, d] as illustrated by the flow diagram in Figures 10a and 10b. As is known to those skilled in the art and as indicated by "Computers and Intractability" by Michael Garey and David Johnson ("Garay"), the task of finding larger single sites on a graph without arbitrary direction G [V, E , d] is non-deterministic polynomial full time (NP-complete). However, the task of finding the largest single cycle on a graph without arbitrary direction G [V, E, d] can be performed at polynomial time for particular types of graphs. For example, researchers have designated a polynomial algorithm to find the largest cycle in the particular type of directed graph called contest. See C. Morrow and S Goodman, "An Efficient Algorithm for Finding A Longest Cycle in a Tournament." Similarly, the method described by the flowchart of Figures 10a and 10b finds the designated number of largest cycles in a polynomial time for a graph without direction G [V, E, d] where the graph without address G [ V, E, d] is a flat graph and when all the finite regions of the graph with no planar direction G [V, E, d] are adjacent to the infinite region of the graph without direction G [V, E, d].
As is known to those skilled in the art, a graph without finite direction G [V, E, d] is planar and can be drawn in a plane such that none of its edges intersect except, possibly at the apex. See, for example, Preparata. In addition, a necessary and sufficient condition for a graph G [V, E, d] to be flat is that it does not contain partial subgraphs of either the star graph or the utility graph. A graph without address G [V, E, d] is called a subgraph of a graph without address G [V, E, d] if V is a subset of V and if E ' (a subset of E) consists of all the edges in E that join the vertices in V. If E 'is a subset of all the vertices that join the edges in V then G' is called a partial subgraph of G. As is known to those skilled in the art, a region of a graph with no planar direction G [V, E , d] is a domain of the plane surrounded by the edges of the graph with no planar direction G [V, E, d] so that any of the two points in it can be joined by a line without crossing any edge. See, for example, Preparata. The edges that touch a region contain a simple cycle called the contour of the region. Two regions will be adjacent if the contours of the two regions have at least one edge in common. Furthermore, since there exists one and only one region that is infinite in a flat graph G [V, E, d] by definition in the sense that the area determined by its contour is infinite, all other regions are finite. Accordingly, Figures 10a and 10b show the flow diagram of a method for determining the designated number of largest single cycles in a polynomial time in a graph with no direction G [V, E, d] that is flat and where all the Finite regions are adjacent to the infinite region. Since this method is executed in polynomial time, allows the classification scheme to operate in libraries that contain molecules and chemical compounds with large numbers of atoms and bonds in terms of molecules and compounds that have graphs without corresponding address G [V, E, d] that are flat and where all the finite regions are adjacent to the infinite region. In step 1004, the method initiates the designated number set of the largest single cycles LC in G [V, Er d] for the override set. The LC set will maintain the result of the execution of the method. In step 1005, the method determines the set of the shortest single SSC cycles of each vertex of the graph G [V, E, d] is known to those skilled in the art. See, for example, Preparata. In step 1010, the method determines a set of edges BE comprising the edges of the shorter simple cycles SSC that are adjacent to the infinite region of the graph with no planar direction G [V, E, d].
In step 1015, the method calculates the number of adjacent regions for the regions corresponding to each shorter simple cycle ssc in the set of shorter single cycles SSC. In step 1020, the method selects single cycles SSC in the set of shorter single cycles ssc by the number of adjacent regions. In step 1025, the method stores the shorter simple cycles SSC of set of shorter single cycles ssc having the smaller number of adjacent regions in an SSC set. In step 1030, the method determines the length of the shortest single cycles ssc in the set of shorter single cycles with the smallest number of adjacent SSC regions. In step 1035, the method selects the shortest single cycles ssc in the shortest cycle list SSC with the smallest number of adjacent SSCT regions by their lengths. In step 1036, the "method defines a set of combinations of simple shorter cycles CC that will keep the catenaries of the shorter single cycles ssc of the set of shorter simple cycles SSC.The method initiates the first number of the set of combinations of short simple cycle CC to an override input.Then, the method append the set of shorter simple cycles with the smallest number of adjacent regions SSC 'to the set of combinations and simple cycle CC.
In step 1040, the method determines if there is some combination of single cycle plus "short CC in the set of short cycle combinations shorter CC and if the set of shorter simple cycles LC contains less than the designated number of longer simple cycles If there are some combinations of simple shorter cycles CC in the set CC and if the set of longer single cycles LC contains less than the designated number of longer single cycles, the control proceeds to step 1045. In step 1045, the method selects the first combination of shortest single cycle c in the set of short cycle combinations shorter CC and eliminates it from the set .. In step 1055, the method forms a modified set of edges BE 'by copying the set of edges BE and eliminating the edges of the shorter simple cycles ssc of the shortest selected simple cycle combination ce.In step 1060, the method calculates the longest single cycle in the modified set of edges BE '. The largest single cycle in the modified set of edges BE 'corresponds to the largest cycle around the edge of the graph with no modified direction G \ V, E, d] that is formed from the graph without initial direction G [V , E, d] eliminating the edges of the shorter SSC cycles of the shortest simple cycle combination ce and its vertices v. In step 1065, the method stores the largest simple cycle calculated from the modified set of edges BE 'within the set of the largest single LC cycles. In step 1068, the method eliminates the duplicate simple cycles of the larger LC single cycle set. In step 1070, the method finds all the shorter single cycles ssc which are adjacent to the last short single cycle in the shortest single cycle selected cE and stores them in a list adjacent to the shorter simple cycle AL. Step 1075 selects in the adjacent list AL the single cycles shorter to their length in descending order. Stage 1080 determines whether there are some shorter simple cycles ssc in the adjacent AL list. If there are shorter simple cycles in the adjacent list AL, the control proceeds to step 1085, in step 1085 the method selects a shorter simple cycle ssc and removes it from the adjacent AL list. In step 1090, the method appended the shortest single cycle selected from ssc to the shortest single cycle combination selected ce and added the shortest single cycle combination resulting ce 'to the shortest single cycle combination set CC. After stage 1090, the control proceeds to step 1080 to process the next shorter simple cycle ssc in the adjacent AL list. If there are no shorter simple cycles ssc in the adjacent list AL in step 1080, the control proceeds to step 1040 to process the next combination of shorter single cycles ssc in the set of short single cycle combinations SSC. If there are no shorter single cycle combinations ssc remaining in the set of simple cycle combinations SSC, as determined in step 1040, the method ends. When the method ends, the list of largest single cycles LC contains the lengths of the simple cycles that appear in the designated number of the largest single cycles in the graph with no G address [V, E, d]. Accordingly, the method illustrated by the flowchart of Figures 10a and 10b yields the lengths of the designated number of the largest single cycles of the graph without address G [V, E, d]. The Figures show the operation and the method defined by the flow chart of Figures 10a and 10b. Figure 11 shows a graph without address G [V, E, d] that corresponds to a molecule of a library of chemical compounds. Step 1005 will produce the following list of shorter single cycles ssc as identified by its component edges eg: ss ^: f e2, e3, e17 / e18, e19 eec_. ~ X7t 'A l ei3 r in / '15 -le SSC3: 6S G6 / G7 f G30 / ß? 2 G? 3 SSC: G2o? Gß? and"? exo? former? Step 1010 will produce the following edges: 'XI e2t e3 f e * f e5f ef / e7 eß t &9f exor ex? ei2 ex *? -xsi ^ xsr ^ xai * = * Step 1015 will produce the number of adjacent regions for each of the shorter single cycles ssc! as indicated below: ssci i 2 ssc2 3 SSC3: 3 ssc4: 2 Step 1020 will produce the set of the shortest single SSC cycles selected by the number of their adjacent regions as follows: ssci i 2 SSC4: 2 ssc2: 3 ssc3: 3 Step 1025 will collect the shortest single cycles sscr with the smallest number of adjacent regions within a set of shorter single cycles with the smallest number of adjacent regions SSC 'as follows: sscí ' 2 ssc4: 2 Step 1030 will calculate the lengths of the shortest single cycles ssc 'in the set of shortest single cycles with the smallest number of adjacent regions SSC' in step 1035 will select the shortest single cycles ssc with the number smaller of adjacent regions by their length as indicated below: ssc: 5 ssct. ' Step 1036 will initiate the set of the shortest single cycle CC combinations as follows: cancellation ssc4: 5 SSCÍ '. 6 In the first iteration of the subsequent loop, step 1045 will eliminate an empty cycle combination override from the shorter single cycle combination set CC. In this way, step 1055 will not eliminate the edges of the edge set BE in the first iteration of the loop. Accordingly, step 1060 will calculate the largest simple site le of the edge set BE as follows: -I I C2 e3 e < / ßS / eíf '71 '* / - * / -10 / -11 / -23 / -! «/ Stage 1065 stores this larger simple cycle together with its associated length of 17 in a list of larger single circles LC. In the next iteration of the loop, step 1045 will eliminate the first short cycle plus short combination c of the shorter single cycle combination set CC of the shorter combination set as follows: SSC4 Step 1055 eliminates the edges of the shortest single cycles ssc of the shortest single cycle combination ssc of the graph without direction G [V, E, d] to form a modified set of edges BE 'as follows: GXI e2 / fiJ / e < eS e < ß7 / eX2l ei4 f G SI ei6l Gl $ l These edges BE 'correspond to the edges in subgraph G' [V, E ', d'] shown in Figure llb. Step 1060 calculates the largest single cycle of the modified set with edges BE 'as follows: ei ß2 l e3 / e4 / eSI eS e e20f G12l eX * l eXSf ßl «el8 Step 1065 stores this larger simple cycle together with its associated length of 14 in a list of larger LC simple cycles. Step 1070 finds all the shorter single cycles ssc which are adjacent to the last shorter simple cycle in the shortest single cycle combination selected ce and stores them in an adjacent list AL as follows: ssc3 Steps 1075 to 1090 will append each shorter simple cycle ssc in the adjacent list AL to the shortest simple cycle combination selected and store the shortest single cycle combinations modified ce 'into a set of shorter single cycle combinations CC to form the set of shorter CC cycle combinations as follows: SSC, SSC3 In the next iteration of the loop, step 1045 will eliminate the first short cycle combination shorter than the set of short cycle combinations shorter CC as indicated below: ssc Step 1055 removes the edges of the shorter simple cycles ssc from the simple cycle combination ce of the graph without direction G [V, E, d] to form a modified set of edges BE 'as follows: -17 / -41 -tt -71 -XO -XX -X2 -X4 - SI -? e These edges BE 'correspond to the edges in subgraph G [V, E, d] shown in Figure 11c. Step 1060 calculates the largest single cycle of the modified set of edges BE 'as follows: ex? F e * r is? Gt? e7i etr in exo? former? ei2i ei4 exs? e? e Step 1065 stores this larger simple cycle together with its associated length of 13 in a larger LC simple cycle list. Step 1070 finds all the shorter simple cycles ssc which are adjacent to the last cycle in the combination of the selected simple cycle ce and stores them in an adjacent list AL as follows: ssc2 Steps 1075 to 1090 will append each simple cycle more short ssc in the adjacent list AL to the shortest simple cycle combination selected ce and store the modified short simple cycle combinations of ce 'in the set of shorter single cycle combinations CC to form the set of single cycle combinations more cut CC as follows: ssc, ssc3 SSCi, ssc2 In the next iteration of the loop, step 1045 will eliminate the first shorter single cycle combination ce in the set of short cycle CC shorter combinations as indicated below : ssc4, ssc3 Step 1055 removes the edges of the shortest single cycles ssc from the shortest single cycle combination ce of the g rafica without direction G [V, E, d] to form a modified set of edges BE 'as indicated below: -1 / -2 c3 / e «, e X3 -XI -XSI -X6I -? ßr' XS These edges BE 'correspond to the edges in subgraph G [V, E, d] represented in Figure lid. Step 1060 calculates the longest simple cycle of the modified set of edges BE 'as follows: e I e2l G3 e4 l GX3 I eX4 / ßlS GX I eX8 f ßX9 Stage 1065 stores this larger simple cycle along with its associated length of 10 in a larger LC simple cycle list. Step 1070 finds all of the shorter simple cycles ssc that are adjacent to the last short simple cycle in the shortest single cycle combination selected and stores them in an adjacent list AL as follows: ssc? Steps 1075 to 1090 will append each shorter simple cycle ssc in the adjacent list AL to the shortest single cycle combination selected ce and store the shorter short cycle combinations modified ce 'in the shorter single cycle combination set CC to form the set of shorter single cycle CC combinations as follows: SSCi, ssc2 ssc4, ssc3, ssc2 In the next iteration of the loop, step 1045 will eliminate the first shorter single cycle combination of the set of combinations of Single cycle shorter CC as indicated below: ssc. ssc? Step 1055 removes the edges of the shorter single cycles ssc from the shorter single cycle combination ce of the graph without address G [V, E, d] to form a modified set of edges BE 'as follows: is? Ge? e7 / eß? is? exo? eXlf ßX2l GX3 These edges BE 'correspond to the edges in subgraph G [V, E, d] represented in Figure lie. Step 1060 calculates the longest simple cycle le of the modified set of edges BE 'as follows: GSI e6l e G1 l GSI GXOI GXX I G12l GX3 Stage 1065 stores this longest simple cycle ic along with its associated length of 9 in a longer LC simple cycle list. Step 1070 finds all the shorter simple cycles ssc which are adjacent to the last shorter simple cycle ssc in the shortest simple cycle combination selected c and is stored in an adjacent list AL as follows: ssc3"Steps 1075 to 1090 shall annex each shorter simple cycle ssc in the adjacent list AL to the shortest simple cycle combination selected ce and store the shorter single cycle combinations modified by ce 'in the set of shorter single cycle combinations CC to form the set of combinations of shorter single cycles CC as follows: sc4, sc3, sc2 SCi, sc2, sc3 At this point, in the program run, the longest cycle list LC has the following entries: G? r G2 f G3 f G4 I GSI G4I G7I Gßl G »l IO S -XX I 6l / e2 l e3 l ß < / &YES e / e / e20l ßJ2 / ß14 l ei5 / Giei '18 14 -17 / - < / -5 / - «/ - * / '101' 111 '1' Ul '151 i6¡ 13 GXI Gx r e3l G4 I GX3 I G14 l GXSI eXß eX8l GX 1 3.® e5 / Gg, e7, eB, e9 / e10 / eu G12 / G13; 9 The execution will continue through subsequent interactions of the loop as discussed in the above until step 1040 determines that there are no combinations of shorter single cycles in the set of combinations of shorter single cycles CC or until the designated number of more cycles. large appears in the longest LC simple cycle list. Figures 12a-12e display the results that would be obtained from executing the method to assign molecules in a classification system of Figure 6 with the method to find the designated number of the longest single cycles in the graph without address G [ V, E, d] shown in Figure 8a, the method of Figure 9 or the deployed method of Figures 10a and 10b. Each figure identifies the designated number of the largest single cycles for the representation of the configured graph of the molecule. Figure 13 displays the digital prints of four libraries of chemical compounds obtained from running the first two steps of the method shown in the flow chart of Figure 1 with a one-step analysis. In other words, Figure 14 displays the results obtained by the present invention when the designated number of larger ring lengths is one. The graphs display the number of compounds within each library as a function of the largest ring length. Figure 14 displays the digital prints of the same four chemistry libraries with a three-step analysis. In other words, Figure 14 displays the results obtained by the present invention when the designated number of largest ring lengths is three. The graph displays the number of compounds within each library as a function of the three largest ring lengths. The S97819 library contains 97,819 mainly synthetic compounds. The NP2466 library contains 2,466 natural products. The F2245 library contains 2,245 different compounds that have been ordered more than five times during the past six years. The CMC library contains 6861 structures. Figure 15 displays the results associated with a three-stage analysis of the S97819 library with the method described by its flow chart of Figure 1. The graph shows the possible category populations and the current category populations, as a function of the ring length. The classification system discussed in the above can be called a ring descriptor classification system because it classifies the molecules according to a designated number of the largest single cycles that appear within the molecules. The present invention also includes a second classification system called an extended ring descriptor classification system. The extended ring descriptor classification system has more categories than the ring descriptor classification system because the extended ring descriptor classification system includes more structural information. The extended ring descriptor classification system identifies each of its categories with a sequence of tupias that have a designated length. Each tupia contains two non-negative integers. The first non-negative whole number in the tupia represents a length of a ring in the molecule. The second non-negative whole number in the tupia represents the number of rings in the molecule that has that length. The first tupia in the sequence represents the rings with the largest length in the molecule. Each successive tuple in the sequence represents shorter rings successively in the molecule. In this way, the classification system of the extended ring descriptor identifies the number of rings having particular lengths while the ring descriptor classification system identifies only the presence or absence of rings having particular lengths. Figure 16a presents a flow chart of a method for finding a designated number in the largest single cycles in the graph without address G [V, E, d] for the extended ring descriptor classification system. In step 1610, the method finds all the simple cycles (se) in the graph without address G [V, E, d] and stores them in a list of simple cycles. In step 1620, the method determines the length of each of the simple cycles found in step 1610 by counting the number of edges of each of the simple cycles and stores each length with the corresponding simple cycle in the simple cycle list. In step 1630, the method selects the simple cycles by their corresponding lengths in descending order in a manner known to those skilled in the art. See for example, Donald Erwin Knuth, "The Art of Computer Programming" 3rd Edition). In step 1640, the method gathers the simple cycles within the group according to their lengths. Specifically, simple cycles will appear in the same group if they have the same length. In step 1650, the method produces a list of tupias where each tupia corresponds to a group of simple cycles. The first non-negative whole number in the tupia represents a ring length in the molecule. The second non-negative whole number in the tupia represents the number of rings in the molecule that has that length. Figure 16b illustrates the operation of the method defined by the flow chart of Figure 16a. Figure 16b shows a graph without address G [V, E, d] that corresponds to a molecule from a library of chemical compounds. Step 1610 will produce the following simple cycle view as indicated by its component edges e: SCxl Glf G2, G3, G1, Glt, Gls ec2: G17I e «, G13, e14, G1S, GxS eC3 Gs, Gg, G7, G20, JJI, G13 BC4: G20, Gg, G9, G10, Gxl ecs GXI e2, e3, e «, eu / G14, els, e? E, e1 $, e19 BCf: Glf G2, G3t G4, Gs , Gf, ß7, G20, G12, e14, Gls, 0C7? G_ G2, G3, G4, Gs, Gfl G7, G8, Gg, x -XX '' X i SCe: G17, G, G5, Gg, G7, G20, G2, G1, Gxs, G? E SCgl G17, e4, 6S,? G, G7, ee, Gt, Gl0, Gx, G12, G14, BC1 Gs, Gs, G7, Ga, Gg, G10, Gllt G12, G13 Step 1620 will produce the lengths of the simple cycles in the list shown below as follows: sc: 6 sc3: 6 sc: 5 SC5 10 se6: 14 sc: 17 sc8: 10 SCgl 13 sc? 0: 9 Stage 1630 will produce the list of simple cycles SC that are selected by their lengths in a descending order as shown below: sc7. 17 SCg 13 se5: 10 sc8: 10 sc10: 9 sc: 6 sc: 6 sc4: 5 Stage 1640 will gather the simple cycles into groups as shown below: Group: se? Group 2: scß Group 3: sc9 Group 4: sc5, sc8 Groups: sc10 Group 6: scl r sc2, sc3 Group 7: sc4 Finally, stage 1650 will produce a sequence of designated number of tupia. The first non-negative whole number in each tupia is a ring length in the molecule that corresponds to a graph without direction G [V, E, d]. The second non-negative whole number in each tuple is the number of rings in the molecule that has the length identified by the first non-negative whole number in the tuple. For example, if the designated number is five in the example of Figure 16b, step 1650 will produce the following tuple sequences: 17: 1, 14: 1, 13: 1, 10: 2, 9: 1. Figure 17 represents a graph without direction G [V, E, d] to illustrate the effect of the definition of the rings on the resolution of the classification system. Because the rings were defined using simple cycles with the set of edges E, the graph without direction G [V, E, d] of Figure 17 has three rings of length four that correspond to three simple cycles: sc? = < V? , v2 > , < v2, v4 > , < v4, v3 > , < v3, V? > , sc2 = < V? , v4 > , < v4, v2 > , < v2, v3 > , < V3, V? > , and sc3 - < v? , v2 > , < v2, v3 > , < v3, v4 > , < v4, V? > . Conversely, if the rings have been defined using only the set of vertices V, the graph without direction G [V, E, d] of Figure 17 would have only one ring of four lengths. Using this definition of an alternating ring, this ring of four lengths is identified only by the vertices of the ring: vl r v2, v3, v4. As shown by the example in Figure 17, the categories of a classification system that define rings using simple cycles with the set of edges E will retain some of the molecules that the categories of a classification system that defines rings using only the vertex set V. In other words, classification systems that define rings using simple cycles with the set of edges E can distinguish similar complex rings as the rings corresponding to simple cycles sc = < v? , v4 > , < v4, v2 > , < v2, v3 > , < v ~ 3, v? >; and sc3, < vl r v2 > , < v2, v3 > , < v3, v4 > , < v4, vx > , of simple rings corresponding to the simple cycle < vl f v2 > , < v2, v4 > , < v4 f v3 > < V3 r V? > . The ability of a classification system to distinguish complex rings from simple rings is important as structures that contain rings with complex interconnections that can correspond to chemical compounds that are not physically possible in the real world. 5 In an alternative embodiment, the resolution of the extended ring descriptor classification system is further increased by also identifying the types of atoms that appear within the rings. Specifically, this classification system of descriptor t 10 of extended rings modified with the type that attributes to the vertices that represent the atoms in the graph without direction G [V, E, d]. The type attribute of a vertex identifies the type of atom that corresponds to the vertex such as carbon, silicon and nitrogen. In an alternative embodiment, the resolution of the classification system of the extended ring descriptor is further increased by also identifying the types of atoms that appear adjacent to the rings. Specifically, this descriptor classification system of modified extended ring has the attributes type neighbors to the vertices that represent atoms in the graph without direction G [V, E, d]. The neighbor type attribute of a vertex identifies the types of atoms that correspond to neighboring vertices such as carbon, silicon, and nitrogen. 25 The extended ring descriptor classification system and its modifications provide higher resolution than the ring descriptor classification system. By ^^ example, there is a large number of chemical compounds that fall within the same category 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3 of the ring descriptor classification system. The extended ring descriptor classification system distinguishes between chemical compounds that have the same ring descriptor by assigning them between different categories. As previously discussed, these categories are defined by the number of rings that have a particular length, the types of atoms within these rings, and the types of atoms connected to these rings. The present invention further comprises a method of generation creating realistic chemical structures to fill the empty categories of a classification system in order to "improve the range of structures represented by the library." Figure 18 provides an overview of the generation method of the present invention. In step 1810, the method generates a set of new graphical representations without direction G '[V, Ef, d'] of the chemical structures to fill the list of empty categories The set of graphs without address G '[V , Er, d '] produced by step 1810 may contain graphs without address Gf [V, E ', d'] that are isomorphic to each other. The two graphs without address G '[V, E', d '] are isomorphic to each other if a graph without address G [V, E, d] can be obtained from another by re-labeling the vertices V. See, for example, Kohavi, "Swi tching and Finite Automaton Theory" (2nd Edition). In step 1820, the method removes the graphs without isomorphic address Gf [V, Er, d '] from the set of graphs without address G' _Vr, E ', d'] that represent the chemical structures generated by step 1810. , the set of graphs without address G '[V, Ef, d'] produced by the execution of steps 1810 and 1820 can represent chemical structures that are not physically possible in the real world. For example, a chemical structure can be based in an unrealistic way on the deformation of the angles of carbon atoms within the chemical structure. Accordingly, step 1830 filters the graphs without address G '' [V, Er, d '] which represent the unrealistic chemical structures of the set of graphs without address G' [V, Er, d '] generated by the execution of steps 1810 and 1820. Figure 19 represents a flowchart of a method that removes graphs without isomorphic direction G '[V, Er, d'] from the set of generated graphs without direction Gr [V, Er, d '] which represents the chemical structures. In step 1910, the method derives a 1915 roentgen descriptor for each vertex v 'of each graph with no generated address G' [V, E ', d'] representing a chemical structure. Step 1920 computes a digital print 1925 for each vertex v 'of each graph without generated direction G' [V, Er, d '] of Roentgen descriptor 1915 of vertex 1. Stage 1930 calculates a descriptor 193 * 5 of chemical structure for each graph without generated address G '' [V, Er, d '] of the digital impressions 1925 of the vertices V of the graph without generated address G' [V, E ', d'} . The method derives a descriptot 1915 from roentgen for a vertex v 'when recording the degree of vertex v'. The degree of a vertex v 'in a graph without direction Gr _ V, Er, d'] is the number of edges e 'that are incidents to the vertex v'. See Preparata, page 74. Next, the method records the degree for all vertices of V that can be reached from vertex v 'by a path of one length. As discussed previously, a path is a sequence of edges so that the second vertex vk2 of an edge <; vkl, vk2 > in the sequence it is equal to the first vertex vu of the next edge < v? l r v12 > in the sequence. See Preparata. Then, the method records the degree for all vertices of V that can be reached from the vertex v 'by a path of two lengths. This process iterates until the method registers the degree of all vertices of V that can be reached from vertices v 'by a path with a predetermined maximum length. Figure 20 presents a graph without sample direction Gf [V, Ef, d '] to illustrate the derivation of a roentgen descriptor 1915 for a vertex v' 0. First, the method records the degree of vertex v'0 as 2 Then, the method records the degree for all vertices in V that can be reached from vertex v'0 by a single-length path. In this way, the method registers the degree of the vertices v 'and v'2 as three. Then, the method records the degree for all vertices that V can be reached from the vertex v '0 by a path of two lengths. In this way, the method records the degree of the vertices v '3 and vr 4 as two and three respectively. Then, the method records the degree for all vertices in V that can be reached from the vertex v'0 by a path of three lengths. In this way, the method records the degree of vertices vr 5 as one. Accordingly, the roentgen descriptor 1915 for the vertex v '0 of the graph without address G' [V, Er, dr] shown in Figure 20 will be read as follows: length 0 - degree 2; length 1 - grade 3: 2; length 2 ~ grade 2: 1, grade 3: 1 length 3 - grade 1: 1. In short, the vertex vr 0 has the degree two. There are two vertices that can be reached from the vertex vr 0 by a single-length path. These two vertices have two degrees. These two vertices can be reached from the vertex v '0 by a path of two lengths. One of these vertices has a degree of two. The other of these vertices has a degree of three. There is a vertex that can be reached from the vertex v'0 by a trajectory in 3 lengths. This vertex has a degree. In an alternative mode, the method also records the length of the simple cycles in the roentgen descriptor 1915 for a vertex v 'as it crosses the graph without direction Gr [V, Ef, df]. In another alternative mode, the method also registers the vertex type in the roentgen descriptor 1915 for the vertices that are found as the graph without direction G 'travels [V, Ef, d']. While the preceding argument included different exemplary methods for calculating the roentgen descriptors 1915 of the vertices v 'a graph without address Gf [V, Er, d'], it is apparent to one skilled in the art that other methods for calculating a descriptor 1915 Roentgen can be used in the present invention. These methods for calculating a roentgen need only ensure that the roentgen descriptor 1915 does not depend on the labeling of the vertices V, the labeling of the edges E 'of the graph without address G' [V, Er, d '] or the form in which the graph without address G '[V, Er, d'] is reached.
The 1915 roentgen descriptor must also be written in a canonical form. See for example, Kohavi, Chapter 1. In the preferred embodiment, the vertices that can be reached from vertex v '0 along a path with a longer length are listed in the roentgen descriptor 1915 before the vertices that can be reached. reach from vertex v'0 along a path of less length. Then, the vertices that can be reached from the vertex v '0 having a greater degree are listed in the descriptor 1915 of roentgen before the vertices that can be reached from the vertex v 0 have a lower degree. In addition, the number 0 is used as a delimiter between the elements of the roentgen descriptor 1915. For example, the canonical form of the preferred modality for the descriptor 1915 of the vertex roentgen vr 0 of the graph without address Gr [V, E ', d'] of Figure 20 is: 3, 1, 1, 2, 2 , 3, 1, 0, 2, 2, 1, 0, 1, 3, 2, 0, 0, 2, 1. The entry 3, 1, 1 means that there is 1 vertex of degree 1 that can be reached from the vertex v '0 of degree 3 that can be reached from vr 0 by a trajectory with 2 lengths. The entry 2, 2, 1 means that there is 1 vertex of degree 2 that can be reached from v'0 by a trajectory with 2 lengths. The entry 1, 3, 2 means that there are 2 vertices of degree 3 that can be reached by a trajectory with length 1. The entry 0, 2, 1 means that there is 1 vertex of degree 2 that can be reached from v'0 for a path with 0 length. In other words, the entry 0, 2, 1 means that the vertex v'0 has the degree 2. While the preceding argument described the preferred canonical form for representing a 1915 roentgen descriptor, it is apparent to one skilled in the art that Other canonical forms to represent a 1915 roentgen descriptor can also be used in the present invention. The method needs only to ensure that the same canonical form for the roentgen descriptor 1915 is used for all vertices V in all graph representations without address G '[V, Er, d'] of the chemical structures. Step 1920 of the flow chart of Figure 19 calculates a digital print 1025 for each vertex vr of each graph without generated direction Gr [V, Er, d '] of vertex roentgen descriptor 1915 created by the execution of the step 1910. Roentgen descriptors 1915 for all vertices Vf can become potentially unmanageable due to their sizes. The 1920 stage addresses this issue to take the 1925 digital impressions of the 1915 descriptors of roentgen for the vertices V. Step 1920 can be represented as a function y = f (x) where x represents the descriptor 1915 of roentgen y and represents the digital print 1925 of the roentgen descriptor 1915. The function f (x) must be a nonlinear function to ensure that linearities do not occur between the roentgen descriptor 1915 and the chemical structure descriptor 1935. The function f (x) must also be dependent on all elements of the Roentgen descriptor 1915. In the preferred embodiment, the function f (x) must be defined so that the function f (x) is more likely to produce different digital impressions 1925 for different roentgen descriptors 1915. Alternatively, the function f (x) must be defined so that it is impracticable for the function f (x) to produce the same digital impressions 1925 for the different descriptors 1915 of roentgen. The latter type of functions f (x) are well known to those skilled in the art of cryptography and are used to create digital prints of documents. In the preferred embodiment, step 1920 is the following formula: Digital print material = in (? (Abs ((ln ((v¡ + 1) * (i + 30))))) Digital print = round (fraction ( digital print material) * module value) Where i represents the index of a roentgen component, vi represents the value for component i of the roentgen descriptor 1915, fraction is a function that yields the fractional portion of its argument, and abs is the function of absolute value.
For example, if the 1925 roentgen descriptor is 54305050, the execution of step 1920 produces the digital print value 30 for a modulo value of 50, as indicated in the calculation shown below: i vi In ((vi + 1) * (i + 30)) 1 5 5.225747 2 4 5.075174 3 3 4.882802 4 0 3.526361 5 5 5.347108 6 0 3.583519 7 5 5.402677 d 0 3.637586 Sum 36.68097 Digital print material = In (sum) = 3.602258 Digital print = round (fraction (digital print material) * module value) = 30 Whereas the preceding argument described the preferred function of step 1920 to take digital impressions 1925 from roentgen descriptors 1915 for vertices V of the graphical representation without address G '[V, Er, d'] of a chemical structure, it is evident to one skilled in the art that other functions can be used with the present invention to take the digital impressions 1925 from the roentgen descriptors 1915 for the vertices V. The function f (x) of step 1920 to take the digital impressions 1925 of the roentgen descriptors 1915 need only be a non-linear function to ensure that linearities do not occur between the roentgen descriptor 1915 and the chemical structure descriptor 1935. The function f (x) must also be dependent on all the elements of the roentgen descriptor 1915 and can be defined so that the function f (x) is more likely to produce different digital impressions 1925 for different roentgen descriptors 1915. On the other hand, step 1920 must calculate the roentgen descriptors 1915 only for the vertices of V having a degree that is greater than a predetermined threshold value instead of calculating the roentgen descriptors 1915 for each vertex in V. In this alternative embodiment of the present invention, step 1920 calculates digital impressions for only vertices in V that have a degree greater than the value of a predetermined value. An exemplary predetermined threshold value for the degree is a vertex that is two. Step 1930 calculates a 1935 descriptor of chemical structure for each graph without generated direction Gf _V, Er, d '] of the digital impressions 1925 of the vertices V of the graph without generated direction G' [V, Er, d ']. In the preferred embodiment, step 1930 performs a module addition operation on digital prints 1925 to produce the chemical structure descriptor 1935. An exemplary module value is 50. For example, if the execution of step 1920 produces the following digital impressions: 12, 15, 23, 5, 19, step 1930 will calculate the chemical structure descriptor 1935 as follows 74 modules 50 = 24-. While the preceding argument described the preferred operation for producing the chemical structure descriptor 1035 of the digital prints 1925, it is apparent to one skilled in the art that other operations can be used with the present invention to calculate the chemical structure descriptor 1935 from the digital impressions 1925. The operation of the 1930 stage only needs to combine the 1925 digital impressions in such a way that they do not duplicate the 1815 chemical structures that will be discovered. As shown by Figure 18, in step 1820, the present invention removes the isomorphic graphs and addresses G '[V, E', d '] from the set of graphs without address G' [V, Er, d '] which represents the chemical structures generated by step 1810. The present invention detects the graphs without isomorphic address Gr [V, Er, d '] by comparing the structures 1935 of chemical structure produced by step 1930 of Figure 19. Next, the set of the graphs without direction G '_V, Er, d'] produced by the execution of steps 1810 and 1820 of Figure 18 may represent chemical structures that are not physically possible in the real world.
For example, a chemical structure can be based in an unrealistic way on the deformation at the angles of carbon atoms within the chemical structure. Therefore, step 1830 filters the graphs without address G '[V, E', d '] that represent the unrealistic chemical structures of the set of graphs without address G [V, Er, d'] generated by the execution of steps 1810 and 1820. In the preferred embodiment, step 1830 executes the following rules for those chemical structures containing only carbon atoms for the purpose of filtering the non-directional graphs G '_V, Er, d'] which represent unrealistic chemical structures. : If a graph without direction G '[V, Ef, d'] representing a chemical structure has a simple cycle of length 3 and another simple cycle of length 3 that shares at least one edge, the chemical structure is unlikely and it filters; y If a graph without direction Gr _ Vr, Er, d '] that represents a chemical structure has a simple cycle of length 3 and another simple cycle of length 4 that share at least one edge, the chemical structure is unlikely and it is filtered. Additional and similar rules mentioned in the above can easily analyze the method of the present invention to filter the graphs without address Gr [V, Er, d '] representing the unrealistic chemical structures of the set of graphs without address G' _V, Er, d ']. While the invention has been described in the foregoing with reference to certain preferred embodiments, the scope of. The present invention is not limited to those modalities. One aspect of the art may find variations of these preferred embodiments which, however, fall within the spirit of the present invention, the scope of which is defined by the claims set forth below.

Claims (16)

  1. CLAIMS 1. A method for classifying one or more molecules comprising the step of determining a graphic representation, having one or more simple cycles, for the characterized molecule in which the method further comprises the steps of: defining a plurality of categories where each of the categories is identified by a table comprising: a designated number of designated number lengths of the largest single cycles of the graphic representation; and for each of the lengths, a corresponding number of simple cycles having each length; and assigning at least one of the molecules to one of the categories.
  2. 2. The method for classifying one or more molecules as in claim 1, further comprising the steps of: selecting the length table in a designated order.
  3. 3. The method for classifying one or more molecules as in claim 2, wherein the designated order of the classification step is a descending order.
  4. 4. The method for classifying one or more molecules as in claim 2, wherein the designated order of the selection step is an ascending order.
  5. 5. The method for classifying one or more molecules as in claim 1, wherein the designated number of designated number lengths of larger single cycles is determined by: setting a desired single cycle length to a plurality number of edges of the graphic representation; determining whether the graphic representation has a simple cycle having a length equal to the desired single cycle length; and collecting the desired simple cycle length in the table if the graphical representation has a simple cycle with the desired simple cycle length.
  6. 6. The method for classifying one or more molecules as in claim 5, further comprising the steps of: decreasing the desired single cycle length by one; and repeat the establishment stage, determine if the graphic representation stage has a simple cycle stage, * collect the stage and decrease the stage while the number of lengths in the table is less than the designated number and the cycle length always desired is greater than two.
  7. 7. The method for classifying one or more molecules as in claim 1, further comprising the steps of: finding some of the categories empty to identify the absent structures of one or more molecules.
  8. 8. The method for classifying one or more molecules as in claim 1, further comprising the steps of: generating one or more chemical structures to fill the voids of the plurality of categories.
  9. 9. The method for classifying one or more molecules as in claim 8, further comprising the steps of: identifying the isomorphic chemical structures in the plurality of categories; and remove the isomorphic chemical structures.
  10. 10. The method for classifying one or more molecules as in claim 9, further comprising the steps of: identifying unrealistic chemical structures in the plurality of categories; and eliminate unrealistic chemical structures.
  11. 11. The method for classifying one or more molecules as in claim 9, wherein identifying the stage of isomorphic chemical structures comprises the steps of: calculating at least one roentgen descriptor for each of the plurality of chemical structures; calculate at least one digital impression for each of the roentgen descriptors; and look for duplicates of digital impressions to identify the isomorphic chemical structures.
  12. 12. The method for classifying one or more molecules as in claim 1, wherein the graphic representation comprises a plurality of vertices and a plurality of edges wherein at least one atom of the molecule has a corresponding one of the plurality of the vertices and at least one bond of the molecule has a corresponding one of the plurality of edges.
  13. 13. A system for classifying a library of molecules comprising a graphic representation, having one or more simple cycles, for the molecule characterized in that the system also comprises: a plurality of categories in which each of the categories is identified by a table comprising: a designated number of lengths - of the designated number of the largest single cycles of the graphic representation; and for each of the lengths, a corresponding number of simple cycles that each length has.
  14. 14. The system for classifying a molecule library as in claim 13, wherein the table is selected by the designated number of lengths in a designated order.
  15. 15. The system for classifying a molecule library as in claim 13, wherein the lengths in the table are not duplicates.
  16. 16. The system for classifying a library of molecules as in claim 13, wherein the graphic representation comprises a plurality of vertices and a plurality of edges wherein at least one atom of the molecule has a corresponding one of the plurality of vertices and at least one link of the molecule has a corresponding one of the plurality of edges.
MXPA01003448A 1998-10-05 1999-10-05 A system for the classification and generation of chemical compounds. MXPA01003448A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16792598A 1998-10-05 1998-10-05
PCT/IB1999/001756 WO2000020991A1 (en) 1998-10-05 1999-10-05 A system for the classification and generation of chemical compounds

Publications (1)

Publication Number Publication Date
MXPA01003448A true MXPA01003448A (en) 2002-09-18

Family

ID=22609385

Family Applications (1)

Application Number Title Priority Date Filing Date
MXPA01003448A MXPA01003448A (en) 1998-10-05 1999-10-05 A system for the classification and generation of chemical compounds.

Country Status (6)

Country Link
EP (1) EP1119818A1 (en)
JP (1) JP2002526863A (en)
AU (1) AU6225499A (en)
CA (1) CA2346013A1 (en)
MX (1) MXPA01003448A (en)
WO (1) WO2000020991A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757618B2 (en) 2000-05-09 2004-06-29 Pharmacia & Upjohn Company Chemical structure identification
CA2546562A1 (en) * 2003-11-21 2005-06-09 Optive Research, Inc. System and method for identifying structures for a chemical compound
JP5075362B2 (en) * 2005-07-05 2012-11-21 智久 石川 Method for quantitative prediction of physiological activity of compounds
WO2017161250A1 (en) * 2016-03-17 2017-09-21 Elsevier, Inc. Systems and methods for electronic searching of materials and material properties
JP2019185506A (en) * 2018-04-13 2019-10-24 株式会社中村超硬 Flow synthesis device and flow synthesis method

Also Published As

Publication number Publication date
JP2002526863A (en) 2002-08-20
EP1119818A1 (en) 2001-08-01
WO2000020991A1 (en) 2000-04-13
CA2346013A1 (en) 2000-04-13
AU6225499A (en) 2000-04-26

Similar Documents

Publication Publication Date Title
Fontana et al. Physical aspects of evolutionary optimization and adaptation
Lynch et al. The automatic detection of chemical reaction sites
Doyon et al. Models, algorithms and programs for phylogeny reconciliation
Warr A short review of chemical reaction database systems, computer‐aided synthesis design, reaction prediction and synthetic feasibility
US6625585B1 (en) Method and system for artificial intelligence directed lead discovery though multi-domain agglomerative clustering
CN1316419C (en) Prediction by collective likelihood from emerging patterns
Rarey et al. A recursive algorithm for efficient combinatorial library docking
Bachmaier et al. Biological networks
Jiang et al. Microarray gene expression data association rules mining based on BSC-tree and FIS-tree
CN103077216A (en) Sub-graph matching device and sub-graph matching method
MXPA01003448A (en) A system for the classification and generation of chemical compounds.
Jin et al. prague: A practical framework for blending visual subgraph query formulation and query processing
Jin et al. Prague: towards blending practical visual subgraph query formulation and query processing
Klein et al. Scaffold hunter: facilitating drug discovery by visual analysis of chemical space
Wang et al. Computational biology and genome informatics
CN101401100B (en) Data mining by determining patterns in input data
Khader et al. The performance of sequential and parallel implementations of fp-growth in mining a pharmacy database
Lane et al. Eyeing the patterns: Data visualization using doubly-seriated color heatmaps
Hsu et al. Video data indexing by 2D C-trees
Attias et al. Substructure systems: concepts and classifications
CN110008425A (en) A kind of comprehensive public service platform of the cultural industry intention based on cloud service
Wang et al. Identifying consensus of trees through alignment
EP1973050A1 (en) Virtual screening of chemical spaces
Peng et al. CFGM: An algorithm for closed frequent graph patterns mining
van der Horst et al. Computational approaches to fragment and substructure discovery and evaluation