WO1997044744A1 - Generateur de donnees normalisees, methode de generation de donnees normalisees et support d'enregistrement pour la production de donnees normalisees - Google Patents

Generateur de donnees normalisees, methode de generation de donnees normalisees et support d'enregistrement pour la production de donnees normalisees Download PDF

Info

Publication number
WO1997044744A1
WO1997044744A1 PCT/JP1997/001661 JP9701661W WO9744744A1 WO 1997044744 A1 WO1997044744 A1 WO 1997044744A1 JP 9701661 W JP9701661 W JP 9701661W WO 9744744 A1 WO9744744 A1 WO 9744744A1
Authority
WO
WIPO (PCT)
Prior art keywords
atom
canonical
data
atoms
class
Prior art date
Application number
PCT/JP1997/001661
Other languages
English (en)
Japanese (ja)
Inventor
Atsushi Tomonaga
Original Assignee
Kureha Kagaku Kogyo Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP8125117A external-priority patent/JPH09305612A/ja
Priority claimed from JP8125123A external-priority patent/JPH09305628A/ja
Application filed by Kureha Kagaku Kogyo Kabushiki Kaisha filed Critical Kureha Kagaku Kogyo Kabushiki Kaisha
Publication of WO1997044744A1 publication Critical patent/WO1997044744A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Definitions

  • Canonicalized data creation device canonicalized data creation method, and storage medium for canonicalized data creation
  • the present invention relates to a processing apparatus and a processing method for performing information processing in the fields of chemistry and biochemistry, and more particularly, to canonicalization data for uniquely identifying a chemical structure of a compound from various data on each atom constituting the compound.
  • the present invention relates to a canonical data creation device and a canonical data creation method.
  • the present invention relates to a storage medium (combination program product) such as a flexible disk or a magnetic tape in which a processing program for performing information processing in the fields of chemistry and biochemistry is stored, and in particular, to each atom constituting a compound.
  • a storage medium storing a processing program for creating a standardized data that uniquely identifies a chemical structure of a compound from various data.
  • compound database systems and programs containing compound information and reaction database systems and programs containing compound reaction information have been developed.
  • Compound information systems and programs contain compound information such as the physical properties and actions of existing compounds, and access compound information using the compound structure as a key. If this compound database is used, the properties and actions of the compound can be referred to efficiently.
  • reaction database systems and programs contain reaction information on existing compounds, and access reaction information using the structure of the compound as a key. If this reaction database is used, researchers of synthetic chemistry can efficiently refer to similar reaction information from the reaction information of existing compounds when performing new synthesis of compounds. In his study Required.
  • the reaction database system includes, for example, “ISIS”, a comprehensive chemical information management system from MDL, USA, and “REACCS”, a reaction information management system.
  • the conventional compound / reaction database system has a function of displaying a compound structural diagram on a display together with information on the properties and actions of the compound.
  • a visually superior system can be constructed.
  • image data bitmap data
  • the present invention solves such a problem, and when used in a compound / reaction database system, canonical data generation apparatus capable of greatly reducing the use of the storage area of this system, ⁇
  • the objective is to provide a method for creating canonicalized data and a storage medium for canonicalized data creation (combuy overnight program product). Disclosure of the invention
  • the canonicalization data creation device of the present invention includes:
  • Input means for receiving input of ISI data and bond pair data between atoms for each atom constituting the compound
  • a canonical data generating means for generating a canonical data that can uniquely identify the chemical structure of the compound based on the unique data and the bond pair data received by the input means.
  • a canonical data generator ;
  • a first processing unit that classifies each atom into a different class for each equivalent atom based on the specific data and the bond pair data and assigns a different class number to each atom for each class;
  • a second processing unit that gives a canonical number uniquely corresponding to the structure of the compound to each atom based on the class number given to each atom in the first processing unit;
  • a third processing unit that creates the canonicalization data based on the canonicalization number given to each atom in the second processing unit.
  • the unique data for each atom and the bonding pairs between atoms received by the input means are provided to the canonical data generating means. Then, the canonicalized data creating means creates canonicalized data based on these data.
  • the processing of the first processing unit is executed, and each atom is converted to an equivalent element based on the specific data of each element and the bond pair data between the atoms. Classify each child into a different class. Then, a different class number is given to each atom for each class.
  • the processing of the second processing unit is executed, and based on the class number given to each atom and the bond pair between the atoms, a canonical number uniquely corresponding to the structure of the compound is obtained. Give to each atom.
  • the processing of the third processing unit is executed, and a canonical data is created based on the canonical number given to each atom and the unique data of each atom.
  • Each atom is given three types of attributes (ai, b i J ⁇ d u ), and by utilizing the fact that even one of these attributes can be determined to be non-equivalent, each atom is classified as an equivalent atom. And different class numbers are assigned to
  • ai is the kind sign atoms input number i
  • b u is out of bonds adjacent to an atom input number i
  • the type symbol I is the number of bonds that are j
  • the second processing unit In the process of assigning canonical numbers to each atom in ascending order from 1, if the atom with the highest priority of the class number is assigned the canonical number 1, and thereafter the canonical number n is assigned, The atom with the smallest canonical number is selected from the atoms to which atoms that have already been assigned a canonical number and that have not yet been assigned a canonical number are selected. The atom with the highest class number precedence among the atoms that have been linked to the atom and that have not yet been assigned a canonical number has the canonical number n + 1;
  • Each atom has three types of malleability ( ⁇ , T ⁇ , S i), and by arranging these attributes in a row, we have created
  • P i is the canonical number of the atom having the smallest canonical number, which is bonded to the atom of canonical number i.
  • S i is preferably the kind of atom of canonical number i.
  • the chemical compound of the compound '' canonical data that can be used to uniformly specify the structure Evening making method
  • each atom is classified into another class for each equivalent atom, and for each class! ) To give the ⁇ atom a class number
  • the first step to the third step By the processing up to the step, a canonicalized data is created based on the fixed data of each atom constituting the compound and the data of the bond pair between the atoms.
  • the process of the first step is executed, and each atom is classified into a different class for each equivalent atom based on the characteristic data of each atom and the bond pair data between the atoms. A different class number is assigned to each atom for each class.
  • the process of the second step is executed, and based on the class number given to each atom and the bond pair between atoms, a canonical number uniquely corresponding to the structure of the compound is obtained for each atom.
  • the process of the third step is executed, and a canonical data is created based on the canonical number given to each atom and the unique data of each atom.
  • the first step is
  • Each atom is given three types of attributes (a, b, d), and by utilizing the fact that even one of these attributes can be determined to be non-equivalent, each atom can be classified as an equivalent atom. Different class numbers,
  • ai is the species Ruiki atom of input number i
  • b u is out of bonds adjacent to an atom input number i
  • the type symbol j Where d u is the number of paths that can travel through j knots from the atom at input number i by the shortest path;
  • the second step is
  • P i is the canonical number of the atom having the smallest canonical number coupled to the atom of canonical number i. Is the type symbol of the bond between the atom of canonical number and the atom of canonical number P], and S i is preferably the type symbol of the atom of canonical number i.
  • the computer program product for canonicalized data creation of the present invention comprises:
  • an information processing apparatus having an input means for receiving an input of unique data for each atom constituting a compound and a bond pair between atoms, and a reading means for reading information from a computer usable medium.
  • a computer usable program having a computer readable program, said computer program product comprising:
  • the program for creating a canonical data that can specify the chemical structure of the compound based on the specific data and the binding pair data is computer-readable.
  • the canonicalization data creation program includes:
  • a first processing routine readable by a combination, which classifies each atom into a different class for each equivalent atom based on the unique data and the bond pair data and gives each atom a different class number for each class; ,
  • a third processing routine readable by a computer which creates the canonicalization data based on the canonicalization number given to each atom in the second processing routine.
  • the canonical data creation program product storage medium for canonical data creation
  • the canonical data creation program stored in the program area
  • the canonicalization data creation program can be executed by the information processing device.
  • the first processing routine is executed, and each atom is converted into an equivalent atom based on the unique data for each atom and the bonding pairs between the atoms. Are classified into different classes. A different class number is assigned to each atom for each class.
  • a second processing routine is executed, and based on the class number given to each atom and the bond pair data between the atoms, a canonical number uniquely corresponding to the structure of the compound is assigned to each atom. Given. Further, a third processing routine is executed, and a canonical data is created based on the canonical number given to each atom and the unique data of each atom.
  • the first processing routine includes:
  • ai is the type symbol of the atom of input number i
  • bij is the type symbol of the bond adjacent to the atom of input number i.
  • j is the number of bonds that Tour a j-number of binding Te ⁇ ;
  • the second processing routine includes:
  • the third processing routine includes:
  • P i binds to the atom of canonical number i and the stop of the atom with the smallest canonical number: fl is the canonical number.
  • T i is the type symbol of the bond between the atom of canonical number i and the atom of canonical number P i
  • S i is preferably the type symbol of the atom of canonical number i.
  • FIG. 1 is a block diagram showing an example of the canonical data creation device of the present invention.
  • FIG. 2A is a diagram showing an example of the data content of the binding table of the binding table
  • FIG. 2B is a diagram showing an example of the data content of the atom pair table of the binding table.
  • FIG. 3 is a diagram showing an example of the contents of a compound i report file.
  • FIG. 4 is a schematic diagram showing an outline of the operation of the standardized data creation device of the present invention.
  • Fig. 5 is a flow chart showing the flow of ovulation of main routine.
  • FIG. 6 is a flowchart showing the flow of the processing of the constituent atom classification routine.
  • FIG. 7A is a diagram showing an example of the contents of the data of the child table of the bonding table.
  • FIG. 7B is a diagram showing an example of the contents of the data of the atom pair table of the bonding table.
  • Figure 8 shows the relationship between the atoms constituting the 3,5-dimethyl-2,3,4,5-tetrahydridopyridine and the input numbers.
  • 9A and 9B are diagrams each showing an example of the data content of the reference table. You.
  • FIG. 10 is a diagram showing three types of attributes ( ai , bu, du) given to each atom constituting 3,5-dimethyl-1,2,3,4,5-tetrahydroviridine.
  • FIG. 11A, FIG. 11B, FIG. 12, FIG. 13A, FIG. 13B, FIG. 14A and FIG. 14B are diagrams each showing an example of the data content of the reference table.
  • FIGS. 15A, 15B, and 15C are diagrams showing the relationship between the atoms constituting the 3,5-dimethyl-2,3,4,5-tetrahydropyridine and the class numbers, respectively.
  • Fig. 16 is a diagram showing the attribute V 'assigned to each atom constituting the 3,5-dimethyl-2,3,4,5-tetrahydrid pyridine.
  • FIG. 18 is a flowchart showing the flow of the processing of the canonical numbering routine.
  • FIG. 19 is a diagram showing the relationship between the atoms constituting the 3,5-dimethyl-2,3,4,5-tetrahydridopyridine and the canonicalization number.
  • f3 ⁇ 4I20 is a flowchart showing a processing flow of a canonical data creation routine.
  • FIG. 21A is a diagram showing an example of the data contents of the atom table of the bonding table
  • FIG. 21B is a diagram showing an example of the data contents of the atom pair table of the bonding table.
  • FIG. 22 is a diagram showing an example of the contents of the canonical tree structure data.
  • FIG. 23 is a block diagram illustrating a configuration of an example of a storage medium for canonical data creation according to the present invention.
  • FIG. 24 is a block diagram showing an example of the canonical data creation device according to the present invention.
  • FIG. 25 is a perspective view showing an example of the canonical data creating apparatus according to the present invention.
  • Figure 26A shows C G.
  • FIG. 26B shows the molecular structure of the molecule;
  • FIG. 26B shows C 6 .
  • Molecule positive It is a figure which shows leveling overnight.
  • FIG. 1 is a block diagram showing a configuration of a canonicalized data creating device 1 according to a preferred embodiment of the present invention.
  • the canonical data generator 1 temporarily stores image data 10 a for storing image data 10 a such as a molecular structure diagram of a compound, and temporarily stores symbol data 11 a and the like.
  • Working memory 11 for temporary storage, a main storage device 20 for storing an operating system (OS) 21 and a canonicalization data creation program 22, a binding table file 31 and a compound information file 3
  • a hard disk device 30 in which 3 is stored.
  • the canonicalization data creation device 1 includes a display 40 for displaying a molecular structure diagram of a compound, a mouse 50 as a pointing device for receiving input of a handwritten figure, and symbol data such as a chemical formula.
  • a keyboard 60 for receiving an input of a compound
  • a printer 70 for outputting a molecular structure diagram of a compound, and the like
  • a CPU 80 for controlling execution and the like of the canonical data creation program 22.
  • the pointing device includes an evening tablet, a digitizer, a light pen, and the like. Any of these devices may be provided instead of the mouse 50.
  • the canonicalization data creation program 22 is a program that creates a canonicalization data based on the fixed data of each atom constituting the compound and the bond pair data between the atoms (canonicalization data creation means ).
  • the canonicalization data creation program 22 includes a main routine 100 for controlling the processing and a constituent atom classification routine for assigning a class number to each atom constituting the compound (the first processing routine, the first processing routine). (Processing unit) 101.
  • the canonicalization data creation program 22 includes a canonical numbering routine (second processing routine, second processing unit) for assigning a canonical number to each atom based on the class number. And create canonical data based on the il canonical number of each atom
  • a canonical data creation routine (third processing routine, third processing unit) 103 is provided.
  • the hard disk device 30 is provided with a connection table file 31 capable of storing a plurality of connection tables 32.
  • the bond table 32 records the specific data and the bond pairs between atoms for each atom constituting the compound, and the canonicalization data creation program 22 is connected via the bond table 32. You can access these data.
  • the bond table 32 contains an atomic table 32a in which the specific data for each atom is recorded, and a bond table in which the atomic interlinkage is recorded.
  • Child pair table 3 2b Specifically, in the atom table 32a, the input number (also called the atom number), the two-dimensional coordinates of the atom (X coordinate and Y coordinate), and the element name (generally the element symbol is used, There is a column for writing the number of atoms, atoms, etc.), attributes, the number of atoms, and the number of bonds (see Fig. 2A).
  • the human power number is a number for identifying each atom constituting the compound by a computer, and is a number in the example of FIG. 2A, but may be a gil number.
  • the bond atom pair data is preferably expressed as a combination of input numbers.
  • the hard disk drive 30 has a compound information file 33 in which a list showing the relationship between the compound number and the canonicalization data corresponding to this compound (also referred to as canonicale data) is recorded. Is stored.
  • the compound information file 3 a canonical Cadet Isseki corresponding to each of Compound No. C, ⁇ C 7, compound C Reference data for each compound -C 7 (name, literature, physical properties, etc.) and, is a file storing a list corresponding to the compound numbers c, ⁇ c 7.
  • the canonical de Isseki and reference de Isseki Can be read.
  • the canonical data is a data consisting of a plurality of symbols that uniquely specify the chemical structure of each compound.
  • the constituent atom classification routine 101 is in the first step, the canonical numbering routine 102 is in the second step, and the canonicalization data creation routine 103 is in the third step. Correspond to each other.
  • the operator operates the mouse 50 on the keyboard 60 to create the binding table 32 of the compound whose molecular structure is to be created in the binding table file 31. be able to.
  • the input by the mouse 50 is performed by manually opening the molecular structure diagram of the compound on the display 40 using the mouse 50, and the input number of each atom determined in the input order is stored in the main storage device 20. It is written in the column of input number ⁇ of the created knot table 32. Further, a bond atom pair indicating the bond relationship of each atom in the molecular structure diagram E, is punched into the column of the bond atom pair in the bond table 32.
  • the bond 3 ⁇ 432 that specifies the compound is created from the molecular structure diagram E, which is written in a cross-section.
  • the input of the molecular structure diagram using the mouse 50 will be described more specifically.
  • the mouse clicks on the mouse 50 the data for one of the atoms that make up the bonded atom pair is input.
  • data on one atom which forms a bond atom pair with the other atom is input.
  • a click to enter data for this other atom is then considered a click that designates one atom of the next bonded atom pair when the next click is made.
  • seven pairs of binding partners can be specified by two consecutive clicks. That is, The operator can input all the bonds that make up the compound by continuing to specify the bond atom pairs while shifting the atoms one after another.
  • the atom number is written in the column of the input number in the atom table 32 a corresponding to the data of the input atom (for example, 1, 2, 3, ...) .
  • the input data on the bonding relationship of atoms is written in the column of bonding atom pairs in the atom pair table 32b.
  • the operator inputs the element name of the atom, it is written in the column of the element name in the atomic table 32a.
  • the multiplicity of the bond connecting the bond atom pairs it is written in the bond type column of the atom pair table 32b. Element names and multiplicity should be recognized as carbon and single bond respectively when they are not written. Note that molecular structure diagrams and bonding tables based on them are usually created without hydrogen atoms.
  • the input using the keyboard 60 is to input a symbol string specifying the name of the binding table corresponding to a predetermined compound using the keyboard 60. Based on the input symbol data 11a, the binding table is input. The binding table 32 specified by the name is read from the binding table file 31.
  • the mouse 50 and the keyboard 60 constitute the input means A, and the connection table 32 can be obtained by using either the mouse 50 or the keyboard 60.
  • the canonicalization data creation program 22 as the canonicalization data creation means B is executed, and a canonicalization database 34 is created based on each data in the combination table 32.
  • the canonicalized data 34 created in this way is written and stored in the compound information file 33 together with the reference data of the compound.
  • the reason that the canonicalized data 34 is created and stored from the connection table 32 is that the storage area can be made smaller than saving the connection table 32 as it is. That is, the canonicalized data 34 created based on the binding table 32 shown in FIG.
  • the input by the keyboard 60 may be performed by directly writing the unique data and the binding pair data into the binding table created in the main storage device 20. Further, an apparatus for optically reading figures and characters, such as an image scanner or an optical card reader (OCR), may be used as the input device of the present invention to receive human power during the connection table.
  • OCR optical card reader
  • a canonical data creation method according to a preferred embodiment of the present invention will be described.
  • the above-described canonicalization data creation device 1 is used.
  • a main routine 100 of the canonical data creation program 22 is started under the control of OS 21.
  • the main routine 100 first calls a constituent atom classification routine (first processing routine) 101 to assign a class number to each atom constituting the compound ( First step: S10).
  • first processing routine first processing routine
  • second processing routine second processing routine
  • canonical data creation routine third processing routine
  • third processing routine third processing routine
  • This process constitutes the compound This is the process of classifying each atom into a different class for each equivalent atom and giving each atom a class number corresponding to the class to which it belongs. For example, all six carbon atoms of benzene are equivalent, so they all have the same class number. Also, for example, 7 carbon atoms of toluene are represented by 5 kinds of class numbers. That is, the two ortho carbon atoms and the two carbon atoms at the toluene position are equivalent to each other and have the same class number.
  • the attribute ai is the type symbol (atomic number in this example) of the atom of input number i.
  • Attribute b is the type symbol of the bond adjacent to the atom of input number i (that is, the bond having the atom of input number i as one atom).
  • the number of bonds (vector amount) is j (where 2 is the heavy bond, 3 is the triple bond, and 4 is the aromatic bond).
  • the attribute du is the number (vector amount) of paths that can be traversed through j bonds from the atom of the input number i by the shortest path.
  • the attributes (a, b u , di) are arranged for each atom to form a numeric string (in this example, a 9-digit numeric string).
  • the class number C given here is the 0th class ⁇ , and the first class number C, 'and the second class number C in the loop processing after S103 , ', .... If the number of classes is equal to the total number of atoms in the 0th order, the processing may be terminated.
  • the order ⁇ is set to 1 (S103). Then, provide the attributes V u n to each atom (S 104).
  • the tropism V u ' 1 is the number of atoms with the class number j in the order n ⁇ 1, which bond to the atom with input number i. Furthermore, it attributes for each atom (a,, b iJ5 d u , V ") side by side, the class number Ci sequentially the numeric string is less" give, to classify Kakuhara child into classes (S 105) . Then, check whether the number N ,, class is equal to the number N (n u class in the previous loop, the process is terminated if equal.
  • each atom is Three types of attributes ( ai , biJ5d are given.
  • the input numbers recorded in the connection table 32 are arbitrary numbers given in the order in which each atom is handwritten as shown in FIG. .
  • the attribute ai is obtained as follows. As described above, the attribute ai is the type symbol of the atom of the input number i.
  • the attribute b u is obtained as follows.
  • the attribute bu is the number of bonds whose type symbol is j among the bonds adjacent to the atom having the human power number i.
  • the bond table 32 records the bond type of each atom, and by reading out this bond type from the bond table 32, the attribute bi ”can be obtained.
  • the attribute b is obtained by using the reference table 12 shown in FIGS. 9A and 9B.
  • Lookup table 12 is an array D (X, y) that shows the bonding relationship between two atoms and It is created based on the data of bonding atom pairs and bonding types in the bonding table 32.
  • X is the number of the first atom in the bond atom pair
  • y is the number of the second atom.
  • X is the number of the first atom in the bond atom pair
  • y is the number of the second atom.
  • 2 of the bond type is a circle. It is shown surrounded by. That is, the bond type j is written in the array element indicated by the bond atom pair, and the reference table 12 is created.
  • the attribute is the number of paths that can be traversed from the atom of the input number i through j connections by the shortest path. That is to say, referring to the molecular structure diagram of FIG. 8, the path that can be reached from the atom of input number 1 through one connection is (input number 1 to input number 2), (input number 1 to input number 6), (Input No.1 ⁇ Input No.8). Also, the path that can be taken from the atom of input number 1 via two bonds is (input number 1 to input number 2 to input number 3), (input Number 1 to input number 6 to human power number 5).
  • the path that can be taken from the atom of input number 1 via the shortest path through three bonds is (human input number 1-input number 2-input number 3-input number 4), (input number 1-input number 6- Input number 5 to input number 4) and (input number 1 to input number 6 to input number 5 to input number 7) are total of three. Furthermore, there are no paths that can be taken from the atom of input number 1 through four bonds by the shortest path. According to Figure 8, there is a path that can be taken from the atom with input number 1 through four bonds. For example, the route is (input number 1 to input number 2 to input number 3 to input number 4 to input number 5).
  • the number of array elements with 1 coupling path is 3
  • the number of array elements with 2 coupling paths is 2
  • the number of array elements with 3 coupling paths is 3
  • connection paths is written to all array elements. As a result, there are 2 array elements with 1 coupling path, 3 array elements with 2 coupling paths, 2 array elements with 3 coupling paths, 2 array elements with 4 coupling paths, and d 2 "2 (2, 3, 2, 2) is obtained.
  • the numeral string of the atom of input number 1 is "630 003230”
  • the numeral string of the primo of input number 2 is "6 1 1 002322”.
  • the numbers are "71 1 002240”, “620002322”, “63 0003230, ##, "620002420”, "6 1 000 1 223, ## and "6 1 000 1 223".
  • each atom is classified into six classes, the number of classes N. Becomes 6.
  • each atom has an attribute Is given.
  • V 2J '1 (0, 0, 0, 0, 1, 1) atoms of c input number 3-8 obtained
  • V 5J ′ (1, 0, 1, 1, 0, 0)
  • ⁇ 6 (0, 0, 0, 0, 2, 0)
  • V 7J ′ (0, 0, 0, 0, 1, o)
  • ⁇ «. ⁇ ( 0, 0, 0, 0, 1, 0) respectively.
  • the tropism V is obtained using the reference table 12 shown in FIGS. 9A and 9B.
  • the digit string of the atom of input number 1 is "51 10100”
  • the digit string of the atom of input number 2 is "20000 11”.
  • "601 100 0, ...,” 300001 1, ..., "510 1 100, ...,” 4000020, ..., "10 000 10", "10000 10”.
  • the processing of S 106 is performed to check whether the number of classes N n is equal to N ( nn . If so, the processing is terminated. If the number of classes ⁇ ⁇ is equal to the total atom Checks whether the numbers are equal to each other, and terminates the processing if they are equal.Here, the number of classes is 7, the number of classes N. is 6, so is not equal to N. The total number of atoms is 8 Therefore, the number of classes, N, is not equal to the total number of atoms, and neither is equal, so the process of S107 is executed to set n to 2.
  • the canonical number is the number of the atom-which is determined by the structure of the compound. That is, the input number given by hand-drawing the molecular structure diagram is an arbitrary number that changes depending on the order of the hand-drawing.
  • canonical data must be unique data that depends only on the structure of the compound. For this reason, it is difficult to directly create a unique canonicalization database from arbitrary input numbers. Therefore, the canonicalization data creation program 22 converts the input number into a canonicalization number ⁇ once and creates the canonicalization data 34 based on this unique iH canonicalization number, so that It is possible to create 34 canonical data files.
  • the undetermined atom having the largest class number f is extracted from the undetermined atoms bonded to the selected undetermined atom, and the canonicalization number of the undetermined atom is set to k (S208).
  • the canonicalization number of the undetermined atom is set to k (S208).
  • the class number d f from pending atom bonded to the convicted atoms selects a maximum pending atom, this
  • the canonical atom is given a canonical number k (S209).
  • the atom of input number 1 is given the canonical number 5 and the atom of input number 6 is given the canonical number 6 respectively.
  • the atom of input number 7 is given the canonical number ⁇ 7, and the atom of input ⁇ 8 is given the canonical number 8 respectively.
  • the constituent atom classification routine 101 and the canonicalization number assignment routine 102 of the present invention are provided with numbers for atoms (S102, S105, S202, S208, S210). 209) and the selection of atoms (S202, S207, S208, S209).
  • the number 1 is assigned in ascending or descending order, and when selecting atoms, priority is given to those with larger or smaller numbers. Is a free choice of the program creator to the extent that the task of creating canonicalized data that uniquely corresponds to the structure of the compound is achieved (including the choice of using negative mathematics). ).
  • the expression “the priority of the class number is the highest” in the present invention has the above-mentioned meaning, and does not necessarily mean that the larger number is selected.
  • the method of numbering the canonical numbers from 1 to each atom in ascending order is based on an arithmetic progression ( ⁇ p + q (n-1)
  • 1, 2, 3, ⁇ n max ⁇ ), and “1” is a mathematical expression (the first term of the arithmetic progression P). It doesn't have to be exactly one. Also, the tolerance q need not be 1.
  • the canonicalization symbol is descended from the first term p (usually, the total number of atoms) (negative tolerance q ) Can be assigned to each atom.
  • p usually, the total number of atoms
  • q negative tolerance
  • Si is the kind symbol (element symbol in this example) of the atom with canonical number i (i> 0).
  • the element symbol of the atom with canonical number 1 is examined with reference to the atom table 32 a.
  • the atoms connected to the primordial number of canonical number 2 are examined with reference to the element pair table 32b.
  • atoms with canonical numbers 1 and 4 are obtained.
  • bond atom pairs that were not referred to when Ti was obtained in the processing of S302 are extracted (S303). This process is performed with reference to the atom pair table 32b. As a result, a bonding atom pair of the atom of canonical number 5 and the atom of canonical number 6 is extracted. Then, three types of data (R 1 ), R 2 j, and Hj are obtained for the extracted bond atom pairs (S 304).
  • R 1 ”and R 2 ” are the canonical numbers of the two atoms that make up the bond. Hi is the type symbol of the bond (the same as Ti in this example) Is used). Note that R 1 j and R 2 , i satisfy the relationship of R 1 j> R 2 ,.
  • canonical tree structure data shown in FIG. 22 can be created.
  • the data obtained in the processes of S302 and S304 are arranged in a line to create canonical data (S305). That is, a delimiter symbol F different from the atom type symbol and the bond type symbol is defined, and the data obtained in the processing of S302 and S304 are arranged as follows.
  • N is the total number of atoms
  • M is the total number of extracted bonded atom pairs of S304.
  • the data sequence obtained in this way is a canonical data sequence uniquely corresponding to the structure of the compound. Specifically, when the delimiter F is set to "/" and the obtained data is arranged in a predetermined order,
  • This canonicalization data is written and stored in the compound information file 33 (S306). After that, the process ends.
  • FIG. 23 is a block diagram showing a configuration of a canonical data creating portable program product (storage medium) 2 according to the embodiment of the present invention.
  • the storage medium 2 for canonicalized data creation has a file area 2b for storing files and a program area 2a for storing programs.
  • file area 2 In b the binding table file 31 and the compound information file 33 described above are stored.
  • the above-described canonicalization data for creating the canonicalization data based on the unique data of each atom constituting the compound and the bonding pair data between the atoms is provided. Evening creation program 22 is stored.
  • the canonicalization data creation program 22 includes a main routine 100 that controls the processing, a constituent atom classification routine (a first processing routine) 101 that assigns a class number to each atom constituting the compound, A canonical numbering routine for assigning a canonical number to each atom based on the class number (second processing routine) 102 and canonical data is created based on the canonical number of each atom canonical de Isseki creation routine as the (third processing routine) 1 0 3 and SL for c canonical de Isseki created and a ⁇ medium 2, for example, a disk such as a flexible disk or CD- ROM A type storage medium is used. Further, a tape-type storage medium such as a magnetic tape may be used.
  • the canonical data creation program 22 stored in the canonical data creation storage medium 2 of the present embodiment is, for example, an information processing device (canonical data creation / concealment) shown in FIGS. 24 and 25. ) 1 can be executed.
  • FIG. 24 is a block diagram showing the configuration of the processing apparatus 1 of the present embodiment
  • K 25 is a perspective view thereof.
  • the report processing device 1 includes a media drive device 3 for reading a canonical data creation program 22 stored in a canonical data creation storage medium 2.
  • the storage medium 2 for creating data overnight can be stored.
  • this storage canonical de Isseki information stored in the created storage medium 2 ([pi:. Spoon data creation program 2 2, bond table file 3 1 and of compound information file 3 3) against It becomes accessible from the information processing device 1.
  • the information processing device 1 can execute the canonical data creation program 22 stored in the program area 2a.
  • the configuration of the information processing device 1 is as follows. That is, the information processing device 1 A medium drive device 3 described above, a main storage device 20 storing an operating system (OS) 21, an image memory 10 storing image data such as a molecular structure diagram of a compound, and the like. It comprises a working memory 11 for temporarily storing 11a and the like, and a display 40 for displaying a molecular structure diagram and the like of the compound. Further, the information processing apparatus 1 includes a mouse 50 as a pointing device for accepting input of handwritten figures, a keyboard 60 for accepting input of symbolic data such as chemical formulas, and a molecular structure diagram of a compound. It is provided with a printer 70 for outputting, and a CPU 80 for controlling the execution of the canonical data creation program 22 and the like.
  • OS operating system
  • an image memory 10 storing image data such as a molecular structure diagram of a compound
  • the information processing apparatus 1 includes a mouse 50 as a pointing device for accepting input of handwritten figures, a keyboard
  • a flexible disk drive device a CD-ROM drive device, a magnetic tape drive device, or the like is used corresponding to the storage medium 2 for canonical data creation.
  • constituent atom classification routine (first processing routine) 101, the canonical numbering routine (second processing routine) 102, and the canonical data creation routine (third processing routine) Is performed as described above, and the canonicalized data for uniquely identifying the compound can be obtained in a short time.
  • the data sequence including the atomic type symbol Si is used as the canonicalized data.
  • the most frequently occurring atomic type symbol (usually C of carbon) is changed to It may be omitted from the data string. That is, by omitting the symbol of carbon C from the above canonicalization data,
  • m ik minimum number of bonds between candidate atom i and atom with canonical number k
  • Priorities are set in advance for this attribute, and an atom i having a higher priority is selected, and the canonicalization number of the atom is set to k. Thereafter, the process returns to S203.
  • an atom selection criterion based on the attribute value of the atom is shown.
  • the amount of scalar depends on the size.
  • the vector quantity when the elements of two vectors i and k are V and V kj , the size of the smallest element j with the width of the element Vi.i ⁇ V kj is used as the criterion for determining the priority.
  • the priorities of the attributes u, d Jj5 V, 1 , mu can be determined. Also, depending on multiple attributes When the priorities are determined, it is preferable to set priorities among the attributes in advance, and give priority to the determination with the attribute having the higher priority.
  • the canonical data creation method according to the present invention using the above-described canonical data creation apparatus (canonical data creation program product) according to the present invention is shown in FIG. C 6 shown in.
  • I obtained the canonical data for the molecule, I found C 6 .
  • the canonical data (Fig. 26B), which uniquely identifies the structure of the molecule, was obtained in only 1.5 seconds.
  • the Morgan algorithm (HL Morgan, J. Chem. Doc, 5 (2), 107 (1965)), which does not go through the process of classifying atoms into equivalent atoms, provides an information processing device with the same performance.
  • C fi When the canonicalization of the numerator was determined, it took 550 seconds to obtain the canonicalization. Therefore, the speed and accuracy of the information processing required for creating canonical data can be greatly improved by employing the canonical data creating apparatus, creating method and creating computer program product according to the present invention.
  • Industrial Use J
  • the unique data for each atom and the bond pair data between the atoms received by the input means are converted to the canonical data generating means. Can be obtained. Then, in the canonicalized data creating means,
  • the execution of the canonicalization data creation program stored in the program area allows the execution of the canonicalization data for each atom constituting the compound.
  • the canonicalization data is created in a short time and accurately on the basis of the characteristic data and the bond pair of the atomic question.
  • the canonicalization data creation apparatus, the canonicalization data creation method, and the canonicalization data creation computer program product for canonicalization data creation according to the present invention include: It is a very short sequence of letters, numbers, and symbols, and can store canonical data with a small storage area. For this reason, if the canonical data creation device, creation method, and compilation program product for creation of the present invention are used in a compound / reaction database system or the like, the storage space of the compound / reaction database system can be significantly reduced. It can be reduced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention a trait à un générateur de données normalisées comportant, d'une part, un périphérique d'entrée recevant une entrée relative à des données caractéristiques concernant des atomes constituant un composé ainsi que des données concernant des paires de liaison entre atomes et, d'autre part, un générateur de données normalisées générant des données normalisées grâce auxquelles la structure chimique d'un composé peut être uniquement définie en fonction des données caractéristiques et des données concernant les paires de liaison. Le générateur de données normalisées comporte trois unités de traitement, la première classant des atomes équivalents en différentes classes en fonction des données caractéristiques et des données concernant les paires de liaison et fournissant les numéros atomiques de la classification atomique, différents selon la classe, la deuxième donnant les numéros de normalisation des atomes correspondant uniquement à la structure du composé en fonction des numéros atomiques de la classification atomique et la troisième unité de traitement générant les données normalisées en fonction des numéros de normalisation.
PCT/JP1997/001661 1996-05-20 1997-05-16 Generateur de donnees normalisees, methode de generation de donnees normalisees et support d'enregistrement pour la production de donnees normalisees WO1997044744A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP8125117A JPH09305612A (ja) 1996-05-20 1996-05-20 正準化データ作成用記憶媒体
JP8/125123 1996-05-20
JP8/125117 1996-05-20
JP8125123A JPH09305628A (ja) 1996-05-20 1996-05-20 正準化データ作成装置及び正準化データ作成方法

Publications (1)

Publication Number Publication Date
WO1997044744A1 true WO1997044744A1 (fr) 1997-11-27

Family

ID=26461641

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP1997/001661 WO1997044744A1 (fr) 1996-05-20 1997-05-16 Generateur de donnees normalisees, methode de generation de donnees normalisees et support d'enregistrement pour la production de donnees normalisees

Country Status (1)

Country Link
WO (1) WO1997044744A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001027052A1 (fr) * 1999-10-08 2001-04-19 Riken Procede de codage stereochimique d'une molecule

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6257017A (ja) * 1985-09-05 1987-03-12 Fuji Photo Film Co Ltd 化学反応情報の処理方法
JPH0498464A (ja) * 1990-08-10 1992-03-31 Fujitsu Ltd 化学構造のデータ表現方式
WO1996006391A2 (fr) * 1994-08-10 1996-02-29 Oxford Molecular Limited Systeme de gestion de base de donnees relationnelles servant a memoriser, a rechercher et a extraire des donnees de structures chimiques

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6257017A (ja) * 1985-09-05 1987-03-12 Fuji Photo Film Co Ltd 化学反応情報の処理方法
JPH0498464A (ja) * 1990-08-10 1992-03-31 Fujitsu Ltd 化学構造のデータ表現方式
WO1996006391A2 (fr) * 1994-08-10 1996-02-29 Oxford Molecular Limited Systeme de gestion de base de donnees relationnelles servant a memoriser, a rechercher et a extraire des donnees de structures chimiques

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SYMPOSIUM ON INFORMATION SCIENCE ABSTRACTS OF THE SYMPOSIUM OF STRUCTURE-ACTIVITY RELATIONSHIP, Vol. 13th-18th, 19 November 1990, AKIRA ASANAGA, "Development and Application of Molecular Structure Processing Programs and a Library (in Japanese)", p. 25-28. *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001027052A1 (fr) * 1999-10-08 2001-04-19 Riken Procede de codage stereochimique d'une molecule

Similar Documents

Publication Publication Date Title
CN102918500B (zh) 使用输入-输出示例生成文本操纵程序
Bodlaender et al. Two strikes against perfect phylogeny
Di Battista et al. Hierarchies and planarity theory
CN109740122A (zh) 思维导图用例文件的转换方法及装置
JP5241738B2 (ja) 表からツリー構造データを構築する方法及び装置
JP2003044267A (ja) データソート方法、データソート装置およびデータソートプログラム
JP2013149277A (ja) 圧縮データの構造を問い合わせる方法
CN112667735A (zh) 一种基于大数据的可视化模型建立分析系统和方法
JPH022419A (ja) プログラム情報管理方式
US20020078431A1 (en) Method for representing information in a highly compressed fashion
CN114817558A (zh) 一种构建子图模型进行图谱查询的方法
CN112465035A (zh) 物流配送任务分配方法、系统、设备及存储介质
WO1997044744A1 (fr) Generateur de donnees normalisees, methode de generation de donnees normalisees et support d'enregistrement pour la production de donnees normalisees
US7660801B2 (en) Method and system for generating a serializing portion of a record identifier
JPH09305628A (ja) 正準化データ作成装置及び正準化データ作成方法
JPH11143753A (ja) データ変換装置、方法及び記録媒体
JPH07210569A (ja) 情報検索方法および情報検索装置
JPH08235033A (ja) オブジェクト指向データベース管理システムにおける結合演算方式
JPH09305612A (ja) 正準化データ作成用記憶媒体
US20040024742A1 (en) Computer system
Bowman et al. A Chemically Oriented Information Storage and Retrieval System. II. Computer Generation of the Wiswesser Notations of Complex Polycyclic Structures
WO2023037506A1 (fr) Dispositif de dérivation de configuration de système, procédé de dérivation de configuration de système et support lisible par ordinateur
US20050015400A1 (en) Existing content utilization support method, information processing device, program, and recording medium
JP2827658B2 (ja) 図形解析装置及び図形検索装置
JP2001184370A (ja) キーワード検索式生成装置及びキーワード検索式生成方法

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA KR US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA