JP2007277188A

JP2007277188A - Support system for compound search

Info

Publication number: JP2007277188A
Application number: JP2006107067A
Authority: JP
Inventors: Asako Koike; 麻子小池; Shigeo Sumino; 重雄炭野; Yoshiki Niwa; 芳樹丹羽
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-04-10
Filing date: 2006-04-10
Publication date: 2007-10-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for searching a compound similar to a query compound, under a specific viewpoint/characteristic from a large-scale compound library. <P>SOLUTION: A system has a compound database 120 having a characteristic-partial structure associated information storage area 121 and a significance score and similarity score deriving function information storage area 122, an input/output terminal 150 being an input means of compound information to be searched, a database control mechanism 130 being a means for searching a similar compound by using a score in accordance with a characteristic, and an information control mechanism 110 being a means for displaying search results in similarity order to the query compound in accordance with the characteristic. The compound input from the input-output terminal 150 is searched as a key, and the results are displayed in similarity order in accordance with the characteristic. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、化合物の検索分野に関り、特に、ユーザが様々な観点で化合物を検索することができるシステム及びその方法に関するものである。 The present invention relates to a compound search field, and more particularly, to a system and method for allowing a user to search for a compound from various viewpoints.

化合物を検索するときに、名称、部分構造、組成式、SMILES形式、特性値で検索する方法が一般的に用いられてきた。名称検索の場合は、同義語の辞書を用い、完全一致だけでなく部分一致を行う方法が一般的である。部分構造を利用した検索の場合は、官能器ごとに検索する手法が広く使われている。化合物の機能・特性に大きく寄与すると思われる官能基を利用して検索することもある。部分構造/FingerPrintとしては、MDLが開発して公開しているMACCSが広く使われている。また、物性値による検索も、物性値の範囲を指定するなどの方法が取られる。完全一致の構造検索においては、特許２７５８６０９号（特許文献１）に示されるようなコネクションテーブルを利用する方法もある。 When searching for a compound, a method of searching by name, partial structure, composition formula, SMILES format, and characteristic value has been generally used. In the case of name search, a method of using a synonym dictionary and performing partial matching as well as complete matching is common. In the case of a search using a partial structure, a search method for each sensual device is widely used. Searches may be made using functional groups that are thought to contribute significantly to the function and properties of the compound. As the partial structure / FingerPrint, MACCS developed and published by MDL is widely used. Also, a search using physical property values can be performed by specifying a range of physical property values. There is also a method of using a connection table as disclosed in Japanese Patent No. 2758609 (Patent Document 1) in the exact structure search.

特許第２７５８６０９号Patent No. 2758609

しかしながら、ユーザが多様な観点で、類似化合物を検索したいときには、これらの手法は適していない。なぜならば、類似というものを定義するためには何らかの観点が必要であるが、従来の技術での類似化合物検索は、単に全体もしくは部分構造が類似か否かを一つの類似尺度でしか探索できない。従って、あらかじめ毒性に深く関わる官能基をユーザが知っているときには、その官能基を持つ化合物をデータベース中から検索するということはできるし、また、与えられた化合物の類似構造を検索することはできる。しかし、クエリとして化合物が与えられたときに、毒性を持つという観点にたって類似の化合物をデータベース中から探しだすことと、ある薬効を持つという観点にたって類似の化合物を探しだすことを区別して検索することができない。即ち、毒性という観点と、ある薬効を持つという観点では、重要となる官能基、もしくは、その組み合わせが異なるにもかかわらず、一定の尺度による類似構造検索しかできない。 However, these methods are not suitable when the user wants to search for similar compounds from various viewpoints. This is because some kind of viewpoint is necessary to define the similarity, but the similar compound search in the conventional technique can only search whether the whole or the partial structure is similar by only one similarity measure. Therefore, when the user knows a functional group deeply related to toxicity in advance, it is possible to search for a compound having the functional group from the database, and it is possible to search a similar structure of a given compound. . However, when a compound is given as a query, a search is made by distinguishing between searching for a similar compound in the database from the viewpoint of toxicity and searching for a similar compound from the viewpoint of having a certain medicinal effect. Can not do it. In other words, from the viewpoint of toxicity and the viewpoint of having a certain medicinal effect, it is possible to search only for similar structures on a certain scale, although important functional groups or combinations thereof are different.

上記課題を解決するために、本発明では、薬効、体内吸収性、毒性などの異なった観点ごとに、その観点での化合物機能・特性に重要な部分と、重要でない部分とを、性質が既知の化合物データを利用して、予め部分構造の重要度を計算して保存しておく。類似検索は、ユーザが選んだ観点での各部分構造の重要度を利用して、化合物同士の類似性を計算することになる。従って、特性観点ごとの部分構造重みを利用して、同一のクエリである化合物に対して、異なる観点で異なる類似化合物を提示することが可能となる。 In order to solve the above-mentioned problems, in the present invention, for different viewpoints such as medicinal effects, absorption in the body, toxicity, etc., the properties of the parts important for the compound function / property and the parts not important for the viewpoints are known. The importance of the partial structure is calculated and stored in advance using the compound data. Similarity search calculates the similarity between compounds using the importance of each partial structure from the viewpoint selected by the user. Therefore, it is possible to present different similar compounds from different viewpoints with respect to the compound that is the same query by using the partial structure weight for each characteristic viewpoint.

上記目的を達成するため、
本発明の化合物検索支援システムにおいては、化合物構造を構成要素となる部分構造へ分解する手段、該手段を用いて化合物部分構造と化合物特性の関係の組を生成する手段、該関係の総数を記憶する手段、該関係に出現する化合物特性名称、該化合物名称、該化合物構造情報、および、それらの出現数と該化合物部分構造総出現数を記憶する手段、および、該部分構造の特性別スコアを記憶する手段、検索すべき化合物情報の入力手段、特性別スコアを用いて類似化合物を検索する手段、該検索した結果を特性別にクエリ化合物との類似度順で表示する手段を具備することを特徴とする。 To achieve the above objective,
In the compound search support system of the present invention, means for decomposing a compound structure into partial structures as constituent elements, means for generating a set of relations between compound partial structures and compound characteristics using the means, and storing the total number of the relations Means for storing the compound characteristic name appearing in the relationship, the compound name, the compound structure information, the number of occurrences thereof and the total number of occurrences of the compound substructure, and the characteristic score of the partial structure A means for storing information, a means for inputting compound information to be searched, a means for searching for similar compounds using a characteristic-specific score, and a means for displaying the search results in order of similarity with query compounds by characteristics. And

また、本化合物検索支援システムにおいて、ユーザが化合物特性を含む化合物情報の登録・更新作業を行った際に、自動的に部分構造と特性別スコアを計算することを特徴とする。 Further, the present compound search support system is characterized in that when a user performs registration / update of compound information including compound characteristics, a partial structure and a score for each characteristic are automatically calculated.

また、スコアのランキングを決定する関数をユーザが特性別に適宜変更可能とすることを特徴とする。 Also, the function for determining the ranking of the score can be appropriately changed by the user for each characteristic.

更に、特性名称と部分構造の組の出現総数、特性名称と部分構造のそれぞれの出現総数、及び、登録化合物の部分構造出現総数を用い、Dice係数、Entropy Gain、Mutual information、Simpson係数、Cosine、超幾何分布の積算値、の一つもしくは複数を計算し、ある特性における部分構造の重要度として使うことを特徴とする。 Furthermore, using the total number of appearances of pairs of characteristic names and partial structures, the total number of occurrences of characteristic names and partial structures, and the total number of partial structures of registered compounds, Dice coefficient, Entropy Gain, Mutual information, Simpson coefficient, Cosine, One or more of the integrated values of the hypergeometric distribution are calculated and used as the importance of the partial structure in a certain characteristic.

本発明によると、入出力端末に化合物の構造情報、もしくは、化合物構造が登録されている化合物においては、化合物の名称を入力するだけで、ユーザが指定する化合物特性の視点のもとでデータベース中の類似化合物を検索し、類似スコア順に表示することができる。また、類似の根拠となった部分構造と、指定した特性下でのその部分構造の重要度を示すことにより、ユーザがどのようなメカニズムでその特性が出現する可能性があるか考察を容易とすることができる。また、ユーザが部分構造の重要度スコア及び、類似度スコアの導出関数を変更可能とすることで、よりユーザの意図に即した検索結果を獲得することができる。更に、部分構造の各特性での重要度の計算を、新しい化合物構造が登録する都度自動的に計算することにより、より新しい情報を使い計算することができる。これらのことにより、ユーザは特性と部分構造の知識を予め持たずして、様々な特性の観点から類似構造を検索することを容易とする。 According to the present invention, in the case of a compound in which the structure information of the compound or the compound structure is registered in the input / output terminal, it is possible to enter the name of the compound in the database from the viewpoint of the compound property specified by the user. Can be searched and displayed in the order of similarity score. Also, by showing the substructure that is the basis for the similarity and the importance of the substructure under the specified characteristics, it is easy for the user to consider the mechanism by which the characteristics may appear. can do. In addition, by enabling the user to change the importance score of the partial structure and the derivation function of the similarity score, it is possible to obtain a search result more in line with the user's intention. Furthermore, by calculating the importance of each characteristic of the partial structure automatically each time a new compound structure is registered, the calculation can be performed using newer information. As a result, the user can easily search for similar structures from the viewpoint of various characteristics without having knowledge of characteristics and partial structures in advance.

以下、図を用いて本発明の化合物検索支援システムの一実施例を詳細に説明する。 Hereinafter, an embodiment of the compound search support system of the present invention will be described in detail with reference to the drawings.

図１は本発明の化合物検索支援システムのシステム構成を示すブロック図である。１００は本発明に係わる化合物検索支援システム、１１０は図４に示す検索インタフェースを表示し、種々の処理をするための情報制御機構、１２０は化合物に関する情報を格納した化合物データベース、１２１は特性と化合物の構成要素である部分構造との関係を記憶した特性・部分構造関連情報記憶領域、１２２は検索結果を表示する際のスコア計算用の関数を記憶するスコア導出関数情報記憶領域、１３０は化合物データベースを制御するためのデータベース制御機構、１４０は化合物検索支援システムにアクセスするために利用するインターネット、１５０は化合物検索支援システムへ入出力を行うための入出力端末である。 FIG. 1 is a block diagram showing the system configuration of the compound search support system of the present invention. 100 is a compound search support system according to the present invention, 110 is a search interface shown in FIG. 4, and is an information control mechanism for various processing, 120 is a compound database storing information about compounds, 121 is a property and a compound The characteristic / partial structure related information storage area that stores the relationship with the partial structure that is a constituent element of the above, 122 is a score derivation function information storage area that stores a function for score calculation when displaying the search results, and 130 is a compound database The database control mechanism 140 controls the Internet, 140 is an internet used to access the compound search support system, and 150 is an input / output terminal for input / output to the compound search support system.

なお、情報制御機構１１０に関しては図２を用いて、特性・化合物部分構造関連情報記憶領域１２１とスコア導出関数情報記憶領域１２２に関しては図３を用いて詳細に説明する。 The information control mechanism 110 will be described in detail with reference to FIG. 2, and the characteristic / compound partial structure related information storage area 121 and the score derivation function information storage area 122 will be described in detail with reference to FIG.

このような構成をとることにより、本発明に係わる化合物検索支援システムは、ユーザが入出力端末１５０より入力した化合物の情報と検索条件を情報制御機構１１０によって受け付ける。該情報制御機構１１０はデータベース制御機構１３０を介して該化合物情報をキーとして化合物データベース１２０に対して検索を実行し、関連する部分構造とその重要度を１２１から、ユーザが指定したスコア導出関数を１２２から取得し、これらを用いて化合物ごとの類似度スコアを１３０にて計算し、該結果を入出力端末１５０に表示する。 By adopting such a configuration, the compound search support system according to the present invention receives the compound information and the search condition input by the user from the input / output terminal 150 by the information control mechanism 110. The information control mechanism 110 performs a search on the compound database 120 using the compound information as a key via the database control mechanism 130, and obtains a score derivation function designated by the user from the relevant partial structure and its importance 121. 122, and using these, the similarity score for each compound is calculated at 130, and the result is displayed on the input / output terminal 150.

このように、入出力端末に化合物の構造情報、もしくは、化合物構造が登録されている化合物においては、化合物の名称を入力するだけで、ユーザが指定する化合物特性の視点のもとでデータベース中の類似化合物を検索し、類似スコア順に表示することができる。図２は情報制御機構の詳細のブロック図である。２００は図４に示す化合物検索インタフェースの表示や、ユーザの入力受付を行うための情報配信機構、２１０はユーザが構造を入力した場合に、化合物構造分解機構２５０を用いて化合物構造を部分構造に分解して、データベース制御機構１３０を介して化合物データベース１２０にアクセスするための検索機構である。ユーザが１２０に登録済みの化合物をクエリとして用いる場合には、入力は化合物名称でもよく、その場合は、化合物名称をキーとして化合物データベース１２０にアクセスすることになる。 As described above, in the compound structure information registered in the input / output terminal or the compound structure is registered, the name of the compound specified by the user can be stored in the database simply by inputting the name of the compound. Similar compounds can be searched and displayed in order of similarity score. FIG. 2 is a detailed block diagram of the information control mechanism. 200 is an information distribution mechanism for displaying the compound search interface shown in FIG. 4 and accepting user input. 210 is a compound structure decomposition mechanism 250 that converts the compound structure into a partial structure when the user inputs the structure. It is a search mechanism for decomposing and accessing the compound database 120 via the database control mechanism 130. When the user uses a compound registered in 120 as a query, the input may be a compound name. In this case, the compound database 120 is accessed using the compound name as a key.

２２０は新規に登録される化合物と特性に関して、化合物構造分解機構２５０を用いて、化合物を部分構造に分解し、化合物名称、化合物部分構造、特性の組にとし、化合物データベース１２０に記憶した情報を更新するための特性・部分構造関連情報更新機構である。また、２３０は検索結果をクエリー化合物との類似スコア順で表示する際の特性ごとの部分構造の重要度スコアを計算する関数を変更するための重要度スコア関数変更機構であり、これらの重要度スコアを使用して類似度を計算する関数を変更するための類似度スコア関数変更機構が２４０となる。 The information 220 stored in the compound database 120 is composed of a compound name, a compound partial structure, and a property by decomposing the compound into partial structures by using the compound structure decomposition mechanism 250 with respect to the newly registered compound and properties. It is a characteristic / substructure related information update mechanism for updating. Reference numeral 230 denotes an importance score function changing mechanism for changing the function for calculating the importance score of the partial structure for each characteristic when displaying the search results in the order of similarity scores with the query compound. The similarity score function changing mechanism for changing the function for calculating the similarity using the score is 240.

このような構成により、本発明に係わる化合物検索支援システムは、まず情報配信機構２００が図４に示す検索インタフェースをユーザに表示し、検索キーとなる化合物構造もしくは名称と、検索対象特性を獲得する。検索機構２１０は、該化合物構造の部分構造と検索対象特性をキーとして検索を実行し、特性ごとの部分構造の重要度を取得し、スコア計算を行い、結果として取得する。取得された検索結果は、情報配信機構２００がユーザに表示する。また、補足データとして、同時に化合物構造分解機構２５０でどのような部分構造に分解されたかと、特性ごとの該部分構造の重要度も表示される。（図４、４８０）
特性・部分構造関連情報更新機構２２０が、データの登録の際に自動的に、特性ごとの部分構造の重要度を計算しなおして、化合物データベースの１２１に登録される。情報配信機構２００では最新の特性および部分構造の各種情報が使用される。
検索結果の表示において、類似スコア順で表示した内容がユーザの意図と合致しない場合は、部分構造の重要度スコア関数の変更機構２３０と類似度スコア関数の変更機構２４０によって、ユーザはランキング導出関数の変更が可能である。 With such a configuration, in the compound search support system according to the present invention, first, the information distribution mechanism 200 displays the search interface shown in FIG. 4 to the user, and acquires the compound structure or name serving as the search key and the search target characteristics. . The search mechanism 210 executes a search using the partial structure of the compound structure and the search target characteristic as a key, acquires the importance of the partial structure for each characteristic, performs score calculation, and acquires the result. The acquired search result is displayed to the user by the information distribution mechanism 200. Further, as supplementary data, the partial structure decomposed by the compound structure decomposition mechanism 250 and the importance of the partial structure for each characteristic are also displayed. (Fig. 4, 480)
The characteristic / substructure related information update mechanism 220 automatically calculates the importance of the partial structure for each characteristic at the time of data registration and registers it in the compound database 121. The information distribution mechanism 200 uses various information on the latest characteristics and partial structures.
In the display of the search result, if the contents displayed in the order of similar scores do not match the user's intention, the user can determine the ranking derivation function by the partial structure importance score function changing mechanism 230 and the similarity score function changing mechanism 240. Can be changed.

本発明によれば、このような構成をとることで、ユーザがクエリとなる化合物構造、もしくは、化合物名称を入れ、類似性の観点となる化合物特性を指定するだけで、特性ごとにデータベース中の類似化合物を検索し、類似スコア順に表示することができる。また、類似の根拠となった部分構造と、指定した特性下でのその部分構造の重要度を示すことにより、どのようなメカニズムでその特性が出現する可能性があるか、ユーザの考察を容易とすることができる。また、ユーザが該ランキングの関数を変更可能とすることで、よりユーザの意図に即した検索結果を獲得することができる。更に、部分構造の各特性での重要度の計算を、新しい化合物構造が登録する都度自動的に計算することにより、より新しい情報を使い計算することができる。これらのことにより、ユーザは特性と部分構造の知識を予め持たずして、様々な特性の観点から類似構造を検索することを容易とする。 According to the present invention, by adopting such a configuration, the user can simply enter the compound structure or compound name to be queried and specify the compound property that is the viewpoint of similarity. Similar compounds can be searched and displayed in order of similarity score. In addition, by showing the substructure that is the basis for the similarity and the importance of the substructure under the specified characteristics, it is easy for the user to consider what kind of mechanism the characteristics may appear in. It can be. In addition, by making it possible for the user to change the ranking function, it is possible to obtain a search result more in line with the user's intention. Furthermore, by calculating the importance of each characteristic of the partial structure automatically each time a new compound structure is registered, the calculation can be performed using newer information. As a result, the user can easily search for similar structures from the viewpoint of various characteristics without having knowledge of characteristics and partial structures in advance.

図３は特性・部分構造関連情報記憶領域と重要度スコア及び類似度スコア導出関数情報記憶領域に記憶したテーブルの説明図である。３００は化合物名称管理テーブルであり、３０１はテーブル３００の主キーを管理する化合物ＩＤ、３０２は化合物名称領域、３０３は該化合物を構成する部分構造総数領域、３１０は特性情報管理テーブルであり、３１１は特性情報テーブル３１０の主キーを管理する特性ＩＤ領域、３１２は特性名を管理する特性名領域、３１３は特性と部分構造との関連情報において該特性名が出現する回数を管理する特性出現数領域、３２１は部分構造情報テーブル３２０の主キーを管理する部分構造ＩＤ領域、３２２は部分構造のデータを管理する部分構造データ領域、３２３は特性と部分構造との関連情報において該部分構造が出現する回数を管理する部分構造総出現数領域、３３０は化合物がどの部分構造から構成されるかを管理する化合物構成部分構造管理テーブルであり、３３１は化合物名称テーブル３００への外部キーを管理する化合物ＩＤ領域、３３２は部分構造情報テーブルへの外部キーを管理するＩＤ領域、３３３は部分構造出現数領域である。３４０は化合物と化合物特性の関係を示す化合物・特性テーブルである。３４１が化合物名称管理テーブル３０１への外部キーとなり、３４２が特性情報テーブル３１０への外部キーとなる。３５０は化合物特性と部分構造との関連を示す、特性部分構造関連テーブルである。３５１はテーブル特性情報テーブル３１０への外部キーを管理するＩＤ領域、３５２は部分構造情報テーブル３２０への外部キーを管理する構造ＩＤ領域、３５３は部分構造情報テーブルへの外部キーを管理する部分構造ＩＤ領域、３５４は特性ＩＤと部分構造ＩＤが出現する関連出現数領域である。また、３６０は、特性ごとの部分構造の重要度を計算した重要度テーブルであり、３６１は、特性部分構造関連テーブル３５０への外部キーを管理する関連ＩＤ領域であり、３６２は、化合物同士の類似度を部分構造構成から決定する関数領域であり、テーブル３７０への外部キーとなっている。３６３は、部分構造テーブルへの外部キーを管理するＩＤ領域である。なお、３６４で管理される重要度スコアには以下の計算方法がある。これらの計算方法は、重要度スコアテーブル３７０に記述される。３７１は主キーとなるＩＤ領域であり、３７２が重要度スコア関数を登録する領域であり、重要度スコア関数として１）-５）が挙げられるが、これに限定されない。
FIG. 3 is an explanatory diagram of a table stored in the characteristic / partial structure related information storage area and the importance score / similarity score derivation function information storage area. Reference numeral 300 denotes a compound name management table, 301 denotes a compound ID for managing the primary key of the table 300, 302 denotes a compound name area, 303 denotes a total number of partial structures constituting the compound, and 310 denotes a characteristic information management table. Is a characteristic ID area that manages the primary key of the characteristic information table 310, 312 is a characteristic name area that manages the characteristic name, 313 is the number of characteristic appearances that manages the number of times the characteristic name appears in the information related to the characteristic and the partial structure 321 is a partial structure ID area for managing the primary key of the partial structure information table 320, 322 is a partial structure data area for managing the data of the partial structure, 323 is the partial structure appearing in the relation information between the characteristic and the partial structure The total number of occurrences of partial structure for managing the number of times to be executed, 330 is a compound structure for managing which partial structure a compound is composed of A partial structure management table, 331 Compound ID area for managing the foreign key to the compound name table 300, 332 is ID area for managing the foreign key to the partial structure information table 333 is a partial structure appearance number region. Reference numeral 340 denotes a compound / characteristic table showing the relationship between compounds and compound characteristics. 341 is an external key to the compound name management table 301, and 342 is an external key to the characteristic information table 310. Reference numeral 350 denotes a characteristic partial structure relation table showing the relation between the compound characteristic and the partial structure. 351 is an ID area for managing a foreign key to the table characteristic information table 310, 352 is a structure ID area for managing a foreign key to the partial structure information table 320, and 353 is a partial structure for managing the foreign key to the partial structure information table. An ID area 354 is a related appearance number area in which the characteristic ID and the partial structure ID appear. Also, 360 is an importance table that calculates the importance of the partial structure for each characteristic, 361 is a related ID area that manages the external key to the characteristic partial structure related table 350, and 362 is the relationship between the compounds. This is a function area for determining the degree of similarity from the partial structure configuration, and is a foreign key to the table 370. Reference numeral 363 denotes an ID area for managing a foreign key to the partial structure table. The importance score managed by H.364 has the following calculation method. These calculation methods are described in the importance score table 370. An ID area 371 is a primary key, and an area 372 is used to register an importance score function. Examples of the importance score function include 1) -5), but are not limited thereto.

M=全体の化合物の部分構造数 (３３３の総和)
N=ある特性ＩＤが関与する部分構造の出現数（３５２、３５３を用いて求める）
n=ある特性ＩＤが関与する特定の部分構造の出現数（３５２、３５３を用いて求める
）
m=該部分構造の総出現数（３３２、３３３から求める）

1) 超幾何分布の積分値 w(M,N,m,n)

即ち、超幾何分布の積分値を使う。超幾何分布ではなくその積分値を使うのは、超幾何分布だとf(M,N,m,x)=f(M,n,m,x'),x>x'となる、xとx'を区別することができないからである。 M = number of partial structures of the whole compound (sum of 333)
N = Number of occurrences of partial structures involving a certain characteristic ID (determined using 352 and 353)
n = number of occurrences of a specific partial structure involving a certain characteristic ID (determined using 352 and 353)
m = total number of occurrences of the partial structure (determined from 332 and 333)

1) Hypergeometric distribution integral value w (M, N, m, n)

That is, the integral value of the hypergeometric distribution is used. The integral value is used instead of the hypergeometric distribution. For the hypergeometric distribution, f (M, N, m, x) = f (M, n, m, x '), x>x' This is because x ′ cannot be distinguished.

2) Dice 係数 D(N,n,m)

3) Cosine 係数 C(N,n,m)

4) Simpson 係数 S(N,n,m)

5) Mutual Information (MI) S(M,N,m,n)

上記の部分構造の数え方として、特定の化合物に部分構造が存在したか否かの答え方と、部分構造を数える方法と２通りの方法がある。 2) Dice coefficient D (N, n, m)

3) Cosine coefficient C (N, n, m)

4) Simpson coefficient S (N, n, m)

5) Mutual Information (MI) S (M, N, m, n)

There are two methods of counting the partial structures as described above: how to answer whether a partial structure exists in a specific compound, a method of counting the partial structures, and two methods.

これらの重要度スコアを用いて、ある着目する化合物特性の観点での、クエリ化合物との類似度を類似度スコア関数を用いて計算する。 Using these importance scores, the similarity to the query compound in terms of a certain compound characteristic is calculated using a similarity score function.

化合物jとクエリ化合物との類似度スコア関数は、

もしくは、

で計算されるが、これに限定されない。ここで、S_ikはクエリ化合物と、検索対象となっている化合物が構成要素としている部分構造のうちi番目の部分構造の特性ｋにおける重要度スコアであり、Nは該化合物を構成する部分構造の数である。これらの類似度スコア関数は、テーブル３８０で管理される。３８１の類似関数ＩＤが主キーとなり、３８２が類似度スコア関数領域である。 The similarity score function between compound j and the query compound is

Or

However, it is not limited to this. Here, S _ik is the importance score in the characteristic k of the i-th partial structure of the query compound and the partial structure which is the constituent element of the compound to be searched, and N is the partial structure constituting the compound Is the number of These similarity score functions are managed in a table 380. The similarity function ID of 381 is a primary key, and 382 is a similarity score function area.

上記の構成により、本発明の化合物検索支援システムは、特性の情報、化合物の部分構造の情報、更に該特性と該部分構造の関連情報を管理し、特性ごとの重要度を考えた類似化合物検索を行える。 With the above configuration, the compound search support system of the present invention manages characteristic information, information on the partial structure of the compound, and information related to the characteristic and the partial structure, and searches for similar compounds considering the importance of each characteristic. Can be done.

本発明によれば、このような構成をとることで、ユーザは化合物の情報を入力するだけで、特性ごとの類似化合物情報を得ることができる。また、類似の根拠となる部分構造とその重要度を取得することができる。 According to the present invention, by adopting such a configuration, the user can obtain similar compound information for each characteristic only by inputting compound information. In addition, it is possible to acquire a partial structure that is the basis for the similarity and its importance.

また、類似度を計算する際の関数を複数個管理することで、ユーザ自身の意図に即した関数を選択することができる。 Further, by managing a plurality of functions for calculating the similarity, it is possible to select a function in accordance with the user's own intention.

図４は化合物検索インタフェースの説明図である。４００は化合物をキーとして検索を実行し、特性毎に類似構造をランキング形式で表示する化合物検索インタフェースであり、４１０はユーザが検索を行いたい化合物の情報を入力するためクエリ化合物入力領域、４２０はどの特性に着目して類似構造を検索するか入力する領域、４３０は部分構造ごとの重要度スコアを計算する関数を選択する領域、４４０は化合物全体としての類似度スコアを計算する関数の選択する領域、４５０は検索を実行するための検索ボタン、４６０〜４６２は特性毎に検索結果の類似構造の表示を切り替えるための検索結果表示切替領域、４７０は類似構造検索結果を類似度スコアとともに表示するための検索結果表示領域、４８０はクエリ化合物を構成する部分構造４８１とそれらの特性別の重要度を表示する領域（４８２〜４８４）である。４９０は化合物検索インタフェース４００を終了するための終了ボタンである。 FIG. 4 is an explanatory diagram of a compound search interface. Reference numeral 400 denotes a compound search interface that executes a search using a compound as a key and displays similar structures in a ranking format for each characteristic. 410 is a query compound input area for inputting information on a compound that the user wants to search, and 420 is a query compound input area. An area for inputting which characteristic is to be searched for a similar structure, 430 is an area for selecting a function for calculating an importance score for each partial structure, and 440 is for selecting a function for calculating the similarity score for the entire compound. An area, 450 is a search button for executing a search, 460 to 462 is a search result display switching area for switching display of a similar structure of the search results for each characteristic, and 470 displays a similar structure search result together with a similarity score. The search result display area 480 for displaying the partial structure 481 constituting the query compound and the importance of each characteristic It is a region (482-484). Reference numeral 490 denotes an end button for ending the compound search interface 400.

なお、検索結果表示領域４７０に表示する具体的な形態については図５を用いて詳細に説明する。 A specific form displayed in the search result display area 470 will be described in detail with reference to FIG.

ユーザが化合物の情報を化合物入力領域４１０に入力し、検索ボタン４５０を押下すると、検索結果が検索結果表示領域４７０に選択した特性の観点でのクエリー化合物との類似度の高い化合物順で表示される。ユーザが別の特性に関する結果を閲覧したい場合は、検索結果表示切替領域４６０〜４６２を選択することで検索結果表示領域４７０の表示内容が該特性に関するものに変更される。
また、検索結果表示領域４７０に表示された検索結果のランキングがユーザの意図と即していない場合、ユーザが重要度スコア関数４３０もしくは類似度スコア関数４４０の選択の変更を行うことで、類似度尺度が変更され、それに伴い検索結果表示領域４７０に表示された検索結果が再表示される。 When the user inputs compound information into the compound input area 410 and presses the search button 450, the search results are displayed in the search result display area 470 in the order of compounds having a high similarity to the query compound in terms of the selected characteristics. The When the user wants to view a result related to another characteristic, the display content of the search result display area 470 is changed to that related to the characteristic by selecting the search result display switching area 460 to 462.
Further, when the ranking of the search results displayed in the search result display area 470 does not match the user's intention, the similarity is changed by the user changing the selection of the importance score function 430 or the similarity score function 440. The scale is changed, and accordingly, the search result displayed in the search result display area 470 is displayed again.

本発明によれば、このような構成を採ることで、特性毎にクエリ化合物と類似度が高い順で化合物に関連する部分構造を表示することができる。
また、ユーザが化合物類似度の尺度となる関数を変更可能とすることで、よりユーザの意図に即した検索を実行することができる。 According to the present invention, by adopting such a configuration, it is possible to display partial structures related to a compound in descending order of similarity to a query compound for each characteristic.
In addition, by making it possible for the user to change the function that is a measure of the degree of compound similarity, it is possible to execute a search that is more in line with the user's intention.

図５は検索結果表示領域に表示する形態の説明図である。５００はクエリ化合物をキーとして得られた類似化合物の検索結果を特性別に表示したものである。５０１は類似度順位を、５０２が化合物名称を、５０３が化合物構造を、５０４が類似度スコアを、５０５が、類似と判定された根拠となる部分構造と、その部分構造が特性によってどの程度重要となるかを示したものとなる。 FIG. 5 is an explanatory diagram of a form displayed in the search result display area. Reference numeral 500 indicates a search result of similar compounds obtained by using the query compound as a key for each characteristic. 501 is the similarity ranking, 502 is the compound name, 503 is the compound structure, 504 is the similarity score, 505 is the substructure that is determined to be similar, and how important the substructure depends on the characteristics It becomes what showed.

図６は６００の特性・化合物情報登録、及び、特性別部分構造重要度の計算処理の詳細の処理手順を示したフローチャートである。特性・部分構造関連情報更新機構２２０が特性・部分構造関連情報読込処理６０５を実行すると、特性・部分構造関連情報更新機構２２０は、特性・部分構造関連情報を１件読み込み（ステップ６１０）、既に、特性、化合物情報が登録されているかどうかチェックする（ステップ６１５）。登録済みならば、読み込むべきデータがないかどうかチェックする（ステップ６６５）。未登録ならば、化合物構造分解機構２５０を使って化合物部分構造に分解する（ステップ６２０）。次に、化合物名が登録済みかチェックし（ステップ６２５）、未登録ならば、テーブル３００及びテーブル３３０に新規登録を行い（ステップ６３０）。次に、全ての部分構造が登録済みかチェックし（ステップ６３５）、未登録分については、テーブル３２０に登録する（ステップ６４０）。また、化合物と部分構造の関係（テーブル３３０）が未登録ならば登録する。化合物中の部分構造出現数（３３３）が未登録ならば登録し、化合物データベース中の部分構造総出現数（３２３）をインクリメントする。 FIG. 6 is a flowchart showing the detailed processing procedure of 600 property / compound information registration and property-specific partial structure importance calculation processing. When the characteristic / partial structure related information update mechanism 220 executes the characteristic / partial structure related information read processing 605, the characteristic / substructure related information update mechanism 220 reads one characteristic / partial structure related information (step 610) and has already been executed. Then, it is checked whether property and compound information are registered (step 615). If registered, it is checked whether there is any data to be read (step 665). If not registered, it is decomposed into a compound partial structure using the compound structure decomposition mechanism 250 (step 620). Next, it is checked whether the compound name has been registered (step 625). If the compound name has not been registered, new registration is performed in the table 300 and the table 330 (step 630). Next, it is checked whether all the partial structures have been registered (step 635), and unregistered portions are registered in the table 320 (step 640). If the relationship between the compound and the partial structure (table 330) is not registered, it is registered. If the number of partial structure appearances (333) in the compound is not registered, registration is performed, and the total number of partial structure appearances (323) in the compound database is incremented.

次に、特性名が登録済みかチェックし（ステップ６５０）、未登録の場合は新規登録を行い（ステップ６５５）、登録済みの場合は、出現数をカウント＋１（特性出現数３１３）とする。化合物-特性のペアをテーブル３４０に登録する（ステップ６６０）。次に、部分構造と特性の関係をテーブル３５０に登録する（ステップ６６５）。特性・化合物情報を全件読み込んでいれば、ステップ６７５に、読み込んでいなければ、次の一件をステップ６１０で読み込む（ステップ６７０）。ステップ６７５で、特性別に部分構造の重要度スコアの計算を行い、テーブル３６０を登録する。 Next, it is checked whether the characteristic name has been registered (step 650). If it has not been registered, new registration is performed (step 655). If it has been registered, the number of appearances is set to count + 1 (characteristic appearance number 313). The compound-property pairs are registered in the table 340 (step 660). Next, the relationship between the partial structure and characteristics is registered in the table 350 (step 665). If all the characteristic / compound information has been read, step 675 is read. If not, the next one is read in step 610 (step 670). In step 675, the importance score of the partial structure is calculated for each characteristic, and the table 360 is registered.

本発明によれば、このような処理を行うことで、特性毎に化合物に関連する部分構造を獲得することができる。 According to the present invention, the partial structure related to the compound can be obtained for each characteristic by performing such treatment.

図７は検索処理の詳細の処理手順を示したフローチャートである。検索機構２１０が検索処理７００を実行すると、化合物検索インタフェースが表示され（ステップ７１０）、入力を受け付ける（ステップ７２０）。ステップ７２０は、化合物入力領域４１０に入力された化合物を化合物構造分解機構２５０を用いて部分構造に分割し（ステップ７３１）、４２０に入力された検索対象となる特性ごとに、入力部４３０で指定された重要度スコア関数IDと全ての部分構造をキーとして、各部分構造の重要度スコア（３６４）、部分構造の出現数（３２３、３３３）を取得するとともに、これらの部分構造を持つ化合物IDをテーブル３３０から取得し（３３１）、各化合物IDの名称（３０２）と、部分構造総数（３０３）とをテーブル３００から取得する。これらと、入力部４４０で指定された類似度スコア関数を用いて、特性、化合物ごとのクエリー化合物との類似度を計算し（ステップ７３２）、４７０に類似度順に表示するとともに、部分構造の特性別の重要度を４８０に表示する（ステップ７３３）。特性別の表示を切り替えする場合は、４６０−４６２を選択すると（ステップ７４０）、表示変更される（ステップ７５０）。重要度スコア（４３０）や類似度スコア関数（４４０）を変更すると、ステップ７３２に戻り、新規化合物を入力するとステップ７３０まで戻る。４９０の終了ボタンを押すと検索システムは終了する（ステップ７６０）。 FIG. 7 is a flowchart showing a detailed processing procedure of the search processing. When the search mechanism 210 executes the search process 700, a compound search interface is displayed (step 710) and an input is received (step 720). In step 720, the compound input to the compound input area 410 is divided into partial structures using the compound structure decomposition mechanism 250 (step 731) and specified by the input unit 430 for each characteristic to be searched input in 420. Using the calculated importance score function ID and all partial structures as keys, the importance score (364) of each partial structure and the number of occurrences of partial structures (323, 333) are obtained, and the compound IDs having these partial structures Is obtained from the table 330 (331), and the name (302) of each compound ID and the total number of partial structures (303) are obtained from the table 300. Using these and the similarity score function designated by the input unit 440, the characteristics and the similarity to the query compound for each compound are calculated (step 732) and displayed in order of similarity in 470, and the characteristics of the partial structure Another importance is displayed on 480 (step 733). When switching the display according to characteristics, if 460-462 is selected (step 740), the display is changed (step 750). When the importance score (430) or the similarity score function (440) is changed, the process returns to step 732, and when a new compound is input, the process returns to step 730. When the end button 490 is pressed, the search system ends (step 760).

本発明によれば、このような処理を行うことで、特性毎にクエリ化合物との類似度順で化合物の検索結果を表示することができる。また、部分構造の重要度が特性ごとにどの程度重要かどうかをユーザに提示することができる。 According to the present invention, by performing such processing, compound search results can be displayed in order of similarity to the query compound for each characteristic. In addition, it is possible to present to the user how important the importance of the partial structure is for each characteristic.

本発明は、大規模な化合物ライブラリの中から特定の観点/特性の下、クエリ化合物と類似の化合物を検索する方法に関するものである。 The present invention relates to a method for searching a compound similar to a query compound from a large compound library under a specific viewpoint / characteristic.

本発明の化合物検索支援システムのシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the compound search assistance system of this invention. 情報制御機構の詳細のブロック図である。It is a detailed block diagram of an information control mechanism. 特性・化合物情報記憶領域と重要度スコア及び類似度スコア導出関数情報記憶領域に記憶したテーブルの説明図である。It is explanatory drawing of the table memorize | stored in the characteristic and compound information storage area, the importance score, and the similarity score derivation function information storage area. 化合物検索インタフェースの説明図である。It is explanatory drawing of a compound search interface. 検索結果表示領域に表示する形態の説明図である。It is explanatory drawing of the form displayed on a search result display area. 化合物検索支援システムの特性・化合物情報登録の概略の処理手順を示したフローチャートである。It is the flowchart which showed the general | schematic process sequence of the characteristic and compound information registration of a compound search assistance system. 検索実行処理の詳細の処理手順を示したフローチャートである。It is the flowchart which showed the detailed process sequence of the search execution process.

Explanation of symbols

１００…化合物検索支援システム、１１０…情報制御機構、１２０…化合物データベース、１２１…特性・化合物部分構造関連情報記憶領域、１２２…重要度スコア及び類似度スコア導出関数記憶領域、１３０…データベース制御機構、１４０…インターネット、１５０…入出力端末
２００…情報配信機構、２１０…検索機構、２２０…特性・部分構造関連情報更新機構、２３０…重要度スコア関数変更機構、２４０…類似度スコア関数変更機構、２５０…化合物構造分解機構、
３００…化合物名称管理テーブル、３０１…化合物ＩＤ領域、３０２…化合物名称領域、３０３…部分構造総数領域、
３１０…特性情報管理テーブル、３１１…特性ＩＤ領域、３１２…特性名領域、３１３…特性出現数領域、
３２０…部分構造情報テーブル、３２１…部分構造ＩＤ領域、３２２…部分構造データ領域、３２３…部分構造総出現数領域、
３３０…化合物構成部分構造テーブル、３３１…化合物ＩＤ領域、３３２…部分構造ＩＤ領域、３３３…部分構造出現数領域、
３４０…化合物・特性テーブル、３４１…化合物ＩＤ領域、３４２…特性ＩＤ領域、
３５０…特性部分構造関連テーブル、３５１…関連ＩＤ領域、３５２…特性ＩＤ領域、３５３…部分構造ＩＤ領域、３５４…関連出現数領域、
３６０…重要度テーブル、３６１…関連ＩＤ領域、３６２…関数ＩＤ領域、３６３…部分構造ＩＤ領域、３６４…重要度スコア領域、
３７０…重要度スコアテーブル、３７１…関数ＩＤ領域、３７２…重要度スコア関数領域、
３８０…類似度スコアテーブル、３８１…類似関数ＩＤ領域、３８２…類似度スコア関数領域
４００…化合物検索インタフェース、４１０…クエリ化合物入力領域、４２０…検索対象特性選択領域、４３０…重要度スコア関数選択領域、４４０…類似度スコア関数選択領域、４５０…検索ボタン、４６０〜４６３…検索結果表示切替領域、４７０…検索結果表示領域、４８０…部分構造の特性別重要度領域、４８１…特性Ｐ１下での該部分構造の重要度領域、４８２…特性Ｐ２下での該部分構造の重要度領域、４８３…特性Ｐ３下での該部分構造の重要度領域、４９０…終了ボタン
５００…類似度順化合物表示形態、５０１…順位表示領域、５０２…化合物名称表示領域、５０３…化合物構造表示領域、５０４…類似度スコア表示領域、５０５…部分構造とその重要度スコア表示領域
６００…特性・化合物登録スキーム、６０５…登録スタート、６１０…特性・化合物情報の読み込みのステップ、６１５…特性・化合物名ペアの登録チェックのステップ、６２０…化合物を部分構造に分解するステップ、６２５…化合物名登録チェックのステップ、６３０…化合物名新規登録のステップ、６３５…部分構造、及び、化合物-部分構造の登録チェックのステップ、６４０…部分構造、及び、化合物-部分構造の新規登録のステップ、６４５…文構造出現数カウントインクリメントのステップ、６５０…特性名称登録チェックのステップ、６５５…新規特性名称登録のステップ、６６０…化合物-特性ペアの未登録データの登録のステップ、６６５…部分構造−特性IDペアで、未登録のものは登録し、登録済みのものは出現数カウントインクリメントのステップ、６７０…特性・化合物情報を全件読み込んだかチェックするステップ、６７５…特性別に部分構造の重要度スコアを計算するステップ、６８０…登録終了。
DESCRIPTION OF SYMBOLS 100 ... Compound search support system, 110 ... Information control mechanism, 120 ... Compound database, 121 ... Property / compound partial structure related information storage area, 122 ... Importance score and similarity score derivation function storage area, 130 ... Database control mechanism, 140 ... Internet 150 ... I / O terminal 200 ... information distribution mechanism 210 ... search mechanism 220 ... characteristic / substructure related information update mechanism 230 ... importance score function change mechanism 240 ... similarity score function change mechanism 250 ... Compound structure decomposition mechanism,
300 ... Compound name management table, 301 ... Compound ID area, 302 ... Compound name area, 303 ... Total number of partial structures area,
310 ... characteristic information management table, 311 ... characteristic ID area, 312 ... characteristic name area, 313 ... characteristic appearance number area,
320 ... Partial structure information table, 321 ... Partial structure ID area, 322 ... Partial structure data area, 323 ... Partial structure total appearance number area,
330 ... Compound structure partial structure table, 331 ... Compound ID area, 332 ... Partial structure ID area, 333 ... Partial structure appearance number area,
340 ... Compound / characteristic table, 341 ... Compound ID area, 342 ... Characteristic ID area,
350 ... Characteristic partial structure related table, 351 ... Related ID area, 352 ... Characteristic ID area, 353 ... Partial structure ID area, 354 ... Related appearance number area,
360 ... Importance table, 361 ... Related ID area, 362 ... Function ID area, 363 ... Partial structure ID area, 364 ... Importance score area,
370 ... Importance score table, 371 ... Function ID area, 372 ... Importance score function area,
380 ... Similarity score table, 381 ... Similarity function ID area, 382 ... Similarity score function area 400 ... Compound search interface, 410 ... Query compound input area, 420 ... Search target characteristic selection area, 430 ... Importance score function selection area 440 ... Similarity score function selection area, 450 ... Search button, 460 to 463 ... Search result display switching area, 470 ... Search result display area, 480 ... Importance area by characteristics of partial structure, 481 ... Under the characteristic P1 Importance region of the partial structure, 482... Importance region of the partial structure under the characteristic P2, 483... Importance region of the partial structure under the characteristic P3, 490. 501 ... Ranking display area 502 ... Compound name display area 503 ... Compound structure display area 504 ... Similarity score display area 50 ... Partial structure and importance score display area 600 ... Characteristic / compound registration scheme, 605 ... Start registration, 610 ... Reading property / compound information step, 615 ... Step for checking registration of characteristic / compound name pair, 620 ... Compound 625 ... Compound name registration check step, 630 ... Compound name new registration step, 635 ... Partial structure and compound-substructure registration check step, 640 ... Substructure, and Step of newly registering compound-substructure, 645: Step of incrementing count of sentence structure occurrence, 650: Step of checking property name registration, 655: Step of registering new property name, 660: Unregistered data of compound-property pair Registration step, 665... Partial structure-property ID pair, unregistered Registered and registered one is the step of incrementing the appearance count, 670... Checking whether all the characteristics / compound information are read, 675... Calculating the importance score of the partial structure for each characteristic, 680.

Claims

Means for decomposing a compound structure into constituent partial structures, means for generating a set of relations between compound partial structures and compound characteristics using the means, means for storing the total number of the relations, compound characteristics appearing in the relations Name, compound name, compound structure information, means for storing the number of occurrences thereof and the total number of occurrences of the compound partial structure, means for storing the characteristic score of the partial structure, compound information to be searched A compound search support system comprising: input means; means for searching for a similar compound using a characteristic-specific score; and means for displaying the search result according to characteristics in order of similarity to a query compound.

The compound search support system according to claim 1, wherein when the user performs registration / update of compound information including compound characteristics, the importance score of the partial structure and the partial structure classified by characteristic is automatically calculated. Characteristic compound search support system.

The compound search support system according to claim 1, wherein a function for deriving an importance score of a partial structure for each characteristic and a similarity score between compounds can be appropriately changed by a user for each characteristic. system.

The compound search support system according to claim 1, wherein the total number of occurrences of a set of characteristic names and partial structures, the total number of occurrences of characteristic names and partial structures, and the total number of partial structure occurrences of a registered compound are used to calculate a Dice coefficient, Entropy A compound search support system that calculates one or more of Gain, Mutual information, Simpson coefficient, Cosine, and integrated value of hypergeometric distribution and uses it as the importance of the partial structure in a certain characteristic.