JP4895689B2

JP4895689B2 - System and method for retrieving chemical structure from large-scale chemical structure database at high speed

Info

Publication number: JP4895689B2
Application number: JP2006150342A
Authority: JP
Inventors: 篤史吉森; 靖一田沼
Original assignee: INSTITUTE FOR THEORETICAL MEDICINE, INC.
Current assignee: INSTITUTE FOR THEORETICAL MEDICINE, INC.
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2012-03-14
Anticipated expiration: 2026-05-30
Also published as: JP2007323182A

Description

本発明は、大規模な化学構造データベースから高速に化学構造を検索するシステム及び方法に関する。 The present invention relates to a system and method for retrieving a chemical structure at high speed from a large-scale chemical structure database.

化学構造データベースは、現代の化学・創薬研究において欠くことのできない重要なツールとなっているばかりでなく、特許情報や試薬管理などにおいても必要不可欠なツールとなっている。例えば、ＣｈｅｍｉｃａｌＡｂｓｔｒａｃｔｓＳｙｓｔｅｍに対する化学構造検索では、ＳｃｉＦｉｎｄｅｒ［ｈｔｔｐ：／／ｗｗｗ．ｊａｉｃｉ．ｏｒ．ｊｐ／ｓｃｉ／ＳＣＨＯＬＡＲ／ｉｎｄｅｘ．ｈｔｍｌ］が、企業内データベースに対する化学構造検索では、ＩＳＩＳ［ｈｔｔｐ：／／ｗｗｗ．ｍｄｌｉ．ｃｏｍ／］などが広く利用されている。 The chemical structure database is not only an indispensable tool in modern chemical and drug discovery research, but also an indispensable tool for patent information and reagent management. For example, in the chemical structure search for Chemical Abstracts System, SciFinder [http: // www. jaici. or. jp / sci / SCHOLAR / index. html] is the ISIS [http: // www. mdli. com /] etc. are widely used.

化学構造検索アルゴリズムの開発は、１９６０年以来、多くの科学者が徹底的な研究を行ってきた。部分構造検索は、指定したクエリー構造が、与えられた標的構造の中に含まれているかどうかを判定する作業である。グラフ理論の用語を用いれば、部分構造検索はクエリーグラフ（Ｇ_Ｑ）が、標的グラフ（Ｇ_Ｔ）の部分グラフと同形であるかどうかを調べる作業であり、Ｇ_ＱとＧ_Ｔの部分グラフ間の同形を探すことは、ＮＰ完全問題であることが知られている［非特許文献１］。したがって、一般的に、同形をすばやく判定することは非常に困難な作業であるが、多くの部分構造検索手法は、バックトラッキング法を効率的に利用して、この問題を解決している［非特許文献２］。 Since 1960, many scientists have conducted thorough research on the development of chemical structure search algorithms. The partial structure search is an operation for determining whether or not a specified query structure is included in a given target structure. Using the terminology of graph theory, substructure searches query graph (G _Q) is a work to determine whether a subgraph isomorphic target graph (G _T), between the subgraph of G _Q and G _T It is known that searching for the isomorphism of is an NP complete problem [Non-Patent Document 1]. Therefore, it is generally a very difficult task to quickly determine isomorphism, but many substructure search techniques solve this problem by using the backtracking method efficiently [non- Patent Document 2].

しかしながら、５０，０００以上の化学構造データベースから化学構造検索を実施するには、バックトラッキング法だけでは、多くの検索時間（数十秒〜）を必要とし、実用的ではない。そこで、バックトラッキング法を行う前に、明らかに同形ではないグラフ（化学構造）を高速に除去する“スクリーニング”と呼ばれる手法が開発された。
通常、スクリーニング手法は、化学構造をビット文字列で表現する。各ビットは、任意のフラグメント（ベンゼン環、アミド基など）を意味し、１は、そのフラグメントが化学構造中に存在することを、０は存在しないことを示す。化学構造データベース中の標的構造は、事前にこのビット文字列を生成させ、データベースに保持しておく、クエリー構造のビット文字列は、検索時に生成させる。次に、クエリー構造のビット文字列と標的構造のビット文字列を順次比較し、クエリー構造のビット文字列中に存在する少なくとも１つの１の立ったビットが標的構造のビット文字列中に存在しなければ、同形の可能性はないとして除去される。この計算は、ビット演算子（ＡＮＤ、ＯＲ、ＸＯＲ）を用いて計算できるので、非常に高速に処理できる［非特許文献３］。一般的に、ビット文字列を用いたスクリーニング手法、及びその改良型を用いることで、全体として１０倍〜２０倍の高速化が実現されるため、現在、多くの構造検索システムで利用されている。 However, in order to perform a chemical structure search from a chemical structure database of 50,000 or more, the backtracking method alone requires a lot of search time (several tens of seconds) and is not practical. Therefore, a technique called “screening” has been developed that quickly removes graphs (chemical structures) that are clearly not isomorphic before performing the backtracking method.
Usually, the screening method represents a chemical structure with a bit character string. Each bit means an arbitrary fragment (benzene ring, amide group, etc.), where 1 indicates that the fragment is present in the chemical structure and 0 indicates that it is not present. The target structure in the chemical structure database is generated in advance and this bit character string is stored in the database, and the query structure bit string is generated at the time of retrieval. Next, the bit string of the query structure and the bit string of the target structure are sequentially compared, and at least one 1-bit bit present in the bit string of the query structure is present in the bit string of the target structure. Otherwise, it is removed as no possible isomorphism. Since this calculation can be performed using bit operators (AND, OR, XOR), it can be processed at a very high speed [Non-Patent Document 3]. In general, by using a screening method using a bit character string and an improved version thereof, a speed increase of 10 to 20 times is realized as a whole, so that it is currently used in many structure search systems. .

通常、データベースに対する検索の高速化は、インデックス（本の索引に相当する）を用いて実現されている。インデックスは、数値や文字列を対象としているため、ビット文字列同士の演算を必要とする既存の化学構造検索においては、有効に利用することができない。したがって、化学構造検索の多くは、データベースの最初から最後まで順次、ビット演算を繰り返す必要があり、高速化のもっとも大きなボトルネックとなっている。 Usually, speeding up of the search for a database is realized by using an index (corresponding to a book index). Since the index is intended for numerical values and character strings, it cannot be used effectively in existing chemical structure searches that require operations between bit character strings. Therefore, in many chemical structure searches, it is necessary to repeat bit operations sequentially from the beginning to the end of the database, which is the biggest bottleneck for speeding up.

一方、近年、大規模データベースに対する検索システムとして、もっとも利用されているものは、ＧｏｏｇｌｅやＹａｈｏｏ！の検索エンジンに代表される“大量の文書の中から特定の文字列を含む文書を検索する”全文検索システムである。全文検索システムの検索速度の速さは、検索エンジンを利用する誰しもが納得するレベルであり、自明である。通常、全文検索システムでは、形態素解析［非特許文献４］やＮ−ｇｒａｍ法［非特許文献５］を用いて文書を単語に分解し、次に、この単語を転置インデックス法などでインデックス化する。この作業により、どの単語がどの文書内に存在するかを高速に検索可能としている。
Ｍ．Ｇａｒｅｙ，Ｄ．Ｊｏｈｎｓｏｎ，ＣｏｍｐｕｔｅｒｓａｎｄＩｎｔｒａｃｔａｂｉｌｉｔｙ；ＡＧｕｉｄｅｔｏｔｈｅＴｈｅｏｒｙｏｆＮＰ−Ｃｏｍｐｌｅｔｅｎｅｓｓ：Ｗ．Ｈ．Ｆｒｅｅｍａｎ，ＮｅｗＹｏｒｋ，１９７９．Ｊ．Ｘｕ，Ｊ．Ｃｈｅｍ．Ｉｎｆ．Ｃｏｍｐｕｔ．Ｓｃｉ．１９９６，３６，２５−３４．Ｍ．Ｆ．Ｌｙｎｃｈ，ＣｈｅｍｉｃａｌＩｎｆｏｒｍａｔｉｏｎＳｙｓｔｅｍｓ，Ｊ．Ｅ．Ａｓｈ，Ｅ．Ｈｙｄｅｅｄｓ．，ＥｌｌｉｓＨｏｒｗｏｏｄ，Ｃｈｉｃｈｅｓｔｅｒ，１９８５，８８−９３．久光徹、新田義彦、「日本語形態素解析における効率的な動詞活用処理」情報処理学会研究会報告、９４−ＮＬ−１０３，１９９４年９月，１−７．踊堂憲道、伊藤克亘、鹿野清宏、中村哲、「Ｎ−ｇｒａｍモデルのエントロピーに基づくパラメータ削減に関する検討」情報処理学会論文誌２００１年２月Ｖｏｌ．４２Ｎｏ．２，３２７−３３３ On the other hand, the most widely used search systems for large-scale databases in recent years are Google and Yahoo! This is a full-text search system that “searches a document including a specific character string from a large number of documents” represented by a search engine. The speed of the search speed of the full-text search system is self-evident at a level that anyone using a search engine can agree with. Normally, in a full-text search system, a document is decomposed into words using morphological analysis [Non-patent Document 4] or N-gram method [Non-Patent Document 5], and then this word is indexed by a transposed index method or the like. . By this work, which word is present in which document can be searched at high speed.
M.M. Garey, D.D. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness: H. Freeman, New York, 1979. J. et al. Xu, J. et al. Chem. Inf. Comput. Sci. 1996, 36, 25-34. M.M. F. Lynch, Chemical Information Systems, J. MoI. E. Ash, E.M. Hyde eds. Elis Horwood, Chichester, 1985, 88-93. Toru Hisamitsu and Yoshihiko Nitta, “Efficient Verb Utilization Processing in Japanese Morphological Analysis” Report of Information Processing Society of Japan, 94-NL-103, September 1994, 1-7. Kendo Odo, Katsunobu Ito, Kiyohiro Shikano, Satoshi Nakamura, “Study on parameter reduction based on entropy of N-gram model” IPSJ Journal 2001, Vol. 42 no. 2, 327-333

化学構造検索において、複数ユーザーによるＷｅｂ経由での同時アクセス要求及びデータベースに登録されている化合物の数は、年々増加（肥大化）傾向にあり、これまで以上の高速な化学構造検索手法が求められている。さらに、データベース自体においても商用の高価なデータベースに依存せず、かつスーパーコンピュータなど特殊なハードウェアを利用せずとも、高速化を実現可能とする検索手法の開発は、いわゆる「チープ革命」の恩恵（コンピュータの急激な高性能化と低価格化、及びオープンソースとして提供される高機能なフリーソフトウェアの出現により、安価に高機能なアプリケーションを開発できるようになったこと。「チープ革命」の詳細は、梅田望夫著「ウェブ進化」ちくま新書を参照。）をダイレクトに受けることができ、コストパフォーマンスの圧倒的に高い化学構造検索システムの開発を可能とする。
以上のことから、高速かつ安価な化学構造検索手法の開発は、大規模データベースに対する複数ユーザーの同時アクセス要求を満たすばかりでなく、これまで未開拓であった化学構造検索を利用したＷｅｂアプリケーションシステムの提供を可能にすると期待されている。 In chemical structure search, simultaneous access requests via the Web by multiple users and the number of compounds registered in the database tend to increase (enlarge) year by year, and faster chemical structure search methods than ever before are required. ing. Furthermore, the development of search methods that can achieve high speeds without relying on expensive commercial databases and without using special hardware such as supercomputers is the benefit of the so-called “cheap revolution”. (With the rapid increase in computer performance and price, and the emergence of high-performance free software provided as open source, it has become possible to develop high-performance applications at low cost. Details of the “Cheap Revolution” Can be directly received by Nobuo Umeda's “Web Evolution” Chikuma New Book.), Enabling the development of an overwhelmingly high-cost chemical structure search system.
From the above, the development of a fast and inexpensive chemical structure search method not only satisfies the simultaneous access request of multiple users to a large-scale database, but also a web application system using chemical structure search that has not been developed yet. It is expected to be available.

上記課題に鑑み、鋭意検討を重ねた結果、化学構造を“文書”とし、その部分構造を“単語”として表現することができれば、既存の全文検索システムをそのまま利用することができ、ＧｏｏｇｌｅやＹａｈｏｏ！レベルの高速な検索システムが化学構造検索においても実現できることを見出し、化学構造検索における“スクリーニング”を全文検索で実現可能とするＩＴＭＣｈｅｍＳｔｒｉｎｇ（ＩＴＣＳ）法を開発した。
本発明の要旨は以下のとおりである。
〔１〕コンピュータに入力された化合物の化学構造を、原子に対応するノードと、原子間の結合に対応するエッジからなる木構造として表現し、ノードの１つからルートノードを選択し、該ルートノードから深さ優先探索により経路決定を行い、該決定された経路に従い該化合物の化学構造を文字列化する手段と、
該木構造を基に、該化合物を構成する全ての部分構造を文字列化する手段と、
得られた化合物の化学構造の文字列化表現及び化合物の部分構造の文字列化表現を、該化合物を識別するユニークなＩＤと共に、化学構造データベースとして記録保存するための記憶媒体と
を備える、所定の部分構造を有する化合物を検索するための化学構造検索システム。
〔２〕１）コンピュータに入力された化合物の化学構造を、原子に対応するノードと、原子間の結合に対応するエッジからなる木構造として表現し、ノードの１つからルートノードを選択し、該ルートノードから深さ優先探索により経路決定を行い、該決定された経路に従い該化合物の化学構造を文字列化すること、及び、２）該木構造を基に、該化合物を構成する全ての部分構造を文字列化すること、によって得られた該化合物の部分構造の文字列化表現をクエリーとして上記〔１〕に記載の化学構造検索システムに対して検索要求を投げる工程と、
該化学構造検索システムに含まれる化学構造データベースを利用して、コンピュータが投げかけられた検索要求を全文検索処理し、検索結果として化合物を識別するユニークなＩＤとそれに該当する標的化合物の化学構造の文字列化表現を返す工程と、
得られた該ＩＤと標的化合物の化学構造の文字列化表現を化学構造検索処理して、検索結果として該部分構造を有する化合物のＩＤを提示する工程と
を含む、所定の部分構造を有する化合物を検索する方法。 In view of the above problems, as a result of intensive studies, if the chemical structure can be expressed as “document” and the partial structure can be expressed as “word”, the existing full-text search system can be used as it is, and Google and Yahoo can be used as they are. ! We have found that a high-level search system can be realized even in chemical structure search, and have developed the ITMChemString (ITCS) method that enables “screening” in chemical structure search to be realized by full-text search.
The gist of the present invention is as follows.
[1] The chemical structure of a compound input to a computer is expressed as a tree structure composed of nodes corresponding to atoms and edges corresponding to bonds between atoms, and a root node is selected from one of the nodes, and the root Means for performing path determination from a node by depth-first search, and characterizing the chemical structure of the compound according to the determined path;
Based on the tree structure, means for characterizing all partial structures constituting the compound;
A storage medium for recording and storing a character string representation of the chemical structure of the obtained compound and a character string representation of the partial structure of the compound together with a unique ID for identifying the compound as a chemical structure database A chemical structure search system for searching for compounds having a partial structure of
[2] 1) The chemical structure of the compound input to the computer is expressed as a tree structure composed of nodes corresponding to atoms and edges corresponding to bonds between atoms, and a root node is selected from one of the nodes. A path is determined by depth-first search from the root node, and the chemical structure of the compound is converted into a character string according to the determined path; and 2) all the constituents of the compound are configured based on the tree structure. Sending a search request to the chemical structure search system according to the above [1] using a character string representation of the partial structure of the compound obtained by characterizing the partial structure as a query;
Using the chemical structure database included in the chemical structure search system, a full text search process is performed on a search request thrown by a computer, and a unique ID for identifying a compound as a search result and the character of the chemical structure of the corresponding target compound Returning a columnized representation;
A compound having a predetermined partial structure, including a step of performing a chemical structure search process on the obtained character string representation of the ID and the chemical structure of the target compound and presenting an ID of the compound having the partial structure as a search result How to search.

本発明の化学構造検索では、化学構造を“文書”とし、その部分構造を“単語”として表現しているので、既存の全文検索システムをそのまま利用することができ、好適には、ＧｏｏｇｌｅやＹａｈｏｏ！レベルの高速な検索システムを化学構造検索においても実現できる。 In the chemical structure search of the present invention, the chemical structure is expressed as “document” and the partial structure is expressed as “word”. Therefore, the existing full-text search system can be used as it is, and preferably Google or Yahoo. ! A high-speed search system can also be realized in chemical structure search.

本明細書において化学構造検索は、化合物の部分構造検索及び完全構造検索を意味する。 In the present specification, chemical structure search means partial structure search and complete structure search of a compound.

本発明は、上述のような所定の部分構造を有する化合物を検索するための化学構造検索システムを提供する。上記各手段によって行われる処理の詳細は、後述するとおりである。 The present invention provides a chemical structure retrieval system for retrieving a compound having a predetermined partial structure as described above. Details of the processing performed by each of the above means are as described later.

それぞれの文字列化に用いる手段は、手動であっても、コンピュータなどの情報処理手段を用いてもよいが、処理の効率化を考慮すると、コンピュータなどの情報処理手段を用い、最初に化合物の化学構造の入力を行えばその後の処理はコンピュータによって自動で行われるようなプログラム又はシステムを構築するのが好ましい。 The means for converting each character string may be manual or information processing means such as a computer. However, considering the efficiency of processing, information processing means such as a computer is used, It is preferable to construct a program or system in which the chemical structure is input and the subsequent processing is automatically performed by a computer.

化学構造データベースを記録保存するための記憶媒体としては、化合物の化学構造の文字列化表現、及び、化合物の部分構造の文字列化表現を、該化合物を識別するユニークなＩＤと共にデータベース化して記録保存することができるものであれば、いかなる記憶媒体であってもよい。例えば、そのような記憶媒体としては、コンピュータ内外に配置されたハードディスク、不揮発性メモリ、磁気ディスク、光ディスク、磁気テープなどが挙げられる。 As a storage medium for recording and storing a chemical structure database, a character string representation of a chemical structure of a compound and a character string representation of a partial structure of a compound are recorded as a database together with a unique ID for identifying the compound. Any storage medium can be used as long as it can be stored. For example, as such a storage medium, a hard disk, a nonvolatile memory, a magnetic disk, an optical disk, a magnetic tape and the like arranged inside and outside the computer can be cited.

化学構造データベースには、化合物を識別するユニークなＩＤ、化合物の化学構造の文字列化表現、及び、化合物の部分構造の文字列化表現以外にも、該化合物に関するこれら以外の諸情報を、化合物を識別するユニークなＩＤに関連づけて記録保存させておいてもよい。そのような情報としては、融点、沸点、分子量、ｌｏｇＰ、分子表面積など化合物が持つ固有の物性や、反応性、構造及び反応経路情報などが挙げられる。 In the chemical structure database, in addition to the unique ID for identifying the compound, the character string expression of the chemical structure of the compound, and the character string expression of the partial structure of the compound, other information related to the compound It may be recorded and stored in association with a unique ID for identifying. Such information includes specific physical properties of compounds such as melting point, boiling point, molecular weight, log P, molecular surface area, reactivity, structure and reaction path information.

本発明のＩＴＣＳ法の概要を図１に示す。 An outline of the ITCS method of the present invention is shown in FIG.

ＩＴＣＳ法において、標的化合物は、まず「ＩＴＣＳ生成プロセス」（詳細は後述）により、ＩＴＣＳが生成される。ＩＴＣＳとは化合物の化学構造を示す文字列（通常の全文検索における文書に相当する）である。
次に、「ＩＴＣＳＷＯＲＤ生成プロセス」（詳細は後述）により、ＩＴＣＳＷＯＲＤが生成される。ＩＴＣＳＷＯＲＤとは、化学構造を構成する全ての部分構造を列挙し、それらを文字列化したものである（通常の全文検索における単語に相当する）。
生成されたＩＴＣＳとＩＴＣＳＷＯＲＤはＩＤとその他付加情報と共にデータベースに格納される。ここで格納される標的化合物数は１〜数千万化合物であり、ＰｏｓｔｇｒｅＳＱＬ、ＭｙＳＱＬ、Ｏｒａｃｌｅなどのリレーショナルデータベースが利用できる。 In the ITCS method, an ITCS is first generated from a target compound by an “ITCS generation process” (details will be described later). ITCS is a character string (corresponding to a document in a normal full-text search) indicating a chemical structure of a compound.
Next, an ITCS WORD is generated by an “ITCS WORD generation process” (details will be described later). ITCS WORD enumerates all partial structures constituting a chemical structure and converts them into character strings (corresponding to words in ordinary full-text search).
The generated ITCS and ITCS WORD are stored in the database together with the ID and other additional information. The number of target compounds stored here is 1 to tens of millions of compounds, and relational databases such as PostgreSQL, MySQL, Oracle, etc. can be used.

クエリー化合物は、一般的な化学構造入力手段（例えば、汎用コンピュータ上で動く当該分野で慣用の化学構造描画ソフトウェア（例えば、ＣｈｅｍＤｒａｗ、ＩＳＩＳ／Ｄｒａｗなど）から、ｓｄｆ形式、ｍｏｌ２形式などの汎用の分子構造インターチェンジフォーマット形式で、入力され、標的化合物と同様にＩＴＣＳとＩＴＣＳＷＯＲＤが生成される。
次に、クエリー化合物のＩＴＣＳＷＯＲＤをクエリーとして、データベースに検索要求を投げる。データベースにおいては、投げられた検索要求（ＩＴＣＳＷＯＲＤ）に該当するデータを「全文検索プロセス」に対し、検索要求を出し、返答（ＩＤ）を得る。ここでの「全文検索プロセス」は、当業者に公知の手法であり、その手法は特に限定されず、全文検索に通常用いられているシステムを適宜利用することができる。例えば、データベース外部で利用するのであればＮａｍａｚｕ、Ｒａｓｔ、Ｅｓｔｒａｉｅｒなどを利用することができ、データベース内部で利用するのであれば、ＰｏｓｔｇｒｅＳＱＬにおいてはＴＳｅａｒｃｈ２などを利用することができる。
「全文検索プロセス」により得られたＩＤとそれに該当するＩＴＣＳをデータベースは、クエリーの返答として返す。 Query compounds can be obtained from general chemical structure input means (for example, chemical structure drawing software (for example, ChemDraw, ISIS / Draw, etc.) used in the field running on a general-purpose computer from general-purpose molecules such as sdf format and mol2 format. ITCS and ITCS WORD are generated in the same manner as the target compound.
Next, a query request is sent to the database using the query compound ITCS WORD as a query. In the database, a search request is sent to the “full text search process” for data corresponding to the search request (ITCS WORD) thrown, and a response (ID) is obtained. The “full-text search process” here is a method known to those skilled in the art, and the method is not particularly limited, and a system that is normally used for full-text search can be used as appropriate. For example, if it is used outside the database, Namazu, Rast, Estraer, etc. can be used, and if it is used inside the database, Tsearch2 etc. can be used in PostgreSQL.
The database returns the ID obtained by the “full-text search process” and the corresponding ITCS as a response to the query.

ここまでのプロセスが“スクリーニング”である。 The process so far is “screening”.

最後に、スクリーニングにより得られたＩＤ＋ＩＴＣＳと、クエリー化合物のＩＴＣＳを「化学構造検索プロセス」により構造検索を実施し、ヒット化合物のＩＤ及び諸情報（例えば、融点、沸点、分子量、ｌｏｇＰ、分子表面積など化合物が持つ固有の物性や、反応性、構造及び反応経路情報など）を提示する。
ここでの「化学構造検索プロセス」は公知のバックトラッキング法を利用することができる。「化学構造検索プロセス」はデータベースの外部で利用するだけではく、データベースの内部に組み込んで利用することもできる。 Finally, ID + ITCS obtained by screening and ITCS of the query compound are subjected to a structure search using a “chemical structure search process”, and hit compound IDs and various information (for example, melting point, boiling point, molecular weight, logP, molecule Indicate physical properties such as surface area, specific physical properties, reactivity, structure and reaction path information.
The “chemical structure search process” here can use a known backtracking method. The “chemical structure search process” can be used not only outside the database but also incorporated inside the database.

以上のプロセスにより、大規模データベースから高速に化学構造を検索することが実現可能となる。 Through the above process, it is possible to retrieve a chemical structure from a large-scale database at high speed.

「ＩＴＣＳ生成プロセス」
化学構造の文字列（線形）表記は、公知のものとしてＷＬＮ、ＳＭＩＬＥＳ、ＲＯＳＤＡＬなどが知られており、広く利用されている。本発明では、拡張性の観点から公知の手法を利用するのではなく、独自にＩＴＣＳ表記を開発するに至った。ＩＴＣＳ生成プロセスを図２に示す。 "ITCS generation process"
As the character string (linear) notation of a chemical structure, WLN, SMILES, ROSDAL, and the like are known and widely used. In the present invention, instead of using a known method from the viewpoint of extensibility, an ITCS notation has been developed independently. The ITCS generation process is shown in FIG.

ステップ１：
化学構造をｓｄｆ形式、ｍｏｌ２形式などの汎用の分子構造インターチェンジフォーマットで準備する。ここで、ヒュッケル則を用いて、芳香族としての性質をもつ結合の特定を行う。 Step 1:
The chemical structure is prepared in a general-purpose molecular structure interchange format such as sdf format or mol2 format. Here, the bond having the property as an aromatic is identified using the Hückel rule.

ステップ２：
化学構造を木構造として取り扱い、ノードは原子に、エッジは結合に対応させる。ルートノードは任意の原子から始めることができ、ノードは、原子種、原子ＩＤ番号、訪問番号を、エッジは結合次数を保持している。ここで、原子ＩＤ番号は、化学構造を作成する際に任意の割り振り方で各原子に番号を割り振ってもよいが、通常は、分子構造インターチェンジフォーマットを準備する際に用いるソフトウェアに依存して割り振られる。 Step 2:
Treat the chemical structure as a tree structure, with nodes corresponding to atoms and edges corresponding to bonds. The root node can start with any atom, the node holds the atomic species, the atom ID number, the visit number, and the edge holds the bond order. Here, the atom ID number may be assigned to each atom in an arbitrary way when creating a chemical structure, but is usually assigned depending on the software used when preparing the molecular structure interchange format. It is.

ステップ３：
ステップ２で作成した木のルートノードから、深さ優先探索によりルート決定を行う。「深さ優先探索」は公知のアルゴリズムである。この際、訪問するノードの順番（訪問番号）に従い、０から番号づけを行い、ノードに保持させる。 Step 3:
A route is determined from the root node of the tree created in step 2 by a depth-first search. “Depth-first search” is a known algorithm. At this time, the nodes are numbered from 0 according to the order of the nodes to be visited (visit number) and are held in the nodes.

ステップ４：
ステップ３で作成したルートに従いＩＴＣＳを作成する。
ＩＴＣＳ化のルール：
・原子は原子名（Ｃ、Ｎ、Ｏなどの文字列）で表現する。
・結合は二重結合を‘ｄ’、三重結合を‘ｔ’、芳香族性の結合を‘ａ’、単結合は‘’（何もなし）で表現する。
・ルート上すでに訪問した原子は、原子名ではなく、訪問番号で表現する。この際、次に訪問するエッジの始まりのノードと現在訪問中のエッジの終わりのノードが異なれば、次のエッジの始まりの原子は‘＜訪問番号＞’で表現する。さらに、次のエッジの終わりの原子がすでに訪問されていれば、この原子は‘［訪問番号］’で表現する。例えば、図２のステップ３を参照して、現在訪問中のエッジがＮ（１）−Ｃ（２）とすると、次に訪問するエッジはＮ（１）−Ｃ（３）となる。この場合、現在訪問中の終わりのエッジはＣ（２）であり、次に訪問するエッジの始まりはＮ（１）となる。従って、同一ノードではないかつＮ（１）はすでに訪問されているノードなので＜１＞となる。
なお、[ ]を使うケースを、シクロヘキサンを例として以下に説明する。 Step 4:
An ITCS is created according to the route created in step 3.
ITCS rules:
・ Atoms are expressed by atomic names (character strings such as C, N, and O).
The bond is represented by a double bond “d”, a triple bond “t”, an aromatic bond “a”, and a single bond “” (nothing).
・ Atoms that have already been visited on the route are represented by a visit number, not an atom name. At this time, if the node at the beginning of the next visited edge is different from the node at the end of the currently visited edge, the atom at the beginning of the next edge is represented by “<visit number>”. Furthermore, if the atom at the end of the next edge has already been visited, this atom is represented by '[visit number]'. For example, referring to step 3 in FIG. 2, if the currently visited edge is N (1) -C (2), the next visited edge is N (1) -C (3). In this case, the last edge currently visited is C (2), and the next visited edge is N (1). Therefore, it is <1> because it is not the same node and N (1) has already been visited.
The case of using [] will be described below using cyclohexane as an example.

現在訪問中のエッジがＣ（４）−Ｃ（５）であり、次に訪問するエッジがＣ（０）−Ｃ（５）とすると、次のエッジの始まりと現在のエッジの終わりは異なりかつＣ（０）はすでに訪問しているので＜０＞となる。さらに次のエッジの終わりのノードＣ（５）もすでに訪問しているので、次のエッジは＜０＞［５］と表現される。
・最後に、‘／’（スラッシュ）を付加し、訪問番号の順序に従い原子ＩＤ番号を‘，’（コンマ）区切りで付加する。通常、化学構造を文字列化した場合、文字化後は、変換元になった化学構造ファイル上の原子ＩＤ番号を保持できない。そこで、本発明では、スラッシュ後にそれを付加することにより、原子ＩＤ番号の保持を可能としている。これにより、ＩＴＣＳを用いた化学構造検索で一致した原子と変換元の化学構造ファイル上の原子を一致させることができ、一致した原子の強調表示などを変換元の化学構造ファイル上で行うことができる。 If the currently visited edge is C (4) -C (5) and the next visited edge is C (0) -C (5), the start of the next edge and the end of the current edge are different and Since C (0) has already visited, it becomes <0>. Furthermore, since the node C (5) at the end of the next edge has already been visited, the next edge is expressed as <0> [5].
Finally, add '/' (slash) and add atomic ID numbers separated by ',' (comma) according to the order of visit numbers. Normally, when a chemical structure is converted into a character string, the atom ID number on the chemical structure file that is the conversion source cannot be retained after the conversion into a character string. Therefore, in the present invention, the atomic ID number can be retained by adding it after the slash. This makes it possible to match the atoms in the chemical structure search using ITCS with the atoms in the conversion source chemical structure file, and to perform highlighting of the matching atoms on the conversion source chemical structure file. it can.

「ＩＴＣＳＷＯＲＤ生成プロセス」
本プロセスの目的は、任意の化学構造を構成する全ての部分構造を列挙し、文字列（通常の全文検索における「単語」に相当）化することである。ＩＴＣＳＷＯＲＤ生成プロセスの主となる部分を図３及び４に示す。 "ITCS WORD generation process"
The purpose of this process is to enumerate all partial structures constituting an arbitrary chemical structure and convert them into character strings (corresponding to “words” in a normal full-text search). The main parts of the ITCS WORD generation process are shown in FIGS.

ステップ１：
ＩＴＣＳ生成プロセスで作成した木を、“基本木”とする。 Step 1:
A tree created by the ITCS generation process is referred to as a “basic tree”.

ステップ２：
基本木を基に、下記“成長木”構築ルールに基づき“成長木”を構築する。
“成長木”構築ルールに基づき木がこれ以上成長しなくなるまで実施する。ただし、木の深さは事前に設定する（ｎ＿ｄｅｐｔｈとして設定する）必要があり、１〜化学構造を構成する結合の数、まで設定できる。現在のコンピュータ資源の性能を考慮し、通常は、４〜７として設定する。
“成長木”構築ルール
・ベースノードの選択は、成長木に対する深さ優先探索の順序に基づき行い、初期のベースノードは、“基本木”上のルートノードとする。
・ベースノードへのノードとエッジの付加は、ベースノードに対応する“基本木”上のノードとその子ノード及びそれらが属するエッジ、さらに、祖先ノードが存在する場合には、祖先ノードとその子ノード及びそれらが属するエッジをベースノードにコピーすることにより実施する。ただし、祖先ノードとその子ノード及びそれらが属するエッジが、すでに成長木上のルートノードからベースノードまでの経路上に存在していれば、付加しない。ここで祖先ノードとは、ベースノードからルートノードまでの経路上に位置するノードを示す。 Step 2:
Based on the basic tree, a “growth tree” is constructed based on the following “growth tree” construction rules.
Continue until the tree no longer grows based on the “Growth Tree” construction rules. However, the depth of the tree needs to be set in advance (set as n_depth) and can be set from 1 to the number of bonds constituting the chemical structure. Considering the performance of the current computer resources, it is usually set as 4-7.
The “growing tree” construction rule / base node is selected based on the depth-first search order for the growing tree, and the initial base node is the root node on the “basic tree”.
The addition of a node and an edge to the base node means that the node on the “basic tree” corresponding to the base node and its child node and the edge to which they belong, and if an ancestor node exists, the ancestor node and its child node and This is done by copying the edges to which they belong to the base node. However, if the ancestor node, its child node, and the edge to which they belong already exist on the path from the root node to the base node on the growth tree, they are not added. Here, the ancestor node indicates a node located on the route from the base node to the root node.

ステップ３：
ステップ２で構築された成長木のルートノードから末端ノードまでの全ての経路を列挙する。そして、各々の経路をルートノードから、深さ１，２，…，ｎ＿ｄｅｐｔｈ−１とそれぞれ切断することにより、部分構造に対応する経路が生成される。
ここで生成された全ての経路は、「ＩＴＣＳ生成プロセス」と同じアルゴリズムを用いて文字列化される。ここで生成された文字列をＩＴＣＳＷＯＲＤと呼ぶ。 Step 3:
List all paths from the root node to the end node of the growth tree constructed in step 2. Then, by cutting each route from the root node to a depth of 1, 2,..., N_depth−1, a route corresponding to the partial structure is generated.
All the paths generated here are converted into character strings using the same algorithm as the “ITCS generation process”. The character string generated here is called ITCS WORD.

ステップ４：
ステップ１〜ステップ３の処理は、化合構造を構成する全ての原子をルートノードとして、繰り返し実施される。この処理により、化学構造中のｎ＿ｄｅｐｔｈ長までで構成される全ての部分構造をＩＴＣＳＷＯＲＤとして文字列化することができる。 Step 4:
The processing of Step 1 to Step 3 is repeatedly performed using all atoms constituting the compound structure as root nodes. By this processing, all partial structures configured up to the n_depth length in the chemical structure can be converted into character strings as ITCS WORD.

ステップ５：
ある１つの化学構造は、複数のＩＴＣＳＷＯＲＤによる表現が可能となるため、それらＩＴＣＳＷＯＲＤを辞書式にアルファベット順に並べ最も大きいものを代表として用いる。例えば、ＣＣＮとＮＣＣは同じ構造を示すが、これを辞書式にアルファベット順に並べるとＮＣＣ＞ＣＣＮとなり、その代表はＮＣＣとなる。これにより１つの構造は１つのＩＴＣＳＷＯＲＤに対応することになる。 Step 5:
Since one chemical structure can be expressed by a plurality of ITCS WORDs, those ITCS WORDs are arranged lexicographically in alphabetical order and the largest one is used as a representative. For example, although CCN and NCC show the same structure, if they are arranged in lexicographical alphabetical order, NCC> CCN, and its representative is NCC. Thus, one structure corresponds to one ITCS WORD.

ステップ６：
ある１つの化学構造の中に、同じ部分構造（ＩＴＣＳＷＯＲＤ）が複数存在するとき、そのＩＴＣＳＷＯＲＤの後ろに数値を付加する。ただし、数値の最大値は６とし、１は省略する。例えば、ＳＣＯ＜１＞Ｎが３つ存在するとき生成されるＩＴＣＳＷＯＲＤは、ＳＣＯ＜１＞Ｎ、ＳＣＯ＜１＞Ｎ２、ＳＣＯ＜１＞Ｎ３、となる。さらに設定したｎ＿ｄｅｐｔｈ長では表現できない特殊な部分構造もこのステップで外部ＩＴＣＳＷＯＲＤとして付加することができる。例えば、ＣＣＣＣＣＣＣＣＣＣＣＣＣＣやＣＣＮＣＣＮＣＣＮＣＣＮＣＣＮＣＣＮなど連続した炭素の繋がりや、連続した繰り返し構造など。
ただし、クエリーとして用いるＩＴＣＳＷＯＲＤはｎ＿ｄｅｐｔｈ長のＩＴＣＳＷＯＲＤと外部ＩＴＣＳＷＯＲＤのみで十分である（それ以下の部分構造は、ｎ＿ｄｅｐｔｈ長のＩＴＣＳＷＯＲＤ内に含まれているため）。
化学構造を構成する結合の数にｎ＿ｄｅｐｔｈを設定した場合、原理的には、全ての部分構造をＩＴＣＳＷＯＲＤとして文字列化することができるため、スクリーニングのみで化学構造検索が完結する。したがって、バックトラック法などの既存の化学構造検索を利用しなくてもよくなる。しかしながら、ｎ＿ｄｅｐｔｈを大きくすればするほど、ＩＴＣＳＷＯＲＤが指数関数的に増加するため、現在のコンピュータ資源の性能を考慮すると、現状では４〜７が適切である。しかしながら、コンピュータ性能が向上すれば、ｎ＿ｄｅｐｔｈをより大きくすることが可能である。 Step 6:
When a plurality of the same partial structures (ITCS WORD) exist in one chemical structure, a numerical value is added after the ITCS WORD. However, the maximum value is 6 and 1 is omitted. For example, ITCS WORD generated when there are three SCO <1> N is SCO <1> N, SCO <1> N2, and SCO <1> N3. Furthermore, a special partial structure that cannot be expressed by the set n_depth length can also be added as an external ITCS WORD at this step. For example, continuous carbon connections such as CCCCCCCCCCCCCC and CCNCCCNCCNCNCNCNCCN, and continuous repeated structures.
However, it is sufficient that the ITCS WORD used as a query is only an ITCS WORD having an n_depth length and an external ITCS WORD (since the partial structure below that is included in the ITCS WORD having an n_depth length).
When n_depth is set as the number of bonds constituting the chemical structure, in principle, all partial structures can be converted into character strings as ITCS WORDs, so that the chemical structure search is completed only by screening. Therefore, it is not necessary to use an existing chemical structure search such as a backtrack method. However, as n_depth increases, ITCS WORD increases exponentially. Therefore, considering the performance of current computer resources, 4-7 is appropriate at present. However, if the computer performance improves, n_depth can be increased.

化学構造検索事例
化学構造の検索事例として、以下に示すクエリー構造を用いて実施した。 Example of chemical structure search The following query structure was used as an example of chemical structure search.

クエリー構造のＩＴＣＳとＩＴＣＳＷＯＲＤを以下に示す。
ＩＴＣＳ：
ＣＮＣＣＮＣＣ＜４＞Ｃ＜１＞［６］＜０＞ＣａＣａＣＣ＜１０＞ａＣａＣａＣ＜８＞ａ［１４］＜０＞ｄＯ／１，２，７，１０，４，９，８，１３，３，６，１１，１６，１５，１４，１２，５
ＩＴＣＳＷＯＲＤ：
“ＣａＣａＣａＣａＣａＣａ［０］ＮＣＣａＣａＣＮＣＣａＣ＜２＞ａＣＮＣＣａＣ＜０＞ＣＮＣＣ＜０＞ＣＣＮＣＣ＜０＞Ｃ＜０＞ＣＮＣＣＮＣＣａＣａＣＣ＜０＞ＣＣａＣａＣａＣａＣＣａＣａＣａＣ＜２＞ＣＣａＣａＣａＣＣＯｄＣＮＣＣＯｄＣＮＣ＜２＞ＣＯｄＣＮＣ＜１＞ＣＯｄＣＮ＜１＞ＣａＣＯｄＣＣａＣａＣＯｄＣＣａＣ＜２＞ａＣＮＣＣａＣａＣ２ＮＣＣａＣ＜０＞Ｃ４ＮＣＣ＜０＞ＣＣ４ＮＣＣ＜０＞Ｃ＜０＞Ｃ５ＮＣＣＮＣ６ＣａＣａＣａＣａＣ６ＣａＣａＣａＣ＜２＞Ｃ４ＣａＣａＣａＣＣ４ＯｄＣＮＣＣ２ＯｄＣＮＣ＜１＞Ｃ２ＯｄＣＮ＜１＞ＣａＣ２ＯｄＣＣａＣａＣ２” The ITCS and ITCS WORD of the query structure are shown below.
ITCS:
CNCNCCCC <4> C <1> [6] <0> CaCaCC <10> aCaCaC <8> a [14] <0> dO / 1,2,7,10,4,9,8,13,3,6 , 11, 16, 15, 14, 12, 5
ITCS WORD:
“CaCaCaCaCaCa [0] NCCaCaC NCCaC <2> aC NCCaC <0> C NCC <0> CC NCC <0> C <0> C NCCNC CaCaCC <0> C CaCaCaCaC CaCaCaC <2> C CaCaCaCC OdCNCC OdCNC <2> COD <1> C OdCN <1> CaC OdCCaCaC OdCCaC <2> aC NCCaCaC2 NCCaC <0> C4 NCC <0> CC4 NCC <0> C <0> C5 NCCNC6 CaCaCaCaC6 CaCaCaC <2> C4 CaCaCaCC2 OdCCC2 OdCNCC <<1> CaC2 OdCCaCaC2 "

ＩＴＣＳ法を用いた場合と用いない場合において、データベースに含まれる化合物のデータ数と検索時間にどの程度の差が生じるかを調べることによりＩＴＣＳ法の有効性を検証した。ここで、ＩＴＣＳ法を用いない場合というのは、バックトラック法による化学構造検索のみでの検索を示す。結果を以下の表及び図５に示す。また、ヒットした化合物の例も以下に示す。 The effectiveness of the ITCS method was verified by examining the difference in the number of data of compounds contained in the database and the search time when the ITCS method was used and when it was not used. Here, the case where the ITCS method is not used indicates a search only by chemical structure search by the backtrack method. The results are shown in the following table and FIG. Examples of hit compounds are also shown below.

これらの結果から、ＩＴＣＳ法を用いた場合、バックトラック法のみと比較して１００倍以上（保存データ数１０００００の場合）の検索速度の向上が見られた。従って、本発明によれば、大規模な化学構造データベースから所定の部分構造を有する化合物を高速に検索することが可能となる。 From these results, when the ITCS method was used, the search speed was improved 100 times or more (when the number of stored data was 100,000) compared with the backtrack method alone. Therefore, according to the present invention, it is possible to search a compound having a predetermined partial structure from a large-scale chemical structure database at high speed.

本発明の化学構造検索方法に基づくシステムの概要図である。It is a schematic diagram of the system based on the chemical structure search method of the present invention. ＩＴＣＳ生成プロセスを説明する図である。It is a figure explaining an ITCS production | generation process. ＩＴＣＳＷＯＲＤ生成プロセスを説明する図である。It is a figure explaining an ITCS WORD production | generation process. 図３の続きである。It is a continuation of FIG. ＩＴＣＳ法を用いた場合と用いない場合において、データベースに含まれる化合物のデータ数と検索時間にどの程度の差が生じるかを調べた結果を示すグラフである。It is a graph which shows the result of having investigated how much difference arises in the data number of compounds contained in a database, and search time, when not using ITCS method.

Claims

The chemical structure of the compound input to the computer is expressed as a tree structure consisting of nodes corresponding to atoms and edges corresponding to the bonds between atoms. A root node is selected from one of the nodes, and a depth from the root node is selected. Means for performing route determination by priority search, and characterizing the chemical structure of the compound according to the determined route;
Based on the tree structure, up to a length of n_depth (where n_depth is the number of bonds constituting the chemical structure), which is a depth of a preset tree constituting the compound It means for stringifying all parts structure,
A storage medium for recording and storing a character string representation of the chemical structure of the obtained compound and a character string representation of the partial structure of the compound together with a unique ID for identifying the compound as a chemical structure database A chemical structure search system for searching for compounds having a partial structure of

1) The chemical structure of a compound input to a computer is expressed as a tree structure composed of nodes corresponding to atoms and edges corresponding to bonds between atoms, a root node is selected from one of the nodes, and the root node Determining a route by depth-first search from the character string, and characterizing the chemical structure of the compound according to the determined route; and
2) Characterizing all partial structures having a length of n_depth set for the chemical structure search system according to claim 1 that constitute the compound based on the tree structure;
Throwing a search request to the chemical structure search system according to claim 1, using as a query the character string representation of the partial structure of the compound obtained by
Using the chemical structure database included in the chemical structure search system, a full text search process is performed on a search request thrown by a computer, and a unique ID for identifying a compound as a search result and the character of the chemical structure of the corresponding target compound Returning a columnized representation;
A compound having a predetermined partial structure, including a step of performing a chemical structure search process on the obtained character string representation of the ID and the chemical structure of the target compound and presenting an ID of the compound having the partial structure as a search result How to search.

3. The method according to claim 2, wherein in the step of sending the search request, a character string representation of a special partial structure that constitutes the input compound and cannot be expressed by the length of the n_depth is also used as the query. .