KR20120043977A

KR20120043977A - A method for retrieving a gene-disease-chemical relationship using multi-dimensional indexes from huge bio-literatures

Info

Publication number: KR20120043977A
Application number: KR1020100105299A
Authority: KR
Inventors: 김태경; 오정수; 이상혁; 허보경
Original assignee: 한국생명공학연구원
Priority date: 2010-10-27
Filing date: 2010-10-27
Publication date: 2012-05-07
Also published as: KR101448731B1

Abstract

PURPOSE: A method for extracting gene-disease-chemical relations from bio-literatures using multi-dimensional indexes is provided to quickly extract gene-disease-chemical relations based on a reverse index and the multi-dimensional indexes, thereby supporting precise search per a sentence. CONSTITUTION: Multi-dimensional indexes about diseases, genes, and chemicals are made from bio-literatures. The multi-dimensional indexes are stored according to an index storage structure. The diseases, genes, and chemicals are multi-dimensionally analyzed by receiving a search word from a user.

Description

A method for retrieving a gene-disease-chemical relationship using multi-dimensional indexes from huge bio-literatures}

FIELD OF THE INVENTION The present invention relates to text mining techniques in the field of bioinformatics, and more particularly to multidimensional indexes from large biotechnology literature for multidimensional analysis of gene-disease-chemicals. The present invention relates to a method for effectively extracting a gene-disease-chemical relationship by applying a method to increase the efficiency and accuracy of a search and to enable a multi-dimensional analysis of a gene-disease-compound.

Conventionally, in the field of biology, the results of a large number of biological experiments are published in the literature every year, and accordingly, the strategic use of such information is becoming increasingly important.

In addition, in order to identify gene-disease-compound relations from biotext literature, the only way is to search through keyword search on PubMed. However, about 10,000 documents are currently being managed on PubMed. It is certain to do.

Therefore, there is an increasing demand for an infrastructure that enables the rapid verification of information of interest from such a large volume of documents, enabling the verification, verification and inference of life phenomena.

As an example of the prior art for identifying gene-disease-compound relationships from biotext literature as described above, see, for example, "PolySearch: aweb-based textmining system for extracting relationships between human diseases, published May 16, 2008." , genes, mutations, drugs and metabolites, "Nucleic Acids Research, 2008. Vol.

That is, the document discloses a system that enables to search for relevant mutation symptoms and drugs through the disease or gene using a query.

However, the gene-disease-compound relationship analysis technique disclosed in the above-mentioned document only has a disadvantage in that only X-> Y relationships are considered and X, Y-> Z cannot be analyzed.

In addition, as another example of the prior art as described above, for example, "Integration of text- and data-mining using ontologies successfully selects disease gene candidates" published February 22, 2005, Nucleic Acids Research, 2005. In Vol. 33, No. 5, descriptions on selecting gene candidates for causing diseases using ontology, text mining, and data mining techniques are described.

In addition, as another example of the prior art as described above, for example, "Text-mining and information-retrieval services for molecular biology" published on June 28, 2005, Genome Biology, 2005. 6: 224 (doi (10.1186 / gb-2005-6-7-224) discloses a technique for automatically extracting a functional relationship between a gene and a protein from text through text mining in molecular biology.

However, there are limitations in identifying gene-disease-compound relationships by keyword-based retrieval from a large-scale biotechnology literature using the methods described in the prior art as described above.

First, the above-described methods of the prior art have a so-called false positive, which is actually negative, but the result is positive because the object of the query is green, so that the amount of documents searched is vastly larger than necessary, Accordingly, there was a problem that the user takes a long time to check the information.

Second, the prior art methods as described above do not have a highlight function for genes, diseases, and compounds, and thus, it is difficult for a user to easily identify the sentence at a glance.

Third, the prior art methods, as described above, are often unable to provide summary information on the relationship between gene-disease-compound, and in the case of presenting summary information, most of them do not accept new information in real time as a result of manual work. There was no limit.

Therefore, by solving the problems of the prior art as described above, it is possible to quickly and flexibly extract the relationship between the gene-disease-compound, to search and confirm the relationship between the gene-disease-compound at the sentence level, and also to index It is desirable to provide a method for extracting a gene-disease-compound relationship from a new large-scale biotechnology literature that can implement an intuitive user interface, but there is no system or method that satisfies all such requirements.

The present invention has been made to solve the problems of the prior art as described above, and therefore an object of the present invention is to quickly and flexibly extract the relationship between gene-disease-compounds through a multidimensional index structure, and at the same time, gene-disease -To provide a method of extracting gene-disease-compound relationships from large-scale biotechnology literature using multidimensional indexes to search and verify compound relationships at the sentence level and to implement an intuitive user interface using indexes. It is.

In order to achieve the above object, according to the present invention, in the method for extracting the gene-disease-compound relationship from the large-scale biotechnology literature, to build a multi-dimensional index for diseases, genes, compounds from the large-scale biotechnology literature Storing the constructed multi-dimensional index according to a predetermined index storage structure; and using the stored index, a user inputs a search word and multi-dimensional analysis of diseases, genes, and compounds from the large-scale biotechnology literature. There is provided a method for extracting a gene-disease-compound relationship from a large volume of biotechnology literature, comprising the step of performing a.

The multi-dimensional index construction may include extracting only a document whose abstract field is not null from a PubMed database, and dividing the contents of each abstract into sentence units and curating them. Creating a sentence table, constructing an inverse index for the sentence table, and comparing each synonym dictionary for genes, diseases, and compounds with the inverse index for each of the genes, diseases, and compounds. It characterized in that it comprises a step of building a dimensional index of.

The sentence table may be stored in the order of [pubmed id, sentence id (sentence id), sentence].

In addition, the index storage structure is characterized in that the star schema (Star Schema) structure.

In addition, in the index storage structure, the disease index, information about the pubmed ID, sentence number, disease ID and disease name, start position, end position is stored, the standard disease name and synonym information about the disease is stored in association It is characterized by.

Further, in the index storage structure, the gene index, the information on the pubmed ID, sentence number, gene ID and gene name, start position, end position is stored, synonymous information about the standard gene and gene is stored in this It is characterized by.

In addition, in the index storage structure, the compound index, information about the pubmed ID, sentence number, compound ID and compound ID, compound name, start position, end position is stored, synonymous information about the compound name and the compound And are stored in association.

In addition, the index storage structure, characterized in that configured to be able to establish a multi-dimensional analysis model by adding index information for other analysis dimensions in addition to the disease index, the gene index and the compound index.

In addition, the method includes the above sentence in one dimension (gene, disease, compound), two dimensions (gene-disease, disease-gene, gene-compound, compound-gene, disease-compound, compound-disease relationship) and three-dimensional (Gene-disease-compound relationship).

In addition, in the above method, when a user inputs a search word, colors or highlights of respective genes, diseases, and compounds are applied to the screens showing the search results, thereby providing visual effects and allowing the user to intuitively understand the contents. It is characterized in that configured to.

In addition, the method is characterized in that, when the user inputs a search word, by grasping the content based on the sentence to view the entire abstract, it is configured to check the abstract content centered on the sentence.

Further, in the method, when a user inputs a search word, keywords related to the search word are displayed, and when the keyword is selected, the search word and a search result corresponding to the keyword and the abstract are displayed, so that the user may enter between the search word and the keyword. Characterized in that it can be configured to easily perform a relationship analysis.

In addition, the method is characterized in that it is configured to immediately perform the necessary analysis by accessing the index using SQL, without having to write a separate program for extracting the relationship between the gene-disease-compound.

As described above, according to the present invention, it is possible to quickly extract the relationship between gene-disease-compound by utilizing inverse index and multi-dimensional index, to support a sophisticated sentence-by-sentence search, and to analyze X-> Y. In addition, a method of extracting a gene-disease-compound relationship from a large-scale biotechnology literature using a multi-dimensional index that can extract the relationship of X, Y-> Z is provided.

That is, according to the present invention, the abstract file is imported from the PubMed database, each abstract is separated into sentence units, and after generating inverse indexes for the positions of genes, diseases and compounds in the separated sentences, genes, diseases, Create a dimensional index for each compound with a name for each compound, use synonym terminology dictionary to improve search accuracy, and use multidimensional indexes to link multi-dimensional analysis by linking indexes and sentences. A method of extracting a gene-disease-compound relationship is provided.

Therefore, according to the present invention, it is possible to derive the relationship between the biotechnological entities from a large volume of literature, and this can be applied to deriving new relational information from literatures of various fields such as chemistry and physics as well as biotechnology.

1 is a view for explaining a procedure for constructing a multi-dimensional index for a disease-gene-compound from a large-capacity literature in a method for extracting a gene-disease-compound relationship from a large-capacity biotechnology literature according to the present invention.
2 is a view for explaining a storage structure for extracting a disease-gene-compound in a method of extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.
3 is a view showing a screen showing a basic search result extracted by applying a multi-dimensional analysis structure in a method for extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.
4 is a view showing a screen for providing the entire abstract content of the extracted sentence in the method for extracting the gene-disease-compound relationship from the large-scale biotechnology literature according to the present invention.
FIG. 5 is a diagram illustrating an input screen and a result screen for multidimensional analysis in a method of extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.
6 is a view showing the structure of the SQL for extracting the relationship between the compound-disease in the method for extracting the gene-disease-compound relationship from the large-scale biotechnology literature according to the present invention.
7 is a view showing a gene-compound relationship analysis screen as an embodiment of a method for extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.
8 is a view showing a gene-gene relationship analysis screen as an embodiment of a method for extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.
9 is a view showing a disease-gene- abstract relationship analysis screen as an embodiment of a method for extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.
10 is a view showing a disease-gene relationship analysis screen as an embodiment of a method for extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.

Hereinafter, with reference to the accompanying drawings, the details of the method for extracting the gene-disease-compound relationship from the large-scale biotechnology literature according to the present invention will be described.

Hereinafter, it is to be noted that the following description is only an embodiment for carrying out the present invention, and the present invention is not limited to the contents of the embodiments described below.

That is, a method of extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention, as described below, is a multidimensional index for analyzing a gene-disease-compound in a star schema form from a large-scale literature. The present invention relates to a gene-disease-compound relationship analysis technique using a multidimensional index having a structure and highlighting the gene-disease-compound included in a search result using such an index.

In addition, the present invention, for example, the support of various search services and biotechnology by adding genes recently discovered in relation to diseases of which biotechnologists are interested, or organization (Organism), body parts (Anatomy), etc. It can be applied to information retrieval system that can be used in all fields.

Subsequently, with reference to FIGS. 1-10, the specific structure of the method of extracting a gene-disease-compound relationship from the large-capacity biotechnology literature concerning this invention is demonstrated.

First, referring to FIG. 1, FIG. 1 illustrates a process of constructing a multidimensional index in a method of extracting a gene-disease-compound relationship from a large-scale biotechnology literature according to the present invention.

That is, as shown in Figure 1, the procedure for constructing a multi-dimensional index for disease-gene-compounds from a large volume of literature, first importing the abstract file from the PubMed database, and then separate each abstract into sentence units, Create an index for each location for each gene, disease, or compound present.

Here, in constructing each of the indexes, the synonym term dictionary is used to increase the search accuracy.

As described above, after the index is generated, the indexes and sentences are connected to each other so that the user can perform multidimensional analysis.

More specifically, the above-described procedure for constructing a multidimensional index first extracts the entire document from the PubMed database, where the extraction condition only results in that the Abstract field is not null (step 1). .

Subsequently, the content of each abstract is divided into sentence units to form a sentence table through curation, and stored in the order of [pubmed id, sentence id, sentence], for example (step 2). .

Next, an inverse index is constructed for the sentence table obtained in the above step (step 3).

Subsequently, the gene, disease, and compound synonym dictionaries are compared with the sentence inverse index to build each dimension index (step 4).

Next, with reference to FIG. 2, the method of storing the index constructed as mentioned above is demonstrated.

Referring to FIG. 2, FIG. 2 shows an index storage structure for extracting a disease-gene-compound relationship.

That is, a key feature of the present invention is that in the storage structure that allows the sentence (Sentence) to be viewed from the perspective (dimensional) of each disease, gene, compound index, as shown in FIG. The structure is called 'Star Schema' in jargon.

More specifically, as shown in FIG. 2, for example, for a disease index, information about a pubmed ID, a sentence number, a disease ID and a disease name, a start position, and an end position is stored, and the standard disease name and Synonym information about the disease is associated and stored.

In addition, for the gene index, similarly, information on the pubmed ID, sentence number, gene ID and gene name, start position, and end position is stored, and synonymous information about the standard gene and gene is stored in association with it.

In addition, with respect to the compound index, information on pubmed ID, sentence number, compound ID and compound ID, compound name, start position, and end position is stored, and the compound name and synonym information about the compound are stored in association with each other.

In addition, for other analysis dimensions, only the index information may be added as appropriate with reference to the above contents as needed, and thus, another analysis dimension can be easily added to establish a multidimensional analysis model.

In this case, as an example of a query type that can be processed, for example, a search for a search result including a desired keyword in a sentence or abstract, or for each type, one or more conditions You can search for sentences or abstracts.

In other words, the present invention is a method of dramatically increasing the performance and accuracy of a search by using an index without directly accessing about 100 million sentences as a whole sentence. The sentence is one-dimensional (gene, disease, compound) , Two-dimensional (gene-disease, disease-gene, gene-compound, compound-gene, disease-compound, compound-disease relationship) and three-dimensional (gene-disease-compound relationship).

In addition, this storage structure can provide a very flexible structure that can easily add another dimension of analysis.

Next, FIG. 3 is a screen showing a basic search result extracted by applying the multidimensional analysis structure as described above, and is a screen showing a search result based on a keyword based on a sentence.

That is, as shown in Figure 3, when the user enters a search word, by applying the color for each gene, disease, compound on the screen showing the search results, not only give a visual effect, the user can intuitively understand the content Configure it to be.

Here, gene, disease, and compound information in each sentence is to be taken from the index.

4 shows a screen that provides the entire abstract contents of the extracted sentences.

That is, as shown in FIG. 4, when the user inputs a search word, the user can grasp the content based on the sentence and then view the entire abstract, so that the abstract content can be confirmed based on the sentence.

5 shows an input screen and a result screen for multidimensional analysis.

That is, as shown in FIG. 5, when a user inputs a search word for a compound, a search word for a related disease is displayed through synonym processing, and when a user selects one of the search results for the compound and the disease, An abstract is displayed, allowing the user to easily perform compound-disease relationship analysis.

FIG. 6 shows the SQL structure for extracting the compound-disease relationship as shown in FIG. 5.

That is, without the need to create a separate program for extracting the relationship between the gene-disease-compound, it is configured to immediately perform the necessary analysis by accessing the index using SQL as shown in FIG.

In other words, the features of the present invention configured as described above, first, as shown in Figs. 3 and 4, supports a keyword-based logical search for the sentence, the screen based on the final result confirmation of the multi-dimensional analysis Second, through the multi-dimensional index structure as shown in FIG. 2, Ad-Hoc queries can be performed for each viewpoint of gene-disease-compound.

7 to 10 show practical applications of multidimensional analysis results using the method of the present invention as described above.

That is, as shown in Figures 7 to 10, according to the present invention, various multidimensional analysis such as gene-compound relationship analysis, gene-gene relationship analysis, disease-gene- abstract analysis, disease-gene analysis is possible.

As described above, according to the present invention, by supporting a high-performance sentence-by-state logical search, it is possible to solve the problem that currently does not support sentence-by-sentral search in biotechnology literature search.

In addition, according to the present invention, the user's intuitive understanding is improved through the highlight function for the gene-disease-compound keyword in the search results, and a flexible and high performance analysis service utilizing the multi-dimensional index of the gene-disease-compound. Can be provided.

That is, the present invention, for example, the disease list output associated with a specific gene, the gene list output associated with a specific disease, the abstract import containing a specific disease and gene, the list of genes present with a specific gene, the specific gene and Providing biotext mining services for various cases, such as the list of related compounds, the list of diseases related to specific compounds, the list of diseases related to specific body parts, the list of compounds related to specific body parts, and the list of compounds related to specific species. Can be.

As described above, the details of the method for extracting the gene-disease-compound relationship from the large-scale biotechnology literature according to the present invention have been described through the embodiments of the present invention as described above. However, the present invention is not limited thereto, and therefore, it is obvious that various modifications, changes, combinations, and substitutions may be made by those skilled in the art according to design needs and various other factors. I will call it work.

Claims

In a method for extracting a gene-disease-compound relationship from a large-scale biotechnology literature,
Constructing a multi-dimensional index for diseases, genes, and compounds from the large biotechnology literature;
Storing the constructed multidimensional index according to a predetermined index storage structure;
Gene-disease-compounds from large-scale biotechnology literature, comprising using the stored index, a user entering a search term and performing multi-dimensional analysis of diseases, genes, and compounds from the large-scale biotechnology literature How to extract a relationship.

The method of claim 1,
Building the multidimensional index,
Extracting only documents in which the Abstract field is not null from the PubMed database,
Dividing the contents of each abstract into sentence units to form a sentence table through curation;
Building an inverse index on the sentence table;
Comparing each synonym dictionary for genes, diseases, and compounds with said inverse index to construct respective dimensional indices for genes, diseases, and compounds. How to extract compound relationships.

The method of claim 2,
In the step of making the sentence table,
And said sentence table is stored in the order of [pubmed id, sentence id (sentence id), sentence (sentence)].

The method of claim 1,
And said index storage structure is a star schema structure. A method of extracting a gene-disease-compound relationship from a large-scale biotechnology literature.

The method of claim 4, wherein
In the index storage structure, the disease index, the information on the pubmed ID, sentence number, disease ID and disease name, start position, end position is stored, and the standard disease name and synonym information about the disease is stored in association with A method for extracting gene-disease-compound relationships from a large volume of biotechnology literature.

The method of claim 4, wherein
In the index storage structure, the gene index, pubmed ID, sentence number, gene ID and gene name, information about the start position, the end position is stored, the standard gene and synonym information for the gene is stored in association with A method for extracting gene-disease-compound relationships from a large volume of biotechnology literature.

The method of claim 4, wherein
In the index storage structure, the compound index, information about the pubmed ID, sentence number, compound ID and compound ID, compound name, start position, end position is stored, synonymous information about the compound name and the compound is associated with A method for extracting a gene-disease-compound relationship from a large volume of biotechnology literature, which is stored.

The method of claim 4, wherein
The index storage structure is gene-disease from a large-scale biotechnology literature, characterized in that it is possible to establish a multi-dimensional analysis model by adding index information for other analysis dimensions in addition to the disease index, the gene index and the compound index. How to extract compound relationships.

The method of claim 1,
The method comprises:
The sentence is divided into one dimension (gene, disease, compound), two dimensions (gene-disease, disease-gene, gene-compound, compound-gene, disease-compound, compound-disease relationship) and three-dimensional (gene-disease-compound) Method for extracting a gene-disease-compound relationship from a large volume of biotechnology literature, which is configured to be analyzed.

The method of claim 1,
When the user enters a search word, the color or highlight for each gene, disease, or compound is applied to the screen displaying the search result, so that the user can intuitively understand the content as well as give a visual effect. A method for extracting gene-disease-compound relationships from large biotechnology literature.

The method of claim 1,
When a user enters a search term, the gene-disease-compound relationship is obtained from a large-scale biotechnology literature, which is configured to identify the contents based on sentences and then view the entire abstract. How to extract.

The method of claim 1,
When the user enters a search word, keywords related to the search word are displayed. When the keyword is selected, the search result and the abstract corresponding to the search word and the keyword are displayed, and the user can easily analyze the relationship between the search word and the keyword. A method for extracting a gene-disease-compound relationship from a large volume of biotechnologies characterized in that it is configured to be capable of doing so.

The method of claim 1,
Gene-disease-from large-scale biotechnology literature, characterized by the ability to access the index and perform the necessary analysis immediately using SQL without the need to write a separate program to extract the relationship between gene-disease-compounds. How to extract compound relationships.