CN111653319A

CN111653319A - Method for constructing biomedical heterogeneous information network by fusing multi-source data

Info

Publication number: CN111653319A
Application number: CN202010552554.8A
Authority: CN
Inventors: 段磊; 何承鑫; 王婷婷; 张译丹; 邓赓
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-09-11

Abstract

The invention discloses a method for constructing a biomedical heterogeneous information network by fusing multi-source data; thereby providing technical support for upper layer analysis. Wherein the multi-source data comprises structured data and unstructured data; for structured data, the identifiers of each biomedical entity type in different databases are mapped, data from different data sources are integrated and associated with each other, so that the data model has higher scalability and extensibility, convenience can be provided for data analysis, and compared with the traditional database, the new data model has the characteristics of extensibility, visualization, comprehensiveness and the like. For unstructured data, the relationship between biomedical entities in biomedical documents and entities is obtained through a series of processes of entity identification, sentence simplification, triple extraction, predicate mapping and the like, and finally a biomedical heterogeneous information network is constructed by combining structured data.

Description

Method for constructing biomedical heterogeneous information network by fusing multi-source data

Technical Field

The invention relates to the technical field of information, in particular to a method for constructing a biomedical heterogeneous information network by fusing multi-source data.

Background

Currently, for structural data, there is systematic research on life activities, a complex biological process in bioinformatics, and with the development of researches from micro to macro in various levels, such as genomics, transcriptomics, proteomics, and metabolomics, biomedical data are continuously expanding, and a large number of databases related to different types of biomedical entities are continuously generated, such as GenBank and OMIM. However, most databases provide only a single type of biomedical entity or relationship. For example, the HUGO Gene Naming Committee (HGNC) database stores only gene data, and the DisGeNET database focuses only on the relationship between genes and disease. Secondly, most databases have poor expandability and are inconvenient to update data. Further, many databases are relational databases and do not provide good visualization. Overcoming these challenges would therefore provide higher quality assistance to experts in the biomedical field, thereby increasing biomedical technology. The data of different databases vary widely and there is no uniform standard to specify them. Therefore, it is necessary to manage them in a canonical form in order to better discover knowledge.

On the other hand, for unstructured data, the literature in biomedicine also provides rich and useful information, making it difficult for researchers to obtain desired information from such data, since a large amount of biomedical literature contains unstructured data. Heterogeneous information networks provide a structured relationship between entities, which can be a solution. However, unstructured data in the biomedical literature is generally heterogeneous, complex, massive, and there is no clear standard to consult to extract the correct information to represent it reasonably and formally.

The progress of scientific technology promotes the development of the biomedical field, and the continuously increased mass biomedical data provides a comprehensive foundation for scientists to acquire potential knowledge. How to bridge the gap between the ability to generate large amounts of data and biological understanding first requires that the generated massive amounts of bioinformatic data be managed in a standardized manner in order to better analyze the data discovery knowledge, e.g., the generation and analysis of massive amounts of data helps to better elucidate the biological mechanisms of complex diseases. Therefore, the biomedical heterogeneous information network is constructed by fusing multi-source data, so that technical support is provided for upper-layer analysis.

Disclosure of Invention

Therefore, in order to solve the above-mentioned deficiencies, the present invention provides a method for constructing a biomedical heterogeneous information network by fusing multi-source data; thereby providing technical support for upper layer analysis.

The invention is realized by the following steps: a method for constructing a biomedical heterogeneous information network by fusing multi-source data is constructed, and the method is characterized in that:

the specific implementation steps are as follows;

the method comprises the steps that (A) structured data are integrated by mapping identifiers of each biomedical entity type in different databases and associating the data from different data sources, so that a data model is more scalable and extensible, direct convenience can be provided for data analysis, and compared with a traditional database, the new data model has the characteristics of extensibility, visualization, comprehensiveness and the like; the specific implementation steps are as follows;

step 1, establishing identification mapping of biomedical entities through related biological entity databases, namely establishing identification numbers of the entities in different databases;

step 2, integrating the biological entities according to the identification mapping by collecting databases with connections among the biological entities;

step 3, constructing the integrated data into a network;

and (II) constructing a biomedical heterogeneous information network in the biomedical literature by a series of processes of identifying entities, simplifying sentences, extracting triples, predicate mapping and the like through unstructured data. The method comprises the following specific steps:

step 1, sentence segmentation is carried out on a document;

step 2, performing part-of-speech tagging on the obtained sentences;

step 3, identifying the biomedical entity;

step 4, analyzing sentence dependence;

step 5, simplifying the sentence according to the dependency tree;

step 6, extracting triples after the sentences are simplified;

step 7, performing predicate mapping through context projection mapping;

and 8, forming nodes and edges in the network by the corrected triples.

The method for constructing the biomedical heterogeneous information network by fusing multi-source data is characterized by comprising the following steps of: for the structured data, the specific implementation mode is;

step 1, firstly, establishing an identification mapping of a biomedical entity, namely integrating system numbers of the entity in each information database through inherent identification (symbol of genes) of the biomedical entity to form the identification mapping table of the entity, downloading data of the entity by a crawler through a given related database, returning corresponding id by using complete matching of given fields, searching for duplicate and screening, and finally integrating the information (table) of a plurality of ids corresponding to one entity, wherein some entities have synonyms, and then completing matching of the given fields and returning the id by comparing the synonyms one by one;

step 2, integrating the biological entities according to identification mapping by collecting a database with the relationship between the biological entities (after obtaining identification mapping information (table) of the biological entities, the information is only collected for a single entity, and then data integration is needed to be carried out on the relationship between a plurality of entities, so that the relationship between the entities is divided into the relationship between every two entities for collection, the relationship between the entities is corresponded by utilizing the identification mapping table of the entities, and semantic information of the relationship between the entities is stored;

step 3, the integrated data can be constructed into a network; the operations of the two steps are based on a graph database (Neo 4 j), so that the method has good applicability to subsequent management, collection and visualization of data, and the construction of the final network is completed by storing the identification mapping information (table) of the entities as node types and inputting and storing the contact table between biological entities through edge types.

The method for constructing the biomedical heterogeneous information network by fusing multi-source data is characterized by comprising the following steps of: for unstructured data, the specific implementation is;

step 1, sentence segmentation is carried out on a document; firstly, preprocessing a document, dividing the document by taking a sentence as a unit, namely dividing the document by taking a given symbol as a separator;

step 2, performing part-of-speech tagging on the obtained sentences; the divided sentences need to pay attention to nouns appearing in the sentences and predicates among the nouns, and whether the nouns are useful or not is further judged, so that each divided sentence is used as input, and a part-of-speech marked sequence result is obtained by using a ready pos tool;

step 3, identifying the biomedical entity; the identification method is characterized in that a corpus with category labels is generated by utilizing PubTator to carry out biomedical named entity identification (BioNER), the identification method can identify five biomedical entities and provide the entity types of the entities, including genes/proteins, chemical substances, diseases, species and SNP, and the typed entities are mentioned and replaced by the types of the entities;

step 4, analyzing sentence dependence; analyzing the dependence items by using a python NLP library spaCy2 to obtain the dependence relation among the words in the sentence; the parsing result is in a tree structure, and the tree structure indicates a group of directed grammatical relations among words in the sentence;

step 5, simplifying the sentence according to the dependency tree; traversing the syntax dependence tree structure from the root, and cutting out nodes and subtrees to assemble into clauses when encountering non-leaf nodes with parts of speech being nouns, wherein the noun nodes are repeated and are kept in the original tree as original leaves and are kept in the subtrees as the root; each tree can be written into a short sentence, so that the original long sentence can be divided into shorter sentences according to the hierarchy;

step 6, extracting triples after the sentences are simplified; for the divided short sentences, extracting frequently-appearing patterns in the sentences by using a frequent pattern mining method, reserving the patterns at least comprising one entity, and matching the entities and the relations in the original sentences by using the patterns so as to extract candidate triples;

step 7, performing predicate mapping through context projection mapping; because the relation in the triple has the condition of one meaning multiple words, the predicates need to be mapped, the predicates with the same semantics are mapped into the same predicate, the relation category is compressed, the redundancy is reduced, and for the same predicates, the same predicates have different meanings due to different contexts, so the representation of the predicates and the contexts in the triple is considered, the similarity between the predicates in the extracted triple and the predicates in the triple in the knowledge base is measured through a Bi-LSTM network, the most similar predicates in the knowledge base are found out, and the predicates of the triple are replaced, so the purpose of compressing the relation category is achieved;

step 8, the corrected triples form nodes and edges in the network; modifying the finally obtained triple in the network generated by processing the structured data to complete the construction of the biomedical heterogeneous information network from the data sources of the structural and non-structural data; in the correction process, the entities in the triplets can be matched according to the identification mapping information (table) of the entities, then repeated screening is carried out on the relationships, and the relationships are added into the existing graph database after the screening is carried out, so that the data storage is completed.

The invention has the following advantages: information independence and data type singleness between biomedical databases are not favorable for comprehensive association analysis and data mode expandability between biomedical entities. To address these problems in traditional relational databases, data is integrated and an extensible heterogeneous biomedical information network is constructed. For future work, the network may continue to scale and existing authenticated data nodes and relationships may be added from a perspective. And the heterogeneous biomedical information network is further analyzed through the expansion of the network.

The invention includes structured data and unstructured data; for structured data, data from different data sources are integrated to be associated with each other by mapping identifiers of each biomedical entity type in different databases, so that the data model is more scalable and extensible, and direct convenience is provided for data analysis. Compared with the traditional database, the new data model has the characteristics of expandability, visualization, comprehensiveness and the like. For unstructured data; a biomedical heterogeneous information network in biomedical documents is constructed through a series of processes of entity identification, sentence simplification, triple extraction, predicate mapping and the like.

Drawings

FIG. 1 is an example of an identification mapping (which may also be expressed as an example of an identification mapping table) for structured data;

FIG. 2 is an example of entity identification by PubTator (also expressible as a biomedical entity identification example);

FIG. 3 is a sentence grammar dependency parsing example;

FIG. 4 is a diagram of a Bi-LSTM network framework.

Detailed Description

The present invention will be described in detail with reference to fig. 1 to 4, and the technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for constructing a biomedical heterogeneous information network by fusing multi-source data, which mainly comprises the following steps: structured data and unstructured data; the following procedures were respectively carried out:

structured data (represented by existing biomedical information-related databases):

when the method is implemented, a biomedical heterogeneous information network is constructed for data management, and a comprehensive database is constructed. By mapping identifiers of each biomedical entity type in different databases, data from different data sources are integrated and associated with each other, so that the data model has higher scalability and extensibility, direct convenience can be provided for data analysis, and compared with the traditional database, the new data model has the characteristics of extensibility, visualization, comprehensiveness and the like. As shown in fig. 1, the specific implementation steps are as follows:

step 1, establishing identification mapping of the biomedical entities through related biological entity databases, namely establishing identification numbers of the entities in different databases. Because the numbers of the same biological entity in different databases may not be consistent, and there may be different expressions for the name of the entity, it is necessary to establish an identification mapping of the entity. For example, the entity: 'Gene' (symbol: AKT 3), which has many databases to organize its information, may have: HGNC database (id: 393), Entrez Gene database (id: 10000), Ensembl database (id: ENSG 00000117020), OMIM database (id: 611223), UniProtKB database (id: Q9Y 243) and other databases store relevant information, and the databases have different coding systems to identify genes. Similarly, other biological entities have similar situations, so in order to solve the problem, firstly, an identification mapping of the biomedical entity is established, namely, the system numbers of the entities in each information database are integrated through the inherent identification (symbol of the gene) of the biological entity to form an identification mapping table of the entities, the step is to download the data of the entities through a crawler given to the related database, corresponding id is returned by using the complete matching of a given field, the information (table) integrated into one entity corresponding to a plurality of ids is searched and screened for duplication, wherein some entities have synonyms, and at the moment, the synonyms are compared one by one to complete the matching of the given field and return the id.

Step 2, integrating the biological entities according to the identification mapping by collecting databases with connections among the biological entities; after the identification mapping information (table) of the biological entity is obtained, the information of a single entity is only collected, then the data integration is needed to be carried out on the relations among a plurality of entities, the relations among the entities are divided into relations between every two entities for collection, the relations are corresponded by utilizing the identification mapping table of the entities, and the semantic information of the relations is stored. For example, the digenet database stores the relationship between entity 'gene' and entity 'disease' (disease causing gene), the gene identifier is from EntrezGene database, the disease identifier is from UMLS database, so the two types of entities are linked by the entity mapping table, the semantics are given to the 'disease causing gene', furthermore, the humann net database stores the correlation between genes, and the numbers (ids) of EntrezGene database for identification can directly establish the relationship between genes, and so on.

Aggregation of resource (source) information of (partially) related databases

(II) unstructured data (represented by literature relevant to biomedical information):

how to construct a heterogeneous biomedical information network in biomedical documents through a series of processes of entity identification, sentence simplification, triple extraction, predicate mapping and the like. The method comprises the following specific steps:

step 1, sentence segmentation is carried out on a document; punctuation marks and redundant placeholders exist in the document, so that the document needs to be preprocessed firstly and divided in units of sentences (namely periods), namely, the document is divided by using given symbols as separators.

Step 2, performing part-of-speech tagging on the obtained sentences; the divided sentences need to pay attention to nouns appearing in the sentences and predicates among the nouns, and whether the nouns are useful or not is judged, so that each divided sentence is used as an input, and a part-of-speech marked sequence result is obtained by using an existing pos tool.

Step 3, identifying the biomedical entity; the identification method can identify five biomedical entities and provide the entity types of the entities, including genes/proteins, chemical substances, diseases, species and SNPs, and replaces the typed entities with the types, and the identification effect of the PubTator is shown in figure 2. Wherein purple represents the gene and orange represents the disease.

Step 4, analyzing sentence dependence; since named entities in biomedical literature may contain one or more words, this may be beyond the vocabulary of sentence-dependent parsers, resulting in parsing errors. Therefore, dependency analysis is carried out by using the python NLP library spaCy2 to obtain the dependency relationship among the words in the sentence. The parsing results are in a tree structure indicating a set of directional grammatical relations between words in the sentence, as shown by way of example in fig. 3.

Step 5, simplifying the sentence according to the dependency tree; as mentioned previously, sentences in biomedical literature can be long, and therefore dependency labels between entities of interest can be long. This may lead to sparseness and incompleteness of sentence patterns. Therefore, it is necessary to simplify lengthy and complex sentences before pattern mining.

In order to decompose a sentence, it is necessary to know the structure and english grammar of the sentence. In linguistics, words may be divided into content words and functional words, content words, including nouns, most verbs and adjectives, meaning words that have some lexical, lexical meaning, meaning that represents some object, action, or characteristic. On the other hand, the function words are used for grammatical purposes. It is noted that in biomedical literature, the complexity of sentences is mainly due to the complexity of noun structures, where nouns can be modified by other nouns, adjectives, adjective clauses, etc. Therefore, the syntax dependence tree structure is traversed from the root, and non-leaf nodes with nouns in part of speech are encountered, then the nodes and the subtrees are cut out and assembled into clauses, and the noun nodes are repeated and are kept in the original tree as original leaves and are kept in the subtrees as the root. Each tree can be written as a short sentence so that the original long sentence can be hierarchically divided into shorter sentences.

Step 6, extracting triples after the sentences are simplified; for the divided short sentences, a frequent pattern mining method is utilized to extract frequently appearing patterns in the sentences, the patterns at least containing one entity mention are reserved, and the patterns are utilized to match the entities and the relations in the original sentences, so that the candidate triples are extracted.

Step 7, performing predicate mapping through context projection mapping; because the relation in the triple has the condition of one meaning multiple words, the predicates need to be mapped, the predicates with the same semantics are mapped into the same predicate, the relation category is compressed, the redundancy is reduced, and for the same predicates, the same predicates possibly have different meanings due to different contexts, so the expression of the predicates and the contexts in the triple is considered, the similarity between the predicates in the triple and the predicates in the triple in the knowledge base is measured and extracted through a Bi-LSTM network, the most similar predicates in the knowledge base are found out, and the predicates of the triple are replaced, so the purpose of compressing the relation category is achieved. Fig. 4, where < St, Pt, Ot > represents the extracted triples and < Sdb, Pdb, Odb > represents the triples in the extracted knowledge base.

Step 8, the corrected triples form nodes and edges in the network; and modifying the finally obtained triple in the network generated by processing the structured data to complete the construction of the biomedical heterogeneous information network from the data sources of the structural and non-structural data. In the correction process, the entities in the triplets can be matched according to the identification mapping information (table) of the entities, then repeated screening is carried out on the relationships, and the relationships are added into the existing graph database after the screening is carried out, so that the data storage is completed.

In conclusion, the invention discloses a method for constructing a biomedical heterogeneous information network by fusing multi-source data; thereby providing technical support for upper layer analysis. Wherein the multi-source data comprises structured data and unstructured data; for structured data, data from different data sources are integrated to be associated with each other by mapping identifiers of each biomedical entity type in different databases, so that the data model has higher scalability and extensibility, direct convenience can be provided for data analysis, and compared with the traditional database, the new data model has the characteristics of extensibility, visualization, comprehensiveness and the like. For unstructured data; a biomedical heterogeneous information network in biomedical documents is constructed through a series of processes of entity identification, sentence simplification, triple extraction, predicate mapping and the like.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for constructing a biomedical heterogeneous information network by fusing multi-source data is characterized by comprising the following steps:

the specific implementation steps are as follows:

the method comprises the following steps of (A) structuring data, integrating data from different data sources by mapping identifiers of each biomedical entity type in different databases to enable the data sources to be associated with each other, so that a data model has higher scalability and extensibility and can provide direct convenience for data analysis; compared with the traditional database, the new data model has the characteristics of expandability, visualization, comprehensiveness and the like; the specific implementation steps are as follows;

step 1, establishing identification mapping of biomedical entities through related biomedical databases, namely establishing identification numbers of the entities in different databases;

step 3, constructing a heterogeneous information network by using the integrated data;

the method comprises the following steps that (II) unstructured data are subjected to a series of processes of entity identification, sentence simplification, triple extraction, predicate mapping and the like, so that a biomedical heterogeneous information network in biomedical documents is constructed; the method comprises the following specific steps:

step 1, sentence segmentation is carried out on a document;

step 2, performing part-of-speech tagging on the obtained sentences;

step 3, identifying the biomedical entity;

step 4, analyzing sentence dependence;

step 5, simplifying the sentence according to the dependency tree;

step 6, extracting triples after the sentences are simplified;

step 7, performing predicate mapping through context projection mapping;

and 8, forming nodes and edges in the network by the corrected triples.

2. The method for constructing the biomedical heterogeneous information network by fusing multi-source data according to claim 1, wherein the method comprises the following steps: for the structured data, the specific implementation is as follows:

step 1, firstly, establishing an identification mapping of a biomedical entity, namely, gathering system numbers of the entity in each information database to generate an identification mapping table of the entity through inherent identification (for example: symbol of genes) of the biomedical entity; automatically downloading data in a given relevant database by using a web crawler technology, completely matching through a given field to return corresponding id (identifier), and integrating to obtain information (table) of a plurality of ids corresponding to one entity after duplicate checking and screening; for entities with synonyms, matching given fields is completed and id is returned through one-to-one comparison of the synonyms;

step 2, by collecting the databases having the association among the biological entities, integrating the databases according to the identification mapping to obtain the identification mapping information (table) of the biological entities; the step is only aiming at information collection of a single entity, and then data integration is needed to be carried out on the relations among a plurality of entities, so that the relations among the entities are divided into the relations between every two entities for collection, the relations among the entities are corresponded by utilizing an identification mapping table of the entities, and semantic information of the relations among the entities is stored;

step 3, the integrated data can be used for constructing a network; the operations of the two steps are realized based on a graph database (Neo 4 j), and the method has better applicability to subsequent management, collection and visualization of data; the identification mapping information (table) of the entities is stored as the node type, and the contact table between the biological entities is stored as the edge type, so that the construction of the final network is completed.

3. The method for constructing the biomedical heterogeneous information network by fusing multi-source data according to claim 1, wherein the method comprises the following steps: for unstructured data, the specific implementation is;

step 1, sentence segmentation is carried out on a document; firstly, preprocessing a document, and dividing the document by taking a sentence as a unit, namely dividing the document by taking a given symbol as a separator;

step 2, performing part-of-speech tagging on the obtained sentences; the divided sentences need to extract nouns appearing in the sentences and predicates among corresponding nouns, and then whether the nouns and the predicates are useful is judged;

therefore, for each segmented sentence, a sequence result of part-of-speech tagging can be obtained by using a ready pos tool;

step 3, identifying the biomedical entity; the identification method can identify five biomedical entities and provide the entity types of the entities, including genes, proteins, chemical substances, diseases and species, and replace the identified entities with the belonged entity types;

step 4, analyzing sentence dependence; analyzing the dependency of the sentence by using spaCy2 in a python NLP library to obtain the dependency among words in the sentence; the analysis result is of a tree structure and indicates a group of directed grammar relations among words in the sentence;

step 5, simplifying the sentence according to the dependency tree; traversing the syntax dependence tree from the root, and cutting the node and the subtree thereof to assemble into clauses when encountering non-leaf nodes with parts of speech as nouns; these noun nodes are repeated and remain in the original tree as original leaves and in the subtree as roots; each tree can form a short sentence, so that the original long sentence can be divided into a plurality of shorter sentences according to the hierarchy;

step 8, the corrected triples form nodes and edges in the network; modifying the finally obtained triple in a network generated after the structured data processing to complete the construction of a biomedical heterogeneous information network from data sources of structural and non-structural data; in the correction process, the entities in the triplets can be matched according to the identification mapping information (table) of the entities, then repeated screening is carried out on the relationships, and the relationships are added into the existing graph database after the screening is carried out, so that the data storage is completed.