CN115713078A

CN115713078A - Knowledge graph construction method and device, storage medium and electronic equipment

Info

Publication number: CN115713078A
Application number: CN202211338614.1A
Authority: CN
Inventors: 孙小婉; 蔡巍; 招一强; 张霞
Original assignee: Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Current assignee: Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-02-24

Abstract

The disclosure relates to a knowledge graph construction method, a knowledge graph construction device, a storage medium and electronic equipment, wherein the method comprises the following steps: obtaining hepatocellular carcinoma text data from different data sources, and determining a triple data set according to the hepatocellular carcinoma text data, wherein each triple in the triple data set comprises an entity pair and an entity relationship, and the entity relationship is used for representing the relationship between two entities in the entity pair; aiming at each entity in the triple data set, determining a corresponding target entity in a preset corpus according to a preset selection rule, and replacing the corresponding entity in the triple data set with the target entity to obtain a target triple data set; and constructing a hepatocellular carcinoma pathological knowledge map according to the target triple data set. The purpose of the disclosure is to solve the technical problem that multi-source heterogeneous data is not easy to search in the related art by constructing a hepatocellular carcinoma pathology knowledge map.

Description

Knowledge graph construction method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the technical field of medical data processing, and in particular, to a method and an apparatus for constructing a knowledge graph, a storage medium, and an electronic device.

Background

Hepatocellular carcinoma is one of the most common malignant tumors in adults, has high mortality rate, and is the most common cause of death in patients with cirrhosis. Despite the ongoing advances in prevention, diagnosis and treatment techniques, morbidity and mortality rates continue to rise. Therefore, hepatocellular carcinoma has become a hot topic of life science research, and diagnosis or treatment of hepatocellular carcinoma is increasingly performed using medical knowledge in an open field. Because of the rich and miscellaneous medical knowledge in the open field, healthcare professionals must traverse multiple data portals to retrieve relevant knowledge in order to be able to diagnose hepatocellular carcinoma. Searching such multi-source heterogeneous data is rather inconvenient for the healthcare professional to study.

Disclosure of Invention

The invention aims to provide a knowledge graph construction method, a knowledge graph construction device, a storage medium and electronic equipment, which solve the technical problem that multi-source heterogeneous data is not easy to search in the related technology by constructing a hepatocellular carcinoma pathological knowledge graph.

The first aspect of the present disclosure provides a method for constructing a knowledge graph, where the method includes:

obtaining hepatocellular carcinoma text data from different data sources, and determining a triple data set according to the hepatocellular carcinoma text data, wherein each triple in the triple data set comprises an entity pair and an entity relationship, and the entity relationship is used for representing the relationship between two entities in the entity pair;

aiming at each entity in the ternary group data set, determining a corresponding target entity in a preset corpus according to a preset selection rule, and replacing the corresponding entity in the ternary group data set with the target entity to obtain a target triple data set;

and constructing a hepatocellular carcinoma pathology knowledge map according to the target triple data set.

Optionally, when the preset selection rule is to select a corpus that is most similar to the semantics of the entities as the target entity, the determining, for each entity in the triple data set, a corresponding target entity in the preset corpus according to the preset selection rule includes:

for each of the entities in the triple data set, performing the following:

determining the similarity between the entity and each corpus in the preset corpus;

determining candidate corpora according to the similarity and a preset similarity threshold, or sequencing the corpora according to the similarity to obtain a corpus sequence, and determining the candidate corpora according to the corpus sequence and a preset selection order;

and determining semantic similarity between the candidate corpus and the entity, and determining the target entity according to the semantic similarity.

Optionally, the determining the similarity between the entity and each corpus in the corpus comprises:

determining sparse vectors of the entities and sparse vectors of the corpus based on a statistical language model;

determining dense vectors of the entities and dense vectors of the corpora based on a language representation model;

determining a first similarity according to the sparse vector of the entity and the sparse vector of the corpus, and determining a second similarity according to the dense vector of the entity and the dense vector of the corpus;

and adding the first similarity and the second similarity to obtain the similarity.

Optionally, the training process of the language representation model includes:

determining dense vectors of sample entities and dense vectors of sample candidate corpora through the pre-trained language characterization model, and determining sparse vectors of the sample entities and sparse vectors of the sample candidate corpora through the statistical language model;

determining the similarity between the sample entity and each sample corpus candidate according to the dense vector of the sample entity, the sparse vector of the sample entity, the dense vector of the sample corpus candidate and the sparse vector of the sample corpus candidate;

determining a sample similarity sequence according to the similarity, and determining a sample label sequence according to the sample similarity sequence; wherein each sample label in the sample label sequence is used for indicating a standard category of each sample similarity in the sample similarity sequence, and the standard category comprises synonyms and/or hypernyms;

and determining a loss function value according to the sample similarity sequence and the sample label sequence, and updating the parameters of the pre-trained language characterization model according to the loss function value.

Optionally, when the preset corpus is a unified medical language library, the determining, for each entity in the triple data sets, a corresponding target entity in the preset corpus according to a preset selection rule includes:

for each of the entities in the triple data set, performing the following:

determining a similarity of the entity to each synonym in the unified medical language library;

determining candidate synonyms according to the similarity and a preset similarity threshold, or sequencing the synonyms according to the similarity, so as to obtain a synonym sequence, and determining the candidate synonyms according to the synonym sequence and a preset selection order;

determining semantic similarity between the candidate synonym and the entity, and determining a target synonym according to the semantic similarity;

and determining the concept name of the target synonym in the unified medical language library, and determining the concept name as the target entity.

Optionally, when the hepatocellular carcinoma text data is unstructured text data, the determining the triple data set according to the hepatocellular carcinoma text data includes:

carrying out entity identification on the hepatocellular carcinoma text data through an entity identification model to obtain the entity;

pairing the entities based on a preset pairing rule to obtain an entity pair;

carrying out relationship identification on the entity pair through a relationship identification model to obtain the entity relationship;

and determining the triple data set according to the entity pairs and the corresponding entity relations.

Optionally, the method further comprises:

storing the hepatocellular carcinoma pathology knowledge map in the form of a resource description framework; and/or storing the hepatocellular carcinoma pathology knowledge-map in the form of a map database.

A second aspect of the present disclosure provides a knowledge-graph constructing apparatus, the apparatus comprising:

the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring hepatocellular carcinoma text data from different data sources and determining a three-element data set according to the hepatocellular carcinoma text data, each three element in the three-element data set comprises an entity pair and an entity relationship, and the entity relationship is used for representing the relationship between two entities in the entity pair;

the processing module is used for determining a corresponding target entity in a preset corpus according to a preset selection rule aiming at each entity in the triple data set, and replacing the corresponding entity in the triple data set with the target entity to obtain a target triple data set;

and the construction module is used for constructing the hepatocellular carcinoma pathology knowledge graph according to the target triple data set.

A third aspect of the disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspects.

A fourth aspect of the present disclosure provides an electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to perform the steps of the method of any of the first aspects.

Through the technical scheme, on one hand, medical knowledge about hepatocellular carcinoma in different data sources can be connected together, so that pathological knowledge of hepatocellular carcinoma can be inquired conveniently. On the other hand, entities with various expression modes can be mapped to target entities in the preset corpus through the preset selection rules, and the entities in the extracted data can be subjected to standard normalized mapping, so that the problem of large-scale information redundancy among different data sources is solved, the data accuracy in the hepatocellular carcinoma pathology knowledge graph is further ensured, and the accuracy of information query based on the knowledge graph is improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of knowledge-graph construction according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a tag sequence acquisition according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating candidate synonym acquisition according to an exemplary embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating an architecture of a knowledge-graph building apparatus according to an exemplary embodiment of the present disclosure;

fig. 5 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

First, an application scenario of the present disclosure will be explained. Hepatocellular carcinoma is one of the most common malignant tumors in adults, has high mortality rate, and is the most common cause of death in patients with cirrhosis. Despite the advances in prevention, diagnosis and treatment technologies, morbidity and mortality rates are on the rise. Therefore, hepatocellular carcinoma has become a hot topic of life science research, and diagnosis or treatment of hepatocellular carcinoma is increasingly performed using medical knowledge in an open field. Because of the wealth and complexity of medical knowledge in the open field, healthcare professionals must traverse multiple data portals to retrieve relevant knowledge in order to be able to diagnose hepatocellular carcinoma. Searching such multi-source heterogeneous data is rather inconvenient for healthcare professionals. With the introduction of the concept of knowledge-graph, more and more fields begin to represent unstructured, semi-structured and structured information in the internet/databases with knowledge-graph, but there is no open knowledge-graph on hepatocellular carcinoma pathology in the related art.

In view of this, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for constructing a knowledge graph, which solve the technical problem in the related art that it is not easy to search multi-source heterogeneous data by constructing a hepatocellular carcinoma pathology knowledge graph.

The following detailed description of the embodiments of the disclosure refers to the accompanying drawings.

A method of knowledge-graph construction, as shown in fig. 1, may include the steps of:

s1: the method comprises the steps of obtaining hepatocellular carcinoma text data from different data sources, and determining a triple data set according to the hepatocellular carcinoma text data, wherein each triple in the triple data set comprises an entity pair and an entity relationship, and the entity relationship is used for representing the relationship between two entities in the entity pair.

The knowledge map is called a knowledge domain visualization or knowledge domain mapping map in the book intelligence world, is a series of different graphs for displaying the relationship between the knowledge development process and the structure, describes knowledge resources and carriers thereof by using a visualization technology, and excavates, analyzes, constructs, draws and displays knowledge and the mutual relation between the knowledge resources and the carriers. In popular terms, a relationship network is obtained by connecting all kinds of information together. Before the knowledge graph is constructed, a person skilled in the relevant field needs to manually design a mode of the knowledge graph according to an application scenario, and determine entities for constructing the knowledge graph and entity relationships for indicating associations between the two entities. Since the knowledge graph constructed by the embodiment of the disclosure is a hepatocellular carcinoma pathology knowledge graph, when an entity is extracted from hepatocellular carcinoma text data, the extracted entity should be an entity related to hepatocellular carcinoma, for example, a disease entity, a human body morphological structure entity, a detection item entity or/and a histological grading entity; the entity relationship may include an hypernym, a hyponym, a disease occurrence part, a disease examination means or/and attribute, and the like.

It should be understood that the different data sources refer to medical domain databases or search platforms that can obtain hepatocellular carcinoma textual data, such as the clinical decision system UpToDate, the biomedical literature database MEDLINE, the PubMed search platform, the National Center for Biotechnology Information NCBI (National Center for Biotechnology Information), and the clinical research database clinical trials. For example, in acquiring text data for hepatocellular carcinoma, documents related to hepatocellular carcinoma pathology may be retrieved and downloaded using the PubMed platform. PubMed is an online repository containing over 2400 ten thousand citations of documents from MEDLINE and life science journals. MEDLINE is an international integrated bioinformatics bibliography database created by the national library of medicine, and is the most commonly used foreign bibliography abstract database in the biomedical field. Medical guidelines for hepatocellular carcinoma may be downloaded using UpToDate. The UpToDate is a clinical decision support system based on the principle of inquiry medicine, is a main resource for a doctor to acquire medical knowledge in the diagnosis and treatment process, and provides continuously updated information for the doctor according to the principle of inquiry medicine. Gov can also be used to obtain clinical trials for hepatocellular carcinoma. Gov is a resource offered by the national library of medicine, a global database of clinical studies, including 299634 studies from 50 countries and 208 cities. Biomedical knowledge related to hepatocellular carcinoma can also be acquired using NCBI. NCBI discloses various biological databases, for example, the nucleic acid sequence database GenBank and the biological project database biopject, and also provides tools for retrieving and analyzing data by which corresponding hepatocellular carcinoma textual data can be extracted from NCBI.

After the hepatocellular carcinoma text data are acquired, in order to construct a hepatocellular carcinoma pathology knowledge map according to the acquired hepatocellular carcinoma text data, corresponding triples need to be extracted from the acquired hepatocellular carcinoma text data. Since the text data of the hepatocellular carcinoma obtained originates from different databases, and the data structures of the different databases may not be consistent, for example, NCBI is structured data, and PubMed search platform, NCBI and UpToDate are structured unstructured data. Therefore, different ways for extracting the triples are needed for hepatocellular carcinoma text data with different data structures.

Illustratively, when the hepatocellular carcinoma text data is structured data, a search tool provided by the mining database may be used to extract corresponding entities from the hepatocellular carcinoma text data, and then a triple data set may be constructed according to relationships between the entities. For example, the extraction of entities is performed using the SemMedDB database. The SemMedDB database is a relational database, provides a search function, can be searched by using hepatocellular carcinoma as a key word when extracting entities to obtain search data, and constructs a ternary group data set according to the hepatocellular carcinoma and the search data after obtaining the search data. For example, the retrieved data is shown in table 1:

TABLE 1 search data obtained by search using hepatocellular carcinoma as a keyword line

The triplets organized according to hepatocellular carcinoma and the search data described above may be:

triplet 1: (hepatocellular carcinoma, carcinoembryonic antigen, tumor marker test);

triplet 2: (hepatocellular carcinoma, alpha-fetoprotein, tumor marker test);

triplet 3: (hepatocellular carcinoma, abdominal MRI, diagnostic examination);

triplet 4: (hepatocellular carcinoma, microscopic examination, pathological examination);

triple 5: (microscopic examination, high differentiation, histological grading);

triplet 6: (microscopic, MO, microvascular invasion).

When the hepatocellular carcinoma text data is unstructured data, an information extraction system can be used for extracting entities and entity relations from the hepatocellular carcinoma text data, so that corresponding triples are obtained. For example, entities and entity relationships in hepatocellular carcinoma text data are extracted by a SemRep information extraction system to obtain corresponding triples.

It should be understood that the SemRep information extraction System is a program designed based on the Unified Medical Language System UMLS (Unified Medical Language System), and although the triples in the text can be directly extracted, entities that are not in the UMLS cannot be identified. Therefore, in order to extract the triples from the rest of the databases with unstructured data, the entities and entity relationships in the hepatocellular carcinoma text data can be extracted through a neural network model, and then a triple data set is constructed according to the extracted entities and entity relationships. That is, according to one embodiment of the present disclosure, when the hepatocellular carcinoma text data is unstructured text data, determining the triple data set according to the hepatocellular carcinoma text data may include:

carrying out entity recognition on the hepatocellular carcinoma text data through an entity recognition model to obtain an entity; pairing the entities based on a preset pairing rule to obtain an entity pair; carrying out relationship identification on the entity pair through a relationship identification model to obtain an entity relationship; and determining the triple data set according to the entity pairs and the corresponding entity relations.

It should be understood that the entity recognition model may be implemented based on existing neural network model structures. For example, a Long Short-Term Memory network LSTM (Long Short-Term Memory) + conditional random field algorithm CRF (conditional random field algorithm) may be used to identify entities in hepatocellular carcinoma text data to obtain entities. Specifically, a sample hepatocellular carcinoma text data is used as an input, all entities contained in the sample hepatocellular carcinoma text data are used as outputs to train an LSTM + CRF model, and then the entities are identified from the hepatocellular carcinoma text data through the trained LSTM + CRF model.

Similarly, the relationship identification model may also be implemented based on an existing neural network model structure, for example, a bidirectional long-short term memory model may be used to identify the relationship between the entity pair to obtain the entity relationship. Specifically, a bidirectional long-short term memory model is trained by taking a sample entity pair as input and a sample entity relationship as output in advance, and then the entity relationship is identified from the entity pair through the trained bidirectional long-short term memory model.

It should be understood that this is only an adaptive illustration, and in particular, other neural network models may be selected as the entity recognition model and the relationship recognition model, and the embodiment of the present disclosure does not limit this.

In addition, it should be understood that the preset pairing rule may be set reasonably according to specific situations of practical application, and the embodiment of the present disclosure does not set any limitation to this. In a possible embodiment, the preset pairing rule may be that two entities identified according to the entity identification model are paired to obtain an entity pair. In other possible embodiments, the preset pairing rule may also be a random combination of entities identified according to the entity identification model to obtain an entity pair.

S2: and aiming at each entity in the triple data set, determining a corresponding target entity in a preset corpus according to a preset selection rule, and replacing the corresponding entity in the triple data set with the target entity to obtain a target triple data set.

According to the foregoing step S1, the entities in the triple dataset originate from different databases, and the expression patterns of the entities in the different databases may be different. That is, the same entity may have different expressions in different databases. For example, "hepatocellular carcinoma disease" is expressed in certain databases as "HCC" and in others as "hepatocellular carcinoma". Therefore, the problem that the same entity corresponds to multiple expression modes or the problem that the same expression mode corresponds to multiple entities easily occurs in triple data sets extracted from different data sources. In view of this, in order to ensure the data accuracy in the hepatocellular carcinoma pathology knowledge-graph, the normalized mapping operation is also required to be performed on the entities in the triple dataset data set. That is, entities with multiple expression modes are mapped to the same expression mode or expression modes with multiple entities are mapped to the same entity through a preset selection rule.

The preset selection rule can be reasonably set according to the specific situation of practical application, and the embodiment of the disclosure does not limit the rule. In a possible embodiment, the preset selection rule may be set as: and selecting the expression mode with the largest occurrence number as the final expression mode. That is, for each entity, the entity that appears the most frequently is selected as the target entity. For example, "HCC" and "hepatocellular carcinoma" both indicate "hepatocellular carcinoma disease", and "hepatocellular carcinoma" is selected to indicate "hepatocellular carcinoma disease" when the frequency of "HCC" occurrence in the triple data set is 264 times and the frequency of "hepatocellular carcinoma" occurrence is 538 times.

In other possible embodiments, a corpus may be constructed in advance, the corpus includes a plurality of pre-selected corpora, and one corpus has and corresponds to only one entity. And then, according to the semantics of each entity in the triple data set, screening out the corpus with the highest semantic similarity from the corpus as a target entity. That is, according to an embodiment of the present disclosure, when the preset selection rule is to select a corpus that is most similar to the semantics of the entities as the target entity, for each entity in the triple data set, determining the corresponding target entity in the preset corpus according to the preset selection rule includes:

for each entity in the triple data set, performing the following:

determining the similarity between the entity and each corpus in the preset corpus; determining candidate corpora according to the similarity and a preset similarity threshold, or sequencing the corpora according to the size sequence of the similarity to obtain a corpus sequence, and determining the candidate corpora according to the corpus sequence and a preset selection order; and determining semantic similarity between the candidate corpus and the entity, and determining a target entity according to the semantic similarity.

It should be understood that the higher the text similarity, the closer the corresponding text semantics are, and the more complicated and time-consuming the determination of the text semantics is compared to the determination of the text similarity. Therefore, the similarity between the entity and each corpus is calculated, and a part of the corpus candidates with higher similarity is screened out from the corpus according to the similarity, so that semantic recognition is performed on the corpus candidates, and the target entity is obtained. Because the number of the candidate corpora is far less than that of the corpora, the time for calculating the text semantics can be greatly saved, and the selection efficiency is improved.

In addition, it should be noted that the manner of selecting the target entity from the corpus candidates is not unique, and in a possible implementation manner, one corpus from the corpus candidates may be manually selected as the target entity.

In addition, it should be noted that, because it takes a lot of time and effort to find the data and screen the corpus to construct the corpus, the difficulty and time of constructing the knowledge graph are greatly increased. Therefore, in a possible embodiment, a database in the related art may be used as the predetermined corpus. This is achieved, for example, with UMLS as the corpus preset. UMLS is a thesaurus, the largest set of biomedical dictionaries containing 290 and 1140 ten thousand entity names and synonyms. As shown in fig. 2, UMLS includes a plurality of concepts, each concept corresponding to a box or/and a dot below the box in fig. 2; each concept has an identifier ID, corresponding to the concept uniform identifier in fig. 2: xxxxxxx; and each concept has N synonyms, corresponding to synonym 1, synonym 2, synonym 3 in the box of fig. 2; one of the N synonyms is used as the preferred term to represent this concept, corresponding to synonym 1 in the box of fig. 2. Because one concept in the UMLS includes multiple synonyms, different synonyms under the same concept are selected as a target entity to avoid different expression modes of the same entity, when the UMLS is used as a corpus preset, after the target synonym is determined, the concept to which the target synonym belongs needs to be determined, and the target entity is represented by the concept, that is, the target entity is represented by a preferred term in the current concept. That is, according to an embodiment of the present disclosure, when the predetermined corpus is a unified medical language library (i.e., UMLS) and the predetermined selection rule is to select a corpus having a semantic similarity with an entity from the predetermined corpus as a target entity, for each entity in the triple data set, determining a corresponding target entity in the predetermined corpus according to the predetermined selection rule includes:

as shown in fig. 3, for each entity in the triple data set, the following operations are performed:

determining the similarity between the entity and each synonym in the unified medical language library; determining candidate synonyms according to the similarity and a preset similarity threshold, or sequencing the synonyms according to the similarity, obtaining a synonym sequence, and determining the candidate synonyms according to the synonym sequence and a preset selection order; determining semantic similarity between the candidate synonym and the entity, and determining a target synonym according to the semantic similarity; and determining the concept name of the target synonym in the unified medical language library, and determining the concept name as the target entity.

It should be understood that the knowledge graph constructed in this embodiment is a hepatocellular carcinoma pathology knowledge graph, and thus, in order to reduce the amount of data calculation and improve the efficiency of constructing the knowledge graph when determining the similarity between an entity and each synonym in the unified medical language library, the similarity between the entity and each synonym in the eight broad categories of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), genes, proteins, cells, diseases, phenotypic abnormalities and treatment techniques may be calculated.

Determining the similarity between the entity and each corpus in the corpus may include:

determining sparse vectors of entities and sparse vectors of corpora based on a statistical language model; determining dense vectors of entities and dense vectors of corpora based on the language characterization model; determining a first similarity according to the sparse vector of the entity and the sparse vector of the corpus, and determining a second similarity according to the dense vector of the entity and the dense vector of the corpus; and adding the first similarity and the second similarity to obtain the similarity.

It should be understood that the statistical language model may be implemented based on existing models. For example, a character-level Chinese language model n-gram may be used to obtain sparse vectors for entities. The basic idea of the character-level n-gram model is to compute the probability of the 3 rd character from the probabilities of the first two characters (i.e., sparse vectors). For ease of understanding, the following description will be given with an example where n is 2:

if it is desired to calculate the current character w _i To see the probability of w _i What the previous character is. That is to sayAccording to w _i The preceding character w _i-1 To calculate w _i Probability P (w) _i |w _i-1 )。P(w _i |w _i-1 ) Can be expressed as: p (w) _i |w _i-1 )＝P(w _i ,w _i-1 )/p(w _i-1 ) Wherein, P (w) _i ,w _i-1 ) Is w _i And w _i-1 The joint probability of (2), i.e. the frequency of the simultaneous occurrence of two characters, is obtained by counting the number of times the two occur simultaneously and dividing by the total number of times all characters occur. p (w) _i-1 ) Is w _i-1 The probability of occurrence, i.e. the frequency of this character, is obtained by counting the number of times it occurs and dividing by the total number of times all characters occur. P (w) _i ,w _i-1 ) And p (w) _i-1 ) These two probabilities are solved by _i Probability P (w) of _i |w _i-1 ) And then solved. When the probabilities of all characters forming a word or a word are obtained, the probability of the corresponding word or word is obtained by calculating the average value of the probabilities of all the characters.

It should be understood that, in the embodiments of the present disclosure, the total number of occurrences of all characters refers to: the entity and the total number of times of occurrence of the character corresponding to each corpus in the preset corpus.

Similarly, the language representation model may also be implemented based on an existing neural network model structure, for example, a pre-trained biomedical language representation model for biomedical text mining (BioBERT) may be used for implementation. Specifically, a BioBERT model is trained by taking a sample entity as input and a dense vector corresponding to the sample entity as output in advance, and then the dense vector of the entity is obtained through the trained BioBERT model.

It should be understood that the BioBERT model is a pre-trained language characterization model, which is a model trained from a large number of medical corpora. When the method is used for the embodiment of the disclosure, due to the fact that differences exist in application scenes, parameters provided by the official are directly used for model training, and the problem that a trained model cannot be well applicable to the scheme, namely the problem that the model precision is low, can exist. To solve the above technical problem, the embodiments of the present disclosure first perform pre-training of the BioBERT model through the officially provided parameters, i.e., training the text vector representation (i.e., dense vector) of each character, and then perform fine-tuning on the parameters of the trained BioBERT model by constructing a downstream task, so as to improve the accuracy of the BioBERT model. That is, according to one embodiment of the present disclosure, the training process of the language characterization model may include:

determining dense vectors of sample entities and dense vectors of sample candidate linguistic data through a pre-trained language characterization model, and determining sparse vectors of the sample entities and sparse vectors of the sample candidate linguistic data through a statistical language model; determining the similarity between the sample entity and each sample corpus candidate according to the dense vector of the sample entity, the sparse vector of the sample entity, the dense vector of the sample corpus candidate and the sparse vector of the sample corpus candidate; determining a sample similarity sequence according to the similarity, and determining a sample label sequence according to the sample similarity sequence; each sample label in the sample label sequence is used for indicating a standard category of each sample similarity in the sample similarity sequence, and the standard category comprises synonyms and/or hypernyms; and determining a loss function value according to the sample similarity sequence and the sample label sequence, and updating the parameters of the pre-trained language representation model according to the loss function value.

Illustratively, in the fine tuning process, a training sample includes an entity d and 20 synonyms, and the 20 synonyms are from 4 concepts in UMLS; during training, the entity d and 20 synonyms are input into a pre-trained BioBERT model and a statistical language model to obtain a dense vector and a sparse vector of the entity d and a dense vector and a sparse vector of each synonym. Secondly, calculating the similarity between the entity d and each synonym according to the dense vector of the entity d, the sparse vector of the entity d, the dense vector of each synonym and the sparse vector of each synonym, selecting 7 synonyms with the maximum similarity as candidate synonyms of the entity d according to the similarity, and marking as [ d ₁ ,d ₂ ,...,d ₇ ]Meanwhile, a similarity sequence z is constructed according to the similarity corresponding to each candidate synonym, wherein z = [ z ] = ₁ ,z ₂ ,...,z ₇ ]. Then labeling according to the relation between each candidate synonym and the entity d to obtain a label sequence y, y = [ y ] ₁ ,y ₂ ,...,y ₇ ]As shown in fig. 2. For example, if the candidate synonym is a synonym for entity d, label the candidate synonym as 2; if the candidate synonym is the hypernym of the entity d, marking the candidate synonym as 1; and if the candidate synonym is not the synonym of the entity d or the hypernym of the entity d, marking the candidate synonym as 0, and then arranging according to the corresponding sequence to obtain the tag sequence y. And finally, calculating a loss function value by using the similarity sequence z and the label sequence y, and updating the parameters of the pre-trained language characterization model according to the loss function value.

In a possible implementation, the loss function value may be found based on the list cross entropy as the loss function, and using maximum likelihood estimation. The list cross entropy is:

where ListLoss represents the loss function value, M represents the total number of entities in the training sample, y _i Denotes the ith tag in the tag sequence, z _i Representing the i-th similarity, k, in the similarity sequence ₁ Indicates the length of the similarity sequence or tag sequence,

representing the probability of the ith similarity score in the similarity sequence,

indicating the probability of the ith tag in the sequence of tags,

the sum of the indices representing the sequence of similarities,

indicates the sum of the indices of the tag sequences.

After the language representation model is trained, the dense vectors of the entities and the dense vectors of the corpora can be determined by the language representation model, and then the similarity between the entities and each corpus in the preset corpus is determined according to the sparse vectors of the entities and the sparse vectors of the corpora.

Illustratively, an entity is denoted as m, and the sparse vector corresponding to m is denoted as m

m corresponding dense vectors are noted

Record e for each corpus in the preset corpus _i ，e _i The corresponding sparse vector is noted

e _i The corresponding dense vector is noted

When calculating the similarity, firstly use

And

calculating cosine similarity to obtain a first similarity S _sparse (m,e _i )，

Then use it

And

calculating cosine similarity to obtain a second similarity S _dense (m,e _i )，

Finally, the first similarity and the second similarity are added to obtain a similarity z _i ，z _i ＝S _dense (m,e _i )+S _sparse (m,e _i ). In a possible embodiment, a corresponding weight λ may also be set for the second similarity, so that the calculated similarity is more accurate, i.e. z _i ＝S _dense (m,e _i )+λS _sparse (m,e _i ). In a possible embodiment, the value of λ may be set as a parameter value in the language representation model.

S3: and constructing a hepatocellular carcinoma pathological knowledge map according to the target triple data set.

In conclusion, the knowledge graph construction method can connect medical knowledge about the hepatocellular carcinoma in different data sources together so as to query pathological knowledge of the hepatocellular carcinoma. Specifically, the hepatocellular carcinoma text data are obtained from different data sources, and the entity and entity relationship are extracted, so that the constructed hepatocellular carcinoma pathological knowledge map can cover most pathological knowledge related to hepatocellular carcinoma, and therefore, not only can biomedical researchers find substances related to hepatocellular carcinoma be facilitated, but also the accuracy and integrity of query of hepatocellular carcinoma pathological knowledge can be improved. On the other hand, entities with various expression modes are mapped to be target entities in the preset corpus through the preset selection rules, and the entities in the extracted data can be subjected to standard normalized mapping, so that the problem of large-scale information redundancy among different data sources is solved, the data accuracy in the hepatocellular carcinoma pathology knowledge graph is further ensured, and the accuracy of information query based on the knowledge graph is improved.

It should be understood that, after the hepatocellular carcinoma pathology knowledge graph is constructed, the constructed hepatocellular carcinoma pathology knowledge graph needs to be stored in order to be able to use the hepatocellular carcinoma pathology knowledge graph for information query subsequently. That is, according to one embodiment of the present disclosure, the method may further include:

storing the hepatocellular carcinoma pathology knowledge graph in a form of a resource description framework; and/or storing the hepatocellular carcinoma pathology knowledge map in the form of a map database.

Illustratively, when storing the hepatocellular carcinoma pathology knowledge graph in the form of a resource description framework, the link between the data and the Query may be established by Sparql Protocol (Simple Protocol and RDF Query Language). When the hepatocellular carcinoma pathology knowledge map is stored in the form of a graph database, the data can be queried and updated using the graphical query language Cypher. According to one embodiment of the disclosure, a graphical database Neo4j is used for storing a hepatocellular carcinoma pathology knowledge map, and corresponding triples in the hepatocellular carcinoma pathology knowledge map are led into the Neo4j through a tool Neo4 j-import. When the hepatocellular carcinoma pathology knowledge graph is opened in Neo4j, a network is displayed, wherein nodes of the network are entities, and connecting lines of the nodes are entity relationships among the entities. Biomedical researchers can search for entities and entity relationships to explore and reason about.

In a possible embodiment, the hepatocellular carcinoma pathology knowledge-graph constructed based on the above method comprises 5028 entities and 13296 triplets. Specifically, the gene comprises 1328 drugs, 1849 proteins, 1403 diseases, 160 cells, 140 DNAs, 54 phenotypic abnormalities, 50 genes, 35 treatment technologies and 9 RNAs. By analyzing the data, it can be found that:

(1) 162406 pieces of hepatocellular carcinoma text data are obtained from different data sources and are far larger than the number of triples in a hepatocellular carcinoma pathology knowledge map, which indicates that large-scale redundant information exists in the hepatocellular carcinoma text data among the different data sources. The knowledge graph construction method provided by the embodiment of the disclosure can help researchers filter out redundant information, and improve research efficiency.

(2) The number of triples in the knowledge map of hepatocellular carcinoma pathology constructed in this embodiment is much larger than the number of entities, which means that one entity may be related to a plurality of different entities, and this may help researchers to analyze the relationship between different entities and to discover the molecular mechanism or treatment method of hepatocellular carcinoma. For example, hepatocellular carcinoma is associated with hepatitis a, which is a glucagon related family, and thus it can be concluded that glucagon is likely to be associated with hepatocellular carcinoma, thereby helping biomedical researchers to discover substances associated with hepatocellular carcinoma.

Based on the same concept, the embodiment of the present disclosure further provides a knowledge graph constructing apparatus, as shown in fig. 4, the knowledge graph constructing apparatus 400 includes:

the obtaining module 401 is configured to obtain hepatocellular carcinoma text data from different data sources, and determine a triple data set according to the hepatocellular carcinoma text data, where each triple in the triple data set includes an entity pair and an entity relationship, and the entity relationship is used to characterize a relationship between two entities in the entity pair;

a processing module 402, configured to determine, for each entity in the triple data set, a corresponding target entity in a preset corpus according to a preset selection rule, and replace the corresponding entity in the triple data set with the target entity, so as to obtain a target triple data set;

a constructing module 403, configured to construct a hepatocellular carcinoma pathology knowledge graph according to the target triple data set.

Optionally, when the preset selection rule is to select a corpus most similar to the semantics of the entity as the target entity, for each entity in the triple data set, the processing module 402 may be further configured to:

determining the similarity between the entity and each corpus in the preset corpus; determining a corpus candidate according to the similarity and a preset similarity threshold, or sequencing the corpuses according to the size sequence of the similarity to obtain a corpus sequence, and determining the corpus candidate according to the corpus sequence and a preset selection order; and determining semantic similarity between the candidate corpus and the entity, and determining a target entity according to the semantic similarity.

Optionally, the processing module 402 may be further configured to:

Optionally, the knowledge-graph building apparatus 400 may further include a training module for:

determining dense vectors of sample entities and dense vectors of sample candidate corpora through a pre-trained language characterization model, and determining sparse vectors of the sample entities and sparse vectors of the sample candidate corpora through a statistical language model;

determining a sample similarity sequence according to the size of the similarity, and determining a sample label sequence according to the sample similarity sequence; each sample label in the sample label sequence is used for indicating a standard category of each sample similarity in the sample similarity sequence, and the standard category comprises synonyms and/or hypernyms;

and determining a loss function value according to the sample similarity sequence and the sample label sequence, and updating the parameters of the pre-trained language representation model according to the loss function value.

Optionally, when the corpus is a unified medical language base, the processing module 402 is further configured to, for each entity in the triple dataset:

determining the similarity between the entity and each synonym in the unified medical language library; determining candidate synonyms according to the similarity and a preset similarity threshold, or sequencing the synonyms according to the similarity, obtaining a synonym sequence, and determining the candidate synonyms according to the synonym sequence and a preset selection order; determining semantic similarity between the candidate synonym and the entity, and determining a target synonym according to the semantic similarity; and determining the concept name of the target synonym in the unified medical language library, and determining the pronouncing name as a target entity.

Optionally, when the hepatocellular carcinoma text data is unstructured text data, the processing module 402 is further configured to:

carrying out entity identification on the hepatocellular carcinoma text data through an entity identification model to obtain an entity; pairing the entities based on a preset pairing rule to obtain an entity pair; carrying out relationship identification on the entity pair through a relationship identification model to obtain an entity relationship; and determining the triple data set according to the entity pairs and the corresponding entity relations.

Optionally, the knowledge-graph constructing apparatus 400 further comprises a storage module, and the storage module can be configured to:

In summary, the knowledge map construction apparatus 400 can link medical knowledge about hepatocellular carcinoma in different data sources together, so as to query pathological knowledge about hepatocellular carcinoma. Specifically, the hepatocellular carcinoma text data are obtained from different data sources and entity relations are extracted, so that the constructed hepatocellular carcinoma pathological knowledge map can cover most pathological knowledge related to hepatocellular carcinoma, and the method is not only beneficial to finding substances related to hepatocellular carcinoma by biomedical researchers, but also can improve accuracy and integrity of query of hepatocellular carcinoma pathological knowledge. On the other hand, entities with various expression modes are mapped to target entities in the preset corpus through the preset selection rule, the entities in the extracted data can be subjected to standard normalized mapping, the problem of large-scale information redundancy among different data sources is reduced, the data accuracy in the hepatocellular carcinoma pathology knowledge graph is further guaranteed, and the accuracy of information query based on the knowledge graph is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating an electronic device 500 in accordance with an example embodiment. As shown in fig. 5, the electronic device 500 may include: a processor 501 and a memory 502. The electronic device 500 may also include one or more of a multimedia component 503, an input/output (I/O) interface 504, and a communications component 505.

The processor 501 is configured to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the above-mentioned knowledge graph construction method. The memory 502 is used to store various types of data to support operation at the electronic device 500, such as instructions for any application or method operating on the electronic device 500 and application-related data, such as contact data, messaging, pictures, audio, video, and the like. The Memory 502 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 502 or transmitted through the communication component 505. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, such as a keyboard, mouse, buttons, and the like. These buttons may be virtual buttons or physical buttons. The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, or combinations thereof, which is not limited herein. The corresponding communication component 505 may thus comprise: wi-Fi module, bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described knowledge graph constructing method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the above-described method of knowledge-graph construction is also provided. For example, the computer readable storage medium may be the memory 502 described above comprising program instructions executable by the processor 501 of the electronic device 500 to perform the method of knowledge graph construction described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned method of knowledge-graph construction when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure as long as it does not depart from the gist of the present disclosure.

Claims

1. A method of knowledge graph construction, the method comprising:

acquiring hepatocellular carcinoma text data from different data sources, and determining a triple data set according to the hepatocellular carcinoma text data, wherein each triple in the triple data set comprises an entity pair and an entity relationship, and the entity relationship is used for representing the relationship between two entities in the entity pair;

2. The method according to claim 1, wherein when the predetermined selection rule is to select a corpus having a semantic similarity with the entity as the target entity, the determining, for each entity in the triple data set, a corresponding target entity in a predetermined corpus according to the predetermined selection rule comprises:

for each of the entities in the triple data set, performing the following:

determining a corpus candidate according to the similarity and a preset similarity threshold, or sequencing the corpuses according to the similarity to obtain a corpus sequence, and determining the corpus candidate according to the corpus sequence and a preset selection order;

3. The method according to claim 2, wherein the determining the similarity between the entity and each corpus in the corpus comprises:

determining a dense vector of the entity and a dense vector of the corpus based on a language characterization model;

4. The method of claim 3, wherein the training process of the language characterization model comprises:

5. The method according to claim 2, wherein when the predetermined corpus is a unified medical language library, the determining, for each entity in the triple data set, a corresponding target entity in the predetermined corpus according to a predetermined selection rule comprises:

for each of the entities in the triple data set, performing the following:

determining the concept name of the target synonym in the unified medical language library, and determining the concept name as the target entity.

6. The method of any of claims 1-5, wherein when the hepatocellular carcinoma text data is unstructured text data, the determining the triple data set from the hepatocellular carcinoma text data comprises:

pairing the entities based on a preset pairing rule to obtain an entity pair;

7. The method according to any one of claims 1-5, further comprising:

storing the hepatocellular carcinoma pathology knowledge map in the form of a resource description framework; and/or storing the hepatocellular carcinoma pathology knowledge map in the form of a map database.

8. A knowledge-graph building apparatus, the apparatus comprising:

the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring hepatocellular carcinoma text data from different data sources and determining a triple data set according to the hepatocellular carcinoma text data, each triple in the triple data set comprises an entity pair and an entity relationship, and the entity relationship is used for representing the relationship between two entities in the entity pair;

and the construction module is used for constructing the hepatocellular carcinoma pathological knowledge map according to the target ternary group dataset.

9. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.