CN115080764B - Medical similar entity classification method and system based on knowledge graph and clustering algorithm - Google Patents
Medical similar entity classification method and system based on knowledge graph and clustering algorithm Download PDFInfo
- Publication number
- CN115080764B CN115080764B CN202210856458.1A CN202210856458A CN115080764B CN 115080764 B CN115080764 B CN 115080764B CN 202210856458 A CN202210856458 A CN 202210856458A CN 115080764 B CN115080764 B CN 115080764B
- Authority
- CN
- China
- Prior art keywords
- entity
- medical
- similar
- entities
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Public Health (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Animal Behavior & Ethology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of knowledge maps, in particular to a medical similar entity classification method and a medical similar entity classification system based on a knowledge map and a clustering algorithm, wherein the method comprises the steps of forming data of a medical database into a triple data set, taking the triple data set as a training set, training a knowledge map learning model to obtain the medical knowledge map expressed by vectorization of the medical database, obtaining representative vectors of triples by the triples through a mean pooling layer, clustering the representative vectors of entities and relations by using a unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge map, taking the entities in the same cluster as positive samples, taking the entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model; the invention solves the problem of complicated classification of manually labeled similar entities and realizes the non-manual accurate construction of the medical knowledge graph.
Description
Technical Field
The invention relates to the technical field of knowledge graphs, in particular to a medical similar entity classification method and system based on knowledge graphs and a clustering algorithm.
Background
The knowledge graph is composed of nodes and edges, and the multi-relation graph generally comprises various types of nodes and various types of edges. Entities (nodes) refer to things in the real world such as people, place names, concepts, drugs, companies, etc., and relationships (edges) are used to express some kind of connection between different entities, such as people- "live in" -beijing, zhang and li are "friends", logistic regression is "lead knowledge" for deep learning, etc.
At present, the applications based on the medical knowledge graph are wide, such as intelligent question answering, visualization, searching and the like based on the knowledge graph, but similar entity classification tasks which do not need manual marking based on the constructed knowledge graph are still to be developed, and difficulty is caused to the construction of the medical knowledge graph.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a medical similar entity classification method and system based on a knowledge graph and a clustering algorithm, so as to solve the problem that the task of classifying similar entities without manual labeling is difficult based on the established knowledge graph and realize the classification of the similar entities without manual labeling of the knowledge graph.
In order to solve the problems, the invention adopts the following technical scheme:
based on whether the entities are similar or not manually labeled during the current similar entity classification task, the similar entity classification task without the manual labeling is provided, firstly, entity relationship nodes in a knowledge graph are converted into vector representation, clustering is carried out based on the entity nodes and relationship triples represented by the vectorization, similar entities are obtained through clustering, positive and negative samples are constructed according to the clustering results of the similar entities, and the positive and negative samples serve as input data to train a similar entity classification model.
In a first aspect, the present invention provides a method for classifying medical similar entities based on a knowledge graph and a clustering algorithm, comprising:
s100, forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge graph learning model for training, generating the knowledge graph learning model, obtaining updated vectorization representations of embedded layer entities and relations based on the knowledge graph learning model, and obtaining a medical knowledge graph represented vectorially by the medical database;
s200, based on the obtained vectorization-expressed medical knowledge graph of the medical database, obtaining representative vectors of triples from the triples through a mean pooling layer, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph;
s300, based on the similar term entity library in the medical knowledge graph, taking the entities in the same cluster as positive samples, taking the entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model.
As an implementation manner, in the step S200, the clustering the representative vectors of the entities and the relations by using an unsupervised clustering algorithm Kmeans includes:
s201, randomly selecting K entities as central points in a data set of the medical knowledge graph;
s202, defining a loss function, and calculating the similarity between entities;
and S203, for each entity in the data set, distributing the entity to the nearest central point according to the calculated cosine distance, re-acquiring K clusters, and for each re-acquired cluster, re-calculating the central point of the cluster until the loss function is converged.
As an implementation manner, the loss function in step S202 is:
where a and B are the attribute vectors of hypothetical vectors a and B, respectively, ai and Bi represent the components of attribute vectors a and B, respectively, α is the angle between vectors a and B, dist (a, B) represents the cosine distance between vectors a and B.
As an implementation manner, the calculating of the entity similarity classification model in step S300 includes:
s301, mapping the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the embedding layers of the positive sample and the negative sample, and representing the word vectors as an embedding layer matrix of input data, wherein the dimensionality of the word vectors of the embedding layers is 256 dimensionalities;
s302, extracting time series characteristics of word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
s303, carrying out secondary classification through the linear layer, and judging whether the two classes are similar according to the following formula:
wherein, W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pairAnd (5) normalizing to ensure that the results are distributed in an interval of 0 to 1.
As an implementable manner, in the step S302, the passing lstm internally extracts the time-series features of the word vectors of the positive sample and negative sample embedding layers:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
wherein the content of the first and second substances,、、in order to input the information into the gate,in order to forget to leave the door,to output the gate, parameterRepresenting the weight matrix of the linear layer for the memory cell W, xtA representative vector, h, corresponding to the character currently input by the computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
As an implementation manner, in the step S100, the selecting correct triples and incorrect triples from the training set, and inputting a knowledge-graph learning model for training includes:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
In a second aspect, the present invention provides a medical similar entity classification system based on knowledge graph and clustering algorithm, including: the system comprises a medical knowledge map vectorization representation module, a similar term entity library construction module and an entity similarity judgment module;
the medical knowledge map vectorization representation module is used for forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge map learning model for training, generating a knowledge map learning model, obtaining vectorization representations of updated embedded layer entities and relations as representation vectors of a knowledge map based on the knowledge map learning model, and obtaining the vectorization representations of the medical knowledge map of the medical database;
the similar term entity library construction module is used for acquiring representative vectors of triples through a mean pooling layer based on the obtained vectorized medical knowledge map of the medical database, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge map;
and the entity similarity judgment module is used for taking the entities in the same cluster as positive samples and the entities in different clusters as negative samples based on the similar term entity library in the medical knowledge graph, inputting the positive samples and the negative samples, training an entity similarity classification model, and performing similarity judgment on the entities based on the entity similarity classification model.
As an implementation manner, the similar term entity library construction module comprises a central point selection unit, a similarity calculation unit and a central point re-determination unit;
the central point selecting unit is used for randomly selecting K entities as central points in the data set of the medical knowledge map;
the similarity calculation unit is used for defining a loss function and calculating the similarity between entities;
and the central point re-determining unit is used for distributing each entity in the data set to a central point closest to the entity according to the calculated cosine distance, re-acquiring K clusters, and re-calculating the central point of each cluster for each newly acquired cluster until the loss function is converged.
As an implementable manner, the loss function in the similarity calculation unit is:
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
As an implementation manner, the entity similarity judgment module comprises a word vector determination unit, a time series feature extraction unit and a similarity judgment unit;
the word vector determining unit is used for mapping the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the embedding layers of the positive sample and the negative sample, the word vectors are used as the embedding layer matrix representation of input data, and the dimensionality of the word vectors is 256 dimensionalities;
the time series feature extraction unit is used for extracting time series features of the word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
the similarity judging unit is used for carrying out two classifications through the linear layer and judging whether the two classifications are similar according to the following formula:
wherein, W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pairAnd (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
As an implementable manner, in the time-series feature extraction unit, the extracting time-series features of the word vectors of the positive sample and negative sample embedding layers through the lstm interior includes:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
wherein, the first and the second end of the pipe are connected with each other,、、in order to input the data to the gate,in order to forget to leave the door,to output the gate, parameterRepresenting the weight matrix of the linear layer for the memory cell W, xtRepresenting the current character pair input by the computing moduleShould represent the vector, ht-1Indicating the hidden layer state output corresponding to the last character, Ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
As an implementation manner, the selecting, by the medical knowledge-graph vectorization representation module, correct triples and incorrect triples from the training set, and inputting a knowledge-graph learning model for training includes:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
In a third aspect, the invention provides a computer apparatus comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the medical similar entity classification method based on the knowledge graph and the clustering algorithm when executing the computer program.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above medical similar entity classification method based on the knowledge-graph and clustering algorithm.
The invention has the beneficial effects that: according to the medical similar entity classification method and system based on the knowledge graph and the clustering algorithm, the medical knowledge graph expressed in a vectorization mode is constructed, clustering is carried out through the unsupervised clustering algorithm, and similarity judgment is carried out on the entities through the lstm entity similar classification model, so that the accurate medical knowledge graph is formed.
Drawings
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:
fig. 1 is a flow chart of a medical similar entity classification method based on a knowledge graph and a clustering algorithm according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of clustering representative vectors of entities and relationships by using an unsupervised clustering algorithm Kmeans according to the embodiment of the present invention.
FIG. 3 is a schematic diagram of a calculation process of the entity similarity classification model according to the embodiment of the present invention.
Fig. 4 is a schematic diagram of a medical similar entity classification system based on a knowledge graph and a clustering algorithm according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
It should be noted that these examples are only for illustrating the present invention, not for limiting the present invention, and that the simple modification of the method based on the idea of the present invention is within the scope of the claimed invention.
The method comprises the steps of converting entity nodes and relation nodes in a knowledge graph into embedding vector representation, clustering based on the vectorized entity nodes and the vectorized triple representation of the relation, obtaining similar entities, constructing positive and negative samples according to the clustering results of the similar entities, and using the positive and negative samples as input data to train a similar entity classification model.
Referring to fig. 1, a method for classifying medical similar entities based on a knowledge graph and a clustering algorithm includes:
s100, forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and incorrect triples from the training set, inputting a knowledge graph learning model for training, generating the knowledge graph learning model, obtaining updated vectorization representation of embedded layer entities and relations as representation vectors of a knowledge graph based on the knowledge graph learning model, and obtaining the vectorization representation of the medical knowledge graph of the medical database.
The correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
judging the correct triples and the incorrect triples by a distance calculating method, wherein the distance calculating method comprises the following steps:
randomly generating an initialized entity vector and a relation vector, then normalizing the initialized vector to express a knowledge graph, wherein the target of the whole algorithm is to obtain the parameters of each entity vector and relation vector to be determined, the vector of each knowledge graph in the knowledge base is set as the parameter to be determined by the method, and the target is to obtain all the coefficients to be determined, namely to obtain the best network, and the prediction can be carried out by the network; how to find eachOne pending parameter: as with linear regression, a loss function is introduced, which isComprises the following steps:
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
S200, based on the obtained vectorization-expressed medical knowledge graph of the medical database, obtaining representative vectors of triples from the triples through a mean pooling layer, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph.
According to h ', r' and t 'respectively obtained in the obtained triples (h, r and t), vector representation h', r 'and t' are obtained through a mean pooling layer, the vector is used as a representative vector g of the obtained triples, and a mean pooling layer formula is defined as follows:
referring to fig. 2, as an implementation manner, the clustering the representative vectors of the entities and the relations by using an unsupervised clustering algorithm Kmeans includes:
s201, randomly selecting K entities as central points in a data set of the medical knowledge graph;
s202, defining a loss function, and calculating the similarity between entities;
s203, for each entity in the data set, distributing the entity to the central point closest to the entity according to the calculated cosine distance, re-acquiring K clusters, and for each re-acquired cluster, re-calculating the central point of the cluster until the loss function converges.
Wherein, the loss function in step S202 is:
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
S300, based on the similar term entity library in the medical knowledge graph, taking the entities in the same cluster as positive samples, taking the entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model.
Referring to fig. 3, as an implementation manner, the calculating of the entity similarity classification model in step S300 includes:
s301, mapping the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the embedding layers of the positive sample and the negative sample, and representing the word vectors as an embedding layer matrix of input data, wherein the dimensionality of the word vectors of the embedding layers is 256 dimensionalities;
s302, extracting time series characteristics of word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
s303, performing secondary classification through the linear layer, and judging whether the two classes are similar according to the following formula:
wherein, W3A weight matrix for the last linear layer; ht is lOutputting the last hidden state of the stm network; p is the probability value of whether the final outputs are similar;is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pairAnd (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
Specifically, in the step S302, the extracting time-series features of the word vectors of the positive sample and the negative sample embedding layer through the lstm interior includes:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
wherein the content of the first and second substances,、、in order to input the data to the gate,to forget the door,To output the gate, parameterRepresenting the weight matrix of the linear layer for the memory cell W, xtRepresenting the corresponding expression vector h of the character input by the current computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
Referring to fig. 4, the system for classifying medical similar entities based on knowledge graph and clustering algorithm includes: the medical knowledge map vectorization representation module 100, the similar term entity library construction module 200 and the entity similarity judgment module 300;
the medical knowledge graph vectorization representation module 100 is configured to form a triple data set from data of a medical database, use the triple data set as a training set, select a correct triple and an incorrect triple from the training set, input a knowledge graph learning model for training, generate a knowledge graph learning model, obtain a vectorized representation of updated embedded layer entities and relationships as a representation vector of a knowledge graph based on the knowledge graph learning model, and obtain a vectorized representation of the medical knowledge graph of the medical database;
the similar term entity library construction module 200 is configured to obtain, based on the obtained vectorized medical knowledge graph of the medical database, representative vectors of triples from the triples through a mean pooling layer, and perform clustering on the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph;
the entity similarity determination module 300 is configured to use entities in the same cluster as a positive sample and entities in different clusters as negative samples based on a similar term entity library in the medical knowledge graph, input the positive sample and the negative sample, train an entity similarity classification model, and perform similarity determination on the entities based on the entity similarity classification model.
As an implementation manner, the similar term entity library construction module 200 includes a central point selection unit 201, a similarity calculation unit 202, and a central point re-determination unit 203;
the central point selecting unit 201 is configured to randomly select K entities as central points in the data set of the medical knowledge graph;
the similarity calculation unit 202 is configured to define a loss function and calculate a similarity between entities;
the center point re-determining unit 203 is configured to, for each entity in the data set, allocate the entity to a center point closest to the entity according to the calculated cosine distance, re-acquire K clusters, and re-calculate a center point of each newly acquired cluster until the loss function converges.
As an implementable embodiment, the loss function in the similarity calculation unit 202 is:
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
As an implementation manner, the entity similarity judging module 300 includes a word vector determining unit 301, a time series feature extracting unit 302 and a similarity judging unit 303;
the word vector determining unit 301 is configured to map the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the positive sample and the negative sample embedding layers, and represent the word vectors as an embedding layer matrix of input data, where the dimensionality of the word vectors is 256 dimensions;
the time series feature extraction unit 302 is configured to extract time series features of the word vectors of the positive sample and negative sample embedding layers through an lstm interior;
the similarity determination unit 303 is configured to perform two classifications through a linear layer, and determine whether the two classifications are similar according to the following formula:
wherein, W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, forAnd (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
As an implementation manner, in the time-series feature extraction unit 302, the internally extracting, by lstm, the time-series features of the word vectors of the positive sample and the negative sample embedding layer includes:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
wherein the content of the first and second substances,、、in order to input the information into the gate,in order to forget to leave the door,to output the gate, parameterWeight matrix, x, representing a linear layer for memory cell WtA representative vector, h, corresponding to the character currently input by the computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
As an implementation, the selecting, in the medical knowledge-graph vectorization representation module 100, correct triples and incorrect triples from the training set, and inputting a knowledge-graph learning model for training includes:
the correct triple is S (h, l, t), the error triple is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
judging the correct triples and the incorrect triples by a distance calculating method, wherein the distance calculating method comprises the following steps:
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
A computer apparatus, comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the medical similar entity classification method based on the knowledge graph and the clustering algorithm when executing the computer program.
The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the computer to perform desired functions. The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by a processor to implement the above method steps of the various embodiments of the application and/or other desired functions.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the above-mentioned steps of the method for classifying medically similar entities based on a knowledge-graph and clustering algorithm.
A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that, while the invention has been described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (12)
1. A medical similar entity classification method based on knowledge graph and clustering algorithm is characterized by comprising the following steps:
s100, forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge graph learning model for training, generating the knowledge graph learning model, obtaining updated vectorization representations of embedded layer entities and relations based on the knowledge graph learning model, and obtaining a medical knowledge graph represented vectorially by the medical database;
s200, based on the obtained vectorization-expressed medical knowledge graph of the medical database, obtaining representative vectors of triples from the triples through a mean pooling layer, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph;
s300, based on a similar term entity library in the medical knowledge graph, taking entities in the same cluster as positive samples, taking entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model;
the calculation of the entity similarity classification model in step S300 includes:
s301, mapping the positive sample and the negative sample through an embedded layer weight matrix to obtain word vectors of the embedded layers of the positive sample and the negative sample, and representing the word vectors as an embedded layer matrix of input data;
s302, extracting time series characteristics of word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
s303, performing secondary classification through the linear layer, and judging whether the two classes are similar according to the following formula:
wherein W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;the output of the LSTM passes through the linear layerThe result of the latter output; softmax is a normalization function, pairAnd (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
2. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 1, wherein in the step S200, the clustering the representative vectors of entities and relations by using unsupervised clustering algorithm Kmeans comprises:
s201, randomly selecting K entities as central points in a data set of the medical knowledge graph;
s202, defining a loss function, and calculating the similarity between entities;
and S203, for each entity in the data set, distributing the entity to the nearest central point according to the calculated cosine distance, re-acquiring K clusters, and for each re-acquired cluster, re-calculating the central point of the cluster until the loss function is converged.
3. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 2, wherein said loss function in step S202 is:
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
4. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 1, wherein in said step S302, said time series feature of word vector passing lstm internal extraction of said positive and negative sample embedding layer:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
wherein the content of the first and second substances,、、in order to input the data to the gate,in order to forget to leave the door,for the output gate, wi, wf, wc, wxo, who, wco respectively represent the weight matrix of the linear layer where they are respectively located, bi, bc, bf, bo respectively represent the bias weight matrix of the linear layer where they are respectively located, and the parametersWeight matrix, x, representing a linear layer for memory cell WtRepresenting the corresponding expression vector h of the character input by the current computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
5. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 1, wherein in the step S100, the selecting correct triples and incorrect triples from the training set and inputting the knowledge graph learning model for training comprises:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
6. A medical similar entity classification system based on knowledge graph and clustering algorithm is characterized by comprising: the system comprises a medical knowledge map vectorization representation module, a similar term entity library construction module and an entity similarity judgment module;
the medical knowledge map vectorization representation module is used for forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge map learning model for training, generating a knowledge map learning model, obtaining vectorization representations of updated embedded layer entities and relations as representation vectors of a knowledge map based on the knowledge map learning model, and obtaining the vectorization representations of the medical knowledge map of the medical database;
the similar term entity library construction module is used for acquiring representative vectors of triples through a mean pooling layer based on the obtained vectorized medical knowledge map of the medical database, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge map;
the entity similarity judgment module is used for taking the entities in the same cluster as positive samples and the entities in different clusters as negative samples based on the similar term entity library in the medical knowledge map, inputting the positive samples and the negative samples, training an entity similarity classification model, and performing similarity judgment on the entities based on the entity similarity classification model;
the entity similarity judgment module comprises a word vector determination unit, a time series feature extraction unit and a similarity judgment unit;
the word vector determining unit is used for mapping the positive samples and the negative samples through an embedded layer weight matrix to obtain word vectors of the embedded layers of the positive samples and the negative samples, and taking the word vectors as embedded layer matrix representation of input data;
the time series feature extraction unit is used for extracting time series features of the word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
the similarity judging unit is used for carrying out secondary classification through the linear layer and judging whether the two classes are similar according to the following formula:
wherein W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pairAnd (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
7. The medical similar entity classification system based on the knowledge graph and the clustering algorithm as claimed in claim 6, wherein the similar term entity library construction module comprises a central point selection unit, a similarity calculation unit and a central point re-determination unit;
the central point selecting unit is used for randomly selecting K entities as central points in the data set of the medical knowledge map;
the similarity calculation unit is used for defining a loss function and calculating the similarity between entities;
and the central point re-determining unit is used for distributing each entity in the data set to a central point closest to the entity according to the calculated cosine distance, re-acquiring K clusters, and re-calculating the central point of each cluster for re-acquisition until the loss function is converged.
8. The system of claim 7, wherein the loss function in the similarity calculation unit is:
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
9. The system of claim 6, wherein the time-series feature extraction unit extracts the time-series features of the word vectors of the positive and negative sample embedding layers through lstm internally, and comprises:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
wherein the content of the first and second substances,、、in order to input the information into the gate,in order to forget to leave the door,for the output gate, wi, wf, wc, wxo, who, wco respectively represent the weight matrix of the linear layer, bi, bc, bf, bo respectively represent the bias weight matrix of the linear layer, and parametersWeight matrix, x, representing a linear layer for memory cell WtA representative vector, h, corresponding to the character currently input by the computing modulet-1Indicating the hidden layer state output corresponding to the last character, Ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
10. The system of claim 6, wherein the medical knowledge-graph vectorization module selects the correct triples and the incorrect triples from the training set, and inputs a knowledge-graph learning model to perform the training, the system comprising:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
where (h '+ l, t') represents an error triplet, [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
11. A computer device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for classifying medically similar entities based on a knowledge-graph and clustering algorithm according to any one of claims 1 to 5 when executing the computer program.
12. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of the method for classifying medically similar entities based on a knowledge-graph and clustering algorithm according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210856458.1A CN115080764B (en) | 2022-07-21 | 2022-07-21 | Medical similar entity classification method and system based on knowledge graph and clustering algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210856458.1A CN115080764B (en) | 2022-07-21 | 2022-07-21 | Medical similar entity classification method and system based on knowledge graph and clustering algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115080764A CN115080764A (en) | 2022-09-20 |
CN115080764B true CN115080764B (en) | 2022-11-01 |
Family
ID=83259597
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210856458.1A Active CN115080764B (en) | 2022-07-21 | 2022-07-21 | Medical similar entity classification method and system based on knowledge graph and clustering algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115080764B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115329102B (en) * | 2022-10-12 | 2023-02-03 | 北京道达天际科技股份有限公司 | Knowledge representation learning method based on news knowledge graph |
CN115687932B (en) * | 2022-12-23 | 2023-03-28 | 阿里健康科技(中国)有限公司 | Multi-element group data labeling method, model training method, device, equipment and medium |
CN115859987B (en) * | 2023-01-19 | 2023-06-16 | 阿里健康科技(中国)有限公司 | Entity mention identification module, and linking method, device and medium thereof |
CN117010494B (en) * | 2023-09-27 | 2024-01-05 | 之江实验室 | Medical data generation method and system based on causal expression learning |
CN117708306B (en) * | 2024-02-06 | 2024-05-03 | 神州医疗科技股份有限公司 | Medical question-answering architecture generation method and system based on layered question-answering structure |
CN117747124A (en) * | 2024-02-20 | 2024-03-22 | 浙江大学 | Medical large model logic inversion method and system based on network excitation graph decomposition |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334211A (en) * | 2019-06-14 | 2019-10-15 | 电子科技大学 | A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning |
CN111291191A (en) * | 2018-12-07 | 2020-06-16 | 国家新闻出版广电总局广播科学研究院 | Radio and television knowledge graph construction method and device |
CN112364174A (en) * | 2020-10-21 | 2021-02-12 | 山东大学 | Patient medical record similarity evaluation method and system based on knowledge graph |
CN113111180A (en) * | 2021-03-22 | 2021-07-13 | 杭州祺鲸科技有限公司 | Chinese medical synonym clustering method based on deep pre-training neural network |
CN114564966A (en) * | 2022-03-04 | 2022-05-31 | 中国科学院地理科学与资源研究所 | Spatial relation semantic analysis method based on knowledge graph |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3905097A1 (en) * | 2020-04-30 | 2021-11-03 | Robert Bosch GmbH | Device and method for determining a knowledge graph |
-
2022
- 2022-07-21 CN CN202210856458.1A patent/CN115080764B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291191A (en) * | 2018-12-07 | 2020-06-16 | 国家新闻出版广电总局广播科学研究院 | Radio and television knowledge graph construction method and device |
CN110334211A (en) * | 2019-06-14 | 2019-10-15 | 电子科技大学 | A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning |
CN112364174A (en) * | 2020-10-21 | 2021-02-12 | 山东大学 | Patient medical record similarity evaluation method and system based on knowledge graph |
CN113111180A (en) * | 2021-03-22 | 2021-07-13 | 杭州祺鲸科技有限公司 | Chinese medical synonym clustering method based on deep pre-training neural network |
CN114564966A (en) * | 2022-03-04 | 2022-05-31 | 中国科学院地理科学与资源研究所 | Spatial relation semantic analysis method based on knowledge graph |
Also Published As
Publication number | Publication date |
---|---|
CN115080764A (en) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115080764B (en) | Medical similar entity classification method and system based on knowledge graph and clustering algorithm | |
CN110309267B (en) | Semantic retrieval method and system based on pre-training model | |
CN114582470B (en) | Model training method and device and medical image report labeling method | |
CN110097096B (en) | Text classification method based on TF-IDF matrix and capsule network | |
CN112905795A (en) | Text intention classification method, device and readable medium | |
ElGhany et al. | Diagnosis of Various Skin Cancer Lesions Based on Fine-Tuned ResNet50 Deep Network. | |
CN113282756B (en) | Text clustering intelligent evaluation method based on hybrid clustering | |
CN111881954A (en) | Transduction reasoning small sample classification method based on progressive cluster purification network | |
CN115471739A (en) | Cross-domain remote sensing scene classification and retrieval method based on self-supervision contrast learning | |
CN112214595A (en) | Category determination method, device, equipment and medium | |
CN114925212A (en) | Relation extraction method and system for automatically judging and fusing knowledge graph | |
CN114997287A (en) | Model training and data processing method, device, equipment and storage medium | |
CN114706989A (en) | Intelligent recommendation method based on technical innovation assets as knowledge base | |
CN113553442A (en) | Unsupervised event knowledge graph construction method and system | |
CN116720632B (en) | Engineering construction intelligent management method and system based on GIS and BIM | |
CN111950646A (en) | Hierarchical knowledge model construction method and target identification method for electromagnetic image | |
CN112818164B (en) | Music type identification method, device, equipment and storage medium | |
CN111199154B (en) | Fault-tolerant rough set-based polysemous word expression method, system and medium | |
CN110457455B (en) | Ternary logic question-answer consultation optimization method, system, medium and equipment | |
CN111882441A (en) | User prediction interpretation Treeshap method based on financial product recommendation scene | |
Shi et al. | Three-way spectral clustering | |
CN116975595B (en) | Unsupervised concept extraction method and device, electronic equipment and storage medium | |
CN116431757B (en) | Text relation extraction method based on active learning, electronic equipment and storage medium | |
Liu et al. | Learning to describe collective search behavior of evolutionary algorithms in solution space | |
CN113837228B (en) | Fine granularity object retrieval method based on punishment perception center loss function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |