CN115080764B - Medical similar entity classification method and system based on knowledge graph and clustering algorithm - Google Patents

Medical similar entity classification method and system based on knowledge graph and clustering algorithm Download PDF

Info

Publication number
CN115080764B
CN115080764B CN202210856458.1A CN202210856458A CN115080764B CN 115080764 B CN115080764 B CN 115080764B CN 202210856458 A CN202210856458 A CN 202210856458A CN 115080764 B CN115080764 B CN 115080764B
Authority
CN
China
Prior art keywords
entity
medical
similar
entities
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210856458.1A
Other languages
Chinese (zh)
Other versions
CN115080764A (en
Inventor
刘硕
杨雅婷
宋佳祥
朱宁
白焜太
许娟
史文钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Health China Technologies Co Ltd
Original Assignee
Digital Health China Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Health China Technologies Co Ltd filed Critical Digital Health China Technologies Co Ltd
Priority to CN202210856458.1A priority Critical patent/CN115080764B/en
Publication of CN115080764A publication Critical patent/CN115080764A/en
Application granted granted Critical
Publication of CN115080764B publication Critical patent/CN115080764B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Animal Behavior & Ethology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of knowledge maps, in particular to a medical similar entity classification method and a medical similar entity classification system based on a knowledge map and a clustering algorithm, wherein the method comprises the steps of forming data of a medical database into a triple data set, taking the triple data set as a training set, training a knowledge map learning model to obtain the medical knowledge map expressed by vectorization of the medical database, obtaining representative vectors of triples by the triples through a mean pooling layer, clustering the representative vectors of entities and relations by using a unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge map, taking the entities in the same cluster as positive samples, taking the entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model; the invention solves the problem of complicated classification of manually labeled similar entities and realizes the non-manual accurate construction of the medical knowledge graph.

Description

Medical similar entity classification method and system based on knowledge graph and clustering algorithm
Technical Field
The invention relates to the technical field of knowledge graphs, in particular to a medical similar entity classification method and system based on knowledge graphs and a clustering algorithm.
Background
The knowledge graph is composed of nodes and edges, and the multi-relation graph generally comprises various types of nodes and various types of edges. Entities (nodes) refer to things in the real world such as people, place names, concepts, drugs, companies, etc., and relationships (edges) are used to express some kind of connection between different entities, such as people- "live in" -beijing, zhang and li are "friends", logistic regression is "lead knowledge" for deep learning, etc.
At present, the applications based on the medical knowledge graph are wide, such as intelligent question answering, visualization, searching and the like based on the knowledge graph, but similar entity classification tasks which do not need manual marking based on the constructed knowledge graph are still to be developed, and difficulty is caused to the construction of the medical knowledge graph.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a medical similar entity classification method and system based on a knowledge graph and a clustering algorithm, so as to solve the problem that the task of classifying similar entities without manual labeling is difficult based on the established knowledge graph and realize the classification of the similar entities without manual labeling of the knowledge graph.
In order to solve the problems, the invention adopts the following technical scheme:
based on whether the entities are similar or not manually labeled during the current similar entity classification task, the similar entity classification task without the manual labeling is provided, firstly, entity relationship nodes in a knowledge graph are converted into vector representation, clustering is carried out based on the entity nodes and relationship triples represented by the vectorization, similar entities are obtained through clustering, positive and negative samples are constructed according to the clustering results of the similar entities, and the positive and negative samples serve as input data to train a similar entity classification model.
In a first aspect, the present invention provides a method for classifying medical similar entities based on a knowledge graph and a clustering algorithm, comprising:
s100, forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge graph learning model for training, generating the knowledge graph learning model, obtaining updated vectorization representations of embedded layer entities and relations based on the knowledge graph learning model, and obtaining a medical knowledge graph represented vectorially by the medical database;
s200, based on the obtained vectorization-expressed medical knowledge graph of the medical database, obtaining representative vectors of triples from the triples through a mean pooling layer, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph;
s300, based on the similar term entity library in the medical knowledge graph, taking the entities in the same cluster as positive samples, taking the entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model.
As an implementation manner, in the step S200, the clustering the representative vectors of the entities and the relations by using an unsupervised clustering algorithm Kmeans includes:
s201, randomly selecting K entities as central points in a data set of the medical knowledge graph;
s202, defining a loss function, and calculating the similarity between entities;
and S203, for each entity in the data set, distributing the entity to the nearest central point according to the calculated cosine distance, re-acquiring K clusters, and for each re-acquired cluster, re-calculating the central point of the cluster until the loss function is converged.
As an implementation manner, the loss function in step S202 is:
Figure 617336DEST_PATH_IMAGE001
Figure 472160DEST_PATH_IMAGE002
where a and B are the attribute vectors of hypothetical vectors a and B, respectively, ai and Bi represent the components of attribute vectors a and B, respectively, α is the angle between vectors a and B, dist (a, B) represents the cosine distance between vectors a and B.
As an implementation manner, the calculating of the entity similarity classification model in step S300 includes:
s301, mapping the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the embedding layers of the positive sample and the negative sample, and representing the word vectors as an embedding layer matrix of input data, wherein the dimensionality of the word vectors of the embedding layers is 256 dimensionalities;
s302, extracting time series characteristics of word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
s303, carrying out secondary classification through the linear layer, and judging whether the two classes are similar according to the following formula:
Figure 13799DEST_PATH_IMAGE003
Figure 81112DEST_PATH_IMAGE004
wherein, W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;
Figure 833167DEST_PATH_IMAGE005
is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pair
Figure 390051DEST_PATH_IMAGE005
And (5) normalizing to ensure that the results are distributed in an interval of 0 to 1.
As an implementable manner, in the step S302, the passing lstm internally extracts the time-series features of the word vectors of the positive sample and negative sample embedding layers:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
Figure 532450DEST_PATH_IMAGE006
Figure 341137DEST_PATH_IMAGE007
Figure 744437DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 878746DEST_PATH_IMAGE009
Figure 570759DEST_PATH_IMAGE010
Figure 42191DEST_PATH_IMAGE011
in order to input the information into the gate,
Figure 706522DEST_PATH_IMAGE012
in order to forget to leave the door,
Figure 870787DEST_PATH_IMAGE013
to output the gate, parameter
Figure 50096DEST_PATH_IMAGE014
Representing the weight matrix of the linear layer for the memory cell W, xtA representative vector, h, corresponding to the character currently input by the computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
As an implementation manner, in the step S100, the selecting correct triples and incorrect triples from the training set, and inputting a knowledge-graph learning model for training includes:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
Figure 262902DEST_PATH_IMAGE015
said loss function
Figure 578477DEST_PATH_IMAGE016
Comprises the following steps:
Figure 851327DEST_PATH_IMAGE017
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
In a second aspect, the present invention provides a medical similar entity classification system based on knowledge graph and clustering algorithm, including: the system comprises a medical knowledge map vectorization representation module, a similar term entity library construction module and an entity similarity judgment module;
the medical knowledge map vectorization representation module is used for forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge map learning model for training, generating a knowledge map learning model, obtaining vectorization representations of updated embedded layer entities and relations as representation vectors of a knowledge map based on the knowledge map learning model, and obtaining the vectorization representations of the medical knowledge map of the medical database;
the similar term entity library construction module is used for acquiring representative vectors of triples through a mean pooling layer based on the obtained vectorized medical knowledge map of the medical database, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge map;
and the entity similarity judgment module is used for taking the entities in the same cluster as positive samples and the entities in different clusters as negative samples based on the similar term entity library in the medical knowledge graph, inputting the positive samples and the negative samples, training an entity similarity classification model, and performing similarity judgment on the entities based on the entity similarity classification model.
As an implementation manner, the similar term entity library construction module comprises a central point selection unit, a similarity calculation unit and a central point re-determination unit;
the central point selecting unit is used for randomly selecting K entities as central points in the data set of the medical knowledge map;
the similarity calculation unit is used for defining a loss function and calculating the similarity between entities;
and the central point re-determining unit is used for distributing each entity in the data set to a central point closest to the entity according to the calculated cosine distance, re-acquiring K clusters, and re-calculating the central point of each cluster for each newly acquired cluster until the loss function is converged.
As an implementable manner, the loss function in the similarity calculation unit is:
Figure 783511DEST_PATH_IMAGE018
Figure 800008DEST_PATH_IMAGE019
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
As an implementation manner, the entity similarity judgment module comprises a word vector determination unit, a time series feature extraction unit and a similarity judgment unit;
the word vector determining unit is used for mapping the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the embedding layers of the positive sample and the negative sample, the word vectors are used as the embedding layer matrix representation of input data, and the dimensionality of the word vectors is 256 dimensionalities;
the time series feature extraction unit is used for extracting time series features of the word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
the similarity judging unit is used for carrying out two classifications through the linear layer and judging whether the two classifications are similar according to the following formula:
Figure 579876DEST_PATH_IMAGE020
Figure 289206DEST_PATH_IMAGE021
wherein, W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;
Figure 911949DEST_PATH_IMAGE005
is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pair
Figure 732137DEST_PATH_IMAGE005
And (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
As an implementable manner, in the time-series feature extraction unit, the extracting time-series features of the word vectors of the positive sample and negative sample embedding layers through the lstm interior includes:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
Figure 22304DEST_PATH_IMAGE006
Figure 699273DEST_PATH_IMAGE007
Figure 278153DEST_PATH_IMAGE008
wherein, the first and the second end of the pipe are connected with each other,
Figure 636453DEST_PATH_IMAGE009
Figure 312285DEST_PATH_IMAGE010
Figure 363418DEST_PATH_IMAGE011
in order to input the data to the gate,
Figure 226332DEST_PATH_IMAGE022
in order to forget to leave the door,
Figure 376604DEST_PATH_IMAGE013
to output the gate, parameter
Figure 47888DEST_PATH_IMAGE014
Representing the weight matrix of the linear layer for the memory cell W, xtRepresenting the current character pair input by the computing moduleShould represent the vector, ht-1Indicating the hidden layer state output corresponding to the last character, Ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
As an implementation manner, the selecting, by the medical knowledge-graph vectorization representation module, correct triples and incorrect triples from the training set, and inputting a knowledge-graph learning model for training includes:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
Figure 535501DEST_PATH_IMAGE023
said loss function
Figure 620132DEST_PATH_IMAGE016
Comprises the following steps:
Figure 382551DEST_PATH_IMAGE024
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
In a third aspect, the invention provides a computer apparatus comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the medical similar entity classification method based on the knowledge graph and the clustering algorithm when executing the computer program.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above medical similar entity classification method based on the knowledge-graph and clustering algorithm.
The invention has the beneficial effects that: according to the medical similar entity classification method and system based on the knowledge graph and the clustering algorithm, the medical knowledge graph expressed in a vectorization mode is constructed, clustering is carried out through the unsupervised clustering algorithm, and similarity judgment is carried out on the entities through the lstm entity similar classification model, so that the accurate medical knowledge graph is formed.
Drawings
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:
fig. 1 is a flow chart of a medical similar entity classification method based on a knowledge graph and a clustering algorithm according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of clustering representative vectors of entities and relationships by using an unsupervised clustering algorithm Kmeans according to the embodiment of the present invention.
FIG. 3 is a schematic diagram of a calculation process of the entity similarity classification model according to the embodiment of the present invention.
Fig. 4 is a schematic diagram of a medical similar entity classification system based on a knowledge graph and a clustering algorithm according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
It should be noted that these examples are only for illustrating the present invention, not for limiting the present invention, and that the simple modification of the method based on the idea of the present invention is within the scope of the claimed invention.
The method comprises the steps of converting entity nodes and relation nodes in a knowledge graph into embedding vector representation, clustering based on the vectorized entity nodes and the vectorized triple representation of the relation, obtaining similar entities, constructing positive and negative samples according to the clustering results of the similar entities, and using the positive and negative samples as input data to train a similar entity classification model.
Referring to fig. 1, a method for classifying medical similar entities based on a knowledge graph and a clustering algorithm includes:
s100, forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and incorrect triples from the training set, inputting a knowledge graph learning model for training, generating the knowledge graph learning model, obtaining updated vectorization representation of embedded layer entities and relations as representation vectors of a knowledge graph based on the knowledge graph learning model, and obtaining the vectorization representation of the medical knowledge graph of the medical database.
The correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
judging the correct triples and the incorrect triples by a distance calculating method, wherein the distance calculating method comprises the following steps:
Figure 705079DEST_PATH_IMAGE015
randomly generating an initialized entity vector and a relation vector, then normalizing the initialized vector to express a knowledge graph, wherein the target of the whole algorithm is to obtain the parameters of each entity vector and relation vector to be determined, the vector of each knowledge graph in the knowledge base is set as the parameter to be determined by the method, and the target is to obtain all the coefficients to be determined, namely to obtain the best network, and the prediction can be carried out by the network; how to find eachOne pending parameter: as with linear regression, a loss function is introduced, which is
Figure 363594DEST_PATH_IMAGE016
Comprises the following steps:
Figure 997838DEST_PATH_IMAGE025
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
S200, based on the obtained vectorization-expressed medical knowledge graph of the medical database, obtaining representative vectors of triples from the triples through a mean pooling layer, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph.
According to h ', r' and t 'respectively obtained in the obtained triples (h, r and t), vector representation h', r 'and t' are obtained through a mean pooling layer, the vector is used as a representative vector g of the obtained triples, and a mean pooling layer formula is defined as follows:
Figure 236052DEST_PATH_IMAGE026
referring to fig. 2, as an implementation manner, the clustering the representative vectors of the entities and the relations by using an unsupervised clustering algorithm Kmeans includes:
s201, randomly selecting K entities as central points in a data set of the medical knowledge graph;
s202, defining a loss function, and calculating the similarity between entities;
s203, for each entity in the data set, distributing the entity to the central point closest to the entity according to the calculated cosine distance, re-acquiring K clusters, and for each re-acquired cluster, re-calculating the central point of the cluster until the loss function converges.
Wherein, the loss function in step S202 is:
Figure 6562DEST_PATH_IMAGE001
Figure 101557DEST_PATH_IMAGE002
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
S300, based on the similar term entity library in the medical knowledge graph, taking the entities in the same cluster as positive samples, taking the entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model.
Referring to fig. 3, as an implementation manner, the calculating of the entity similarity classification model in step S300 includes:
s301, mapping the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the embedding layers of the positive sample and the negative sample, and representing the word vectors as an embedding layer matrix of input data, wherein the dimensionality of the word vectors of the embedding layers is 256 dimensionalities;
s302, extracting time series characteristics of word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
s303, performing secondary classification through the linear layer, and judging whether the two classes are similar according to the following formula:
Figure 160780DEST_PATH_IMAGE027
Figure 265002DEST_PATH_IMAGE028
wherein, W3A weight matrix for the last linear layer; ht is lOutputting the last hidden state of the stm network; p is the probability value of whether the final outputs are similar;
Figure 624439DEST_PATH_IMAGE005
is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pair
Figure 93598DEST_PATH_IMAGE005
And (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
Specifically, in the step S302, the extracting time-series features of the word vectors of the positive sample and the negative sample embedding layer through the lstm interior includes:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
Figure 171275DEST_PATH_IMAGE006
Figure 79188DEST_PATH_IMAGE007
Figure 761974DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 464350DEST_PATH_IMAGE009
Figure 232586DEST_PATH_IMAGE010
Figure 678611DEST_PATH_IMAGE011
in order to input the data to the gate,
Figure 12640DEST_PATH_IMAGE022
to forget the door,
Figure 823602DEST_PATH_IMAGE013
To output the gate, parameter
Figure 141450DEST_PATH_IMAGE014
Representing the weight matrix of the linear layer for the memory cell W, xtRepresenting the corresponding expression vector h of the character input by the current computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
Referring to fig. 4, the system for classifying medical similar entities based on knowledge graph and clustering algorithm includes: the medical knowledge map vectorization representation module 100, the similar term entity library construction module 200 and the entity similarity judgment module 300;
the medical knowledge graph vectorization representation module 100 is configured to form a triple data set from data of a medical database, use the triple data set as a training set, select a correct triple and an incorrect triple from the training set, input a knowledge graph learning model for training, generate a knowledge graph learning model, obtain a vectorized representation of updated embedded layer entities and relationships as a representation vector of a knowledge graph based on the knowledge graph learning model, and obtain a vectorized representation of the medical knowledge graph of the medical database;
the similar term entity library construction module 200 is configured to obtain, based on the obtained vectorized medical knowledge graph of the medical database, representative vectors of triples from the triples through a mean pooling layer, and perform clustering on the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph;
the entity similarity determination module 300 is configured to use entities in the same cluster as a positive sample and entities in different clusters as negative samples based on a similar term entity library in the medical knowledge graph, input the positive sample and the negative sample, train an entity similarity classification model, and perform similarity determination on the entities based on the entity similarity classification model.
As an implementation manner, the similar term entity library construction module 200 includes a central point selection unit 201, a similarity calculation unit 202, and a central point re-determination unit 203;
the central point selecting unit 201 is configured to randomly select K entities as central points in the data set of the medical knowledge graph;
the similarity calculation unit 202 is configured to define a loss function and calculate a similarity between entities;
the center point re-determining unit 203 is configured to, for each entity in the data set, allocate the entity to a center point closest to the entity according to the calculated cosine distance, re-acquire K clusters, and re-calculate a center point of each newly acquired cluster until the loss function converges.
As an implementable embodiment, the loss function in the similarity calculation unit 202 is:
Figure 63270DEST_PATH_IMAGE001
Figure 517385DEST_PATH_IMAGE019
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
As an implementation manner, the entity similarity judging module 300 includes a word vector determining unit 301, a time series feature extracting unit 302 and a similarity judging unit 303;
the word vector determining unit 301 is configured to map the positive sample and the negative sample through an embedding layer weight matrix to obtain word vectors of the positive sample and the negative sample embedding layers, and represent the word vectors as an embedding layer matrix of input data, where the dimensionality of the word vectors is 256 dimensions;
the time series feature extraction unit 302 is configured to extract time series features of the word vectors of the positive sample and negative sample embedding layers through an lstm interior;
the similarity determination unit 303 is configured to perform two classifications through a linear layer, and determine whether the two classifications are similar according to the following formula:
Figure 561565DEST_PATH_IMAGE020
Figure 304393DEST_PATH_IMAGE029
wherein, W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;
Figure 826641DEST_PATH_IMAGE005
is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, for
Figure 400842DEST_PATH_IMAGE005
And (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
As an implementation manner, in the time-series feature extraction unit 302, the internally extracting, by lstm, the time-series features of the word vectors of the positive sample and the negative sample embedding layer includes:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
Figure 553605DEST_PATH_IMAGE006
Figure 580467DEST_PATH_IMAGE007
Figure 906406DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 272797DEST_PATH_IMAGE009
Figure 658779DEST_PATH_IMAGE010
Figure 376199DEST_PATH_IMAGE011
in order to input the information into the gate,
Figure 505829DEST_PATH_IMAGE022
in order to forget to leave the door,
Figure 523464DEST_PATH_IMAGE013
to output the gate, parameter
Figure 283609DEST_PATH_IMAGE014
Weight matrix, x, representing a linear layer for memory cell WtA representative vector, h, corresponding to the character currently input by the computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
As an implementation, the selecting, in the medical knowledge-graph vectorization representation module 100, correct triples and incorrect triples from the training set, and inputting a knowledge-graph learning model for training includes:
the correct triple is S (h, l, t), the error triple is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
judging the correct triples and the incorrect triples by a distance calculating method, wherein the distance calculating method comprises the following steps:
Figure 285063DEST_PATH_IMAGE015
said loss function
Figure 952805DEST_PATH_IMAGE016
Comprises the following steps:
Figure 28208DEST_PATH_IMAGE030
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
A computer apparatus, comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the medical similar entity classification method based on the knowledge graph and the clustering algorithm when executing the computer program.
The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the computer to perform desired functions. The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by a processor to implement the above method steps of the various embodiments of the application and/or other desired functions.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the above-mentioned steps of the method for classifying medically similar entities based on a knowledge-graph and clustering algorithm.
A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that, while the invention has been described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (12)

1. A medical similar entity classification method based on knowledge graph and clustering algorithm is characterized by comprising the following steps:
s100, forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge graph learning model for training, generating the knowledge graph learning model, obtaining updated vectorization representations of embedded layer entities and relations based on the knowledge graph learning model, and obtaining a medical knowledge graph represented vectorially by the medical database;
s200, based on the obtained vectorization-expressed medical knowledge graph of the medical database, obtaining representative vectors of triples from the triples through a mean pooling layer, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge graph;
s300, based on a similar term entity library in the medical knowledge graph, taking entities in the same cluster as positive samples, taking entities in different clusters as negative samples, inputting the positive samples and the negative samples, training an entity similar classification model, and performing similar judgment on the entities based on the entity similar classification model;
the calculation of the entity similarity classification model in step S300 includes:
s301, mapping the positive sample and the negative sample through an embedded layer weight matrix to obtain word vectors of the embedded layers of the positive sample and the negative sample, and representing the word vectors as an embedded layer matrix of input data;
s302, extracting time series characteristics of word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
s303, performing secondary classification through the linear layer, and judging whether the two classes are similar according to the following formula:
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
wherein W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;
Figure DEST_PATH_IMAGE004
the output of the LSTM passes through the linear layerThe result of the latter output; softmax is a normalization function, pair
Figure 711132DEST_PATH_IMAGE004
And (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
2. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 1, wherein in the step S200, the clustering the representative vectors of entities and relations by using unsupervised clustering algorithm Kmeans comprises:
s201, randomly selecting K entities as central points in a data set of the medical knowledge graph;
s202, defining a loss function, and calculating the similarity between entities;
and S203, for each entity in the data set, distributing the entity to the nearest central point according to the calculated cosine distance, re-acquiring K clusters, and for each re-acquired cluster, re-calculating the central point of the cluster until the loss function is converged.
3. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 2, wherein said loss function in step S202 is:
Figure DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE007
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
4. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 1, wherein in said step S302, said time series feature of word vector passing lstm internal extraction of said positive and negative sample embedding layer:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE011
Figure DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE013
in order to input the data to the gate,
Figure DEST_PATH_IMAGE014
in order to forget to leave the door,
Figure DEST_PATH_IMAGE015
for the output gate, wi, wf, wc, wxo, who, wco respectively represent the weight matrix of the linear layer where they are respectively located, bi, bc, bf, bo respectively represent the bias weight matrix of the linear layer where they are respectively located, and the parameters
Figure DEST_PATH_IMAGE016
Weight matrix, x, representing a linear layer for memory cell WtRepresenting the corresponding expression vector h of the character input by the current computing modulet-1Indicating the hidden layer state output corresponding to the last character, ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
5. The method for classifying medical similar entities based on knowledge graph and clustering algorithm as claimed in claim 1, wherein in the step S100, the selecting correct triples and incorrect triples from the training set and inputting the knowledge graph learning model for training comprises:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
Figure DEST_PATH_IMAGE017
said loss function
Figure DEST_PATH_IMAGE018
Comprises the following steps:
Figure DEST_PATH_IMAGE020
wherein [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
6. A medical similar entity classification system based on knowledge graph and clustering algorithm is characterized by comprising: the system comprises a medical knowledge map vectorization representation module, a similar term entity library construction module and an entity similarity judgment module;
the medical knowledge map vectorization representation module is used for forming a triple data set by data of a medical database, taking the triple data set as a training set, selecting correct triples and error triples from the training set, inputting a knowledge map learning model for training, generating a knowledge map learning model, obtaining vectorization representations of updated embedded layer entities and relations as representation vectors of a knowledge map based on the knowledge map learning model, and obtaining the vectorization representations of the medical knowledge map of the medical database;
the similar term entity library construction module is used for acquiring representative vectors of triples through a mean pooling layer based on the obtained vectorized medical knowledge map of the medical database, and clustering the representative vectors of entities and relations by using an unsupervised clustering algorithm Kmeans to obtain a similar term entity library in the medical knowledge map;
the entity similarity judgment module is used for taking the entities in the same cluster as positive samples and the entities in different clusters as negative samples based on the similar term entity library in the medical knowledge map, inputting the positive samples and the negative samples, training an entity similarity classification model, and performing similarity judgment on the entities based on the entity similarity classification model;
the entity similarity judgment module comprises a word vector determination unit, a time series feature extraction unit and a similarity judgment unit;
the word vector determining unit is used for mapping the positive samples and the negative samples through an embedded layer weight matrix to obtain word vectors of the embedded layers of the positive samples and the negative samples, and taking the word vectors as embedded layer matrix representation of input data;
the time series feature extraction unit is used for extracting time series features of the word vectors of the positive sample embedding layer and the negative sample embedding layer through the inside of lstm;
the similarity judging unit is used for carrying out secondary classification through the linear layer and judging whether the two classes are similar according to the following formula:
Figure DEST_PATH_IMAGE021
Figure DEST_PATH_IMAGE022
wherein W3A weight matrix for the last linear layer; ht is the final hidden state output of the lstm network; p is the probability value of whether the final outputs are similar;
Figure 220085DEST_PATH_IMAGE004
is the result of the output of the LSTM after passing through the linear layer; softmax is a normalization function, pair
Figure 688237DEST_PATH_IMAGE004
And (5) carrying out normalization so that the results are distributed in an interval of 0 to 1.
7. The medical similar entity classification system based on the knowledge graph and the clustering algorithm as claimed in claim 6, wherein the similar term entity library construction module comprises a central point selection unit, a similarity calculation unit and a central point re-determination unit;
the central point selecting unit is used for randomly selecting K entities as central points in the data set of the medical knowledge map;
the similarity calculation unit is used for defining a loss function and calculating the similarity between entities;
and the central point re-determining unit is used for distributing each entity in the data set to a central point closest to the entity according to the calculated cosine distance, re-acquiring K clusters, and re-calculating the central point of each cluster for re-acquisition until the loss function is converged.
8. The system of claim 7, wherein the loss function in the similarity calculation unit is:
Figure 21130DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE023
wherein, A and B are the attribute vectors of the assumed vectors a and B, ai and Bi represent the components of the attribute vectors A and B, respectively, α is the angle between the vectors a and B, dist (A, B) represents the cosine distance between the vectors a and B.
9. The system of claim 6, wherein the time-series feature extraction unit extracts the time-series features of the word vectors of the positive and negative sample embedding layers through lstm internally, and comprises:
serially inputting the word vectors of the positive sample embedding layer and the negative sample embedding layer into an LSTM calculating unit, and obtaining Lstm _ embedding vector representations in different sequence directions through calculation of the following formula:
Figure DEST_PATH_IMAGE024
Figure DEST_PATH_IMAGE025
Figure DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 786567DEST_PATH_IMAGE011
Figure 795980DEST_PATH_IMAGE012
Figure 797434DEST_PATH_IMAGE013
in order to input the information into the gate,
Figure DEST_PATH_IMAGE027
in order to forget to leave the door,
Figure 356854DEST_PATH_IMAGE015
for the output gate, wi, wf, wc, wxo, who, wco respectively represent the weight matrix of the linear layer, bi, bc, bf, bo respectively represent the bias weight matrix of the linear layer, and parameters
Figure 478263DEST_PATH_IMAGE016
Weight matrix, x, representing a linear layer for memory cell WtA representative vector, h, corresponding to the character currently input by the computing modulet-1Indicating the hidden layer state output corresponding to the last character, Ct-1And b represents a bias weight matrix of the linear layer, and tanh and sigma are activation functions.
10. The system of claim 6, wherein the medical knowledge-graph vectorization module selects the correct triples and the incorrect triples from the training set, and inputs a knowledge-graph learning model to perform the training, the system comprising:
the correct triplet is S (h, l, t), the error triplet is S '(h', l, t) or S '(h, l, t'), wherein h is a head entity, t is a tail entity, l is the relation between h and t, and h 'and t' are respectively obtained by replacing the head entity and the tail entity by a random entity;
and judging the correct triples and the wrong triples by a distance calculating method which comprises the following steps:
Figure 409310DEST_PATH_IMAGE017
said loss function
Figure 854984DEST_PATH_IMAGE018
Comprises the following steps:
Figure DEST_PATH_IMAGE029
where (h '+ l, t') represents an error triplet, [ x ] + represents: max (0, x), λ are adjustable hyper-parameters.
11. A computer device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for classifying medically similar entities based on a knowledge-graph and clustering algorithm according to any one of claims 1 to 5 when executing the computer program.
12. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of the method for classifying medically similar entities based on a knowledge-graph and clustering algorithm according to any one of claims 1 to 5.
CN202210856458.1A 2022-07-21 2022-07-21 Medical similar entity classification method and system based on knowledge graph and clustering algorithm Active CN115080764B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210856458.1A CN115080764B (en) 2022-07-21 2022-07-21 Medical similar entity classification method and system based on knowledge graph and clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210856458.1A CN115080764B (en) 2022-07-21 2022-07-21 Medical similar entity classification method and system based on knowledge graph and clustering algorithm

Publications (2)

Publication Number Publication Date
CN115080764A CN115080764A (en) 2022-09-20
CN115080764B true CN115080764B (en) 2022-11-01

Family

ID=83259597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210856458.1A Active CN115080764B (en) 2022-07-21 2022-07-21 Medical similar entity classification method and system based on knowledge graph and clustering algorithm

Country Status (1)

Country Link
CN (1) CN115080764B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329102B (en) * 2022-10-12 2023-02-03 北京道达天际科技股份有限公司 Knowledge representation learning method based on news knowledge graph
CN115687932B (en) * 2022-12-23 2023-03-28 阿里健康科技(中国)有限公司 Multi-element group data labeling method, model training method, device, equipment and medium
CN115859987B (en) * 2023-01-19 2023-06-16 阿里健康科技(中国)有限公司 Entity mention identification module, and linking method, device and medium thereof
CN117010494B (en) * 2023-09-27 2024-01-05 之江实验室 Medical data generation method and system based on causal expression learning
CN117708306B (en) * 2024-02-06 2024-05-03 神州医疗科技股份有限公司 Medical question-answering architecture generation method and system based on layered question-answering structure
CN117747124A (en) * 2024-02-20 2024-03-22 浙江大学 Medical large model logic inversion method and system based on network excitation graph decomposition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334211A (en) * 2019-06-14 2019-10-15 电子科技大学 A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning
CN111291191A (en) * 2018-12-07 2020-06-16 国家新闻出版广电总局广播科学研究院 Radio and television knowledge graph construction method and device
CN112364174A (en) * 2020-10-21 2021-02-12 山东大学 Patient medical record similarity evaluation method and system based on knowledge graph
CN113111180A (en) * 2021-03-22 2021-07-13 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
CN114564966A (en) * 2022-03-04 2022-05-31 中国科学院地理科学与资源研究所 Spatial relation semantic analysis method based on knowledge graph

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3905097A1 (en) * 2020-04-30 2021-11-03 Robert Bosch GmbH Device and method for determining a knowledge graph

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291191A (en) * 2018-12-07 2020-06-16 国家新闻出版广电总局广播科学研究院 Radio and television knowledge graph construction method and device
CN110334211A (en) * 2019-06-14 2019-10-15 电子科技大学 A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning
CN112364174A (en) * 2020-10-21 2021-02-12 山东大学 Patient medical record similarity evaluation method and system based on knowledge graph
CN113111180A (en) * 2021-03-22 2021-07-13 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
CN114564966A (en) * 2022-03-04 2022-05-31 中国科学院地理科学与资源研究所 Spatial relation semantic analysis method based on knowledge graph

Also Published As

Publication number Publication date
CN115080764A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN115080764B (en) Medical similar entity classification method and system based on knowledge graph and clustering algorithm
CN110309267B (en) Semantic retrieval method and system based on pre-training model
CN114582470B (en) Model training method and device and medical image report labeling method
CN110097096B (en) Text classification method based on TF-IDF matrix and capsule network
CN112905795A (en) Text intention classification method, device and readable medium
ElGhany et al. Diagnosis of Various Skin Cancer Lesions Based on Fine-Tuned ResNet50 Deep Network.
CN113282756B (en) Text clustering intelligent evaluation method based on hybrid clustering
CN111881954A (en) Transduction reasoning small sample classification method based on progressive cluster purification network
CN115471739A (en) Cross-domain remote sensing scene classification and retrieval method based on self-supervision contrast learning
CN112214595A (en) Category determination method, device, equipment and medium
CN114925212A (en) Relation extraction method and system for automatically judging and fusing knowledge graph
CN114997287A (en) Model training and data processing method, device, equipment and storage medium
CN114706989A (en) Intelligent recommendation method based on technical innovation assets as knowledge base
CN113553442A (en) Unsupervised event knowledge graph construction method and system
CN116720632B (en) Engineering construction intelligent management method and system based on GIS and BIM
CN111950646A (en) Hierarchical knowledge model construction method and target identification method for electromagnetic image
CN112818164B (en) Music type identification method, device, equipment and storage medium
CN111199154B (en) Fault-tolerant rough set-based polysemous word expression method, system and medium
CN110457455B (en) Ternary logic question-answer consultation optimization method, system, medium and equipment
CN111882441A (en) User prediction interpretation Treeshap method based on financial product recommendation scene
Shi et al. Three-way spectral clustering
CN116975595B (en) Unsupervised concept extraction method and device, electronic equipment and storage medium
CN116431757B (en) Text relation extraction method based on active learning, electronic equipment and storage medium
Liu et al. Learning to describe collective search behavior of evolutionary algorithms in solution space
CN113837228B (en) Fine granularity object retrieval method based on punishment perception center loss function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant