CN114969386B - Disambiguation method, apparatus, electronic device, and medium applied to medical field - Google Patents

Disambiguation method, apparatus, electronic device, and medium applied to medical field Download PDF

Info

Publication number
CN114969386B
CN114969386B CN202210926041.8A CN202210926041A CN114969386B CN 114969386 B CN114969386 B CN 114969386B CN 202210926041 A CN202210926041 A CN 202210926041A CN 114969386 B CN114969386 B CN 114969386B
Authority
CN
China
Prior art keywords
data
disambiguated
sample
medical knowledge
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210926041.8A
Other languages
Chinese (zh)
Other versions
CN114969386A (en
Inventor
刘硕
杨雅婷
宋佳祥
朱宁
白焜太
许娟
史文钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Health China Technologies Co Ltd
Original Assignee
Digital Health China Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Health China Technologies Co Ltd filed Critical Digital Health China Technologies Co Ltd
Priority to CN202210926041.8A priority Critical patent/CN114969386B/en
Publication of CN114969386A publication Critical patent/CN114969386A/en
Application granted granted Critical
Publication of CN114969386B publication Critical patent/CN114969386B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The embodiment of the disclosure discloses a disambiguation method, device, electronic equipment and medium applied to the medical field, and relates to the technical field of medical knowledge map construction. One embodiment of the method comprises: acquiring a medical knowledge map and data to be disambiguated; carrying out disambiguation processing on the data to be disambiguated based on the medical knowledge map to obtain a new medical knowledge map; and storing the new medical knowledge map to a database of a target medical information platform. The embodiment realizes effective disambiguation of the data to be disambiguated, and provides important help for updating and constructing the medical knowledge graph.

Description

Disambiguation method, apparatus, electronic device, and medium applied to medical field
Technical Field
The embodiment of the disclosure relates to the technical field of medical knowledge graph construction, in particular to a disambiguation method, device, electronic equipment and medium applied to the medical field.
Background
With the advent of the big data era, medical health has become an important field of big data application, and medical data can be applied to many aspects such as auxiliary diagnosis of diseases, treatment scheme determination, epidemic disease prediction, drug side effect analysis, medical clinical research, and the like, so that medical knowledge maps are widely constructed and applied. In the process of updating the medical knowledge graph according to the external data, the external data may face different calling methods, different names and different synonyms with the internal information of the knowledge graph, or different synonyms with the same name, which brings great difficulty to the construction and updating of the medical knowledge graph. Thus, there is an urgent need for an efficient and accurate entity disambiguation method for processing medical knowledge maps and external data.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a disambiguation method, apparatus, electronic device, and medium applied in the medical field, so as to solve the problem of how to efficiently perform medical entity disambiguation in the prior art.
In a first aspect of the embodiments of the present disclosure, a disambiguation method applied in the medical field is provided, including: acquiring a medical knowledge map and data to be disambiguated; carrying out disambiguation on the data to be disambiguated based on the medical knowledge graph to obtain a new medical knowledge graph; and storing the new medical knowledge map to a database of a target medical information platform.
In a second aspect of the embodiments of the present disclosure, there is provided a disambiguation apparatus applied to a medical field, the apparatus including: an acquisition unit configured to acquire a medical knowledge-graph and data to be disambiguated; the disambiguation unit is configured to perform disambiguation processing on the data to be disambiguated based on the medical knowledge map to obtain a new medical knowledge map; and the storage unit is configured to store the new medical knowledge map to a database of a target medical information platform.
In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, acquiring a medical knowledge map and data to be disambiguated; then, carrying out disambiguation processing on the data to be disambiguated based on the medical knowledge graph to obtain a new medical knowledge graph; and finally, storing the new medical knowledge graph to a database of a target medical information platform. The method provided by the disclosure realizes effective disambiguation of the data to be disambiguated, and provides important help for updating and constructing the medical knowledge graph.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.
Fig. 1 is a schematic illustration of one application scenario of a disambiguation method applied in the medical field according to some embodiments of the present disclosure;
fig. 2 is a flow diagram of some embodiments of a disambiguation method applied in the medical field according to the present disclosure;
FIG. 3 is a schematic structural diagram of some embodiments of a disambiguating apparatus applied in the medical field according to the present disclosure;
FIG. 4 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
A disambiguation method, apparatus, electronic device, and medium applied to the medical field according to embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of one application scenario of a disambiguation method applied in the medical field according to some embodiments of the present disclosure.
In the application scenario of fig. 1, first, the computing device 101 may acquire a medical knowledge-graph 102 and data to be disambiguated 103. Then, based on the medical knowledge-graph 102, the computing device 101 may perform disambiguation on the data to be disambiguated 103, as indicated by reference numeral 104, resulting in a new medical knowledge-graph 105. Finally, the computing device 101 may store the new medical knowledge-map 105 described above to the database 106 of the target medical information platform.
The computing device 101 may be hardware or software. When the computing device 101 is hardware, it may be implemented as a distributed cluster composed of a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device 101 is embodied as software, it may be installed in the hardware devices listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.
It should be understood that the number of computing devices 101 in FIG. 1 is merely illustrative. There may be any number of computing devices 101, as desired for an implementation.
Fig. 2 is a schematic flow diagram of some embodiments of a disambiguation method applied in the medical field according to the present disclosure. The disambiguation method of fig. 2 applied to the medical field may be performed by the computing device 101 of fig. 1. As shown in fig. 2, the disambiguation method applied to the medical field includes the following steps:
step S201, acquiring a medical knowledge map and data to be disambiguated.
In some embodiments, an executive (e.g., computing device 101 shown in fig. 1) of a disambiguation method applied in the medical field may acquire a medical knowledge map and data to be disambiguated. Here, the medical knowledge graph may be a set of < entity 1, relationship, entity 2> three-tuple data structure composed of entities representing respective medical nouns and relationships between the entities, the three-tuple set of the knowledge graph may be represented as S, S containing a series of three-tuples (h, l, t), where h, t belong to the entity set and l belongs to the relationship set, and the data to be disambiguated may be a candidate data set to be padded to the entities of the medical knowledge graph or the relationships between the entities.
And S202, carrying out disambiguation processing on the data to be disambiguated based on the medical knowledge graph to obtain a new medical knowledge graph.
In some embodiments, the executing entity may input the medical knowledge-graph and the data to be disambiguated to a pre-trained entity disambiguation model, and output a mapping result of an entity of the data to be disambiguated to a known entity within the medical knowledge-graph. Then, the executing body may map the entity in the data to be disambiguated to the medical knowledge map according to the mapping result to obtain a new medical knowledge map.
In some optional implementations of some embodiments, the entity disambiguation model includes at least: the system comprises an embedding layer, a multi-head attention mechanism layer, a forward computing layer, an average pooling layer, a linear layer network and a semi-supervision mechanism network; the multi-head attention mechanism layer comprises three linear layers for matrix characteristic extraction; the forward computation layer includes two linear layers and an active layer. As an example, the activation function employed by the activation layer may be a Sigmoid function.
In some optional implementation manners of some embodiments, the entity disambiguation model is obtained by taking a sample medical knowledge graph in a training sample and sample data to be disambiguated as inputs, and taking a sample new medical knowledge graph in the training sample as an expected output, and training.
In some optional implementations of some embodiments, the training sample of the entity disambiguation model is generated based on data included in the sample knowledge graph and the sample data to be disambiguated. As an example, the executing body may combine the medical knowledge map and the data in the data to be disambiguated two by two, and label the combined data as positive and negative samples. As a specific example, the combination of two data sets means that the original data in the labeled data and the data labeled as corresponding in the knowledge graph are combined together as a positive sample, 5 data sets not labeled as corresponding in the original data and the knowledge graph are randomly selected as negative samples, for example, the original data "liver metastasis" and the data labeled as corresponding in the knowledge graph are combined together as a positive sample, and simultaneously 5 data sets in other data sets in the knowledge graph are randomly calculated and combined together with the original data "liver metastasis" as negative samples.
In some optional implementations of some embodiments, the training of the entity disambiguation model comprises the steps of:
firstly, inputting the sample knowledge graph and the sample data to be disambiguated into an embedding layer of an initial model to generate an input vector matrix. Here, the initial model may be a BERT model, and the embedding layer may be configured to perform word segmentation on the sample knowledge graph and the sentences in the sample data to be disambiguated, then perform word embedding on the words obtained by the word segmentation to obtain word vectors, and then splice the word vectors to obtain a vector of each sentence, so as to obtain an input vector matrix, where a vector dimension is 768 dimensions. Word embedding is the general term for Language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it refers to embedding a high-dimensional space with dimensions equal to the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. Specifically, a word vector (word vector) may be a vector in which a word or phrase is mapped to a real number by a word embedding method. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions.
And secondly, inputting the input vector matrix to the multi-head attention mechanism layer of the initial model to generate an input vector matrix with attention information, wherein the vector dimension of the input vector matrix is 768 dimensions.
And thirdly, inputting the input vector matrix with the attention information into a forward computing layer of the initial model, and outputting an activated input vector matrix, wherein the vector dimension of the input vector matrix is 768 dimensions.
And fourthly, inputting the activated input vector matrix into an average pooling layer of the initial model to obtain statement vectors of the sample knowledge graph, statement vectors of the sample data to be disambiguated and relation vectors for representing statement relations.
And fifthly, splicing the statement vector of the sample knowledge graph, the statement vector of the sample data to be disambiguated and the relation vector to obtain a spliced vector.
And sixthly, inputting the splicing vector to a linear layer network of the initial model to obtain an output result.
And seventhly, normalizing the output result to obtain a score of the similarity between the statement vector for representing the sample knowledge graph and the statement vector of the sample data to be disambiguated. Here, the normalization method employed is a method using a Softmax function, and the normalization method can be used to convert data into a decimal between (0, 1).
And step eight, filling the sample to-be-disambiguated data into the sample medical knowledge graph in response to the fact that the fraction is larger than or equal to a preset threshold value, and outputting a new sample medical knowledge graph. As an example, the padding here may be to pad the statement of data to be disambiguated to the position of the medical knowledge-graph of the information representation according to the information expressed by the above-mentioned relationship vector.
And ninthly, determining whether the training is finished or not based on the new medical knowledge graph of the sample and the new medical knowledge graph of the sample. For example, the executing entity may determine whether the new medical knowledge graph of the sample and the new medical knowledge graph of the sample are consistent, if so, the executing entity may determine that the training is completed, the next training may be performed, and if not, the executing entity may determine that the training is not completed.
Tenth, in response to determining that the training of the initial model is completed, determining the initial model as the entity disambiguation model.
In some optional implementations of some embodiments, the method further comprises: in response to determining that the score is smaller than the preset threshold, transmitting and displaying the sample data to be disambiguated to an audit page; receiving an audit result input aiming at the audit page; and determining whether to fill the sample medical knowledge graph with the sample data to be disambiguated based on the auditing result. Here, the above audit result is one of: confirm to fill in, confirm not to fill in.
In some optional implementations of some embodiments, the method further comprises: and adjusting relevant parameters in the initial model in response to determining that the training of the initial model is not finished, reselecting a training sample, and continuing to execute the training step by using the adjusted initial model as the initial model. Here, the adjustment of the relevant parameters facilitates the training of the model, and the use of the adjusted initial model as the initial model plays a role of iterative update.
In some optional implementations of some embodiments, the inputting the medical knowledge-graph and the data to be disambiguated into a pre-trained entity disambiguation model, and outputting a mapping result of the data to be disambiguated to an entity in the medical knowledge-graph specifically includes: after the entity disambiguation model training is completed by using the semi-supervised mechanism network, entity data of the content of the medical knowledge graph are expressed into 512-dimensional vector expression by using the trained model, the 512-dimensional expression vectors of all the entity data in the medical knowledge graph are stored in a JSON file, the vector dimension stored in the JSON file is [ k,512], wherein k represents the number of the entities in the medical knowledge graph, and 512 represents the vector dimension. Then, in the new disambiguation process of the data to be disambiguated, only the data to be disambiguated needs to pass through an embedding layer of an initial model to obtain an n-dimensional expression vector a, then a previously stored JSON file is loaded to obtain a vector matrix b of all entity data in the medical knowledge map, then the expression vector a and the vector matrix b corresponding to the data to be disambiguated are used for carrying out matrix calculation, and the calculation formula is as follows:
Figure 515144DEST_PATH_IMAGE001
to obtain
Figure 357198DEST_PATH_IMAGE002
The value range is [ -1,1 [ ]]Then, the value is converted into a value of a (0, 1) interval through normalization, and the normalization formula is as follows:
Figure 316189DEST_PATH_IMAGE003
wherein d represents the similarity of the data to be disambiguated and the entity data in the medical knowledge graph. And finally, selecting the data to be disambiguated with the similarity exceeding a preset threshold and mapping entity data in the medical knowledge graph as a result. By the way of storing the expression vector of the entity data in the medical knowledge graph and then performing matrix calculation on the expression vector of the data to be disambiguated and the expression vector of the entity data in the medical knowledge graph, the matrix calculation is only needed to be performed once for obtaining a result (finally mapping the entity data), and compared with a traditional method, the time consumption can be greatly reduced, and the disambiguation speed is improved.
And step S203, storing the new medical knowledge map to a database of a target medical information platform.
In some embodiments, the executive may store the new medical knowledge-map to a database of the target medical information platform. Here, the target medical information platform may be a platform for storing and presenting a medical knowledge map and other medical field information.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, acquiring a medical knowledge map and data to be disambiguated; then, carrying out disambiguation processing on the data to be disambiguated based on the medical knowledge graph to obtain a new medical knowledge graph; and finally, storing the new medical knowledge graph to a database of a target medical information platform. The method provided by the disclosure realizes effective disambiguation of the data to be disambiguated, and provides important help for updating and constructing the medical knowledge graph. In addition, a semi-supervised mechanism network is adopted in the process of training the entity disambiguation model, further judgment can be carried out according to prediction results (similarity confirmation and preset threshold judgment) after training is finished, data to be disambiguated which are lower than a preset threshold are transmitted and displayed to an audit page, training is continued after the audit result of manual audit is received, iterative updating is carried out repeatedly, and the model generalization and accuracy are improved.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic structural diagram of some embodiments of a disambiguation apparatus applied to the medical field according to the present disclosure. As shown in fig. 3, the disambiguation apparatus applied to the medical field includes: an acquisition unit 301, a disambiguation unit 302 and a storage unit 303. Wherein the obtaining unit 301 is configured to obtain a medical knowledge-graph and data to be disambiguated; a disambiguation unit 302 configured to perform disambiguation processing on the data to be disambiguated based on the medical knowledge map to obtain a new medical knowledge map; a storage unit 303 configured to store the new medical knowledge-map to a database of the target medical information platform.
In some optional implementations of some embodiments, the disambiguation unit 302 of the disambiguation apparatus applied in the medical field is further configured to: inputting the medical knowledge map and the data to be disambiguated into a pre-trained entity disambiguation model, and outputting a mapping result of the data to be disambiguated to an entity in the medical knowledge map; and generating a new medical knowledge map based on the mapping result.
In some optional implementations of some embodiments, the entity disambiguation model includes at least: the system comprises an embedding layer, a multi-head attention mechanism layer, a forward computing layer, an average pooling layer, a linear layer network and a semi-supervision mechanism network; the multi-head attention mechanism layer comprises three linear layers for matrix characteristic extraction; the forward computation layer includes two linear layers and an active layer.
In some optional implementation manners of some embodiments, the entity disambiguation model is obtained by taking a sample medical knowledge graph in a training sample and sample data to be disambiguated as inputs, and taking a sample new medical knowledge graph in the training sample as an expected output, and training.
In some optional implementations of some embodiments, the training sample of the entity disambiguation model is generated based on data included in the sample knowledge graph and the sample data to be disambiguated, and the training of the entity disambiguation model includes the following steps: inputting the sample knowledge graph and the sample data to be disambiguated into an embedding layer of an initial model to generate an input vector matrix; inputting the input vector matrix into a multi-head attention mechanism layer of the initial model to generate an input vector matrix with attention information; inputting the input vector matrix with the attention information into a forward calculation layer of the initial model, and outputting an activated input vector matrix; inputting the activated input vector matrix into an average pooling layer of the initial model to obtain statement vectors of the sample knowledge graph, statement vectors of the sample data to be disambiguated and relation vectors for representing statement relations; splicing the statement vector of the sample knowledge graph, the statement vector of the sample data to be disambiguated and the relation vector to obtain a spliced vector; inputting the splicing vector to a linear layer network of the initial model to obtain an output result; normalizing the output result to obtain a score of similarity between a statement vector for representing the sample knowledge graph and a statement vector of the sample data to be disambiguated; filling the sample medical knowledge graph with the to-be-disambiguated data in response to the fact that the score is larger than or equal to a preset threshold value, and outputting a new sample medical knowledge graph; determining whether training is completed based on the sample new medical knowledge graph and the new sample medical knowledge graph; in response to determining that the initial model training is complete, determining the initial model as the entity disambiguation model.
In some optional implementations of some embodiments, in response to determining that the score is greater than or equal to a preset threshold, padding the sample medical knowledge-graph with the sample to-be-disambiguated data, and after outputting a new sample medical knowledge-graph, the method further comprises: in response to determining that the score is less than the preset threshold, transmitting and displaying the sample data to be disambiguated to an audit page; receiving an audit result input aiming at the audit page; and determining whether to fill the sample medical knowledge graph with the sample data to be disambiguated based on the auditing result.
In some optional implementations of some embodiments, the disambiguation apparatus applied in the medical field is further configured to: and adjusting relevant parameters in the initial model in response to determining that the training of the initial model is not finished, reselecting a training sample, and continuing to execute the training step by using the adjusted initial model as the initial model.
In some optional implementations of some embodiments, the inputting the medical knowledge-graph and the data to be disambiguated into a pre-trained entity disambiguation model and outputting a mapping result of the data to be disambiguated to an entity in the medical knowledge-graph includes: generating a representation vector of at least one entity in the medical knowledge map to obtain a representation vector set; storing the expression vector set to a target file; inputting the data to be disambiguated to an embedding layer in the entity disambiguation model to obtain a representation vector of the data to be disambiguated; loading the target file to obtain an expression vector matrix of an entity in the medical knowledge graph; performing matrix calculation on the expression vector of the data to be disambiguated and the expression vector matrix to obtain a calculation result; normalizing the calculation result to obtain a similarity score for representing the data to be disambiguated and the entity in the medical knowledge graph; and outputting a mapping result of the data to be disambiguated to the entity in the medical knowledge graph in response to the fact that the similarity score exceeds a preset score threshold value.
It will be understood that the elements described in the apparatus correspond to various steps in the method described with reference to figure 2. Thus, the operations, features and advantages described above with respect to the method are also applicable to the apparatus and the units included therein, and are not described herein again.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present disclosure.
Fig. 4 is a schematic diagram of a computer device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the computer device 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps in the various method embodiments described above are implemented when the processor 401 executes the computer program 403. Alternatively, the processor 401 implements the functions of the respective modules/units in the above-described respective apparatus embodiments when executing the computer program 403.
Illustratively, the computer program 403 may be partitioned into one or more modules/units, which are stored in the memory 402 and executed by the processor 401 to accomplish the present disclosure. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 403 in the computer device 4.
The computer device 4 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computer devices. Computer device 4 may include, but is not limited to, a processor 401 and a memory 402. Those skilled in the art will appreciate that fig. 4 is merely an example of a computer device 4 and is not intended to limit computer device 4 and may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the computer device may also include input output devices, network access devices, buses, etc.
The Processor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 402 may be an internal storage unit of the computer device 4, for example, a hard disk or a memory of the computer device 4. The memory 402 may also be an external storage device of the computer device 4, such as a plug-in hard disk provided on the computer device 4, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, memory 402 may also include both internal storage units of computer device 4 and external storage devices. The memory 402 is used for storing computer programs and other programs and data required by the computer device. The memory 402 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described or recited in any embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a division of modules or units, a division of logical functions only, an additional division may be made in actual implementation, multiple units or components may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the above embodiments may be realized by the present disclosure, and the computer program may be stored in a computer readable storage medium to instruct related hardware, and when the computer program is executed by a processor, the steps of the above method embodiments may be realized. The computer program may comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.
The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and they should be construed as being included in the scope of the present disclosure.

Claims (8)

1. A disambiguation method applied to a medical field, comprising:
acquiring a medical knowledge map and data to be disambiguated;
carrying out disambiguation on the data to be disambiguated based on the medical knowledge graph to obtain a new medical knowledge graph;
storing the new medical knowledge map to a database of a target medical information platform;
the disambiguation processing is carried out on the data to be disambiguated based on the medical knowledge graph to obtain a new medical knowledge graph, and the method comprises the following steps:
inputting the medical knowledge graph and the data to be disambiguated into a pre-trained entity disambiguation model, and outputting a mapping result of the data to be disambiguated to an entity in the medical knowledge graph;
generating a new medical knowledge graph based on the mapping result;
the entity disambiguation model is obtained by taking a sample knowledge graph and sample to-be-disambiguated data in a training sample as input, taking a sample new medical knowledge graph in the training sample as expected output and training;
the training sample of the entity disambiguation model is generated based on pairwise combination of data contained in the sample knowledge graph and data to be disambiguated of the sample, the pairwise combination refers to that original data in labeled data and data labeled as corresponding in the knowledge graph are combined together to be used as positive samples, and 5 data which are not labeled and correspond in the original data and the knowledge graph are randomly selected to be used as negative samples; the training of the entity disambiguation model comprises the following steps:
inputting the sample knowledge graph and the sample data to be disambiguated to an embedding layer of an initial model to generate an input vector matrix;
inputting the input vector matrix to a multi-head attention mechanism layer of the initial model to generate an input vector matrix with attention information;
inputting the input vector matrix with the attention information into a forward computing layer of the initial model, and outputting an activated input vector matrix;
inputting the activated input vector matrix into an average pooling layer of the initial model to obtain statement vectors of the sample knowledge graph, statement vectors of the sample data to be disambiguated and relation vectors for representing statement relations;
splicing the statement vector of the sample knowledge graph, the statement vector of the sample data to be disambiguated and the relation vector to obtain a spliced vector;
inputting the splicing vector to a linear layer network of the initial model to obtain an output result;
normalizing the output result to obtain a score of similarity between a statement vector for representing the sample knowledge graph and a statement vector of the sample data to be disambiguated;
in response to determining that the score is greater than or equal to a preset threshold, padding the sample to-be-disambiguated data to the sample knowledge-graph, and outputting a new sample medical knowledge-graph;
determining whether training is complete based on the sample new medical knowledge-graph and the new sample medical knowledge-graph;
in response to determining that the initial model training is complete, determining the initial model as the entity disambiguation model;
the entity disambiguation model comprising at least: the system comprises an embedding layer, a multi-head attention mechanism layer, a forward computing layer, an average pooling layer, a linear layer network and a semi-supervision mechanism network; the multi-head attention mechanism layer comprises three linear layers for matrix characteristic extraction; the forward computing layer comprises two linear layers and an activation layer;
generating a representation vector of at least one entity in the medical knowledge-graph to obtain a representation vector set;
storing the set of representation vectors to a target file;
inputting the data to be disambiguated to an embedding layer in the entity disambiguation model to obtain a representation vector of the data to be disambiguated;
loading the target file to obtain a representation vector matrix of an entity in the medical knowledge graph;
performing matrix calculation on the expression vector of the data to be disambiguated and the expression vector matrix to obtain a calculation result;
normalizing the calculation result to obtain a similarity score for representing the data to be disambiguated and the entity in the medical knowledge graph;
in response to determining that the similarity score exceeds a preset score threshold, outputting a mapping result of the data to be disambiguated to an entity within the medical knowledge-graph;
expressing entity data of the content of the medical knowledge graph into 512-dimensional expression vectors by using a trained model, storing the 512-dimensional expression vectors of all entity data in the medical knowledge graph into a JSON file, wherein the vector dimensions stored in the JSON file are [ k,512], wherein k represents the number of entities in the medical knowledge graph, and 512 represents the vector dimensions;
in the new disambiguation process of the data to be disambiguated, only the data to be disambiguated needs to pass through an embedding layer of an initial model to obtain an n-dimensional expression vector a, then a JSON file which is stored before is loaded to obtain a vector matrix b of all entity data in the medical knowledge map, then the expression vector a and the vector matrix b corresponding to the data to be disambiguated are used for carrying out matrix calculation, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE002
to obtain
Figure DEST_PATH_IMAGE004
The value range is [ -1,1]And then the value is converted into a value of a (0, 1) interval through normalization, wherein the normalization formula is as follows:
Figure DEST_PATH_IMAGE006
and d, representing the similarity between the data to be disambiguated and the entity data in the medical knowledge graph, and finally selecting the data to be disambiguated with the similarity exceeding a preset threshold and the mapping entity data in the medical knowledge graph as a result.
2. The disambiguation method applied to the medical field of claim 1, wherein said padding of said sample data to be disambiguated to said sample medical knowledge-graph in response to determining that said score is greater than or equal to a preset threshold, said method further comprising, after outputting a new sample medical knowledge-graph:
in response to determining that the score is less than the preset threshold, transmitting and displaying the sample to-be-disambiguated data to an audit page;
receiving an audit result input aiming at the audit page;
determining whether to pad the sample medical knowledge graph with the sample to-be-disambiguated data based on the audit result.
3. The disambiguation method applied to the medical field of claim 1, further comprising:
and in response to determining that the initial model training is not complete, adjusting relevant parameters in the initial model, reselecting a training sample, and continuing to perform the training step by using the adjusted initial model as the initial model.
4. A disambiguating apparatus applied to a medical field, comprising:
an acquisition unit configured to acquire a medical knowledge map and data to be disambiguated;
a disambiguation unit configured to perform disambiguation processing on the data to be disambiguated based on the medical knowledge map, resulting in a new medical knowledge map;
a storage unit configured to store the new medical knowledge-map to a database of a target medical information platform;
the disambiguation unit is further configured to:
inputting the medical knowledge graph and the data to be disambiguated into a pre-trained entity disambiguation model, and outputting a mapping result of the data to be disambiguated to an entity in the medical knowledge graph;
generating a new medical knowledge map based on the mapping result;
the entity disambiguation model is obtained by taking a sample medical knowledge graph in a training sample and sample data to be disambiguated as input, taking a sample new medical knowledge graph in the training sample as expected output and training;
the training sample of the entity disambiguation model is generated based on pairwise combination of data contained in the sample knowledge graph and data to be disambiguated of the sample, the pairwise combination refers to that original data in labeled data and data labeled as corresponding in the knowledge graph are combined together to be used as positive samples, and 5 data which are not labeled and correspond in the original data and the knowledge graph are randomly selected to be used as negative samples; the training of the entity disambiguation model comprises the following steps:
inputting the sample knowledge graph and the sample data to be disambiguated to an embedding layer of an initial model to generate an input vector matrix;
inputting the input vector matrix to a multi-head attention mechanism layer of the initial model to generate an input vector matrix with attention information;
inputting the input vector matrix with the attention information into a forward computing layer of the initial model, and outputting an activated input vector matrix;
inputting the activated input vector matrix into an average pooling layer of the initial model to obtain statement vectors of the sample knowledge graph, statement vectors of the sample data to be disambiguated and relation vectors for representing statement relations;
splicing the statement vector of the sample knowledge graph, the statement vector of the sample data to be disambiguated and the relation vector to obtain a spliced vector;
inputting the splicing vector to a linear layer network of the initial model to obtain an output result;
normalizing the output result to obtain a score of similarity between a statement vector for representing the sample knowledge graph and a statement vector of the sample data to be disambiguated;
in response to determining that the score is greater than or equal to a preset threshold, padding the sample to-be-disambiguated data to the sample medical knowledge-graph, and outputting a new sample medical knowledge-graph;
determining whether training is complete based on the sample new medical knowledge-graph and the new sample medical knowledge-graph;
in response to determining that the initial model training is complete, determining the initial model as the entity disambiguation model;
the entity disambiguation model comprises at least: the system comprises an embedding layer, a multi-head attention mechanism layer, a forward computing layer, an average pooling layer, a linear layer network and a semi-supervision mechanism network; the multi-head attention mechanism layer comprises three linear layers for matrix characteristic extraction; the forward computing layer comprises two linear layers and an activation layer;
generating a representation vector of at least one entity in the medical knowledge-graph to obtain a representation vector set;
storing the set of representation vectors to a target file;
inputting the data to be disambiguated to an embedding layer in the entity disambiguation model to obtain a representation vector of the data to be disambiguated;
loading the target file to obtain a representation vector matrix of an entity in the medical knowledge graph;
performing matrix calculation on the expression vector of the data to be disambiguated and the expression vector matrix to obtain a calculation result;
normalizing the calculation result to obtain a similarity score for representing the data to be disambiguated and the entity in the medical knowledge graph;
in response to determining that the similarity score exceeds a preset score threshold, outputting a mapping result of the data to be disambiguated to an entity within the medical knowledge-graph;
expressing entity data of the content of the medical knowledge graph into 512-dimensional expression vectors by using a trained model, storing the 512-dimensional expression vectors of all entity data in the medical knowledge graph into a JSON file, wherein the vector dimensions stored in the JSON file are [ k,512], wherein k represents the number of entities in the medical knowledge graph, and 512 represents the vector dimensions;
in the new disambiguation process of the data to be disambiguated, the data to be disambiguated only needs to pass through an embedding layer of an initial model to obtain an n-dimensional expression vector a, then a JSON file which is stored before is loaded, a vector matrix b of all entity data in the medical knowledge map is obtained, then matrix calculation is carried out by using the expression vector a and the vector matrix b corresponding to the data to be disambiguated, and the calculation formula is as follows:
Figure 996227DEST_PATH_IMAGE002
to obtain
Figure 241264DEST_PATH_IMAGE004
The value range is [ -1,1 [ ]]Then, the value is converted into a value of a (0, 1) interval through normalization, and the normalization formula is as follows:
Figure 335515DEST_PATH_IMAGE006
and d, representing the similarity between the data to be disambiguated and the entity data in the medical knowledge graph, and finally selecting the data to be disambiguated with the similarity exceeding a preset threshold and the mapping entity data in the medical knowledge graph as a result.
5. The disambiguation apparatus applied to the medical field of claim 4, wherein said apparatus is further configured to:
in response to determining that the score is less than the preset threshold, transmitting and displaying the sample data to be disambiguated to an audit page;
receiving an audit result input aiming at the audit page;
determining whether to pad the sample medical knowledge graph with the sample to-be-disambiguated data based on the audit result.
6. The disambiguation apparatus applied to the medical field of claim 4, wherein the apparatus is further configured to:
and in response to determining that the initial model training is not complete, adjusting relevant parameters in the initial model, reselecting a training sample, and continuing to perform the training step by using the adjusted initial model as the initial model.
7. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor realizes the steps of the method according to any one of claims 1 to 3 when executing the computer program.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.
CN202210926041.8A 2022-08-03 2022-08-03 Disambiguation method, apparatus, electronic device, and medium applied to medical field Active CN114969386B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210926041.8A CN114969386B (en) 2022-08-03 2022-08-03 Disambiguation method, apparatus, electronic device, and medium applied to medical field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210926041.8A CN114969386B (en) 2022-08-03 2022-08-03 Disambiguation method, apparatus, electronic device, and medium applied to medical field

Publications (2)

Publication Number Publication Date
CN114969386A CN114969386A (en) 2022-08-30
CN114969386B true CN114969386B (en) 2022-11-18

Family

ID=82968676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210926041.8A Active CN114969386B (en) 2022-08-03 2022-08-03 Disambiguation method, apparatus, electronic device, and medium applied to medical field

Country Status (1)

Country Link
CN (1) CN114969386B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9542652B2 (en) * 2013-02-28 2017-01-10 Microsoft Technology Licensing, Llc Posterior probability pursuit for entity disambiguation
CN112037920A (en) * 2020-08-31 2020-12-04 康键信息技术(深圳)有限公司 Medical knowledge map construction method, device, equipment and storage medium
CN114021570A (en) * 2021-11-05 2022-02-08 平安普惠企业管理有限公司 Entity disambiguation method, apparatus, device and storage medium

Also Published As

Publication number Publication date
CN114969386A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN112074806B (en) System, method and computer storage medium for block floating point computing
CN109871532B (en) Text theme extraction method and device and storage medium
WO2022007823A1 (en) Text data processing method and device
CN108763535B (en) Information acquisition method and device
EP4336378A1 (en) Data processing method and related device
CN110427486B (en) Body condition text classification method, device and equipment
CN111782826A (en) Knowledge graph information processing method, device, equipment and storage medium
CN116386800B (en) Medical record data segmentation method and system based on pre-training language model
CN114925320A (en) Data processing method and related device
CN112130805A (en) Chip comprising floating-point adder, equipment and control method of floating-point operation
CN115374771A (en) Text label determination method and device
CN113326383B (en) Short text entity linking method, device, computing equipment and storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN114969386B (en) Disambiguation method, apparatus, electronic device, and medium applied to medical field
CN116775836A (en) Textbook text question-answering method and system based on multi-level attention
CN116957043A (en) Model quantization method, device, equipment and medium
CN115034225A (en) Word processing method and device applied to medical field, electronic equipment and medium
CN114792097B (en) Method and device for determining prompt vector of pre-training model and electronic equipment
CN112765936B (en) Training method and device for operation based on language model
CN115906861A (en) Statement emotion analysis method and device based on interaction aspect information fusion
US20210406294A1 (en) Relevance approximation of passage evidence
CN110852348B (en) Feature map processing method, image processing method and device
CN111382246B (en) Text matching method, matching device, terminal and computer readable storage medium
CN112784003A (en) Method for training statement repeat model, statement repeat method and device thereof
CN112507698B (en) Word vector generation method, device, terminal equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant