WO2023025255A1 - 一种多中心医学诊断知识图谱表示学习方法及系统 - Google Patents
一种多中心医学诊断知识图谱表示学习方法及系统 Download PDFInfo
- Publication number
- WO2023025255A1 WO2023025255A1 PCT/CN2022/114879 CN2022114879W WO2023025255A1 WO 2023025255 A1 WO2023025255 A1 WO 2023025255A1 CN 2022114879 W CN2022114879 W CN 2022114879W WO 2023025255 A1 WO2023025255 A1 WO 2023025255A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- server
- disease classification
- codes
- medical
- medical diagnosis
- Prior art date
Links
- 238000003745 diagnosis Methods 0.000 title claims abstract description 97
- 238000000034 method Methods 0.000 title claims abstract description 38
- 201000010099 disease Diseases 0.000 claims abstract description 120
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 120
- 239000011159 matrix material Substances 0.000 claims abstract description 47
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 230000008569 process Effects 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 23
- 238000010276 construction Methods 0.000 claims description 7
- 239000000654 additive Substances 0.000 claims description 6
- 230000000996 additive effect Effects 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000019771 cognition Effects 0.000 abstract description 3
- 206010047473 Viral pharyngitis Diseases 0.000 description 4
- 238000013136 deep learning model Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 201000008197 Laryngitis Diseases 0.000 description 1
- 201000007100 Pharyngitis Diseases 0.000 description 1
- 206010046306 Upper respiratory tract infection Diseases 0.000 description 1
- 201000010550 acute laryngitis Diseases 0.000 description 1
- 208000016150 acute pharyngitis Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 208000023504 respiratory system disease Diseases 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/008—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
Definitions
- the invention belongs to the technical field of medical information, and in particular relates to a multi-center medical diagnosis knowledge map representation learning method and system.
- Knowledge graph is to describe concepts, entities and their relationships in the objective world in a structured form, express information in a form closer to the human cognitive world, and provide a better ability to organize, manage and understand information, which can be used It focuses on the mining, analysis and construction of knowledge, drawing and displaying the interrelationships between knowledge.
- Representation learning aims to represent the semantic information of the research object as a dense low-dimensional entity vector.
- Knowledge graph representation learning is mainly to represent entities and relationships in knowledge graphs. Through the learning and training of large-scale knowledge graphs and original data, the distribution vector representation of knowledge in low-dimensional dense space can be obtained to express entities and relationships. Semantic information, which is convenient for knowledge calculation and reasoning.
- Utilizing medical ontology to encode the relationship between hierarchical clinical structures and medical concepts can reduce the need for large amounts of data and effectively reduce the search space without losing information.
- ontologies such as the International Classification of Diseases (ICD), the Clinical Classification Software (CCS), or the Systematic Nomenclature of Clinical Terms in Medicine (SNOMED).
- ICD International Classification of Diseases
- CCS Clinical Classification Software
- SNOMED Systematic Nomenclature of Clinical Terms in Medicine
- nodes that are close to each other i.e., medical concepts
- Using medical ontologies may be useful when the amount of data is insufficient to train a deep learning model. Even when the amount of data is sufficient, it can be used as a method to simplify the model without loss of information by learning more interpretable representations that conform to the ontology structure.
- the knowledge representation learning model based on the structural information of the medical knowledge graph cannot solve the problems of low semantic representation ability caused by complex relationship modeling and data sparsity.
- Existing research work uses massive text information outside the structure of the knowledge graph itself to expand the structural information of the knowledge graph to reduce the impact of data sparsity.
- Existing methods ignore the structural and correlation information inherent in the data.
- the present invention proposes a multi-center medical diagnosis knowledge map representation learning method and system, under the premise of protecting the data privacy and security of the participants of each medical institution, the multi-center data is used to increase the data density.
- the multi-center data is used to increase the data density.
- the present invention discloses a multi-center medical diagnosis knowledge map representation learning method.
- the method is based on federated learning and homomorphic encryption, uses multi-center data, combines hierarchical information and complex association relationships, and realizes knowledge representation learning of structural information. Include the following steps:
- the first server builds a global medical diagnosis knowledge map, which represents the hierarchical structure of medical diagnosis concepts in the form of a directed acyclic graph, and consists of two parts: leaf nodes and ancestor nodes.
- the leaf node is the smallest disease classification code, and its ancestor node is the upper-level disease classification code corresponding to the leaf node disease classification code;
- the first server distributes the constructed global medical diagnosis knowledge map to each medical institution participant
- the medical institution participant builds a co-occurrence matrix M based on all disease classification codes in the medical diagnosis knowledge map, and the element M ij in the i-th row and j-column of the co-occurrence matrix M represents the co-occurrence information of the two codes c′ i and c′ j , P represents the total number of patients in the participating parties of the medical institution, Indicates the co-occurrence information of two codes c′ i and c′ j in the enhanced disease classification code set V′ t of a visit of patient p;
- the second server generates the encryption algorithm, encryption key, decryption algorithm and decryption key, and distributes the encryption algorithm and encryption key to each medical institution participant; each medical institution participant uses the encryption algorithm and The encryption key encrypts its co-occurrence matrix and uploads it to the first server;
- the first server adds the co-occurrence information of the same two codes in the ciphertext state to obtain the global co-occurrence matrix in the ciphertext state, and sends it to the first server Two servers;
- the second server obtains the global co-occurrence matrix through the decryption algorithm and the decryption key, and returns it to the first server;
- each disease classification code is expressed as a representation vector composed of real numbers, and the following objective function J is constructed:
- w i and w j are the representation vectors of encoding c′ i and c′ j respectively, b i and b j are the bias items of two representation vectors respectively, X ij represents the encoding c′ i and c′ j in the global co-occurrence matrix
- the co-occurrence information of c′ j , f is the weighting function
- both the first server and the second server are third-party servers
- the third-party servers need to be honest, and the third-party servers can communicate with each other
- each medical institution participant internally deploys its own electronic medical record database
- electronic medical record The original data in the database is not allowed to leave the participants of the medical institutions, and the participants of the medical institutions cannot directly communicate with each other, but can only communicate with the third-party server.
- the medical diagnosis ontologies used include ICD, CCS, and SNOMED.
- the constructed medical diagnosis knowledge graph is stored in the form of a dictionary, and each element in the dictionary records the hierarchical structure information of a disease.
- the construction of the enhanced disease classification coding set V′ t is as follows: find the ancestor nodes corresponding to the leaf nodes in the medical diagnosis knowledge map, and the upper-level disease classification codes corresponding to the common ancestor nodes need to be repeatedly added to V t .
- the formula for calculating the co-occurrence information of the code pair is:
- count(c i , V′ t ) is the number of times ci appears in V′ t
- count(c j , V′ t ) is the number of times c j appears in V′ t
- d ij is the number of times c i
- cooccurrence( ci , c j , V′ t ) is the co-occurrence information of the coding pair.
- step (3) The calculation of is specifically: if the two codes c′ i and c′ j appear in the enhanced disease classification code set V′ t of a patient p’s visit at the same time, the code c can be found in this patient’s V′ t i is equal to the code c′ i , and the code c j is found to be equal to the code c′ j , then equal to cooccurrence(c i , c j , V′ t ); otherwise is equal to 0.
- step (4) is specifically:
- the K co-occurrence matrices of K medical institution participants are denoted as M 1 , M 2 ,..., M K , K ⁇ 2, and the co-occurrence information of any two codes c′ i and c′ j is denoted as
- the second server uses the additive homomorphic encryption algorithm to obtain the encryption algorithm ENC, decryption algorithm DEC, encryption key KEY E and decryption key KEY D , and sends the encryption algorithm ENC and encryption key KEY E to each medical institution participant;
- each medical institution participant first encrypts the co-occurrence information into ciphertext, denoted as Then send the ciphertext to the first server;
- the first server directly operates on the ciphertext; according to the homomorphism of addition, it only needs to calculate the product of the ciphertext.
- the formula of the co-occurrence information EncX ij after the encryption of the two codes c′ i and c′ j is as follows:
- the co-occurrence information of the two codes in the ciphertext state is calculated, and finally the global co-occurrence matrix EncX in the ciphertext state is obtained;
- step (5) adopts the following piecewise function:
- MAX and ⁇ are both hyperparameters, and the optimal value is set according to the experimental results. After the co-occurrence information exceeds MAX, f(X ij ) remains at a constant level of 1.
- Another aspect of the present invention discloses a multi-center medical diagnosis knowledge map representation learning system, which includes:
- Global medical diagnosis knowledge map construction module used to build a global medical diagnosis knowledge map on the first server, the global medical diagnosis knowledge map represents the hierarchical structure of medical diagnosis concepts in the form of a directed acyclic graph, consisting of leaf nodes and The ancestor node consists of two parts, the leaf node is the smallest disease classification code, and its ancestor node is the upper-level disease classification code corresponding to the leaf node disease classification code;
- Medical diagnosis knowledge map distribution module used to distribute the global medical diagnosis knowledge map constructed by the first server to each medical institution participant;
- kinds of disease classification codes, and the medical records of each patient in the medical institution participant are regarded as multiple visits, recorded as V ⁇ V 1 , V 2 ,..., V T ⁇ , a total of visits T times, the disease classification coding set for each visit is recorded as V t , and the upper-level disease classification coding of each disease classification coding in V t is added to V t to obtain the enhanced disease classification coding set as V′ t ;
- Data encryption calculation module the second server generates the encryption algorithm, encryption key, decryption algorithm and decryption key, and distributes the encryption algorithm and encryption key to each medical institution participant; each medical institution participant uses the encryption algorithm and encryption key key to encrypt its co-occurrence matrix and upload it to the first server; the first server adds the co-occurrence information of the same two codes in the ciphertext state to obtain the global co-occurrence matrix in the ciphertext state and sends it to the second server ; The second server obtains the global co-occurrence matrix through the decryption algorithm and the decryption key, and returns it to the first server;
- Knowledge representation learning module Deployed on the first server, each disease classification code is expressed as a representation vector composed of real numbers, and the following objective function J is constructed:
- w i and w j are the representation vectors of encoding c′ i and c′ j respectively, b i and b j are the bias items of two representation vectors respectively, X ij represents the encoding c′ i and c′ j in the global co-occurrence matrix
- the co-occurrence information of c′ j , f is the weighting function
- the co-occurrence matrix of disease classification codes calculates the co-occurrence information of each pair of codes, and the more code pairs that appear at the same time and the closer the distance, the greater the co-occurrence information.
- Figure 1 is a schematic diagram of the network architecture of the multi-center medical diagnosis knowledge map representation learning method provided by the embodiment of the present invention
- Fig. 2 is a flow chart of the realization of the multi-center medical diagnosis knowledge map representation learning method provided by the embodiment of the present invention
- Fig. 3 is an example of the structure of the medical diagnosis knowledge map provided by the embodiment of the present invention.
- the present invention provides a multi-center medical diagnosis knowledge graph representation learning method.
- the method is based on federated learning and homomorphic encryption, utilizes multi-center data, combines hierarchical information and complex association relationships, and realizes knowledge representation learning of structural information.
- the method is based on the network architecture shown in Figure 1, including two third-party servers (the first server and the second server) and multiple medical institution participants, the third-party servers need to be honest, and the third-party servers can communicate with each other .
- Each medical institution participant deploys its own electronic medical record database internally, and the original data in the electronic medical record database is not allowed to leave each medical institution participant. Participants of various medical institutions cannot directly communicate with each other, but can only communicate with third-party servers.
- a multi-center medical diagnosis knowledge map representation learning method provided in this embodiment has the following steps:
- the first server is responsible for constructing the global medical diagnosis knowledge map.
- the global medical diagnosis knowledge graph represents the hierarchical structure of medical diagnosis concepts in the form of a directed acyclic graph.
- the global medical diagnosis knowledge graph is composed of leaf nodes and ancestor nodes.
- the leaf nodes are the smallest disease classification codes
- the ancestor nodes are the upper-level disease classification codes corresponding to the leaf node disease classification codes.
- ICD10 is used as the medical diagnosis ontology to construct a global medical diagnosis knowledge map.
- the medical diagnosis ontology can also choose knowledge sources commonly used in the medical field such as CCS and SNOMED.
- viral pharyngitis J02.801 is the leaf node, and its ancestor nodes are constructed according to the disease level information in ICD10: respiratory diseases J00-J99, acute upper respiratory tract infection J00-J06, acute pharyngitis J02 ,As shown in Figure 3.
- the constructed medical diagnosis knowledge graph is stored in the form of a dictionary, and each element in the dictionary records the hierarchical structure information of a disease.
- the hierarchical structure information is stored as: ⁇ J02.801:[J02.801, root, J00-J99, J00-J06, J02] ⁇ , where root represents the root node.
- the first server distributes the constructed global medical diagnosis knowledge map to participants of various medical institutions, because the knowledge map is publicly available and may not be encrypted.
- the participants of the medical institution take a single visit V t as a unit, by adding the upper-level disease classification code of each disease classification code in V t , the enhanced disease classification code set is recorded as V′ t , that is, the leaf is searched in the medical diagnosis knowledge graph
- V′ t the leaf is searched in the medical diagnosis knowledge graph
- the ancestor node corresponding to the node, and the upper-level disease classification code corresponding to the common ancestor node need to be added repeatedly.
- V′ t Calculate the number of occurrences of each disease classification code and its upper-level disease classification code in V′ t . Combine the codes in V' t in pairs to form a code pair, and calculate the co-occurrence information of the code pair by multiplying the occurrence times of the two codes in the code pair. At the same time, the distance between the two codes in the code pair is calculated, that is, the number of edges included in the shortest path connecting two nodes, and the reciprocal of the distance is used as the weight.
- count(c i , V′ t ) is the number of times ci appears in V′ t
- count(c j , V′ t ) is the number of times c j appears in V′ t
- d ij is the number of times c i
- cooccurrence( ci , c j , V′ t ) is the co-occurrence information of the coding pair.
- the medical institution participant constructs a co-occurrence matrix M based on all disease classification codes in the medical diagnosis knowledge map, as shown in Table 1.
- M ij represents the co-occurrence information of two codes c′ i and c′ j
- P represents the total number of patients in the participating parties of the medical institution
- the co-occurrence matrix M is symmetric, M ij and M ji are equal, and the co-occurrence information of the same disease classification coding is on the diagonal, which is recorded as
- V t [J02.801, J04.000]
- V′ t [J02.801, J02, J00-J06, J00-J99, root, J04.000, J04, J00-J06, J00-J99, root]
- the number of occurrences of code J02.801 is 1, J00- The number of occurrences of J06 is 2, the distance between them is 2, and the value of co-occurrence information is 1.
- the second server generates an encryption algorithm, an encryption key, a decryption algorithm and a decryption key, and distributes the encryption algorithm and the encryption key to each medical institution participant.
- Each medical institution participant uses an encryption algorithm and an encryption key to encrypt its co-occurrence matrix and upload it to the first server.
- the first server adds the co-occurrence information of the same two codes to obtain the global co-occurrence matrix in the ciphertext state, and sends it to the second server.
- the second server obtains the global co-occurrence matrix through the decryption algorithm and the decryption key, and returns it to the first server. There is no risk of data leakage during the whole process.
- the specific implementation process is as follows:
- the K co-occurrence matrices of K medical institution participants are denoted as M 1 , M 2 ,..., M K , K ⁇ 2, and the co-occurrence information of any two codes c′ i and c′ j is denoted as
- the second server obtains the encryption algorithm ENC, the decryption algorithm DEC, the encryption key KEY E and the decryption key KEY D by using the additive homomorphic encryption algorithm, and sends the encryption algorithm ENC and the encryption key KEY E to each medical institution participant.
- each medical institution participant first encrypts the co-occurrence information into ciphertext, denoted as Then send the ciphertext to the first server.
- the first server does not perform a decryption operation, but directly operates on the ciphertext. According to the homomorphism of addition, it is only necessary to calculate the product of the ciphertext.
- the formula of the co-occurrence information EncX ij after the encryption of the two codes c′ i and c′ j is as follows:
- any two disease classification codes in the medical diagnosis knowledge map follow the above steps to calculate the co-occurrence information of the two codes in the ciphertext state, and finally obtain the global co-occurrence matrix EncX in the ciphertext state.
- each disease classification code is expressed as a representation vector composed of real numbers, and the relationship between the representation vector and the global co-occurrence matrix is expressed as:
- w i and w j are the representation vectors of the disease classification codes c′ i and c′ j that need to be solved, respectively, and are randomly initialized as a 128-dimensional random vector with a value between -0.1 and 0.1; the superscript T represents the transpose operation; b i and b j are two bias items representing vectors respectively, and the initial value is 0; X ij represents the co-occurrence information encoded in the global co-occurrence matrix X of c′ i and c′ j .
- f is the weighting function.
- the hyperparameters MAX and ⁇ set the optimal values according to the experimental results, which can be set to 100 and 0.75, respectively.
- the process of optimizing the objective function uses the AdaDelta gradient descent algorithm to randomly sample the elements in the global co-occurrence matrix X, the learning rate is set to 0.05, and iterates 50 times until convergence, and two representation vectors w i and w j are obtained.
- the representation vector obtained through knowledge map representation learning can not only be used to calculate the similarity between diseases, but also can be combined with the patient's medical records and integrated into the deep learning model to complete the prediction task. For example, based on the patient's historical visit records, predict the disease that may appear in the next visit. In electronic medical records, each patient's medical record can be regarded as multiple visits, and each visit contains a series of disease classification codes, that is, a subset of C'.
- the disease classification code set of a certain visit of a patient can be expressed as a binary vector x t , x t ⁇ ⁇ 0, 1 ⁇ N , where the i-th element represents whether the code c′ i appears in this visit of the patient, and 1 if it occurs , otherwise it is 0.
- the binary vector x t of each patient visit can be dot multiplied by the representation vector and then transformed nonlinearly, which can be used as the input of the RNN prediction model to predict the disease classification code of the next visit, so as to predict the possible diseases that will occur.
- the embodiment of the present invention also provides a multi-center medical diagnosis knowledge map representation learning system, the system includes:
- Global medical diagnosis knowledge map construction module used to build a global medical diagnosis knowledge map on the first server, the global medical diagnosis knowledge map represents the hierarchical structure of medical diagnosis concepts in the form of a directed acyclic graph, consisting of leaf nodes and The ancestor node consists of two parts, the leaf node is the smallest disease classification code, and its ancestor node is the upper-level disease classification code corresponding to the leaf node disease classification code;
- Medical diagnosis knowledge map distribution module used to distribute the global medical diagnosis knowledge map constructed by the first server to each medical institution participant;
- kinds of disease classification codes, and the medical records of each patient in the medical institution participant are regarded as multiple visits, recorded as V ⁇ V 1 , V 2 ,..., V T ⁇ , a total of visits T times, the disease classification coding set for each visit is recorded as V t , and the upper-level disease classification coding of each disease classification coding in V t is added to V t to obtain the enhanced disease classification coding set as V′ t ;
- Data encryption calculation module the second server generates the encryption algorithm, encryption key, decryption algorithm and decryption key, and distributes the encryption algorithm and encryption key to each medical institution participant; each medical institution participant uses the encryption algorithm and encryption key key to encrypt its co-occurrence matrix and upload it to the first server; the first server adds the co-occurrence information of the same two codes in the ciphertext state to obtain the global co-occurrence matrix in the ciphertext state and sends it to the second server ; The second server obtains the global co-occurrence matrix through the decryption algorithm and the decryption key, and returns it to the first server;
- Knowledge representation learning module Deployed on the first server, each disease classification code is expressed as a representation vector composed of real numbers, and the following objective function J is constructed:
- w i and w j are the representation vectors of encoding c′ i and c′ j respectively, b i and b j are the bias items of two representation vectors respectively, X ij represents the encoding c′ i and c′ j in the global co-occurrence matrix
- the co-occurrence information of c′ j , f is the weighting function
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
本发明公开了一种多中心医学诊断知识图谱表示学习方法及系统,本发明基于已有的医学诊断本体,以有向无环图的形式表示医学诊断概念的层级结构,构建全局医学诊断知识图谱;利用全局医学诊断知识图谱,构建所有疾病分类编码的共现矩阵,计算每对编码的共现信息,同时出现越多且距离越近的编码对,具有更大的共现信息;基于联邦学习,在保护各医疗机构参与方数据隐私和安全的前提下,利用多中心数据,加和共现信息,提高数据密度,解决数据稀疏问题;在对大规模知识图谱及原始数据进行学习的过程中,融入了知识源中符合人类认知的层级信息和复杂关联关系,挖掘数据之间的相关关系,丰富语义信息,学习知识的高质量表示形式,便于知识的计算与推理。
Description
本发明属于医疗信息技术领域,尤其涉及一种多中心医学诊断知识图谱表示学习方法及系统。
知识图谱是以结构化的形式描述客观世界中概念、实体及其关系,将信息表达成更接近人类认知世界的形式,提供了一种更好地组织、管理和理解信息的能力,可以用于知识的挖掘、分析及构建、绘制和显示知识之间的相互联系。表示学习旨在将研究对象的语义信息表示为稠密低维实体向量。知识图谱表示学习主要是面对知识图谱中的实体和关系进行表示学习,通过对大规模知识图谱及原始数据的学习与训练,能够获得知识在低维稠密空间的分布向量表示,表达实体和关系的语义信息,便于知识的计算与推理。
利用医学本体对分层临床结构和医学概念之间的关系进行编码,可以减少对大量数据的需求,在不丢失信息的情况下有效减少搜索空间。幸运的是,在医疗保健领域有许多组织良好的本体,如国际疾病分类(ICD)、临床分类软件(CCS)或医学临床术语系统化命名(SNOMED)。在医学本体中,相互接近的结点(即医学概念)很可能与类似的患者相关联,从而允许我们在它们之间传递知识。当数据量不足以训练深度学习模型时,使用医学本体可能是有用的。甚至在数据量足够的情况下,也可以在不损失信息的前提下,作为一种精简模型的方法,通过学习更多符合本体结构的可解释表征。
基于医疗知识图谱结构信息的知识表示学习模型不能解决复杂关系建模和数据稀疏所带来的语义表示能力低下等问题。已有研究工作利用知识图谱本身结构外的海量文本信息,扩充知识图谱结构信息来减少数据稀疏所造成的影响。现有方法忽略了数据中固有的结构和相关性信息。此外,缺少在隐私保护和数据安全前提下扩大数据量,用于知识表示学习的方法。
发明内容
本发明针对现有技术的不足,提出一种多中心医学诊断知识图谱表示学习方法及系统,在保护各医疗机构参与方数据隐私和安全的前提下,利用多中心数据,提高数据密度,此外,在对大规模知识图谱及原始数据进行学习的过程中,融入了知识源中符合人类认知的层级信息和复杂关联关系,挖掘数据之间的相关关系,丰富语义信息,从而解决数据稀疏带来的语义表示能力低下问题。
本发明的目的是通过以下技术方案来实现的:
本发明一方面公开了一种多中心医学诊断知识图谱表示学习方法,该方法基于联邦学习与同态加密,利用多中心数据,结合层级信息和复杂关联关系,实现结构信息的知识表示学习,具体包括以下步骤:
(1)第一服务器构建全局医学诊断知识图谱,所述全局医学诊断知识图谱以有向无环图的形式表示医学诊断概念的层级结构,由叶子结点和祖先结点两部分组成,所述叶子结点为最小的疾病分类编码,其祖先结点为叶子结点疾病分类编码对应的上层疾病分类编码;
(2)第一服务器将构建的全局医学诊断知识图谱分发给各医疗机构参与方;
(3)各医疗机构参与方内部进行疾病诊断共现信息统计,具体为:
将某医疗机构参与方电子病历中所有疾病分类编码的集合记为C={c
1,c
2,...,c
|C|},共有|C|种疾病分类编码,医疗机构参与方每个患者的病历记录看作是多次就诊,记为V={V
1,V
2,...,V
T},共就诊T次,每次就诊的疾病分类编码集记为V
t,将V
t中每个疾病分类编码的上层疾病分类编码加入V
t,得到增强疾病分类编码集记为V′
t;将V′
t中的编码两两组合构成编码对,计算编码对的共现信息;
医学诊断知识图谱中所有疾病分类编码的集合记为C′={c′
1,c′
2,...,c′
N},共有N种疾病分类编码,
该医疗机构参与方基于医学诊断知识图谱中的所有疾病分类编码构建共现矩阵M,共现矩阵M的第i行第j列元素M
ij表示两编码c′
i和c′
j的共现信息,
P表示该医疗机构参与方中患者总数,
表示两编码c′
i和c′
j在患者p某次就诊的增强疾病分类编码集V′
t中的共现信息;
(4)数据加密计算:第二服务器生成加密算法、加密密钥、解密算法和解密密钥,并将加密算法和加密密钥分发给各医疗机构参与方;各医疗机构参与方使用加密算法和加密密钥对其共现矩阵进行加密并上传至第一服务器;第一服务器在密文状态下,加和相同两编码的共现信息,得到密文状态下的全局共现矩阵,发送给第二服务器;第二服务器通过解密算法和解密密钥得到全局共现矩阵,返回给第一服务器;
(5)知识表示学习:在第一服务器中,将每个疾病分类编码表达成一个由实数组成的表示向量,构造如下目标函数J:
其中,w
i和w
j分别是编码c′
i和c′
j的表示向量,b
i和b
j分别是两个表示向量的偏置项,X
ij表示全局共现矩阵中编码c′
i和c′
j的共现信息,f为加权函数;
优化目标函数直至收敛,得到两个表示向量w
i和w
j。
进一步地,所述第一服务器和第二服务器均为第三方服务器,第三方服务器需要是诚实的,第三方服务器之间能够相互通信,各医疗机构参与方内部部署各自的电子病历数据库,电子病历数据库中的原始数据不允许离开各医疗机构参与方,各医疗机构参与方之间无法直接进行相互通信,只能与第三方服务器进行通信。
进一步地,在构建全局医学诊断知识图谱过程中,使用的医学诊断本体包括ICD、CCS、SNOMED。
进一步地,构建的医学诊断知识图谱以字典形式存储,字典中的每个元素记录一种疾病的层级结构信息。
进一步地,增强疾病分类编码集V′
t的构建具体为:在医学诊断知识图谱中查找叶子结点对应的祖先结点,共同祖先结点对应的上层疾病分类编码需要重复加入V
t中。
进一步地,所述步骤(3)中,对于某编码对中的两编码c
i,c
j,编码对的共现信息计算公式为:
其中,count(c
i,V′
t)为c
i在V′
t中出现的次数,count(c
j,V′
t)为c
j在V′
t中出现的次数,d
ij为两编码c
i,c
j之间的距离,cooccurrence(c
i,c
j,V′
t)为编码对的共现信息。
进一步地,所述步骤(3)中,
的计算具体为:如果两编码c′
i和c′
j在患者p某次就诊的增强疾病分类编码集V′
t中同时出现过,就能够在该患者此次的V′
t中找到编码c
i等于编码c′
i,找到编码c
j等于编码c′
j,则
等于cooccurrence(c
i,c
j,V′
t);否则
等于0。
进一步地,所述步骤(4)具体为:
第二服务器利用加法同态加密算法得到加密算法ENC、解密算法DEC、加密密钥KEY
E和解密密钥KEY
D,将加密算法ENC和加密密钥KEY
E发送给各医疗机构参与方;
第一服务器直接对密文进行操作;根据加法同态性,只需计算密文的乘积即可,两编码c′
i和c′
j加密后的共现信息EncX
ij公式如下:
针对医学诊断知识图谱中的任意两个疾病分类编码,均计算密文状态下的两编码共现信息,最终得到密文状态下的全局共现矩阵EncX;
第一服务器将EncX发给第二服务器,第二服务器解密得到全局共现矩阵X,即X=DEC(KEY
D,EncX),返回给第一服务器。
进一步地,所述步骤(5)中,f采用以下分段函数:
其中,MAX和α均为超参数,根据实验结果设定最优取值,在共现信息超过MAX后,f(X
ij)维持在1的不变水平。
本发明另一方面公开了一种多中心医学诊断知识图谱表示学习系统,该系统包括:
全局医学诊断知识图谱构建模块:用于在第一服务器上构建全局医学诊断知识图谱,所述全局医学诊断知识图谱以有向无环图的形式表示医学诊断概念的层级结构,由叶子结点和祖先结点两部分组成,所述叶子结点为最小的疾病分类编码,其祖先结点为叶子结点疾病分类编码对应的上层疾病分类编码;
医学诊断知识图谱分发模块:用于将第一服务器构建的全局医学诊断知识图谱分发给各医疗机构参与方;
疾病诊断共现信息统计模块:部署在各医疗机构参与方;将某医疗机构参与方电子病历中所有疾病分类编码的集合记为C={c
1,c
2,...,c
|C|},共有|C|种疾病分类编码,医疗机构参与方每个患者的病历记录看作是多次就诊,记为V={V
1,V
2,...,V
T},共就诊T次,每次就诊的疾病分类编码集记为V
t,将V
t中每个疾病分类编码的上层疾病分类编码加入V
t,得到增强疾病分类编码集记为V′
t;将V′
t中的编码两两组合构成编码对,计算编码对的共现信息;医学诊断知识图谱中所有疾病分类编码的集合记为C′={c′
1,c′
2,...,c′
N},共有N种疾病分类编码,
该医疗机构参与方基于医学诊断知识图谱中的所有疾病分类编码构建共现矩阵M,共现矩阵M的第i行第j列元素M
ij表示两编码c′
i和c′
j的共现信息,
P表示该医疗机构参与方中患者总数,
表示两编码c′
i和c′
j在患者p某次就诊的增强疾病分类编码集V′
t中的共现信息;
数据加密计算模块:第二服务器生成加密算法、加密密钥、解密算法和解密密钥,并将加密算法和加密密钥分发给各医疗机构参与方;各医疗机构参与方使用加密算法和加密密钥对其共现矩阵进行加密并上传至第一服务器;第一服务器在密文状态下,加和相同两编码的 共现信息,得到密文状态下的全局共现矩阵,发送给第二服务器;第二服务器通过解密算法和解密密钥得到全局共现矩阵,返回给第一服务器;
知识表示学习模块:部署在第一服务器,将每个疾病分类编码表达成一个由实数组成的表示向量,构造如下目标函数J:
其中,w
i和w
j分别是编码c′
i和c′
j的表示向量,b
i和b
j分别是两个表示向量的偏置项,X
ij表示全局共现矩阵中编码c′
i和c′
j的共现信息,f为加权函数;
优化目标函数直至收敛,得到两个表示向量w
i和w
j。
本发明的有益效果是:
1.基于已有的医学诊断本体(ICD、CCS、SNOMED等),以有向无环图的形式表示医学诊断概念的层级结构,构建全局医学诊断知识图谱;利用全局医学诊断知识图谱,构建所有疾病分类编码的共现矩阵,计算每对编码的共现信息,同时出现越多且距离越近的编码对,具有更大的共现信息。
2.基于联邦学习,在保护各医疗机构参与方数据隐私和安全的前提下,利用多中心数据,加和共现信息,提高数据密度,解决数据稀疏问题;
3.在对大规模知识图谱及原始数据进行学习的过程中,融入了知识源中符合人类认知的层级信息和复杂关联关系,挖掘数据之间的相关关系,丰富语义信息,学习知识的高质量表示形式,便于知识的计算与推理。
图1为本发明实施例提供的多中心医学诊断知识图谱表示学习方法的网络架构示意图;
图2为本发明实施例提供的多中心医学诊断知识图谱表示学习方法的实现流程图;
图3为本发明实施例提供的医学诊断知识图谱结构示例。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其他不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。
本发明提供一种多中心医学诊断知识图谱表示学习方法,该方法基于联邦学习与同态加密,利用多中心数据,结合层级信息和复杂关联关系,实现结构信息的知识表示学习。该方 法基于如图1所示网络架构,包括两个第三方服务器(第一服务器和第二服务器)和多个医疗机构参与方,第三方服务器需要是诚实的,第三方服务器之间可以相互通信。各医疗机构参与方内部部署各自的电子病历数据库,电子病历数据库中的原始数据不允许离开各医疗机构参与方。各医疗机构参与方之间无法直接进行相互通信,只能与第三方服务器进行通信。
如图2所示,本实施例提供的一种多中心医学诊断知识图谱表示学习方法,步骤如下:
1.全局医学诊断知识图谱构建
第一服务器负责构建全局医学诊断知识图谱。全局医学诊断知识图谱以有向无环图的形式表示医学诊断概念的层级结构。全局医学诊断知识图谱由叶子结点和祖先结点两部分组成,其中叶子结点为最小的疾病分类编码,其祖先结点为叶子结点疾病分类编码对应的上层疾病分类编码。
本实施例中使用ICD10作为医学诊断本体,构建全局医学诊断知识图谱。其中,医学诊断本体还可以选择CCS、SNOMED等医学领域常用的知识源。以病毒性咽炎为例,病毒性咽炎J02.801为叶子结点,根据ICD10中的疾病层级信息,构建其祖先结点:呼吸系统疾病J00-J99、急性上呼吸道感染J00-J06、急性咽炎J02,如图3所示。
构建的医学诊断知识图谱以字典形式存储,字典中的每个元素记录一种疾病的层级结构信息。以病毒性咽炎为例,层级结构信息存储为:{J02.801:[J02.801,root,J00-J99,J00-J06,J02]},root代表根结点。
2.医学诊断知识图谱分发
第一服务器将构建的全局医学诊断知识图谱分发给各医疗机构参与方,因为知识图谱是公开获取的,可以不加密。
3.每个医疗机构参与方内部进行疾病诊断共现信息统计
将每个医疗机构参与方电子病历中所有疾病分类编码的集合记为C={c
1,c
2,...,c
|C|},总共有|C|种疾病分类编码。该医疗机构参与方每个患者的病历记录可以看作是多次就诊,记为V={V
1,V
2,...,V
T},总共就诊T次,每次就诊的疾病分类编码集记为V
t。
医疗机构参与方以单次就诊V
t为单位,通过加入V
t中每个疾病分类编码的上层疾病分类编码,得到增强疾病分类编码集记为V′
t,即在医学诊断知识图谱中查找叶子结点对应的祖先结点,共同祖先结点对应的上层疾病分类编码需要重复加入。
计算每个疾病分类编码及其上层疾病分类编码在V′
t中的出现次数。将V′
t中的编码两两组合构成编码对,通过相乘编码对中两编码的出现次数,计算该编码对的共现信息。同时,计算编码对中两编码之间的距离,即连接两个结点的最短路径所包含的边的数量,将距离的倒数作为权重。
对于某编码对中的两编码c
i,c
j,编码对的共现信息为:
其中,count(c
i,V′
t)为c
i在V′
t中出现的次数,count(c
j,V′
t)为c
j在V′
t中出现的次数,d
ij为两编码c
i,c
j之间的距离,cooccurrence(c
i,c
j,V′
t)为编码对的共现信息。
医学诊断知识图谱中所有疾病分类编码的集合记为C′={c′
1,c′
2,...,c′
N},总共有N种疾病分类编码,
该医疗机构参与方基于医学诊断知识图谱中的所有疾病分类编码构建共现矩阵M,如表1所示。M
ij表示两编码c′
i和c′
j的共现信息,
P表示该医疗机构参与方中患者的总数,
表示两编码c′
i和c′
j在患者p某次就诊的增强疾病分类编码集V′
t中的共现信息,如果两编码c′
i和c′
j在患者p某次就诊的增强疾病分类编码集V′
t中同时出现过,就可以在该患者此次的V′
t中找到编码c
i等于编码c′
i,找到编码c
j等于编码c′
j,则
等于cooccurrence(c
i,c
j,V′
t);如果没有,记为0。共现矩阵M对称,M
ij和M
ji相等,对角线上是相同疾病分类编码的共现信息,记为0。
表1共现矩阵结构示例
以图3的医学诊断知识图谱为例,假设患者某次就诊的疾病诊断为病毒性咽炎和急性喉炎,V
t=[J02.801,J04.000],通过加入上层疾病分类编码,增强为V′
t=[J02.801,J02,J00-J06,J00-J99,root,J04.000,J04,J00-J06,J00-J99,root],编码J02.801的出现次数为1,J00-J06的出现次数为2,两者间的距离为2,共现信息取值为1。
4.数据加密计算
第二服务器生成加密算法、加密密钥、解密算法和解密密钥,并将加密算法和加密密钥分发给各医疗机构参与方。各医疗机构参与方使用加密算法和加密密钥对其共现矩阵进行加 密并上传至第一服务器。第一服务器在密文状态下,加和相同两编码的共现信息,得到密文状态下的全局共现矩阵,发送给第二服务器。第二服务器通过解密算法和解密密钥得到全局共现矩阵,返回给第一服务器。整个过程没有数据泄露风险。具体实现流程如下:
第二服务器利用加法同态加密算法得到加密算法ENC、解密算法DEC、加密密钥KEY
E和解密密钥KEY
D,将加密算法ENC和加密密钥KEY
E发送给各医疗机构参与方。
第一服务器不进行解密操作,直接对密文进行操作。根据加法同态性,只需计算密文的乘积即可,两编码c′
i和c′
j加密后的共现信息EncX
ij公式如下:
针对医学诊断知识图谱中的任意两个疾病分类编码,均按照上述步骤,计算密文状态下的两编码共现信息,最终得到密文状态下的全局共现矩阵EncX。第一服务器将EncX发给第二服务器,第二服务器解密得到全局共现矩阵X,即X=DEC(KEY
D,EncX),返回给第一服务器。
5.知识表示学习
在第一服务器中,根据GloVe算法原理,将每个疾病分类编码表达成一个由实数组成的表示向量,该表示向量和全局共现矩阵之间的关系表示为:
其中,w
i和w
j分别是最终需要求解的疾病分类编码c′
i和c′
j的表示向量,随机初始化为一个128维的、取值在-0.1到0.1之间的随机向量;上标T表示转置操作;b
i和b
j分别是两个表示向量的偏置项,初始值为0;X
ij表示全局共现矩阵X中编码c′
i和c′
j的共现信息。
基于上述公式,构造目标函数J:
其中,f为加权函数。为了让共现多的编码对获得更高权重,f为非递减函数,同时,这个权重不能过大,当到达一定程度之后应该不再增加。如果两个编码c′
i和c′
j没有一起出现, 即X
ij=0,那么它们不参与目标函数的计算,即f(0)=0。基于以上要求,f采用以下分段函数:
即在共现信息超过阈值MAX后,其权重维持在1的不变水平。超参数MAX和α根据实验结果设定最优取值,可分别设置为100和0.75。
优化目标函数过程采用AdaDelta梯度下降算法,对全局共现矩阵X中的元素进行随机采样,学习率设为0.05,迭代50次,直至收敛,得到两个表示向量w
i和w
j。
通过知识图谱表示学习得到的表示向量,不仅可以用于计算疾病之间的相似性,还可以将其和患者病历相结合,融入到深度学习模型中完成预测任务。比如,根据患者的历史就诊记录,预测下次就诊可能会出现的疾病。在电子病历中,每个患者的病历记录可以看作是多次就诊,而每次就诊中又包含着一系列的疾病分类编码,即C′的子集。患者某次就诊的疾病分类编码集可以表示为二进制向量x
t,x
t∈{0,1}
N,其中第i个元素代表该患者的这次就诊是否出现编码c′
i,出现则为1,反之则为0。在深度学习模型训练中,可以将患者每次就诊的二进制向量x
t与表示向量做点乘再经非线性转换,作为RNN预测模型的输入,预测得到下一次就诊的疾病分类编码,从而预测可能会出现的疾病。
本发明实施例还提供一种多中心医学诊断知识图谱表示学习系统,该系统包括:
全局医学诊断知识图谱构建模块:用于在第一服务器上构建全局医学诊断知识图谱,所述全局医学诊断知识图谱以有向无环图的形式表示医学诊断概念的层级结构,由叶子结点和祖先结点两部分组成,所述叶子结点为最小的疾病分类编码,其祖先结点为叶子结点疾病分类编码对应的上层疾病分类编码;
医学诊断知识图谱分发模块:用于将第一服务器构建的全局医学诊断知识图谱分发给各医疗机构参与方;
疾病诊断共现信息统计模块:部署在各医疗机构参与方;将某医疗机构参与方电子病历中所有疾病分类编码的集合记为C={c
1,c
2,...,c
|C|},共有|C|种疾病分类编码,医疗机构参与方每个患者的病历记录看作是多次就诊,记为V={V
1,V
2,...,V
T},共就诊T次,每次就诊的疾病分类编码集记为V
t,将V
t中每个疾病分类编码的上层疾病分类编码加入V
t,得到增强疾病分类编码集记为V′
t;将V′
t中的编码两两组合构成编码对,计算编码对的共现信息;医学诊断知识图谱中所有疾病分类编码的集合记为C′={c′
1,c′
2,...,c′
N},共有N种疾病分类编码,
该医疗机构参与方基于医学诊断知识图谱中的所有疾病分类编码构建共现矩阵M,共现矩阵M的第i行第j列元素M
ij表示两编码c′
i和c′
j的共现信息,
P表示该医疗 机构参与方中患者总数,
表示两编码c′
i和c′
j在患者p某次就诊的增强疾病分类编码集V′
t中的共现信息;
数据加密计算模块:第二服务器生成加密算法、加密密钥、解密算法和解密密钥,并将加密算法和加密密钥分发给各医疗机构参与方;各医疗机构参与方使用加密算法和加密密钥对其共现矩阵进行加密并上传至第一服务器;第一服务器在密文状态下,加和相同两编码的共现信息,得到密文状态下的全局共现矩阵,发送给第二服务器;第二服务器通过解密算法和解密密钥得到全局共现矩阵,返回给第一服务器;
知识表示学习模块:部署在第一服务器,将每个疾病分类编码表达成一个由实数组成的表示向量,构造如下目标函数J:
其中,w
i和w
j分别是编码c′
i和c′
j的表示向量,b
i和b
j分别是两个表示向量的偏置项,X
ij表示全局共现矩阵中编码c′
i和c′
j的共现信息,f为加权函数;
优化目标函数直至收敛,得到两个表示向量w
i和w
j。
以上所述仅是本发明的优选实施方式,虽然本发明已以较佳实施例披露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。
Claims (10)
- 一种多中心医学诊断知识图谱表示学习方法,其特征在于,该方法包括:(1)第一服务器构建全局医学诊断知识图谱,所述全局医学诊断知识图谱以有向无环图的形式表示医学诊断概念的层级结构,由叶子结点和祖先结点两部分组成,所述叶子结点为最小的疾病分类编码,其祖先结点为叶子结点疾病分类编码对应的上层疾病分类编码;(2)第一服务器将构建的全局医学诊断知识图谱分发给各医疗机构参与方;(3)各医疗机构参与方内部进行疾病诊断共现信息统计,具体为:将某医疗机构参与方电子病历中所有疾病分类编码的集合记为C={c 1,c 2,...,c |C|},共有|C|种疾病分类编码,医疗机构参与方每个患者的病历记录看作是多次就诊,记为V={V 1,V 2,...,V T},共就诊T次,每次就诊的疾病分类编码集记为V t,将V t中每个疾病分类编码的上层疾病分类编码加入V t,得到增强疾病分类编码集记为V′ t;将V′ t中的编码两两组合构成编码对,计算编码对的共现信息;医学诊断知识图谱中所有疾病分类编码的集合记为C′={c′ 1,c′ 2,...,c′ N},共有N种疾病分类编码, 该医疗机构参与方基于医学诊断知识图谱中的所有疾病分类编码构建共现矩阵M,共现矩阵M的第i行第j列元素M ij表示两编码c′ i和c′ j的共现信息, P表示该医疗机构参与方中患者总数, 表示两编码c′ i和c′ j在患者p某次就诊的增强疾病分类编码集V′ t中的共现信息;(4)数据加密计算:第二服务器生成加密算法、加密密钥、解密算法和解密密钥,并将加密算法和加密密钥分发给各医疗机构参与方;各医疗机构参与方使用加密算法和加密密钥对其共现矩阵进行加密并上传至第一服务器;第一服务器在密文状态下,加和相同两编码的共现信息,得到密文状态下的全局共现矩阵,发送给第二服务器;第二服务器通过解密算法和解密密钥得到全局共现矩阵,返回给第一服务器;(5)知识表示学习:在第一服务器中,将每个疾病分类编码表达成一个由实数组成的表示向量,构造如下目标函数J:其中,w i和w j分别是编码c′ i和c′ j的表示向量,b i和b j分别是两个表示向量的偏置项,X ij表示全局共现矩阵中编码c′ i和c′ j的共现信息,f为加权函数;优化目标函数直至收敛,得到两个表示向量w i和w j。
- 根据权利要求1所述的一种多中心医学诊断知识图谱表示学习方法,其特征在于,所述第一服务器和第二服务器均为第三方服务器,第三方服务器需要是诚实的,第三方服务器之间能够相互通信,各医疗机构参与方内部部署各自的电子病历数据库,电子病历数据库中的原始数据不允许离开各医疗机构参与方,各医疗机构参与方之间无法直接进行相互通信,只能与第三方服务器进行通信。
- 根据权利要求1所述的一种多中心医学诊断知识图谱表示学习方法,其特征在于,在构建全局医学诊断知识图谱过程中,使用的医学诊断本体包括ICD、CCS、SNOMED。
- 根据权利要求1所述的一种多中心医学诊断知识图谱表示学习方法,其特征在于,构建的医学诊断知识图谱以字典形式存储,字典中的每个元素记录一种疾病的层级结构信息。
- 根据权利要求1所述的一种多中心医学诊断知识图谱表示学习方法,其特征在于,增强疾病分类编码集V′ t的构建具体为:在医学诊断知识图谱中查找叶子结点对应的祖先结点,共同祖先结点对应的上层疾病分类编码需要重复加入V t中。
- 根据权利要求1所述的一种多中心医学诊断知识图谱表示学习方法,其特征在于,所述步骤(4)具体为:第二服务器利用加法同态加密算法得到加密算法ENC、解密算法DEC、加密密钥KEY E和解密密钥KEY D,将加密算法ENC和加密密钥KEY E发送给各医疗机构参与方;第一服务器直接对密文进行操作;根据加法同态性,只需计算密文的乘积即可,两编码c′ i和c′ j加密后的共现信息EncX ij公式如下:针对医学诊断知识图谱中的任意两个疾病分类编码,均计算密文状态下的两编码共现信息,最终得到密文状态下的全局共现矩阵EncX;第一服务器将EncX发给第二服务器,第二服务器解密得到全局共现矩阵X,即X=DEC(KEY D,EncX),返回给第一服务器。
- 一种多中心医学诊断知识图谱表示学习系统,其特征在于,该系统包括:全局医学诊断知识图谱构建模块:用于在第一服务器上构建全局医学诊断知识图谱,所述全局医学诊断知识图谱以有向无环图的形式表示医学诊断概念的层级结构,由叶子结点和祖先结点两部分组成,所述叶子结点为最小的疾病分类编码,其祖先结点为叶子结点疾病分类编码对应的上层疾病分类编码;医学诊断知识图谱分发模块:用于将第一服务器构建的全局医学诊断知识图谱分发给各医疗机构参与方;疾病诊断共现信息统计模块:部署在各医疗机构参与方;将某医疗机构参与方电子病历中所有疾病分类编码的集合记为C={c 1,c 2,...,c |C|},共有|C|种疾病分类编码,医疗机构参与方每个患者的病历记录看作是多次就诊,记为V={V 1,V 2,...,V T},共就诊T次,每次就诊的疾病分类编码集记为V t,将V t中每个疾病分类编码的上层疾病分类编码加入V t,得到增强疾病分类编码集记为V′ t;将V′ t中的编码两两组合构成编码对,计算编码对的共现信息;医学诊断知识图谱中所有疾病分类编码的集合记为C′={c′ 1,c′ 2,...,c′ N},共有N种疾病分类编码, 该医疗机构参与方基于医学诊断知识图谱中的所有疾病分类编码构建共现矩阵M,共现矩阵M的第i行第j列元素M ij表示两编码c′ i和c′ j的共现信息, P表示该医疗机构参与方中患者总数, 表示两编码c′ i和c′ j在患者p某次就诊的增强疾病分类编码集V′ t中的共现信息;数据加密计算模块:第二服务器生成加密算法、加密密钥、解密算法和解密密钥,并将加密算法和加密密钥分发给各医疗机构参与方;各医疗机构参与方使用加密算法和加密密钥对其共现矩阵进行加密并上传至第一服务器;第一服务器在密文状态下,加和相同两编码的共现信息,得到密文状态下的全局共现矩阵,发送给第二服务器;第二服务器通过解密算法和解密密钥得到全局共现矩阵,返回给第一服务器;知识表示学习模块:部署在第一服务器,将每个疾病分类编码表达成一个由实数组成的表示向量,构造如下目标函数J:其中,w i和w j分别是编码c′ i和c′ j的表示向量,b i和b j分别是两个表示向量的偏置项,X ij表示全局共现矩阵中编码c′ i和c′ j的共现信息,f为加权函数;优化目标函数直至收敛,得到两个表示向量w i和w j。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023535611A JP7433541B2 (ja) | 2021-08-27 | 2022-08-25 | 多中心医学診断知識グラフ表示学習方法及びシステム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110995013.7A CN113434626B (zh) | 2021-08-27 | 2021-08-27 | 一种多中心医学诊断知识图谱表示学习方法及系统 |
CN202110995013.7 | 2021-08-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023025255A1 true WO2023025255A1 (zh) | 2023-03-02 |
Family
ID=77798239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/114879 WO2023025255A1 (zh) | 2021-08-27 | 2022-08-25 | 一种多中心医学诊断知识图谱表示学习方法及系统 |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP7433541B2 (zh) |
CN (1) | CN113434626B (zh) |
WO (1) | WO2023025255A1 (zh) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116072298A (zh) * | 2023-04-06 | 2023-05-05 | 之江实验室 | 一种基于层级标记分布学习的疾病预测系统 |
CN116364299A (zh) * | 2023-03-30 | 2023-06-30 | 之江实验室 | 一种基于异构信息网络的疾病诊疗路径聚类方法及系统 |
CN116502129A (zh) * | 2023-06-21 | 2023-07-28 | 之江实验室 | 一种知识与数据协同驱动的不平衡临床数据分类系统 |
CN116525125A (zh) * | 2023-07-04 | 2023-08-01 | 之江实验室 | 一种虚拟电子病历的生成方法及装置 |
CN116757275A (zh) * | 2023-06-07 | 2023-09-15 | 京信数据科技有限公司 | 一种知识图谱的联邦学习装置及方法 |
CN116821375A (zh) * | 2023-08-29 | 2023-09-29 | 之江实验室 | 一种跨机构医学知识图谱表示学习方法及系统 |
CN117116432A (zh) * | 2023-10-23 | 2023-11-24 | 博奥生物集团有限公司 | 一种疾病特征的处理方法、装置和设备 |
CN117409911A (zh) * | 2023-10-13 | 2024-01-16 | 四川大学 | 一种基于多视图对比学习的电子病历表示学习方法 |
CN117711578A (zh) * | 2024-02-06 | 2024-03-15 | 重庆医科大学绍兴柯桥医学检验技术研究中心 | 一种医学影像数据分析管理系统 |
CN117811722A (zh) * | 2024-03-01 | 2024-04-02 | 山东云海国创云计算装备产业创新中心有限公司 | 全局参数模型构建方法、秘钥生成方法、装置及服务器 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113434626B (zh) * | 2021-08-27 | 2021-12-07 | 之江实验室 | 一种多中心医学诊断知识图谱表示学习方法及系统 |
CN113990495B (zh) * | 2021-12-27 | 2022-04-29 | 之江实验室 | 一种基于图神经网络的疾病诊断预测系统 |
CN116564535B (zh) * | 2023-05-11 | 2024-02-20 | 之江实验室 | 基于隐私保护下局部图信息交换的中心疾病预测方法和装置 |
CN116403728B (zh) * | 2023-06-09 | 2023-08-29 | 吉林大学第一医院 | 医疗就诊数据的数据处理装置和相关设备 |
CN116825264B (zh) * | 2023-08-30 | 2023-11-21 | 青岛市妇女儿童医院(青岛市妇幼保健院、青岛市残疾儿童医疗康复中心、青岛市新生儿疾病筛查中心) | 基于互联网的妇产科信息处理方法及系统 |
CN118571502B (zh) * | 2024-08-02 | 2024-10-18 | 之江实验室 | 基于知识引导域自适应的多中心医学数据处理方法、系统、设备、介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060136259A1 (en) * | 2004-12-17 | 2006-06-22 | General Electric Company | Multi-dimensional analysis of medical data |
CN106951684A (zh) * | 2017-02-28 | 2017-07-14 | 北京大学 | 一种医学疾病诊断记录中实体消歧的方法 |
CN107610770A (zh) * | 2016-07-11 | 2018-01-19 | 百度(美国)有限责任公司 | 用于自动化诊断的问题生成系统和方法 |
CN111180061A (zh) * | 2019-12-09 | 2020-05-19 | 广东工业大学 | 融合区块链与联邦学习的共享医疗数据智能辅助诊断系统 |
CN111739595A (zh) * | 2020-07-24 | 2020-10-02 | 湖南创星科技股份有限公司 | 一种医疗大数据共享分析方法及装置 |
CN112364376A (zh) * | 2020-11-11 | 2021-02-12 | 贵州大学 | 一种属性代理重加密医疗数据共享方法 |
CN112765312A (zh) * | 2020-12-31 | 2021-05-07 | 湖南大学 | 一种基于图神经网络嵌入匹配的知识图谱问答方法和系统 |
CN113434626A (zh) * | 2021-08-27 | 2021-09-24 | 之江实验室 | 一种多中心医学诊断知识图谱表示学习方法及系统 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008083928A (ja) | 2006-09-27 | 2008-04-10 | Gifu Univ | 医療情報抽出装置、及び医療情報抽出プログラム |
JP6101563B2 (ja) | 2013-05-20 | 2017-03-22 | 株式会社日立製作所 | 情報構造化システム |
EP3276570A4 (en) | 2015-03-27 | 2018-11-07 | Hitachi, Ltd. | Computer system and information processing method |
CN106886543B (zh) * | 2015-12-16 | 2020-01-17 | 清华大学 | 结合实体描述的知识图谱表示学习方法和系统 |
US20170277841A1 (en) * | 2016-03-23 | 2017-09-28 | HealthPals, Inc. | Self-learning clinical intelligence system based on biological information and medical data metrics |
CN106874695B (zh) * | 2017-03-22 | 2019-10-25 | 北京大数医达科技有限公司 | 医疗知识图谱的构建方法和装置 |
CN107145744B (zh) * | 2017-05-08 | 2018-03-02 | 合肥工业大学 | 医学知识图谱的构建方法、装置及辅助诊断方法 |
CN108197290B (zh) * | 2018-01-19 | 2021-08-03 | 桂林电子科技大学 | 一种融合实体和关系描述的知识图谱表示学习方法 |
CN108614885B (zh) * | 2018-05-03 | 2019-04-30 | 杭州认识科技有限公司 | 基于医学信息的知识图谱分析方法及装置 |
CN109284396A (zh) * | 2018-09-27 | 2019-01-29 | 北京大学深圳研究生院 | 医学知识图谱构建方法、装置、服务器及存储介质 |
CN110347798B (zh) * | 2019-07-12 | 2021-06-01 | 之江实验室 | 一种基于自然语言生成技术的知识图谱辅助理解系统 |
CN111191020B (zh) * | 2019-12-27 | 2023-09-22 | 江苏省人民医院(南京医科大学第一附属医院) | 基于机器学习和知识图谱的处方推荐方法和系统 |
CN111292848B (zh) * | 2019-12-31 | 2023-05-16 | 同方知网数字出版技术股份有限公司 | 一种基于贝叶斯估计的医疗知识图谱辅助推理方法 |
CN111382272B (zh) * | 2020-03-09 | 2022-11-01 | 西南交通大学 | 一种基于知识图谱的电子病历icd自动编码方法 |
CN111858955B (zh) * | 2020-07-01 | 2023-08-18 | 石家庄铁路职业技术学院 | 基于加密联邦学习的知识图谱表示学习增强方法和装置 |
-
2021
- 2021-08-27 CN CN202110995013.7A patent/CN113434626B/zh active Active
-
2022
- 2022-08-25 JP JP2023535611A patent/JP7433541B2/ja active Active
- 2022-08-25 WO PCT/CN2022/114879 patent/WO2023025255A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060136259A1 (en) * | 2004-12-17 | 2006-06-22 | General Electric Company | Multi-dimensional analysis of medical data |
CN107610770A (zh) * | 2016-07-11 | 2018-01-19 | 百度(美国)有限责任公司 | 用于自动化诊断的问题生成系统和方法 |
CN106951684A (zh) * | 2017-02-28 | 2017-07-14 | 北京大学 | 一种医学疾病诊断记录中实体消歧的方法 |
CN111180061A (zh) * | 2019-12-09 | 2020-05-19 | 广东工业大学 | 融合区块链与联邦学习的共享医疗数据智能辅助诊断系统 |
CN111739595A (zh) * | 2020-07-24 | 2020-10-02 | 湖南创星科技股份有限公司 | 一种医疗大数据共享分析方法及装置 |
CN112364376A (zh) * | 2020-11-11 | 2021-02-12 | 贵州大学 | 一种属性代理重加密医疗数据共享方法 |
CN112765312A (zh) * | 2020-12-31 | 2021-05-07 | 湖南大学 | 一种基于图神经网络嵌入匹配的知识图谱问答方法和系统 |
CN113434626A (zh) * | 2021-08-27 | 2021-09-24 | 之江实验室 | 一种多中心医学诊断知识图谱表示学习方法及系统 |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116364299A (zh) * | 2023-03-30 | 2023-06-30 | 之江实验室 | 一种基于异构信息网络的疾病诊疗路径聚类方法及系统 |
CN116364299B (zh) * | 2023-03-30 | 2024-02-13 | 之江实验室 | 一种基于异构信息网络的疾病诊疗路径聚类方法及系统 |
CN116072298B (zh) * | 2023-04-06 | 2023-08-15 | 之江实验室 | 一种基于层级标记分布学习的疾病预测系统 |
CN116072298A (zh) * | 2023-04-06 | 2023-05-05 | 之江实验室 | 一种基于层级标记分布学习的疾病预测系统 |
CN116757275B (zh) * | 2023-06-07 | 2024-06-11 | 京信数据科技有限公司 | 一种知识图谱的联邦学习装置及方法 |
CN116757275A (zh) * | 2023-06-07 | 2023-09-15 | 京信数据科技有限公司 | 一种知识图谱的联邦学习装置及方法 |
CN116502129A (zh) * | 2023-06-21 | 2023-07-28 | 之江实验室 | 一种知识与数据协同驱动的不平衡临床数据分类系统 |
CN116502129B (zh) * | 2023-06-21 | 2023-09-22 | 之江实验室 | 一种知识与数据协同驱动的不平衡临床数据分类系统 |
CN116525125A (zh) * | 2023-07-04 | 2023-08-01 | 之江实验室 | 一种虚拟电子病历的生成方法及装置 |
CN116525125B (zh) * | 2023-07-04 | 2023-09-19 | 之江实验室 | 一种虚拟电子病历的生成方法及装置 |
CN116821375A (zh) * | 2023-08-29 | 2023-09-29 | 之江实验室 | 一种跨机构医学知识图谱表示学习方法及系统 |
CN116821375B (zh) * | 2023-08-29 | 2023-12-22 | 之江实验室 | 一种跨机构医学知识图谱表示学习方法及系统 |
CN117409911A (zh) * | 2023-10-13 | 2024-01-16 | 四川大学 | 一种基于多视图对比学习的电子病历表示学习方法 |
CN117409911B (zh) * | 2023-10-13 | 2024-05-07 | 四川大学 | 一种基于多视图对比学习的电子病历表示学习方法 |
CN117116432B (zh) * | 2023-10-23 | 2023-12-15 | 博奥生物集团有限公司 | 一种疾病特征的处理装置和设备 |
CN117116432A (zh) * | 2023-10-23 | 2023-11-24 | 博奥生物集团有限公司 | 一种疾病特征的处理方法、装置和设备 |
CN117711578B (zh) * | 2024-02-06 | 2024-04-30 | 重庆医科大学绍兴柯桥医学检验技术研究中心 | 一种医学影像数据分析管理系统 |
CN117711578A (zh) * | 2024-02-06 | 2024-03-15 | 重庆医科大学绍兴柯桥医学检验技术研究中心 | 一种医学影像数据分析管理系统 |
CN117811722A (zh) * | 2024-03-01 | 2024-04-02 | 山东云海国创云计算装备产业创新中心有限公司 | 全局参数模型构建方法、秘钥生成方法、装置及服务器 |
CN117811722B (zh) * | 2024-03-01 | 2024-05-24 | 山东云海国创云计算装备产业创新中心有限公司 | 全局参数模型构建方法、秘钥生成方法、装置及服务器 |
Also Published As
Publication number | Publication date |
---|---|
JP7433541B2 (ja) | 2024-02-19 |
CN113434626A (zh) | 2021-09-24 |
JP2023547562A (ja) | 2023-11-10 |
CN113434626B (zh) | 2021-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023025255A1 (zh) | 一种多中心医学诊断知识图谱表示学习方法及系统 | |
Li et al. | A review of applications in federated learning | |
Zhao et al. | Privacy-preserving collaborative deep learning with unreliable participants | |
Wang et al. | A privacy-enhanced retrieval technology for the cloud-assisted internet of things | |
Alzubi et al. | Optimal multiple key‐based homomorphic encryption with deep neural networks to secure medical data transmission and diagnosis | |
Zhang et al. | Achieving efficient and privacy-preserving neural network training and prediction in cloud environments | |
Yu et al. | Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data | |
Hao et al. | Privacy-aware and resource-saving collaborative learning for healthcare in cloud computing | |
Alabdulkarim et al. | A Privacy-Preserving Algorithm for Clinical Decision-Support Systems Using Random Forest. | |
Wang et al. | Neighborhood attention networks with adversarial learning for link prediction | |
Chu et al. | Privacy-preserving self-taught federated learning for heterogeneous data | |
Ni et al. | Federated learning model with adaptive differential privacy protection in medical IoT | |
Miyajima et al. | Machine Learning with Distributed Processing using Secure Divided Data: Towards Privacy-Preserving Advanced AI Processing in a Super-Smart Society | |
Wang et al. | Differentially private data publishing for arbitrarily partitioned data | |
Wang et al. | Decision Tree-Based Federated Learning: A Survey | |
Wang et al. | Neural-SEIR: A flexible data-driven framework for precise prediction of epidemic disease | |
Vallevik et al. | Can I trust my fake data–A comprehensive quality assessment framework for synthetic tabular data in healthcare | |
Jamshidi et al. | Adjustable privacy using autoencoder-based learning structure | |
Zhou et al. | Homomorphic multi-label classification of virus strains | |
Tong et al. | Learning discriminative text representation for streaming social event detection | |
Budig et al. | Trade-offs between privacy-preserving and explainable machine learning in healthcare | |
Awoseyi et al. | Hybridization of decision tree algorithm using sequencing predictive model for COVID-19 | |
Wang et al. | DUGRA: dual-graph representation learning for health information networks | |
Kolhar et al. | An Intelligent Cardiovascular Diseases Prediction System Focused on Privacy. | |
Sumana et al. | Modelling a secure support vector machine classifier for private data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22860601 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023535611 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22860601 Country of ref document: EP Kind code of ref document: A1 |