WO2023098291A1 - Data processing method and data processing apparatus - Google Patents

Data processing method and data processing apparatus Download PDF

Info

Publication number
WO2023098291A1
WO2023098291A1 PCT/CN2022/124247 CN2022124247W WO2023098291A1 WO 2023098291 A1 WO2023098291 A1 WO 2023098291A1 CN 2022124247 W CN2022124247 W CN 2022124247W WO 2023098291 A1 WO2023098291 A1 WO 2023098291A1
Authority
WO
WIPO (PCT)
Prior art keywords
knowledge
data
representation
graph
entity
Prior art date
Application number
PCT/CN2022/124247
Other languages
French (fr)
Chinese (zh)
Inventor
沈雯
乔楠
张雷
陶建军
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023098291A1 publication Critical patent/WO2023098291A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to a data processing method and a data processing device.
  • AI artificial intelligence
  • deep learning technology is an AI technology based on deep neural network algorithms, which processes data by simulating the working mechanism of the human brain.
  • AI models (such as deep learning models) are often used to complete tasks in various application scenarios, and AI models can also be called AI task models.
  • the AI model requires a large amount of sample data for training, and some current technical solutions often only use sample data with a relatively single data type to train the AI model.
  • CDSS clinical decision support system
  • the source of the sample data required for the training of the disease diagnosis model based on deep learning in CDSS often only uses electronic medical records, sample data The type is text in EMR. Due to the single source and type of sample data, the prediction accuracy of the disease diagnosis model is low, and the effect of assisting clinical decision-making is poor.
  • the sample data used for AI model training can use different data sources and different data types.
  • the AI model cannot learn the characteristics of the sample data in the process of training the AI model. , resulting in low task prediction accuracy of the trained AI task model.
  • Embodiments of the present application provide a data processing method and a data processing device for improving the prediction accuracy of an AI task model.
  • the first aspect of the embodiments of the present application provides a data processing method.
  • the method is executed by a computer device, or by a component of the computer device, such as a processor, a chip or a chip system of the computer device, or by a logic module or software that can realize all or part of the device's functions.
  • the data processing method includes: the computer equipment obtains a variety of data, and the various data have different data sources and different data types.
  • the source of the data is related to the type of task to be trained, including data generated by humans. or machine-generated data, which can be text, numeric, or image.
  • Computer equipment performs knowledge extraction on various data to obtain knowledge graphs.
  • Knowledge graphs include multiple knowledge entities and the associations between multiple knowledge entities.
  • Knowledge entities include key elements extracted from various data. Multiple knowledge entities include different data type.
  • the computer equipment uses the knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation for each knowledge entity, and initializes the weight of the relationship between multiple knowledge entities in the knowledge graph to obtain a vector graph, which is used for For training artificial intelligence AI task models.
  • the sample data used by the computer device to train the AI task model is a variety of sources and types of data.
  • the computer device represents the abstract knowledge map as a computer device through knowledge representation algorithms corresponding to different data types. Recognizable vector illustration.
  • the computer equipment trains the AI task model based on vector images obtained from various sources and various types of data, which improves the prediction accuracy of the AI task model.
  • the computer device performs knowledge extraction of various data based on different knowledge levels, so as to obtain a multi-knowledge level knowledge map.
  • the computer device can perform knowledge extraction based on multiple knowledge levels such as the symptom level, gene level, or microbial level, so as to obtain knowledge associated with multiple knowledge levels Atlas.
  • the knowledge map acquired by the computer device in the embodiment of the present application is a knowledge map with multiple knowledge levels related to each other, and the AI task model is trained based on the knowledge map of multiple knowledge levels. Since the knowledge map involves multiple knowledge levels, therefore, Improve the coverage of the knowledge map and further improve the prediction accuracy of the AI task model.
  • knowledge entities from different knowledge levels include association relationships
  • the association relationships are obtained from various data
  • computer equipment analyzes the semantic information of various data to obtain the relationship between knowledge entities .
  • the association relationship is obtained according to preset rules, for example, the computer device pre-stores knowledge association rules based on domain knowledge, and the computer device establishes association relationships between knowledge entities at different levels based on the preset knowledge association rules.
  • the computer equipment obtains the associations of various data itself, and establishes the association relationship between the knowledge entities of the same level or different levels according to the preset rules, so as to fully excavate the internal relationship between the knowledge entities of different knowledge levels , a variety of methods for obtaining associations fully exploit the associations between knowledge entities and increase the amount of data used to train AI task models.
  • the computer device determines the data type of the knowledge entity from the preset algorithm library according to the preset relationship according to the data type of each knowledge entity According to the corresponding knowledge representation algorithm, the computer device performs knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtains a representation vector corresponding to the knowledge entity.
  • the computer device selects the knowledge representation algorithm from the preset algorithm library according to the preset relationship between the text type and the knowledge representation algorithm.
  • graph embedding (KGE) algorithm bidirectional encoder representations from transformers (BERT) algorithm, or word vector (word2vec) algorithm.
  • the computer device selects the corresponding knowledge representation algorithm from the preset algorithm library according to the data type of the knowledge entity, thereby improving the representation efficiency of the knowledge entity and the associated relationship.
  • the computer device determines the knowledge representation algorithm corresponding to the data type input by the user according to the data type of each knowledge entity, performs knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtains the representation corresponding to the knowledge entity vector.
  • the knowledge representation algorithm in the embodiment of the present application may be a user-defined knowledge representation algorithm, thereby improving applicability to different knowledge representation algorithms.
  • the AI task model is an AI model for disease diagnosis
  • the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.
  • the data processing method in the embodiment of the present application can be applied to the medical field.
  • the trained AI task model can perform the AI task model of disease diagnosis, and the sample data from various sources can be used to train the disease diagnosis model, thereby improving the diagnostic accuracy of the disease diagnosis model. .
  • the computer device trains the AI task model according to the vector graph to obtain the trained AI task model.
  • the computer equipment trains the AI task model through the obtained vector graph represented by the knowledge graph, which improves the feasibility of training the AI task model.
  • the computer device updates the weights in the vector graph.
  • the computer device can continuously update the weights in the vector diagram, thereby improving the accuracy of the trained AI task model.
  • the computer device uses the trained AI task model to perform task prediction to obtain the prediction result, and based on the updated vector diagram, perform task prediction on the key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction. Identify and obtain an interpretable knowledge graph.
  • the computer device can identify the key knowledge entities and/or key associations of the knowledge graph applied in the task prediction, which improves the interpretability of the model prediction results.
  • the computer device outputs an interpretable knowledge map through a graphical user interface (GUI).
  • GUI graphical user interface
  • the computer device outputs the explainable knowledge map through the graphical user interface GUI, which improves the feasibility of the solution.
  • a second aspect of the embodiments of the present application provides a data processing device, where the data processing device includes an interface unit and a processing unit.
  • the interface unit is used to obtain multiple data, and various data in the multiple data have different sources and different data types.
  • the processing unit is used to perform knowledge extraction on various data to obtain a knowledge map.
  • the knowledge map includes multiple knowledge entities and the associations between the multiple knowledge entities.
  • the multiple knowledge entities include different data types.
  • the processing unit is also used to perform knowledge representation for each knowledge entity by using a knowledge representation algorithm corresponding to the data type of each knowledge entity, and initialize the weights of the relationships between multiple knowledge entities in the knowledge graph to obtain a vector graph, Vector graphs are used to train artificial intelligence AI task models.
  • the processing unit is specifically configured to perform knowledge extraction on various data based on different knowledge levels, and obtain a multi-knowledge level knowledge map.
  • knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from various data, or the association relationship is obtained according to a preset rule.
  • the processing unit is specifically configured to determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the data type of each knowledge entity according to the preset relationship, and according to the corresponding knowledge representation The algorithm performs knowledge representation on the knowledge entity and obtains the representation vector corresponding to the knowledge entity.
  • the knowledge representation algorithm corresponding to the data type input by the user is determined according to the data type of each knowledge entity, the knowledge representation is performed on the knowledge entity according to the corresponding knowledge representation algorithm, and the representation vector corresponding to the knowledge entity is obtained.
  • the AI task model is an AI model for disease diagnosis
  • the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.
  • the processing unit is further configured to train the AI task model according to the vector graph to obtain a trained AI task model.
  • the processing unit is specifically configured to update the weights in the vector map.
  • the processing unit is also used to use the trained AI task model to perform task prediction, obtain the prediction result, and predict the key knowledge entities and/or key knowledge entities in the corresponding knowledge graph based on the updated vector graph. Key associations are identified to obtain interpretable knowledge graphs.
  • the processing unit is further configured to output an explainable knowledge map through a graphical user interface GUI.
  • the third aspect of the embodiments of the present application provides a computer device, the computer device includes a processor, the processor is coupled with a memory, the memory is used to store instructions, and when the instructions are executed by the processor, the computer device executes the above-mentioned first Aspect or the method described in any possible implementation manner of the first aspect.
  • the fourth aspect of the embodiments of the present application is a computer-readable storage medium, on which instructions are stored. When the instructions are executed, the computer executes the method described in the first aspect or any possible implementation manner of the first aspect. .
  • the fifth aspect of the embodiments of the present application is a computer program product.
  • the computer program product includes instructions. When the instructions are executed, the computer implements the method described in the first aspect or any possible implementation manner of the first aspect.
  • beneficial effects achieved by the data processing device, computer equipment, computer readable medium or computer program product provided above can refer to the beneficial effects in the corresponding method, and will not be repeated here.
  • FIG. 1 is a schematic diagram of a system architecture of a data processing method provided in an embodiment of the present application
  • FIG. 2 is a schematic flow diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a knowledge extraction provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a knowledge representation provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of establishing an AI task model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a data processing effect provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a data processing device provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the embodiment of the present application provides a data processing method and a data processing device, which are used to improve the accuracy of clinical decision-making.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • Deep learning Deep learning (deep learning, DL) is a machine learning technology based on deep neural network algorithms, and its main feature is to use multiple nonlinear transformations to process and analyze data. For example, it is applied in scenarios such as image recognition, speech recognition or natural language processing, and medical imaging data.
  • Graph deep learning is to apply various algorithms of deep learning to graph structure data, such as graph neural network or graph convolutional neural network.
  • Graph convolutional network is a kind of neural network method that realizes convolution on graph-structured data, for example, realizes convolution on graph-structured data by methods such as Laplace matrix or Fourier transform.
  • Gene regulatory network (gene regulatory network, GRN) is a network of interactions between DNA and proteins. The activity of genes is regulated by transcription factors that bind to DNA. Most transcription factors bind to multiple binding sites in the genome. Therefore, all cells All have complex gene regulatory networks. For example, the human genome encodes approximately 1400 transcription factors that regulate the expression of more than 20,000 human genes. Gene regulatory network techniques include binding site analysis ChIP-chip or ChIP-seq, etc.
  • Metabolic network is a network in which various chemical substances in living cells are connected by biochemical reactions. Biochemical reactions are catalyzed by enzymes to convert one chemical substance into another chemical substance. Thus, all chemicals in a cell are part of a complex network of biochemical reactions known as a metabolic network.
  • EMR Electronic medical records
  • RWD Real word data refers to data obtained from sources other than traditional clinical trials. Data sources such as large-scale simple clinical trials, clinical trials in actual medical care, prospective observational studies or registry studies, database analysis, case reports, health management reports or electronic health records, etc.
  • the knowledge graph is a graph-based data structure consisting of nodes and edges, where each node is a knowledge entity and each edge is an association relationship between knowledge entities.
  • Knowledge entities can be things in the real world, such as names, gender, or symptoms, etc., and association relations are used to express certain connections between different knowledge entities.
  • FIG. 1 is a schematic diagram of a system architecture of a data processing method provided by an embodiment of the present application.
  • the data processing system 10 includes a knowledge extraction module 101, a knowledge representation module 102, a knowledge modeling module 103 and an attention visualization module 104, wherein each module in the data processing system 10 can be called independently, and the data processing
  • the system 10 can also flexibly expand other modules, which is not specifically limited.
  • each module of the above-mentioned data processing system 10 is a logical unit based on the division of system functions, and the entities of the data processing system 10 can be centralized or distributed computer equipment or servers, or components of computer equipment or servers, such as A processor, chip or system-on-a-chip of a computer device.
  • the data processing system 10 in the embodiment of the present application can integrate data based on relevant domain knowledge, train an AI task model, and use the AI task model to perform interpretable prediction of target events.
  • the data processing system 10 is a general framework for heterogeneous data processing and target event prediction, which can be applied to various fields. The following medical field is used as an example to introduce various modules in the data processing system 10 .
  • the knowledge extraction module 101 can extract usable knowledge entities from various sources and types of data, and establish associations between knowledge entities at different knowledge levels based on data semantics or domain knowledge to form a knowledge graph.
  • the knowledge graph contains knowledge entities and the relationship between knowledge entities.
  • Knowledge entities include element information extracted from a variety of data, and associations between knowledge entities include links between the extracted element information.
  • the evaluation data source for a patient's physical state can be electronic medical records, imaging data, gene regulation network or protein metabolism network, etc.
  • the knowledge extraction module 101 can extract knowledge entities from these heterogeneous data of different sources and different data types, thereby forming a knowledge map with connection characteristics, in which knowledge entities are regarded as nodes in the knowledge map, and associations between knowledge entities are regarded as The edge in the knowledge graph, which can maximize the extraction of key information in heterogeneous data.
  • the knowledge entity nodes in the above knowledge graph can be at different knowledge levels, and the relationship between knowledge entity nodes can also be a cross-knowledge level relationship.
  • knowledge graphs at different knowledge levels include representation symptom layer, gene sequencing data layer or metabolic data layer and other multi-level knowledge graphs, the knowledge entities in the knowledge graph representing the symptom layer may be associated with the knowledge entities in the gene sequencing data layer.
  • the knowledge representation module 102 is used to represent the above knowledge map by using a vector graph, including the knowledge entities in the knowledge map and the associations between the knowledge entities. It can be understood that the knowledge graph obtained by the knowledge extraction module 101 cannot be directly used for training the AI task model, and the knowledge representation module 102 is required to represent the knowledge entities in the knowledge graph as data in the form of vectors, and then use the data to train the AI task model .
  • the knowledge characterization module 102 includes a node module for representing knowledge entities and an edge module for representing associations between knowledge entities, wherein the node module and the edge module are provided with multiple sub-modules, and different sub-modules are used to represent different data types Knowledge entity or relationship.
  • the knowledge modeling module 103 is used to obtain a deep learning model based on vector graph training, and different vector graphs are trained to obtain a deep learning model to support different downstream tasks.
  • Deep learning models include graph convolutional neural network GCN, graph attention network GAT or graph samples and aggregation GraphSAGE, and deep learning models can also be integrated into the Transformer structure.
  • Downstream tasks include auxiliary diagnosis tasks, examination recommendation tasks, or drug recommendation tasks, etc.
  • the attention visualization module 104 uses the updated vector graph to identify key nodes and key edges in the knowledge graph, and perform visual display, so that the key nodes and edge information in the knowledge graph can be highlighted.
  • the updated vector map is the vector map obtained after the training of the deep learning model is completed.
  • FIG. 2 is a schematic flowchart of a data processing method provided by an embodiment of the present application. The method is applied to the data processing system shown in FIG. 1 . Taking computer equipment execution as an example, the data processing method includes the following steps:
  • Computer equipment acquires multiple types of data, including data from different data sources and data types.
  • the computer equipment obtains a variety of data, which are sample data for training AI task models.
  • Various data acquired by computing devices have different data sources and data types, and the data sources are different according to the types of tasks.
  • the specific multiple data can be data generated by humans or data generated by machines.
  • Types of data include text, numbers, or images.
  • a computer acquires various medical data.
  • These medical data can be real-world data RWD, and various medical data have different data sources.
  • sources of various medical data can be large-scale simple clinical trials, clinical trials in actual medical care, prospective observational studies, registry studies, retrospective database analysis, case reports, health management reports, medical record data, imaging examinations Reports, gene regulatory expression networks, metabolic networks, protein profiles or microbial profiles.
  • a computer device performs knowledge extraction on various data to obtain a knowledge map, and the knowledge map includes multiple knowledge entities and the relationship between knowledge entities.
  • the computer device After the computer device acquires various data, it extracts the knowledge entity and the association relationship between the knowledge entities based on the acquired various data, and the knowledge entity and the association relationship between the knowledge entities form a knowledge graph.
  • computer equipment classifies the extracted knowledge entities according to different knowledge levels, and establishes the association relationship between knowledge entities at different knowledge levels.
  • Knowledge graph at the knowledge level.
  • the computer device extracts the knowledge graph, it needs to standardize the knowledge entities in the knowledge graph.
  • the knowledge entity extracted by computer equipment from electronic medical records is "stomach pain”
  • the standardized knowledge entity is "abdominal pain”.
  • FIG. 3 is a schematic diagram of a multi-level knowledge map in the medical field provided by an embodiment of the present application.
  • the computer device performs knowledge extraction on various medical data, including electronic medical records, radiation information, gene information, protein information or microbial information.
  • Computer equipment divides the extracted knowledge entities into genetic level, phenotypic level and metagenomic level according to different knowledge levels.
  • knowledge entities at the gene level are, for example, PTPN11 gene, PIK3R1 gene or CDC42 gene.
  • Knowledge entities at the level of phenotypic symptoms such as frequent bowel movements, hypotension, or insomnia.
  • Knowledge entities at the microbiological level such as Prevotella, Haldemannia or Dorerella.
  • the association relationship between knowledge entities is established based on domain knowledge, including the association relationship between knowledge entities in the same knowledge level or different knowledge Relationships between knowledge entities at the level.
  • Associations within the same level of knowledge For example, at the level of phenotypic symptoms, colon cancer is associated with frequent bowel movements, abdominal pain, and familial adenomatous polyposis (FAP).
  • FAP familial adenomatous polyposis
  • Correlations within different knowledge levels For example, hypotension at the symptom level is associated with PIK3R1 gene, EGFR gene and KRAS gene at the gene level.
  • the computer device can establish the association relationship between knowledge entities based on various data, and can also establish the association relationship between knowledge entities based on preset rules.
  • Establishing the association relationship between knowledge entities based on various data includes analyzing the semantics of various data by computer equipment, and mining the association contained in various data itself. For example, it is recorded in the electronic medical record that “a 42-year-old male patient’s symptoms are drinking too much water, high blood sugar and frequent urination”, and the knowledge entities extracted by the computer equipment based on the electronic medical record data are age 42, gender male, symptoms of drinking too much water, Symptoms of hyperglycemia and symptoms of frequent urination, and establish the association relationship between these knowledge entities based on semantics.
  • Establishing associations between knowledge entities based on preset rules includes computer equipment establishing associations between knowledge entities according to rules formed by domain knowledge and experience.
  • the preset rule stored in the computing device is "Prevotella causes hypotension", and when the knowledge entities extracted by the computer device are hypotension and Prevotella, the computer device establishes an association between Prevotella and hypotension.
  • the knowledge entity extracted by the computer device also includes multiple data types, for example, the type of the knowledge entity includes text or value.
  • the type of the knowledge entity includes text or value.
  • the unextracted knowledge entities include hidden nodes that cannot be covered by domain knowledge
  • the unextracted associations include hidden associations that cannot be covered by domain knowledge. Since computer equipment cannot obtain these hidden nodes and hidden associations based on data semantics or domain knowledge in the process of knowledge entity extraction, computer equipment establishes virtual knowledge nodes and virtual associations for these hidden nodes and hidden associations in the process of knowledge representation. That is, the vector graph represented by the computer equipment contains virtual knowledge nodes and virtual associations that are not reflected in the knowledge graph.
  • the knowledge graph obtained by computer equipment contains two knowledge entities "headache” and "cough". There is no relationship between these two knowledge entities.
  • Computer equipment can add virtual knowledge nodes in knowledge representation, such as "influencing factors 1", and added “hidden association 1” and “hidden association 2" between "influencing factor 1" and "headache” and “cough", these virtual knowledge nodes and virtual associations do not exist in the extracted knowledge graph , but reflected in the nodes and weights in the vector graph after representation.
  • the computer device performs knowledge representation for each knowledge entity based on the knowledge representation algorithm, and initializes the weights of the relationships among multiple knowledge entities in the knowledge graph to obtain a vector graph.
  • the computer equipment performs knowledge representation for each knowledge entity based on the knowledge representation algorithm, and performs correlation representation for the association relationship between the knowledge entities, so as to obtain the vector diagram corresponding to the knowledge graph.
  • the computer device selects a knowledge representation algorithm corresponding to the data type according to the data type of the knowledge entity, and uses the knowledge representation algorithm to represent the knowledge entity to obtain a representation vector of the knowledge entity.
  • the computer device characterizes the association relationship to obtain the representation vector of the association relationship, the representation vector of the knowledge entity and the representation vector of the association relationship constitute a vector graph, and the represented vector graph contains the initialized weights between multiple knowledge entities.
  • Figure 4 is a schematic diagram of knowledge representation based on different data types provided by the embodiment of this application.
  • knowledge entities are divided according to different data types, and the types of knowledge entities include text nodes, value nodes, virtual nodes or other nodes.
  • the knowledge representation algorithm corresponding to the text node is such as the knowledge graph embedding algorithm (knowledge graph embedding, KGE) algorithm, the bidirectional encoder representations from the transformer (bidirectional encoder representations from transformers, BERT) algorithm or the word vector (word2vec) algorithm.
  • KGE knowledge graph embedding
  • BERT bidirectional encoder representations from transformers
  • word2vec word vector
  • the computer device obtains the representation vector of the text node through the knowledge graph embedding algorithm.
  • the computer can first obtain the representation vector of the external source knowledge graph through deep learning of the external source knowledge graph through the knowledge graph embedding algorithm.
  • the text node in the knowledge graph Match the knowledge entities in the external knowledge graph to obtain the representation vector of the knowledge graph.
  • the computer device can also be based on the pre-training model of the BERT algorithm in the medical field, and use the model to obtain the representation vector of the text node.
  • the knowledge representation algorithm corresponding to the value node is such as the multilayer perceptron (multilayer perceptron, MLP) algorithm.
  • MLP multilayer perceptron
  • computer equipment classifies and encodes numerical nodes such as height, weight, age, or check value based on the MLP model, and maps them to representation vectors to mine the high and low meanings of the data.
  • computer equipment obtains representation vectors based on the aggregated embedding algorithm
  • computer devices obtain representation vectors based on random embedding algorithms.
  • the computer device obtains an edge representation vector through an edge embedding algorithm (edge embedding).
  • the knowledge representation algorithm in the embodiment of the present application may be a knowledge representation algorithm in a preset algorithm library, or a knowledge representation algorithm input by a user, which is not specifically limited. There is a one-to-one or many-to-one preset relationship between the knowledge representation algorithm and the data type in the preset algorithm library, and the fixed data type includes text or value.
  • the knowledge representation algorithm input by the user is used to supplement the knowledge representation algorithm in the preset algorithm library.
  • the above knowledge representation process is executed by the knowledge representation module in the computer device.
  • the knowledge representation module can be flexibly decoupled in the computer device, adjusted or customized according to the characteristics of the field, and has scalability and interactivity. It can be understood that the knowledge representation module has built-in different representation sub-modules, and the representation sub-modules are used to represent knowledge entities and association relationships of different data types.
  • the computer equipment trains the AI task model according to the vector graph.
  • the computer trains the AI task model based on the vector graph, and the trained AI task model can be used to perform various downstream tasks.
  • downstream tasks include medical consultation, drug recommendation, diagnostic decision support or treatment decision support, etc.
  • the computer device iteratively trains the AI task model based on the multiple vector graphs obtained through the aforementioned steps S201-S203, until the training output of the AI task model meets the deviation requirement from the target output, then the training of the AI task model is completed.
  • the computer device obtains a dynamically updated vector diagram according to the training process, and each node and weight in the updated vector diagram are updated.
  • the computer device characterizes the data to be predicted based on the above steps S201-S203, and uses The predicted results are obtained from the characterized vector diagram and the trained AI task model.
  • the computer device also identifies the key knowledge entities and/or key associations in the knowledge map corresponding to the task prediction based on the vector map updated after the above training, obtains the interpretable knowledge map, and outputs the interpretable knowledge map through the graphical user interface . Specifically, after the computer device acquires the updated vector graph, based on the weights of the edges between the nodes in the updated vector graph, determine the corresponding association relationship in the knowledge map of the edges whose weight exceeds the preset threshold, and the computer device in The association relationship in the knowledge graph and the knowledge entity connected to the relationship are identified.
  • FIG. 5 is a schematic diagram of an interpretable knowledge map provided by an embodiment of the present application.
  • the knowledge graph shown in Figure 5 is the knowledge graph corresponding to the disease diagnosis task, and the computer equipment is based on the vector graph completed by the AI task model to mark the key nodes and key edges in the knowledge graph corresponding to the task, and obtain the Interpret the knowledge graph.
  • the interpretable knowledge graph reflects the contribution of key nodes and key edges to the AI task model.
  • the labeling method can be displayed visually with different colors and weight values according to the degree of contribution, so that the key nodes and edge information in the graph can be highlighted. For example, in the example shown in 5, the nodes and edges in bold are the key nodes and edges corresponding to the disease diagnosis task.
  • Algorithms for computer equipment training AI task models in the embodiment meeting of this application include graph convolutional network (graph convolutional network, GCN), graph attention network (graph attention network, GAT) or graph sample aggregation training (Graph sample and aggregate, GraphSAGE) ).
  • FIG. 6 is a schematic diagram of an F1 score of a data processing method provided by an embodiment of the present application.
  • Figure 6 is a comparison chart of the F1 score in the disease classification task provided by the data processing method provided in the embodiment of the present application, wherein the middle F1 score is the harmonic mean of the precision rate and the recall rate, where the precision rate represents How many of the samples that are predicted to be positive are true positive samples, and the recall rate indicates how many positive examples in the sample are predicted correctly.
  • the F1 score is used to evaluate the classification accuracy of the disease classification task. The higher the F1 score value, the more accurate the disease classification.
  • the disease classification AI task model trained based on the multi-dimensional graph embedding representation algorithm compared with the disease classification AI task model trained by the direct text representation method based on the BERT model, the F1 score has increased by up to 5.7 %.
  • the sample data used by the computer equipment to train the AI task model is a variety of sources and types of data, and the AI task model is trained based on the knowledge map extracted from multiple sources and various types of data, thereby improving improved the predictive accuracy of the AI task model.
  • the extracted knowledge graph includes knowledge entities and association relationships at multiple knowledge levels, which further improves the prediction accuracy of the AI task model.
  • FIG. 7 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the apparatus is used to implement each step of the corresponding equipment in the foregoing embodiments.
  • the data processing apparatus 700 includes an interface unit 701 and a processing unit 702 .
  • the interface unit 701 is used to obtain various data, and various data in the various data have different sources and different data types.
  • the processing unit 702 is used to perform knowledge extraction on various types of data to obtain a knowledge graph.
  • the knowledge graph includes a plurality of knowledge entities and associations among the plurality of knowledge entities, and the plurality of knowledge entities include different data types.
  • the processing unit 702 is also used to perform knowledge representation on each knowledge entity using the knowledge representation algorithm corresponding to the data type of each knowledge entity, and initialize the weights of the relationships among multiple knowledge entities in the knowledge graph to obtain a vector graph , the vector diagram is used to train the artificial intelligence AI task model.
  • the processing unit 702 is specifically configured to perform knowledge extraction on various data based on different knowledge levels, and obtain a multi-knowledge level knowledge map.
  • knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from various data, or the association relationship is obtained according to a preset rule.
  • the processing unit 702 is specifically configured to determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the data type of each knowledge entity according to the preset relationship, and according to the corresponding knowledge
  • the representation algorithm performs knowledge representation on the knowledge entity and obtains the representation vector corresponding to the knowledge entity.
  • the knowledge representation algorithm corresponding to the data type input by the user is determined according to the data type of each knowledge entity, the knowledge representation is performed on the knowledge entity according to the corresponding knowledge representation algorithm, and the representation vector corresponding to the knowledge entity is obtained.
  • the AI task model is an AI model for disease diagnosis
  • the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.
  • the processing unit 702 is further configured to train the AI task model according to the vector graph, and obtain a trained AI task model.
  • the processing unit 702 is specifically configured to update weights in the vector map.
  • the processing unit 702 is also configured to use the trained AI task model to perform task prediction, obtain the prediction result, and predict the key knowledge entities and/or key knowledge entities in the knowledge map corresponding to the task based on the updated vector graph or key associations to obtain interpretable knowledge graphs.
  • the processing unit 702 is further configured to output an explainable knowledge map through a graphical user interface GUI.
  • each unit in the device can be implemented in the form of software called by the processing element; they can also be implemented in the form of hardware; some units can also be implemented in the form of software called by the processing element, and some units can be implemented in the form of hardware.
  • each unit can be a separate processing element, or it can be integrated in a certain chip of the device.
  • it can also be stored in the memory in the form of a program, which is called and executed by a certain processing element of the device. Function.
  • all or part of these units can be integrated together, or implemented independently.
  • the processing element mentioned here may also be a processor, which may be an integrated circuit with signal processing capabilities.
  • each step of the above method or each unit above may be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software called by the processing element.
  • FIG. 8 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the computer device 800 includes: a processor 810 , a memory 820 and an interface 830 , and the processor 810 , the memory 820 and the interface 830 are coupled through a bus (not marked in the figure).
  • the memory 820 stores instructions, and when the execution instructions in the memory 820 are executed, the computer device 800 executes the method executed by the first chip in the above method embodiment.
  • the computer device 800 may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (application specific integrated circuit, ASIC), or, one or more microprocessors (digital signal processor , DSP), or, one or more field programmable gate arrays (field programmable gate array, FPGA), or a combination of at least two of these integrated circuit forms.
  • ASIC application specific integrated circuit
  • DSP digital signal processor
  • FPGA field programmable gate array
  • the units in the device can be implemented in the form of a processing element scheduler
  • the processing element can be a general-purpose processor, such as a central processing unit (central processing unit, CPU) or other processors that can call programs.
  • CPU central processing unit
  • these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).
  • Processor 810 may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuits, ASICs), on-site Programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof.
  • a general-purpose processor can be a microprocessor, or any conventional processor.
  • Memory 820 may include read-only memory and random-access memory, and provides instructions and data to processor 810 .
  • Memory 820 may also include non-volatile random access memory.
  • the memory 820 may be provided with multiple partitions, each of which is used to store private keys of different software modules.
  • Memory 820 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the bus may also include a power bus, a control bus, and a status signal bus.
  • the bus can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link (compute express link, CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnection standard
  • EISA extended industry standard architecture
  • unified bus unified bus, Ubus or UB
  • computer fast link compute express link
  • CXL cache coherent interconnect for accelerators
  • CIX cache coherent interconnect for accelerators
  • a computer-readable storage medium is also provided, and computer-executable instructions are stored in the computer-readable storage medium.
  • the processor of the device executes the computer-executable instructions
  • the device executes the above method embodiment A method performed by a computer device.
  • a computer program product in another embodiment of the present application, includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium.
  • the processor of the device executes the computer-executed instructions, the device executes the method performed by the computer device in the foregoing method embodiments.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Animal Behavior & Ethology (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

Disclosed in embodiments of the present invention are a data processing method and a data processing apparatus, which are used for improving the prediction accuracy of an AI task model. The method provided by an embodiment of the present invention comprises: acquiring multiple types of data, the data being from different sources and having different data types; carrying out knowledge extraction on the multiple types of data to obtain a knowledge graph, the knowledge graph comprising a plurality of knowledge entities and relationships among the plurality of knowledge entities, and the plurality of knowledge entities having different data types; carrying out knowledge representation on each knowledge entity by using a knowledge representation algorithm corresponding to the data type of the knowledge entity, and initializing weighting of the relationships among the plurality of knowledge entities in the knowledge graph so as to obtain a vector diagram, the vector diagram being used for training an artificial intelligence (AI) task model.

Description

一种数据处理方法及数据处理装置A data processing method and data processing device
本申请要求于2021年11月30日提交中国专利局、申请号为“202111453147.2”、申请名称为“一种数据处理方法及数据处理装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application with application number "202111453147.2" and application title "A Data Processing Method and Data Processing Device" filed with the China Patent Office on November 30, 2021, the entire contents of which are incorporated by reference in this application.
技术领域technical field
本申请实施例涉及人工智能领域,尤其涉及一种数据处理方法及数据处理装置。The embodiments of the present application relate to the field of artificial intelligence, and in particular, to a data processing method and a data processing device.
背景技术Background technique
近年来,人工智能(artificial intelligence,AI)的相关技术在各行各业得到了越来越广泛的应用。其中,深度学习技术是一种基于深层神经网络的算法的AI技术,通过模拟人脑的工作机制来进行数据的处理。目前,常采用AI模型(例如:深度学习模型)来完成各种应用场景的任务,AI模型也可以称为AI任务模型。In recent years, artificial intelligence (AI) related technologies have been more and more widely used in various industries. Among them, deep learning technology is an AI technology based on deep neural network algorithms, which processes data by simulating the working mechanism of the human brain. At present, AI models (such as deep learning models) are often used to complete tasks in various application scenarios, and AI models can also be called AI task models.
目前的AI技术中,AI模型需要大量的样本数据进行训练,目前的一些技术方案常常仅采用数据类型较为单一的样本数据对AI模型进行训练。例如,AI技术应用在医疗领域的临床决策支持系统(clinical decision support system,CDSS)中时,CDSS中基于深度学习的疾病诊断模型训练的所需的样本数据的来源往往仅采用电子病历,样本数据的类型是电子病历中的文本。由于样本数据来源和类型单一,导致疾病诊断模型的预测准确率低,辅助临床决策效果差。In the current AI technology, the AI model requires a large amount of sample data for training, and some current technical solutions often only use sample data with a relatively single data type to train the AI model. For example, when AI technology is applied in the clinical decision support system (CDSS) in the medical field, the source of the sample data required for the training of the disease diagnosis model based on deep learning in CDSS often only uses electronic medical records, sample data The type is text in EMR. Due to the single source and type of sample data, the prediction accuracy of the disease diagnosis model is low, and the effect of assisting clinical decision-making is poor.
在一些场景中,用于AI模型训练的样本数据可以采用具有不同的数据来源和不同的数据类型。然而,目前样本数据用于AI模型的训练时,由于对于不同来源和不同数据类型的样本数据不能进行较好的表征,使得在训练AI模型的过程中,AI模型不能学习到样本数据中的特征,从而导致训练得到的AI任务模型的任务预测准确率低。In some scenarios, the sample data used for AI model training can use different data sources and different data types. However, when the sample data is used for the training of the AI model, because the sample data from different sources and different data types cannot be well represented, the AI model cannot learn the characteristics of the sample data in the process of training the AI model. , resulting in low task prediction accuracy of the trained AI task model.
因此,如何对来自不同来源和不同数据类型的样本数据进行表征,使得采用经过表征后的数据训练得到的AI任务模型提高任务的预测准确率,是当前急需解决的技术问题。Therefore, how to characterize the sample data from different sources and different data types, so that the AI task model obtained by using the represented data training to improve the prediction accuracy of the task is an urgent technical problem to be solved.
发明内容Contents of the invention
本申请实施例提供了一种数据处理方法以及数据处理装置,用于提升AI任务模型的预测准确率。Embodiments of the present application provide a data processing method and a data processing device for improving the prediction accuracy of an AI task model.
本申请实施例第一方面提供了一种数据处理的方法。该方法由计算机设备执行,也可以由计算机设备的部件,例如计算机设备的处理器、芯片或芯片系统等执行,还可以由能实现全部或部分设备功能的逻辑模块或软件实现。以计算机设备为例,该数据处理方法包括:计算机设备获取多种数据,多种数据中具有不同的数据来源和不同的数据类型,数据的来源与所要训练的任务类型相关,包括人产生的数据或机器产生的数据,数据的类型包括文本、数值或图像。计算机设备对多种数据进行知识抽取获得知识图谱,知识图谱包括多个知识实体以及多个知识实体之间的关联关系,知识实体包括多种数据中提取出的关键 要素,多个知识实体包括不同的数据类型。计算机设备利用与每个知识实体的数据类型对应的知识表征算法对每个知识实体进行知识表征,且对知识图谱中多个知识实体之间的关系进行权重的初始化,获得向量图,向量图用于训练人工智能AI任务模型。The first aspect of the embodiments of the present application provides a data processing method. The method is executed by a computer device, or by a component of the computer device, such as a processor, a chip or a chip system of the computer device, or by a logic module or software that can realize all or part of the device's functions. Taking computer equipment as an example, the data processing method includes: the computer equipment obtains a variety of data, and the various data have different data sources and different data types. The source of the data is related to the type of task to be trained, including data generated by humans. or machine-generated data, which can be text, numeric, or image. Computer equipment performs knowledge extraction on various data to obtain knowledge graphs. Knowledge graphs include multiple knowledge entities and the associations between multiple knowledge entities. Knowledge entities include key elements extracted from various data. Multiple knowledge entities include different data type. The computer equipment uses the knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation for each knowledge entity, and initializes the weight of the relationship between multiple knowledge entities in the knowledge graph to obtain a vector graph, which is used for For training artificial intelligence AI task models.
本申请实施例中计算机设备用于训练AI任务模型的样本数据为多种来源和多种类型的数据,同时,计算机设备通过不同数据类型所对应的知识表征算法将抽象的知识图谱表征为计算机设备可识别的向量图。计算机设备基于多种来源和多种类型的数据所获得的向量图对AI任务模型进行训练,提升了AI任务模型的预测准确性。In the embodiment of the present application, the sample data used by the computer device to train the AI task model is a variety of sources and types of data. At the same time, the computer device represents the abstract knowledge map as a computer device through knowledge representation algorithms corresponding to different data types. Recognizable vector illustration. The computer equipment trains the AI task model based on vector images obtained from various sources and various types of data, which improves the prediction accuracy of the AI task model.
一种可能的实施方式中,计算机设备对多种数据进行知识抽取获得知识图谱的过程中,计算机设备基于不同的知识层面对多种数据进行知识抽取,从而获得多知识层面的知识图谱。例如,计算机设备对多种医疗数据进行知识抽取获得治疗领域的知识图谱时,可以根据表征症状层面、基因层面或微生物层面等多个知识层面进行知识抽取,从而获得具备多个知识层面关联的知识图谱。In a possible implementation manner, in the process of knowledge extraction of various data by a computer device to obtain a knowledge map, the computer device performs knowledge extraction of various data based on different knowledge levels, so as to obtain a multi-knowledge level knowledge map. For example, when computer equipment extracts knowledge from a variety of medical data to obtain a knowledge map in the therapeutic field, it can perform knowledge extraction based on multiple knowledge levels such as the symptom level, gene level, or microbial level, so as to obtain knowledge associated with multiple knowledge levels Atlas.
本申请实施例中计算机设备获取的知识图谱为具有多个知识层面相互关联的知识图谱,基于多个知识层面的知识图谱对AI任务模型进行训练,由于知识谱图涉及多个知识层面,因此,提升知识图谱的覆盖面,进一步提升了AI任务模型的预测准确性。The knowledge map acquired by the computer device in the embodiment of the present application is a knowledge map with multiple knowledge levels related to each other, and the AI task model is trained based on the knowledge map of multiple knowledge levels. Since the knowledge map involves multiple knowledge levels, therefore, Improve the coverage of the knowledge map and further improve the prediction accuracy of the AI task model.
一种可能的实施方式中,来自不同知识层面的知识实体之间包括关联关系,关联关系从多种数据中获得,例如,计算机设备根据多种数据的语义信息分析得到知识实体的之间的关系。或者,关联关系根据预置的规则获得,例如计算机设备中预先存储有基于领域知识确定知识关联规则,计算机设备基于预置的知识关联规则建立不同层面的知识实体之间的关联关系。In a possible implementation manner, knowledge entities from different knowledge levels include association relationships, and the association relationships are obtained from various data, for example, computer equipment analyzes the semantic information of various data to obtain the relationship between knowledge entities . Alternatively, the association relationship is obtained according to preset rules, for example, the computer device pre-stores knowledge association rules based on domain knowledge, and the computer device establishes association relationships between knowledge entities at different levels based on the preset knowledge association rules.
本申请实施例中计算机设备获取多种数据自身存在的关联,以及根据预置规则建立同一层面或不同层面知识实体之间的关联关系,从而充分挖掘了不同知识层面的知识实体之间的内在联系,多种获取关联关系的方法充分挖掘了知识实体之间的关联关系,提升了用于训练AI任务模型的数据量。In the embodiment of the present application, the computer equipment obtains the associations of various data itself, and establishes the association relationship between the knowledge entities of the same level or different levels according to the preset rules, so as to fully excavate the internal relationship between the knowledge entities of different knowledge levels , a variety of methods for obtaining associations fully exploit the associations between knowledge entities and increase the amount of data used to train AI task models.
一种可能的实施方式中,计算机设备对每个知识实体进行知识表征的过程中,计算机设备根据每个知识实体的数据类型,根据预置关系从预置算法库中确定与知识实体的数据类型对应的知识表征算法,计算机设备根据对应的知识表征算法对知识实体进行知识表征,获得知识实体对应的表征向量。例如,当知识实体的数据类型为文本时,计算机设备根据文本类型与知识表征算法的预置关系,从预置算法库选择知识表征算法,文本类型对应的知识表征算法例如指示图谱嵌入算法(knowledge graph embedding,KGE)算法、来自变换器的双向编码器表示(bidirectional encoder representations from transformers,BERT)算法或词向量(word2vec)算法。In a possible implementation, in the process of performing knowledge representation on each knowledge entity by the computer device, the computer device determines the data type of the knowledge entity from the preset algorithm library according to the preset relationship according to the data type of each knowledge entity According to the corresponding knowledge representation algorithm, the computer device performs knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtains a representation vector corresponding to the knowledge entity. For example, when the data type of the knowledge entity is text, the computer device selects the knowledge representation algorithm from the preset algorithm library according to the preset relationship between the text type and the knowledge representation algorithm. graph embedding (KGE) algorithm, bidirectional encoder representations from transformers (BERT) algorithm, or word vector (word2vec) algorithm.
本申请实施例中,计算机设备根据知识实体的数据类型选择从预置算法库中选择对应的知识表征算法,从而提升了知识实体和关联关系的表征效率。In the embodiment of the present application, the computer device selects the corresponding knowledge representation algorithm from the preset algorithm library according to the data type of the knowledge entity, thereby improving the representation efficiency of the knowledge entity and the associated relationship.
一种可能的实施方式中,计算机设备根据每个知识实体的数据类型确定用户输入的与数据类型对应的知识表征算法,根据对应的知识表征算法对知识实体进行知识表征,获得知识实体对应的表征向量。In a possible implementation, the computer device determines the knowledge representation algorithm corresponding to the data type input by the user according to the data type of each knowledge entity, performs knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtains the representation corresponding to the knowledge entity vector.
本申请实施例中知识表征算法可以是用户定义的知识表征算法,从而提升了对不同知识表征算法的适用性。The knowledge representation algorithm in the embodiment of the present application may be a user-defined knowledge representation algorithm, thereby improving applicability to different knowledge representation algorithms.
一种可能的实施方式中,AI任务模型为用于进行疾病诊断的AI模型,多种数据包括以下数据中的至少两种:病历数据、影像检查报告、基因调控表达网络和代谢网络。In a possible implementation, the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.
本申请实施例中的数据处理方法可以应用于医疗领域,训练的AI任务模型可以进行疾病诊断的AI任务模型,多种来源的样本数据训练疾病诊断模型,从而提升了疾病诊断模型的诊断准确率。The data processing method in the embodiment of the present application can be applied to the medical field. The trained AI task model can perform the AI task model of disease diagnosis, and the sample data from various sources can be used to train the disease diagnosis model, thereby improving the diagnostic accuracy of the disease diagnosis model. .
一种可能的实施方式中,计算机设备根据向量图对AI任务模型进行训练获得训练完成的AI任务模型。In a possible implementation manner, the computer device trains the AI task model according to the vector graph to obtain the trained AI task model.
本申请实施例中计算机设备通过知识图谱表征之后的得到向量图对AI任务模型进行训练,提升了AI任务模型训练的可实现性。In the embodiment of the present application, the computer equipment trains the AI task model through the obtained vector graph represented by the knowledge graph, which improves the feasibility of training the AI task model.
一种可能的实施方式中,计算机设备根据向量图对AI任务模型进行训练过程中,计算机设备更新向量图中的权重。In a possible implementation manner, during the computer device training the AI task model according to the vector graph, the computer device updates the weights in the vector graph.
本申请实施例中计算机设备能够不断更新向量图中的权重,从而提升训练后的AI任务模型的准确性。In the embodiment of the present application, the computer device can continuously update the weights in the vector diagram, thereby improving the accuracy of the trained AI task model.
一种可能的实施方式中,计算机设备利用训练完成的AI任务模型进行任务预测获得预测结果,并且基于更新后的向量图对任务预测对应的知识图谱中的关键知识实体和/或关键关联关系进行标识,获得可解释知识图谱。In a possible implementation, the computer device uses the trained AI task model to perform task prediction to obtain the prediction result, and based on the updated vector diagram, perform task prediction on the key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction. Identify and obtain an interpretable knowledge graph.
本申请实施例中计算机设备能够标识任务预测中所应用到的知识图谱的关键知识实体和/或关键关联关系,提升了模型预测结果的可解释性。In the embodiment of the present application, the computer device can identify the key knowledge entities and/or key associations of the knowledge graph applied in the task prediction, which improves the interpretability of the model prediction results.
一种可能的实施方式中,计算机设备通过图形用户界面GUI输出可解释知识图谱。In a possible implementation manner, the computer device outputs an interpretable knowledge map through a graphical user interface (GUI).
本申请实施例中计算机设备通过图形用户界面GUI输出可解释知识图谱,提升了方案的可实现性。In the embodiment of the present application, the computer device outputs the explainable knowledge map through the graphical user interface GUI, which improves the feasibility of the solution.
本申请实施例第二方面提供了一种数据处理装置,该数据处理装置包括接口单元和处理单元。其中,接口单元用于获取多种数据,多种数据中的各种数据具有不同的来源和不同的数据类型。处理单元用于对多种数据进行知识抽取,获得知识图谱,知识图谱包括多个知识实体以及多个知识实体之间的关联关系,多个知识实体包括不同的数据类型。处理单元还用于利用与每个知识实体的数据类型对应的知识表征算法对每个知识实体进行知识表征,且对知识图谱中多个知识实体之间的关系进行权重的初始化,获得向量图,向量图用于训练人工智能AI任务模型。A second aspect of the embodiments of the present application provides a data processing device, where the data processing device includes an interface unit and a processing unit. Wherein, the interface unit is used to obtain multiple data, and various data in the multiple data have different sources and different data types. The processing unit is used to perform knowledge extraction on various data to obtain a knowledge map. The knowledge map includes multiple knowledge entities and the associations between the multiple knowledge entities. The multiple knowledge entities include different data types. The processing unit is also used to perform knowledge representation for each knowledge entity by using a knowledge representation algorithm corresponding to the data type of each knowledge entity, and initialize the weights of the relationships between multiple knowledge entities in the knowledge graph to obtain a vector graph, Vector graphs are used to train artificial intelligence AI task models.
一种可能的实施方式中,处理单元具体用于基于不同的知识层面对多种数据进行知识抽取,获得多知识层面的知识图谱。In a possible implementation manner, the processing unit is specifically configured to perform knowledge extraction on various data based on different knowledge levels, and obtain a multi-knowledge level knowledge map.
一种可能的实施方式中,来自不同知识层面的知识实体之间包括关联关系,关联关系从多种数据中获得,或者,关联关系根据预置的规则获得。In a possible implementation manner, knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from various data, or the association relationship is obtained according to a preset rule.
一种可能的实施方式中,处理单元具体用于根据每个知识实体的数据类型,根据预置关系从预置算法库中确定与知识实体的数据类型对应的知识表征算法,根据对应的知识表 征算法对知识实体进行知识表征,获得知识实体对应的表征向量。In a possible implementation, the processing unit is specifically configured to determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the data type of each knowledge entity according to the preset relationship, and according to the corresponding knowledge representation The algorithm performs knowledge representation on the knowledge entity and obtains the representation vector corresponding to the knowledge entity.
一种可能的实施方式中,根据每个知识实体的数据类型确定用户输入的与数据类型对应的知识表征算法,根据对应的知识表征算法对知识实体进行知识表征,获得知识实体对应的表征向量。In a possible implementation, the knowledge representation algorithm corresponding to the data type input by the user is determined according to the data type of each knowledge entity, the knowledge representation is performed on the knowledge entity according to the corresponding knowledge representation algorithm, and the representation vector corresponding to the knowledge entity is obtained.
一种可能的实施方式中,AI任务模型为用于进行疾病诊断的AI模型,多种数据包括以下数据中的至少两种:病历数据、影像检查报告、基因调控表达网络和代谢网络。In a possible implementation, the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.
一种可能的实施方式中,处理单元还用于根据向量图对AI任务模型进行训练,获得训练完成的AI任务模型。In a possible implementation manner, the processing unit is further configured to train the AI task model according to the vector graph to obtain a trained AI task model.
一种可能的实施方式中,处理单元具体用于更新向量图中的权重。In a possible implementation manner, the processing unit is specifically configured to update the weights in the vector map.
一种可能的实施方式中,处理单元还用于利用训练完成的AI任务模型进行任务预测,获得预测结果,并基于更新后的向量图对任务预测对应的知识图谱中的关键知识实体和/或关键关联关系进行标识,获得可解释知识图谱。In a possible implementation, the processing unit is also used to use the trained AI task model to perform task prediction, obtain the prediction result, and predict the key knowledge entities and/or key knowledge entities in the corresponding knowledge graph based on the updated vector graph. Key associations are identified to obtain interpretable knowledge graphs.
一种可能的实施方式中,处理单元还用于通过图形用户界面GUI输出可解释知识图谱。In a possible implementation manner, the processing unit is further configured to output an explainable knowledge map through a graphical user interface GUI.
本申请实施例中第三方面提供了一种计算机设备,该计算机设备包括处理器,处理器与存储器耦合,存储器用于存储指令,当指令被处理器执行时,以使得计算机设备执行上述第一方面或第一方面任意一种可能的实施方式所述的方法。The third aspect of the embodiments of the present application provides a computer device, the computer device includes a processor, the processor is coupled with a memory, the memory is used to store instructions, and when the instructions are executed by the processor, the computer device executes the above-mentioned first Aspect or the method described in any possible implementation manner of the first aspect.
本申请实施例中第四方面一种计算机可读存储介质,其上存储有指令,指令被执行时,以使得计算机执行上述第一方面或第一方面任意一种可能的实施方式所述的方法。The fourth aspect of the embodiments of the present application is a computer-readable storage medium, on which instructions are stored. When the instructions are executed, the computer executes the method described in the first aspect or any possible implementation manner of the first aspect. .
本申请实施例中第五方面一种计算机程序产品,计算机程序产品中包括指令,指令被执行时,以使得计算机实现上述第一方面或第一方面任意一种可能的实施方式所述的方法。The fifth aspect of the embodiments of the present application is a computer program product. The computer program product includes instructions. When the instructions are executed, the computer implements the method described in the first aspect or any possible implementation manner of the first aspect.
可以理解,上述提供的数据处理装置、计算机设备、计算机可读介质或计算机程序产品等所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。It can be understood that the beneficial effects achieved by the data processing device, computer equipment, computer readable medium or computer program product provided above can refer to the beneficial effects in the corresponding method, and will not be repeated here.
附图说明Description of drawings
图1为本申请实施例提供的一种数据处理方法的系统架构示意图;FIG. 1 is a schematic diagram of a system architecture of a data processing method provided in an embodiment of the present application;
图2为本申请实施例提供的一种数据处理方法的流程示意图;FIG. 2 is a schematic flow diagram of a data processing method provided in an embodiment of the present application;
图3为本申请实施例提供的一种知识抽取的示意图;FIG. 3 is a schematic diagram of a knowledge extraction provided in an embodiment of the present application;
图4为本申请实施例提供的一种知识表征的示意图;FIG. 4 is a schematic diagram of a knowledge representation provided by an embodiment of the present application;
图5为本申请实施例提供的一种建立AI任务模型的示意图;FIG. 5 is a schematic diagram of establishing an AI task model provided by an embodiment of the present application;
图6为本申请实施例提供的一种数据处理效果的示意图;FIG. 6 is a schematic diagram of a data processing effect provided by an embodiment of the present application;
图7为本申请实施例提供的一种数据处理装置的结构示意图;FIG. 7 is a schematic structural diagram of a data processing device provided in an embodiment of the present application;
图8为本申请实施例提供的一种计算机设备的结构示意图。FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种数据处理方法以及数据处理装置,用于提升临床决策的准确性。The embodiment of the present application provides a data processing method and a data processing device, which are used to improve the accuracy of clinical decision-making.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、 “第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.
以下,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。Hereinafter, some terms used in this application are explained to facilitate the understanding of those skilled in the art.
深度学习(deep learning,DL)是一种基于深层神经网络算法的机器学习技术,其主要特点是使用多重非线性变换对数据进行处理和分析。例如应用在图像识别、语音识别或自然语言处理以及医疗影像数据等场景。Deep learning (deep learning, DL) is a machine learning technology based on deep neural network algorithms, and its main feature is to use multiple nonlinear transformations to process and analyze data. For example, it is applied in scenarios such as image recognition, speech recognition or natural language processing, and medical imaging data.
图深度学习(graph deep learning,GDL)是将深度学习的各种算法应用到图结构数据,例如图神经网络或图卷积神经网络等。图卷积神经网络(graph convolutional network,GCN)是对图结构数据实现卷积的一类神经网络方法,例如,通过拉普拉斯矩阵或傅里叶变换等方法对图结构数据实现卷积。Graph deep learning (graph deep learning, GDL) is to apply various algorithms of deep learning to graph structure data, such as graph neural network or graph convolutional neural network. Graph convolutional network (GCN) is a kind of neural network method that realizes convolution on graph-structured data, for example, realizes convolution on graph-structured data by methods such as Laplace matrix or Fourier transform.
基因调控网络(gene regulatory network,GRN)是DNA和蛋白质相互作用的网络,基因的活性受到与DNA结合的转录因子调节,大多数转录因子与基因组中的多个结合位点结合,因此,所有细胞都具有复杂的基因调控网络。例如,人类基因组编码了大约1400个转录因子,它们调节20000多个人类基因的表达。基因调控网络的技术包括结合位点分析法ChIP-chip或ChIP-seq等。Gene regulatory network (gene regulatory network, GRN) is a network of interactions between DNA and proteins. The activity of genes is regulated by transcription factors that bind to DNA. Most transcription factors bind to multiple binding sites in the genome. Therefore, all cells All have complex gene regulatory networks. For example, the human genome encodes approximately 1400 transcription factors that regulate the expression of more than 20,000 human genes. Gene regulatory network techniques include binding site analysis ChIP-chip or ChIP-seq, etc.
代谢网络(metabolic network,ME)是活细胞中的各种化学物质被生化反应连接起来的网络。生化反应由酶催化将一种化学物质转化为另一种化学物质。因此,细胞中的所有化学物质都是复杂的生化反应网络的一部分,这一网络被称为代谢网络。Metabolic network (ME) is a network in which various chemical substances in living cells are connected by biochemical reactions. Biochemical reactions are catalyzed by enzymes to convert one chemical substance into another chemical substance. Thus, all chemicals in a cell are part of a complex network of biochemical reactions known as a metabolic network.
电子病历(electronic medical records,EMR)基于计算机系统的电子化病人记录。Electronic medical records (EMR) are electronic patient records based on computer systems.
真实世界数据(real word data,RWD)指从传统临床试验以外其它来源获取的数据。数据来源例如,大规模简单临床试验、实际医疗中的临床试验、前瞻型观察性研究或注册型研究、数据库分析、病例报告、健康管理报告或电子健康档案等。Real word data (RWD) refers to data obtained from sources other than traditional clinical trials. Data sources such as large-scale simple clinical trials, clinical trials in actual medical care, prospective observational studies or registry studies, database analysis, case reports, health management reports or electronic health records, etc.
知识图谱(knowledge graph)是一种基于图的数据结构,由节点和边组成,其中每个节点为知识实体,每个边为知识实体之间的关联关系。知识实体可以是现实世界中的事物,比如姓名、性别或症状等,关联关系则用来表达不同知识实体之间的某种联系。The knowledge graph is a graph-based data structure consisting of nodes and edges, where each node is a knowledge entity and each edge is an association relationship between knowledge entities. Knowledge entities can be things in the real world, such as names, gender, or symptoms, etc., and association relations are used to express certain connections between different knowledge entities.
以上介绍了本申请实施例中的一些术语,下面结合附图介绍本申请实施例提供的数据处理方法以及数据处理装置。Some terms in the embodiments of the present application have been introduced above, and the data processing method and data processing device provided in the embodiments of the present application will be described below with reference to the accompanying drawings.
请参阅图1,图1为本申请实施例提供的一种数据处理方法的系统架构示意图。如图1 所示,数据处理系统10包括知识抽取模块101、知识表征模块102、知识建模模块103和注意力可视化模块104,其中,数据处理系统10中的各个模块都能够独立调用,数据处理系统10也可以灵活扩展其他的模块,具体不做限定。Please refer to FIG. 1 . FIG. 1 is a schematic diagram of a system architecture of a data processing method provided by an embodiment of the present application. As shown in Figure 1, the data processing system 10 includes a knowledge extraction module 101, a knowledge representation module 102, a knowledge modeling module 103 and an attention visualization module 104, wherein each module in the data processing system 10 can be called independently, and the data processing The system 10 can also flexibly expand other modules, which is not specifically limited.
可以理解的是,上述数据处理系统10的各个模块为基于系统功能划分的逻辑单元,数据处理系统10的实体可以是集中式或分布式的计算机设备或服务器,也计算机设备或服务器的部件,例如计算机设备的处理器、芯片或芯片系统。It can be understood that each module of the above-mentioned data processing system 10 is a logical unit based on the division of system functions, and the entities of the data processing system 10 can be centralized or distributed computer equipment or servers, or components of computer equipment or servers, such as A processor, chip or system-on-a-chip of a computer device.
本申请实施例中的数据处理系统10能够基于相关领域知识对数据整合,训练AI任务模型,并利用AI任务模型对目标事件进行可解释的结果预测。数据处理系统10为一种异质数据处理以及目标事件预测的通用框架,可以应用至各种领域,下面医疗领域为例介绍数据处理系统10中的各个模块。The data processing system 10 in the embodiment of the present application can integrate data based on relevant domain knowledge, train an AI task model, and use the AI task model to perform interpretable prediction of target events. The data processing system 10 is a general framework for heterogeneous data processing and target event prediction, which can be applied to various fields. The following medical field is used as an example to introduce various modules in the data processing system 10 .
知识抽取模块101能够从多种来源和多种类型的数据中提取出可以利用的知识实体,并基于数据语义或者领域知识建立不同知识层面的知识实体建立关联关系,形成知识图谱。知识图谱包含知识实体以及知识实体之间的关联关系。知识实体包括从多种数据中提炼的要素信息,知识实体之间的关联关系包括提炼出的要素信息之间的联系。例如,在医疗领域,对于一位患者的身体状态的评价数据来源可以是电子病历、影像数据、基因调控网络或蛋白质代谢网络等。知识抽取模块101能够将这些不同来源、不同数据类型的异质数据进行知识实体提取,从而形成具有连接特性的知识图谱,其中知识实体作为知识图谱的中的节点,知识实体之间的关联关系作为知识图谱中的边,该知识图谱能最大化的提取异质数据中的关键信息。The knowledge extraction module 101 can extract usable knowledge entities from various sources and types of data, and establish associations between knowledge entities at different knowledge levels based on data semantics or domain knowledge to form a knowledge graph. The knowledge graph contains knowledge entities and the relationship between knowledge entities. Knowledge entities include element information extracted from a variety of data, and associations between knowledge entities include links between the extracted element information. For example, in the medical field, the evaluation data source for a patient's physical state can be electronic medical records, imaging data, gene regulation network or protein metabolism network, etc. The knowledge extraction module 101 can extract knowledge entities from these heterogeneous data of different sources and different data types, thereby forming a knowledge map with connection characteristics, in which knowledge entities are regarded as nodes in the knowledge map, and associations between knowledge entities are regarded as The edge in the knowledge graph, which can maximize the extraction of key information in heterogeneous data.
上述知识图谱中的知识实体节点可以在不同的知识层面,知识实体节点之间的关联关系也可以是跨知识层面的关联关系,例如,不同知识层面的知识图谱包括表征症状层、基因测序数据层或代谢数据层等多层面的知识图谱,表征症状层的知识图谱中的知识实体可能与基因测序数据层的知识实体存在关联关系。The knowledge entity nodes in the above knowledge graph can be at different knowledge levels, and the relationship between knowledge entity nodes can also be a cross-knowledge level relationship. For example, knowledge graphs at different knowledge levels include representation symptom layer, gene sequencing data layer or metabolic data layer and other multi-level knowledge graphs, the knowledge entities in the knowledge graph representing the symptom layer may be associated with the knowledge entities in the gene sequencing data layer.
知识表征模块102用于采用向量图表征上述知识图谱,包括知识图谱中的知识实体和知识实体之间的关联关系。可以理解的是,知识抽取模块101获取的知识图谱无法直接用于AI任务模型的训练,需要知识表征模块102将知识图谱中的知识实体表征为向量形式的数据,再利用该数据训练AI任务模型。知识表征模块102包括用于表征知识实体的节点模块和用于表征知识实体之间关联关系的边模块,其中节点模块和边模块中设置有多个子模块,不同子模块用于表征不同数据类型的知识实体或关联关系。The knowledge representation module 102 is used to represent the above knowledge map by using a vector graph, including the knowledge entities in the knowledge map and the associations between the knowledge entities. It can be understood that the knowledge graph obtained by the knowledge extraction module 101 cannot be directly used for training the AI task model, and the knowledge representation module 102 is required to represent the knowledge entities in the knowledge graph as data in the form of vectors, and then use the data to train the AI task model . The knowledge characterization module 102 includes a node module for representing knowledge entities and an edge module for representing associations between knowledge entities, wherein the node module and the edge module are provided with multiple sub-modules, and different sub-modules are used to represent different data types Knowledge entity or relationship.
知识建模模块103用于基于向量图训练得到深度学习模型,不同的向量图训练得到深度学习模型支撑不同下游任务。深度学习模型包括图卷积神经网络GCN、图注意力网络GAT或图样本和聚合GraphSAGE,深度学习模型还可以融入Transformer结构。下游任务包括辅助诊断任务、检查建议任务或药物推荐任务等。The knowledge modeling module 103 is used to obtain a deep learning model based on vector graph training, and different vector graphs are trained to obtain a deep learning model to support different downstream tasks. Deep learning models include graph convolutional neural network GCN, graph attention network GAT or graph samples and aggregation GraphSAGE, and deep learning models can also be integrated into the Transformer structure. Downstream tasks include auxiliary diagnosis tasks, examination recommendation tasks, or drug recommendation tasks, etc.
注意力可视化模块104用于更新后的向量图对知识图谱中的关键节点和关键边进行标识,并进行可视化展示,使得知识图谱图中关键节点和边信息能够突出显示。更新后的向量图为深度学习模型训练完成后得到的向量图。The attention visualization module 104 uses the updated vector graph to identify key nodes and key edges in the knowledge graph, and perform visual display, so that the key nodes and edge information in the knowledge graph can be highlighted. The updated vector map is the vector map obtained after the training of the deep learning model is completed.
请参阅图2,图2为本申请实施例提供的一种数据处理方法的流程示意图。该方法应用于图1所示的数据处理系统。以计算机设备执行为例,该数据处理方法包括以下步骤:Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a data processing method provided by an embodiment of the present application. The method is applied to the data processing system shown in FIG. 1 . Taking computer equipment execution as an example, the data processing method includes the following steps:
201.计算机设备获取多种数据,多种数据包括不同的数据来源和数据类型的数据。201. Computer equipment acquires multiple types of data, including data from different data sources and data types.
计算机设备获取多种数据,这些数据为训练AI任务模型的样本数据。计算设备获取的多种数据具有不同的数据来源和数据类型,数据来源根据任务的类型不同而不同,具体多种数据可以是人产生的数据或者机器产生的数据。数据的类型包括文本、数值或者图像。The computer equipment obtains a variety of data, which are sample data for training AI task models. Various data acquired by computing devices have different data sources and data types, and the data sources are different according to the types of tasks. The specific multiple data can be data generated by humans or data generated by machines. Types of data include text, numbers, or images.
在医疗场景的示例中,计算机获取多种医疗数据。这些医疗数据可以是真实世界数据RWD,多种医疗数据的具有不同的数据来源。例如,多种医疗数据的来源可以是大规模简单临床试验、实际医疗中的临床试验、前瞻型观察性研究、注册型研究、回顾性数据库分析、病例报告、健康管理报告、病历数据、影像检查报告、基因调控表达网络、代谢网络、蛋白质信息或微生物信息。In an example of a medical scenario, a computer acquires various medical data. These medical data can be real-world data RWD, and various medical data have different data sources. For example, sources of various medical data can be large-scale simple clinical trials, clinical trials in actual medical care, prospective observational studies, registry studies, retrospective database analysis, case reports, health management reports, medical record data, imaging examinations Reports, gene regulatory expression networks, metabolic networks, protein profiles or microbial profiles.
202.计算机设备对多种数据进行知识抽取获得知识图谱,知识图谱包括多个知识实体和知识实体之间的关联关系。202. A computer device performs knowledge extraction on various data to obtain a knowledge map, and the knowledge map includes multiple knowledge entities and the relationship between knowledge entities.
计算机设备获取多种数据之后,基于获取的多种数据抽取知识实体和知识实体之间的关联关系,知识实体和知识实体之间的关联关系组成知识图谱。After the computer device acquires various data, it extracts the knowledge entity and the association relationship between the knowledge entities based on the acquired various data, and the knowledge entity and the association relationship between the knowledge entities form a knowledge graph.
具体的,计算机设备在进行知识抽取的过程中,按照不同的知识层面对抽取的知识实体进行分类,并建立不同知识层面的知识实体之间的关联关系,不同层面的知识实体和关联关系组成多知识层面的知识图谱。Specifically, in the process of knowledge extraction, computer equipment classifies the extracted knowledge entities according to different knowledge levels, and establishes the association relationship between knowledge entities at different knowledge levels. Knowledge graph at the knowledge level.
需要说明的是,计算机设备在抽取获得知识图谱之后,需要对知识图谱中的知识实体进行标准化。例如,计算机设备在电子病历中抽取的知识实体为“肚子疼”,标准化后的知识实体为“腹痛”。It should be noted that after the computer device extracts the knowledge graph, it needs to standardize the knowledge entities in the knowledge graph. For example, the knowledge entity extracted by computer equipment from electronic medical records is "stomach pain", and the standardized knowledge entity is "abdominal pain".
请参阅图3,图3为本申请实施例提供的一种医疗领域的多知识层面的知识图谱示意图。在图3所示的一个示例中,计算机设备对多种医疗数据进行知识抽取,多种医疗数据包括电子病历、放射信息、基因信息、蛋白信息或微生物信息。计算机设备将抽取的知识实体按照不同的知识层面分为基因层面(genetic level)、表型症状层面(phenotypic level)和微生物层面(metagenomic level)。Please refer to FIG. 3 . FIG. 3 is a schematic diagram of a multi-level knowledge map in the medical field provided by an embodiment of the present application. In an example shown in FIG. 3 , the computer device performs knowledge extraction on various medical data, including electronic medical records, radiation information, gene information, protein information or microbial information. Computer equipment divides the extracted knowledge entities into genetic level, phenotypic level and metagenomic level according to different knowledge levels.
如图3所示,基因层面的知识实体例如PTPN11基因、PIK3R1基因或CDC42基因。表型症状层面的知识实体例如排便频繁、低血压或失眠。微生物层面的知识实体例如普氏菌、霍尔德曼氏菌或多尔氏菌属。As shown in FIG. 3 , knowledge entities at the gene level are, for example, PTPN11 gene, PIK3R1 gene or CDC42 gene. Knowledge entities at the level of phenotypic symptoms such as frequent bowel movements, hypotension, or insomnia. Knowledge entities at the microbiological level such as Prevotella, Haldemannia or Dorerella.
在图3所示的示例中,计算机设备将知识实体按照不同的知识层面分类之后,基于领域知识建立知识实体之间的关联关系,包括同一知识层面内的知识实体之间的关联关系或不同知识层面的知识实体之间的关联关系。同一知识层面内的关联关系例如,在表型症状层面,结肠癌与排便频繁、腹痛和家族性腺瘤性息肉病(familial adenomatous polyposis,FAP)都存在关联关系。不同知识层面内的关联关系例如,表征症状层面的低血压与基因层面的PIK3R1基因、EGFR基因和KRAS基因存在关联关系。In the example shown in Figure 3, after the computer device classifies the knowledge entities according to different knowledge levels, the association relationship between knowledge entities is established based on domain knowledge, including the association relationship between knowledge entities in the same knowledge level or different knowledge Relationships between knowledge entities at the level. Associations within the same level of knowledge For example, at the level of phenotypic symptoms, colon cancer is associated with frequent bowel movements, abdominal pain, and familial adenomatous polyposis (FAP). Correlations within different knowledge levels For example, hypotension at the symptom level is associated with PIK3R1 gene, EGFR gene and KRAS gene at the gene level.
本申请实施例中计算机设备可以基于多种数据建立知识实体之间的关联关系,也可以 基于预置规则建立知识实体之间关联关系。基于多种数据建立知识实体之间关联关系包括计算机设备对多种数据的语义进行分析,挖掘出多种数据自身包含的关联。例如,电子病例中记录了“一位42岁的男性患者的症状为喝水多、高血糖和尿频”,计算机设备基于电子病历数据抽取的知识实体为年龄42、性别男、症状喝水多、症状高血糖和症状尿频,并基于语义建立这些知识实体之间的关联关系。In the embodiment of the present application, the computer device can establish the association relationship between knowledge entities based on various data, and can also establish the association relationship between knowledge entities based on preset rules. Establishing the association relationship between knowledge entities based on various data includes analyzing the semantics of various data by computer equipment, and mining the association contained in various data itself. For example, it is recorded in the electronic medical record that “a 42-year-old male patient’s symptoms are drinking too much water, high blood sugar and frequent urination”, and the knowledge entities extracted by the computer equipment based on the electronic medical record data are age 42, gender male, symptoms of drinking too much water, Symptoms of hyperglycemia and symptoms of frequent urination, and establish the association relationship between these knowledge entities based on semantics.
基于预置规则建立知识实体之间的关联关系包括计算机设备根据领域知识和经验形成的规则建立知识实体之间的关联关系。例如,计算设备存储的预置规则为“普氏菌导致低血压”,当计算机设备抽取的知识实体为低血压和普氏菌时,计算机设备建立普氏菌与低血压之间的关联关系。Establishing associations between knowledge entities based on preset rules includes computer equipment establishing associations between knowledge entities according to rules formed by domain knowledge and experience. For example, the preset rule stored in the computing device is "Prevotella causes hypotension", and when the knowledge entities extracted by the computer device are hypotension and Prevotella, the computer device establishes an association between Prevotella and hypotension.
本申请实施例中,计算机设备抽取的知识实体也包括多种数据类型,例如:知识实体的类型包括文本或数值。值得说明的是,本申请实施例计算机设备多种数据进行知识抽取的时候,存在一些计算机设备没有抽取到的知识实体和关联关系,这些没有抽取到的知识实体包括领域知识无法覆盖到的隐藏节点,没有抽取到的关联关系包括领域知识无法覆盖到的隐藏关联关系。由于计算机设备在知识实体的抽取过程无法基于数据语义或者领域知识获得这些隐藏节点和隐藏关联关系,因此,计算机设备在知识表征过程为这些隐藏节点和隐藏关联关系建立虚拟知识节点和虚拟关联关系,即计算机设备表征后的向量图中包含并未体现在知识图谱中的虚拟知识节点和虚拟关联关系。In the embodiment of the present application, the knowledge entity extracted by the computer device also includes multiple data types, for example, the type of the knowledge entity includes text or value. It is worth noting that, in the embodiment of the present application, when knowledge extraction is performed on various data of computer equipment, there are some knowledge entities and association relationships that are not extracted by computer equipment. These unextracted knowledge entities include hidden nodes that cannot be covered by domain knowledge , the unextracted associations include hidden associations that cannot be covered by domain knowledge. Since computer equipment cannot obtain these hidden nodes and hidden associations based on data semantics or domain knowledge in the process of knowledge entity extraction, computer equipment establishes virtual knowledge nodes and virtual associations for these hidden nodes and hidden associations in the process of knowledge representation. That is, the vector graph represented by the computer equipment contains virtual knowledge nodes and virtual associations that are not reflected in the knowledge graph.
例如,计算机设备获得知识图谱中包含“头痛”和“咳嗽”两个知识实体,这两个知识实体之间并没有建立关联关系,计算机设备在知识表征可以新增虚拟知识节点,例如“影响因素1”,并新增“影响因素1”分别与“头痛”和“咳嗽”之间的“隐藏关联1”和“隐藏关联2”,这些虚拟知识节点和虚拟关联关系不存在于抽取的知识图谱,但是体现在表征后的向量图中的节点和权重。For example, the knowledge graph obtained by computer equipment contains two knowledge entities "headache" and "cough". There is no relationship between these two knowledge entities. Computer equipment can add virtual knowledge nodes in knowledge representation, such as "influencing factors 1", and added "hidden association 1" and "hidden association 2" between "influencing factor 1" and "headache" and "cough", these virtual knowledge nodes and virtual associations do not exist in the extracted knowledge graph , but reflected in the nodes and weights in the vector graph after representation.
203.计算机设备基于知识表征算法对每个知识实体进行知识表征,且对知识图谱中多个知识实体之间的关系进行权重的初始化获得向量图。203. The computer device performs knowledge representation for each knowledge entity based on the knowledge representation algorithm, and initializes the weights of the relationships among multiple knowledge entities in the knowledge graph to obtain a vector graph.
计算机设备基于知识表征算法对每个知识实体进行知识表征,并对知识实体之间的关联关系进行关联表征,从而获得知识图谱对应的向量图。The computer equipment performs knowledge representation for each knowledge entity based on the knowledge representation algorithm, and performs correlation representation for the association relationship between the knowledge entities, so as to obtain the vector diagram corresponding to the knowledge graph.
具体的,计算机设备在对知识实体进行表征的过程中,根据知识实体的数据类型,选择与数据类型对应的知识表征算法,并利用知识表征算法对知识实体表征得到知识实体的表征向量。计算机设备在对关联关系进行表征得到关联关系的表征向量,知识实体的表征向量和关联关系的表征向量构成了向量图,表征后的向量图中包含多个知识实体之间初始化的权重。Specifically, in the process of representing the knowledge entity, the computer device selects a knowledge representation algorithm corresponding to the data type according to the data type of the knowledge entity, and uses the knowledge representation algorithm to represent the knowledge entity to obtain a representation vector of the knowledge entity. The computer device characterizes the association relationship to obtain the representation vector of the association relationship, the representation vector of the knowledge entity and the representation vector of the association relationship constitute a vector graph, and the represented vector graph contains the initialized weights between multiple knowledge entities.
请参与图4,图4为本申请实施例提供的一种基于不同数据类型进行知识表征的示意图。如图4所示,根据不同数据类型对知识实体进行划分,知识实体的类型包括文本节点、数值节点、虚拟节点或其他节点。其中,文本节点对应的知识表征算法例如知识图谱嵌入算法(knowledge graph embedding,KGE)算法、来自变换器的双向编码器表示(bidirectional encoder representations from transformers,BERT)算法或词向量(word2vec)算法。例如,计算机设备通过知识图谱嵌入算法获取文本节点的表征向量, 具体的,计算机可以先通过识图谱嵌入算法对外源知识图谱深度学习得到外源知识图谱的表征向量,先将知识图谱中的文本节点与外源知识图谱中的知识实体进行匹配,从而得到该知识图谱的表征向量。再例如,计算机设备也可基于医疗领域BERT算法预训练模型,利用该模型获取文本节点的表征向量。Please refer to Figure 4, which is a schematic diagram of knowledge representation based on different data types provided by the embodiment of this application. As shown in Figure 4, knowledge entities are divided according to different data types, and the types of knowledge entities include text nodes, value nodes, virtual nodes or other nodes. Among them, the knowledge representation algorithm corresponding to the text node is such as the knowledge graph embedding algorithm (knowledge graph embedding, KGE) algorithm, the bidirectional encoder representations from the transformer (bidirectional encoder representations from transformers, BERT) algorithm or the word vector (word2vec) algorithm. For example, the computer device obtains the representation vector of the text node through the knowledge graph embedding algorithm. Specifically, the computer can first obtain the representation vector of the external source knowledge graph through deep learning of the external source knowledge graph through the knowledge graph embedding algorithm. First, the text node in the knowledge graph Match the knowledge entities in the external knowledge graph to obtain the representation vector of the knowledge graph. For another example, the computer device can also be based on the pre-training model of the BERT algorithm in the medical field, and use the model to obtain the representation vector of the text node.
如图4所示,数值节点对应的知识表征算法例如多层感知机(multilayer perceptron,MLP)算法。例如,计算机设备基于MLP模型对身高、体重、年龄或检查数值等数值节点分类编码,并映射为表征向量,从而挖掘数据高低含义。As shown in Figure 4, the knowledge representation algorithm corresponding to the value node is such as the multilayer perceptron (multilayer perceptron, MLP) algorithm. For example, computer equipment classifies and encodes numerical nodes such as height, weight, age, or check value based on the MLP model, and maps them to representation vectors to mine the high and low meanings of the data.
如图4所示,对于虚拟知识节点,计算机设备基于聚合嵌入(aggregated embedding)算法获得表征向量,对于其他节点,计算机设备基于随机嵌入(random embedding)算法获得表向征量。在图4所示的示例中,对于知识实体之间的关联关系,计算机设备通过边嵌入算法(edge embedding)获得边的表征向量。As shown in Figure 4, for virtual knowledge nodes, computer equipment obtains representation vectors based on the aggregated embedding algorithm, and for other nodes, computer devices obtain representation vectors based on random embedding algorithms. In the example shown in FIG. 4 , for the relationship between knowledge entities, the computer device obtains an edge representation vector through an edge embedding algorithm (edge embedding).
本申请实施例中的知识表征算法可以是预置算法库中的知识表征算法,还可以用户输入的知识表征算法,具体不做限定。预置算法库中知识表征算法与数据类型存在一对一或者多对一的预置关系,固定数据类型包括文本或数值。用户输入的知识表征算法用于对预置算法库中知识表征算法进行补充。The knowledge representation algorithm in the embodiment of the present application may be a knowledge representation algorithm in a preset algorithm library, or a knowledge representation algorithm input by a user, which is not specifically limited. There is a one-to-one or many-to-one preset relationship between the knowledge representation algorithm and the data type in the preset algorithm library, and the fixed data type includes text or value. The knowledge representation algorithm input by the user is used to supplement the knowledge representation algorithm in the preset algorithm library.
上述知识表征过程由计算机设备中的知识表征模块执行,知识表征模块在计算机设备内可以灵活解耦,根据领域特征调整或自定义设置,具有扩展性和可交互性。可以理解的是,知识表征模块内置不同的表征子模块,表征子模块用于对不同数据类型的知识实体和关联关系进行表征。The above knowledge representation process is executed by the knowledge representation module in the computer device. The knowledge representation module can be flexibly decoupled in the computer device, adjusted or customized according to the characteristics of the field, and has scalability and interactivity. It can be understood that the knowledge representation module has built-in different representation sub-modules, and the representation sub-modules are used to represent knowledge entities and association relationships of different data types.
204.计算机设备根据向量图训练AI任务模型。204. The computer equipment trains the AI task model according to the vector graph.
计算机根据向量图训练AI任务模型,训练完成的AI任务模型可以用于执行各种下游任务。以医疗领域为例,下游任务包括医疗咨询、药物推荐、诊断决策支持或治疗决策支持等。The computer trains the AI task model based on the vector graph, and the trained AI task model can be used to perform various downstream tasks. Taking the medical field as an example, downstream tasks include medical consultation, drug recommendation, diagnostic decision support or treatment decision support, etc.
具体的,计算机设备基于多个通过前述步骤S201-S203获得的向量图迭代地训练AI任务模型,直到AI任务模型的训练输出满足与目标输出的偏差要求时,则AI任务模型训练完成。同时,计算机设备得到根据训练过程动态更新后的向量图,更新后的向量图中各个节点和权重完成更新。在采用上述训练完成的AI任务模型进行任务预测时(例如:基于病人A的多种数据对病人A进行疾病诊断),计算机设备基于上述步骤S201-S203的方式对待预测的数据进行表征,并利用表征后的向量图和训练完成的AI任务模型获得预测结果。Specifically, the computer device iteratively trains the AI task model based on the multiple vector graphs obtained through the aforementioned steps S201-S203, until the training output of the AI task model meets the deviation requirement from the target output, then the training of the AI task model is completed. At the same time, the computer device obtains a dynamically updated vector diagram according to the training process, and each node and weight in the updated vector diagram are updated. When using the above-mentioned trained AI task model to perform task prediction (for example: to diagnose patient A's disease based on various data of patient A), the computer device characterizes the data to be predicted based on the above steps S201-S203, and uses The predicted results are obtained from the characterized vector diagram and the trained AI task model.
计算机设备还基于上述训练更新后的向量图,对任务预测对应的知识图谱中的关键知识实体和/或关键关联关系进行标识,获得可解释知识图谱,并通过图形用户界面输出可以解释的知识图谱。具体的,计算机设备获取更新后的向量图之后,基于更新后的向量图中各个节点之间的边的权重,确定权重超过预设阈值的边在知识图谱中所对应的关联关系,计算机设备在知识图谱中的该关联关系和该关联系所连接的知识实体进行标识。The computer device also identifies the key knowledge entities and/or key associations in the knowledge map corresponding to the task prediction based on the vector map updated after the above training, obtains the interpretable knowledge map, and outputs the interpretable knowledge map through the graphical user interface . Specifically, after the computer device acquires the updated vector graph, based on the weights of the edges between the nodes in the updated vector graph, determine the corresponding association relationship in the knowledge map of the edges whose weight exceeds the preset threshold, and the computer device in The association relationship in the knowledge graph and the knowledge entity connected to the relationship are identified.
请参阅图5,图5为本申请实施例提供的一种可解释的知识图谱的示意图。如图5所示,图5所示的知识图谱为疾病诊断任务对应的知识谱,计算机设备基于AI任务模型完成的向量图对任务对应的知识图谱中的关键节点和关键边进行标注,得到可解释知识图谱, 该可解释的知识图谱反映了关键节点和关键边对于AI任务模型的贡献度。标注方式可以根据贡献度大小采用不同色彩、权重值等进行可视化展示,使得图中关键节点和边信息得以凸显。例如在如5所示的示例中,加粗的节点和边为疾病诊断任务所对应的关键节点和关键边。Please refer to FIG. 5 . FIG. 5 is a schematic diagram of an interpretable knowledge map provided by an embodiment of the present application. As shown in Figure 5, the knowledge graph shown in Figure 5 is the knowledge graph corresponding to the disease diagnosis task, and the computer equipment is based on the vector graph completed by the AI task model to mark the key nodes and key edges in the knowledge graph corresponding to the task, and obtain the Interpret the knowledge graph. The interpretable knowledge graph reflects the contribution of key nodes and key edges to the AI task model. The labeling method can be displayed visually with different colors and weight values according to the degree of contribution, so that the key nodes and edge information in the graph can be highlighted. For example, in the example shown in 5, the nodes and edges in bold are the key nodes and edges corresponding to the disease diagnosis task.
本申请实施例会中计算机设备训练AI任务模型的算法包括图卷积神经网络(graph convolutional network,GCN)、图注意力网络(graph attention network,GAT)或图样本聚合训练(Graph sample and aggregate,GraphSAGE)。Algorithms for computer equipment training AI task models in the embodiment meeting of this application include graph convolutional network (graph convolutional network, GCN), graph attention network (graph attention network, GAT) or graph sample aggregation training (Graph sample and aggregate, GraphSAGE) ).
请参阅图6,图6为本申请实施例提供的一种数据处理方法的F1分数示意图。如图6所示,图6为本申请实施例中提供数据处理方法在疾病分类任务中F1分数对比图,其中中F1分数为精准率和召回率的调和平均数,其中精准率为表示的是预测为正的样本中有多少是真正的正样本,召回率是表示的是样本中的正例有多少被预测正确。F1分数用于评估疾病分类任务的分类精确度。F1分数值越高,疾病分类更准确。从图6中可以看出,基于多维度的图嵌入表征算法训练得到的疾病分类AI任务模型,相较于基于BERT模型直接文本表征的方法训练得到疾病分类AI任务模型,F1分数最多提升了5.7%。Please refer to FIG. 6 . FIG. 6 is a schematic diagram of an F1 score of a data processing method provided by an embodiment of the present application. As shown in Figure 6, Figure 6 is a comparison chart of the F1 score in the disease classification task provided by the data processing method provided in the embodiment of the present application, wherein the middle F1 score is the harmonic mean of the precision rate and the recall rate, where the precision rate represents How many of the samples that are predicted to be positive are true positive samples, and the recall rate indicates how many positive examples in the sample are predicted correctly. The F1 score is used to evaluate the classification accuracy of the disease classification task. The higher the F1 score value, the more accurate the disease classification. It can be seen from Figure 6 that the disease classification AI task model trained based on the multi-dimensional graph embedding representation algorithm, compared with the disease classification AI task model trained by the direct text representation method based on the BERT model, the F1 score has increased by up to 5.7 %.
本申请实施例中计算机设备用于训练AI任务模型的样本数据为多种来源和多种类型的数据,基于多种来源和多种类型的数据提取的知识图谱对AI任务模型进行训练,从而提升了AI任务模型的预测准确性。同时,提取得到的知识图谱包括多个知识层面的知识实体和关联关系,进一步提升了AI任务模型的预测准确性。In the embodiment of the present application, the sample data used by the computer equipment to train the AI task model is a variety of sources and types of data, and the AI task model is trained based on the knowledge map extracted from multiple sources and various types of data, thereby improving improved the predictive accuracy of the AI task model. At the same time, the extracted knowledge graph includes knowledge entities and association relationships at multiple knowledge levels, which further improves the prediction accuracy of the AI task model.
上面介绍本申请实施例提供的一种数据处理方法,下面结合附图介绍本申请实施例涉及的数据处理装置。A data processing method provided by the embodiment of the present application is described above, and the data processing device involved in the embodiment of the present application is described below with reference to the accompanying drawings.
请参阅图7,图7为本申请实施例提供的一种数据处理装置的结构示意图。该装置用于实现上述各实施例中对应设备的各个步骤,如图7所示,该数据处理装置700包括接口单元701、处理单元702。Please refer to FIG. 7 . FIG. 7 is a schematic structural diagram of a data processing device provided by an embodiment of the present application. The apparatus is used to implement each step of the corresponding equipment in the foregoing embodiments. As shown in FIG. 7 , the data processing apparatus 700 includes an interface unit 701 and a processing unit 702 .
接口单元701用于获取多种数据,多种数据中的各种数据具有不同的来源和不同的数据类型。处理单元702用于对多种数据进行知识抽取,获得知识图谱,知识图谱包括多个知识实体以及多个知识实体之间的关联关系,多个知识实体包括不同的数据类型。处理单元702还用于利用与每个知识实体的数据类型对应的知识表征算法对每个知识实体进行知识表征,且对知识图谱中多个知识实体之间的关系进行权重的初始化,获得向量图,向量图用于训练人工智能AI任务模型。The interface unit 701 is used to obtain various data, and various data in the various data have different sources and different data types. The processing unit 702 is used to perform knowledge extraction on various types of data to obtain a knowledge graph. The knowledge graph includes a plurality of knowledge entities and associations among the plurality of knowledge entities, and the plurality of knowledge entities include different data types. The processing unit 702 is also used to perform knowledge representation on each knowledge entity using the knowledge representation algorithm corresponding to the data type of each knowledge entity, and initialize the weights of the relationships among multiple knowledge entities in the knowledge graph to obtain a vector graph , the vector diagram is used to train the artificial intelligence AI task model.
一种可能的实施方式中,处理单元702具体用于基于不同的知识层面对多种数据进行知识抽取,获得多知识层面的知识图谱。In a possible implementation manner, the processing unit 702 is specifically configured to perform knowledge extraction on various data based on different knowledge levels, and obtain a multi-knowledge level knowledge map.
一种可能的实施方式中,来自不同知识层面的知识实体之间包括关联关系,关联关系从多种数据中获得,或者,关联关系根据预置的规则获得。In a possible implementation manner, knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from various data, or the association relationship is obtained according to a preset rule.
一种可能的实施方式中,处理单元702具体用于根据每个知识实体的数据类型,根据预置关系从预置算法库中确定与知识实体的数据类型对应的知识表征算法,根据对应的知识表征算法对知识实体进行知识表征,获得知识实体对应的表征向量。In a possible implementation manner, the processing unit 702 is specifically configured to determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the data type of each knowledge entity according to the preset relationship, and according to the corresponding knowledge The representation algorithm performs knowledge representation on the knowledge entity and obtains the representation vector corresponding to the knowledge entity.
一种可能的实施方式中,根据每个知识实体的数据类型确定用户输入的与数据类型对应的知识表征算法,根据对应的知识表征算法对知识实体进行知识表征,获得知识实体对应的表征向量。In a possible implementation, the knowledge representation algorithm corresponding to the data type input by the user is determined according to the data type of each knowledge entity, the knowledge representation is performed on the knowledge entity according to the corresponding knowledge representation algorithm, and the representation vector corresponding to the knowledge entity is obtained.
一种可能的实施方式中,AI任务模型为用于进行疾病诊断的AI模型,多种数据包括以下数据中的至少两种:病历数据、影像检查报告、基因调控表达网络和代谢网络。In a possible implementation, the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.
一种可能的实施方式中,处理单元702还用于根据向量图对AI任务模型进行训练,获得训练完成的AI任务模型。In a possible implementation manner, the processing unit 702 is further configured to train the AI task model according to the vector graph, and obtain a trained AI task model.
一种可能的实施方式中,处理单元702具体用于更新向量图中的权重。In a possible implementation manner, the processing unit 702 is specifically configured to update weights in the vector map.
一种可能的实施方式中,处理单元702还用于利用训练完成的AI任务模型进行任务预测,获得预测结果,并基于更新后的向量图对任务预测对应的知识图谱中的关键知识实体和/或关键关联关系进行标识,获得可解释知识图谱。In a possible implementation, the processing unit 702 is also configured to use the trained AI task model to perform task prediction, obtain the prediction result, and predict the key knowledge entities and/or key knowledge entities in the knowledge map corresponding to the task based on the updated vector graph or key associations to obtain interpretable knowledge graphs.
一种可能的实施方式中,处理单元702还用于通过图形用户界面GUI输出可解释知识图谱。In a possible implementation manner, the processing unit 702 is further configured to output an explainable knowledge map through a graphical user interface GUI.
应理解以上装置中单元的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且装置中的单元可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分单元以软件通过处理元件调用的形式实现,部分单元以硬件的形式实现。例如,各个单元可以为单独设立的处理元件,也可以集成在装置的某一个芯片中实现,此外,也可以以程序的形式存储于存储器中,由装置的某一个处理元件调用并执行该单元的功能。此外这些单元全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件又可以成为处理器,可以是一种具有信号的处理能力的集成电路。在实现过程中,上述方法的各步骤或以上各个单元可以通过处理器元件中的硬件的集成逻辑电路实现或者以软件通过处理元件调用的形式实现。It should be understood that the division of units in the above device is only a division of logical functions, and may be fully or partially integrated into a physical entity or physically separated during actual implementation. And the units in the device can all be implemented in the form of software called by the processing element; they can also be implemented in the form of hardware; some units can also be implemented in the form of software called by the processing element, and some units can be implemented in the form of hardware. For example, each unit can be a separate processing element, or it can be integrated in a certain chip of the device. In addition, it can also be stored in the memory in the form of a program, which is called and executed by a certain processing element of the device. Function. In addition, all or part of these units can be integrated together, or implemented independently. The processing element mentioned here may also be a processor, which may be an integrated circuit with signal processing capabilities. In the process of implementation, each step of the above method or each unit above may be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software called by the processing element.
值得说明的是,对于上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明本申请并不受所描述的动作顺序的限制,其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明本申请所必须的。It is worth noting that, for the above method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required by the present application of the present invention.
本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本发明本申请的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明本申请所必须的。Other reasonable step combinations conceivable by those skilled in the art based on the above description also fall within the protection scope of the present application. Secondly, those skilled in the art should also be familiar with that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily necessary for the application of the present invention.
请参阅图8,图8为本申请实施例提供的一种计算机设备示意图。如图8所示,该计算机设备800包括:处理器810、存储器820和接口830,处理器810、存储器820与接口830通过总线(图中未标注)耦合。存储器820存储有指令,当存储器820中的执行指令被执行时,计算机设备800执行上述方法实施例中第一芯片所执行的方法。Please refer to FIG. 8 , which is a schematic diagram of a computer device provided by an embodiment of the present application. As shown in FIG. 8 , the computer device 800 includes: a processor 810 , a memory 820 and an interface 830 , and the processor 810 , the memory 820 and the interface 830 are coupled through a bus (not marked in the figure). The memory 820 stores instructions, and when the execution instructions in the memory 820 are executed, the computer device 800 executes the method executed by the first chip in the above method embodiment.
计算机设备800可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(application specific integrated circuit,ASIC),或,一个或多个微处理器(digital singnal processor,DSP),或,一个或者多个现场可编程门阵列(field programmable gate array,FPGA),或这些集成电路形式中至少两种的组合。再 如,当装置中的单元可以通过处理元件调度程序的形式实现时,该处理元件可以是通用处理器,例如中央处理器(central processing unit,CPU)或其它可以调用程序的处理器。再如,这些单元可以集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。The computer device 800 may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (application specific integrated circuit, ASIC), or, one or more microprocessors (digital signal processor , DSP), or, one or more field programmable gate arrays (field programmable gate array, FPGA), or a combination of at least two of these integrated circuit forms. For another example, when the units in the device can be implemented in the form of a processing element scheduler, the processing element can be a general-purpose processor, such as a central processing unit (central processing unit, CPU) or other processors that can call programs. For another example, these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).
处理器810可以是中央处理单元(central processing unit,CPU),还可以是其它通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。 Processor 810 may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuits, ASICs), on-site Programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor, or any conventional processor.
存储器820可以包括只读存储器和随机存取存储器,并向处理器810提供指令和数据。存储器820还可以包括非易失性随机存取存储器。例如,存储器820可设置多个分区,每个区域分别用于存储不同软件模块的私钥。 Memory 820 may include read-only memory and random-access memory, and provides instructions and data to processor 810 . Memory 820 may also include non-volatile random access memory. For example, the memory 820 may be provided with multiple partitions, each of which is used to store private keys of different software modules.
存储器820可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。 Memory 820 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
总线除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。总线可以是快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线可以分为地址总线、数据总线、控制总线等。In addition to the data bus, the bus may also include a power bus, a control bus, and a status signal bus. The bus can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link (compute express link, CXL), cache coherent interconnect for accelerators (CCIX), etc. The bus can be divided into address bus, data bus, control bus and so on.
在本申请的另一实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的处理器执行该计算机执行指令时,设备执行上述方法实施例中计算机设备所执行的方法。In another embodiment of the present application, a computer-readable storage medium is also provided, and computer-executable instructions are stored in the computer-readable storage medium. When the processor of the device executes the computer-executable instructions, the device executes the above method embodiment A method performed by a computer device.
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中。当设备的处理器执行该计算机执行指令时,设备执行上述方法实施例中计算机设备所执行的方法。In another embodiment of the present application, a computer program product is also provided, the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium. When the processor of the device executes the computer-executed instructions, the device executes the method performed by the computer device in the foregoing method embodiments.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通 过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disc, etc., which can store program codes. .

Claims (21)

  1. 一种数据处理的方法,其特征在于,包括:A data processing method, characterized in that, comprising:
    获取多种数据,所述多种数据中的各种数据具有不同的来源和不同的数据类型;obtaining a plurality of data, each of the plurality of data having different sources and different data types;
    对所述多种数据进行知识抽取,获得知识图谱,所述知识图谱包括多个知识实体以及所述多个知识实体之间的关联关系,所述多个知识实体包括不同的数据类型;Performing knowledge extraction on the various data to obtain a knowledge map, the knowledge map includes a plurality of knowledge entities and association relationships between the plurality of knowledge entities, and the plurality of knowledge entities include different data types;
    利用与每个知识实体的数据类型对应的知识表征算法对每个知识实体进行知识表征,且对所述知识图谱中所述多个知识实体之间的关联关系进行权重的初始化,获得向量图,所述向量图用于训练人工智能AI任务模型。Using the knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation for each knowledge entity, and initialize the weights of the association relationships between the multiple knowledge entities in the knowledge graph to obtain a vector diagram, The vector graph is used to train the artificial intelligence AI task model.
  2. 根据权利要求1所述的方法,其特征在于,所述对所述多种数据进行知识抽取,获得知识图谱,包括:The method according to claim 1, wherein said performing knowledge extraction on said various data to obtain a knowledge graph comprises:
    基于不同的知识层面对所述多种数据进行知识抽取,获得多知识层面的所述知识图谱。Knowledge extraction is performed on the various data based on different knowledge levels to obtain the knowledge map of multiple knowledge levels.
  3. 根据权利要求2所述的方法,其特征在于,来自不同知识层面的知识实体之间包括关联关系,所述关联关系从所述多种数据中获得,或者,所述关联关系根据预置的规则获得。The method according to claim 2, wherein the knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from the various data, or the association relationship is based on a preset rule get.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述利用与每个知识实体的数据类型对应的知识表征算法对每个知识实体进行知识表征,包括:The method according to any one of claims 1-3, wherein said performing knowledge representation on each knowledge entity using a knowledge representation algorithm corresponding to the data type of each knowledge entity comprises:
    根据每个知识实体的数据类型,根据预置关系从预置算法库中确定与所述知识实体的数据类型对应的知识表征算法,根据对应的知识表征算法对所述知识实体进行知识表征,获得所述知识实体对应的表征向量;或者,According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the preset relationship, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain A representation vector corresponding to the knowledge entity; or,
    根据每个知识实体的数据类型确定用户输入的与所述数据类型对应的知识表征算法,根据对应的知识表征算法对所述知识实体进行知识表征,获得所述知识实体对应的表征向量。According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type input by the user, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain the representation vector corresponding to the knowledge entity.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述AI任务模型为用于进行疾病诊断的AI模型,所述多种数据包括以下数据中的至少两种:病历数据、影像检查报告、基因调控表达网络和代谢网络。The method according to any one of claims 1-4, wherein the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, Imaging examination report, gene regulatory expression network and metabolic network.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    根据所述向量图对AI任务模型进行训练,获得训练完成的AI任务模型。The AI task model is trained according to the vector diagram, and the trained AI task model is obtained.
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述向量图对AI任务模型进行训练,包括:更新所述向量图中的权重。The method according to claim 6, wherein the training the AI task model according to the vector diagram comprises: updating weights in the vector diagram.
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:利用所述训练完成的AI任务模型进行任务预测,获得预测结果;The method according to claim 7, further comprising: using the trained AI task model to perform task prediction, and obtain a prediction result;
    基于所述更新后的向量图对所述任务预测对应的知识图谱中的关键知识实体和/或关键关联关系进行标识,获得可解释知识图谱。Key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction are identified based on the updated vector graph to obtain an interpretable knowledge graph.
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:The method according to claim 8, characterized in that the method further comprises:
    通过图形用户界面GUI输出所述可解释知识图谱。The explainable knowledge graph is output through a graphical user interface GUI.
  10. 一种数据处理装置,其特征在于,包括接口单元和处理单元;A data processing device, characterized in that it includes an interface unit and a processing unit;
    所述接口单元用于获取多种数据,所述多种数据中的各种数据具有不同的来源和不同 的数据类型;The interface unit is used to obtain multiple data, and various data in the multiple data have different sources and different data types;
    所述处理单元用于对所述多种数据进行知识抽取,获得知识图谱,所述知识图谱包括多个知识实体以及所述多个知识实体之间的关联关系,所述多个知识实体包括不同的数据类型;The processing unit is used to perform knowledge extraction on the various data to obtain a knowledge map, the knowledge map includes a plurality of knowledge entities and the associations between the plurality of knowledge entities, and the plurality of knowledge entities include different data type;
    所述处理单元还用于利用与每个知识实体的数据类型对应的知识表征算法对每个知识实体进行知识表征,且对所述知识图谱中所述多个知识实体之间的关联关系进行权重的初始化,获得向量图,所述向量图用于训练人工智能AI任务模型。The processing unit is further configured to use a knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation on each knowledge entity, and weight the association relationships among the multiple knowledge entities in the knowledge graph The initialization of the vector graph is obtained, and the vector graph is used to train the artificial intelligence AI task model.
  11. 根据权利要求10所述的装置,其特征在于,所述处理单元具体用于:The device according to claim 10, wherein the processing unit is specifically used for:
    基于不同的知识层面对所述多种数据进行知识抽取,获得多知识层面的所述知识图谱。Knowledge extraction is performed on the various data based on different knowledge levels to obtain the knowledge map of multiple knowledge levels.
  12. 根据权利要求11所述的装置,其特征在于,来自不同知识层面的知识实体之间包括关联关系,所述关联关系从所述多种数据中获得,或者,所述关联关系根据预置的规则获得。The device according to claim 11, characterized in that knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from the various data, or the association relationship is based on a preset rule get.
  13. 根据权利要求10-12任一项所述的装置,其特征在于,所述处理单元具体用于:The device according to any one of claims 10-12, wherein the processing unit is specifically configured to:
    根据每个知识实体的数据类型,根据预置关系从预置算法库中确定与所述知识实体的数据类型对应的知识表征算法,根据对应的知识表征算法对所述知识实体进行知识表征,获得所述知识实体对应的表征向量;或者,According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the preset relationship, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain A representation vector corresponding to the knowledge entity; or,
    根据每个知识实体的数据类型确定用户输入的与所述数据类型对应的知识表征算法,根据对应的知识表征算法对所述知识实体进行知识表征,获得所述知识实体对应的表征向量。According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type input by the user, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain the representation vector corresponding to the knowledge entity.
  14. 根据权利要求10-13任一项所述的装置,其特征在于,所述AI任务模型为用于进行疾病诊断的AI模型,所述多种数据包括以下数据中的至少两种:病历数据、影像检查报告、基因调控表达网络和代谢网络。The device according to any one of claims 10-13, wherein the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, Imaging examination report, gene regulatory expression network and metabolic network.
  15. 根据权利要求10-14任一项所述的装置,其特征在于,所述处理单元还用于:The device according to any one of claims 10-14, wherein the processing unit is further configured to:
    根据所述向量图对AI任务模型进行训练,获得训练完成的AI任务模型。The AI task model is trained according to the vector diagram, and the trained AI task model is obtained.
  16. 根据权利要求15所述的装置,其特征在于,所述处理单元具体用于更新所述向量图中的权重。The device according to claim 15, wherein the processing unit is specifically configured to update weights in the vector map.
  17. 根据权利要求16所述的装置,其特征在于,所述处理单元还用于利用所述训练完成的AI任务模型进行任务预测,获得预测结果;The device according to claim 16, wherein the processing unit is further configured to use the trained AI task model to perform task prediction and obtain a prediction result;
    基于所述更新后的向量图对所述任务预测对应的知识图谱中的关键知识实体和/或关键关联关系进行标识,获得可解释知识图谱。Key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction are identified based on the updated vector graph to obtain an interpretable knowledge graph.
  18. 根据权利要求17所述的装置,其特征在于,所述处理单元还用于:The device according to claim 17, wherein the processing unit is further used for:
    通过图形用户界面GUI输出所述可解释知识图谱。The explainable knowledge graph is output through a graphical user interface GUI.
  19. 一种计算机设备,其特征在于,包括处理器,所述处理器与存储器耦合,所述存储器用于存储指令,当所述指令被所述处理器执行时,以使得所述计算机设备执行权利要求1至9中任一项所述的方法。A computer device, characterized in that it includes a processor, the processor is coupled with a memory, and the memory is used to store instructions, when the instructions are executed by the processor, so that the computer device performs the claims The method described in any one of 1 to 9.
  20. 一种计算机可读存储介质,其上存储有指令,其特征在于,所述指令被执行时,以使得计算机执行权利要求1至9中任一项所述的方法。A computer-readable storage medium on which instructions are stored, wherein when the instructions are executed, the computer executes the method described in any one of claims 1-9.
  21. 一种计算机程序产品,所述计算机程序产品中包括指令,其特征在于,所述指令被执行时,以使得计算机实现权利要求1至9中任一项所述的方法。A computer program product, the computer program product includes instructions, wherein when the instructions are executed, the computer implements the method according to any one of claims 1 to 9.
PCT/CN2022/124247 2021-11-30 2022-10-10 Data processing method and data processing apparatus WO2023098291A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111453147.2A CN116205306A (en) 2021-11-30 2021-11-30 Data processing method and data processing device
CN202111453147.2 2021-11-30

Publications (1)

Publication Number Publication Date
WO2023098291A1 true WO2023098291A1 (en) 2023-06-08

Family

ID=86508272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124247 WO2023098291A1 (en) 2021-11-30 2022-10-10 Data processing method and data processing apparatus

Country Status (2)

Country Link
CN (1) CN116205306A (en)
WO (1) WO2023098291A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117826771A (en) * 2024-03-05 2024-04-05 广东云湾科技有限公司 Cold rolling mill control system abnormality detection method and system based on AI analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644062A (en) * 2017-08-29 2018-01-30 广州思涵信息科技有限公司 The knowledge content Weight Analysis System and method of a kind of knowledge based collection of illustrative plates
CN110379507A (en) * 2019-06-27 2019-10-25 南京市卫生信息中心 A kind of aided diagnosis method based on patient's vector-valued image
US20200118010A1 (en) * 2018-10-16 2020-04-16 Samsung Electronics Co., Ltd. System and method for providing content based on knowledge graph
CN111753098A (en) * 2020-06-23 2020-10-09 陕西师范大学 Teaching method and system based on cross-media dynamic knowledge graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644062A (en) * 2017-08-29 2018-01-30 广州思涵信息科技有限公司 The knowledge content Weight Analysis System and method of a kind of knowledge based collection of illustrative plates
US20200118010A1 (en) * 2018-10-16 2020-04-16 Samsung Electronics Co., Ltd. System and method for providing content based on knowledge graph
CN110379507A (en) * 2019-06-27 2019-10-25 南京市卫生信息中心 A kind of aided diagnosis method based on patient's vector-valued image
CN111753098A (en) * 2020-06-23 2020-10-09 陕西师范大学 Teaching method and system based on cross-media dynamic knowledge graph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117826771A (en) * 2024-03-05 2024-04-05 广东云湾科技有限公司 Cold rolling mill control system abnormality detection method and system based on AI analysis
CN117826771B (en) * 2024-03-05 2024-05-14 广东云湾科技有限公司 Cold rolling mill control system abnormality detection method and system based on AI analysis

Also Published As

Publication number Publication date
CN116205306A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
Wang et al. Propensity score-integrated power prior approach for incorporating real-world evidence in single-arm clinical studies
JP7466058B2 (en) Clinical omics data processing method, device, electronic device, and computer program based on graph neural network
Toh et al. Applications of machine learning in healthcare
WO2022041722A1 (en) Hospital guidance data acquisition method and apparatus, and computer device and storage medium
Huang et al. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations
WO2023098291A1 (en) Data processing method and data processing apparatus
CN112201359A (en) Artificial intelligence-based critical illness inquiry data identification method and device
KR20220069871A (en) Method, apparatus and computer program for generating formalized research record data automatically for learning artificial intelligence model
Dutta et al. Breast cancer prediction using stacked GRU-LSTM-BRNN
Diaz-Flores et al. Evolution of artificial intelligence-powered technologies in biomedical research and healthcare
Coorey et al. Prediction modeling—part 2: using machine learning strategies to improve transplantation outcomes
Turgut et al. A framework proposal for machine learning-driven agent-based models through a case study analysis
Amiri et al. The deep learning applications in IoT-based bio-and medical informatics: a systematic literature review
Kipkogei et al. Explainable transformer-based neural network for the prediction of survival outcomes in non-small cell lung cancer (NSCLC)
Khanna et al. Polygenic risk score for cardiovascular diseases in artificial intelligence paradigm: a review
Tian et al. Establishment and evaluation of a multicenter collaborative prediction model construction framework supporting model generalization and continuous improvement: A pilot study
Raman et al. Infinite mixture-of-experts model for sparse survival regression with application to breast cancer
Gupta et al. Keeping up with innovation: A predictive framework for modeling healthcare data with evolving clinical interventions
Sixian et al. Application of Shapley Additive Explanation Towards Determining Personalized Triage from Health Checkup Data
Kasabov Evolving connectionist systems for adaptive learning and knowledge discovery: methods, tools, applications
Rajendran et al. Multi Head Graph Attention for Drug Response Predicton
Wilson et al. Machine intelligence for radiation science: summary of the Radiation Research Society 67th annual meeting symposium
Karatzas et al. An approach for predicting the effects of endocrine disrupting chemicals on human health using deep learning
CN116844717B (en) Medical advice recommendation method, system and equipment based on hierarchical multi-label model
CN117912570B (en) Classification feature determining method and system based on gene co-expression network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22900098

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022900098

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022900098

Country of ref document: EP

Effective date: 20240612