WO2023098291A1

WO2023098291A1 - Data processing method and data processing apparatus

Info

Publication number: WO2023098291A1
Application number: PCT/CN2022/124247
Authority: WO
Inventors: 沈雯; 乔楠; 张雷; 陶建军
Original assignee: 华为云计算技术有限公司
Priority date: 2021-11-30
Filing date: 2022-10-10
Publication date: 2023-06-08
Also published as: CN116205306A

Abstract

Disclosed in embodiments of the present invention are a data processing method and a data processing apparatus, which are used for improving the prediction accuracy of an AI task model. The method provided by an embodiment of the present invention comprises: acquiring multiple types of data, the data being from different sources and having different data types; carrying out knowledge extraction on the multiple types of data to obtain a knowledge graph, the knowledge graph comprising a plurality of knowledge entities and relationships among the plurality of knowledge entities, and the plurality of knowledge entities having different data types; carrying out knowledge representation on each knowledge entity by using a knowledge representation algorithm corresponding to the data type of the knowledge entity, and initializing weighting of the relationships among the plurality of knowledge entities in the knowledge graph so as to obtain a vector diagram, the vector diagram being used for training an artificial intelligence (AI) task model.

Description

A data processing method and data processing device

This application claims the priority of a Chinese patent application with application number "202111453147.2" and application title "A Data Processing Method and Data Processing Device" filed with the China Patent Office on November 30, 2021, the entire contents of which are incorporated by reference in this application.

technical field

The embodiments of the present application relate to the field of artificial intelligence, and in particular, to a data processing method and a data processing device.

Background technique

In recent years, artificial intelligence (AI) related technologies have been more and more widely used in various industries. Among them, deep learning technology is an AI technology based on deep neural network algorithms, which processes data by simulating the working mechanism of the human brain. At present, AI models (such as deep learning models) are often used to complete tasks in various application scenarios, and AI models can also be called AI task models.

In the current AI technology, the AI model requires a large amount of sample data for training, and some current technical solutions often only use sample data with a relatively single data type to train the AI model. For example, when AI technology is applied in the clinical decision support system (CDSS) in the medical field, the source of the sample data required for the training of the disease diagnosis model based on deep learning in CDSS often only uses electronic medical records, sample data The type is text in EMR. Due to the single source and type of sample data, the prediction accuracy of the disease diagnosis model is low, and the effect of assisting clinical decision-making is poor.

In some scenarios, the sample data used for AI model training can use different data sources and different data types. However, when the sample data is used for the training of the AI model, because the sample data from different sources and different data types cannot be well represented, the AI model cannot learn the characteristics of the sample data in the process of training the AI model. , resulting in low task prediction accuracy of the trained AI task model.

Therefore, how to characterize the sample data from different sources and different data types, so that the AI task model obtained by using the represented data training to improve the prediction accuracy of the task is an urgent technical problem to be solved.

Contents of the invention

Embodiments of the present application provide a data processing method and a data processing device for improving the prediction accuracy of an AI task model.

The first aspect of the embodiments of the present application provides a data processing method. The method is executed by a computer device, or by a component of the computer device, such as a processor, a chip or a chip system of the computer device, or by a logic module or software that can realize all or part of the device's functions. Taking computer equipment as an example, the data processing method includes: the computer equipment obtains a variety of data, and the various data have different data sources and different data types. The source of the data is related to the type of task to be trained, including data generated by humans. or machine-generated data, which can be text, numeric, or image. Computer equipment performs knowledge extraction on various data to obtain knowledge graphs. Knowledge graphs include multiple knowledge entities and the associations between multiple knowledge entities. Knowledge entities include key elements extracted from various data. Multiple knowledge entities include different data type. The computer equipment uses the knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation for each knowledge entity, and initializes the weight of the relationship between multiple knowledge entities in the knowledge graph to obtain a vector graph, which is used for For training artificial intelligence AI task models.

In the embodiment of the present application, the sample data used by the computer device to train the AI task model is a variety of sources and types of data. At the same time, the computer device represents the abstract knowledge map as a computer device through knowledge representation algorithms corresponding to different data types. Recognizable vector illustration. The computer equipment trains the AI task model based on vector images obtained from various sources and various types of data, which improves the prediction accuracy of the AI task model.

In a possible implementation manner, in the process of knowledge extraction of various data by a computer device to obtain a knowledge map, the computer device performs knowledge extraction of various data based on different knowledge levels, so as to obtain a multi-knowledge level knowledge map. For example, when computer equipment extracts knowledge from a variety of medical data to obtain a knowledge map in the therapeutic field, it can perform knowledge extraction based on multiple knowledge levels such as the symptom level, gene level, or microbial level, so as to obtain knowledge associated with multiple knowledge levels Atlas.

The knowledge map acquired by the computer device in the embodiment of the present application is a knowledge map with multiple knowledge levels related to each other, and the AI task model is trained based on the knowledge map of multiple knowledge levels. Since the knowledge map involves multiple knowledge levels, therefore, Improve the coverage of the knowledge map and further improve the prediction accuracy of the AI task model.

In a possible implementation manner, knowledge entities from different knowledge levels include association relationships, and the association relationships are obtained from various data, for example, computer equipment analyzes the semantic information of various data to obtain the relationship between knowledge entities . Alternatively, the association relationship is obtained according to preset rules, for example, the computer device pre-stores knowledge association rules based on domain knowledge, and the computer device establishes association relationships between knowledge entities at different levels based on the preset knowledge association rules.

In the embodiment of the present application, the computer equipment obtains the associations of various data itself, and establishes the association relationship between the knowledge entities of the same level or different levels according to the preset rules, so as to fully excavate the internal relationship between the knowledge entities of different knowledge levels , a variety of methods for obtaining associations fully exploit the associations between knowledge entities and increase the amount of data used to train AI task models.

In a possible implementation, in the process of performing knowledge representation on each knowledge entity by the computer device, the computer device determines the data type of the knowledge entity from the preset algorithm library according to the preset relationship according to the data type of each knowledge entity According to the corresponding knowledge representation algorithm, the computer device performs knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtains a representation vector corresponding to the knowledge entity. For example, when the data type of the knowledge entity is text, the computer device selects the knowledge representation algorithm from the preset algorithm library according to the preset relationship between the text type and the knowledge representation algorithm. graph embedding (KGE) algorithm, bidirectional encoder representations from transformers (BERT) algorithm, or word vector (word2vec) algorithm.

In the embodiment of the present application, the computer device selects the corresponding knowledge representation algorithm from the preset algorithm library according to the data type of the knowledge entity, thereby improving the representation efficiency of the knowledge entity and the associated relationship.

In a possible implementation, the computer device determines the knowledge representation algorithm corresponding to the data type input by the user according to the data type of each knowledge entity, performs knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtains the representation corresponding to the knowledge entity vector.

The knowledge representation algorithm in the embodiment of the present application may be a user-defined knowledge representation algorithm, thereby improving applicability to different knowledge representation algorithms.

In a possible implementation, the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, imaging examination reports, gene regulatory expression networks, and metabolic networks.

The data processing method in the embodiment of the present application can be applied to the medical field. The trained AI task model can perform the AI task model of disease diagnosis, and the sample data from various sources can be used to train the disease diagnosis model, thereby improving the diagnostic accuracy of the disease diagnosis model. .

In a possible implementation manner, the computer device trains the AI task model according to the vector graph to obtain the trained AI task model.

In the embodiment of the present application, the computer equipment trains the AI task model through the obtained vector graph represented by the knowledge graph, which improves the feasibility of training the AI task model.

In a possible implementation manner, during the computer device training the AI task model according to the vector graph, the computer device updates the weights in the vector graph.

In the embodiment of the present application, the computer device can continuously update the weights in the vector diagram, thereby improving the accuracy of the trained AI task model.

In a possible implementation, the computer device uses the trained AI task model to perform task prediction to obtain the prediction result, and based on the updated vector diagram, perform task prediction on the key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction. Identify and obtain an interpretable knowledge graph.

In the embodiment of the present application, the computer device can identify the key knowledge entities and/or key associations of the knowledge graph applied in the task prediction, which improves the interpretability of the model prediction results.

In a possible implementation manner, the computer device outputs an interpretable knowledge map through a graphical user interface (GUI).

In the embodiment of the present application, the computer device outputs the explainable knowledge map through the graphical user interface GUI, which improves the feasibility of the solution.

A second aspect of the embodiments of the present application provides a data processing device, where the data processing device includes an interface unit and a processing unit. Wherein, the interface unit is used to obtain multiple data, and various data in the multiple data have different sources and different data types. The processing unit is used to perform knowledge extraction on various data to obtain a knowledge map. The knowledge map includes multiple knowledge entities and the associations between the multiple knowledge entities. The multiple knowledge entities include different data types. The processing unit is also used to perform knowledge representation for each knowledge entity by using a knowledge representation algorithm corresponding to the data type of each knowledge entity, and initialize the weights of the relationships between multiple knowledge entities in the knowledge graph to obtain a vector graph, Vector graphs are used to train artificial intelligence AI task models.

In a possible implementation manner, the processing unit is specifically configured to perform knowledge extraction on various data based on different knowledge levels, and obtain a multi-knowledge level knowledge map.

In a possible implementation manner, knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from various data, or the association relationship is obtained according to a preset rule.

In a possible implementation, the processing unit is specifically configured to determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the data type of each knowledge entity according to the preset relationship, and according to the corresponding knowledge representation The algorithm performs knowledge representation on the knowledge entity and obtains the representation vector corresponding to the knowledge entity.

In a possible implementation, the knowledge representation algorithm corresponding to the data type input by the user is determined according to the data type of each knowledge entity, the knowledge representation is performed on the knowledge entity according to the corresponding knowledge representation algorithm, and the representation vector corresponding to the knowledge entity is obtained.

In a possible implementation manner, the processing unit is further configured to train the AI task model according to the vector graph to obtain a trained AI task model.

In a possible implementation manner, the processing unit is specifically configured to update the weights in the vector map.

In a possible implementation, the processing unit is also used to use the trained AI task model to perform task prediction, obtain the prediction result, and predict the key knowledge entities and/or key knowledge entities in the corresponding knowledge graph based on the updated vector graph. Key associations are identified to obtain interpretable knowledge graphs.

In a possible implementation manner, the processing unit is further configured to output an explainable knowledge map through a graphical user interface GUI.

The third aspect of the embodiments of the present application provides a computer device, the computer device includes a processor, the processor is coupled with a memory, the memory is used to store instructions, and when the instructions are executed by the processor, the computer device executes the above-mentioned first Aspect or the method described in any possible implementation manner of the first aspect.

The fourth aspect of the embodiments of the present application is a computer-readable storage medium, on which instructions are stored. When the instructions are executed, the computer executes the method described in the first aspect or any possible implementation manner of the first aspect. .

The fifth aspect of the embodiments of the present application is a computer program product. The computer program product includes instructions. When the instructions are executed, the computer implements the method described in the first aspect or any possible implementation manner of the first aspect.

It can be understood that the beneficial effects achieved by the data processing device, computer equipment, computer readable medium or computer program product provided above can refer to the beneficial effects in the corresponding method, and will not be repeated here.

Description of drawings

FIG. 1 is a schematic diagram of a system architecture of a data processing method provided in an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a data processing method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a knowledge extraction provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a knowledge representation provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of establishing an AI task model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a data processing effect provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a data processing device provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The embodiment of the present application provides a data processing method and a data processing device, which are used to improve the accuracy of clinical decision-making.

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

Hereinafter, some terms used in this application are explained to facilitate the understanding of those skilled in the art.

Deep learning (deep learning, DL) is a machine learning technology based on deep neural network algorithms, and its main feature is to use multiple nonlinear transformations to process and analyze data. For example, it is applied in scenarios such as image recognition, speech recognition or natural language processing, and medical imaging data.

Graph deep learning (graph deep learning, GDL) is to apply various algorithms of deep learning to graph structure data, such as graph neural network or graph convolutional neural network. Graph convolutional network (GCN) is a kind of neural network method that realizes convolution on graph-structured data, for example, realizes convolution on graph-structured data by methods such as Laplace matrix or Fourier transform.

Gene regulatory network (gene regulatory network, GRN) is a network of interactions between DNA and proteins. The activity of genes is regulated by transcription factors that bind to DNA. Most transcription factors bind to multiple binding sites in the genome. Therefore, all cells All have complex gene regulatory networks. For example, the human genome encodes approximately 1400 transcription factors that regulate the expression of more than 20,000 human genes. Gene regulatory network techniques include binding site analysis ChIP-chip or ChIP-seq, etc.

Metabolic network (ME) is a network in which various chemical substances in living cells are connected by biochemical reactions. Biochemical reactions are catalyzed by enzymes to convert one chemical substance into another chemical substance. Thus, all chemicals in a cell are part of a complex network of biochemical reactions known as a metabolic network.

Electronic medical records (EMR) are electronic patient records based on computer systems.

Real word data (RWD) refers to data obtained from sources other than traditional clinical trials. Data sources such as large-scale simple clinical trials, clinical trials in actual medical care, prospective observational studies or registry studies, database analysis, case reports, health management reports or electronic health records, etc.

The knowledge graph is a graph-based data structure consisting of nodes and edges, where each node is a knowledge entity and each edge is an association relationship between knowledge entities. Knowledge entities can be things in the real world, such as names, gender, or symptoms, etc., and association relations are used to express certain connections between different knowledge entities.

Some terms in the embodiments of the present application have been introduced above, and the data processing method and data processing device provided in the embodiments of the present application will be described below with reference to the accompanying drawings.

Please refer to FIG. 1 . FIG. 1 is a schematic diagram of a system architecture of a data processing method provided by an embodiment of the present application. As shown in Figure 1, the data processing system 10 includes a knowledge extraction module 101, a knowledge representation module 102, a knowledge modeling module 103 and an attention visualization module 104, wherein each module in the data processing system 10 can be called independently, and the data processing The system 10 can also flexibly expand other modules, which is not specifically limited.

It can be understood that each module of the above-mentioned data processing system 10 is a logical unit based on the division of system functions, and the entities of the data processing system 10 can be centralized or distributed computer equipment or servers, or components of computer equipment or servers, such as A processor, chip or system-on-a-chip of a computer device.

The data processing system 10 in the embodiment of the present application can integrate data based on relevant domain knowledge, train an AI task model, and use the AI task model to perform interpretable prediction of target events. The data processing system 10 is a general framework for heterogeneous data processing and target event prediction, which can be applied to various fields. The following medical field is used as an example to introduce various modules in the data processing system 10 .

The knowledge extraction module 101 can extract usable knowledge entities from various sources and types of data, and establish associations between knowledge entities at different knowledge levels based on data semantics or domain knowledge to form a knowledge graph. The knowledge graph contains knowledge entities and the relationship between knowledge entities. Knowledge entities include element information extracted from a variety of data, and associations between knowledge entities include links between the extracted element information. For example, in the medical field, the evaluation data source for a patient's physical state can be electronic medical records, imaging data, gene regulation network or protein metabolism network, etc. The knowledge extraction module 101 can extract knowledge entities from these heterogeneous data of different sources and different data types, thereby forming a knowledge map with connection characteristics, in which knowledge entities are regarded as nodes in the knowledge map, and associations between knowledge entities are regarded as The edge in the knowledge graph, which can maximize the extraction of key information in heterogeneous data.

The knowledge entity nodes in the above knowledge graph can be at different knowledge levels, and the relationship between knowledge entity nodes can also be a cross-knowledge level relationship. For example, knowledge graphs at different knowledge levels include representation symptom layer, gene sequencing data layer or metabolic data layer and other multi-level knowledge graphs, the knowledge entities in the knowledge graph representing the symptom layer may be associated with the knowledge entities in the gene sequencing data layer.

The knowledge representation module 102 is used to represent the above knowledge map by using a vector graph, including the knowledge entities in the knowledge map and the associations between the knowledge entities. It can be understood that the knowledge graph obtained by the knowledge extraction module 101 cannot be directly used for training the AI task model, and the knowledge representation module 102 is required to represent the knowledge entities in the knowledge graph as data in the form of vectors, and then use the data to train the AI task model . The knowledge characterization module 102 includes a node module for representing knowledge entities and an edge module for representing associations between knowledge entities, wherein the node module and the edge module are provided with multiple sub-modules, and different sub-modules are used to represent different data types Knowledge entity or relationship.

The knowledge modeling module 103 is used to obtain a deep learning model based on vector graph training, and different vector graphs are trained to obtain a deep learning model to support different downstream tasks. Deep learning models include graph convolutional neural network GCN, graph attention network GAT or graph samples and aggregation GraphSAGE, and deep learning models can also be integrated into the Transformer structure. Downstream tasks include auxiliary diagnosis tasks, examination recommendation tasks, or drug recommendation tasks, etc.

The attention visualization module 104 uses the updated vector graph to identify key nodes and key edges in the knowledge graph, and perform visual display, so that the key nodes and edge information in the knowledge graph can be highlighted. The updated vector map is the vector map obtained after the training of the deep learning model is completed.

Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a data processing method provided by an embodiment of the present application. The method is applied to the data processing system shown in FIG. 1 . Taking computer equipment execution as an example, the data processing method includes the following steps:

201. Computer equipment acquires multiple types of data, including data from different data sources and data types.

The computer equipment obtains a variety of data, which are sample data for training AI task models. Various data acquired by computing devices have different data sources and data types, and the data sources are different according to the types of tasks. The specific multiple data can be data generated by humans or data generated by machines. Types of data include text, numbers, or images.

In an example of a medical scenario, a computer acquires various medical data. These medical data can be real-world data RWD, and various medical data have different data sources. For example, sources of various medical data can be large-scale simple clinical trials, clinical trials in actual medical care, prospective observational studies, registry studies, retrospective database analysis, case reports, health management reports, medical record data, imaging examinations Reports, gene regulatory expression networks, metabolic networks, protein profiles or microbial profiles.

202. A computer device performs knowledge extraction on various data to obtain a knowledge map, and the knowledge map includes multiple knowledge entities and the relationship between knowledge entities.

After the computer device acquires various data, it extracts the knowledge entity and the association relationship between the knowledge entities based on the acquired various data, and the knowledge entity and the association relationship between the knowledge entities form a knowledge graph.

Specifically, in the process of knowledge extraction, computer equipment classifies the extracted knowledge entities according to different knowledge levels, and establishes the association relationship between knowledge entities at different knowledge levels. Knowledge graph at the knowledge level.

It should be noted that after the computer device extracts the knowledge graph, it needs to standardize the knowledge entities in the knowledge graph. For example, the knowledge entity extracted by computer equipment from electronic medical records is "stomach pain", and the standardized knowledge entity is "abdominal pain".

Please refer to FIG. 3 . FIG. 3 is a schematic diagram of a multi-level knowledge map in the medical field provided by an embodiment of the present application. In an example shown in FIG. 3 , the computer device performs knowledge extraction on various medical data, including electronic medical records, radiation information, gene information, protein information or microbial information. Computer equipment divides the extracted knowledge entities into genetic level, phenotypic level and metagenomic level according to different knowledge levels.

As shown in FIG. 3 , knowledge entities at the gene level are, for example, PTPN11 gene, PIK3R1 gene or CDC42 gene. Knowledge entities at the level of phenotypic symptoms such as frequent bowel movements, hypotension, or insomnia. Knowledge entities at the microbiological level such as Prevotella, Haldemannia or Dorerella.

In the example shown in Figure 3, after the computer device classifies the knowledge entities according to different knowledge levels, the association relationship between knowledge entities is established based on domain knowledge, including the association relationship between knowledge entities in the same knowledge level or different knowledge Relationships between knowledge entities at the level. Associations within the same level of knowledge For example, at the level of phenotypic symptoms, colon cancer is associated with frequent bowel movements, abdominal pain, and familial adenomatous polyposis (FAP). Correlations within different knowledge levels For example, hypotension at the symptom level is associated with PIK3R1 gene, EGFR gene and KRAS gene at the gene level.

In the embodiment of the present application, the computer device can establish the association relationship between knowledge entities based on various data, and can also establish the association relationship between knowledge entities based on preset rules. Establishing the association relationship between knowledge entities based on various data includes analyzing the semantics of various data by computer equipment, and mining the association contained in various data itself. For example, it is recorded in the electronic medical record that “a 42-year-old male patient’s symptoms are drinking too much water, high blood sugar and frequent urination”, and the knowledge entities extracted by the computer equipment based on the electronic medical record data are age 42, gender male, symptoms of drinking too much water, Symptoms of hyperglycemia and symptoms of frequent urination, and establish the association relationship between these knowledge entities based on semantics.

Establishing associations between knowledge entities based on preset rules includes computer equipment establishing associations between knowledge entities according to rules formed by domain knowledge and experience. For example, the preset rule stored in the computing device is "Prevotella causes hypotension", and when the knowledge entities extracted by the computer device are hypotension and Prevotella, the computer device establishes an association between Prevotella and hypotension.

In the embodiment of the present application, the knowledge entity extracted by the computer device also includes multiple data types, for example, the type of the knowledge entity includes text or value. It is worth noting that, in the embodiment of the present application, when knowledge extraction is performed on various data of computer equipment, there are some knowledge entities and association relationships that are not extracted by computer equipment. These unextracted knowledge entities include hidden nodes that cannot be covered by domain knowledge , the unextracted associations include hidden associations that cannot be covered by domain knowledge. Since computer equipment cannot obtain these hidden nodes and hidden associations based on data semantics or domain knowledge in the process of knowledge entity extraction, computer equipment establishes virtual knowledge nodes and virtual associations for these hidden nodes and hidden associations in the process of knowledge representation. That is, the vector graph represented by the computer equipment contains virtual knowledge nodes and virtual associations that are not reflected in the knowledge graph.

For example, the knowledge graph obtained by computer equipment contains two knowledge entities "headache" and "cough". There is no relationship between these two knowledge entities. Computer equipment can add virtual knowledge nodes in knowledge representation, such as "influencing factors 1", and added "hidden association 1" and "hidden association 2" between "influencing factor 1" and "headache" and "cough", these virtual knowledge nodes and virtual associations do not exist in the extracted knowledge graph , but reflected in the nodes and weights in the vector graph after representation.

203. The computer device performs knowledge representation for each knowledge entity based on the knowledge representation algorithm, and initializes the weights of the relationships among multiple knowledge entities in the knowledge graph to obtain a vector graph.

The computer equipment performs knowledge representation for each knowledge entity based on the knowledge representation algorithm, and performs correlation representation for the association relationship between the knowledge entities, so as to obtain the vector diagram corresponding to the knowledge graph.

Specifically, in the process of representing the knowledge entity, the computer device selects a knowledge representation algorithm corresponding to the data type according to the data type of the knowledge entity, and uses the knowledge representation algorithm to represent the knowledge entity to obtain a representation vector of the knowledge entity. The computer device characterizes the association relationship to obtain the representation vector of the association relationship, the representation vector of the knowledge entity and the representation vector of the association relationship constitute a vector graph, and the represented vector graph contains the initialized weights between multiple knowledge entities.

Please refer to Figure 4, which is a schematic diagram of knowledge representation based on different data types provided by the embodiment of this application. As shown in Figure 4, knowledge entities are divided according to different data types, and the types of knowledge entities include text nodes, value nodes, virtual nodes or other nodes. Among them, the knowledge representation algorithm corresponding to the text node is such as the knowledge graph embedding algorithm (knowledge graph embedding, KGE) algorithm, the bidirectional encoder representations from the transformer (bidirectional encoder representations from transformers, BERT) algorithm or the word vector (word2vec) algorithm. For example, the computer device obtains the representation vector of the text node through the knowledge graph embedding algorithm. Specifically, the computer can first obtain the representation vector of the external source knowledge graph through deep learning of the external source knowledge graph through the knowledge graph embedding algorithm. First, the text node in the knowledge graph Match the knowledge entities in the external knowledge graph to obtain the representation vector of the knowledge graph. For another example, the computer device can also be based on the pre-training model of the BERT algorithm in the medical field, and use the model to obtain the representation vector of the text node.

As shown in Figure 4, the knowledge representation algorithm corresponding to the value node is such as the multilayer perceptron (multilayer perceptron, MLP) algorithm. For example, computer equipment classifies and encodes numerical nodes such as height, weight, age, or check value based on the MLP model, and maps them to representation vectors to mine the high and low meanings of the data.

As shown in Figure 4, for virtual knowledge nodes, computer equipment obtains representation vectors based on the aggregated embedding algorithm, and for other nodes, computer devices obtain representation vectors based on random embedding algorithms. In the example shown in FIG. 4 , for the relationship between knowledge entities, the computer device obtains an edge representation vector through an edge embedding algorithm (edge embedding).

The knowledge representation algorithm in the embodiment of the present application may be a knowledge representation algorithm in a preset algorithm library, or a knowledge representation algorithm input by a user, which is not specifically limited. There is a one-to-one or many-to-one preset relationship between the knowledge representation algorithm and the data type in the preset algorithm library, and the fixed data type includes text or value. The knowledge representation algorithm input by the user is used to supplement the knowledge representation algorithm in the preset algorithm library.

The above knowledge representation process is executed by the knowledge representation module in the computer device. The knowledge representation module can be flexibly decoupled in the computer device, adjusted or customized according to the characteristics of the field, and has scalability and interactivity. It can be understood that the knowledge representation module has built-in different representation sub-modules, and the representation sub-modules are used to represent knowledge entities and association relationships of different data types.

204. The computer equipment trains the AI task model according to the vector graph.

The computer trains the AI task model based on the vector graph, and the trained AI task model can be used to perform various downstream tasks. Taking the medical field as an example, downstream tasks include medical consultation, drug recommendation, diagnostic decision support or treatment decision support, etc.

Specifically, the computer device iteratively trains the AI task model based on the multiple vector graphs obtained through the aforementioned steps S201-S203, until the training output of the AI task model meets the deviation requirement from the target output, then the training of the AI task model is completed. At the same time, the computer device obtains a dynamically updated vector diagram according to the training process, and each node and weight in the updated vector diagram are updated. When using the above-mentioned trained AI task model to perform task prediction (for example: to diagnose patient A's disease based on various data of patient A), the computer device characterizes the data to be predicted based on the above steps S201-S203, and uses The predicted results are obtained from the characterized vector diagram and the trained AI task model.

The computer device also identifies the key knowledge entities and/or key associations in the knowledge map corresponding to the task prediction based on the vector map updated after the above training, obtains the interpretable knowledge map, and outputs the interpretable knowledge map through the graphical user interface . Specifically, after the computer device acquires the updated vector graph, based on the weights of the edges between the nodes in the updated vector graph, determine the corresponding association relationship in the knowledge map of the edges whose weight exceeds the preset threshold, and the computer device in The association relationship in the knowledge graph and the knowledge entity connected to the relationship are identified.

Please refer to FIG. 5 . FIG. 5 is a schematic diagram of an interpretable knowledge map provided by an embodiment of the present application. As shown in Figure 5, the knowledge graph shown in Figure 5 is the knowledge graph corresponding to the disease diagnosis task, and the computer equipment is based on the vector graph completed by the AI task model to mark the key nodes and key edges in the knowledge graph corresponding to the task, and obtain the Interpret the knowledge graph. The interpretable knowledge graph reflects the contribution of key nodes and key edges to the AI task model. The labeling method can be displayed visually with different colors and weight values according to the degree of contribution, so that the key nodes and edge information in the graph can be highlighted. For example, in the example shown in 5, the nodes and edges in bold are the key nodes and edges corresponding to the disease diagnosis task.

Algorithms for computer equipment training AI task models in the embodiment meeting of this application include graph convolutional network (graph convolutional network, GCN), graph attention network (graph attention network, GAT) or graph sample aggregation training (Graph sample and aggregate, GraphSAGE) ).

Please refer to FIG. 6 . FIG. 6 is a schematic diagram of an F1 score of a data processing method provided by an embodiment of the present application. As shown in Figure 6, Figure 6 is a comparison chart of the F1 score in the disease classification task provided by the data processing method provided in the embodiment of the present application, wherein the middle F1 score is the harmonic mean of the precision rate and the recall rate, where the precision rate represents How many of the samples that are predicted to be positive are true positive samples, and the recall rate indicates how many positive examples in the sample are predicted correctly. The F1 score is used to evaluate the classification accuracy of the disease classification task. The higher the F1 score value, the more accurate the disease classification. It can be seen from Figure 6 that the disease classification AI task model trained based on the multi-dimensional graph embedding representation algorithm, compared with the disease classification AI task model trained by the direct text representation method based on the BERT model, the F1 score has increased by up to 5.7 %.

In the embodiment of the present application, the sample data used by the computer equipment to train the AI task model is a variety of sources and types of data, and the AI task model is trained based on the knowledge map extracted from multiple sources and various types of data, thereby improving improved the predictive accuracy of the AI task model. At the same time, the extracted knowledge graph includes knowledge entities and association relationships at multiple knowledge levels, which further improves the prediction accuracy of the AI task model.

A data processing method provided by the embodiment of the present application is described above, and the data processing device involved in the embodiment of the present application is described below with reference to the accompanying drawings.

Please refer to FIG. 7 . FIG. 7 is a schematic structural diagram of a data processing device provided by an embodiment of the present application. The apparatus is used to implement each step of the corresponding equipment in the foregoing embodiments. As shown in FIG. 7 , the data processing apparatus 700 includes an interface unit 701 and a processing unit 702 .

The interface unit 701 is used to obtain various data, and various data in the various data have different sources and different data types. The processing unit 702 is used to perform knowledge extraction on various types of data to obtain a knowledge graph. The knowledge graph includes a plurality of knowledge entities and associations among the plurality of knowledge entities, and the plurality of knowledge entities include different data types. The processing unit 702 is also used to perform knowledge representation on each knowledge entity using the knowledge representation algorithm corresponding to the data type of each knowledge entity, and initialize the weights of the relationships among multiple knowledge entities in the knowledge graph to obtain a vector graph , the vector diagram is used to train the artificial intelligence AI task model.

In a possible implementation manner, the processing unit 702 is specifically configured to perform knowledge extraction on various data based on different knowledge levels, and obtain a multi-knowledge level knowledge map.

In a possible implementation manner, the processing unit 702 is specifically configured to determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the data type of each knowledge entity according to the preset relationship, and according to the corresponding knowledge The representation algorithm performs knowledge representation on the knowledge entity and obtains the representation vector corresponding to the knowledge entity.

In a possible implementation manner, the processing unit 702 is further configured to train the AI task model according to the vector graph, and obtain a trained AI task model.

In a possible implementation manner, the processing unit 702 is specifically configured to update weights in the vector map.

In a possible implementation, the processing unit 702 is also configured to use the trained AI task model to perform task prediction, obtain the prediction result, and predict the key knowledge entities and/or key knowledge entities in the knowledge map corresponding to the task based on the updated vector graph or key associations to obtain interpretable knowledge graphs.

In a possible implementation manner, the processing unit 702 is further configured to output an explainable knowledge map through a graphical user interface GUI.

It should be understood that the division of units in the above device is only a division of logical functions, and may be fully or partially integrated into a physical entity or physically separated during actual implementation. And the units in the device can all be implemented in the form of software called by the processing element; they can also be implemented in the form of hardware; some units can also be implemented in the form of software called by the processing element, and some units can be implemented in the form of hardware. For example, each unit can be a separate processing element, or it can be integrated in a certain chip of the device. In addition, it can also be stored in the memory in the form of a program, which is called and executed by a certain processing element of the device. Function. In addition, all or part of these units can be integrated together, or implemented independently. The processing element mentioned here may also be a processor, which may be an integrated circuit with signal processing capabilities. In the process of implementation, each step of the above method or each unit above may be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software called by the processing element.

It is worth noting that, for the above method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required by the present application of the present invention.

Other reasonable step combinations conceivable by those skilled in the art based on the above description also fall within the protection scope of the present application. Secondly, those skilled in the art should also be familiar with that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily necessary for the application of the present invention.

Please refer to FIG. 8 , which is a schematic diagram of a computer device provided by an embodiment of the present application. As shown in FIG. 8 , the computer device 800 includes: a processor 810 , a memory 820 and an interface 830 , and the processor 810 , the memory 820 and the interface 830 are coupled through a bus (not marked in the figure). The memory 820 stores instructions, and when the execution instructions in the memory 820 are executed, the computer device 800 executes the method executed by the first chip in the above method embodiment.

The computer device 800 may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (application specific integrated circuit, ASIC), or, one or more microprocessors (digital signal processor , DSP), or, one or more field programmable gate arrays (field programmable gate array, FPGA), or a combination of at least two of these integrated circuit forms. For another example, when the units in the device can be implemented in the form of a processing element scheduler, the processing element can be a general-purpose processor, such as a central processing unit (central processing unit, CPU) or other processors that can call programs. For another example, these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).

Processor 810 may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuits, ASICs), on-site Programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor, or any conventional processor.

Memory 820 may include read-only memory and random-access memory, and provides instructions and data to processor 810 . Memory 820 may also include non-volatile random access memory. For example, the memory 820 may be provided with multiple partitions, each of which is used to store private keys of different software modules.

Memory 820 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).

In addition to the data bus, the bus may also include a power bus, a control bus, and a status signal bus. The bus can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link (compute express link, CXL), cache coherent interconnect for accelerators (CCIX), etc. The bus can be divided into address bus, data bus, control bus and so on.

In another embodiment of the present application, a computer-readable storage medium is also provided, and computer-executable instructions are stored in the computer-readable storage medium. When the processor of the device executes the computer-executable instructions, the device executes the above method embodiment A method performed by a computer device.

In another embodiment of the present application, a computer program product is also provided, the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium. When the processor of the device executes the computer-executed instructions, the device executes the method performed by the computer device in the foregoing method embodiments.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disc, etc., which can store program codes. .

Claims

A data processing method, characterized in that, comprising:

obtaining a plurality of data, each of the plurality of data having different sources and different data types;

Performing knowledge extraction on the various data to obtain a knowledge map, the knowledge map includes a plurality of knowledge entities and association relationships between the plurality of knowledge entities, and the plurality of knowledge entities include different data types;

Using the knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation for each knowledge entity, and initialize the weights of the association relationships between the multiple knowledge entities in the knowledge graph to obtain a vector diagram, The vector graph is used to train the artificial intelligence AI task model.
The method according to claim 1, wherein said performing knowledge extraction on said various data to obtain a knowledge graph comprises:

Knowledge extraction is performed on the various data based on different knowledge levels to obtain the knowledge map of multiple knowledge levels.
The method according to claim 2, wherein the knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from the various data, or the association relationship is based on a preset rule get.
The method according to any one of claims 1-3, wherein said performing knowledge representation on each knowledge entity using a knowledge representation algorithm corresponding to the data type of each knowledge entity comprises:

According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the preset relationship, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain A representation vector corresponding to the knowledge entity; or,

According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type input by the user, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain the representation vector corresponding to the knowledge entity.
The method according to any one of claims 1-4, wherein the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, Imaging examination report, gene regulatory expression network and metabolic network.
The method according to any one of claims 1-5, wherein the method further comprises:

The AI task model is trained according to the vector diagram, and the trained AI task model is obtained.
The method according to claim 6, wherein the training the AI task model according to the vector diagram comprises: updating weights in the vector diagram.
The method according to claim 7, further comprising: using the trained AI task model to perform task prediction, and obtain a prediction result;

Key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction are identified based on the updated vector graph to obtain an interpretable knowledge graph.
The method according to claim 8, characterized in that the method further comprises:

The explainable knowledge graph is output through a graphical user interface GUI.
A data processing device, characterized in that it includes an interface unit and a processing unit;

The interface unit is used to obtain multiple data, and various data in the multiple data have different sources and different data types;

The processing unit is used to perform knowledge extraction on the various data to obtain a knowledge map, the knowledge map includes a plurality of knowledge entities and the associations between the plurality of knowledge entities, and the plurality of knowledge entities include different data type;

The processing unit is further configured to use a knowledge representation algorithm corresponding to the data type of each knowledge entity to perform knowledge representation on each knowledge entity, and weight the association relationships among the multiple knowledge entities in the knowledge graph The initialization of the vector graph is obtained, and the vector graph is used to train the artificial intelligence AI task model.
The device according to claim 10, wherein the processing unit is specifically used for:

Knowledge extraction is performed on the various data based on different knowledge levels to obtain the knowledge map of multiple knowledge levels.
The device according to claim 11, characterized in that knowledge entities from different knowledge levels include an association relationship, and the association relationship is obtained from the various data, or the association relationship is based on a preset rule get.
The device according to any one of claims 10-12, wherein the processing unit is specifically configured to:

According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type of the knowledge entity from the preset algorithm library according to the preset relationship, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain A representation vector corresponding to the knowledge entity; or,

According to the data type of each knowledge entity, determine the knowledge representation algorithm corresponding to the data type input by the user, perform knowledge representation on the knowledge entity according to the corresponding knowledge representation algorithm, and obtain the representation vector corresponding to the knowledge entity.
The device according to any one of claims 10-13, wherein the AI task model is an AI model for disease diagnosis, and the various data include at least two of the following data: medical record data, Imaging examination report, gene regulatory expression network and metabolic network.
The device according to any one of claims 10-14, wherein the processing unit is further configured to:

The AI task model is trained according to the vector diagram, and the trained AI task model is obtained.
The device according to claim 15, wherein the processing unit is specifically configured to update weights in the vector map.
The device according to claim 16, wherein the processing unit is further configured to use the trained AI task model to perform task prediction and obtain a prediction result;

Key knowledge entities and/or key associations in the knowledge graph corresponding to the task prediction are identified based on the updated vector graph to obtain an interpretable knowledge graph.
The device according to claim 17, wherein the processing unit is further used for:

The explainable knowledge graph is output through a graphical user interface GUI.
A computer device, characterized in that it includes a processor, the processor is coupled with a memory, and the memory is used to store instructions, when the instructions are executed by the processor, so that the computer device performs the claims The method described in any one of 1 to 9.
A computer-readable storage medium on which instructions are stored, wherein when the instructions are executed, the computer executes the method described in any one of claims 1-9.
A computer program product, the computer program product includes instructions, wherein when the instructions are executed, the computer implements the method according to any one of claims 1 to 9.