CN104915717B

CN104915717B - Data processing method, Analysis of Knowledge Bases Reasoning method and relevant apparatus

Info

Publication number: CN104915717B
Application number: CN201510295748.3A
Authority: CN
Inventors: 伍海江; 郭玉箐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-06-02
Filing date: 2015-06-02
Publication date: 2017-11-14
Anticipated expiration: 2035-06-02
Also published as: CN104915717A

Abstract

The invention discloses a kind of data processing method, Analysis of Knowledge Bases Reasoning method and relevant apparatus, wherein, the data processing method includes：The regular dependence for determining each regular dispatching sequence during Analysis of Knowledge Bases Reasoning is obtained in the following ways：Data are chosen from the knowledge base and form the first data set；First data set is made inferences according to the rule of the knowledge base；The regular dependence is determined according to each regular the reasoning results and corresponding input data.Using the present invention, it is possible to increase Analysis of Knowledge Bases Reasoning technology is applicable scene and flexibility, improves efficiency and overall reasoning performance that rule performs.

Description

Data processing method, knowledge base reasoning method and related device

Technical Field

The invention relates to the field of data processing, in particular to a data processing method applied to knowledge base reasoning, a knowledge base reasoning method and a related device.

Background

Knowledge in the knowledge base is divided into two categories: ontology (ontology) and fact (fact). The triplet representation is adopted as (subject, predicate, object), and is abbreviated as (s, p, o). The fact refers to a basic description in the real world or the virtual world, for example, "wife of yaoming is ye li" can be represented by a triad (yaoming, wife, ye li). And ontologies are abstractions of facts, including concepts, properties, relationships between concepts, relationships between properties, and constraints on concepts and properties. Such as defining concepts "person", "athlete", defining attributes "wife", "husband", defining relationships between concepts (athlete, subclass (sublassof), person), defining relationships between attributes (wife, reciprocal (invertseof), husband).

Knowledge base reasoning refers to the automatic generation of new facts, which are not explicitly expressed in the knowledge base, by a program based on the knowledge already in the knowledge base and the semantics or rules of the knowledge description language. With fig. 1, the concept is represented by a shaded ellipse, the instance is represented by a transparent ellipse, the attribute is represented by an arrow, and new facts are inferred according to inference rules "(?, instance,.

The existing knowledge base reasoning method is generally based on a distributed programming framework Hadoop. The method mainly comprises the following steps that firstly, data in a Map (mapping) stage are divided into different Map nodes, Map functions realized by a Hadoop user are operated on the nodes, key-value pairs are output, then output results of the Map functions are sorted and combined according to keys, and all the key-values with the same keys are transmitted to the same computing node through a local area network. In Reduce stage, the nodes receiving data execute Reduce function realized by Hadoop user and output result to hard disk. One specific implementation is shown in fig. 2.

Existing knowledge base inference techniques based on Hadoop, for example, inference engines oriented to RDFS and OWL rule sets with limited expressive power, all of the rules of which are hard-coded into the inference engine. Such knowledge base inference techniques cannot implement extension of inference semantics as needed, and are also fixed in terms of scheduling of rules, and cannot optimize the scheduling order of rules according to changes in rules/rule sets. Therefore, the existing knowledge base reasoning technology has the advantages of small application range, poor flexibility and low reasoning efficiency.

In addition, the existing knowledge base reasoning technology is easy to have the problems of insufficient memory, high hard disk cost and the like, and the reasoning efficiency is reduced due to the problems.

Disclosure of Invention

In order to overcome the defects of the conventional knowledge base reasoning technology, the embodiment of the invention provides a data processing method, a knowledge base reasoning method and a related device, which can improve the application scene and flexibility of the knowledge base reasoning technology, and improve the rule execution efficiency and the overall reasoning performance.

In a first aspect, an embodiment of the present invention provides a data processing method applied to knowledge base inference, including:

obtaining a rule dependency relationship for determining a scheduling order of each rule in the knowledge base reasoning process by adopting the following modes:

selecting data from the knowledge base to form a first data set;

reasoning on the first data set according to rules of the knowledge base;

and determining the rule dependency relationship according to the reasoning result of each rule and the corresponding input data.

Optionally, in an implementation manner of the embodiment of the present invention, the determining the rule dependency relationship according to the inference result of each rule and the corresponding input data includes: matching and judging the inference result of each rule with the conditions of other rules; if the inference result of the first rule is matched with the condition of the second rule, establishing an edge pointing to the second rule from the first rule in the constructed directed acyclic graph for representing the rule dependency relationship; wherein the first rule and the second rule are used to represent any one of the rules of the knowledge base.

Further, the method further comprises: and carrying out topological sorting on the directed acyclic graph according to a topological sorting algorithm, and determining the scheduling sequence of each rule in the inference process of the knowledge base.

Optionally, in another implementation manner of the embodiment of the present invention, the method further includes: and determining the scheduling sequence of each rule in the inference process of the knowledge base according to the rule dependency relationship.

In a second aspect, an embodiment of the present invention provides a knowledge base inference method, including:

obtaining a rule dependency relationship for determining a scheduling order of each rule in a knowledge base reasoning process by using the method according to the first aspect of the embodiment of the invention;

determining the scheduling sequence of each rule in the inference process of the knowledge base according to the rule dependency relationship;

and carrying out knowledge base reasoning on the knowledge base according to the determined scheduling sequence of each rule.

Optionally, in an implementation manner of this embodiment, the method further includes: carrying out deduplication processing in the inference process of the knowledge base, wherein the deduplication processing comprises the following steps: and performing the deduplication processing after each MapReduce job for executing inference is finished, or performing the deduplication processing after each round of rule iteration, or performing the deduplication processing after determining that a new inference result is not generated.

In a third aspect, an embodiment of the present invention provides a data processing apparatus applied to knowledge base inference, including:

the relation module is used for obtaining rule dependency relation used for determining the scheduling sequence of each rule in the inference process of the knowledge base;

the relationship module includes:

the sampling submodule is used for selecting data from the knowledge base to form a first data set;

the reasoning submodule is used for reasoning the first data set according to the rule of the knowledge base;

and the determining submodule is used for determining the rule dependency relationship according to the inference result of each rule and the corresponding input data.

Optionally, in an implementation manner of this embodiment, the determining sub-module is specifically configured to perform the following processing: matching and judging the inference result of each rule with the conditions of other rules; if the inference result of the first rule is matched with the condition of the second rule, establishing an edge pointing to the second rule from the first rule in the constructed directed acyclic graph for representing the rule dependency relationship; wherein the first rule and the second rule are used to represent any one of the rules of the knowledge base.

Further optionally, the apparatus further includes a first ordering module, configured to perform topology ordering on the directed acyclic graph according to a topology ordering algorithm, and determine a scheduling order of each rule in the inference process of the knowledge base.

Optionally, in another implementation manner of this embodiment, the apparatus further includes a second order module, configured to determine, according to the rule dependency relationship, a scheduling order of each rule in the inference process of the knowledge base.

In a fourth aspect, an embodiment of the present invention provides an inference apparatus, including:

the data processing apparatus of claim 7 or 8;

the sequence module is used for determining the scheduling sequence of each rule in the reasoning process of the knowledge base according to the rule dependency relationship;

a data processing apparatus according to a third aspect of an embodiment of the present invention;

and the reasoning module is used for carrying out knowledge base reasoning on the knowledge base according to the determined scheduling sequence of each rule.

Optionally, in an implementation manner of this embodiment, the apparatus further includes: the duplication removing module is used for carrying out duplication removing processing in the process of carrying out knowledge base inference by the inference module and comprises the following steps: and performing the deduplication processing after each MapReduce job for executing inference is finished, or performing the deduplication processing after each round of rule iteration, or performing the deduplication processing after determining that a new inference result is not generated.

In a fifth aspect, an embodiment of the present invention provides a data processing method applied to knowledge base inference, including:

storing data in the knowledge base in the following manner:

classifying the data in the knowledge base according to a preset classification strategy;

determining first data and second data according to the data quantity of each type of data, wherein the first data are used as input of a MapReduce task, the second data are used for participating in knowledge base reasoning in a reduction Reduce stage of the MapReduce task, and the MapReduce task is used for carrying out database reasoning according to the first data and the second data;

and storing the first data to a hard disk and storing the second data to a memory.

Optionally, in an implementation manner of this embodiment, the classifying the data in the knowledge base according to a preset classification policy includes: and classifying and storing the data in the knowledge base in corresponding input files according to predicate types of the data in the knowledge base.

Optionally, in another implementation manner of this embodiment, determining the first data and the second data according to the data amount of each type of data includes: and judging according to the data volume of each type of data, wherein the type of data with the largest data volume in each type of data is used as the first data, and the rest of data is used as the second data.

Optionally, in another implementation manner of this embodiment, the MapReduce task is configured to perform database inference according to the first data and the second data, and includes: inputting the first data and generating a key-value pair which is classified and represents the first data in a Map stage of the MapReduce task; in a reduction Reduce stage of the MapReduce task, carrying out knowledge base reasoning on the second data and a key-value pair input into each Reduce node; wherein classifying the key-value pair representing the first data comprises: and taking the object, predicate or subject of the data in the first data as a key, and taking the data in the first data as a key-value pair of value.

Optionally, in another implementation manner of this embodiment, the storing the second data in the memory includes: reading the second data into a memory of each Reduce node executing the MapReduce task, and generating a key-vakue pair representing the second data in a classified manner; wherein classifying the key-value pair representing the second data comprises: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value.

In a sixth aspect, an embodiment of the present invention provides a knowledge base inference method, including:

reading first data from a hard disk as input of a MapReduce task and executing the MapReduce task;

generating a key-value pair which expresses the first data in a classified manner at a Map mapping stage of the MapReduce task;

in a reduction Reduce stage of the MapReduce task, carrying out knowledge base reasoning on second data in a memory according to key-value pairs input into each Reduce node;

the first data and the second data are obtained by processing data in a knowledge base by using the method according to the fifth aspect of the embodiment of the invention.

Optionally, in an implementation manner of this embodiment, generating a key-value pair that classifies and represents the first data includes: and generating a key-value pair with the object, the predicate or the subject of the data in the first data as a key and the data in the first data as a value.

Optionally, in another implementation manner of this embodiment, the performing knowledge base inference on the second data in the memory and the key-value pairs input to each Reduce node includes: and at each Reduce node, performing matching judgment according to the key of the key-value pair input to the Reduce node and the key of the key-value pair in the memory of the Reduce node, wherein the key of the key-value pair of the second data is represented by classification, and if the matching is successful, performing connection processing to obtain an inference result. Wherein the key-value pair input to the Reduce node comprises: classifying a key-value pair representing the first data according to the key pair to obtain a key-value pair after merging,

classifying the key-value pair representing the second data includes: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value.

Optionally, in another implementation manner of this embodiment, the method further includes: carrying out deduplication processing in the process of knowledge base reasoning, wherein the deduplication processing comprises the following steps: and performing the deduplication processing after each inference task is finished, or performing the deduplication processing after each round of rule iteration, or performing the deduplication processing after determining that a new inference result is not generated.

In a seventh aspect, an embodiment of the present invention provides a data processing apparatus applied to knowledge base inference, including:

the storage processing module is used for storing the data in the knowledge base at corresponding positions;

the storage processing module includes:

a classification submodule for classifying the data in the knowledge base according to a preset classification strategy,

the determination submodule is used for determining first data and second data according to the data quantity of various types of data, wherein the first data are used as input of a MapReduce task, the second data are used for participating in knowledge base reasoning in a reduction Reduce stage of the MapReduce task, and the MapReduce task is used for carrying out database reasoning according to the first data and the second data;

the first storage submodule is used for storing the first data to a hard disk;

and the second storage submodule is used for storing the second data to the memory.

Optionally, in an implementation manner of this embodiment, the classification sub-module is specifically configured to store the data in the knowledge base in a classification manner in the corresponding input file according to a predicate type of each data in the knowledge base. Or the determining sub-module is specifically configured to perform determination according to the data volume of each type of data, and use the type of data with the largest data volume in each type of data as the first data, and use the remaining data as the second data.

Optionally, in yet another implementation manner of this embodiment, the second storage submodule is specifically configured to read the second data into a memory of each Reduce node that executes the MapReduce task, and generate a key-vakue pair that represents the second data in a classified manner; wherein classifying the key-value pair representing the second data comprises: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value.

In an eighth aspect, an embodiment of the present invention provides a knowledge base inference apparatus, including:

the reading module is used for reading first data from the hard disk;

the execution module is used for carrying out knowledge base reasoning in a mode of executing a MapReduce task, and comprises the following steps:

a first execution submodule, configured to, in a Map phase of the MaReduce task, take the first data as input and generate a key-value pair representing the first data in a classified manner,

the second execution submodule is used for carrying out knowledge base reasoning on second data in the memory according to the key-value input to each Reduce node in the reduction stage of the MapReduce task; wherein,

the first data and the second data are first data and second data obtained by processing data in a knowledge base by the apparatus according to the seventh aspect of the embodiment of the present invention.

Optionally, in an implementation manner of this embodiment, the first execution sub-module is specifically configured to generate a key-value pair in which an object, a predicate, or a subject of data in the first data is a key, and data in the first data is a value; and/or the second execution submodule is specifically used for carrying out matching judgment on each Reduce node according to the key of the key-value pair input into the Reduce node and the key of the key-value pair which is classified and represents the second data in the memory of the Reduce node, and carrying out connection processing to obtain an inference result if matching is successful; wherein the key-value pair input to the Reduce node comprises: according to the key-value pair obtained after merging the key-value pairs which are used for classifying and representing the first data, the key-value pair which is used for classifying and representing the second data comprises: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value. .

Optionally, in another implementation manner of this embodiment, the apparatus further includes: the duplication removing module is used for carrying out duplication removing processing in the process of carrying out knowledge base reasoning by the execution module and comprises: and performing the deduplication processing after each inference task is finished, or performing the deduplication processing after each round of rule iteration, or performing the deduplication processing after determining that a new inference result is not generated.

In a ninth aspect, an embodiment of the present invention further provides a knowledge base inference apparatus, which includes the data processing apparatus according to the seventh aspect of the embodiment of the present invention.

In a tenth aspect, an embodiment of the present invention provides a knowledge base inference method, which simultaneously employs the data processing methods according to the first aspect and the third aspect of the embodiments of the present invention. Or it may employ both the knowledge base inference method according to the second and fourth aspects of the embodiments of the invention.

In an eleventh aspect, an embodiment of the present invention provides a knowledge base inference apparatus, which includes the data processing apparatus according to the fifth aspect and the seventh aspect of the present invention, or which includes all the modules of the knowledge base inference apparatus according to the sixth aspect and the eighth aspect of the present invention.

The adoption of the various embodiments of the invention has the following beneficial effects:

by determining the rule dependency relationship, the knowledge base inference technology is not limited to a fixed rule which can only carry out inference in a hard coding mode, but can flexibly expand, change and define a rule/rule set, so that the application range of the knowledge base inference technology is enlarged, and the flexibility of the knowledge base inference technology is improved;

by reasoning according to the scheduling sequence of the rules, unnecessary repeated calling of the rules is avoided, and the rule execution efficiency and reasoning performance are improved;

by caching the data with small data volume to the Reduce computing node in advance, the problem of memory overflow can be effectively avoided, meanwhile, the read-write operation on the hard disk can be reduced, and the reasoning performance is improved.

Drawings

FIG. 1 is a schematic diagram of a prior to and after reasoning comparison of a knowledge base reasoning process;

FIG. 2 is a schematic diagram illustrating an execution flow of a MapReduce task;

FIG. 3A is a flow diagram illustrating a data processing method applied to knowledge base reasoning, in accordance with an embodiment of the present invention;

FIG. 3B is a rule dependency graph of the RDFS rule set on LUBM data;

FIG. 4 is a flow diagram of a knowledge base inference method according to an embodiment of the invention;

FIG. 5 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 6 is a block diagram of a knowledge base inference engine in accordance with an embodiment of the present invention;

FIG. 7 is a data processing method applied to knowledge base reasoning, in accordance with an embodiment of the present invention;

FIG. 8 is a flow diagram of a knowledge base inference method according to an embodiment of the invention;

FIG. 9 is a block diagram of a data processing apparatus for knowledge base reasoning, in accordance with an embodiment of the present invention;

FIG. 10 is a block diagram of a knowledge base inference engine in accordance with an embodiment of the present invention;

fig. 11 is a block diagram of a knowledge base inference apparatus according to an embodiment of the present invention.

Detailed Description

Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known modules, units and their interconnections, links, communications or operations with each other are not shown or described in detail. Furthermore, the described features, architectures, or functions can be combined in any manner in one or more implementations. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. It will also be readily understood that the modules or units, or steps, of the embodiments described herein and illustrated in the figures can be combined and designed in a wide variety of different configurations.

Fig. 3A is a flowchart of a data processing method applied to knowledge base inference, which may be used to obtain rule dependencies used to determine a scheduling order of rules in a knowledge base inference process according to an embodiment of the present invention. Referring to fig. 3A, the method includes:

300: data is selected from the knowledge base to form a first data set.

Optionally, in an implementation manner of this embodiment, the first data set is obtained by randomly sampling the knowledge base, for example, randomly sampling 10% of data in the knowledge base.

302: the first data set is inferred according to rules of a knowledge base. In other words, the rules of the knowledge base are executed on the first data set.

It should be noted that the rules of the knowledge base mentioned in the embodiments of the present invention include, but are not limited to, a rule set based on existing resource description languages such as RDFS, OWL2DL, OWL Full, and the like, and may also include user-defined rules. Illustratively, one user-customized rule is as follows:

rule: condition 1, …, condition n → conclusion 1, …, conclusion m

Conditions are as follows: (s, p, o) ∈ (R.sub.Vu). sub.X (R.sub.V). sub.X (R.sub.VU.sub.L)

And (4) conclusion: (s, p, o) ∈ (R.sub.Vu). sub.X (R.sub.V). sub.X (R.sub.VU.sub.L)

Where R represents an identifier of a concept, attribute, or instance, V represents a variable identifier, and L represents a string value or value.

304: and determining rule dependency relationship according to the inference result of each rule and corresponding input data.

In this embodiment, the dependency of one rule (assumed to be rule a) on another rule (assumed to be rule B) means that: the inference result of rule B triggers the execution of rule a, thereby affecting the scheduling of the rule. In other words, in this embodiment, it can be determined whether rule a has a dependency on rule B by determining whether rule a can be triggered by the inference result of rule B.

Therefore, whether or not the rules of the knowledge base are dependent on each other can be determined according to the rule dependency mentioned in the present embodiment.

By adopting the data processing method provided by the embodiment shown in fig. 3A, the invention ensures that the knowledge base reasoning technology is not limited to a fixed rule which can only carry out reasoning in a hard coding mode by determining the rule dependency relationship, but can flexibly expand, change and define the rule/rule set, thereby increasing the application range of the knowledge base reasoning technology and improving the flexibility of the knowledge base reasoning technology.

Optionally, in an implementation manner of this embodiment, as shown by a dashed box in fig. 3A, the method further includes:

306: and determining the scheduling sequence of each rule in the inference process of the knowledge base according to the rule dependency relationship.

The regular scheduling order has a significant impact on the performance of the inference. For example, rule "rdfs 9:? s rdf type? x; is there a xrdfs SubClassOf? y? s rdf type? y "versus rule" rdfs 11? x rdfs subclasof? y; is there a y rdfs subclasof? z? x rdfs subclasof? z ", there is a dependency, if the rule rdfs9 is executed first, the flow of scheduling is" rdfs9 → rdfs11 → rdfs9 ". If the rule rdfs11 is executed first, the flow of scheduling is "rdfs 11 → rdfs 9". Obviously, two different regular scheduling orders will achieve different inference performance.

By adopting the method provided by the implementation mode, the scheduling sequence of the rule can be determined according to the rule dependency relationship no matter the rule set is fixed or changed, so that unnecessary repeated calling of the rule is avoided, and the rule execution efficiency and the inference performance (for example, the inference performance of an inference engine applying the embodiment or the implementation mode of the invention) are improved.

Optionally, in an implementation manner of this embodiment, the process 304 is implemented in the following manner:

matching and judging the inference result of each rule with the conditions of other rules; if the inference result of the first rule matches the condition of the second rule, for example, the inference result of the first rule is included in the condition of the second rule, an edge pointing from the first rule to the second rule is established in the constructed directed acyclic graph representing the rule dependency. Wherein the first rule and the second rule are used to represent any one of the rules of the knowledge base.

It should be noted that, the inference result of the first rule is matched with the condition of the second rule, which means that the inference result of the first rule can be used as input data of the second rule, and at this time, the second rule has dependency on the first rule.

The method provided by the implementation mode can determine the dependency among the rules, form the rule dependency relationship by the dependency among the rules, and embody the rule dependency relationship in the form of the directed acyclic graph. Illustratively, as shown in fig. 3B, is a rule dependency graph of RDFS rule set on LUBM (test benchmark dataset for testing inference engine performance) data.

Further optionally, in this implementation, the process 306 may be implemented in the following manner: and carrying out topological sorting on the directed acyclic graph according to a topological sorting algorithm, and determining the scheduling sequence of each rule in the inference process of the knowledge base. The result of the topology ranking can be used as the scheduling order of each rule, and the topology ranking algorithm can be the existing algorithm, which is not limited and described in detail in the present invention.

Fig. 4 is a flow chart of a knowledge base inference method according to an embodiment of the present invention, and referring to fig. 4, the method includes:

40: and determining rule dependency relations according to the rules of the knowledge base and partial data in the knowledge base.

Optionally, in an implementation manner of this embodiment, the embodiment shown in fig. 3A or its implementation manner may be adopted to implement the process 40, which is not described in detail in this embodiment.

42: and determining the scheduling sequence of each rule in the inference process of the knowledge base according to the rule dependency relationship.

Optionally, in an implementation manner of this embodiment, the directed acyclic graph used for representing the rule dependency relationship may be topologically ordered by an existing topological ordering algorithm, so as to obtain a scheduling order of each rule.

44: and carrying out knowledge base reasoning on the knowledge base according to the determined scheduling sequence of each rule.

In this embodiment, the specific inference process is not limited, and it falls within the scope of the present invention as long as the inference is performed according to the scheduling order of the rules determined by the method provided by the embodiment of the present invention.

Optionally, in one implementation of this embodiment, in process 44, in the process of performing knowledge base inference according to the topological ordering result (i.e., the scheduling order of the rules), if all rules on which a rule depends have no new inference result, the rule is removed from the scheduling queue. And circularly executing each rule according to the topological sorting result until no rule generates new fact data, thereby finishing the inference of the knowledge base.

By adopting the knowledge base reasoning method provided by the embodiment, on one hand, the method is not limited to a fixed rule which adopts a hard coding mode to carry out reasoning, but can flexibly expand, change and define the rule/rule set, thereby improving the application scene and flexibility of the knowledge base reasoning technology; on the other hand, by reasoning according to the scheduling sequence of each rule, unnecessary repeated calling of the rules is avoided, and the rule execution efficiency and the overall reasoning performance are improved.

Optionally, in an implementation manner of this embodiment, as shown by a dashed box in fig. 4, the method further includes:

46: and carrying out deduplication processing in the inference process of the knowledge base.

It should be noted that the result of the knowledge base inference inevitably overlaps with the existing data in the knowledge base, and the overlapping inference result increases the storage space of the knowledge base on one hand and affects the inference performance on the other hand. For example, when reasoning is performed, data needs to be read from a hard disk and reasoning results need to be written into the hard disk, so that a large number of repeated reasoning results can greatly prolong the reasoning time. Meanwhile, if the repeated inference results trigger the execution of rules, more repeated data can be generated, and the inference performance is reduced and the cycle is vicious.

Therefore, the embodiment of the invention can effectively save the storage space and improve the reasoning performance by carrying out the duplicate removal processing in the reasoning process of the knowledge base.

Optionally, in an implementation manner of this embodiment, when performing knowledge base inference based on Hadoop, the deduplication processing may be performed in the following manner: outputting a key-value pair with the triple as the key and null as the value in the Map stage; the key is output to the file during the Reduce phase.

Too many deduplication operations may also lead to performance degradation during knowledge base reasoning. Thus, one implementation of embodiments of the present invention provides a re-emphasis strategy: the duplication removal is carried out after the completion of each MapReduce operation for executing the inference, so that the method is suitable for the condition that the inference task generates a large number of repeated results and triggers other rules; the duplication removal is carried out after each round of rule iteration, so that the method is suitable for the condition that the inference result quantity is large but other rules cannot be triggered; the deduplication is performed after determining that no new inference result is generated (i.e., after all inference results are obtained), which is applicable to the case that the number of inference results is small and no other rule is triggered.

It should be understood by those skilled in the art that one MapReduce job completes the connection of two conditions in one rule, and when a rule contains multiple conditions, the execution of one rule requires the execution of multiple rounds of MapReduce jobs.

Fig. 5 is a block diagram of a data processing apparatus according to an embodiment of the present invention, and the data processing apparatus 5 is applied to knowledge base inference, and the data processing apparatus 5 is described in detail below with reference to fig. 5.

The data processing apparatus 5 comprises a relationship module 50 for deriving rule dependencies for determining the scheduling order of the rules in the knowledge base inference process, for which purpose the relationship module 50 comprises a sampling submodule 501, an inference submodule 502 and a determination submodule 503.

The sampling submodule 501 is configured to select data from the knowledge base to form a first data set.

The reasoning submodule 502 is configured to reason about the first data set according to rules of the knowledge base.

The determining submodule 503 is configured to determine rule dependency relationships according to the inference results of the rules and the corresponding input data.

By adopting the data processing device 5 provided by the embodiment, the knowledge base inference technology is not limited to a fixed rule which can only carry out inference in a hard coding mode, but can flexibly expand, change and define the rule/rule set by determining the rule dependency relationship. The application scene and the flexibility of the knowledge base reasoning technology are improved.

Optionally, in an implementation manner of this embodiment, the determining submodule 503 is specifically configured to perform the following processing: matching and judging the inference result of each rule with the conditions of other rules; if the inference result of the first rule is matched with the condition of the second rule, establishing an edge pointing to the second rule from the first rule in the constructed directed acyclic graph for representing the rule dependency relationship; wherein the first rule and the second rule are used to represent any one of the rules of the knowledge base.

Further optionally, in this implementation manner, the data processing apparatus may further include a first ordering module, configured to perform topology ordering on the directed acyclic graph according to a topology ordering algorithm, and determine a scheduling order of each rule in the inference process of the knowledge base.

Optionally, in an implementation manner of this embodiment, as shown by a dashed box in fig. 5, the data processing apparatus 5 may further include a second order module 51, configured to determine a scheduling order of each rule in the knowledge base inference process according to the rule dependency relationship. It will be appreciated by those skilled in the art that the aforementioned first sequential module may be used as a specific implementation of the second sequential module 51.

It should be noted that, in each device embodiment of the present invention, for the detailed description of the processing executed by each module and sub-module, the explanation of the related names, terms, and conditions, and the detailed analysis and description of the technical problem to be solved and the technical effect to be achieved, please refer to the description in the corresponding method embodiment, which is not repeated.

Fig. 6 is a block diagram schematically illustrating a knowledge base inference apparatus according to an embodiment of the present invention, and referring to fig. 6, the knowledge base inference apparatus 6 includes a data processing apparatus 5, a sequence module 60, and an inference module 61. The following description will be made separately.

The data processing device 5 is applied in the knowledge base inference device 6 for obtaining rule dependencies for determining the scheduling order of the rules in the knowledge base inference process. For the description of the data processing device 5, please refer to the foregoing description, which is not repeated herein.

And the sequence module 60 is used for determining the scheduling sequence of each rule in the inference process of the knowledge base according to the rule dependency relationship. For example, for a directed acyclic graph used for representing rule dependency relationships, a scheduling order of the rules is calculated by a topological sorting algorithm.

And the inference module 61 is used for performing knowledge base inference on the knowledge base according to the determined scheduling sequence of each rule.

By adopting the knowledge base reasoning device 6 provided by the embodiment, on one hand, the flexibility of the applicable scene of the knowledge base reasoning technology is improved; on the other hand, the rule execution efficiency and the overall reasoning performance are improved.

Optionally, in an implementation manner of the present embodiment, as shown by a dashed box in fig. 6, the knowledge base inference apparatus 6 further includes a deduplication module 62, configured to perform deduplication processing during the process of the inference module 61 performing knowledge base inference, for example, perform deduplication processing after each MapReduce job for performing inference is finished, or perform deduplication processing after each round of rule iteration, or perform deduplication processing after determining that a new inference result is not generated.

Fig. 7 is a data processing method applied to knowledge base reasoning, which is used for determining and storing a storage manner of data in a knowledge base, according to an embodiment of the present invention, and with reference to fig. 7, the method includes:

700: and classifying the data in the knowledge base according to a preset classification strategy.

It should be noted that, in various embodiments of the present invention, the data in the knowledge base is exemplified by triples. Other forms of data may be applied as desired by those skilled in the art and are within the scope of the invention.

Optionally, in an implementation manner of this embodiment, the process 700 may include: and classifying and storing the data in the knowledge base in corresponding input files according to predicate types of the data in the knowledge base. For example, a triplet with a predicate of "rdf: type" is stored in a file with a suffix name of "-type-data", and a triplet with a predicate of "rdfs: subClassOf" is placed in a file with a suffix name of "-subclass-schema".

Optionally, in an implementation manner of this embodiment, the process 700 may include: and classifying and storing the data in the knowledge base in corresponding input files according to the predicate type of each data in the knowledge base and the data volume corresponding to each predicate type. For example, for predicate types with large data size, the data corresponding to the predicate types are stored in one or more input files (the input files do not contain data of other predicate types), and for predicate types with small data size, the data corresponding to the predicate types are stored in one input file in a unified manner (for example, data of multiple predicate types are stored in one input file). The data size is also the common/uncommon condition of the predicate type, and those skilled in the art can flexibly set the judgment standard of the data size according to the needs, which is not limited by the present invention.

Of course, in other implementation manners of the present embodiment, the "preset classification policy" is not limited to the above example, and a person skilled in the art may flexibly change the classification policy based on the present embodiment.

702: and determining the first data and the second data according to the data volume of each type of data.

The first data are used as input of a MapReduce task, the second data are used for participating in knowledge base reasoning in a reduction stage of the MapReduce task, and the MapReduce task is used for conducting database reasoning according to the first data and the second data.

Optionally, in an implementation manner of this embodiment, the first data may be data that occupies a storage space that exceeds a certain proportion (for example, 80%, 50%, and the like, and the specific proportion value of the present invention is not particularly limited).

Optionally, in an implementation manner of this embodiment, the process 702 may be implemented in the following manner: and judging according to the data volume of each type of data, wherein the type of data with the largest data volume in each type of data is used as first data, and the rest of data is used as second data.

As mentioned above, each type of data can be stored in one or more input files, and therefore, for a certain type of data, the data size of the data can be determined according to the sizes of all the input files storing the data.

Optionally, in an implementation manner of this embodiment, the MapReduce task may specifically perform database inference according to the first data and the second data in the following manner: inputting first data and generating a key-value pair which is classified and represents the first data in a Map stage of a MapReduce task; and in the Reduce stage of the MapReduce task, performing knowledge base reasoning on the second data and the key-value pairs input into each Reduce node (a specific knowledge base reasoning method is explained below).

Further optionally, sorting the key-value pairs representing the first data comprises: and taking the object, predicate or subject of the data in the first data as a key, and taking the data in the first data as a key-value pair of value.

704: and storing the first data to the hard disk and the second data to the memory.

Optionally, in an implementation manner of this embodiment, for example, before the MapReduce task is executed, the second data is read into a memory of each Reduce node that executes the MapReduce task, and a key-value pair that is classified and represents the second data is generated. Wherein classifying the key-value pair representing the second data includes: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value. In the key-value pair, a value may be a data (e.g., a triplet) or a data list (e.g., a triplet list) containing a plurality of data.

The key of the key-value pair representing the first data and the second data is not particularly limited in the embodiments of the present invention, and a person skilled in the art may flexibly determine which of the object, the subject, and the predicate is the key of the key-value pair of the first data and the second data according to conditions in the rule, an inference policy, and the like. For example, the key-value pair of the first data takes the object as the key, while the key-vlalue pair of the second data takes the subject as the key; the key-value pair of the first data takes the subject as the key, and the key-vlalue pair of the second data takes the object as the key; and the like.

The existing knowledge base inference technology based on Hadoop needs to cache various types of data on Reduce nodes for inference. When data skew occurs, the problem of memory overflow is easily caused. By adopting the method provided by the embodiment, the data is classified and stored before the inference of the knowledge base, for example, the data with small data volume is cached to the Reduce computing node in advance, so that the problem of memory overflow can be effectively avoided, and meanwhile, the read-write operation on the hard disk can be reduced.

Fig. 8 is a flowchart illustrating a method for knowledge base inference according to an embodiment of the present invention, where the first data and the second data are obtained by processing data in a knowledge base according to the embodiment shown in fig. 7 or various implementations thereof. Referring to fig. 8, the method includes:

80: and reading the first data from the hard disk as an input of the MapReduce task and executing the MapReduce task. The MapReduce task is used for carrying out knowledge base reasoning according to the first data and the second data.

82: and in a Map stage of the MaReduce task, generating a key-value pair which expresses the first data in a classified mode.

Optionally, in an implementation manner of this embodiment, the processing 82 may specifically include: in a Map stage of the MaReduce task, generating a key with an object, a predicate or a subject of data (such as a triple) in first data as the key and a key-value pair with data in the first data as the value according to the first data input to the Map node.

84: and in the reduction stage of the MapReduce task, carrying out knowledge base reasoning on second data in the memory according to the key-value pairs input into each reduction node.

Optionally, in an implementation manner of this embodiment, the second data in the memory includes: in the memory of each Reduce node executing the MapReduce task, a key-value pair representing the second data is classified. In other words, all the second data may be saved in the memory of each Reduce node in the form of key-value pairs. The key-value pairs which are classified and represent the second data comprise key which takes the object, predicate or subject of the data in the second data as key and takes the data in the second data as value. Further optionally, in the key-value pair, the value corresponding to one key may be a data list (e.g., a triplet list) containing a plurality of data.

Optionally, in an implementation manner of this embodiment, the processing 84 may specifically include: and at each Reduce node, performing matching judgment according to the key of the key-value pair input into the Reduce node and the key of the key-value pair representing the second data in the memory of the Reduce node in a classified manner, and performing connection processing to obtain an inference result if the matching is successful.

The key-value pair input to the Reduce node may be a key-value pair obtained by merging key-value pairs representing the first data in a classified manner according to the key pair. For example, in the shuffle processing process between the Map stage and the Reduce stage, one or more key-value pairs with a data list as value are obtained according to the key-value pairs output by the key merging Map stage, and the key-value pairs obtained through merging are distributed to different Reduce nodes according to the key. That is, a value, which may be a data list (e.g., a triplet list) containing a plurality of data, is input into the key-value pair of the Reduce node.

Wherein, in case of successful matching, the connection processing specifically may include: and (3) taking data from the value (data list) of the key-value pair input to the Reduce node, and connecting the data with the value corresponding to the matched key in the second data (suitable for the condition that one value is one piece of data), or connecting the data with the data in the value corresponding to the matched key in the second data (suitable for the condition that one value is one data list), so as to obtain an inference result. In the present embodiment, the order of taking data from the value of the key-value pair input to the Reduce node is not limited, and those skilled in the art can flexibly set or optimize in this respect.

Optionally, in an implementation manner of this embodiment, as shown by a dashed box in fig. 8, the method may further include:

86: and carrying out deduplication processing in the process of knowledge base reasoning. For example, the deduplication processing is performed after each MapReduce job that performs inference is finished, or after each round of rule iteration, or after it is determined that a new inference result is not generated.

By adopting the knowledge base reasoning method provided by the embodiment, the read-write operation on the hard disk can be reduced while the memory overflow is avoided, and the reasoning performance is effectively improved (namely, the reasoning efficiency is improved).

Fig. 9 is a block diagram of a data processing apparatus applied to knowledge base inference according to an embodiment of the present invention, and referring to fig. 9, the data processing apparatus 9 includes a storage processing module 90 for storing data in a knowledge base at a corresponding location, and includes a classification submodule 901, a determination submodule 902, a first storage submodule 903 and a second storage submodule 904. The following description will be made separately.

The classification submodule 901 is configured to classify data in the knowledge base according to a preset classification policy.

Optionally, in an implementation manner of this embodiment, the classification sub-module 901 is specifically configured to store the data in the knowledge base in a classification manner in the corresponding input file according to the predicate type of each data in the knowledge base.

The determining sub-module 902 is configured to determine the first data and the second data according to the data amount of each type of data.

Optionally, in an implementation manner of this embodiment, the determining sub-module 902 is specifically configured to perform a judgment according to data amounts of various types of data, and use the type of data with the largest data amount in the various types of data as the first data, and use the remaining data (i.e., all data with the data amount that is not the largest) as the second data.

Optionally, in an implementation manner of this embodiment, the MapReduce task specifically performs database inference according to the first data and the second data by using the following method: inputting first data and generating a key-value pair which is classified and represents the first data in a Map stage of a MapReduce task; and in a reduction stage of the MapReduce task, carrying out knowledge base reasoning on the second data and the key-value pairs input into each reduction node. Wherein classifying the key-value pair representing the first data comprises: and taking an object, a predicate or a subject of the data in the first data as a key, and taking the data in the first data as a key-value pair of value. The key-value pairs input to each Reduce node include: and classifying the key-value pairs representing the first data according to the key combination, and distributing the key-value pairs to the Reduce nodes according to the key.

The first storage submodule 903 is configured to store the first data in the hard disk.

And a second storage submodule 904, configured to store the second data in the memory. For example, reading second data into the memory of each Reduce node executing the MapReduce task, and generating a key-vakue pair representing the second data in a classified manner; wherein classifying the key-value pair representing the second data comprises: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value.

By adopting the data processing device 9 provided by the embodiment, the problem of memory overflow can be effectively avoided by caching the data with small data volume to the Reduce computing node in advance.

Fig. 10 is a block diagram of a knowledge base inference apparatus according to an embodiment of the present invention, where first data and second data on which the knowledge base inference apparatus 10 performs knowledge base inference are first data and second data obtained by processing data in a knowledge base by using the method provided in the embodiment or the implementation manner shown in fig. 7, or first data and second data obtained by processing data in a knowledge base by using the embodiment or the implementation manner shown in fig. 9.

Referring to fig. 10, the knowledge base inference apparatus 10 includes a reading module 101 and an execution module 102, which are described below separately.

The reading module 101 is configured to read first data from a hard disk.

And the execution module 102 is used for carrying out knowledge base reasoning in a manner of executing the MapReduce task. In particular, the execution module 102 may include a first execution submodule and a second execution submodule described below.

And the first execution submodule is used for taking the first data as input and generating a key-value pair for classifying and representing the first data in the Map stage of the MaReduce task.

Optionally, in an implementation manner of this embodiment, the first execution sub-module may be specifically configured to generate a key-value pair in which an object, a predicate, or a subject of data in the first data is a key, and data in the first data is a value.

And the second execution submodule is used for carrying out knowledge base reasoning on second data in the memory according to the key-value pairs input to each Reduce node in the reduction stage of the MapReduce task.

Optionally, in an implementation manner of this embodiment, the second execution submodule is specifically configured to, at each Reduce node, perform matching judgment according to a key of the key-value pair input to the Reduce node and a key of the key-value pair representing second data in a classification in a memory of the Reduce node, and if matching is successful, perform connection processing to obtain an inference result.

Wherein the key-value pair input to the Reduce node comprises: and classifying the key-value pairs representing the first data according to the key pairs to obtain the key-value pairs after merging. Classifying the key-value pair representing the second data includes: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value.

Optionally, in an implementation manner of this embodiment, as shown by a dashed box in fig. 10, the knowledge base inference apparatus 10 further includes a deduplication module 103, configured to perform deduplication processing during the process of performing knowledge base inference by the execution module 102. For example, the deduplication processing is performed after each Mapreduce job that performs inference is finished, or after each round of rule iteration, or after it is determined that a new inference result is not generated.

By adopting the knowledge base reasoning device 10 provided by the embodiment, the read-write operation on the hard disk can be reduced while the memory overflow is avoided, and the reasoning performance is effectively improved.

Fig. 11 is a block diagram schematically illustrating a knowledge base inference apparatus according to an embodiment of the present invention, and referring to fig. 11, the knowledge base inference apparatus 11 includes the data processing apparatus 9 of the embodiment shown in fig. 9 or its implementation. In one implementation manner of this embodiment, the knowledge base inference device 11 may also have various modules and functions of the knowledge base inference device 10.

In an embodiment of the invention, a knowledge base reasoning method is also provided. The data processing method provided in the embodiment shown in fig. 3A and fig. 5 or the implementation thereof may be simultaneously adopted, and the knowledge base inference method provided in the embodiment shown in fig. 4 and fig. 6 or the implementation thereof may also be simultaneously adopted.

In an embodiment of the invention, a knowledge base reasoning apparatus is also provided. It may contain both data processing means 7 and data processing means 9, or may contain all the modules of knowledge base inference means 8 and knowledge base inference means 10.

Embodiments of the invention are described below in connection with a more specific implementation.

A method for knowledge base inference based on Hadoop that can use the embodiments shown in fig. 3A-6 or implementations thereof of the present invention is as follows:

with the rule "rdfs 9:? s rdf type? x; is there a x rdfs subclasof? y? s rdf type? y "is an example, where,"? s ","? x ","? y "represents variables,"? x "is a shared variable that occurs in both conditions.

In the knowledge base reasoning process, taking the execution rule rdfs9 as an example, all predicates are' rdfs: the data of type "outputs a key-value pair with the object of the triplet as key and the triplet as value, and the data of predicate" rdfs: sublassof "outputs a key-value pair with the subject of the triplet as key and the triplet as value. Thus, on the Reduce node, the value of one key contains two types of triples, one type is a predicate "rdfs: type, one type is a predicate of "rdfs: sublasof" triples. The two types of triples are respectively stored in two lists, and an inference result can be generated by respectively taking out one triplet from each list and connecting the triplets.

For example, assuming there are triplets "< Yaoming rdf: type basketball player >, < basketball player rdfs: subclasof player >" in the knowledge base, a new knowledge "< Yaoming rdf: type player >" can be obtained after reasoning. In the Map phase "< Yamingf: type basketball player >" can match the first condition of the rule "? s rdf type? x ", output key-value pairs (basketball player, < Yaming rdf: type basketball player >)," < basketball player rdfs: sublassOf player > "can match the second condition of the rule"? x rdfs subclasof? y ", and outputs a key-value pair (basketball player, < basketball player rdfs: subclasof player >). Therefore, in the Reduce stage, the value list corresponding to the key of the basketball player has two types of triples, namely the triples with the predicate of 'rdf: type' and the triples with the predicate of 'rdfs: subClassOf'. The two types of triples are respectively stored in two lists, and an inference result can be generated by respectively taking out one triplet from each list and connecting the triplets.

However, the foregoing knowledge base inference method has a drawback that if a triple list corresponding to a key is too large, a memory overflow (out of memory) may occur, and there are many hard disk read-write operations. The embodiments of fig. 7-11 provided by the present invention solve this problem.

Specifically, in one particular implementation of the embodiments shown in fig. 7-11, the knowledge base inference process is as follows:

before executing the MapReduce task, storing the triples of the knowledge base into different files according to the type of the triple predicates, and recording the size of each input file. For example, a triplet with a predicate of "rdf: type" is placed in a file with a suffix name of "-type-data", and a triplet with a predicate of "rdfs: subClassOf" is placed in a file with a suffix name of "-subclass-schema". And then judging according to the size of each input file, and if the "-type-data" file is the largest, or the "-type-data" file is larger than a certain proportion (for example 80%) of the memory, writing the "-type-data" file into the hard disk, and reading the "-type-schema" file into the memory of each Reduce computing node in advance.

When the MapReduce task is executed, only a file with a suffix name of "-type-data" is taken as an input in a Map stage, so that only a triple with a predicate of "rdf: type" exists in a value corresponding to a key of a Reduce node.

When the knowledge base reasoning is carried out, triples are taken out from the output result of the MapReduce task (the taking-out sequence is not limited), and are matched and judged with triples in a triples list with the rdfs/sublassOf predicate in the memory, if the triples are matched, the triples are connected to generate a reasoning result, if the triples are not matched, the triples which are not subjected to matching judgment are continuously subjected to matching judgment, and the like.

By adopting the implementation mode, the problem of insufficient memory can be effectively avoided, the read-write operation of the hard disk is reduced, and the reasoning efficiency is improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background art may be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, a smart phone, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.

The terms and expressions used in the specification of the present invention have been set forth for illustrative purposes only and are not meant to be limiting. It will be appreciated by those skilled in the art that changes could be made to the details of the above-described embodiments without departing from the underlying principles thereof. The scope of the invention is, therefore, indicated by the appended claims, in which all terms are intended to be interpreted in their broadest reasonable sense unless otherwise indicated.

Claims

1. A data processing method for knowledge base reasoning, the method comprising:

obtaining a rule dependency relationship by adopting the following mode, wherein the rule dependency relationship is used for determining the scheduling sequence of each rule in the inference process of the knowledge base:

selecting data from the knowledge base to form a first data set;

reasoning on the first data set according to rules of the knowledge base;

2. The method of claim 1, wherein determining the rule dependencies based on the inference results of each rule and corresponding input data comprises:

matching and judging the inference result of each rule with the conditions of other rules;

if the inference result of the first rule is matched with the condition of the second rule, establishing an edge pointing to the second rule from the first rule in the constructed directed acyclic graph for representing the rule dependency relationship;

wherein the first rule and the second rule are used to represent any one of the rules of the knowledge base.

3. The method of claim 2, wherein the method further comprises:

and carrying out topological sorting on the directed acyclic graph according to a topological sorting algorithm, and determining the scheduling sequence of each rule in the inference process of the knowledge base.

4. The method of claim 1, wherein the method further comprises:

and determining the scheduling sequence of each rule in the inference process of the knowledge base according to the rule dependency relationship.

5. A knowledge base inference method, characterized in that the inference method comprises:

obtaining a rule dependency for determining a scheduling order of rules in a knowledge base inference process using the method of claim 1 or 2;

6. The inference method of claim 5, wherein the method further comprises:

carrying out deduplication processing in the inference process of the knowledge base, wherein the deduplication processing comprises the following steps:

and performing the deduplication processing after each MapReduce job for executing inference is finished, or performing the deduplication processing after each round of rule iteration, or performing the deduplication processing after determining that a new inference result is not generated.

7. A data processing apparatus for application to knowledge base reasoning, the apparatus comprising:

the relation module is used for obtaining a rule dependency relation, and the rule dependency relation is used for determining the scheduling sequence of each rule in the inference process of the knowledge base;

the relationship module includes:

8. The data processing apparatus according to claim 7, wherein the determination submodule is specifically configured to perform:

9. The data processing apparatus of claim 8, further comprising:

and the first sequence module is used for carrying out topological sequencing on the directed acyclic graph according to a topological sequencing algorithm and determining the scheduling sequence of each rule in the inference process of the knowledge base.

10. The data processing apparatus of claim 7, further comprising:

and the second sequence module is used for determining the scheduling sequence of each rule in the reasoning process of the knowledge base according to the rule dependency relationship.

11. A knowledge base inference apparatus, characterized in that the inference apparatus comprises:

the data processing apparatus of claim 7 or 8;

12. The inference apparatus of claim 11, further comprising:

the duplication removing module is used for carrying out duplication removing processing in the process of carrying out knowledge base inference by the inference module and comprises the following steps:

13. A data processing method for knowledge base reasoning, the method comprising:

storing data in the knowledge base in the following manner:

14. The method of claim 13, wherein the classifying the data in the knowledge base according to a preset classification policy comprises:

and classifying and storing the data in the knowledge base in corresponding input files according to predicate types of the data in the knowledge base.

15. The method of claim 13, wherein determining the first data and the second data based on the amount of each type of data comprises:

and judging according to the data volume of each type of data, wherein the type of data with the largest data volume in each type of data is used as the first data, and the rest of data is used as the second data.

16. The method of any one of claims 13-15, wherein the MapReduce task is for database reasoning from the first and second data, comprising:

inputting the first data and generating a key-value pair which is classified and represents the first data in a Map stage of the MapReduce task;

in a reduction Reduce stage of the MapReduce task, carrying out knowledge base reasoning on the second data and a key-value pair input into each Reduce node;

wherein classifying the key-value pair representing the first data comprises: and taking the object, predicate or subject of the data in the first data as a key, and taking the data in the first data as a key-value pair of value.

17. The method of any of claims 13-15, wherein storing the second data to memory comprises:

reading the second data into a memory of each Reduce node executing the MapReduce task, and generating a key-vakue pair representing the second data in a classified manner;

wherein classifying the key-value pair representing the second data comprises: and taking the object, predicate or subject of the data in the second data as a key, and taking the data in the second data as a key-value pair of value.

18. A method of knowledge base inference, the method comprising:

wherein the first data and the second data are first data and second data obtained by processing data in a knowledge base by the method according to any one of claims 13 to 17.

19. The method of claim 18, wherein generating a key-value pair that classifies the representation of the first data comprises:

and generating a key-value pair with the object, the predicate or the subject of the data in the first data as a key and the data in the first data as a value.

20. The method of claim 18,

and the knowledge base reasoning is carried out on the second data in the memory and the key-value pairs input to each Reduce node, and the method comprises the following steps: in each Reduce node, matching judgment is carried out according to the key of the key-value pair input into the Reduce node and the key of the key-value pair in the memory of the Reduce node, and if matching is successful, connection processing is carried out to obtain an inference result;

wherein,

the key-value pair input to the Reduce node comprises: classifying a key-value pair representing the first data according to the key pair to obtain a key-value pair after merging,

21. The method of any one of claims 18-20, further comprising:

carrying out deduplication processing in the process of knowledge base reasoning, wherein the deduplication processing comprises the following steps:

22. A data processing apparatus for use in knowledge base reasoning, the apparatus comprising:

the storage processing module includes:

the first storage submodule is used for storing the first data to a hard disk;

23. The apparatus of claim 22,

the classification sub-module is specifically configured to store the data in the knowledge base in a classification manner in the corresponding input file according to predicate types of the data in the knowledge base.

24. The apparatus of claim 22,

the determining submodule is specifically configured to perform determination according to data volumes of various types of data, and use the type of data with the largest data volume in the various types of data as the first data, and use the remaining data as the second data.

25. The apparatus of any one of claims 22-24, wherein the MapReduce task is to perform database reasoning from the first data and second data, comprising:

26. The apparatus of any one of claims 22-24,

the second storage submodule is specifically configured to read the second data into a memory of each Reduce node executing the MapReduce task, and generate a key-vakue pair representing the second data in a classified manner;

27. A knowledge base reasoning apparatus comprising:

the reading module is used for reading first data from the hard disk;

the second execution submodule is used for carrying out knowledge base reasoning on second data in the memory according to the key-value input to each Reduce node in the reduction stage of the MapReduce task;

wherein the first data and the second data are first data and second data obtained by processing data in a knowledge base by the apparatus according to any one of claims 22-26.

28. The knowledge base inference apparatus of claim 27,

the first execution submodule is specifically configured to generate a key-value pair with an object, a predicate, or a subject of data in the first data as a key and data in the first data as a value.

29. The knowledge base inference apparatus of claim 27,

the second execution submodule is specifically used for carrying out matching judgment on each Reduce node according to the key of the key-value pair input into the Reduce node and the key of the key-value pair in the memory of the Reduce node, and carrying out connection processing to obtain an inference result if matching is successful; wherein the key-value pair input to the Reduce node comprises: classifying a key-value pair representing the first data according to the key pair to obtain a key-value pair after merging,

30. The knowledge base inference apparatus of any of claims 27-29, further comprising:

the duplication removing module is used for carrying out duplication removing processing in the process of carrying out knowledge base reasoning by the execution module and comprises:

and performing the deduplication processing after each Mapreduce operation for executing inference is finished, or performing the deduplication processing after each rule iteration, or performing the deduplication processing after determining that a new inference result is not generated.

31. Knowledge base reasoning apparatus comprising data processing apparatus according to any one of claims 22 to 26.