CN116610725B

CN116610725B - Entity enhancement rule mining method and device applied to big data

Info

Publication number: CN116610725B
Application number: CN202310568228.XA
Authority: CN
Inventors: 王尧舒; 谢珉; 樊文飞
Original assignee: Shenzhen Institute of Computing Sciences
Current assignee: Shenzhen Institute of Computing Sciences
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2024-03-12
Anticipated expiration: 2043-05-18
Also published as: CN116610725A

Abstract

The application provides a method and a device for mining entity enhancement rules applied to big data, wherein the method comprises the following steps: acquiring a data set and determining an entity enhancement rule set in the data set; recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set; determining a sequencing score corresponding to entity enhancement rules in the entity enhancement rule set, and generating a first target rule set from the first K complete rules in the sequencing score; when an output request aiming at updating a target rule set is received, determining rule types of the enhancement rules of the first K entities in the ordering score; and when the rule types of the K entity enhancement rules contain incomplete rules, generating a second target rule set according to the awakened incomplete rules and the rest K entity enhancement rules. According to the embodiment of the application, on the premise that all rules in the data set are not required to be obtained in advance, the entity enhancement rules with the highest k ordering scores can be returned quickly, and the efficiency of entity enhancement rule mining on the data set is improved.

Description

Entity enhancement rule mining method and device applied to big data

Technical Field

The application relates to the field of computers, in particular to an entity enhancement rule mining method and device applied to big data.

Background

Rule discovery in big data means that given one data set D, all rules established on the data set are determined. One major problem faced by rule discovery is that a large number of rule candidates are generated. Thus one hasThe effective strategy is to perform top-k rule discovery. Specifically, given a data scoring function score (), useRepresenting rules for each rule +.>Calculate a ranking score, use +.>And (3) representing. Then the k rules with the top ranking score on dataset D may be returned as a result of top-k rule discovery.

The existing top-k rule mining algorithm discovers entity enhancement rules in data in a depth-first or breadth-first single-machine mode, and is essentially a process of enumerating all predicate arrangements and combinations on all data D. For all possible predicates, any extraction of one or more of the outputs may constitute a valid rule with the REE result e. Therefore, in order to perform rule mining, the existing method needs to test all permutations and combinations of predicates. Rules established for each Calculate his ranking score +.>And recorded. When the ranking scores of all established rules are recorded, the top k-ranked rule can be obtained and returned as the result of the top-k rule mining algorithm.

When the user is not satisfied with the currently returned rule, the search is typically continued for the next set of top-k rules, however, to meet this requirement of the user, existing algorithms need to initially enumerate and verify all possible rules that meet the data. In practice, however, the user is often interested in only the top-ranked rules, without having to traverse all rules, and traversing all rules can impact rule discovery efficiency.

Disclosure of Invention

In view of the foregoing, the present application has been developed to provide a method and apparatus for entity-enhanced rule mining for big data that overcomes or at least partially solves the foregoing, the method comprising:

acquiring a data set and determining an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules;

recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set;

Determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set by using the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set;

when an output request aiming at updating a target rule set is received, determining rule types of the enhancement rules of the first K entities in the ordering score;

when the rule types of the K entity enhancement rules contain the incomplete rules, waking up the corresponding incomplete rules according to the predicate to be selected, and generating a second target rule set according to the incomplete rules after waking up and the rest K entity enhancement rules.

Further, the determining the entity-enhanced rule set in the dataset comprises:

determining a condition predicate set to be selected in the data set and a result predicate;

generating a target conditional predicate set by at least one iteration in the conditional predicate set to be selected according to a preset mode;

determining an entity enhancement rule according to the result predicates of the target condition predicate set;

generating an entity enhancement rule set according to at least one entity enhancement rule.

Further, the method further comprises:

judging whether a subset of a target conditional predicate set in the entity enhancement rule to be identified exists or not according to any entity enhancement rule to be identified, wherein the subset is matched with a result predicate of the entity enhancement rule to be identified;

and if the subset does not exist, the rule type of the entity enhancement rule to be identified is a complete rule.

Further, the method further comprises:

aiming at any entity enhancement rule to be identified, stopping generating a target predicate set by iteration when a preset stop expansion condition is met in the process of generating the target condition predicate set by at least one iteration in a condition predicate set to be selected according to a preset mode, and determining the rule type of the entity enhancement rule to be identified as an incomplete rule;

the preset stop expansion condition comprises that the ordering score of the entity enhancement rule to be identified is behind the Kth.

Further, the method further comprises:

constructing a preset stack; the heap is used for storing the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set and storing the entity enhancement rule set after deleting the target rule set.

Further, the method further comprises:

and when the rule types of the current K entity enhancement rules do not contain the incomplete rule, generating a second target rule set according to the current first K entity enhancement rules.

Further, the determining the ranking score of the entity enhancement rule in the entity enhancement rule set comprises:

dividing the data set into a plurality of sets to be processed corresponding to entity enhancement rules in an entity enhancement rule set based on the result predicates;

distributing the set to be processed to a plurality of processing units; the processing unit is used for outputting the preliminary ordering information of the corresponding entity enhancement rule according to the received set to be processed;

and processing the preliminary ranking information of the plurality of processing units to obtain ranking scores corresponding to the entity enhancement rules.

An entity enhanced rule mining apparatus applied to big data, comprising:

the acquisition module is used for acquiring a data set and determining an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules;

the predicate record module is used for recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set;

The first target rule set processing module is used for determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set from the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set;

the rule type determining module is used for determining rule types of the first K entity enhancement rules in the ordering scores when receiving an output request for updating the target rule set;

and the second target rule set generation module is used for waking up the corresponding incomplete rule according to the predicate to be selected when the rule types of the current K entity enhancement rules contain the incomplete rule, and generating a second target rule set according to the incomplete rule after waking up and the rest of the previous K entity enhancement rules.

A computer device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the steps of the entity enhanced rule mining method as described above applied to big data.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of an entity enhanced rule mining method as described above applied to big data.

The application has the following advantages:

in the embodiments of the present application, the existing algorithm needs to enumerate and verify all possible rules that satisfy the data at the beginning, as opposed to when the user is not satisfied with the current returned rule and typically continues to find the next set of top-k rules in the prior art. In practice, however, the user is often interested in only the top-ranked rules, without having to traverse all rules, and traversing all rules can impact rule discovery efficiency. The method and the device can divide rule types of entity enhancement rule division into complete rules and incomplete rules when entity enhancement rules in data are mined, record corresponding predicates to be selected of the incomplete rules, and output K complete rules before ordering scores of a data set as a first target rule set, namely, output top-K rules for data set mining on the premise that all rules do not need to be traversed. When receiving an output request of a user aiming at updating a target rule set, determining that the user needs to output the next K entity enhancement rules and the types of the current first K entity enhancement rules contain incomplete rules, generating a solution of a second target rule set by waking up the corresponding incomplete rules according to the predicate to be selected and generating a solution of the second target rule set according to the awakened incomplete rules and the rest of the first K entity enhancement rules, and avoiding the need of enumerating and verifying all possible rules meeting data, wherein the method specifically comprises the following steps of: acquiring a data set and determining an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules; recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set; determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set by using the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set; when an output request aiming at updating a target rule set is received, determining rule types of the enhancement rules of the first K entities in the ordering score; when the rule types of the K entity enhancement rules contain the incomplete rule, waking up the corresponding incomplete rule according to the predicate to be selected, and generating a second target rule set, namely the next top-K rule, according to the incomplete rule after waking up and the rest of the K entity enhancement rules. According to the method, when an output request aiming at an updating target rule set is received, the corresponding incomplete rule is awakened according to the predicate to be selected, and a second target rule set is generated according to the awakened incomplete rule and the rest first K entity enhancement rules, so that the defect that when a user is dissatisfied with the rule which is returned currently and needs to acquire the next set of top-K rules, all possible rules which meet data need to be enumerated once and verified at the beginning of rule mining is overcome, the aim that the top-K rules can be output without traversing all rules in the data set is achieved, and the rule with the highest K ordering scores can be returned at any time according to the requirement of the user on the premise that all rules do not need to be acquired in advance is achieved, and the mining efficiency of the rules in the data is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flowchart of steps for entity enhanced rule mining for big data according to one embodiment of the present application;

FIG. 2 is a flowchart of an entity enhancement rule mining method applied to big data according to an embodiment of the present application;

FIG. 3 is a block diagram of an entity enhanced rule mining apparatus for big data according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present application more comprehensible, the present application is described in further detail below with reference to the accompanying drawings and detailed description. It will be apparent that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The inventors found by analyzing the prior art that:

entity enhancement rules (Rules for Entity Enhancing), REE for short. The basic component of REE is predicate p, defined as follows:

p:＝R(t)|t.A◎c|t.A◎s.B|M(t.A,s.B)

wherein fineis an operator, which may be equal or unequal; r (t) represents t is a tuple variable in the relationship table R; t.A represents the A attribute of the variable t; m is a machine learning model that returns true if t.A and s.B are related, and false otherwise. t.A c has a constant, called constant predicate; t.A @ s.B has no constant, and is called variable predicate; m (t.A, s.B) is referred to as a machine learning predicate.

Based on predicates, REE φ is defined as: x.fwdarw.e. Where (1) X is a combination of predicates, a condition called this REE; (2) e is a predicate, called the result of this REE.

One specific example of REEs is as follows:

express (t) ' a express(s) ' a t. addressee=s. Addressee ' a t. Address = "a province B city" →s. Postal code = "XYZZZZ"

This REE describes the scenario that if the recipients of two couriers t and s are the same person, and t is addressed to "A province B city", then the postal code of s must be "XYZZZZZ".

The existing top-k rule mining algorithm discovers REE rules in data through a depth-first or breadth-first single-machine mode, and the essence of the existing top-k rule mining algorithm is a process of enumerating all predicate arrangements and combinations on all data D. For all possible predicates, any extraction of one or more of the outputs may constitute a valid rule with the REE result e. Therefore, in order to perform rule mining, existing methods need to try all permutation and combinations of predicates through. Rules established for each Calculate his ranking score +.>And recorded. When the ranking scores of all established rules are recorded, the rule with the score k before ranking can be obtained and returned as the result of the top-k rule mining algorithm. Since the complete rule ordering is known, when the user is not satisfied with the current top-k rule, the next k highest scoring rules can be obtained from the sequence and returned as results. When the data is large-scale, the efficiency of such a brute force single-machine algorithm is undoubtedly low.

In order to obtain the top-k rule, the existing method not only needs to enumerate all possible rules meeting the data D once and verify, but also cannot effectively parallelize the calculation process. When the user is not satisfied with the current top-k rule, the method also cannot effectively return the k rule with the highest ranking score to the user like a search engine. Therefore, the efficiency (mining efficiency) of existing top-k rule discovery is very low.

It should be noted that, in any embodiment of the present invention, the present invention may be applicable to entity enhancement rules composed of one or more predicates, and the embodiment of the present invention does not limit the applicable entity enhancement rules.

Referring to fig. 1, an entity enhancement rule mining method applied to big data according to an embodiment of the present application is shown; the method comprises the following steps:

s110, acquiring a data set and determining an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules;

s120, recording corresponding predicate candidates of incomplete rules in the entity enhancement rule set;

s130, determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set by using the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set;

s140, when an output request aiming at an updating target rule set is received, determining rule types of the first K entity enhancement rules in the ordering score;

s150, when rule types of the current K entity enhancement rules contain the incomplete rules, waking up the corresponding incomplete rules according to the predicate to be selected, and generating a second target rule set according to the incomplete rules after waking up and the rest of the first K entity enhancement rules.

In an embodiment of the present application, a rule set is enhanced by acquiring a dataset and determining an entity in the dataset; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules; recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set; determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set by using the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set; when an output request aiming at updating a target rule set is received, determining rule types of the enhancement rules of the first K entities in the ordering score; when the rule types of the K entity enhancement rules contain the incomplete rule, waking up the corresponding incomplete rule according to the predicate to be selected, and generating a second target rule set, namely the next top-K rule, according to the incomplete rule after waking up and the rest of the K entity enhancement rules. According to the method, when an output request aiming at an updating target rule set is received, the corresponding incomplete rule is awakened according to the predicate to be selected, and a second target rule set is generated according to the awakened incomplete rule and the rest first K entity enhancement rules, so that the defect that when a user is dissatisfied with the rule which is returned currently and needs to acquire the next set of top-K rules, all possible rules which meet data need to be enumerated once and verified at the beginning of rule mining is overcome, the aim that the top-K rules can be output without traversing all rules in the data set is achieved, and the rule with the highest K ordering scores can be returned at any time according to the requirement of the user on the premise that all rules do not need to be acquired in advance is achieved, and the mining efficiency of the rules in the data is improved.

Next, an entity enhancement rule mining method applied to big data in the present exemplary embodiment will be further described.

Acquiring a dataset and determining an entity enhancement rule set in the dataset as described in the step S110; the rule categories of the entity enhanced rule set include complete rules and incomplete rules.

The method of acquiring the data set is not limited in this embodiment, and may be read from a local storage medium or received online. Alternatively, the data set may be a dynamically updated state, i.e., the data set may be a timed or non-timed change of data.

In one embodiment of the present invention, the specific process of determining the entity-enhanced rule set in the dataset described in step S110 may be further described in conjunction with the following description, and may specifically include: determining a condition predicate set to be selected in the data set and a result predicate; generating a target conditional predicate set by at least one iteration in the conditional predicate set to be selected according to a preset mode; determining an entity enhancement rule according to the result predicates of the target condition predicate set; generating an entity enhancement rule set according to at least one entity enhancement rule.

A conditional predicate set P to be selected can be constructed _re Sum target conditional predicate set P _sel . Extracting result predicate e from the dataset and storing all possible predicates identified to the conditional predicate set P _re Constructing a target conditional predicate set P _sel It is an empty set for storing predicates selected as rental car REEs conditions. Iteratively selecting P by traversing the search space in breadth-first manner _re Predicate in is added to P _sel In the middle, this process is also called extension P _sel Until one of the following stop conditions is met: (1) P (P) _re Forming an empty set; or (2) P _sel E is a valid REE rule expressed asX.fwdarw.e. And generating an entity enhancement rule set from entity enhancement rules obtained by mining the data set, and storing the entity enhancement rule set.

In the embodiment of the application, rule types of the entity enhancement rule set are divided into two types, wherein one type is a complete rule, and the other type is an incomplete rule.

In the embodiment of the application, whether any entity enhancement rule is a complete rule is identified by the following steps: judging whether a subset of a target conditional predicate set in the entity enhancement rule to be identified exists or not according to any entity enhancement rule to be identified, wherein the subset is matched with a result predicate of the entity enhancement rule to be identified; and if the subset does not exist, the entity enhancement rule to be identified is a complete rule.

Complete rules are also known as minimized rulesX.fwdarw.e. Specifically, if there is a known REE ruleX '. Fwdarw.e such that X' is a subset of X, then +.>It is not a minimization rule, i.e. currently +.>Not a complete rule. Similarly, if for rule->X-e that no subset X' exists so that +.>X'. Fwdarw.e, then->Is completely regular.

In the embodiment of the application, whether any entity enhancement rule is an incomplete rule is identified through the following steps: aiming at any entity enhancement rule to be identified, stopping generating a target predicate set by iteration when a preset stop expansion condition is met in the process of generating the target condition predicate set by at least one iteration in a condition predicate set to be selected according to a preset mode, and determining that the entity enhancement rule to be identified is an incomplete rule; the preset stop expansion condition comprises that the ordering score of the entity enhancement rule to be identified is behind the Kth.

Incomplete is also called unexpanded (target predicate set expansion) completed rules. If during the expansion process, the rule appears to continue expanding XNo score of (2) is higher than the top k rule in the current heap Σ. Stopping the expansion of X even if the stop condition of the expansion has not been satisfied, and stopping +. >X.fwdarw.e is used as the incomplete rule.

In this embodiment of the present application, whether any entity enhancement rule is a complete rule and whether any entity enhancement rule is an incomplete rule are identified, and the method may be invoked at any time and may be invoked for multiple times.

The "record the corresponding alternative predicates of incomplete rules in the entity-enhanced rule set" described in step S120 may be further described in connection with the following description.

Although in the above case, the imperfection is regularIt is not possible to scale to top-k rules with top k high rank scores. But in the immediate discovery later (step S150) if the partial high score rule has been returned, then +.>It is also possible to become a top-k rule with a higher score by extension. Therefore, the predicate set to be selected corresponding to the incomplete rule phi needs to be stored so as to continue to perform the matching of ++at any time>Is an extension of (c). This pair->The policies that expand at any time, also called lazy wake policies, are performed.

In an embodiment of the present application, the entity enhancement rule mining method applied to big data further includes: constructing a preset stack; the heap is used for storing the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set and storing the entity enhancement rule set after deleting the target rule set.

According to the embodiment of the invention, the heap sigma can be constructed, the complete rule and/or the incomplete rule and the predicate to be selected corresponding to the incomplete rule can be stored in the heap sigma, and after any rule is output as the target rule set, the target rule set which is already output is deleted in the heap sigma, so that the problems that when a user needs to output the next group of top-k rules, the same rule is repeatedly output, the output result is not matched with the user requirement, and the resource waste is caused are avoided.

In an embodiment of the present invention, the following description may be combined to further explain the specific process of "the rule types of the first K entity enhancement rules include the incomplete rule" in step S150, and generating the second target rule set according to the awakened incomplete rule and the remaining first K entity enhancement rules according to the incomplete rule corresponding to the awakened predicate.

Since the selected predicates corresponding to the incomplete rule are stored in the steps, the continuous expansion of the incomplete rule, namely the continuous iterative updating of the target predicate set, can be completed very quickly, so that the incomplete rule is awakened. The awakened incomplete rule also remains in the heap Σ as a complete rule until all the top k rules with the highest ranking score stored in the heap Σ are complete rules, and the top k complete rules of the current ranking score are returned to the user as a second target rule set. When the user is not satisfied with the currently returned rule, the next set of top-k rules with highest ranking score, also called instant discovery, can be instantly found for them by step S150.

In an embodiment of the present invention, the method further includes: and when the rule types of the current K entity enhancement rules do not contain the incomplete rule, generating a second target rule set according to the current first K entity enhancement rules.

If only complete rules exist in the current heap Σ, the target predicate set corresponding to the incomplete rules does not need to be awakened, namely the iteration of the target predicate set corresponding to the incomplete rules is not needed to be continued, and only the k complete rules in front of the current sequencing score are needed to be returned to the user as a second target rule set. By combining this step with step S150 it is achieved that the speed of determining the next highest ranked top-k rule set can be further increased when the user is not satisfied with the currently returned rule.

Meanwhile, in practical application, in order to ensure that the rule found in real time is an unrepeated rule, an additional check is required for the output complete rule. The rule is output only if it cannot be deduced from other already output rules.

In the embodiment of the present application, a specific calculation formula of the ranking score is not limited.

In addition, the inventors found by analyzing the prior art that: the existing top-k rule mining algorithm discovers REE rules in data in a depth-first or breadth-first single-machine mode, and the essence of the existing top-k rule mining algorithm is a process of enumerating all predicate arrangements and combinations on all data D. For all possible predicates, any extraction of one or more of the outputs may constitute a valid rule with the REE result e. Therefore, in order to perform rule mining, existing methods need to try all permutation and combinations of predicates through. Rules established for each Calculate his ranking score +.>And recorded. After the ranking scores of all established rules are recorded, we can obtain the top k-ranked rule as the result of top-k rule mining algorithm. Since the complete rule ordering is known, when the user is not satisfied with the current top-k rule, we can get the next k highest scoring rules from the sequence to return as results. When the data is large-scale, the efficiency of such a brute force single-machine algorithm is undoubtedly low. Even if more computing resources are used, the improvement of the operation efficiency of rule mining is not necessarily guaranteed.

In an alternative embodiment of the present application, the "calculate ranking score of each entity enhancement rule" described in step S130 may include: dividing the data set into a plurality of sets to be processed based on the result predicates; distributing the set to be processed to a plurality of processing units; the processing unit is used for outputting the preliminary ordering information of the corresponding entity enhancement rule according to the received set to be processed; and processing the preliminary ranking information of the plurality of processing units to obtain ranking scores.

Step S130 may be further described in connection with the following description. Referring to fig. 2, a flow chart of an entity enhancement rule mining method applied to big data according to an embodiment of the present application is shown, and in order to improve the processing efficiency according to the embodiment of the present application, an overall synchronous parallel computing model may be introduced in practice, so as to implement parallel discovery of entity rules. The specific steps are as follows:

the whole synchronous parallel computing model is based on breadth-first search and consists of one scheduling unit (coordinator) and n processing units (workers). Under the whole synchronous parallel computing model, the scheduling unit is responsible for generating and distributing tasks, maintaining and load balancing the whole top-k rule, and the processing unit is responsible for carrying out instant discovery rules in parallel. The overall computation is divided into a number of supersoles, each bounded by a fixed time.

In general, the scheduler maintains an overall heap Σ, and in the ith superstep, the scheduler records the ranking score of the rule currently scoring the kth high in the heap Σ, denoted by Ti.

The scheduling unit firstly splits rule discovery into a plurality of small tasks according to REE result e required to be discovered, and each task consists of a triplet: <Psel，Pre,e>. Where Psel stores predicates that have been selected to make up the REE condition, and Pre stores predicates that are alternatives. Initially, psel is an empty set and Pre is all possible predicate sets. After the scheduling unit equally distributes the value of Ti and all the tasks to all the processing units, each processing unit performs top-k rule discovery of the ith superstep at the same time: they will maintain a heap locally for storing locally discovered rules. When the processing units perform local rule expansion, they extract the required data according to the distributed tasks and iteratively select predicates in the Pre to add to the Psel until at least one of the following three conditions is satisfied: (1) P (P) _re Forming an empty set; (2) P (P) _sel E is a complete rule; (3) P (P) _sel E is an incomplete rule.

At the end of each stride, the processing units may communicate the found top-k rule to the scheduling unit for aggregation. The scheduling unit adjusts the overall heap Σ according to the collected rule and calculates a ranking score (ranking score) ti+1 that will be used for the next superstep. In each superstep, the scheduling unit also performs load balancing, splits the task of the processing unit W with the heaviest workload, and allocates half of the tasks to the idle processing units. If there are more than one task in W, W will allocate half of the tasks out. If only one task is left by W, splitting the task into a plurality of subtasks for distribution by splitting the corresponding data of the task. When all processing units complete all computations, the parallel rule discovery ends. When the parallel discovery rule is finished, the sequencing score of each entity enhancement rule can be obtained, and then the top-k rule can be output aiming at the sequencing score.

In the embodiment of the application, through the instant discovery of the relevant processing steps, on the premise that all rules do not need to be obtained in advance, the next k rules with highest ranking scores can be returned at any time according to the requirements of users. And (3) through the parallel discovery related processing step, based on the overall synchronous parallel computing model, the task is adjusted in a load balancing mode through task generation and distribution, and the scheduling unit and the processing unit cooperate to perform parallel top-k rule discovery. By guaranteeing reduced runtime of top-k rule discovery when using more computing resources, there is thus parallelism scalability.

The embodiments of the present application are illustrated below by way of one example: assuming k=3, the rule in the current heap Σ for ordering the score from high to low is:if the superscript of the rule is c, the rule is a complete rule; if the superscript of a rule is p, it indicates that the rule is an incomplete rule. Since there are incomplete rules in the top three rules, we continue to expand/wake up the incomplete rules in order of the order from high to low. We first continue to expandAssume that is extended->After that we have obtained three new rules +. >Heap Σ is also updated toThe top three rules in the time Σ are all complete rulesThen the three rules are returned as a result of the top-k rule discovery.

In practical application, the entity enhancement rule mining method and the existing method applied to big data are adopted for rule mining aiming at the same data set, experiments are carried out according to a plurality of data sets, and the efficiency of the instant top-k rule discovery algorithm, the parallel rule discovery algorithm and the existing rule discovery algorithm which are used in the scheme are comprehensively compared. The results show that:

(1) When a user continuously wants to make k rules with highest scores, the advantages of the instant rule discovery algorithm are obvious, and the instant rule discovery algorithm effectively maintains the incomplete rules, so that the expansion of the incomplete rules can be quickly awakened and restored, and the instant rule discovery algorithm is accelerated by 95 times compared with the traditional algorithm. The next k rules may have been enumerated and stored so that when the instant discovery algorithm accumulates enough incomplete rules, the time to perform the next k rule discovery can be further reduced.

(2) The improvement in parallel rule discovery is very pronounced when more computing resources are used. For example, when rule discovery is performed for a data set, processing efficiency of rule discovery is improved by 3.15 times when the same computing device is used to change from 4 to 20. I.e., it can be determined that the parallel rule discovery algorithm is extensible in parallel.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

Referring to fig. 3, an entity enhancement rule mining apparatus applied to big data according to an embodiment of the present application is shown; the method specifically comprises the following steps:

an acquisition module 310, configured to acquire a data set and determine an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules;

a predicate-to-be-selected recording module 320, configured to record a predicate-to-be-selected corresponding to an incomplete rule in the entity-enhanced rule set;

a first target rule set processing module 330, configured to determine a ranking score corresponding to an entity enhancement rule in the entity enhancement rule set, generate a first target rule set from first K complete rules in the ranking score, and delete an entity enhancement rule included in the first target rule set in the entity enhancement rule set;

a rule type determining module 340, configured to determine rule types of the first K entity enhancement rules in the ranking score when receiving an output request for updating the target rule set;

And a second target rule set generating module 350, configured to wake up the corresponding incomplete rule according to the predicate to be selected when the rule types of the current K entity enhancement rules include the incomplete rule, and generate a second target rule set according to the wake-up incomplete rule and the remaining first K entity enhancement rules.

In an embodiment of the present invention, the obtaining module 310 includes:

a predicate determination submodule for determining a conditional predicate set to be selected in the dataset and a result predicate;

the target conditional predicate iteration submodule is used for generating a target conditional predicate set by iterating at least one of the conditional predicate sets to be selected according to a preset mode;

and the rule determination submodule is used for determining entity enhancement rules according to the result predicates of the target condition predicate set.

In an embodiment of the invention, the apparatus further comprises:

the complete rule recognition module is used for judging whether a subset of a target condition predicate set in the entity enhancement rule to be recognized exists or not according to any entity enhancement rule to be recognized, and the subset is matched with a result predicate of the entity enhancement rule to be recognized; and if the subset does not exist, the entity enhancement rule to be identified is a complete rule.

In an embodiment of the invention, the apparatus further comprises:

the incomplete rule recognition module is used for aiming at any entity enhancement rule to be recognized, stopping generating the target predicate set by iteration when a preset stop expansion condition is met in the process of generating the target condition predicate set by at least one iteration in the condition predicate set to be selected according to a preset mode, and determining that the entity enhancement rule to be recognized is an incomplete rule;

In an embodiment of the invention, the apparatus further comprises:

the pile construction module is used for constructing a preset pile; the heap is used for storing the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set and storing the entity enhancement rule set after deleting the target rule set.

In an embodiment of the invention, the apparatus further comprises:

and the direct generation module is used for generating a second target rule set according to the current first K entity enhancement rules when the rule types of the current K entity enhancement rules do not contain the incomplete rules.

In one embodiment of the present invention, the first target rule set processing module 330 includes:

a distribution sub-module for distributing the set to be processed to a plurality of processing units; the processing unit is used for outputting the preliminary ordering information of the corresponding entity enhancement rule according to the received set to be processed;

and the integration sub-module is used for processing the preliminary ranking information of the plurality of processing units to obtain ranking scores corresponding to the entity enhancement rules.

Referring to fig. 4, a computer device of the present invention for applying the entity enhanced rule mining method to big data may specifically include the following:

the computer device 12 described above is embodied in the form of a general purpose computing device, and the components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, a processor, or a local bus 18 using any of a variety of bus 18 architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus 18, micro channel architecture (MAC) bus 18, enhanced ISA bus 18, video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk such as a CD-ROM, DVD-ROM, or other optical media may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42, the program modules 42 being configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, a memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown in fig. 4, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, data backup storage systems 34, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the entity-enhanced rule mining method for big data provided by the embodiment of the present invention.

That is, the processing unit 16 realizes when executing the program: acquiring a data set and determining an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules; recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set; determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set by using the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set; when an output request aiming at updating a target rule set is received, determining rule types of the enhancement rules of the first K entities in the ordering score; when the rule types of the K entity enhancement rules contain the incomplete rules, waking up the corresponding incomplete rules according to the predicate to be selected, and generating a second target rule set according to the incomplete rules after waking up and the rest K entity enhancement rules.

In an embodiment of the present invention, the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an entity enhanced rule mining method applied to big data as provided in all embodiments of the present application:

that is, the program is implemented when executed by a processor: acquiring a data set and determining an entity enhancement rule set in the data set; the rule types of the entity enhanced rule set comprise complete rules and incomplete rules; recording the corresponding predicate to be selected of the incomplete rule in the entity enhancement rule set; determining a sorting score corresponding to entity enhancement rules in the entity enhancement rule set, generating a first target rule set by using the first K complete rules in the sorting score, and deleting entity enhancement rules contained in the first target rule set in the entity enhancement rule set; when an output request aiming at updating a target rule set is received, determining rule types of the enhancement rules of the first K entities in the ordering score; when the rule types of the K entity enhancement rules contain the incomplete rules, waking up the corresponding incomplete rules according to the predicate to be selected, and generating a second target rule set according to the incomplete rules after waking up and the rest K entity enhancement rules.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

While preferred embodiments of the present embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the present application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing has described in detail a method and apparatus for entity-enhanced rule mining applied to big data, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the above description of the embodiments is only for helping to understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An entity enhancement rule mining method applied to big data is characterized in that,

2. The method of claim 1, wherein the determining the set of entity-enhanced rules in the dataset comprises:

3. The method according to claim 2, wherein the method further comprises:

4. A method according to claim 3, characterized in that the method further comprises:

5. The method according to claim 4, wherein the method further comprises:

6. The method of claim 5, wherein the method further comprises:

7. The method of claim 2, wherein the determining the ranking score of the entity enhancement rule in the entity enhancement rule set comprises:

8. An entity enhanced rule mining apparatus for big data, comprising:

9. A computer device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, implements the method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 7.