WO2023236238A1

WO2023236238A1 - Relational data-based data processing method and apparatus thereof

Info

Publication number: WO2023236238A1
Application number: PCT/CN2022/099183
Authority: WO
Inventors: 谢珉; 王尧舒; 樊文飞
Original assignee: 深圳计算科学研究院
Priority date: 2022-06-09
Filing date: 2022-06-16
Publication date: 2023-12-14
Also published as: CN115033650A

Abstract

The present application provides a relational data-based data processing method and an apparatus thereof, being used for recovering target data of a missing data segment by means of a data relation, and verifying the validity of the recovered target data. The method comprises: obtaining target data, and performing data screening according to the word meaning of the target data to determine sampling data; generating a template predicate according to the sampling data, and constructing a target template according to the template predicate; performing data screening on the target data according to constant predicates to construct a predicate total set; performing association rule mining according to the predicate total set to generate a candidate rule total set; and determining a valid rule in the candidate rule total set according to the target data, and determining valid data according to the valid rule. Therefore, when rule discovery with a constant is performed in large-scale relational data, a valid rule with the constant can also be found without enumerating all possible constants, thereby greatly improving the execution efficiency of rule discovery.

Description

A data processing method and device based on relational data

Technical field

The present application relates to the field of data processing, in particular to a data processing method and device based on relational data.

Background technique

Rule discovery in large-scale relational data is a time-consuming and labor-intensive process. When constants are allowed in rules, the cost of rule discovery increases exponentially in complexity.

For example, consider the following simple conditional functional dependency (CFD):

Address="Shenzhen City, Guangdong Province"->Postal code="518000"

The scenario described by this CFD is that if an address attribute is in Shenzhen City, Guangdong Province, then its corresponding zip code attribute must be 518000. This rule can be widely used for error checking and correction in relational data. Specifically, when the data stored in the relational data violates this rule (that is, an address attribute is in Shenzhen, Guangdong Province, but its zip code attribute is not 518000), then it can be known that there is an error in the data, and the data can be further processed. correct. In this rule, "Shenzhen City, Guangdong Province" and "518000" are both constants, and the address and zip code are the attribute names of the data.

If you perform rule discovery with constants in large-scale data, you need to consider not only the permutations and combinations of different data attributes, but also the constants that each attribute may match. This enumeration process is very expensive. For example, the following CFDs:

Address="Guangzhou City, Guangdong Province"->Postcode="510000"

Address="Dongguan City, Guangdong Province"->Postcode="523000"

Address="Foshan City, Guangdong Province"->Postcode="528010"

The scenarios described by these CFD are similar, the only difference lies in the use of different constants. Although the attributes of the rules (i.e. address and zip code) are all the same, the matching constants are different. If you need to enumerate all possible matching constants in the data, there is no doubt that the efficiency of rule discovery will be greatly reduced. It can take days or even weeks to perform rule discovery in a regular-sized relational data set.

The limited expression ability of CFD rules limits its applicability in actual scenarios. In order to support constant predicates, CFD rule mining needs to enumerate all possible combinations of attributes and constants, which is time-consuming and labor-intensive.

Contents of the invention

In view of the above problems, this application is proposed to provide a data processing method and device based on relational data that overcome the problems or at least partially solve the problems, including:

A data processing method based on relational data. The method is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data, including:

Obtain target data, and perform data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one;

Generate a template predicate based on the sampled data, and construct a target template based on the template predicate;

Perform data filtering on the target data according to the constant predicate to construct a predicate aggregate set;

Perform association rule mining based on the predicate set to generate a candidate rule set;

Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules.

Further, the step of obtaining the target data and performing data filtering to determine sampled data based on the word meaning of the target data, wherein the sampled data is a constant predicate and includes at least one step, includes:

Obtain data attributes within the target data;

Determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;

The target data whose word meaning type corresponds to the constant are screened and determined as the sampled data.

Further, the step of generating a template predicate based on the sampled data and constructing a target template based on the template predicate includes:

Generate template predicates based on the sampled data;

When the template predicate has a valid value in the target data, it is determined that the template predicate is a valid predicate;

Build a target template based on the valid predicates.

Further, the step of constructing a target template based on the valid predicate includes:

Combining the valid predicates to generate permutations;

Screen the permutations and combinations to determine template predicate combinations;

The target template is constructed based on the template predicate combination and the valid predicate.

Further, the step of filtering the target data based on the constant predicate to construct a predicate aggregate set includes:

Perform data screening on the target data according to the constant predicate to determine non-constant predicate data, wherein the non-constant predicate data is a set of non-constant predicates;

Perform constant value supplementation on the non-constant predicate data according to the target template to generate constant predicate data, wherein the constant predicate data is a set of constant predicates;

The predicate total set is constructed based on the non-constant predicate set and the constant predicate set.

Further, the step of performing association rule mining based on the predicate aggregate set to generate a candidate rule set includes:

Perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,

Perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;

The total set of candidate rules is generated according to the first set of candidate rules or the second set of candidate rules.

Further, the steps of determining effective rules in the total set of candidate rules based on the target data, and determining effective data based on the effective rules include:

Obtain each sub-candidate rule in the total set of candidate rules;

Verify the validity of each sub-candidate rule according to the target data; wherein, when there is a sub-target data corresponding to the current sub-candidate rule in the target data, then determine the validity of each sub-candidate rule. The current sub-candidate rule is a valid rule;

Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.

This application also discloses a data processing device based on relational data. The device is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data, including:

An acquisition module is used to acquire target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;

A first building module, configured to generate a template predicate based on the sampled data, and construct a target template based on the template predicate;

The second building module is used to perform data filtering on the target data based on the constant predicate to build a predicate aggregate set;

A generation module, configured to perform association rule mining based on the predicate set to generate a candidate rule set;

A determining module, configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.

This application also discloses a device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, the above-mentioned methods are implemented. The steps of a data processing method based on relational data.

This application also discloses a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, the steps of a data processing method based on relational data as described above are implemented. .

This application has the following advantages:

In the embodiment of the present application, sampled data is determined by obtaining target data and filtering data according to the word meaning of the target data, where the sampled data is a constant predicate and includes at least one; a template is generated based on the sampled data Predicate, and build a target template based on the template predicate; perform data screening on the target data based on the constant predicate to build a predicate collection; perform association rule mining based on the predicate collection to generate a candidate rule collection; based on the target The data determines valid rules in the total set of candidate rules, and valid data is determined based on the valid rules. By proposing a data processing method to repair constants, when discovering rules with constants in large-scale relational data, you do not need to enumerate all possible constants, and you can also discover effective rules with constants, thus greatly improving the efficiency of rules. Discovery execution efficiency.

Description of the drawings

In order to explain the technical solution of the present application more clearly, the drawings needed to be used in the description of the present application will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application and are not useful in this field. Ordinary technicians can also obtain other drawings based on these drawings without exerting creative labor.

Figure 1 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 2 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 3 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 4 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 5 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 6 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 7 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

Figure 8 is a structural block diagram of a data processing device based on relational data provided by an embodiment of the present application;

Figure 9 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.

Detailed ways

In order to make the purpose, features and advantages of the present application more obvious and understandable, the present application will be described in further detail below in conjunction with the accompanying drawings and specific implementation modes. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

It should be noted that, unlike CFD rules, the rules utilized in this invention are Entity Enhancing Rules (hereinafter referred to as REE). The basic component of REE is the predicate p, which is defined as follows:

p:＝R(t)|t.A◎c|t.A◎s.B|M(t.A,s.B)

Among them, ◎ is an operator, which can be equal to or not equal to.

R(t) indicates that t is a tuple variable in the relational table R.

t.A represents the A attribute of variable t; M is a machine learning model. If t.A and s.B are related, then the machine learning model will return true (correct), otherwise it will return false (error).

t.A◎c has a constant and is called a constant predicate.

t.A◎s.B does not have a constant and is called a variable predicate.

M(t.A,s.B) is called a machine learning predicate.

Based on predicates, the definition of REE is: X->e; among them, (1)

A specific REE example is as follows:

Express(t)∧Express(s)∧t.Recipient=s.Recipient∧t.Address="Shenzhen City, Guangdong Province"->s.Postcode="510000"

The scenario described by this REE is that if the recipient of express t and express s are the same person, and the address of t is in "Shenzhen City, Guangdong Province", then the zip code of s must be "510000".

REE rules can be found in relational data through depth-first or breadth-first search methods.

The most relevant thing to REE rules is the above-mentioned CFD rule; CFD rules support constant predicates and variable predicates with only one tuple variable, which can be regarded as a special case of REE rules.

CFD-based rule mining algorithms also use breadth-first or depth-first search methods for rule mining.

The overall technical solution of the present invention is divided into two steps: template mining and constant repair. In order to improve the mining efficiency, a part of the data is extracted from the full data D to form the sampling data D _s . Template mining is performed on the sampled data D _s , while constant repair is performed on the full data D s. Template mining is performed first, and constant repair is performed later. Template mining is performed first, and then constant repair is performed based on the mined template.

Referring to Figure 1, there is shown a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;

A data processing method based on relational data. The method is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data. The method includes:

S110. Obtain target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;

S120. Generate a template predicate based on the sampled data, and construct a target template based on the template predicate;

S130. Perform data filtering on the target data according to the constant predicate to build a predicate total set;

S140. Perform association rule mining based on the predicate set to generate a candidate rule set;

S150: Determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.

Below, a method for mining predicate combination rules based on reinforcement learning in this exemplary embodiment will be further described.

As described in step S110, target data is obtained, and data filtering is performed to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one.

In an embodiment of the present invention, the step S110 of "obtaining target data, and performing data filtering to determine sampling data according to the meaning of the target data, wherein the sampling data is a constant predicate, and at least Including a” specific process.

Referring to Figure 2, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;

As described in the following steps,

S210. Obtain the data attributes in the target data;

S220. Determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;

S230: Screen the target data whose word meaning type corresponds to the constant, and determine it as the sampled data.

It should be noted that the data attributes in the target data are obtained; different target data have data attributes corresponding to each one.

It should be noted that the word meaning type of the target data is determined according to the data attributes, wherein the word meaning type includes constants and no constants; the word meaning type of the target data is determined through the data attributes; the word meaning types include constants and No constant.

It should be noted that the target data whose word meaning type is a constant is screened out and determined as the sampled data; the target data whose word meaning type is a constant is filtered and these target data are marked as sampled data.

As described in step S120, a template predicate is generated based on the sampled data, and a target template is constructed based on the template predicate.

In an embodiment of the present invention, the specific process of "generating a template predicate based on the sampled data and constructing a target template based on the template predicate" described in step S120 can be further explained in conjunction with the following description.

Referring to Figure 3, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;

As described in the following steps,

S310. Generate a template predicate based on the sampled data;

S320. When the template predicate has a valid value in the target data, determine that the template predicate is a valid predicate;

S330. Construct a target template based on the effective predicate.

It should be noted that a template predicate is generated based on the sampled data; a predicate can be formed based on the data attributes of the sampled data; several predicates are generated based on the data attributes; the initial state of the predicate generated through the data attributes is an invalid predicate, and its corresponding A constant value that determines whether it is a template predicate.

It should be noted that when validating a template predicate, as long as at least one sub-data in the target data has a valid value on the data attribute of the template predicate, then the predicate will be used as a valid predicate, and then the template REE will be formed through the valid predicate. target template.

In a specific implementation, the template REE, that is, the target template, is a REE rule in which all constants in the REE are represented by the wildcard character "_".

It should be noted that wildcards can match any and constant values.

In a specific implementation, the template REE corresponding to the sample REE is as follows:

Express (t) ∧ Express (s) ∧t. Recipient = s. Recipient ∧t. Address = “_” -> s. Postal code = “_”.

The advantage of template REE is that if one or more REE rules only differ in the constant of the constant predicate, they can be represented by the same template REE.

Multiple CFDs can be expressed as:

Express(t)∧t.Address="_"->t.Postcode="_"

In order to distinguish multiple REE rules under the same template REE, each REE rule will append a pattern tuple (pattern tuple) to the template REE for constant assignment;

For example, the pattern tuple of CFD address="Guangzhou City, Guangdong Province"->Postcode="510000" is ("Guangzhou City, Guangdong Province", "510000").

Multiple pattern tuples form the pattern tableau form of REE, as shown in Table 1:

Table 1

As an example, consider constant predicates that differ in constant value only on the same attribute.

In a specific implementation, such as t.address="Shenzhen City, Guangdong Province", t.address="Guangzhou City, Guangdong Province" and t.address="Dongguan City, Guangdong Province", only one template predicate is enumerated. , that is, t.address="_".

As described in step S330, construct a target template according to the effective predicate;

In an embodiment of the present invention, the specific process of "constructing a target template based on the effective predicate" in step S330 may be further explained in conjunction with the following description.

Referring to Figure 4, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;

As described in the following steps,

S410. Combine the valid predicates to generate permutations;

S420. Screen the permutations and combinations to determine template predicate combinations;

S430: Construct the target template based on the template predicate combination and the valid predicate.

It should be noted that the effective predicates are combined to form a permutation and combination; several effective predicates are arranged to form a permutation and combination; and several effective predicates are combined to form a permutation and combination.

It should be noted that the permutations and combinations are screened to determine the template predicate combination; instead of enumerating all the permutations and combinations, the permutations and combinations are initially screened by using the concept of free itemset in the transaction database to determine Template predicate combination.

As an example, instead of enumerating all permutations and combinations of template predicates, by using the concept of free itemset in the transaction database, a preliminary screening of all permutations and combinations of template predicates is performed; only those that pass Only the filtered template predicate combinations will form a valid template REE, which is the target template.

In a specific implementation, for example, when the template predicate is t.address="_", there is express data in the sampled data, and the address is not a null value (a null value means there is no data), then the sampled data is in this template A valid value exists on the predicate's data attribute.

It should be noted that for the template REE, that is, the target template, as long as there is at least one set of data attributes that satisfy the template REE, that is, the target template, then this rule will participate in subsequent rule verification as a valid rule candidate. For example, express (t)∧t.Address="_"->t.Zip code="_", there is a courier data in the data, and neither the address nor the zip code is null, then the data attributes of this data can satisfy the template REE That is the target template.

As described in step S130, perform data filtering on the target data according to the constant predicate to build a predicate aggregate set;

In an embodiment of the present invention, the specific process of "filtering the target data according to the constant predicate to construct a predicate total set" described in step S130 can be further explained in conjunction with the following description.

Referring to Figure 5, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;

As described in the following steps,

S510. Perform data screening on the target data according to the constant predicate to determine non-constant predicate data, where the non-constant predicate data is a set of non-constant predicates;

S520: Supplement the non-constant predicate data with a constant value according to the target template to generate constant predicate data, where the constant predicate data is a set of constant predicates;

S530. Construct the predicate total set based on the non-constant predicate set and the constant predicate set.

It should be noted that after mining the template REE on D _s , that is, the sampled data, the template REE is used to perform constant repair on the full data D, that is, the target data; among them, the constant repair includes the following four main steps: (1) ) Use non-constant predicates to confirm the enumeration range; (2) Use templates to supplement constants; (3) Candidate rule generation; (4) Rule verification.

It should be noted that the target data is filtered according to the constant predicate to determine the non-constant predicate data, wherein the non-constant predicate data is a non-constant predicate set; the target data is filtered using the constant predicate as the constant predicate data. Filter out, thereby obtaining the target data that is not constant predicate data, and identify the target data that is not constant predicate data as non-constant predicate data.

It should be noted that for the filtered non-constant predicate data, constants are supplemented according to the template REE, that is, the template predicate in the target template. Each template predicate is supplemented with constants, thereby constructing constant predicate data that can be enumerated. .

As an example, for the template predicate t.address="_", we find all the address attribute values in the data and fill in the position of the wildcard character "_" to form the constant predicate data.

It should be noted that based on the set of constant predicates and the set of non-constant predicates, a total set of predicates can be formed. The total set of predicates includes all non-constant predicates and all constant predicates obtained previously.

As an example, given a template REE, which is the target template, the full data D, that is, the sub-data in the target data, is filtered through the constant predicate in the template REE. Only the sub-data that satisfies the non-constant predicate will participate in the next step. In one step of constant supplementation, that is, only non-constant predicate data will participate in the next step of constant supplementation; this approach avoids expensive constant supplementation in the full data D, and greatly improves the performance while ensuring completeness. The execution efficiency of the algorithm.

As described in step S140, perform association rule mining based on the predicate total set to generate a candidate rule set;

In an embodiment of the present invention, the specific process of "mining association rules to generate a set of candidate rules based on the total set of predicates" described in step S140 can be further explained in conjunction with the following description.

Referring to Figure 6, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;

As described in the following steps,

S610. Perform a depth-first search based on the total set of predicates to generate a first candidate rule set; or,

S620. Perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;

S630. Generate the total set of candidate rules based on the first candidate rule set or the second candidate rule set.

It should be noted that a depth-first search or a breadth-first search is performed based on the predicate set to obtain several candidate rules, and the candidate rules generate a set of candidate rules.

As an example, the template REE, that is, the target template is mined based on depth-first or breadth-first rules for the total set of predicates to obtain candidate rules.

As described in step S150, determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules;

In an embodiment of the present invention, the specific process of "determining valid rules in the total set of candidate rules based on the target data and determining valid data based on the valid rules" in step S150 can be further explained in conjunction with the following description.

Referring to Figure 7, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;

As described in the following steps,

S710. Obtain each sub-candidate rule in the total set of candidate rules;

S720. Verify each sub-candidate rule according to the target data to determine the validity of each sub-candidate rule; wherein, when there is a sub-target data corresponding to the current sub-candidate rule in the target data, then Determine the current sub-candidate rule as a valid rule;

S730: Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.

It should be noted that the valid rules in the total set of candidate rules are determined based on the target data, and the valid data are determined based on the valid rules; for candidate rules, it is necessary to determine whether they are valid rules through the full data D, that is, the target data.

As an example, for each candidate rule, verify the validity of the candidate rule on the full data D, that is, the target data; when there is a sub-target data in the target data corresponding to the current sub-candidate rule, then the candidate rule is valid of.

In a specific implementation, for example, if the candidate rule is express(t)∧express(s)∧t.recipient=s.recipient∧t.address="Shenzhen City, Guangdong Province"->s.zip code= "510000". If the recipient of express t and express s is the same person in the data, and the address of express t is in "Shenzhen City, Guangdong Province", then the postal code of express s must be "510000", then the corresponding data of express t and express s If the candidate rule is satisfied, the candidate rule is considered valid; the valid rules will form the final result output.

Technical effects of the present invention:

There is no concept like a template in CFD. Therefore, all constants on the full data need to be enumerated to generate effective CFD rules with constants. On the contrary, by using the definition of template REE, a small sampling data D _s is first extracted from the global data for template mining. Due to the small amount of data, this process is very fast compared to CFD's method of constant enumeration on global data. Secondly, by utilizing the mined template REE, there is no need to perform expensive constant enumeration in the full data D. Instead, we just need to enumerate the constants that might make up a valid rule. Constants that are unlikely to form valid rules will be excluded in the range confirmation before enumeration to avoid possible redundant and invalid operations.

We compared the accuracy and mining efficiency of three rule mining methods in multiple public data, including: (1) the rule mining method in this invention that performs template mining on D _s and then performs constant repair on D; (2) directly Methods for rule mining on D _s ; and (3) methods for rule mining directly on D.

By comparison with the above method (2), the mining method of the present invention improves the rule recall rate by 2%; after constant repair, the mined rules are more accurate.

By comparison with method (3), the present invention can improve the operating efficiency by an average of 12.2 times. On a large DBLP data set with 3 relational tables, 18 attributes, and 1.8 million pieces of data, the running time of the present invention is 406 seconds, while the running time of method (3) is 2096 seconds; in other words, the mining efficiency is higher.

As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Referring to Figure 8, a structural block diagram of a data processing device based on relational data provided by an embodiment of the present application is shown;

The device is used for rule mining and repair of constants in relational data. The relational data includes full data and sampled data in the full data, specifically including:

The acquisition module 810 is used to acquire target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;

The first building module 820 is used to generate a template predicate based on the sampled data, and build a target template based on the template predicate;

The second building module 830 is used to perform data filtering on the target data according to the constant predicate to build a total set of predicates;

The generation module 840 is used to perform association rule mining based on the predicate aggregate set to generate a candidate rule aggregate set;

The determination module 850 is configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.

In an embodiment of the present invention, the acquisition module 810 includes:

The first acquisition sub-module is used to acquire data attributes in the target data;

The first determination sub-module is used to determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;

The second determination sub-module is used to filter the target data whose word meaning type corresponds to the constant, and determine it as the sampled data.

In an embodiment of the present invention, the first building module 820 includes:

The first generation sub-module is used to generate template predicates based on the sampled data;

The third determination sub-module is used to determine that the template predicate is a valid predicate when the template predicate has a valid value in the target data;

The first construction sub-module is used to construct the target template based on the valid predicate.

In an embodiment of the present invention, the first building submodule includes:

The first generation unit is used to combine the valid predicates to generate permutations;

The first determination unit is used to screen the permutations and combinations to determine template predicate combinations;

A first building unit configured to build the target template based on the template predicate combination and the valid predicate.

In an embodiment of the present invention, the second building module 830 includes:

The first screening sub-module is used to perform data screening on the target data according to the constant predicate to determine non-constant predicate data, where the non-constant predicate data is a set of non-constant predicates;

The second generation sub-module is used to supplement the non-constant predicate data with constant values according to the target template to generate constant predicate data, where the constant predicate data is a set of constant predicates;

The second construction sub-module is used to construct the predicate total set based on the non-constant predicate set and the constant predicate set.

In an embodiment of the present invention, the generation module 840 includes:

The third generation sub-module is used to perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,

The fourth generation sub-module is used to perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;

The fifth generation sub-module is used to generate the total set of candidate rules based on the first set of candidate rules or the second set of candidate rules.

In an embodiment of the present invention, the determination module 850 includes:

The second acquisition sub-module is used to acquire each sub-candidate rule in the total set of candidate rules;

The fourth determination sub-module is used to verify and determine the validity of each of the sub-candidate rules based on the target data; wherein, when there is a sub-target data in the target data and the current sub-candidate rule When the candidate rules correspond, the current sub-candidate rule is determined to be a valid rule;

The third acquisition sub-module is used to acquire the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.

Referring to Figure 9, a computer device for a data processing method based on relational data of the present invention is shown, which may specifically include the following:

The above-mentioned computer device 12 is in the form of a general computing device. The components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, a system memory 28, connecting different system components (including system memory 28 and processing units). 16) of bus 18.

The bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18 , a graphics acceleration port, a processor or a computer using any of the plurality of bus 18 structures. Domain bus 18. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus 18, the Micro Channel Architecture (MAC) bus 18, the Enhanced ISA bus 18, the Video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including volatile and nonvolatile media, removable and non-removable media.

System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 . Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 9, a disk drive may be provided for reading and writing to removable non-volatile disks (e.g., "floppy disks"), and for removable non-volatile optical disks (e.g., CD-ROM, DVD-ROM or other optical media) that can read and write optical disc drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. The memory may include at least one program product having a set (eg, at least one) program module 42 configured to perform the functions of embodiments of the invention.

A program/utility 40 having a set of (at least one) program modules 42, which may be stored, for example, in memory. Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules. 42 As well as program data, each of these examples or some combination may include an implementation of a network environment. Program modules 42 generally perform functions and/or methods in the described embodiments of the invention.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.) and with one or more devices that enable an operator to interact with computer device 12, and /or communicate with any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication may occur through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (eg, local area network (LAN)), wide area network (WAN), and/or public network (eg, the Internet) through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 . It should be understood that, although not shown in Figure 9, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing unit 16, external disk drive arrays, RAID systems, Tape drives and data backup storage systems 34, etc.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing a data processing method based on relational data provided by the embodiment of the present invention.

That is, when the above-mentioned processing unit 16 executes the above-mentioned program, it achieves: acquiring target data, and performing data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one; according to the Sampling data generates a template predicate, and constructs a target template based on the template predicate; performs data filtering on the target data based on the constant predicate to build a predicate aggregate set; performs association rule mining based on the predicate aggregate set to generate a candidate rule aggregate set; Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules. The constants are repaired by proposing a data processing method.

In an embodiment of the present invention, the present invention also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the data processing based on relational data as provided in all embodiments of the present application is implemented. method:

That is, when the program is executed by the processor, the following is achieved: obtaining the target data, and filtering the data to determine the sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one; according to the sampling The data generates a template predicate, and constructs a target template based on the template predicate; performs data filtering on the target data based on the constant predicate to build a predicate collection; performs association rule mining based on the predicate collection to generate a candidate rule collection; based on The target data determines effective rules in the total set of candidate rules, and determines effective data based on the effective rules. The constants are repaired by proposing a data processing method.

Any combination of one or more computer-readable media may be employed. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared or semiconductor system, apparatus or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .

Computer program code for performing the operations of the present invention may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language - such as "C" or similar programming language. The program code may execute entirely on the operator's computer, partly on the operator's computer, as a stand-alone software package, partly on the operator's computer and partly on a remote computer or entirely on the remote computer or server . In situations involving remote computers, the remote computer can be connected to the operator computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., using an Internet service provider). to connect via the Internet). Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other.

Although preferred embodiments of the embodiments of the present application have been described, those skilled in the art may make additional changes and modifications to these embodiments once the basic inventive concepts are understood. Therefore, the appended claims are intended to be construed to include the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of the present application.

Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or end device that includes a list of elements includes not only those elements, but also elements not expressly listed or other elements inherent to such process, method, article or terminal equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or terminal device including the stated element.

The above is a detailed introduction to a data processing method and device based on relational data provided by this application. This article uses specific examples to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for It helps to understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the field, there will be changes in the specific implementation methods and application scope based on the ideas of this application. In summary, the content of this specification It should not be construed as a limitation on this application.

Claims

A data processing method based on relational data. The method is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data. It is characterized by including:

Obtain target data, and perform data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one;

Generate a template predicate based on the sampled data, and construct a target template based on the template predicate;

Perform data filtering on the target data according to the constant predicate to construct a predicate aggregate set;

Perform association rule mining based on the predicate set to generate a candidate rule set;

Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules.
The method according to claim 1, characterized in that: obtaining target data, and performing data screening to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one step ,include:

Obtain data attributes within the target data;

Determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;

The target data whose word meaning type corresponds to the constant are screened and determined as the sampled data.
The method of claim 1, wherein the step of generating a template predicate based on the sampled data and constructing a target template based on the template predicate includes:

Generate template predicates based on the sampled data;

When the template predicate has a valid value in the target data, it is determined that the template predicate is a valid predicate;

Build a target template based on the valid predicates.
The method according to claim 3, characterized in that the step of constructing a target template based on the effective predicate includes:

Combining the valid predicates to generate permutations;

Screen the permutations and combinations to determine template predicate combinations;

The target template is constructed based on the template predicate combination and the valid predicate.
The method according to claim 1, characterized in that the step of filtering the target data according to the constant predicate to construct a predicate aggregate includes:

Perform data screening on the target data according to the constant predicate to determine non-constant predicate data, wherein the non-constant predicate data is a set of non-constant predicates;

Perform constant value supplementation on the non-constant predicate data according to the target template to generate constant predicate data, wherein the constant predicate data is a set of constant predicates;

The predicate total set is constructed based on the non-constant predicate set and the constant predicate set.
The method according to claim 1, characterized in that the step of performing association rule mining based on the predicate aggregate set to generate a candidate rule set includes:

Perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,

Perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;

The total set of candidate rules is generated according to the first set of candidate rules or the second set of candidate rules.
The method according to claim 1, characterized in that the step of determining effective rules in the total set of candidate rules based on the target data and determining effective data based on the effective rules includes:

Obtain each sub-candidate rule in the total set of candidate rules;

Verify the validity of each sub-candidate rule according to the target data; wherein, when there is a sub-target data corresponding to the current sub-candidate rule in the target data, then determine the validity of each sub-candidate rule. The current sub-candidate rule is a valid rule;

Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
A data processing device based on relational data. The device is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data. It is characterized by including:

An acquisition module is used to acquire target data, and perform data screening to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one;

A first building module, configured to generate a template predicate based on the sampled data, and construct a target template based on the template predicate;

The second building module is used to perform data filtering on the target data based on the constant predicate to build a predicate aggregate set;

A generation module, configured to perform association rule mining based on the predicate set to generate a candidate rule set;

A determining module, configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
A computer device, characterized in that it includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, the computer program implements claim 1 The method described in any one of to 7.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.