WO2023236238A1 - Relational data-based data processing method and apparatus thereof - Google Patents

Relational data-based data processing method and apparatus thereof Download PDF

Info

Publication number
WO2023236238A1
WO2023236238A1 PCT/CN2022/099183 CN2022099183W WO2023236238A1 WO 2023236238 A1 WO2023236238 A1 WO 2023236238A1 CN 2022099183 W CN2022099183 W CN 2022099183W WO 2023236238 A1 WO2023236238 A1 WO 2023236238A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
predicate
template
constant
target
Prior art date
Application number
PCT/CN2022/099183
Other languages
French (fr)
Chinese (zh)
Inventor
谢珉
王尧舒
樊文飞
Original Assignee
深圳计算科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳计算科学研究院 filed Critical 深圳计算科学研究院
Publication of WO2023236238A1 publication Critical patent/WO2023236238A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of data processing, in particular to a data processing method and device based on relational data.
  • Rule discovery in large-scale relational data is a time-consuming and labor-intensive process.
  • the cost of rule discovery increases exponentially in complexity.
  • CFD rule mining needs to enumerate all possible combinations of attributes and constants, which is time-consuming and labor-intensive.
  • this application is proposed to provide a data processing method and device based on relational data that overcome the problems or at least partially solve the problems, including:
  • a data processing method based on relational data is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data, including:
  • sampling data is a constant predicate and includes at least one
  • Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules.
  • the step of obtaining the target data and performing data filtering to determine sampled data based on the word meaning of the target data, wherein the sampled data is a constant predicate and includes at least one step includes:
  • the target data whose word meaning type corresponds to the constant are screened and determined as the sampled data.
  • the step of generating a template predicate based on the sampled data and constructing a target template based on the template predicate includes:
  • the template predicate When the template predicate has a valid value in the target data, it is determined that the template predicate is a valid predicate;
  • step of constructing a target template based on the valid predicate includes:
  • the target template is constructed based on the template predicate combination and the valid predicate.
  • the step of filtering the target data based on the constant predicate to construct a predicate aggregate set includes:
  • the predicate total set is constructed based on the non-constant predicate set and the constant predicate set.
  • the step of performing association rule mining based on the predicate aggregate set to generate a candidate rule set includes:
  • the total set of candidate rules is generated according to the first set of candidate rules or the second set of candidate rules.
  • the steps of determining effective rules in the total set of candidate rules based on the target data, and determining effective data based on the effective rules include:
  • the current sub-candidate rule is a valid rule
  • This application also discloses a data processing device based on relational data.
  • the device is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data, including:
  • An acquisition module is used to acquire target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;
  • a first building module configured to generate a template predicate based on the sampled data, and construct a target template based on the template predicate;
  • the second building module is used to perform data filtering on the target data based on the constant predicate to build a predicate aggregate set
  • a generation module configured to perform association rule mining based on the predicate set to generate a candidate rule set
  • a determining module configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
  • This application also discloses a device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor.
  • the computer program is executed by the processor, the above-mentioned methods are implemented.
  • This application also discloses a computer-readable storage medium.
  • a computer program is stored on the computer-readable storage medium.
  • the steps of a data processing method based on relational data as described above are implemented. .
  • sampled data is determined by obtaining target data and filtering data according to the word meaning of the target data, where the sampled data is a constant predicate and includes at least one; a template is generated based on the sampled data Predicate, and build a target template based on the template predicate; perform data screening on the target data based on the constant predicate to build a predicate collection; perform association rule mining based on the predicate collection to generate a candidate rule collection; based on the target The data determines valid rules in the total set of candidate rules, and valid data is determined based on the valid rules.
  • Figure 1 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • Figure 2 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • Figure 3 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • Figure 4 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • Figure 5 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • Figure 6 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • Figure 7 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application.
  • Figure 8 is a structural block diagram of a data processing device based on relational data provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • REE Entity Enhancing Rules
  • is an operator, which can be equal to or not equal to.
  • R(t) indicates that t is a tuple variable in the relational table R.
  • t.A represents the A attribute of variable t; M is a machine learning model. If t.A and s.B are related, then the machine learning model will return true (correct), otherwise it will return false (error).
  • t.A ⁇ c has a constant and is called a constant predicate.
  • t.A ⁇ s.B does not have a constant and is called a variable predicate.
  • M(t.A,s.B) is called a machine learning predicate.
  • REE X->e; among them, (1)
  • REE rules can be found in relational data through depth-first or breadth-first search methods.
  • CFD rules support constant predicates and variable predicates with only one tuple variable, which can be regarded as a special case of REE rules.
  • CFD-based rule mining algorithms also use breadth-first or depth-first search methods for rule mining.
  • the overall technical solution of the present invention is divided into two steps: template mining and constant repair.
  • template mining is performed on the sampled data D s
  • constant repair is performed on the full data D s.
  • Template mining is performed first, and constant repair is performed later.
  • Template mining is performed first, and then constant repair is performed based on the mined template.
  • FIG. 1 there is shown a step flow chart of a data processing method based on relational data provided by an embodiment of the present application
  • a data processing method based on relational data is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data.
  • the method includes:
  • S150 Determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
  • sampled data is determined by obtaining target data and filtering data according to the word meaning of the target data, where the sampled data is a constant predicate and includes at least one; a template is generated based on the sampled data Predicate, and build a target template based on the template predicate; perform data screening on the target data based on the constant predicate to build a predicate collection; perform association rule mining based on the predicate collection to generate a candidate rule collection; based on the target The data determines valid rules in the total set of candidate rules, and valid data is determined based on the valid rules.
  • step S110 target data is obtained, and data filtering is performed to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one.
  • S230 Screen the target data whose word meaning type corresponds to the constant, and determine it as the sampled data.
  • the data attributes in the target data are obtained; different target data have data attributes corresponding to each one.
  • word meaning type of the target data is determined according to the data attributes, wherein the word meaning type includes constants and no constants; the word meaning type of the target data is determined through the data attributes; the word meaning types include constants and No constant.
  • the target data whose word meaning type is a constant is screened out and determined as the sampled data; the target data whose word meaning type is a constant is filtered and these target data are marked as sampled data.
  • a template predicate is generated based on the sampled data, and a target template is constructed based on the template predicate.
  • step S120 the specific process of "generating a template predicate based on the sampled data and constructing a target template based on the template predicate" described in step S120 can be further explained in conjunction with the following description.
  • a template predicate is generated based on the sampled data; a predicate can be formed based on the data attributes of the sampled data; several predicates are generated based on the data attributes; the initial state of the predicate generated through the data attributes is an invalid predicate, and its corresponding A constant value that determines whether it is a template predicate.
  • the predicate when validating a template predicate, as long as at least one sub-data in the target data has a valid value on the data attribute of the template predicate, then the predicate will be used as a valid predicate, and then the template REE will be formed through the valid predicate. target template.
  • the template REE that is, the target template, is a REE rule in which all constants in the REE are represented by the wildcard character "_”.
  • the template REE corresponding to the sample REE is as follows:
  • template REE is that if one or more REE rules only differ in the constant of the constant predicate, they can be represented by the same template REE.
  • each REE rule will append a pattern tuple (pattern tuple) to the template REE for constant assignment;
  • t.address "Shenzhen City, Guangdong province”
  • step S330 construct a target template according to the effective predicate
  • step S330 the specific process of "constructing a target template based on the effective predicate" in step S330 may be further explained in conjunction with the following description.
  • S430 Construct the target template based on the template predicate combination and the valid predicate.
  • the effective predicates are combined to form a permutation and combination; several effective predicates are arranged to form a permutation and combination; and several effective predicates are combined to form a permutation and combination.
  • the permutations and combinations are screened to determine the template predicate combination; instead of enumerating all the permutations and combinations, the permutations and combinations are initially screened by using the concept of free itemset in the transaction database to determine Template predicate combination.
  • the address is not a null value (a null value means there is no data)
  • the sampled data is in this template A valid value exists on the predicate's data attribute.
  • step S130 perform data filtering on the target data according to the constant predicate to build a predicate aggregate set
  • step S130 the specific process of "filtering the target data according to the constant predicate to construct a predicate total set" described in step S130 can be further explained in conjunction with the following description.
  • S520 Supplement the non-constant predicate data with a constant value according to the target template to generate constant predicate data, where the constant predicate data is a set of constant predicates;
  • the template REE is used to perform constant repair on the full data D, that is, the target data; among them, the constant repair includes the following four main steps: (1) ) Use non-constant predicates to confirm the enumeration range; (2) Use templates to supplement constants; (3) Candidate rule generation; (4) Rule verification.
  • the target data is filtered according to the constant predicate to determine the non-constant predicate data, wherein the non-constant predicate data is a non-constant predicate set; the target data is filtered using the constant predicate as the constant predicate data. Filter out, thereby obtaining the target data that is not constant predicate data, and identify the target data that is not constant predicate data as non-constant predicate data.
  • constants are supplemented according to the template REE, that is, the template predicate in the target template.
  • Each template predicate is supplemented with constants, thereby constructing constant predicate data that can be enumerated. .
  • the total set of predicates includes all non-constant predicates and all constant predicates obtained previously.
  • the full data D that is, the sub-data in the target data
  • the full data D is filtered through the constant predicate in the template REE. Only the sub-data that satisfies the non-constant predicate will participate in the next step.
  • this approach avoids expensive constant supplementation in the full data D, and greatly improves the performance while ensuring completeness. The execution efficiency of the algorithm.
  • step S140 perform association rule mining based on the predicate total set to generate a candidate rule set
  • step S140 the specific process of "mining association rules to generate a set of candidate rules based on the total set of predicates" described in step S140 can be further explained in conjunction with the following description.
  • a depth-first search or a breadth-first search is performed based on the predicate set to obtain several candidate rules, and the candidate rules generate a set of candidate rules.
  • the template REE that is, the target template is mined based on depth-first or breadth-first rules for the total set of predicates to obtain candidate rules.
  • step S150 determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules
  • step S150 the specific process of "determining valid rules in the total set of candidate rules based on the target data and determining valid data based on the valid rules" in step S150 can be further explained in conjunction with the following description.
  • S730 Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
  • the valid rules in the total set of candidate rules are determined based on the target data, and the valid data are determined based on the valid rules; for candidate rules, it is necessary to determine whether they are valid rules through the full data D, that is, the target data.
  • each candidate rule verify the validity of the candidate rule on the full data D, that is, the target data; when there is a sub-target data in the target data corresponding to the current sub-candidate rule, then the candidate rule is valid of.
  • the mining method of the present invention improves the rule recall rate by 2%; after constant repair, the mined rules are more accurate.
  • the present invention can improve the operating efficiency by an average of 12.2 times.
  • the running time of the present invention is 406 seconds, while the running time of method (3) is 2096 seconds; in other words, the mining efficiency is higher.
  • the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.
  • FIG. 8 a structural block diagram of a data processing device based on relational data provided by an embodiment of the present application is shown;
  • the device is used for rule mining and repair of constants in relational data.
  • the relational data includes full data and sampled data in the full data, specifically including:
  • the acquisition module 810 is used to acquire target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;
  • the first building module 820 is used to generate a template predicate based on the sampled data, and build a target template based on the template predicate;
  • the second building module 830 is used to perform data filtering on the target data according to the constant predicate to build a total set of predicates
  • the generation module 840 is used to perform association rule mining based on the predicate aggregate set to generate a candidate rule aggregate set
  • the determination module 850 is configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
  • the acquisition module 810 includes:
  • the first acquisition sub-module is used to acquire data attributes in the target data
  • the first determination sub-module is used to determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;
  • the second determination sub-module is used to filter the target data whose word meaning type corresponds to the constant, and determine it as the sampled data.
  • the first building module 820 includes:
  • the first generation sub-module is used to generate template predicates based on the sampled data
  • the third determination sub-module is used to determine that the template predicate is a valid predicate when the template predicate has a valid value in the target data;
  • the first construction sub-module is used to construct the target template based on the valid predicate.
  • the first building submodule includes:
  • the first generation unit is used to combine the valid predicates to generate permutations
  • the first determination unit is used to screen the permutations and combinations to determine template predicate combinations
  • a first building unit configured to build the target template based on the template predicate combination and the valid predicate.
  • the second building module 830 includes:
  • the first screening sub-module is used to perform data screening on the target data according to the constant predicate to determine non-constant predicate data, where the non-constant predicate data is a set of non-constant predicates;
  • the second generation sub-module is used to supplement the non-constant predicate data with constant values according to the target template to generate constant predicate data, where the constant predicate data is a set of constant predicates;
  • the second construction sub-module is used to construct the predicate total set based on the non-constant predicate set and the constant predicate set.
  • the generation module 840 includes:
  • the third generation sub-module is used to perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,
  • the fourth generation sub-module is used to perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules
  • the fifth generation sub-module is used to generate the total set of candidate rules based on the first set of candidate rules or the second set of candidate rules.
  • the determination module 850 includes:
  • the second acquisition sub-module is used to acquire each sub-candidate rule in the total set of candidate rules
  • the fourth determination sub-module is used to verify and determine the validity of each of the sub-candidate rules based on the target data; wherein, when there is a sub-target data in the target data and the current sub-candidate rule When the candidate rules correspond, the current sub-candidate rule is determined to be a valid rule;
  • the third acquisition sub-module is used to acquire the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
  • a computer device for a data processing method based on relational data of the present invention is shown, which may specifically include the following:
  • the above-mentioned computer device 12 is in the form of a general computing device.
  • the components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, a system memory 28, connecting different system components (including system memory 28 and processing units). 16) of bus 18.
  • the bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18 , a graphics acceleration port, a processor or a computer using any of the plurality of bus 18 structures.
  • Domain bus 18 includes, but are not limited to, the Industry Standard Architecture (ISA) bus 18, the Micro Channel Architecture (MAC) bus 18, the Enhanced ISA bus 18, the Video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including volatile and nonvolatile media, removable and non-removable media.
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
  • Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (commonly referred to as "hard drives").
  • a disk drive may be provided for reading and writing to removable non-volatile disks (e.g., "floppy disks"), and for removable non-volatile optical disks (e.g., CD-ROM, DVD-ROM or other optical media) that can read and write optical disc drives.
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • the memory may include at least one program product having a set (eg, at least one) program module 42 configured to perform the functions of embodiments of the invention.
  • a program/utility 40 having a set of (at least one) program modules 42, which may be stored, for example, in memory.
  • Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules. 42 As well as program data, each of these examples or some combination may include an implementation of a network environment.
  • Program modules 42 generally perform functions and/or methods in the described embodiments of the invention.
  • Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.) and with one or more devices that enable an operator to interact with computer device 12, and /or communicate with any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication may occur through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (eg, local area network (LAN)), wide area network (WAN), and/or public network (eg, the Internet) through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 .
  • external devices 14 e.g., keyboard, pointing device, display 24, camera, etc.
  • any device eg, network card, modem, etc.
  • network adapter 20 may communicate with other modules of computer device 12 via bus 18 .
  • the processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing a data processing method based on relational data provided by the embodiment of the present invention.
  • the above-mentioned processing unit 16 executes the above-mentioned program, it achieves: acquiring target data, and performing data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one; according to the Sampling data generates a template predicate, and constructs a target template based on the template predicate; performs data filtering on the target data based on the constant predicate to build a predicate aggregate set; performs association rule mining based on the predicate aggregate set to generate a candidate rule aggregate set; Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules.
  • the constants are repaired by proposing a data processing method.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the data processing based on relational data as provided in all embodiments of the present application is implemented.
  • the program when executed by the processor, the following is achieved: obtaining the target data, and filtering the data to determine the sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one; according to the sampling The data generates a template predicate, and constructs a target template based on the template predicate; performs data filtering on the target data based on the constant predicate to build a predicate collection; performs association rule mining based on the predicate collection to generate a candidate rule collection; based on The target data determines effective rules in the total set of candidate rules, and determines effective data based on the effective rules.
  • the constants are repaired by proposing a data processing method.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared or semiconductor system, apparatus or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Computer program code for performing the operations of the present invention may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language - such as "C" or similar programming language.
  • the program code may execute entirely on the operator's computer, partly on the operator's computer, as a stand-alone software package, partly on the operator's computer and partly on a remote computer or entirely on the remote computer or server .
  • the remote computer can be connected to the operator computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., using an Internet service provider). to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., using an Internet service provider

Abstract

The present application provides a relational data-based data processing method and an apparatus thereof, being used for recovering target data of a missing data segment by means of a data relation, and verifying the validity of the recovered target data. The method comprises: obtaining target data, and performing data screening according to the word meaning of the target data to determine sampling data; generating a template predicate according to the sampling data, and constructing a target template according to the template predicate; performing data screening on the target data according to constant predicates to construct a predicate total set; performing association rule mining according to the predicate total set to generate a candidate rule total set; and determining a valid rule in the candidate rule total set according to the target data, and determining valid data according to the valid rule. Therefore, when rule discovery with a constant is performed in large-scale relational data, a valid rule with the constant can also be found without enumerating all possible constants, thereby greatly improving the execution efficiency of rule discovery.

Description

一种基于关系数据的数据处理方法及其装置A data processing method and device based on relational data 技术领域Technical field
本申请涉及数据处理领域,特别是一种基于关系数据的数据处理方法及其装置。The present application relates to the field of data processing, in particular to a data processing method and device based on relational data.
背景技术Background technique
在大规模的关系数据中进行规则发现是一个费时费力的过程。当规则中允许使用常数时,规则发现的代价更是以指数级别的复杂度上升。Rule discovery in large-scale relational data is a time-consuming and labor-intensive process. When constants are allowed in rules, the cost of rule discovery increases exponentially in complexity.
比如,考虑以下一个简单的条件函数依赖(Conditional Functional Dependency,简称CFD):For example, consider the following simple conditional functional dependency (CFD):
地址=“广东省深圳市”->邮编=“518000”Address="Shenzhen City, Guangdong Province"->Postal code="518000"
这个CFD描述的场景是,如果一个地址属性是在广东省深圳市的话,那么它对应的邮编属性一定是518000。这种规则可被广泛地运用于关系数据中的查错和纠错。具体来说,当关系数据中存储的数据违反了这个规则(即一个地址属性是在广东省深圳市,但是它的邮编属性不是518000),那么可以得知数据中存在错误,可以进一步地进行数据纠正。在这个规则中,“广东省深圳市”和“518000”都是常数,地址和邮编是数据的属性名。The scenario described by this CFD is that if an address attribute is in Shenzhen City, Guangdong Province, then its corresponding zip code attribute must be 518000. This rule can be widely used for error checking and correction in relational data. Specifically, when the data stored in the relational data violates this rule (that is, an address attribute is in Shenzhen, Guangdong Province, but its zip code attribute is not 518000), then it can be known that there is an error in the data, and the data can be further processed. correct. In this rule, "Shenzhen City, Guangdong Province" and "518000" are both constants, and the address and zip code are the attribute names of the data.
如果在大规模数据中进行带常数的规则发现,不仅需要考虑不同的数据属性间的排列组合,还需要考虑每个属性可能匹配的常数。这个枚举过程的代价是非常昂贵的。比如下面几个CFD:If you perform rule discovery with constants in large-scale data, you need to consider not only the permutations and combinations of different data attributes, but also the constants that each attribute may match. This enumeration process is very expensive. For example, the following CFDs:
地址=“广东省广州市”->邮编=“510000”Address="Guangzhou City, Guangdong Province"->Postcode="510000"
地址=“广东省东莞市”->邮编=“523000”Address="Dongguan City, Guangdong Province"->Postcode="523000"
地址=“广东省佛山市”->邮编=“528010”Address="Foshan City, Guangdong Province"->Postcode="528010"
这几个CFD描述的场景都是类似的,区别只在于使用了不同的常数。虽然规则的属性(即地址和邮编)都相同,但是匹配的常数不同。如果需要在数据中枚举所有可能匹配的常数,毫无疑问地,会使规则发现的效率大大降低。可能需要几天甚至几周,在一个普通大小的关系数据中进行规则发现。The scenarios described by these CFD are similar, the only difference lies in the use of different constants. Although the attributes of the rules (i.e. address and zip code) are all the same, the matching constants are different. If you need to enumerate all possible matching constants in the data, there is no doubt that the efficiency of rule discovery will be greatly reduced. It can take days or even weeks to perform rule discovery in a regular-sized relational data set.
CFD规则局限的表达能力使其在实际场景中适用性受限。为了能支持常 数谓词,CFD规则挖掘需要枚举所有可能属性和常数的组合,因此费时费力。The limited expression ability of CFD rules limits its applicability in actual scenarios. In order to support constant predicates, CFD rule mining needs to enumerate all possible combinations of attributes and constants, which is time-consuming and labor-intensive.
发明内容Contents of the invention
鉴于所述问题,提出了本申请以便提供克服所述问题或者至少部分地解决所述问题的一种基于关系数据的数据处理方法及其装置,包括:In view of the above problems, this application is proposed to provide a data processing method and device based on relational data that overcome the problems or at least partially solve the problems, including:
一种基于关系数据的数据处理方法,所述方法用于通过数据关系修复缺失数据段的目标数据,并验证修复后的目标数据的有效性,包括:A data processing method based on relational data. The method is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data, including:
获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;Obtain target data, and perform data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one;
依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;Generate a template predicate based on the sampled data, and construct a target template based on the template predicate;
依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;Perform data filtering on the target data according to the constant predicate to construct a predicate aggregate set;
依据所述谓词总集进行关联规则挖掘生成候选规则总集;Perform association rule mining based on the predicate set to generate a candidate rule set;
依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules.
进一步地,所述获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个的步骤,包括:Further, the step of obtaining the target data and performing data filtering to determine sampled data based on the word meaning of the target data, wherein the sampled data is a constant predicate and includes at least one step, includes:
获取所述目标数据内的数据属性;Obtain data attributes within the target data;
依据所述数据属性确定所述目标数据的所述词义类型,其中,所述词义类型包括有常数和无常数;Determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;
筛选所述词义类型为所述有常数对应的所述目标数据,并确定为所述采样数据。The target data whose word meaning type corresponds to the constant are screened and determined as the sampled data.
进一步地,所述依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板的步骤,包括:Further, the step of generating a template predicate based on the sampled data and constructing a target template based on the template predicate includes:
依据所述采样数据生成模板谓词;Generate template predicates based on the sampled data;
当所述模板谓词在所述目标数据存在有效值时,则确定所述模板谓词为有效谓词;When the template predicate has a valid value in the target data, it is determined that the template predicate is a valid predicate;
依据所述有效谓词构建目标模板。Build a target template based on the valid predicates.
进一步地,所述依据所述有效谓词构建目标模板的步骤,包括:Further, the step of constructing a target template based on the valid predicate includes:
对所述有效谓词之间进行组合生成排列组合;Combining the valid predicates to generate permutations;
对所述排列组合进行筛选确定模板谓词组合;Screen the permutations and combinations to determine template predicate combinations;
依据所述模板谓词组合和所述有效谓词构建所述目标模板。The target template is constructed based on the template predicate combination and the valid predicate.
进一步地,所述依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集的步骤,包括:Further, the step of filtering the target data based on the constant predicate to construct a predicate aggregate set includes:
依据所述常数谓词对所述目标数据进行数据筛选确定非常数谓词数据,其中,所述非常数谓词数据为非常数谓词集合;Perform data screening on the target data according to the constant predicate to determine non-constant predicate data, wherein the non-constant predicate data is a set of non-constant predicates;
依据所述目标模板对所述非常数谓词数据进行常数值补充生成常数谓词数据,其中,所述常数谓词数据为常数谓词集合;Perform constant value supplementation on the non-constant predicate data according to the target template to generate constant predicate data, wherein the constant predicate data is a set of constant predicates;
依据所述非常数谓词集合和所述常数谓词集合构建所述谓词总集。The predicate total set is constructed based on the non-constant predicate set and the constant predicate set.
进一步地,所述依据所述谓词总集进行关联规则挖掘生成候选规则集合的步骤,包括:Further, the step of performing association rule mining based on the predicate aggregate set to generate a candidate rule set includes:
依据所述谓词总集进行深度优先搜索生成第一候选规则集合;或,Perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,
依据所述谓词总集进行广度优先搜索生成第二候选规则集合;Perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;
依据所述第一候选规则集合或所述第二候选规则集合生成所述候选规则总集。The total set of candidate rules is generated according to the first set of candidate rules or the second set of candidate rules.
进一步地,所述依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据的步骤,包括:Further, the steps of determining effective rules in the total set of candidate rules based on the target data, and determining effective data based on the effective rules include:
获取所述候选规则总集内的每一个子候选规则;Obtain each sub-candidate rule in the total set of candidate rules;
依据所述目标数据对每一个所述子候选规则验证确定每一个所述子候选规则的有效性;其中,当在所述目标数据存在一子目标数据与当前子候选规则对应时,则确定所述当前子候选规则为有效规则;Verify the validity of each sub-candidate rule according to the target data; wherein, when there is a sub-target data corresponding to the current sub-candidate rule in the target data, then determine the validity of each sub-candidate rule. The current sub-candidate rule is a valid rule;
获取所述有效规则对应的所述子目标数据,标记所述子目标数据为所述有效数据。Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
本申请还公开了一种基于关系数据的数据处理装置,所述装置用于通过数据关系修复缺失数据段的目标数据,并验证修复后的目标数据的有效性,包括:This application also discloses a data processing device based on relational data. The device is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data, including:
获取模块,用于获取目标数据,并依据所述目标数据的词义进行数据筛 选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;An acquisition module is used to acquire target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;
第一构建模块,用于依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;A first building module, configured to generate a template predicate based on the sampled data, and construct a target template based on the template predicate;
第二构建模块,用于依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;The second building module is used to perform data filtering on the target data based on the constant predicate to build a predicate aggregate set;
生成模块,用于依据所述谓词总集进行关联规则挖掘生成候选规则总集;A generation module, configured to perform association rule mining based on the predicate set to generate a candidate rule set;
确定模块,用于依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。A determining module, configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
本申请还公开了一种设备,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如上所述的一种基于关系数据的数据处理方法的步骤。This application also discloses a device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, the above-mentioned methods are implemented. The steps of a data processing method based on relational data.
本申请还公开了一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如上所述的一种基于关系数据的数据处理方法的步骤。This application also discloses a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, the steps of a data processing method based on relational data as described above are implemented. .
本申请具有以下优点:This application has the following advantages:
在本申请的实施例中,通过获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;依据所述谓词总集进行关联规则挖掘生成候选规则总集;依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。通过提出一种数据处理方法对常数进行修复,使得在大规模关系数据中进行带常数的规则发现时,不需要枚举所有可能的常数,也能发现有效地带常数的规则,从而大大提高了规则发现的执行效率。In the embodiment of the present application, sampled data is determined by obtaining target data and filtering data according to the word meaning of the target data, where the sampled data is a constant predicate and includes at least one; a template is generated based on the sampled data Predicate, and build a target template based on the template predicate; perform data screening on the target data based on the constant predicate to build a predicate collection; perform association rule mining based on the predicate collection to generate a candidate rule collection; based on the target The data determines valid rules in the total set of candidate rules, and valid data is determined based on the valid rules. By proposing a data processing method to repair constants, when discovering rules with constants in large-scale relational data, you do not need to enumerate all possible constants, and you can also discover effective rules with constants, thus greatly improving the efficiency of rules. Discovery execution efficiency.
附图说明Description of the drawings
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提 下,还可以根据这些附图获得其他的附图。In order to explain the technical solution of the present application more clearly, the drawings needed to be used in the description of the present application will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application and are not useful in this field. Ordinary technicians can also obtain other drawings based on these drawings without exerting creative labor.
图1是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 1 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图2是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 2 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图3是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 3 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图4是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 4 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图5是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 5 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图6是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 6 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图7是本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Figure 7 is a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
图8是本申请一实施例提供的一种基于关系数据的数据处理装置的结构框图;Figure 8 is a structural block diagram of a data processing device based on relational data provided by an embodiment of the present application;
图9是本发明一实施例提供的一种计算机设备的结构示意图。Figure 9 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本申请的所述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, features and advantages of the present application more obvious and understandable, the present application will be described in further detail below in conjunction with the accompanying drawings and specific implementation modes. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
需要说明的是,与CFD规则不同,本发明利用的规则是实体增强规则(Entity Enhancing Rules,下述简称REE)。REE的基本组成部分是谓词p,定义如下:It should be noted that, unlike CFD rules, the rules utilized in this invention are Entity Enhancing Rules (hereinafter referred to as REE). The basic component of REE is the predicate p, which is defined as follows:
p:=R(t)|t.A◎c|t.A◎s.B|M(t.A,s.B)p:=R(t)|t.A◎c|t.A◎s.B|M(t.A,s.B)
其中,◎是一个操作符,可以是等于或不等于。Among them, ◎ is an operator, which can be equal to or not equal to.
R(t)表示t是关系表R中的一个元组变量。R(t) indicates that t is a tuple variable in the relational table R.
t.A表示变量t的A属性;M是一个机器学习模型,如果t.A和s.B是相关的,那么机器学习模型将返回true(正确),否则返回false(错误)。t.A represents the A attribute of variable t; M is a machine learning model. If t.A and s.B are related, then the machine learning model will return true (correct), otherwise it will return false (error).
t.A◎c带有常数,被称为常数谓词。t.A◎c has a constant and is called a constant predicate.
t.A◎s.B不带有常数,被称为变量谓词。t.A◎s.B does not have a constant and is called a variable predicate.
M(t.A,s.B)被称为机器学习谓词。M(t.A,s.B) is called a machine learning predicate.
基于谓词,REE的定义为:X->e;其中,(1)X是多个谓词的结合,被称为这个REE的条件;(2)e是一个谓词,被称为这个REE的结果。Based on predicates, the definition of REE is: X->e; among them, (1)
一个具体的REE实例如下:A specific REE example is as follows:
快递(t)∧快递(s)∧t.收件人=s.收件人∧t.地址=“广东省深圳市”->s.邮编=“510000”Express(t)∧Express(s)∧t.Recipient=s.Recipient∧t.Address="Shenzhen City, Guangdong Province"->s.Postcode="510000"
这个REE描述的场景是,如果快递t和快递s的收件人为同一人,且t的地址在“广东省深圳市”,那么s的邮编一定是“510000”。The scenario described by this REE is that if the recipient of express t and express s are the same person, and the address of t is in "Shenzhen City, Guangdong Province", then the zip code of s must be "510000".
可以通过深度优先或者广度优先的搜索方式,在关系数据中发现REE规则。REE rules can be found in relational data through depth-first or breadth-first search methods.
与REE规则最相关的是上述的CFD规则;CFD规则支持只有一个元组变量的常数谓词和变量谓词,可以看成是REE规则的一种特殊情况。The most relevant thing to REE rules is the above-mentioned CFD rule; CFD rules support constant predicates and variable predicates with only one tuple variable, which can be regarded as a special case of REE rules.
基于CFD的规则挖掘算法同样利用广度优先或深度优先的搜索方式进行规则挖掘。CFD-based rule mining algorithms also use breadth-first or depth-first search methods for rule mining.
本发明的总体技术方案分为两个步骤:模板挖掘和常数修复。为了提高挖掘效率,从全数据D中抽取一部分数据出来组成采样数据D s。在采样数据D s上进行模板挖掘,而在全数据D上进行常数修复。模板挖掘在先,常数修复在后,先进行模板挖掘,基于挖掘出的模板,再进行常数修复。 The overall technical solution of the present invention is divided into two steps: template mining and constant repair. In order to improve the mining efficiency, a part of the data is extracted from the full data D to form the sampling data D s . Template mining is performed on the sampled data D s , while constant repair is performed on the full data D s. Template mining is performed first, and constant repair is performed later. Template mining is performed first, and then constant repair is performed based on the mined template.
参照图1,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 1, there is shown a step flow chart of a data processing method based on relational data provided by an embodiment of the present application;
一种基于关系数据的数据处理方法,所述方法用于通过数据关系修复缺失数据段的目标数据,并验证修复后的目标数据的有效性,所述方法包括:A data processing method based on relational data. The method is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data. The method includes:
S110、获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;S110. Obtain target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;
S120、依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;S120. Generate a template predicate based on the sampled data, and construct a target template based on the template predicate;
S130、依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;S130. Perform data filtering on the target data according to the constant predicate to build a predicate total set;
S140、依据所述谓词总集进行关联规则挖掘生成候选规则总集;S140. Perform association rule mining based on the predicate set to generate a candidate rule set;
S150、依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。S150: Determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
在本申请的实施例中,通过获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;依据所述谓词总集进行关联规则挖掘生成候选规则总集;依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。通过提出一种数据处理方法对常数进行修复,使得在大规模关系数据中进行带常数的规则发现时,不需要枚举所有可能的常数,也能发现有效地带常数的规则,从而大大提高了规则发现的执行效率。In the embodiment of the present application, sampled data is determined by obtaining target data and filtering data according to the word meaning of the target data, where the sampled data is a constant predicate and includes at least one; a template is generated based on the sampled data Predicate, and build a target template based on the template predicate; perform data screening on the target data based on the constant predicate to build a predicate collection; perform association rule mining based on the predicate collection to generate a candidate rule collection; based on the target The data determines valid rules in the total set of candidate rules, and valid data is determined based on the valid rules. By proposing a data processing method to repair constants, when discovering rules with constants in large-scale relational data, you do not need to enumerate all possible constants, and you can also discover effective rules with constants, thus greatly improving the efficiency of rules. Discovery execution efficiency.
下面,将对本示例性实施例中一种基于强化学习的谓词组合规则挖掘方法作进一步地说明。Below, a method for mining predicate combination rules based on reinforcement learning in this exemplary embodiment will be further described.
如所述步骤S110所述,获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个。As described in step S110, target data is obtained, and data filtering is performed to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one.
在本发明一实施例中,可以结合下列描述进一步说明步骤S110所述“获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个”的具体过程。In an embodiment of the present invention, the step S110 of "obtaining target data, and performing data filtering to determine sampling data according to the meaning of the target data, wherein the sampling data is a constant predicate, and at least Including a” specific process.
参照图2,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 2, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;
如下列步骤所述,As described in the following steps,
S210、获取所述目标数据内的数据属性;S210. Obtain the data attributes in the target data;
S220、依据所述数据属性确定所述目标数据的所述词义类型,其中,所述词义类型包括有常数和无常数;S220. Determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;
S230、筛选所述词义类型为所述有常数对应的所述目标数据,并确定为所述采样数据。S230: Screen the target data whose word meaning type corresponds to the constant, and determine it as the sampled data.
需要说明的是,获取所述目标数据内的数据属性;通过不同的目标数据都有与之一一对应的数据属性。It should be noted that the data attributes in the target data are obtained; different target data have data attributes corresponding to each one.
需要说明的是,依据所述数据属性确定所述目标数据的所述词义类型,其中,所述词义类型包括有常数和无常数;通过数据属性确定目标数据的词义类型;词义类型括有常数和无常数。It should be noted that the word meaning type of the target data is determined according to the data attributes, wherein the word meaning type includes constants and no constants; the word meaning type of the target data is determined through the data attributes; the word meaning types include constants and No constant.
需要说明的是,筛选所述词义类型为所述有常数对应的所述目标数据,并确定为所述采样数据;通过筛选词义类型为常数的目标数据,并标记这些目标数据为采样数据。It should be noted that the target data whose word meaning type is a constant is screened out and determined as the sampled data; the target data whose word meaning type is a constant is filtered and these target data are marked as sampled data.
如所述步骤S120所述,依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板。As described in step S120, a template predicate is generated based on the sampled data, and a target template is constructed based on the template predicate.
在本发明一实施例中,可以结合下列描述进一步说明步骤S120所述“依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板”的具体过程。In an embodiment of the present invention, the specific process of "generating a template predicate based on the sampled data and constructing a target template based on the template predicate" described in step S120 can be further explained in conjunction with the following description.
参照图3,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 3, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;
如下列步骤所述,As described in the following steps,
S310、依据所述采样数据生成模板谓词;S310. Generate a template predicate based on the sampled data;
S320、当所述模板谓词在所述目标数据存在有效值时,则确定所述模板谓词为有效谓词;S320. When the template predicate has a valid value in the target data, determine that the template predicate is a valid predicate;
S330、依据所述有效谓词构建目标模板。S330. Construct a target template based on the effective predicate.
需要说明的是,依据所述采样数据生成模板谓词;其中依据采样数据的数据属性能够组成谓词;依据数据属性生成若干个谓词;通过数据属性的生成谓词的初始状态为无效谓词,需要通过其对应的常数值,决定其是否为模 板谓词。It should be noted that a template predicate is generated based on the sampled data; a predicate can be formed based on the data attributes of the sampled data; several predicates are generated based on the data attributes; the initial state of the predicate generated through the data attributes is an invalid predicate, and its corresponding A constant value that determines whether it is a template predicate.
需要说明的是,当对模板谓词进行验证时,只要目标数据中至少有一个子数据在模板谓词的数据属性上存在有效值,那么该谓词就作为有效谓词,进而通过该有效谓词组成模板REE即目标模板。It should be noted that when validating a template predicate, as long as at least one sub-data in the target data has a valid value on the data attribute of the template predicate, then the predicate will be used as a valid predicate, and then the template REE will be formed through the valid predicate. target template.
在一具体实现中,模板REE即目标模板就是将REE中所有常数都用通配符“_”表示的REE规则。In a specific implementation, the template REE, that is, the target template, is a REE rule in which all constants in the REE are represented by the wildcard character "_".
需要说明的是,通配符能与任何和常数值进行匹配。It should be noted that wildcards can match any and constant values.
在一具体实现中,样例REE对应的模板REE如下所示:In a specific implementation, the template REE corresponding to the sample REE is as follows:
快递(t)∧快递(s)∧t.收件人=s.收件人∧t.地址=“_”->s.邮编=“_”。Express (t) ∧ Express (s) ∧t. Recipient = s. Recipient ∧t. Address = “_” -> s. Postal code = “_”.
模板REE的好处是,如果一个或者多个REE规则只在常数谓词的常数上有所不同,可以用同一个模板REE对其进行表示。The advantage of template REE is that if one or more REE rules only differ in the constant of the constant predicate, they can be represented by the same template REE.
多个CFD都可表示为:Multiple CFDs can be expressed as:
快递(t)∧t.地址=“_”->t.邮编=“_”Express(t)∧t.Address="_"->t.Postcode="_"
为了区分同一个模板REE下的多个REE规则,每个REE规则会在模板REE的基础上附加一个模式元组(pattern tuple),用于常数的赋值;In order to distinguish multiple REE rules under the same template REE, each REE rule will append a pattern tuple (pattern tuple) to the template REE for constant assignment;
比如CFD地址=“广东省广州市”->邮编=“510000”的模式元组为(“广东省广州市”,“510000”)。For example, the pattern tuple of CFD address="Guangzhou City, Guangdong Province"->Postcode="510000" is ("Guangzhou City, Guangdong Province", "510000").
多个模式元组就组成了REE的模式表格(pattern tableau)形式,如表1所示:Multiple pattern tuples form the pattern tableau form of REE, as shown in Table 1:
Figure PCTCN2022099183-appb-000001
Figure PCTCN2022099183-appb-000001
表1Table 1
作为一种示例,对于仅在同一个属性上常数值有所不同的的常数谓词。As an example, consider constant predicates that differ in constant value only on the same attribute.
在一具体实现中,如t.地址=“广东省深圳市”,t.地址=“广东省广州市”和t.地址=“广东省东莞市”,仅枚举一个模板谓词(template predicate),即 t.地址=“_”。In a specific implementation, such as t.address="Shenzhen City, Guangdong Province", t.address="Guangzhou City, Guangdong Province" and t.address="Dongguan City, Guangdong Province", only one template predicate is enumerated. , that is, t.address="_".
如所述步骤S330所述,依据所述有效谓词构建目标模板;As described in step S330, construct a target template according to the effective predicate;
在本发明一实施例中,可以结合下列描述进一步说明步骤S330所述“依据所述有效谓词构建目标模板”的具体过程。In an embodiment of the present invention, the specific process of "constructing a target template based on the effective predicate" in step S330 may be further explained in conjunction with the following description.
参照图4,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 4, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;
如下列步骤所述,As described in the following steps,
S410、对所述有效谓词之间进行组合生成排列组合;S410. Combine the valid predicates to generate permutations;
S420、对所述排列组合进行筛选确定模板谓词组合;S420. Screen the permutations and combinations to determine template predicate combinations;
S430、依据所述模板谓词组合和所述有效谓词构建所述目标模板。S430: Construct the target template based on the template predicate combination and the valid predicate.
需要说明的是,对所述有效谓词之间进行组合生成排列组合,;将若干个有效谓词进行排列生成排列组合;若干个有效谓词之间组合构成排列组合。It should be noted that the effective predicates are combined to form a permutation and combination; several effective predicates are arranged to form a permutation and combination; and several effective predicates are combined to form a permutation and combination.
需要说明的是,对所述排列组合进行筛选确定模板谓词组合;与其枚举所有的排列组合,通过利用事务数据库中的自由项目集(free itemset)的概念,对排列组合进行初筛,从而确定模板谓词组合。It should be noted that the permutations and combinations are screened to determine the template predicate combination; instead of enumerating all the permutations and combinations, the permutations and combinations are initially screened by using the concept of free itemset in the transaction database to determine Template predicate combination.
作为一种示例,与其枚举所有的模板谓词之间的排列组合,通过利用事务数据库中的自由项目集(free itemset)的概念,对所有的模板谓词之间的排列组合进行初筛;只有通过筛选的模板谓词组合,才会组成有效的模板REE即目标模板。As an example, instead of enumerating all permutations and combinations of template predicates, by using the concept of free itemset in the transaction database, a preliminary screening of all permutations and combinations of template predicates is performed; only those that pass Only the filtered template predicate combinations will form a valid template REE, which is the target template.
在一具体实现中,比如模板谓词是t.地址=“_”时,采样数据中存在一个快递数据,且地址不是空值(空值就是什么数据都没有),那么这个采样数据就是在这个模板谓词的数据属性上存在有效值。In a specific implementation, for example, when the template predicate is t.address="_", there is express data in the sampled data, and the address is not a null value (a null value means there is no data), then the sampled data is in this template A valid value exists on the predicate's data attribute.
需要说明的是,对于模板REE即目标模板而言,只要有至少一组数据的数据属性能满足该模板REE即目标模板,那么这个规则就会作为有效规则候选参与后续的规则验证。比如快递(t)∧t.地址=“_”->t.邮编=“_”,数据中存在一个快递数据,且地址和邮编都不是空值,那么这个数据的数据属性能满足该模板REE即目标模板。It should be noted that for the template REE, that is, the target template, as long as there is at least one set of data attributes that satisfy the template REE, that is, the target template, then this rule will participate in subsequent rule verification as a valid rule candidate. For example, express (t)∧t.Address="_"->t.Zip code="_", there is a courier data in the data, and neither the address nor the zip code is null, then the data attributes of this data can satisfy the template REE That is the target template.
如所述步骤S130所述,依据所述常数谓词对所述目标数据进行数据筛 选构建谓词总集;As described in step S130, perform data filtering on the target data according to the constant predicate to build a predicate aggregate set;
在本发明一实施例中,可以结合下列描述进一步说明步骤S130所述“依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集”的具体过程。In an embodiment of the present invention, the specific process of "filtering the target data according to the constant predicate to construct a predicate total set" described in step S130 can be further explained in conjunction with the following description.
参照图5,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 5, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;
如下列步骤所述,As described in the following steps,
S510、依据所述常数谓词对所述目标数据进行数据筛选确定非常数谓词数据,其中,所述非常数谓词数据为非常数谓词集合;S510. Perform data screening on the target data according to the constant predicate to determine non-constant predicate data, where the non-constant predicate data is a set of non-constant predicates;
S520、依据所述目标模板对所述非常数谓词数据进行常数值补充生成常数谓词数据,其中,所述常数谓词数据为常数谓词集合;S520: Supplement the non-constant predicate data with a constant value according to the target template to generate constant predicate data, where the constant predicate data is a set of constant predicates;
S530、依据所述非常数谓词集合和所述常数谓词集合构建所述谓词总集。S530. Construct the predicate total set based on the non-constant predicate set and the constant predicate set.
需要说明的是,在D s即采样数据上进行模板REE的挖掘之后,通过利用模板REE在全数据D即目标数据上进行常数修复;其中,常数修复又包括了以下四个主要步骤:(1)用非常数谓词确认枚举范围;(2)利用模板进行常数补充;(3)候选规则生成;(4)规则验证。 It should be noted that after mining the template REE on D s , that is, the sampled data, the template REE is used to perform constant repair on the full data D, that is, the target data; among them, the constant repair includes the following four main steps: (1) ) Use non-constant predicates to confirm the enumeration range; (2) Use templates to supplement constants; (3) Candidate rule generation; (4) Rule verification.
需要说明的是,依据所述常数谓词对所述目标数据进行数据筛选确定非常数谓词数据,其中,所述非常数谓词数据为非常数谓词集合;通过常数谓词将目标数据内为常数谓词数据进行筛除,从而得到不是常数谓词数据的目标数据,将不是常数谓词数据的目标数据认定为非常数谓词数据。It should be noted that the target data is filtered according to the constant predicate to determine the non-constant predicate data, wherein the non-constant predicate data is a non-constant predicate set; the target data is filtered using the constant predicate as the constant predicate data. Filter out, thereby obtaining the target data that is not constant predicate data, and identify the target data that is not constant predicate data as non-constant predicate data.
需要说明的是,对于筛选出来的非常数谓词数据,根据模板REE即目标模板中的模板谓词进行常数补充,其中,每一个模板谓词都进行常数补充,从而构建出可以进行枚举的常数谓词数据。It should be noted that for the filtered non-constant predicate data, constants are supplemented according to the template REE, that is, the template predicate in the target template. Each template predicate is supplemented with constants, thereby constructing constant predicate data that can be enumerated. .
作为一种示例,比如对于模板谓词t.地址=“_”,我们找到数据中所有的地址属性值,填充到通配符“_”的位置组成常数谓词数据。As an example, for the template predicate t.address="_", we find all the address attribute values in the data and fill in the position of the wildcard character "_" to form the constant predicate data.
需要说明的是,基于常数谓词集合和非常数谓词集合,能构成一个谓词总集,谓词总集包括了前面得到的所有非常数谓词和所有常数谓词。It should be noted that based on the set of constant predicates and the set of non-constant predicates, a total set of predicates can be formed. The total set of predicates includes all non-constant predicates and all constant predicates obtained previously.
作为一种示例,给定一模板REE即目标模板,通过模板REE中的常数谓词对全数据D即目标数据中的子数据进行筛选,只有满足为非常数谓词的子 数据,才会参与到下一步的常数补充中,即只有非常数谓词数据才会参与到下一步的常数补充;这样的做法避免了在全数据D中进行代价昂贵的常数补充,在保证完备性的前提下,大大提升了算法的执行效率。As an example, given a template REE, which is the target template, the full data D, that is, the sub-data in the target data, is filtered through the constant predicate in the template REE. Only the sub-data that satisfies the non-constant predicate will participate in the next step. In one step of constant supplementation, that is, only non-constant predicate data will participate in the next step of constant supplementation; this approach avoids expensive constant supplementation in the full data D, and greatly improves the performance while ensuring completeness. The execution efficiency of the algorithm.
如所述步骤S140所述,依据所述谓词总集进行关联规则挖掘生成候选规则集合;As described in step S140, perform association rule mining based on the predicate total set to generate a candidate rule set;
在本发明一实施例中,可以结合下列描述进一步说明步骤S140所述“依据所述谓词总集进行关联规则挖掘生成候选规则集合”的具体过程。In an embodiment of the present invention, the specific process of "mining association rules to generate a set of candidate rules based on the total set of predicates" described in step S140 can be further explained in conjunction with the following description.
参照图6,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 6, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;
如下列步骤所述,As described in the following steps,
S610、依据所述谓词总集进行深度优先搜索生成第一候选规则集合;或,S610. Perform a depth-first search based on the total set of predicates to generate a first candidate rule set; or,
S620、依据所述谓词总集进行广度优先搜索生成第二候选规则集合;S620. Perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;
S630、依据所述第一候选规则集合或所述第二候选规则集合生成所述候选规则总集。S630. Generate the total set of candidate rules based on the first candidate rule set or the second candidate rule set.
需要说明的是,依据所述谓词总集进行深度优先搜索或广度优先搜索得到若干个候选规则,若干个候选规则从而生成候选规则总集。It should be noted that a depth-first search or a breadth-first search is performed based on the predicate set to obtain several candidate rules, and the candidate rules generate a set of candidate rules.
作为一种示例,对谓词总集重新进行模板REE即目标模板基于深度优先或者广度优先的规则挖掘,获得候选规则。As an example, the template REE, that is, the target template is mined based on depth-first or breadth-first rules for the total set of predicates to obtain candidate rules.
如所述步骤S150所述,依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据;As described in step S150, determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules;
在本发明一实施例中,可以结合下列描述进一步说明步骤S150所述“依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据”的具体过程。In an embodiment of the present invention, the specific process of "determining valid rules in the total set of candidate rules based on the target data and determining valid data based on the valid rules" in step S150 can be further explained in conjunction with the following description.
参照图7,示出了本申请一实施例提供的一种基于关系数据的数据处理方法的步骤流程图;Referring to Figure 7, a flow chart of steps of a data processing method based on relational data provided by an embodiment of the present application is shown;
如下列步骤所述,As described in the following steps,
S710、获取所述候选规则总集内的每一个子候选规则;S710. Obtain each sub-candidate rule in the total set of candidate rules;
S720、依据所述目标数据对每一个所述子候选规则验证确定每一个所述 子候选规则的有效性;其中,当在所述目标数据存在一子目标数据与当前子候选规则对应时,则确定所述当前子候选规则为有效规则;S720. Verify each sub-candidate rule according to the target data to determine the validity of each sub-candidate rule; wherein, when there is a sub-target data corresponding to the current sub-candidate rule in the target data, then Determine the current sub-candidate rule as a valid rule;
S730、获取所述有效规则对应的所述子目标数据,标记所述子目标数据为所述有效数据。S730: Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
需要说明的是,依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据;对于候选规则需要通过全数据D即目标数据确定其是否为有效规则。It should be noted that the valid rules in the total set of candidate rules are determined based on the target data, and the valid data are determined based on the valid rules; for candidate rules, it is necessary to determine whether they are valid rules through the full data D, that is, the target data.
作为一种示例,对于每一个候选规则,在全数据D即目标数据上验证该候选规则的有效性;当在目标数据存在一子目标数据与当前子候选规则对应时,那么该候选规则就是有效的。As an example, for each candidate rule, verify the validity of the candidate rule on the full data D, that is, the target data; when there is a sub-target data in the target data corresponding to the current sub-candidate rule, then the candidate rule is valid of.
在一具体实现中,比如,如果候选规则是快递(t)∧快递(s)∧t.收件人=s.收件人∧t.地址=“广东省深圳市”->s.邮编=“510000”。如果数据中,存在快递t和快递s的收件人为同一人,且快递t的地址在“广东省深圳市”,那么快递s的邮编一定是“510000”,那么快递t和快递s对应的数据就满足了候选规则,该候选规就认定为有效;有效的规则将组成最终结果输出。In a specific implementation, for example, if the candidate rule is express(t)∧express(s)∧t.recipient=s.recipient∧t.address="Shenzhen City, Guangdong Province"->s.zip code= "510000". If the recipient of express t and express s is the same person in the data, and the address of express t is in "Shenzhen City, Guangdong Province", then the postal code of express s must be "510000", then the corresponding data of express t and express s If the candidate rule is satisfied, the candidate rule is considered valid; the valid rules will form the final result output.
本发明的技术效果:Technical effects of the present invention:
在CFD中没有类似模板的概念,因此,需要枚举全数据上的所有常数,生成有效的带常数的CFD规则。相反,通过利用模板REE的定义,首先从全局数据中,抽取了一个小型采样数据D s进行模板的挖掘。由于数据量小,这个过程与CFD在全局数据上进行常数枚举的方法相比是非常快的。其次,利用挖掘出来的模板REE,不需要在全数据D中进行代价昂贵的常数枚举。相反,我们只需要枚举可能组成有效规则的常数。对于不可能组成有效规则的常数,会在枚举前的范围确认中被排除,避免可能导致的多余和无效运算 There is no concept like a template in CFD. Therefore, all constants on the full data need to be enumerated to generate effective CFD rules with constants. On the contrary, by using the definition of template REE, a small sampling data D s is first extracted from the global data for template mining. Due to the small amount of data, this process is very fast compared to CFD's method of constant enumeration on global data. Secondly, by utilizing the mined template REE, there is no need to perform expensive constant enumeration in the full data D. Instead, we just need to enumerate the constants that might make up a valid rule. Constants that are unlikely to form valid rules will be excluded in the range confirmation before enumeration to avoid possible redundant and invalid operations.
我们在多个公开数据中对比了三种规则挖掘方法的准确度和挖掘效率,包括:(1)本发明中在D s进行模板挖掘之后在D进行常数修复的规则挖掘方法;(2)直接在D s上进行规则挖掘的方法;以及(3)直接在D上进行规则挖掘的方法。 We compared the accuracy and mining efficiency of three rule mining methods in multiple public data, including: (1) the rule mining method in this invention that performs template mining on D s and then performs constant repair on D; (2) directly Methods for rule mining on D s ; and (3) methods for rule mining directly on D.
通过与上述方法(2)的对比,本发明的挖掘方法提升了2%的规则召回率;在进行了常数修复之后,挖掘出来的规则更加准确。By comparison with the above method (2), the mining method of the present invention improves the rule recall rate by 2%; after constant repair, the mined rules are more accurate.
通过与方法(3)的对比,本发明平均能够提升12.2倍的运行效率。在有3个关系表,18个属性,180万条数据的大型DBLP数据集上,本发明的运行时间为406秒,而方法(3)的运行时间为2096秒;换言之,挖掘效率更高。By comparison with method (3), the present invention can improve the operating efficiency by an average of 12.2 times. On a large DBLP data set with 3 relational tables, 18 attributes, and 1.8 million pieces of data, the running time of the present invention is 406 seconds, while the running time of method (3) is 2096 seconds; in other words, the mining efficiency is higher.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.
参照图8,示出了本申请一实施例提供的一种基于关系数据的数据处理装置的结构框图;Referring to Figure 8, a structural block diagram of a data processing device based on relational data provided by an embodiment of the present application is shown;
所述装置用于关系数据中常数的规则挖掘以及修复,所述关系数据包括全数据和所述全数据中的采样数据,具体包括:The device is used for rule mining and repair of constants in relational data. The relational data includes full data and sampled data in the full data, specifically including:
获取模块810,用于获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;The acquisition module 810 is used to acquire target data, and perform data filtering to determine sampling data according to the word meaning of the target data, where the sampling data is a constant predicate and includes at least one;
第一构建模块820,用于依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;The first building module 820 is used to generate a template predicate based on the sampled data, and build a target template based on the template predicate;
第二构建模块830,用于依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;The second building module 830 is used to perform data filtering on the target data according to the constant predicate to build a total set of predicates;
生成模块840,用于依据所述谓词总集进行关联规则挖掘生成候选规则总集;The generation module 840 is used to perform association rule mining based on the predicate aggregate set to generate a candidate rule aggregate set;
确定模块850,用于依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。The determination module 850 is configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
在本发明一实施例中,所述获取模块810,包括:In an embodiment of the present invention, the acquisition module 810 includes:
第一获取子模块,用于获取所述目标数据内的数据属性;The first acquisition sub-module is used to acquire data attributes in the target data;
第一确定子模块,用于依据所述数据属性确定所述目标数据的所述词义类型,其中,所述词义类型包括有常数和无常数;The first determination sub-module is used to determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;
第二确定子模块,用于筛选所述词义类型为所述有常数对应的所述目标数据,并确定为所述采样数据。The second determination sub-module is used to filter the target data whose word meaning type corresponds to the constant, and determine it as the sampled data.
在本发明一实施例中,所述第一构建模块820,包括:In an embodiment of the present invention, the first building module 820 includes:
第一生成子模块,用于依据所述采样数据生成模板谓词;The first generation sub-module is used to generate template predicates based on the sampled data;
第三确定子模块,用于当所述模板谓词在所述目标数据存在有效值时,则确定所述模板谓词为有效谓词;The third determination sub-module is used to determine that the template predicate is a valid predicate when the template predicate has a valid value in the target data;
第一构建子模块,用于依据所述有效谓词构建目标模板。The first construction sub-module is used to construct the target template based on the valid predicate.
在本发明一实施例中,所述第一构建子模块,包括:In an embodiment of the present invention, the first building submodule includes:
第一生成单元,用于对所述有效谓词之间进行组合生成排列组合;The first generation unit is used to combine the valid predicates to generate permutations;
第一确定单元,用于对所述排列组合进行筛选确定模板谓词组合;The first determination unit is used to screen the permutations and combinations to determine template predicate combinations;
第一构建单元,用于依据所述模板谓词组合和所述有效谓词构建所述目标模板。A first building unit configured to build the target template based on the template predicate combination and the valid predicate.
在本发明一实施例中,所述第二构建模块830,包括:In an embodiment of the present invention, the second building module 830 includes:
第一筛选子模块,用于依据所述常数谓词对所述目标数据进行数据筛选确定非常数谓词数据,其中,所述非常数谓词数据为非常数谓词集合;The first screening sub-module is used to perform data screening on the target data according to the constant predicate to determine non-constant predicate data, where the non-constant predicate data is a set of non-constant predicates;
第二生成子模块,用于依据所述目标模板对所述非常数谓词数据进行常数值补充生成常数谓词数据,其中,所述常数谓词数据为常数谓词集合;The second generation sub-module is used to supplement the non-constant predicate data with constant values according to the target template to generate constant predicate data, where the constant predicate data is a set of constant predicates;
第二构建子模块,用于依据所述非常数谓词集合和所述常数谓词集合构建所述谓词总集。The second construction sub-module is used to construct the predicate total set based on the non-constant predicate set and the constant predicate set.
在本发明一实施例中,所述生成模块840,包括:In an embodiment of the present invention, the generation module 840 includes:
第三生成子模块,用于依据所述谓词总集进行深度优先搜索生成第一候选规则集合;或,The third generation sub-module is used to perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,
第四生成子模块,用于依据所述谓词总集进行广度优先搜索生成第二候选规则集合;The fourth generation sub-module is used to perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;
第五生成子模块,用于依据所述第一候选规则集合或所述第二候选规则集合生成所述候选规则总集。The fifth generation sub-module is used to generate the total set of candidate rules based on the first set of candidate rules or the second set of candidate rules.
在本发明一实施例中,所述确定模块850,包括:In an embodiment of the present invention, the determination module 850 includes:
第二获取子模块,用于获取所述候选规则总集内的每一个子候选规则;The second acquisition sub-module is used to acquire each sub-candidate rule in the total set of candidate rules;
第四确定子模块,用于依据所述目标数据对每一个所述子候选规则验证确定每一个所述子候选规则的有效性;其中,当在所述目标数据存在一子目 标数据与当前子候选规则对应时,则确定所述当前子候选规则为有效规则;The fourth determination sub-module is used to verify and determine the validity of each of the sub-candidate rules based on the target data; wherein, when there is a sub-target data in the target data and the current sub-candidate rule When the candidate rules correspond, the current sub-candidate rule is determined to be a valid rule;
第三获取子模块,用于获取所述有效规则对应的所述子目标数据,标记所述子目标数据为所述有效数据。The third acquisition sub-module is used to acquire the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
参照图9,示出了本发明的一种基于关系数据的数据处理方法的计算机设备,具体可以包括如下:Referring to Figure 9, a computer device for a data processing method based on relational data of the present invention is shown, which may specifically include the following:
上述计算机设备12以通用计算设备的形式表现,计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。The above-mentioned computer device 12 is in the form of a general computing device. The components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, a system memory 28, connecting different system components (including system memory 28 and processing units). 16) of bus 18.
总线18表示几类总线18结构中的一种或多种,包括存储器总线18或者存储器控制器,外围总线18,图形加速端口,处理器或者使用多种总线18结构中的任意总线18结构的局域总线18。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线18,微通道体系结构(MAC)总线18,增强型ISA总线18、音视频电子标准协会(VESA)局域总线18以及外围组件互连(PCI)总线18。The bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18 , a graphics acceleration port, a processor or a computer using any of the plurality of bus 18 structures. Domain bus 18. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus 18, the Micro Channel Architecture (MAC) bus 18, the Enhanced ISA bus 18, the Video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。 Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including volatile and nonvolatile media, removable and non-removable media.
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。计算机设备12可以进一步包括其他移动/不可移动的、易失性/非易失性计算机体统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(通常称为“硬盘驱动器”)。尽管图9中未示出,可以提供用于对可移动非易失性磁盘(如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其他光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质界面与总线18相连。存储器可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块42,这些程序模块42被配置以执行本发明各实施例的功能。 System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 . Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 9, a disk drive may be provided for reading and writing to removable non-volatile disks (e.g., "floppy disks"), and for removable non-volatile optical disks (e.g., CD-ROM, DVD-ROM or other optical media) that can read and write optical disc drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. The memory may include at least one program product having a set (eg, at least one) program module 42 configured to perform the functions of embodiments of the invention.
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器中,这样的程序模块42包括——但不限于——操作系统、一个或者多个应用程序、其他程序模块42以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本发明所描述的实施例中的功能和/或方法。A program/utility 40 having a set of (at least one) program modules 42, which may be stored, for example, in memory. Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules. 42 As well as program data, each of these examples or some combination may include an implementation of a network environment. Program modules 42 generally perform functions and/or methods in the described embodiments of the invention.
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24、摄像头等)通信,还可与一个或者多个使得操作人员能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其他计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)界面22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN)),广域网(WAN)和/或公共网络(例如因特网)通信。如图所示,网络适配器20通过总线18与计算机设备12的其他模块通信。应当明白,尽管图9中未示出,可以结合计算机设备12使用其他硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元16、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统34等。 Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.) and with one or more devices that enable an operator to interact with computer device 12, and /or communicate with any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication may occur through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (eg, local area network (LAN)), wide area network (WAN), and/or public network (eg, the Internet) through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 . It should be understood that, although not shown in Figure 9, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing unit 16, external disk drive arrays, RAID systems, Tape drives and data backup storage systems 34, etc.
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现本发明实施例所提供的一种基于关系数据的数据处理方法。The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing a data processing method based on relational data provided by the embodiment of the present invention.
也即,上述处理单元16执行上述程序时实现:获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;依据所述谓词总集进行关联规则挖掘生成候选规则总集;依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。通过提出一种数据处理方法对常数进行修复。That is, when the above-mentioned processing unit 16 executes the above-mentioned program, it achieves: acquiring target data, and performing data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one; according to the Sampling data generates a template predicate, and constructs a target template based on the template predicate; performs data filtering on the target data based on the constant predicate to build a predicate aggregate set; performs association rule mining based on the predicate aggregate set to generate a candidate rule aggregate set; Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules. The constants are repaired by proposing a data processing method.
在本发明实施例中,本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请所有实施例提供的一种 基于关系数据的数据处理方法:In an embodiment of the present invention, the present invention also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the data processing based on relational data as provided in all embodiments of the present application is implemented. method:
也即,给程序被处理器执行时实现:获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;依据所述谓词总集进行关联规则挖掘生成候选规则总集;依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。通过提出一种数据处理方法对常数进行修复。That is, when the program is executed by the processor, the following is achieved: obtaining the target data, and filtering the data to determine the sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one; according to the sampling The data generates a template predicate, and constructs a target template based on the template predicate; performs data filtering on the target data based on the constant predicate to build a predicate collection; performs association rule mining based on the predicate collection to generate a candidate rule collection; based on The target data determines effective rules in the total set of candidate rules, and determines effective data based on the effective rules. The constants are repaired by proposing a data processing method.
可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Any combination of one or more computer-readable media may be employed. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared or semiconductor system, apparatus or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言——诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在操作人员计算机上执行、 部分地在操作人员计算机上执行、作为一个独立的软件包执行、部分在操作人员计算机上部分在远程计算机上执行或者完全在远程计算机或者服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到操作人员计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。Computer program code for performing the operations of the present invention may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language - such as "C" or similar programming language. The program code may execute entirely on the operator's computer, partly on the operator's computer, as a stand-alone software package, partly on the operator's computer and partly on a remote computer or entirely on the remote computer or server . In situations involving remote computers, the remote computer can be connected to the operator computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., using an Internet service provider). to connect via the Internet). Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other.
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。Although preferred embodiments of the embodiments of the present application have been described, those skilled in the art may make additional changes and modifications to these embodiments once the basic inventive concepts are understood. Therefore, the appended claims are intended to be construed to include the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of the present application.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or end device that includes a list of elements includes not only those elements, but also elements not expressly listed or other elements inherent to such process, method, article or terminal equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or terminal device including the stated element.
以上对本申请所提供的一种基于关系数据的数据处理方法及其装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to a data processing method and device based on relational data provided by this application. This article uses specific examples to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for It helps to understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the field, there will be changes in the specific implementation methods and application scope based on the ideas of this application. In summary, the content of this specification It should not be construed as a limitation on this application.

Claims (10)

  1. 一种基于关系数据的数据处理方法,所述方法用于通过数据关系修复缺失数据段的目标数据,并验证修复后的目标数据的有效性,其特征在于,包括:A data processing method based on relational data. The method is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data. It is characterized by including:
    获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;Obtain target data, and perform data filtering to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one;
    依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;Generate a template predicate based on the sampled data, and construct a target template based on the template predicate;
    依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;Perform data filtering on the target data according to the constant predicate to construct a predicate aggregate set;
    依据所述谓词总集进行关联规则挖掘生成候选规则总集;Perform association rule mining based on the predicate set to generate a candidate rule set;
    依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。Valid rules in the total set of candidate rules are determined based on the target data, and valid data are determined based on the valid rules.
  2. 根据权利要求1所述的方法,其特征在于,所述获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个的步骤,包括:The method according to claim 1, characterized in that: obtaining target data, and performing data screening to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one step ,include:
    获取所述目标数据内的数据属性;Obtain data attributes within the target data;
    依据所述数据属性确定所述目标数据的所述词义类型,其中,所述词义类型包括有常数和无常数;Determine the word meaning type of the target data according to the data attribute, wherein the word meaning type includes constants and no constants;
    筛选所述词义类型为所述有常数对应的所述目标数据,并确定为所述采样数据。The target data whose word meaning type corresponds to the constant are screened and determined as the sampled data.
  3. 根据权利要求1所述的方法,其特征在于,所述依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板的步骤,包括:The method of claim 1, wherein the step of generating a template predicate based on the sampled data and constructing a target template based on the template predicate includes:
    依据所述采样数据生成模板谓词;Generate template predicates based on the sampled data;
    当所述模板谓词在所述目标数据存在有效值时,则确定所述模板谓词为有效谓词;When the template predicate has a valid value in the target data, it is determined that the template predicate is a valid predicate;
    依据所述有效谓词构建目标模板。Build a target template based on the valid predicates.
  4. 根据权利要求3所述的方法,其特征在于,所述依据所述有效谓词 构建目标模板的步骤,包括:The method according to claim 3, characterized in that the step of constructing a target template based on the effective predicate includes:
    对所述有效谓词之间进行组合生成排列组合;Combining the valid predicates to generate permutations;
    对所述排列组合进行筛选确定模板谓词组合;Screen the permutations and combinations to determine template predicate combinations;
    依据所述模板谓词组合和所述有效谓词构建所述目标模板。The target template is constructed based on the template predicate combination and the valid predicate.
  5. 根据权利要求1所述的方法,其特征在于,所述依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集的步骤,包括:The method according to claim 1, characterized in that the step of filtering the target data according to the constant predicate to construct a predicate aggregate includes:
    依据所述常数谓词对所述目标数据进行数据筛选确定非常数谓词数据,其中,所述非常数谓词数据为非常数谓词集合;Perform data screening on the target data according to the constant predicate to determine non-constant predicate data, wherein the non-constant predicate data is a set of non-constant predicates;
    依据所述目标模板对所述非常数谓词数据进行常数值补充生成常数谓词数据,其中,所述常数谓词数据为常数谓词集合;Perform constant value supplementation on the non-constant predicate data according to the target template to generate constant predicate data, wherein the constant predicate data is a set of constant predicates;
    依据所述非常数谓词集合和所述常数谓词集合构建所述谓词总集。The predicate total set is constructed based on the non-constant predicate set and the constant predicate set.
  6. 根据权利要求1所述的方法,其特征在于,所述依据所述谓词总集进行关联规则挖掘生成候选规则集合的步骤,包括:The method according to claim 1, characterized in that the step of performing association rule mining based on the predicate aggregate set to generate a candidate rule set includes:
    依据所述谓词总集进行深度优先搜索生成第一候选规则集合;或,Perform a depth-first search based on the total set of predicates to generate a first set of candidate rules; or,
    依据所述谓词总集进行广度优先搜索生成第二候选规则集合;Perform a breadth-first search based on the total set of predicates to generate a second set of candidate rules;
    依据所述第一候选规则集合或所述第二候选规则集合生成所述候选规则总集。The total set of candidate rules is generated according to the first set of candidate rules or the second set of candidate rules.
  7. 根据权利要求1所述的方法,其特征在于,所述依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据的步骤,包括:The method according to claim 1, characterized in that the step of determining effective rules in the total set of candidate rules based on the target data and determining effective data based on the effective rules includes:
    获取所述候选规则总集内的每一个子候选规则;Obtain each sub-candidate rule in the total set of candidate rules;
    依据所述目标数据对每一个所述子候选规则验证确定每一个所述子候选规则的有效性;其中,当在所述目标数据存在一子目标数据与当前子候选规则对应时,则确定所述当前子候选规则为有效规则;Verify the validity of each sub-candidate rule according to the target data; wherein, when there is a sub-target data corresponding to the current sub-candidate rule in the target data, then determine the validity of each sub-candidate rule. The current sub-candidate rule is a valid rule;
    获取所述有效规则对应的所述子目标数据,标记所述子目标数据为所述 有效数据。Obtain the sub-goal data corresponding to the valid rule, and mark the sub-goal data as the valid data.
  8. 一种基于关系数据的数据处理装置,所述装置用于通过数据关系修复缺失数据段的目标数据,并验证修复后的目标数据的有效性,其特征在于,包括:A data processing device based on relational data. The device is used to repair target data of missing data segments through data relationships and verify the validity of the repaired target data. It is characterized by including:
    获取模块,用于获取目标数据,并依据所述目标数据的词义进行数据筛选确定采样数据,其中,所述采样数据为常数谓词,且至少包括一个;An acquisition module is used to acquire target data, and perform data screening to determine sampling data according to the word meaning of the target data, wherein the sampling data is a constant predicate and includes at least one;
    第一构建模块,用于依据所述采样数据生成模板谓词,并依据所述模板谓词构建目标模板;A first building module, configured to generate a template predicate based on the sampled data, and construct a target template based on the template predicate;
    第二构建模块,用于依据所述常数谓词对所述目标数据进行数据筛选构建谓词总集;The second building module is used to perform data filtering on the target data based on the constant predicate to build a predicate aggregate set;
    生成模块,用于依据所述谓词总集进行关联规则挖掘生成候选规则总集;A generation module, configured to perform association rule mining based on the predicate set to generate a candidate rule set;
    确定模块,用于依据所述目标数据确定所述候选规则总集内的有效规则,并依据所述有效规则确定有效数据。A determining module, configured to determine valid rules in the total set of candidate rules based on the target data, and determine valid data based on the valid rules.
  9. 一种计算机设备,其特征在于,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至7中任一项所述的方法。A computer device, characterized in that it includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, the computer program implements claim 1 The method described in any one of to 7.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的方法。A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.
PCT/CN2022/099183 2022-06-09 2022-06-16 Relational data-based data processing method and apparatus thereof WO2023236238A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210648304.3 2022-06-09
CN202210648304.3A CN115033650A (en) 2022-06-09 2022-06-09 Data processing method and device based on relational data

Publications (1)

Publication Number Publication Date
WO2023236238A1 true WO2023236238A1 (en) 2023-12-14

Family

ID=83122974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099183 WO2023236238A1 (en) 2022-06-09 2022-06-16 Relational data-based data processing method and apparatus thereof

Country Status (2)

Country Link
CN (1) CN115033650A (en)
WO (1) WO2023236238A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116610725B (en) * 2023-05-18 2024-03-12 深圳计算科学研究院 Entity enhancement rule mining method and device applied to big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500208A (en) * 2013-09-30 2014-01-08 中国科学院自动化研究所 Deep layer data processing method and system combined with knowledge base
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
US20160125067A1 (en) * 2014-10-31 2016-05-05 International Business Machines Corporation Entity resolution between datasets
WO2018201916A1 (en) * 2017-05-04 2018-11-08 华为技术有限公司 Data query method, device, and database system
US20200193286A1 (en) * 2017-05-09 2020-06-18 Sri International Deep adaptive semantic logic network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500208A (en) * 2013-09-30 2014-01-08 中国科学院自动化研究所 Deep layer data processing method and system combined with knowledge base
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
US20160125067A1 (en) * 2014-10-31 2016-05-05 International Business Machines Corporation Entity resolution between datasets
WO2018201916A1 (en) * 2017-05-04 2018-11-08 华为技术有限公司 Data query method, device, and database system
US20200193286A1 (en) * 2017-05-09 2020-06-18 Sri International Deep adaptive semantic logic network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI-LONG TAN, WAN DING-SHENG; QIAN ZHEN-XING: "Research and application of FastCFD algorithm based on conditional function dependencies ", XINXI JISHU = INFORMATION TECHNOLOGY, XINXI CHANYEBU DIANZI XINXI ZHONGXIN, CN, vol. 7, no. 7, 24 July 2018 (2018-07-24), CN , pages 1 - 5, XP093114083, ISSN: 1009-2552, DOI: 10.13274/j.cnki.hdzj.2018.07.001 *

Also Published As

Publication number Publication date
CN115033650A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN111709527A (en) Operation and maintenance knowledge map library establishing method, device, equipment and storage medium
WO2020082673A1 (en) Invoice inspection method and apparatus, computing device and storage medium
US11775504B2 (en) Computer estimations based on statistical tree structures
WO2022121337A1 (en) Data exploration method and apparatus, and electronic device and storage medium
WO2021196935A1 (en) Data checking method and apparatus, electronic device, and storage medium
WO2023236238A1 (en) Relational data-based data processing method and apparatus thereof
US11544328B2 (en) Method and system for streamlined auditing
CN111738290B (en) Image detection method, model construction and training method, device, equipment and medium
CN114281803A (en) Data migration method, device, equipment, medium and program product
CN111062208B (en) File auditing method, device, equipment and storage medium
CN112965943A (en) Data processing method and device, electronic equipment and storage medium
WO2023173733A1 (en) Data tracking method and apparatus, electronic device and storage medium
WO2023236240A1 (en) Data screening method and apparatus based on reinforcement learning
CN111680083A (en) Intelligent multi-stage government financial data acquisition system and data acquisition method
CN113792138B (en) Report generation method and device, electronic equipment and storage medium
CN107992457B (en) Information conversion method, device, terminal equipment and storage medium
WO2023236239A1 (en) Multi-round sampling based data screening rule validation method, and apparatus thereof
WO2022062834A1 (en) Data exploration method and apparatus, electronic device and storage medium
US11687574B2 (en) Record matching in a database system
CN111986020A (en) Financial loan risk assessment method, device, equipment and storage medium
CN112700322B (en) Order sampling detection method, order sampling detection device, electronic equipment and storage medium
CN116467223B (en) Method, device, system, equipment and medium for generating test report
CN112115214B (en) Address standardization method, address standardization device and electronic equipment
US20230035639A1 (en) Embedding service for unstructured data
US20230214394A1 (en) Data search method and apparatus, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22945378

Country of ref document: EP

Kind code of ref document: A1