CN115168504A - Function dependence determination method and device - Google Patents

Function dependence determination method and device Download PDF

Info

Publication number
CN115168504A
CN115168504A CN202210699530.4A CN202210699530A CN115168504A CN 115168504 A CN115168504 A CN 115168504A CN 202210699530 A CN202210699530 A CN 202210699530A CN 115168504 A CN115168504 A CN 115168504A
Authority
CN
China
Prior art keywords
function
tuple
fields
field
invalid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210699530.4A
Other languages
Chinese (zh)
Inventor
王天振
陈长城
庞艳蓓
顾云帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Cloud Computing Ltd filed Critical Alibaba Cloud Computing Ltd
Priority to CN202210699530.4A priority Critical patent/CN115168504A/en
Publication of CN115168504A publication Critical patent/CN115168504A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

One or more embodiments of the present specification provide a method and an apparatus for determining function dependence, including: acquiring an original data set, wherein the original data set comprises a plurality of tuples, and each tuple comprises field values corresponding to at least two fields; generating a corresponding partition set aiming at each field respectively, wherein the partition set comprises at least one tuple set, and the values of the tuples in each tuple set on the fields corresponding to the partition set are the same; sampling all tuple sets to obtain tuple pairs, wherein each tuple pair comprises two tuples belonging to the same tuple set; generating a corresponding invalid function dependency for each tuple group and adding the invalid function dependency into an invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuple groups; and inverting the invalid function dependence set to obtain an effective function dependence set, wherein the effective function dependence set comprises effective function dependence corresponding to the original data set.

Description

Function dependence determining method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of data processing, and in particular, to a method and an apparatus for determining function dependency.
Background
With the increasing data assets, the data processing flow becomes more complex, and the way of manually processing data is difficult to deal with the explosive growth of enterprise data in the big data era. When a data set with huge number of fields and various data types is faced, the relationship between the fields is often determined first, and then the data set is further processed, and the use of function dependence to characterize the relationship between the fields is the mainstream method at the present stage.
Although some ways of determining the functional dependence based on the data set are proposed in the related art, they tend to be inefficient, increasing computational and time overhead.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a method and an apparatus for determining function dependency, which can solve the deficiencies in the related art.
To achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
according to a first aspect of one or more embodiments of the present specification, there is provided a method for determining function dependence, the method including:
acquiring an original data set, wherein the original data set comprises a plurality of tuples, and each tuple comprises field values corresponding to at least two fields;
generating a corresponding partition set aiming at each field respectively, wherein the partition set comprises at least one tuple set, and the values of the tuples in each tuple set on the fields corresponding to the partition set are the same;
sampling all tuple sets to obtain tuple pairs, wherein each tuple pair comprises two tuples belonging to the same tuple set;
generating a corresponding invalid function dependency for each tuple group and adding the invalid function dependency into an invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuple groups;
and inverting the invalid function dependence set to obtain an effective function dependence set, wherein the effective function dependence set comprises effective function dependence corresponding to the original data set.
According to a second aspect of one or more embodiments of the present specification, there is provided a function-dependent sensitive field-based reasoning method, the method comprising:
acquiring a function dependence generated based on a data set, wherein the function dependence is used for representing the relation of fields contained in the data set;
determining a sensitive field marked by a user in a field of the data set;
determining an objective function dependency from the function dependencies, wherein fields contained in a left set of the objective function dependency belong to the sensitive fields;
and if the fields contained in the right set of the target function dependency are different from the sensitive fields, judging the fields contained in the right set of the target function dependency to be potential sensitive fields.
According to a third aspect of one or more embodiments herein, there is provided a function-dependent determining apparatus, the apparatus comprising:
a first acquisition unit: acquiring an original data set, wherein the original data set comprises a plurality of tuples, and each tuple comprises field values corresponding to at least two fields;
a dividing unit: generating a corresponding partition set aiming at each field respectively, wherein the partition set comprises at least one tuple set, and the values of the tuples in each tuple set on the fields corresponding to the partition set are the same;
a sampling unit: sampling all tuple sets to obtain tuple pairs, wherein each tuple pair comprises two tuples belonging to the same tuple set;
a generation unit: generating a corresponding invalid function dependency for each tuple group and adding the invalid function dependency into an invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuple groups;
an inversion unit: and inverting the invalid function dependence set to obtain an effective function dependence set, wherein the effective function dependence set comprises effective function dependence corresponding to the original data set.
According to a fourth aspect of one or more embodiments of the present specification, there is provided an inference apparatus based on function-dependent sensitive fields, the apparatus comprising:
a second acquisition unit: acquiring a function dependence generated based on a data set, wherein the function dependence is used for representing the relation of fields contained in the data set;
a fourth determination unit: determining sensitive fields marked by a user in fields of the data set;
a fifth determination unit: determining an objective function dependency from the function dependencies, wherein fields contained in a left set of the objective function dependency belong to the sensitive fields;
a determination unit: and if the fields contained in the right set of the target function dependency are different from the sensitive fields, judging the fields contained in the right set of the target function dependency to be potential sensitive fields.
According to a fifth aspect of one or more embodiments herein, there is provided an electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method according to the first aspect or the second aspect by executing the executable instructions.
According to a sixth aspect of one or more embodiments of the present description, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to the first or second aspect.
As can be seen from the above technical solutions, in the method for determining function dependency provided in one or more embodiments of the present specification, an original data set including multiple tuples is first obtained, where the tuples in the original data set include field values corresponding to at least two fields, then a partition set including at least one tuple set is generated for each field, then all tuple sets are sampled to obtain tuple pairs, where the tuple pairs include two tuples belonging to the same tuple set, and a corresponding invalid function dependency is generated for each tuple pair and added to the invalid function dependency set, so as to obtain all invalid function dependencies corresponding to the sampled tuple pairs. Since the value of the tuple contained in the same tuple set on a certain field is the same, the tuple pair obtained based on sampling can generate invalid function dependence without fail. Compared with the direct comparison of the tuples of the original data set, the generation of the partition set can reduce the number of the tuple pairs, thereby reducing the calculation and time overhead of the generation of invalid functions through the tuple pairs. And inverting the invalid function dependence set to obtain an effective function dependence set, wherein the effective function dependence set comprises effective function dependence corresponding to the original data set. Compared with the method for directly verifying whether the function dependence is established or not, the function dependence to be verified is generated through invalid function dependence inversion, and the valid function dependence is determined by utilizing an invalid function dependence set, so that the number of the function dependence to be verified is reduced, the step of verifying the function dependence is simplified, and the efficiency of determining the function dependence is improved.
Drawings
Fig. 1 is a system architecture diagram of a function dependent determination method according to an exemplary embodiment.
Fig. 2 is a flowchart of a method for determining function dependence according to an exemplary embodiment.
Figure 3 is a schematic diagram of a method for generating partition sets in accordance with an exemplary embodiment.
Fig. 4 is a schematic diagram of sequential sampling provided by an exemplary embodiment.
FIG. 5 is a diagram illustrating generation of invalid function dependencies, according to an exemplary embodiment.
FIG. 6 is a diagram of constructing a binary search tree in accordance with an illustrative embodiment.
FIG. 7 is a diagram illustrating the generation of an efficient function dependency according to an exemplary embodiment.
Fig. 8 is a schematic diagram of a sliding serial port sampling according to an exemplary embodiment.
FIG. 9 is a flowchart of a method for function dependent sensitive field based reasoning provided by an exemplary embodiment.
FIG. 10 is a schematic block diagram of an apparatus provided in an exemplary embodiment.
Fig. 11 is a block diagram of a function-dependent determining apparatus according to an exemplary embodiment.
Fig. 12 is a block diagram of an inference apparatus based on sensitive fields of function dependency provided in an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims that follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the methods may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
To further illustrate one or more embodiments of the present disclosure, the following examples are provided:
fig. 1 is a system architecture diagram of a function-dependent determination method according to an exemplary embodiment, and as shown in fig. 1, the system architecture diagram includes: a database 11, a server 12; the database 11 is used for storing and providing data sets, and the server 12 is used for processing the data sets provided by the database 11 to determine the functional dependence.
The functional dependency refers to that when one attribute set can determine another attribute set, the other attribute set is said to be dependent on the attribute set. The function dependence is introduced mathematically: let R (U) be a pattern of relationships on one attribute set U, X and Y be a subset of U, if R1, R2 for any two possible relationships of R (U), if R1[ X ] = R2[ X ], then R1[ Y ] = R2[ Y ], or if R1[ Y ] is not equal to R2[ Y ], then R1[ X ] is not equal to R2[ X ], say X determines Y, or Y depends on X.
Taking the scenario of the student table as an example, the name of a student can be determined by the student's academic number, and the academic number of the student can also be determined by the student's name, which is called that the name attribute depends on the academic number attribute, or the academic number attribute depends on the name attribute, and the two function dependencies are represented by "{ academic number } → { name }", "{ name } → { academic number }". The name of a student cannot be determined by the age of the student, but the age of the student can be determined by the name of the student, which is called the case where the name attribute determines the age attribute, or the age attribute depends on the name attribute, and the function dependency is represented by "{ name } → { age }". The school of a student cannot be determined by the age of the student, and the course of the student cannot be determined by the school of the student, in which case, there is no functional dependency between the age attribute and the school attribute.
It can be seen that the relationship pattern between attribute set X and attribute set Y corresponds to the functional dependence: when there is a one-to-one relationship between X and Y, if there is a one-to-one relationship between the scholarly number and the name, there is a functional dependence "X → Y" or "Y → X"; when there is a one-to-many relationship between X and Y, such as a one-to-many relationship between age and name, there is a functional dependency "Y → X"; if there is a many-to-many relationship between X and Y, such as a student and course, there is no functional dependency between X and Y.
In the technical solution of the present specification, the database 11 may be any type of relational database, and the data set maintained by the database 11 is carried in a table; wherein each record in the data set is recorded as a line of data in the table, i.e. a tuple. For example, in the scenario of the student table, information such as name, school number, age, etc. of each student is recorded in a row in the student table, and formed as a corresponding tuple. Meanwhile, columns of the table are fields, and each row of data contains field values corresponding to the respective fields, such as: the information for each student includes field values corresponding to fields of name, school number, age, etc., i.e., attributes of the respective student in the dimension characterized by those fields.
The servers 12 may be physical servers comprising a single host, or virtual servers hosted by a cluster of hosts. In operation, the server 12 obtains a data set from the database 11, and processes the data set according to the technical solution of the present specification, thereby determining the functional dependency. Wherein, although in the embodiment shown in fig. 1, the database 11 is illustrated as being independent from the server 12, in some cases, the database 11 may also be disposed in a local storage space of the server 12, and this specification does not limit this.
In fact, the technical solution of the present specification may efficiently and quickly analyze the data set to obtain the corresponding function dependence, so as to achieve the purpose of further processing according to the obtained function dependence. For example, in the sensitive field analysis scenario, by determining the function dependence, it can be used to analyze the known sensitive field, so as to deduce a potential sensitive field. The technical means of the present specification will be described below with reference to examples.
Fig. 2 is a flowchart of a method for determining function dependence according to an exemplary embodiment. As shown in fig. 2, the method may include the steps of:
step 201, an original data set is obtained, where the original data set includes multiple tuples, and each tuple includes field values corresponding to at least two fields.
A data set may refer to a collection of data, usually in tabular form. Each column represents a specific field, each row corresponds to a tuple, and each tuple has a field value corresponding to the field. The original data set may refer to an unprocessed data set, which corresponds to a replacement data set, and the process of converting the processed original data set into the replacement data set will be described in detail in the subsequent steps, which is not described herein again.
Tuple number Name (I) Sex Age (age) Blood pressure Administration of drugs
1 Xiao Hong Woman 11 Is normal Medicine X
2 Small white For male 20 Is low in Medicine B
3 Light black For male 65 Is normal Drug X
4 Small blue For male 35 Height of Medicine A
5 Small green Woman 24 Height of Medicine A
6 Radix Et rhizoma Rhei For male 24 Is normal and normal Medicine X
TABLE 1
Table 1 is an example of a function-dependent determining method, and table 1 is a data set provided in an exemplary embodiment, and as shown in table 1, the original data set includes 6 tuples, which correspond to 6 tuple numbers "1, 2,3,4, 5, 6", respectively, and further includes 5 different fields, which are "name, gender, age, blood pressure, and medication", respectively. The 6 tuples have field values corresponding to these 5 fields, for example: the tuple with the tuple number of "1" has a corresponding field value of "red color" in the "name" field, a corresponding field value of "female" in the "gender" field, a corresponding field value of "11" in the "age" field, a corresponding field value of "normal" in the "blood pressure" field, and a corresponding field value of "drug X" in the "medication" field. The field values of different tuples on the same field may be the same or different, for example: the value of the field of the tuple with the tuple serial number of 3 on the name field is small black, the value of the field on the blood pressure field is normal, the value of the tuple is different from the value of the tuple with the tuple serial number of 1 on the name field, and the value of the field on the blood pressure field is the same. In the subsequent steps, it is the difference of field values on the same field of each tuple that is compared to generate the corresponding invalid function dependency.
Step 202, generating a corresponding partition set for each field, where the partition set includes at least one tuple set, and the tuples in each tuple set have the same field value in the corresponding field of the partition set.
FIG. 3 is a diagram of partition set generation according to an exemplary embodiment, and as shown in FIG. 3, corresponding 5 partition sets are generated for 5 fields in Table 1, each partition set includes multiple tuple sets, and the tuples in each tuple set have the same field value in the corresponding field of the partition set, for example: for the "sex" field, there are only 2 kinds of field values, which are "male" and "female", respectively, the tuple numbers corresponding to the tuples with the field value "male" on the "sex" field are "2, 3,4, 6", and the tuple numbers corresponding to the tuples with the field value "female" on the "sex" field are "1,5", respectively, the tuples corresponding to the tuple numbers "2, 3,4, 6" and the tuples corresponding to the tuple numbers "1,5" are added to two different tuple sets, and the two tuple sets are both divided into a "sex" partition set. The final result is shown in fig. 3, there are 5 partition sets corresponding to "name, gender, age, blood pressure, and medication", respectively, there are 2 tuple sets in the "gender" partition set, the first tuple set is "{2,3,4,6}", there are 4 tuples, the field value of the 4 tuples on the gender field is "male", the second tuple set "{1,5}", there are 2 tuples, and the field value of the 2 tuples on the gender field is "female".
In the embodiment, the tuples with the same field value are divided into the same tuple set, so that only the tuples with the same field value on a certain field are sampled in the subsequent tuple set sampling process, the number of pairs of the sampled tuples is reduced, the sampling efficiency is improved, and the calculation cost of subsequent generation function dependence is further reduced.
Step 203, sampling all tuple sets to obtain tuple pairs, wherein each tuple pair comprises two tuples belonging to the same tuple set.
In the sampling process, the tuple sets in the division sets or the division sets may be sorted, and the tuple sets are sampled according to a certain order, or the tuple sets may not be sorted and randomly sampled, which is not limited in this specification.
In an embodiment, each partition set is used as a queue unit, and tuple sets in each queue unit are sorted according to efficiency values, wherein the size of each efficiency value is positively correlated with the number of tuples contained in the corresponding tuple set; when the tuple set in any queue unit is sampled, the tuple sets in the queue unit are sequentially sampled according to the sequence of the efficiency values from large to small to obtain the tuple groups.
Fig. 4 is a schematic diagram of sequential sampling according to an exemplary embodiment, as shown in fig. 4, in the "name" partition set, the number of tuples in the tuple set with the field value "male" is 4, and the number of tuples in the tuple set with the field value "female" is 2, so that the efficiency value of the tuple set with the field value "male" is higher than that of the other tuple set, and therefore, the tuple set is sampled first. The result of sampling the tuple set with the field value of "male" is "{2,3}, {3,4}, {4,6}, {2,4}, {3,6}, and {4,6}", and the result of sampling the tuple set with the field value of "female" is "{1,5}", so as to obtain 7 tuple sets. Similarly, tuple sets in other partition sets are also sorted and sampled according to the efficiency values. The final sampling result should be "{2,3}, {3,4}, {4,6}, {1,5}, {5,6}, {1,3}, {3,6}, {4,5}, {2,4}, {1,6}, {2,6}", and a total of 11 tuple pairs.
The embodiment sorts the tuple sets according to the efficiency values of the tuple sets, preferentially samples the tuple sets with high efficiency values, enables most of tuple pairs to be extracted in the early stage of sampling, enables the growth curve of the whole function to be more early and gentle, and accordingly can finish sampling in advance and save sampling time.
And 204, generating corresponding invalid function dependencies for each tuple group and adding the invalid function dependencies to the invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuple groups.
The invalidation function dependency may be a false function dependency, the left set of which is all fields with the same field value for both tuples and the right set is any field with a different field value for the two tuples included in the corresponding tuple. In the data set, fields of the same field value are fields for which different field values cannot be decided. Taking the tuple pair "{2,3}" as an example, the values of the tuples 2 and 3 are the same only in the "gender" field, but are different in the other fields, and obviously, the values of the tuples 2 and 3 in the other fields cannot be determined by the "gender" field, so that the function dependence generated by using the method is certain to be unrealistic, namely invalid.
Fig. 5 is a schematic diagram of generating an invalid function dependency according to an exemplary embodiment, as shown in fig. 5, according to field value differences of all fields of 2 tuples in a tuple pair, taking a tuple pair "{2,3}" as an example, the field values of tuples 2 and 3 only on the "gender" field are the same, and the field values on the remaining fields are all different, taking the "gender" field as the left set of the invalid function dependency and the remaining fields as the right set of the invalid function dependency, so that 4 invalid function dependencies can be generated, namely, "{ gender } → { name }, { gender } → { age }, { blood }, { gender }, { medication }". And similarly, comparing the tuples in all the tuple pairs to generate invalid function dependence. Wherein, the functions obtained by the tuple group of "{2,3}, {3,4}, {4,6}, {1,5}, {2,4}, {2,6}" have the same dependence, and are all: { sex } → { name }, { sex } → { age }, { sex } → { blood pressure }, { sex } → { medication }; the functional dependence obtained by the tuple on "{5,6}" is: { age } → { name }, { age } → { sex }, { age } → { blood pressure }, { age } → { medication }; the functional dependencies obtained by the tuple set "{1,3} {4,5} {1,6}" are each: { blood pressure, medication } → { name }, { blood pressure, medication } → { sex }, { blood pressure, medication } → { age }; the functional dependence obtained by the tuple on "{3,6}" is: { gender, blood pressure, medication } → { name }, { gender, blood pressure, medication } → { age }. In this embodiment, by comparing the tuple with the field values in all the fields, an invalid function dependency is obtained, which lays the foundation for the subsequent generation of valid function dependencies.
In the invalid function dependencies generated in this embodiment, two function dependencies of "{ gender } → { name }" and "{ gender, blood pressure, medication } → { name }" have an inclusion relationship, and when the right sets are the same, the left set of the first set of invalid function dependencies belongs to the left set of the second invalid function dependencies. In fact, in the case where the "name" field cannot be determined in any of the three fields, i.e., the "gender" field, the "blood pressure" field, and the "medication" field, the first invalid function dependency is unnecessary, which affects the efficiency of generating the valid function dependency.
In one embodiment, unnecessary invalid function dependencies are determined, the left set of unnecessary invalid function dependencies being included in the left set of other invalid function dependencies, the right set of unnecessary invalid function dependencies being the same as the right set of other invalid function dependencies; the adding of the candidate fields in the left set on which any invalid function depends comprises: adding a candidate field in a left set of any invalid function dependency distinguished from the unnecessary invalid function dependencies.
Furthermore, unnecessary function dependence is pruned by constructing a binary search tree taking the left part set of invalid function dependence as a path, and the efficiency of converting the invalid function dependence into the valid function dependence is improved.
Fig. 6 is a schematic diagram of constructing a binary search tree according to an exemplary embodiment, and as shown in fig. 6, a binary search tree is established for each field, and for convenience of drawing, five fields of "name, gender, age, blood pressure, and medication" are denoted by "a, B, C, D, and E". Taking the "blood pressure" field as an example, taking the "blood pressure" field as a root node, the left set containing the invalid function dependency of the blood pressure field is found from all invalid function dependencies, and as a result, "{ blood pressure, medication } → { name }, { blood pressure, medication } → { sex } → { blood pressure, medication } → { age }, { sex, blood pressure, medication } → { name }, { sex, blood pressure, medication } → { age }". The right set of these 5 invalid function dependencies contains 3 fields, respectively, "{ name }, { gender }, { age }", and thus a, B, and C are drawn next to the root node (the right set of invalid function dependencies does not appear in the binary tree during the actual construction process, and the right set is shown next to the node in this specification for ease of understanding). The left set of 5 invalid function dependencies contains "{ blood pressure, medication }", so the right set is shown by drawing A, B, and C on the edge of node E, which are connected to leaf node E. Since the left set of the last two invalid function dependencies also contains a gender field, and the right sets of the two invalid function dependencies are respectively { name }, { age } ", the original E node is divided into two nodes, wherein A and C are drawn beside the left E node, and B is drawn beside the right E node. And constructing a child node B of the left E node, and continuing A and C on the edge of the child node B to be used as a right set. As shown, there are two branches in the constructed binary search tree, represented as "DEB → a, DEB → C" and "DE → B", namely "{ sex, blood pressure, medication } → { name }, { sex, blood pressure, medication } → { age }, { blood pressure, medication } → { sex }". It is easy to see that the two invalid function dependencies of "{ blood pressure, medication } → { name }, { blood pressure, medication } → { age }" are clipped, which is not difficult to understand, since the name field cannot be determined by the three fields of gender, blood pressure and medication, the two fields of blood pressure and medication cannot be determined. According to the embodiment, unnecessary invalid function dependence is cut off by constructing the binary search tree taking the left part set of the invalid function dependence as the path, and the subsequent calculation overhead is reduced.
In the data set, the data types of the field values are not necessarily the same, when the field values of the two tuples are compared, a large amount of storage space is often occupied due to the data types, and the field values can be compared by replacing the specific field values in other ways.
In an embodiment, the corresponding field values of the tuples in the original data set are replaced by: the serial number of the field value formed by combining the serial number of the partition set corresponding to the field to which the corresponding field value belongs and the serial number of the tuple set in which the tuple to which the field value belongs is positioned so as to generate a replacement data set;
the generating a respective invalid function dependency for each tuple pair comprises: generating the invalid function dependence from the replacement data set.
For example, assuming that five fields of "name, gender, age, blood pressure, and medication" correspond to sequence numbers "1, 2,3,4, and 5", respectively, the sequence number of the tuple in the tuple set of the "name" partition set is 1 (for convenience of understanding, the sequence number of the tuple set may be sorted according to the writing order of the tuple set in the partition set in fig. 2, for example, the sequence number of the tuple set with the field value of "male" in the "gender" partition set is 1, and the sequence number of the tuple set with the field value of "female" in the "name" partition set is 2), the field value of the tuple in the "name" field may be replaced by "1-1". Similarly, other field values are replaced correspondingly, and the result after replacement is shown in table 2. When the field values of two tuples in the subsequent tuple pair are compared, the difference of the sequence numbers can be directly compared without comparing specific field values.
Tuple sequence numbers Name (I) Sex Age(s) Blood pressure Administration of drugs
1 1-1 2-2 3-2 4-1 5-1
2 1-2 2-1 3-3 4-3 5-3
3 1-3 2-1 3-4 4-1 5-1
4 1-4 2-1 3-5 4-2 5-2
5 1-5 2-2 3-1 4-2 5-2
6 1-6 2-1 3-1 4-1 5-1
TABLE 2
In the embodiment, the original field value is replaced by the same serial number, so that on one hand, the characteristics of a bottom storage structure of a computer are considered, and the time and space expenses required by calculation are reduced; on the other hand, under the condition that subsequent calculation is not influenced, original data are converted, and the risk of privacy disclosure is reduced.
And 205, inverting the invalid function dependency set to obtain an effective function dependency set, where the effective function dependency set includes effective function dependencies corresponding to the original data set.
Generalization refers to a generalization relationship between function dependencies, and a generalized left set of an invalid function dependency is contained in the left set of the invalid function dependency, and the right set is identical to the right set of the invalid function dependency, for example, the invalid function dependency "{ gender, blood pressure, medication } → { name }" can completely cover the invalid function dependency "{ blood pressure, medication } → { name }", and then can be called "{ blood pressure, medication } → { name }" is a generalization of "{ gender, blood pressure, medication } → { name }".
In an embodiment, the inverting the invalid function-dependent set to obtain a valid function-dependent set includes: extracting any invalid function dependency in the invalid function dependency set, wherein the left part set of the invalid function dependency is all fields of which two tuples contained in the corresponding tuple pair have the same field value, and the right part set of the invalid function dependency is any field of which two tuples contained in the corresponding tuple pair have different field values; adding a candidate field in the left part set of any invalid function dependency to obtain a function dependency to be verified, wherein the candidate field is other fields of the at least two fields, which are different from fields contained in the left part set and the right part set of any invalid function dependency; and if the function dependence to be verified does not belong to the invalid function dependence set and the generalization of the invalid function dependence in the set, determining the function dependence to be verified as a valid function dependence, and adding the valid function dependence into a valid function dependence set.
Fig. 7 is a schematic diagram for generating a valid function dependency according to an exemplary embodiment, and as shown in fig. 7, a candidate field is added in a left part of any invalid function dependency and is verified to be converted into a valid function dependency. Taking the example of the invalid function dependency "{ blood pressure, medication } → { age }", the left and right sets of the invalid function dependency contain fields of "blood pressure", age ", and medication, so the candidate fields may be a" gender "field and a" name "field. The 2 fields are added into the left part set indicating invalid function dependence respectively to obtain 2 pieces of function dependence to be verified, namely, "{ sex, blood pressure, medication } → { age }, { name, blood pressure, medication } → { age }", and it can be seen that in the 2 pieces of function dependence, the first piece of function dependence to be verified belongs to an invalid function dependence set, and the second piece of function dependence to be verified does not belong to the invalid function dependence set, so that the condition that "{ name, blood pressure, medication } → { age }" is valid function dependence can be judged. In the embodiment, the candidate field is added to the invalid function dependency to obtain the function dependency to be verified, and whether the function dependency to be verified is a valid function dependency is determined by verifying whether the invalid function dependency to be verified belongs to the invalid function dependency set. The process converts the invalid function dependence into the valid function dependence, and judges by using the invalid function dependence set without directly verifying whether the function dependence is valid, thereby reducing the calculation cost and improving the efficiency of determining the function dependence.
Further, if the function dependency to be verified belongs to the invalid function dependency set or generalization of invalid function dependencies in the set, continuously adding other candidate fields different from the added candidate fields in the left part set of the function dependency to be verified until the function dependency to be verified is determined to be the valid function dependency. Continuing to add a candidate field to the second to-be-verified function dependency, since the "gender, blood pressure, medication } → { age }" added candidate field is the "gender" field, the candidate field added again should be the "name" field, resulting in the to-be-verified function dependency "{ name, gender, blood pressure, medication } → { age }", which does not belong to an invalid function dependency set, and thus it can be determined that "{ name, gender, blood pressure, medication } → { age }" is a valid function dependency. The embodiment enables the invalid function dependence to be fully converted into the valid function dependence, and avoids omission of the determined function dependence.
As described above, in the sampling process, tuple sets can be sorted according to efficiency values, and since the tuple sets with high efficiency values extract more tuples and the tuple sets with low efficiency values extract less tuples, when facing a data set with more tuples and fields, if the tuple sets with high efficiency values are sampled first, it can be ensured that most tuple sets can be extracted in the early stage of sampling. In practical cases, in order to ensure the efficiency of sampling, the fullness of sampling can be sacrificed to some extent to improve the efficiency of generating function dependence.
In an embodiment, performing multiple sampling rounds on all tuple sets to obtain tuple pairs includes: in the first sampling, sliding sampling is carried out on all tuple sets according to a sampling window with the initial size; or, in the non-first sampling, the sampling window is increased on the basis of the previous sampling, and one sliding sampling round is restarted on all tuple sets; wherein, the tuples at the two ends of the sliding window are used for generating a tuple pair; if the increase rate of the invalid function dependence reaches a first preset lower limit threshold value after the sliding sampling of the current wheel is finished, finishing the sampling; otherwise, the next round of sliding sampling is entered.
Fig. 8 is a schematic diagram of sliding window sampling according to an exemplary embodiment, where multiple rounds of sampling are performed on a tuple set using a sliding window as shown in fig. 8 (since a tuple set with 1 tuple number cannot extract two tuples as a tuple pair and has no sampling value, it is not queued here). In the first sampling, the initial size of the sampling window is 2, sliding sampling is performed on all tuple sets from the tuple set at the head of the queue, namely the tuple set with the field value of "male" in the "gender" partition set, and two tuples at two ends of the sliding window are extracted as tuple pairs. The sampling result of the first sampling is as follows: {2,3}, {3,4}, {4,6}, {1,5}, {5,6}, {1,3}, {3,6}, and {4,5}. As can be seen from fig. 4, the invalid function dependencies generated by these tuple pairs include all invalid function dependencies, which will not be described here. The rate of increase of the inefficiency function dependence is 100% due to the first sampling.
In the second round of sampling, the size of the sliding window is increased to 3, all tuple sets are traversed, and the sampling result of the second round of sampling is as follows: {2,4}, {3,6}, and {1,6}. Obviously, the generated invalid function dependence must be repeated by the tuple with the invalid function dependence generated in the first round, so the increasing rate of the invalid function dependence of the round is 0. If the first preset lower threshold is 5%, the increase rate of the invalid function dependency after the second round is finished has already reached the first preset lower threshold, and the sampling may be finished. To verify that the third round of sampling is not worth much, the sliding window size is still increased to 4 here for the third round of sampling, and the sampling results are: {2,6}. The invalid function dependence generated by the tuple on "{2,6}" is also repeated with the invalid function dependence generated by the first sampling. Therefore, it can be seen that the sampling efficiency is lower and lower as the sampling round is performed, and the setting of the first preset lower threshold can reduce the time and the calculation overhead while ensuring the accuracy.
Further, if the increase rate of the effective function dependence calculated after the current round of sliding sampling is finished does not reach a second preset lower limit threshold, the next round of sliding sampling is started until the increase rate of the effective function dependence calculated after any round of sliding sampling reaches the preset threshold.
After each sampling round is finished, not only the growth rate of invalid function dependence needs to be calculated, but also the growth rate of valid function dependence needs to be calculated, wherein the growth rate of valid function dependence refers to the growth rate of valid function dependence generated by newly adding invalid function dependence after each sampling round is finished. For example: if after one round of sampling, an invalid function dependence "{ age } → { name }" is newly added, and valid function dependences generated by the invalid function dependence are "{ sex, age } → { name }, { blood pressure, age } → { name }, { medication, age } → { name }", if the 3 valid function dependences were not generated in the previous round, the rate of increase of the valid function dependence is the ratio of the number (3) of newly added valid function dependence to the total number of valid function dependence (the number of previously generated valid function dependence plus 3).
In the above embodiment, since all invalid function dependencies are obtained by the first sampling, the increase rate of the invalid function dependencies after the second sampling is finished is 0, and there is no need to continue sampling. In addition to this, it may occur that the increase rate of the invalid function dependency reaches the first preset lower threshold, but the increase rate of the valid function dependency does not reach the second preset lower threshold, and the sampling does not end, and the sampling still needs to be continued. The sampling is ended only if the growth rate on which the invalid function depends reaches a first preset lower threshold and the growth rate on which the valid function depends reaches a second preset lower threshold.
The embodiment ensures the fullness of sampling to a certain extent while improving the sampling efficiency, thereby reducing the possibility of missing function dependence.
In the embodiments described above, valid function dependencies may be generated based on the original data set, which may be used to characterize relationships between fields in the original data set. Taking the valid function dependency "{ gender, age } → { name }" as an example, the function dependency includes a "gender" field and an "age" field in the left set and a "name" field in the right set, so the valid function dependency can represent: the field value of a tuple in the "name" field can be determined by the field value of the tuple in the "gender" field and the field value in the "age" field. Due to the special role of function dependence, function dependence is often applied in scenarios where field relationships are determined.
In one embodiment, the user needs to determine the relationship between the "gender" field and the "age" field in the data set shown in Table 1. It is determined that there is no "{ sex } → { age }" or "{ age } → { sex }" in the valid function dependency set, and that there is "{ sex, blood pressure, medication } → { age }" and "{ age } → { sex }" in the invalid function dependency set. It can be seen that in case a tuple is determined to be in the field value of the "gender" field, it is not possible to determine the field value of the tuple in the "age" field and vice versa. Thus, there is no direct connection between the "gender" field and the "age" field, and the two fields cannot be derived from each other.
In another embodiment, the user needs to determine the relationship between the "name" field and the "age" field in the data set shown in Table 1. It is determined that there is "{ name } → { age }", there is no "{ age } → { name }", and there is "{ age } → { name }" in the set of invalid function dependencies. It can be seen that, in the case of determining a tuple in the field value of the "name" field, the field value of the tuple in the "age" field can be determined; in the case of determining that a tuple is in the field value of the "age" field, it is impossible to determine that the tuple is in the field value of the "name" field. Therefore, the field value of the "age" field can be derived from the field value of the "name" field, and the field value of the "name" field cannot be derived from the field value of the "age" field, and the relationship between the "name" field and the "age" field is: the "name" field unilaterally determines the "age" field.
In particular, the method of determining the dependency of a function may be applied to sensitive field identification scenarios. In an embodiment, determining a sensitive field marked by the user in the at least two fields; determining an objective function dependency from the effective function dependencies, wherein fields contained in a left set of the objective function dependency belong to the sensitive fields; and if the fields contained in the right set of the target function dependency are different from the sensitive fields, judging the fields contained in the right set of the target function dependency to be potential sensitive fields.
Sensitive fields may refer to data fields that may pose a serious harm to the society or to an individual after disclosure, including private information of the individual, such as name, gender, etc. The user's marking of sensitive fields may be simply marking fields with non-sensitive fields and non-sensitive fields, or may be grading fields, for example: the field level is divided into 5 levels, the non-sensitive field is set to be 0 level, the higher the field level of the sensitive field is set along with the increase of the sensitivity degree, the higher the field level of the sensitive field is set to be 5 levels, and the specification does not limit the specific marking method.
If the sensitive field marked by the user is: gender, age, the effective function dependence of these two sensitive fields as left set is: { sex, age } → { name }, { sex, age } → { blood pressure }, { sex, age } → { medication }. The right part set of the 3 effective function dependencies are respectively a 'name' field, a 'blood pressure' field and a 'medication' field, and the 3 fields are not sensitive fields, so that the 'name' field, the 'blood pressure' field and the 'medication' field can be judged to be potentially sensitive fields.
The embodiment applies the method for determining the function dependence to the field of sensitive field identification, determines the relation between fields based on the function dependence, and infers the potential sensitive fields through the function dependence and the sensitive fields, thereby avoiding the manual marking of the sensitive fields and reducing the time and labor cost.
Fig. 9 is a flowchart of an inference method based on sensitive fields of function dependency according to an exemplary embodiment, and as shown in fig. 9, the method may include the following steps:
step 901, obtaining a function dependency generated based on a data set, where the function dependency is used for characterizing a relationship between fields in the data set.
The function-dependent determination method can be implemented by the foregoing embodiments, and can also be implemented by using other algorithms in the related art, which is not limited in this specification.
Step 902, determining sensitive fields marked by the user in the fields of the data set.
Step 903, determining the objective function dependency from the function dependencies, wherein the fields contained in the left set of the objective function dependency belong to the sensitive fields.
Step 904, if the field in the right set of the target function dependency is different from the sensitive field, determining that the field in the right set of the target function dependency is a potential sensitive field.
Still taking the data set of table 1 as an example, if the sensitive fields marked by the user are the "age" field and the "blood pressure" field, the function dependence with these two fields as the left set includes: { blood pressure, age } → { name }, { blood pressure, age } → { sex }, { blood pressure, age } → { medication }. The right set on which these 3 functions depend are the "name" field, "gender" field, and "medication" field, respectively, and none of these 3 fields are sensitive fields, so the "name" field, "gender" field, and "medication" field can be determined as potentially sensitive fields.
According to the embodiment, the relation between the fields in the data set is represented through the function dependence, so that a user can reason out potential sensitive fields based on the function dependence and the marked sensitive fields, the time and labor cost is reduced, and the efficiency of identifying the sensitive fields is improved.
FIG. 10 is a schematic block diagram of an apparatus provided in an exemplary embodiment. Referring to fig. 10, at the hardware level, the device includes a processor 1002, an internal bus 1004, a network interface 1006, a memory 1009, and a non-volatile memory 1010, although other hardware required for other functions may be included. One or more embodiments of the present description can be implemented in software, such as by the processor 1002 reading corresponding computer programs from the non-volatile storage 1010 into the memory 1008 and then running. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
Referring to fig. 11, a function-dependent determining apparatus may be applied to the device shown in fig. 11 to implement the technical solution of the present specification, and the apparatus may include:
a first obtaining unit 1101, configured to obtain an original data set, where the original data set includes multiple tuples, and each tuple includes field values corresponding to at least two fields;
a dividing unit 1102, configured to generate a corresponding division set for each field, where the division set includes at least one tuple set, and field values of tuples in each tuple set on fields corresponding to the division set are the same;
a sampling unit 1103, configured to sample all tuple sets to obtain tuple pairs, where each tuple pair includes two tuples belonging to the same tuple set;
a generating unit 1104, configured to generate a corresponding invalid function dependency for each tuple and add the invalid function dependency to the invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuples;
an inversion unit 1105, performing inversion on the invalid function dependency set to obtain an effective function dependency set, where the effective function dependency set includes effective function dependencies corresponding to the original data set.
Optionally, the method further includes:
a replacing unit 1106, configured to replace field values of the tuples in the original data set with: the serial number of the field value formed by combining the serial number of the partition set corresponding to the field to which the corresponding field value belongs and the serial number of the tuple set in which the tuple to which the field value belongs is positioned so as to generate a replacement data set;
the generating unit 1104 is specifically configured to generate the invalid function dependency from the replacement data set.
Optionally, the sampling unit 1103 is specifically configured to use each partition set as a queue unit, and sort the tuple sets in each queue unit according to an efficiency value, where the size of the efficiency value is positively correlated with the number of tuples included in the corresponding tuple set;
when the tuple set in any queue unit is sampled, the tuple sets in the queue unit are sequentially sampled according to the sequence of the efficiency values from large to small to obtain the tuple groups.
Optionally, the sampling unit 1103 is specifically configured to perform sliding sampling on all tuple sets according to a sampling window of an initial size in the first-round sampling; or in the non-first-round sampling, the sampling window is increased on the basis of the previous round, and one round of sliding sampling is restarted on all tuple sets; wherein, the tuples at two ends of the sliding window are used for generating a tuple pair;
if the increase rate of the invalid function dependence reaches a first preset lower limit threshold value after the sliding sampling of the current wheel is finished, finishing the sampling; otherwise, the next round of sliding sampling is entered.
Optionally, the sampling unit 1103 is specifically configured to enter a next sliding sampling if the increase rate of the effective function dependency calculated after the current sliding sampling is finished does not reach a second preset lower threshold, until the increase rate of the effective function dependency calculated after any sliding sampling reaches a preset threshold.
Optionally, the inversion unit 1105 is specifically configured to:
extracting any invalid function dependency in the invalid function dependency set, wherein the left part set of the invalid function dependency is all fields of which two tuples contained in the corresponding tuple pair have the same field value, and the right part set of the invalid function dependency is any field of which two tuples contained in the corresponding tuple pair have different field values;
adding a candidate field in the left part set of any invalid function dependency to obtain a function dependency to be verified, wherein the candidate field is other fields of the at least two fields, which are different from fields contained in the left part set and the right part set of any invalid function dependency;
and if the function dependence to be verified does not belong to the invalid function dependence set and the generalization of the invalid function dependence in the set, determining the function dependence to be verified as a valid function dependence, and adding the valid function dependence into a valid function dependence set.
Optionally, the method further includes:
a first determination unit 1107 configured to determine an unnecessary invalid function dependency whose left set is included in the left set of other invalid function dependencies and whose right set is the same as the right set of other invalid function dependencies;
the inversion unit 1105 is specifically configured to add a candidate field in a left part of any invalid function dependency different from the unnecessary invalid function dependencies.
Optionally, the method further includes:
an adding unit 1108, configured to, if the function dependency to be verified belongs to the invalid function dependency set or generalization of invalid function dependencies in the set, continue to add other candidate fields different from the added candidate fields in the left part set of the function dependency to be verified until the function dependency to be verified is determined to be the valid function dependency.
Optionally, the method further includes:
a second determining unit 1109, configured to determine a sensitive field marked by the user in the at least two fields;
a third determining unit 1110, configured to determine an objective function dependency from the valid function dependencies, where a field included in a left set of the objective function dependency belongs to the sensitive field;
a first determining unit 1111, configured to determine that the field included in the right set of the objective function dependency is a potentially sensitive field if the field included in the right set of the objective function dependency is different from the sensitive field.
Referring to fig. 12, an inference apparatus based on function-dependent sensitive fields may be applied to the device shown in fig. 12 to implement the technical solution of the present specification, and the inference apparatus may include:
the second acquisition unit is used for acquiring a function dependence generated based on a data set, and the function dependence is used for representing the relation of fields contained in the data set;
a fourth determining unit, configured to determine a sensitive field marked by a user in a field of the data set;
a fifth determining unit, configured to determine an objective function dependency from the function dependencies, where a field included in a left set of the objective function dependency belongs to the sensitive field;
and a second determining unit, configured to determine that the field included in the right set of the target function dependency is a potentially sensitive field if the field included in the right set of the target function dependency is different from the sensitive field.
The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium, that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein in one or more embodiments to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (12)

1. A method of function-dependent determination, the method comprising:
acquiring an original data set, wherein the original data set comprises a plurality of tuples, and each tuple comprises field values corresponding to at least two fields;
generating a corresponding partition set aiming at each field respectively, wherein the partition set comprises at least one tuple set, and the values of the tuples in each tuple set on the fields corresponding to the partition set are the same;
sampling all tuple sets to obtain tuple pairs, wherein each tuple pair comprises two tuples belonging to the same tuple set;
generating a corresponding invalid function dependency for each tuple group and adding the invalid function dependency into an invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuple groups;
and inverting the invalid function dependence set to obtain an effective function dependence set, wherein the effective function dependence set comprises effective function dependence corresponding to the original data set.
2. The method of claim 1,
the method further comprises the following steps: respectively replacing the corresponding field values of the tuples in the original data set with: the serial number of the field value formed by combining the serial number of the partition set corresponding to the field to which the corresponding field value belongs and the serial number of the tuple set in which the tuple to which the field value belongs is positioned so as to generate a replacement data set;
the generating, for each tuple pair, a respective invalid function dependency, comprising: generating the invalid function dependence from the replacement data set.
3. The method of claim 1, wherein sampling all tuple sets results in tuple pairs comprising:
taking each partition set as a queue unit, and sequencing tuple sets in each queue unit according to an efficiency value, wherein the size of the efficiency value is positively correlated with the number of tuples contained in the corresponding tuple set;
when the tuple sets in any queue unit are sampled, the tuple sets in the queue unit are sampled in sequence from large to small according to the efficiency values to obtain tuple pairs.
4. The method of claim 1, wherein sampling all tuple sets results in tuple pairs, comprising:
in the first sampling, all tuple sets are subjected to sliding sampling according to a sampling window with an initial size; or, in the non-first sampling, the sampling window is increased on the basis of the previous sampling, and one sliding sampling round is restarted on all tuple sets; wherein, the tuples at the two ends of the sliding window are used for generating a tuple pair;
if the increase rate of the invalid function dependence calculated after the sliding sampling of the current wheel is finished reaches a first preset lower limit threshold, finishing the sampling; otherwise, the next round of sliding sampling is entered.
5. The method of claim 4, wherein sampling all tuple sets results in tuple pairs, comprising:
and if the increase rate of the effective function dependence calculated after the current round of sliding sampling is finished does not reach a second preset lower limit threshold, entering the next round of sliding sampling until the increase rate of the effective function dependence calculated after any round of sliding sampling reaches a preset threshold.
6. The method of claim 1, wherein inverting the invalid function-dependent set to obtain a valid function-dependent set comprises:
extracting any invalid function dependence in the invalid function dependence set, wherein the left part of the invalid function dependence set is all fields with the same field value in two tuples contained in the corresponding tuple pair, and the right part of the invalid function dependence set is any field with different field values in the two tuples contained in the corresponding tuple pair;
adding a candidate field in the left part set of any invalid function dependency to obtain a function dependency to be verified, wherein the candidate field is other fields of the at least two fields, which are different from fields contained in the left part set and the right part set of any invalid function dependency;
and if the function dependence to be verified does not belong to the invalid function dependence set and the generalization of the invalid function dependence in the set, determining that the function dependence to be verified is valid function dependence, and adding the valid function dependence into a valid function dependence set.
7. The method of claim 6,
the method further comprises the following steps: determining unnecessary invalid function dependencies, a left set of the unnecessary invalid function dependencies being included in a left set of other invalid function dependencies, a right set of the unnecessary invalid function dependencies being the same as the right set of the other invalid function dependencies;
the adding of the candidate fields in the left part set on which any invalid function depends comprises: adding a candidate field in a left set of any invalid function dependency distinguished from the unnecessary invalid function dependencies.
8. The method of claim 6, further comprising:
if the function dependency to be verified belongs to the invalid function dependency set or generalizes invalid function dependencies in the set, continuously adding other candidate fields different from the added candidate fields in the left part set of the function dependency to be verified until the function dependency to be verified is determined to be the valid function dependency.
9. The method of claim 1, further comprising:
determining a sensitive field marked by the user in the at least two fields;
determining an objective function dependency from the effective function dependencies, wherein fields contained in a left set of the objective function dependency belong to the sensitive fields;
and if the fields contained in the right set of the target function dependency are different from the sensitive fields, judging the fields contained in the right set of the target function dependency to be potential sensitive fields.
10. A method of inference based on function-dependent sensitive fields, the method comprising:
obtaining a function dependence generated based on a dataset, the function dependence being used to characterize a relationship between fields in the dataset;
determining sensitive fields marked by a user in fields of the data set;
determining an objective function dependency from the function dependencies, wherein fields contained in a left set of the objective function dependency belong to the sensitive fields;
and if the fields contained in the right set of the target function dependency are different from the sensitive fields, judging the fields contained in the right set of the target function dependency to be potential sensitive fields.
11. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method of any one of claims 1-10 by executing the executable instructions.
12. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, carry out the steps of the method according to any one of claims 1-10.
CN202210699530.4A 2022-06-20 2022-06-20 Function dependence determination method and device Pending CN115168504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210699530.4A CN115168504A (en) 2022-06-20 2022-06-20 Function dependence determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210699530.4A CN115168504A (en) 2022-06-20 2022-06-20 Function dependence determination method and device

Publications (1)

Publication Number Publication Date
CN115168504A true CN115168504A (en) 2022-10-11

Family

ID=83488260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210699530.4A Pending CN115168504A (en) 2022-06-20 2022-06-20 Function dependence determination method and device

Country Status (1)

Country Link
CN (1) CN115168504A (en)

Similar Documents

Publication Publication Date Title
US11977541B2 (en) Systems and methods for rapid data analysis
JP6377622B2 (en) Data profiling using location information
US10360405B2 (en) Anonymization apparatus, and program
US11288266B2 (en) Candidate projection enumeration based query response generation
CN116631561B (en) Patient identity information matching method and device based on feature division and electronic equipment
CN109471874A (en) Data analysis method, device and storage medium
CN110083731B (en) Image retrieval method, device, computer equipment and storage medium
CN106874332B (en) Database access method and device
Satish et al. Big data processing with harnessing hadoop-MapReduce for optimizing analytical workloads
CN113672653A (en) Method and device for identifying private data in database
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
CN115168504A (en) Function dependence determination method and device
CN111324701B (en) Content supplement method, content supplement device, computer equipment and storage medium
JP2021522605A (en) Accelerated large-scale similarity calculation
CN114881761A (en) Determination method of similar sample and determination method of credit limit
Song et al. PoBery: Possibly-complete big data queries with probabilistic data placement and scanning
CN111984798A (en) Atlas data preprocessing method and device
CN108052554A (en) The method and apparatus that various dimensions expand keyword
Rajendran et al. Incremental MapReduce for K-medoids clustering of big time-series data
CN113435501B (en) Clustering-based metric space data partitioning and performance measuring method and related components
CN111666295B (en) Data extraction method, terminal device and computer readable storage medium
CN111427893B (en) Json data storage method, json data storage device, computer equipment and storage medium
Sathiya Devi et al. Enhancing privacy for automatically detected quasi identifier using data anonymization
US20200167312A1 (en) Hash suppression
CN118035180A (en) Metadata completion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination