CN113568942A

CN113568942A - Data set frequent item set mining availability evaluation method

Info

Publication number: CN113568942A
Application number: CN202110579345.7A
Authority: CN
Inventors: 吴卓超
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-10-29

Abstract

The invention discloses a data set frequent item set mining availability evaluation method, which comprises the following steps: (1) let C ═ I₁,I₂,…,I_nIs a collection of items, given a transactional dataset D₁And D₂Wherein each transaction T is a non-empty set of items, such that

To D₁、D₂Mining by using Apriori algorithm to obtain maximum frequent item set, and recording as FIS₁、FIS₂(ii) a (2) Mixing FIS₁Any set MIS₁And FIS₂Any set of MIS₂Matching is carried out through an item set matching algorithm F to obtain a paired item set table Pairs, and the Pairs is paired by the item set<MIS₁,MIS₂,score₁>Composition of (score)₁Representation of MIS₁、MIS₂The item similarity is calculated in the matching process. (3) For each item in Pairs<MIS₁,MIS₂,score₁>Computing MIS₁,MIS₂Support degree similarity score of₂Further calculating to obtain MIS₁,MIS₂The composite similarity score of (1) updates pair to<MIS₁,MIS₂,score>(ii) a (4) Accumulating the composite similarity score of each term in Pairs, and dividing by the number of terms in Pairs to obtain D₁And D₂The SCORE of the similarity is [0,1 ]]。

Description

Data set frequent item set mining availability evaluation method

Technical Field

The invention relates to a method for evaluating the mining availability of a frequent itemset of a data set, which is used for evaluating the availability of the data set on the mining analysis availability of the frequent itemset.

Background

At present, frequent item set mining analysis has been widely researched, however, currently, the evaluation of the usability of the frequent item set of the data set is still in the starting stage, currently, there is no research specially used for the usability evaluation of the frequent item set, and the evaluation indexes used in the field of the frequent item set mining analysis currently are precision, relative error RE and the like.

However, the precision of the current common evaluation method is mainly measured based on the item similarity of the frequent item sets, the RE uses the median of the support similarity to represent the support similarity between the frequent item sets, and the two measurement indexes are relatively independent and are all in one-sided comparison. The similarity of the frequent item sets has an unsingurable relation with the similarity of the items and the similarity of the support degree, and meanwhile, the similarity of the frequent item sets cannot be compared in a unified dimension by using two evaluation indexes, so that the mining and analyzing availability of the data set on the frequent item sets cannot be quantized.

Disclosure of Invention

The invention aims to provide a method for evaluating mining availability of a frequent item set of a data set, which combines item set similarity and support degree similarity by applying the scheme, provides a new measurement index SCORE, can reflect the similarity of two data sets through the SCORE, and quantifies the availability of the data set on mining analysis of the frequent item set. According to the method, the composite similarity among the data sets is calculated, so that the mining and analyzing availability of frequent item sets of the data sets is evaluated, and the higher the similarity is, the better the availability is.

The technical scheme adopted by the invention is as follows: a method of dataset frequent item set mining availability assessment, the method comprising the steps of:

step (1) given data set D₁And D₂To D, pair₁、D₂Mining by using Apriori algorithm to obtain the maximum frequent item set which is recorded as FIS₁、FIS₂Wherein l is₁，l₂Is FIS₁,FIS₂Cardinality of the collection of items;

step (2) FIS₁Item set I of₁And FIS₂Item set I of₂Pairing to obtain pairing result pair<I₁,I₂, score₁>And added to Pairs, where score₁Is represented by₁、I₂Item similarity of (2);

(a) for FIS₁I of (A)₁，FIS₂I of (A)₂If I is₁、I₂If the compositions are completely the same, matching is performed and score is set₁＝1， k＝1；

(b) For FIS₁I of (A)₁，FIS₂I of (A)₂Calculating I₁、I₂If dis is equal to k, will I₂Joining to the current I₁In the candidate matching set of (2), will I₁Is added to I₂In the candidate matching set of (3);

(c) for FIS₁Item set I of₁If the candidate matching set PList is empty, the current item set is directly skipped, otherwise, the optimal item set is selected in PList and set

And (6) matching.

(d) k + +, if k is less than MAX (l)₁，l₂) Returning to step (b), if k is equal to MAX (l)₁，l₂) The FIS is₁First n terms of and FIS₂The first n items are matched one by one, and score is set in the matching process₁0.1, n is MIN (| FIS)₁|， |FIS₂|).

(e) Mixing FIS₁，FIS₂Set of middle and remaining items, match with empty set, set score₁＝0。

Step (3) for Pair in Pairs<I₁,I₂,score₁>Calculating I₁、I₂Support degree similarity score of₂Thereby obtaining I₁、I₂The similarity score of (c) is updated to<I₁,I₂,score>；

Step (4) adding the scores of all Pairs in the Pairs, and dividing the scores by the number of Pairs in the Pairs to obtain D₁And D₂The SCORE of the similarity is [0,1 ]]。

Wherein score in step (3)₁、score₂Score is defined as follows:

definition (project similarity score)₁) Item set I₁、I₂Similarity based on items is recorded as score₁. The calculation is as follows:

if I₁、I₂Has the same composition, score₁＝1；

If I₁、I₂Is different from the prior art in that,

if I₁、I₂One of them is an empty set, score₁＝0；

Definition (support similarity score)₂): paired item set I₁、I₂Similarity based on support degree is recorded as score₂The calculation is as follows:

for I in pair in Pairs₁、I₂，I₁Has a support degree of s₁,I₂Has a support degree of s₂，

Definition (item set similarity score): item set I₁、I₂The similarity of (c) is denoted as score, and score is mainly based on the item similarity score₁At score₁On the basis, the support degree score is utilized₂Further refinement, the calculation is as follows: score ═ score₁*score₂。

The matching operation adopted in the algorithm step (2) (a), (c), (d) and (e) is to perform<I₁，I₂,score>Adding Pairs, setting different score values according to different scenes, and simultaneously respectively selecting from FIS₁And FIS₂Deletion in₁、 I₂。

The distance dis in step (b) of algorithm (2) is defined as follows:

definition (item set distance dis): the item set distance represents the number of non-coincident items between the item sets, is recorded as dis, and is calculated as follows:

dis＝MAX(l₁,l₂)-|I₁∩I₂|

the matching algorithm (2) adopts a heuristic algorithm to match item sets, the matching rule is to perform preferential matching on two item sets with close distances, wherein the used k is used for controlling the heuristic rule, the k represents that the matching is performed only by considering the distance between the two item sets as k in the current matching process, the two item sets with close distances can be preferentially matched through the iteration of the k, the disordered searching is changed into the ordered matching through a nearest matching principle, the calculating process can be reused every time, the repeated calculating process is not needed, all the item sets with the distance of k-1 are excluded when the k-distance searching is performed every time, and the item set of k +1 is not in the searching range, so that the searching space is effectively reduced, and the repeated calculation generated by matching with all the other item sets every time is avoided.

In the algorithm (2), the candidate matching set PList is used for storing an item set, the distance between the item set PList and the current item set I is less than k, and each item set, which is k away from the current item set I, is stored to obtain a candidate matching set, so that the optimal matching is selected from the candidate matching sets while the closest matching principle is ensured, and the selection mode is as follows: if PList has and has only one item set I₂If there are multiple item sets, then match with a certain item set I in PList₂Pairing with the condition that₂The cardinality of the candidate matching set is minimum, thereby ensuring that the matching item selected each time has the minimum influence on the matching of other item sets.

Compared with the prior art, the invention has the following technical effects: the similarity of the item sets and the similarity of the support degree are combined, a new measurement index SCORE is provided, the similarity of the two data sets can be reflected through the SCORE, the data sets before and after the privacy protection frequent item set issuing algorithm is processed are further compared, the usability of the privacy protection frequent item set issuing is quantized, meanwhile, a heuristic algorithm of a nearest principle is adopted in the item set matching process, the nearest principle enables the algorithm to preferentially select the item set with the nearest distance for matching, the algorithm is prevented from being matched with all the other item sets when the algorithm is matched every time, a large number of repeated calculation processes are avoided, the search space is reduced, and the algorithm operation efficiency is improved on the premise of ensuring the optimal matching.

Drawings

FIG. 1 is a process flow diagram of the present invention.

Detailed Description

The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1: referring to fig. 1, the invention is a method for evaluating the publishing availability of a frequent itemset for privacy protection, comprising the following steps:

the method comprises the following steps: given data set D₁And D₂To D, pair₁、D₂Mining by using Apriori algorithm to obtain the maximum frequent item set which is recorded as FIS₁、FIS₂Wherein l is₁，l₂Is FIS₁,FIS₂Cardinality of the collection of items;

in this example, the support threshold is set to 3, and FIS is obtained by mining₁Is { { a, b, c }:4, { a, c, d }:4, { b, d, e }:3, { a, d, e }:3, { b, d, f }:3},

FIS₂is { { a, b, c, d }:3, { b, c, d, e }:4, { a, d, e, f }:3, { b, d, g, h }:3}, then l₁＝3， l₂＝4；

Step two: mixing FIS₁Item set I of₁And FIS₂Item set I of₂Pairing to obtain pairing result pair<I₁,I₂, score₁>And added to Pairs, where score₁Is represented by₁、I₂The item similarity of (2) and the pairing specific steps are as follows:

(a) for FIS₁I of (A)₁，FIS₂I of (A)₂If I is₁、I₂Are completely the same, then theMatch, set score₁1, k is 1, there are no Pairs of identical sets of items in this example, current Pairs is { };

Matching is carried out;

In this example, when k is 1,

obtaining FIS by step (b)₁、FIS₂The list of candidate matching sets of (a) is as follows:

TABLE 1 FIS₁List of candidate matching sets

Frequent itemset I	Candidate matching set
		{a,b,c}	{a,b,c,d}
{a,c,d}	{a,b,c,d}
		{b,d,e}	{b,c,d,e}
{a,d,e}	{a,d,e,f}
		{b,d,f}

TABLE 2 FIS₂List of candidate matching sets

Frequent itemset I	Candidate matching set
		{a,b,c,d}	{a,b,c},{a,c,d}
{b,c,d,e}	{b,d,e}
		{a,d,e,f}	{a,d,e}
{b,d,g,h}

In step (c), for FIS₁Mid-frequent itemset, { a, b, c }Matching with { a, b, c, d }, and calculating to obtain score₁0.75, add to Pairs<{a,b,c},{a,b,c,d},0.75>And the { b, d, e } is matched with the { b, c, d, e } to obtain score through calculation₁0.75, add to Pairs, { a, d, e } Pairs with { a, d, e, f } and calculate score₁0.75, add to Pairs, then pair the item set from FIS₁And FIS₂Deleting to obtain FIS₁＝{{a,c,d},{b,d,f}},FIS₂＝{{b,d,g,h}}；

In step (d), returning to step (b) when k is 2;

when k is equal to 2, the number of the bits is increased,

TABLE 3 FIS₁List of candidate matching sets

Frequent itemset I	Candidate matching set
		{a,c,d}
{b,d,f}	{b,d,g,h}

TABLE 4 FIS₂List of candidate matching sets

Frequent itemset I	Candidate matching set
		{b,d,g,h}	{b,d,f}

In step (c), for FIS₁Matching the { b, d, f } with the { b, d, g, h } to obtain score₁0.5, adding into Pairs<{b,d,f},{b,d,g,h},0.5>Then the matched item set is selected from the FIS₁And FIS₂Deleting to obtain FIS₁＝{{a,c,d}},FIS₂＝{}；

In step (d), k is 3, the process returns to step (b),

TABLE 5 FIS₁List of candidate matching sets

Frequent itemset I	Candidate matching set
		{a,c,d}

TABLE 6 FIS₂List of candidate matching sets

Frequent itemset I

Candidate matching set

Obtaining FIS by step (c)₁＝{{a,c,d}},FIS₂＝{}；

In step (d), k is 4, and FIS is added₁And FIS₂The first 0 entries of (a) are matched one by one.

Adding < { a, c, d }, { },0> to Pairs through step (e), and finally adding the Pairs as { < { a, b, c }, { a, b, c, d },0.75>, { b, d, e }, { b, c, d, e },0.75>, { a, d, e },

{a,d,e,f},0.75>,<{b,d,f},{b,d,g,h},0.5>,<{a,c,d},{},0>}；

step three, pair of Pair in Pairs<I₁,I₂,score₁>Calculating I₁、I₂Support degree similarity score of₂Thereby obtaining I₁、I₂The similarity score of (c) is updated to<I₁,I₂,score>；

Through the third step, for<{a,b,c},{a,b,c,d},0.75>Calculate score₂0.75, final score 0.5625; for the<{b,d,e},{b,c,d,e},0.75>Calculate score₂0.75, final score 0.5625; for the<{a,d,e},{a,d,e,f},0.75>Calculate score₂1, final score 0.75; for the<{b,d,f},{b,d,g,h},0.5>Calculate score₂1, final score 0.5; for the<{a,c,d},{},0>}, calculate score₂0, final score 0; finally obtaining Pairs ═ ready pocket<{a,b,c},{a,b,c,d}0.5625>,<{b,d,e},{b,c,d,e},0.5625>,<{a,d,e},{a,d,e,f},0.75>,<{b ,d,f},{b,d,g,h},0.5>,<{a,c,d},{},0>}；

Step four: adding the fractions of all Pairs in Pairs, and dividing by the number of Pairs in Pairs to obtain D₁And D₂The SCORE of the similarity is [0,1 ]]。

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the basis of the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A data set frequent item set mining availability evaluation method is characterized by comprising the following steps: the method comprises the following steps:

step (1) setting C ═ I₁,I₂,…,I_nIs a collection of items, given a transactional dataset D₁And D₂Wherein each transaction T is a non-empty set of items, such that

To D₁、D₂Mining by using Apriori algorithm to obtain maximum frequent item set, and recording as FIS₁、FIS₂(ii) a Defining: (most frequent item set MIS) the most frequent item set MIS represents an item set that is itself frequent but whose superset is not,

FIS₁、FIS₂contains several MIS and support information₁、l₂Separately representing FIS₁，FIS₂Mi | MIS₁|、|MIS₂The maximum value of |; text MIS₁、MIS₂Representation from FIS₁，FIS₂A certain set of items, the following is the same;

step (2) FIS₁Any set of MIS₁And FIS₂Any set of MIS₂Matching is carried out through an item set matching algorithm F to obtain a paired item set table Pairs, and the Pairs is paired by a plurality of item sets<MIS₁,MIS₂,score₁>Composition of (score)₁Representation of MIS₁、MIS₂The item similarity is obtained by calculation in the matching process;

step (3) for all Pairs<MIS₁,MIS₂,score₁>Computing MIS₁,MIS₂Support degree similarity score of₂Further calculating to obtain MIS₁,MIS₂The composite similarity score of (1) updates pair to<MIS₁,MIS₂,score>；

Step (4) accumulating the composite similarity score of each item in Pairs, and dividing the cumulative similarity score by the number of items in Pairs to obtain D₁And D₂The SCORE of the similarity is [0,1 ]]。

2. The dataset frequent item set mining availability evaluation method of claim 1, further comprising: the item set matching algorithm F in step (2) is described as follows:

(a) setting score₁1, FIS₁，FIS₂In the same item set<MIS₁,MIS₂,score₁>In the form of Pairs, while separately from FIS₁And FIS₂Midamble matched MIS₁,MIS₂Setting k to 1;

(b) initializing FIS₁、FIS₂The candidate matching set of each item in the set is an empty set, for FIS₁Arbitrary sets MIS₁，FIS₂Is of arbitrary sets MIS₂Computing MIS₁、MIS₂If dis equals k, MIS₂Joining to current MIS₁Of the candidate matching set of (1), MIS₁Adding to MIS₂In the candidate matching set of (3);

(c) for FIS₁MIS of arbitrary set of items₁If the candidate matching set PList is empty, the current item set is directly skipped, otherwise the item set MIS is selected in PList according to the minimum influence matching strategy₂Calculating

Will be provided with<MIS₁,MIS₂,score₁>Adding Pairs, simultaneously separately from FIS₁And FIS₂Middlete MIS₁、MIS₂.

(d) k + +, if k is less than MAX (l)₁，l₂) Returning to step (b), if k is equal to MAX (l)₁，l₂) The FIS is₁First n terms of and FIS₂The first n items of (1) are matched one by one, and n is MIN (| FIS)₁|，|FIS₂|) score is set during the matching process₁0.1, added to Pairs at the same time, and finally the matched term set is passed from the FIS₁，FIS₂Deleting;

(e) setting score₁0, FIS₁，FIS₂And (4) matching the rest item sets with the empty sets and adding the empty sets into Pairs.

3. The dataset frequent item set mining availability evaluation method of claim 1, further comprising: item similarity score in steps (2) and (3)₁Support similarity score₂The composite similarity score is defined as follows:

definition (project similarity score)₁) Item set MIS₁、MIS₂Similarity based on items is recorded as score₁The calculation is as follows:

if MIS₁、MIS₂Has the same composition, score₁＝1；

If MIS₁、MIS₂Different, and are not all empty sets,

if MIS₁、MIS₂One of them is an empty set, score₁＝0；

Definition (support similarity score)₂): paired item set MIS₁、MIS₂Similarity based on support degree is recorded as score₂The calculation is as follows: for a certain item in Pairs<MIS₁,MIS₂,score₁>，MIS₁Has a support degree of s₁,MIS₂Has a support degree of s₂，

Definition (composite similarity score): item set MIS₁、MIS₂The composite similarity of (a) is recorded as score which is mainly based on the item similarity score₁At score₁On the basis, the support degree score is utilized₂Further refinement is carried out, and the calculation process is as follows: score ═ score₁*score₂。

4. The method of claim 2 for assessing data set frequent itemset mining availability, characterized in that: an item set matching algorithm F, characterized by: the distance dis in step (b) is defined as follows:

dis＝MAX(|MIS₁|,|MIS₂|)-|MIS₁∩MIS₂|。

5. the method of claim 2 for assessing data set frequent itemset mining availability, characterized in that: an item set matching algorithm F, which adopts a heuristic algorithm of a nearest principle to match item sets, wherein the matching rule is to perform preferential matching on two item sets with similar distances, wherein, k is used for controlling the heuristic rule, k represents that the matching is carried out only by considering the distance between two item sets with k in the current matching process, and the two item sets with close distance are preferentially matched through the iteration of k, i.e., when a match of distance k is made, all pairs of sets of terms having distances less than k have been matched, the disorder searching is changed into the ordered matching through the nearest matching principle, each calculation process can be multiplexed without repeated calculation processes, and every time k-distance search is performed, all the item sets with the distance of k-1 are excluded, and the k +1 item set is not in the searching range, so that the searching space is effectively reduced, and repeated calculation caused by matching with all the other item sets in each matching is avoided.

6. The method of claim 2 for assessing data set frequent itemset mining availability, characterized in that: in step (c), the candidate matching set is described as follows: the candidate matching set is used for storing an item set with the distance from the current item set MIS to k, and each item set with the distance from the current item set MIS to k is stored to obtain the candidate matching set, so that the superiority of a matching result can be ensured by a minimum influence matching strategy in the candidate matching set while a closest matching principle is ensured; meanwhile, in each iteration, the matched item sets can be deleted from the candidate matching sets, and the change condition of the item sets which can be matched with each item set is recorded in real time, so that repeated matching is avoided.

7. The method of claim 2 for assessing data set frequent itemset mining availability, characterized in that: the minimum impact matching strategy in step (c) is described as follows: if the current MIS₁Has only one item set MIS₂If there are multiple item sets, then choose one item set MIS from PList₂Pairing is performed with the selection condition being MIS among PList₂The number of matched items in the candidate matching set is minimum, so that the matching item selected each time is guaranteed to have the minimum matching influence on other item sets.