CN103838854A - Completely-weighted mode mining method for discovering association rules among texts - Google Patents

Completely-weighted mode mining method for discovering association rules among texts Download PDF

Info

Publication number
CN103838854A
CN103838854A CN201410096985.2A CN201410096985A CN103838854A CN 103838854 A CN103838854 A CN 103838854A CN 201410096985 A CN201410096985 A CN 201410096985A CN 103838854 A CN103838854 A CN 103838854A
Authority
CN
China
Prior art keywords
complete
awcpir
negative
item
weighted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410096985.2A
Other languages
Chinese (zh)
Other versions
CN103838854B (en
Inventor
黄名选
元昌安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
GUANGXI COLLEGE OF EDUCATION
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGXI COLLEGE OF EDUCATION filed Critical GUANGXI COLLEGE OF EDUCATION
Priority to CN201410096985.2A priority Critical patent/CN103838854B/en
Publication of CN103838854A publication Critical patent/CN103838854A/en
Application granted granted Critical
Publication of CN103838854B publication Critical patent/CN103838854B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Abstract

The invention discloses a completely-weighted mode mining method for discovering association rules among texts. Completely-weighted data to be processed are pre-processed, and a completely-weighted database and an item .library are established; a completely-weighted frequent item set and a negative item set are mined, and an interesting completely-weighted frequent item set and an interesting negative item set are obtained through pruning; the effective completely-weighted positive and negative association rules are mined through a support degree-CPIR model-correlation-interestingness evaluation framework. The completely-weighted mode mining method can overcome the defects of the existing weighing mining technology. Item weights are objectively distributed in the database and integrated with the completely-weighted mode mining method along with the completely-weighted data characteristics of the business record change, and a more actual and reasonable completely-weighted positive and negative association mode can be obtained. An invalid and uninteresting association mode is avoided. The number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes are smaller than the number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes in the prior art. The mining efficiency is greatly improved, and the completely-weighted mode mining method has good extendibility.

Description

For finding the complete weighting pattern method for digging of correlation rule between text word
Technical field
The invention belongs to Data Mining, specifically a kind of for finding the positive and negative mode excavation method of the complete weighting of correlation rule between text word, be applicable to the field such as the discovery of Feature Words association mode and document information retrieval query expansion in text mining.
Background technology
Over nearly 20 years, association rule mining obtains numerous scholars' very big interest and research, has become one of focus for data mining research, and its research mainly concentrates on based on the excavation of project frequency with based on two aspects such as project weights excavations.
The principal feature that positive and negative association mode based on project frequency excavates is the as one man project in process database of equality, and the probability occurring in database using item collection excavates association mode as support.The defect that association rule mining based on project frequency exists is: only pay attention to project frequency, neglected items weights, usually cause the correlation rule with invalid redundancy, barren to increase.
In order to overcome the defect of above-mentioned association rule mining method, the positive and negative association rule mining based on project weights has obtained paying attention to and research, and it has introduced weight, to have different importance between embodiment project and project has different weights in database.Positive and negative association rule mining based on project weights is divided into the positive and negative association rule mining of weighting and the complete positive and negative association rule mining of weighting.The principal feature of the positive and negative association rule mining of weighting is that its project weights have embodied between collection and have different importance, along with going deep into of research, the effect day of the negative correlation rule of weighting is aobvious outstanding, in excavating favorable factor, also expect to find some unfavorable factors, can reach this object by the analysis of negative correlation rule.The defect that weighted association rules excavates is to have ignored project weights and have in each transaction journal of database the situation of different weights.The objective project weights data that are distributed in transaction journal and change with record are called to complete weighted data.Existing weighted association rules method for digging can not be suitable for complete weighted data and excavate, for this reason, since 2003, all-weighted association Research on Mining has obtained paying close attention to and research, current, the positive and negative Association Rule Mining of weighting has important theory and using value in the field such as text mining, information retrieval completely.All-weighted association method for digging can overcome the defect that weighted association rules excavates effectively, but can't solve the negative Association Rule Mining problem of complete weighting.For these problems, the present invention furthers investigate the positive and negative association rule mining of complete weighting, a kind of new positive and negative association rule mining method of complete weighting based on weights ratio and dimension ratio in item is proposed, be applied to document information retrieval query expansion, can improve retrieval performance, be applied to text mining, can find actual reasonably positive negative feature words association mode.
Summary of the invention
The object of the invention is to the deficiency existing for prior art, provide a kind of for finding the complete weighting pattern method for digging of correlation rule between text word, the abundant Association Rule Mining achievement of excavating based on project weights, solves the technical barrier in the positive and negative association rule mining of all-weighted item.The method has important theory value and wide application prospect in the field such as text mining, document information retrieval.
The present invention realizes the technical scheme that above-mentioned purpose takes: a kind of for finding the complete weighting pattern method for digging of correlation rule between text word, comprise the steps:
(1) complete weighted data pretreatment stage:
In real world, there is the complete weighted data of magnanimity, as text message data etc.Weighted data preprocess method will, depending on concrete data object, for example, for Chinese text data message, will carry out participle, remove stop words, extract the preprocess methods such as Feature Words and weights calculating thereof completely; For English text data message, preprocess method is that stem extracts, gets rid of stop words, lexical analysis, extraction Feature Words and weights calculating thereof etc.The pretreated result of weighted data is to build based on complete weighted data storehouse and project library completely;
Feature Words weights computing formula for text data is: w ij=(0.5+0.5 × tf ij/ max j(tf ij)) × idf i,
Wherein, w ijbe the weights of i Feature Words at j piece of writing document, tf ijbe the word frequency of i Feature Words at j piece of writing document, idf ibe the reverse document frequency of i Feature Words, it is worth idf i=log (N/df i), N is total number of documents in document sets, df ifor containing the number of documents of i Feature Words.
(2) completely weighted frequent items and negative term collection excavation phase, comprises the following steps 2.1 and step 2.2:
2.1, from project library, extract complete weighting candidate 1_ item collection awC 1, and excavate the frequent 1_ item of complete weighting collection awL 1; Concrete steps are carried out according to 2.1.1~2.1.3:
2.1.1, from project library, extract complete weighting candidate 1_ item collection awC 1;
2.1.2, cumulative complete weighting candidate 1_ item collection awC 1weights summation in complete weighted data storehouse (All-Weighted Database is called for short AWD), calculates its support;
AwC 1support computing formula is as follows:
awsup ( aw C 1 ) = w aw C 1 n × k
Wherein,
Figure BDA0000477363740000022
expression project i jat transaction journal T iin weights summation, n is the transaction journal sum of complete weighted data storehouse AWD, k is a collection awC 1length (be awC 1project number).
2.1.3, by complete weighting candidate 1_ item collection C 1middle support is more than or equal to the frequent 1_ item of the complete weighting collection awL of minimum support threshold value minsup 1join frequent item set set awPIS;
2.2, from complete weighting candidate 2_ item collection, according to step, 2.2.1~2.2.4 operates:
2.2.1, by complete weighting frequent (i-1) _ collection awL i-1carry out Apriori connection, generate complete weighting candidate i_ item collection awC i; Described i>=2;
2.2.2, cumulative complete weighting candidate i_ item collection awC i-1weights summation in complete weighted data storehouse AWD, calculates its support awsup (awC i-1), its computing formula is as follows:
awsup ( aw C i - 1 ) = w aw C i - 1 n × k
Wherein,
Figure BDA0000477363740000032
expression project i jat transaction journal T iin weights summation, n is the transaction journal sum of complete weighted data storehouse AWD, k is a collection awC i-1length.
2.2.3, from complete weighting candidate i_ item collection awC ithe middle frequent i_ item collection awL that its support is not less than to support threshold value minsup itake out, deposit complete weighted frequent items set awPIS in, meanwhile, its support is less than to the negative i_ item collection awN of complete weighting of support threshold value ideposit complete weighting negative term collection set awNIS in.
2.2.4, the value of i is added to 1, if frequent (i-1) _ collection awL i-1for empty (being that its length is 0) just proceeds to (3) step, otherwise, 2.2.1~2.2.3 step continued;
(3) the beta pruning stage: obtain interesting complete weighted frequent items and negative term collection by the beta pruning stage
3.1, for each the frequent i-item collection awL in frequent item set set awPIS i, calculate IAWFI (awL i) value, wipe out its IAWFI (awL i) value is false frequent item set, obtains interesting complete weighted frequent items set awPIS after beta pruning; IAWFI (awL i) computing formula is as follows:
Figure BDA0000477363740000033
Wherein, awItemsetInt (I 1∪ I 2)=awsup (I 1) × awsup (I 1∪ I 2) × (1 – awsup (I 2)), awItemsetInt (﹁ I 1, ﹁ I 2)=awsup (I 2) × (1 – awsup (I 1)) × (1 – awsup (I 1) – awsup (I 2)+awsup (I 1∪ I 2)), minInt is minimum interestingness threshold value, minsup minimum support threshold value.
3.2, for each the negative i-item collection awN in negative term collection set awNIS i, calculate IAWNI (awN i) value, wipe out its IAWNI (awN i) value is false negative term collection, obtains interesting complete weighting negative term collection set awNIS after beta pruning; IAWNI (awN i) computing formula as follows:
Wherein, awItemsetInt (I 1∪ I 2)=awsup (I 1) × awsup (I 1∪ I 2) × (1 – awsup (I 2))
awItemsetInt(I 1∪﹁I 2)=awsup(I 1)×awsup(I 2)×(awsup(I 1)–awsup(I 1∪I 2))
awItemsetInt(﹁I 1∪I 2)=(1–awsup(I 1))×(1–awsup(I 2)×(awsup(I 2)–awsup(I 1∪I 2))
awItemsetInt(﹁I 1∪﹁I 2)=awsup(I 2)×(1–awsup(I 1))×(1–awsup(I 1)–awsup(I 2)+awsup(I 1∪I 2))
(4) from interesting complete weighted frequent items set awPIS, excavate effectively the positive and negative correlation rule of weighting completely, comprise the following steps:
4.1, take out frequent item set awL from interesting complete weighted frequent items set awPIS i, obtain awL iall proper subclass, build awL iproper subclass set, then carry out following operation:
4.2.1, from awL iproper subclass set in take out arbitrarily two proper subclass I 1and I 2, work as I 1and I 2common factor be empty set (I 1∩ I 2=φ), I 1and I 2project number sum equal the project number (I of its former frequent item set 1∪ I 2=awL i), and I 1and I 2support be all not less than support threshold value (awsup (I 1)>=minsup, awsup (I 2)>=minsup), calculate frequent item set (I 1∪ I 2) item in weights than awIWR (I 1, I 2) and dimension than awIDR (I 1, I 2); AwIWR (I 1, I 2) and awIDR (I 1, I 2) computing formula as follows:
awIWR ( I 1 , I 2 ) = w 12 w 1 × w 2 , awIDR ( I 1 , I 2 ) = k 12 k 1 × k 2
W 12and w 1, w 2be respectively complete weighted term collection (I 1, I 2) and subitem collection I 1and I 2weights summation in complete weighted data storehouse AWD, k 12, k 1and k 2be respectively a collection (I 1, I 2) and subitem collection I 1and I 2project number, n is transaction journal sum in database.
4.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I 1, I 2)) product be greater than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) >awIDR (I 1, I 2)), proceed as follows:
If 4.2.2.1 I 1→ I 2awCPIR value (awCPIR (I 1→ I 2)) be not less than confidence threshold value minconf, excavate all-weighted association I 1→ I 2; If I 2→ I 1awCPIR value be not less than confidence threshold value (awCPIR (I 2→ I 1)>=minconf), excavate all-weighted association I 2→ I 1; AwCPIR (I 1→ I 2) and awCPIR (I 2→ I 1) computing formula as follows:
awCPIR ( I 1 → I 2 ) = awsup ( I 2 ∪ I 1 ) - awsup ( I 1 ) awsup ( I 2 ) awsup ( I 1 ) ( 1 - awsup ( I 2 ) )
awCPIR ( I 2 → I 1 ) = awsup ( I 2 ∪ I 1 ) - awsup ( I 1 ) awsup ( I 2 ) awsup ( I 2 ) ( 1 - awsup ( I 1 ) )
If 4.2.2.2 (﹁ I 1∪ ﹁ I 2) support be not less than support threshold value (awsup (﹁ I 1∪ ﹁ I 2)>=minsup), so, if 1. ﹁ I 1→ ﹁ I 2awCPIR value be not less than confidence threshold value (awCPIR (﹁ I 1→ ﹁ I 2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ ﹁ I 1awCPIR value be not less than confidence threshold value (awCPIR (﹁ I 2→ ﹁ I 1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 2→ ﹁ I 1; Awsup (﹁ I 1∪ ﹁ I 2), awCPIR (﹁ I 1→ ﹁ I 2) and awCPIR (﹁ I 2→ ﹁ I 1) computing formula as follows:
awsup(﹁I 1∪﹁I 2)=awsup(﹁I 1∪﹁I 2)=1–awsup(I 1)–awsup(I 2)+awsup(I 1∪I 2)
Figure BDA0000477363740000051
Figure BDA0000477363740000052
4.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I 1, I 2)) product be less than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) <awIDR (I 1, I 2)), proceed as follows:
If 4.2.3.1 (I 1∪ ﹁ I 2) support be not less than support threshold value (awsup (I 1∪ ﹁ I 2)>=minsup), so, if 1. I 1→ ﹁ I 2awCPIR value be not less than confidence threshold value (awCPIR (I 1→ ﹁ I 2)>=minconf), excavate the negative correlation rule I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ I 1awCPIR value be not less than confidence threshold value (awCPIR (﹁ I 2→ I 1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 2→ I 1; Awsup (I 1∪ ﹁ I 2), awCPIR (I 1→ ﹁ I 2) and awCPIR (﹁ I 2→ I 1) computing formula as follows:
awsup(I 1→﹁I 2)=awsup(I 1∪﹁I 2)=awsup(I 1)–awsup(I 1∪I 2)
Figure BDA0000477363740000053
Figure BDA0000477363740000054
If 4.2.3.2 (﹁ I 1∪ I 2) support be not less than support threshold value (awsup (﹁ I 1∪ I 2)>=minsup), so, if 1. ﹁ I 1→ I 2awCPIR value be not less than confidence threshold value (awCPIR (﹁ I 1→ I 2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 1→ I 2; If 2. I 2→ ﹁ I 1awCPIR value be not less than confidence threshold value (awCPIR (I 2→ ﹁ I 1)>=minconf), excavate the negative correlation rule I of complete weighting 2→ ﹁ I 1; Awsup (﹁ I 1∪ I 2), awCPIR (﹁ I 1→ I 2) and awCPIR (I 2→ ﹁ I 1) computing formula as follows:
awsup(﹁I 1→I 2)=awsup(﹁I 1∪I 2)=awsup(I 2)–awsup(I 1∪I 2)
Figure BDA0000477363740000055
Figure BDA0000477363740000056
4.2.4, continue 4.2.1~4.2.3 step, if awL iproper subclass set in each proper subclass and if only if is removed once, proceed to 4.2.5 step;
4.2.5, continue 4.1 steps, if each frequent item set awL in interesting complete weighted frequent items set awPIS iall and if only if is removed once, proceeds to (5) step;
(5) from interesting complete weighting negative term collection set awNIS, excavate effectively the negative correlation rule of weighting completely, comprise the following steps:
5.1, take out negative term collection awN from interesting complete weighting negative term collection set awNIS i, obtain awN iall proper subclass, build awN iproper subclass set, then carry out following operation:
5.2.1, from awN iproper subclass set in take out arbitrarily two proper subclass I 1and I 2, work as I 1and I 2common factor be empty set (I 1∩ I 2=φ), I 1and I 2project number sum equal the project number (I of its former frequent item set 1∪ I 2=awN i), and I 1and I 2support be all greater than or equal to support threshold value (awsup (I 1)>=minsup, awsup (I 2)>=minsup), calculate negative term collection (I 1∪ I 2) item in weights than (awIWR (I 1, I 2)) and dimension than (awIDR (I 1, I 2)); AwIWR (I 1, I 2) and awIDR (I 1, I 2) computing formula with the formula of 4.2.1.
5.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I 1, I 2)) product be greater than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) >awIDR (I 1, I 2)), proceed as follows:
If 5.2.2.1 (﹁ I 1∪ ﹁ I 2) support be greater than or equal to support threshold value (awsup (﹁ I 1∪ ﹁ I 2)>=minsup), so, if 1. ﹁ I 1→ ﹁ I 2awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I 1→ ﹁ I 2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ ﹁ I 1awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I 2→ ﹁ I 1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 2→ ﹁ I 1; Awsup (﹁ I 1∪ ﹁ I 2), awCPIR (﹁ I 1→ ﹁ I 2) and awCPIR (﹁ I 2→ ﹁ I 1) computing formula with the formula of 4.2.2.2.
5.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I 1, I 2)) product be less than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) <awIDR (I 1, I 2)):
If 5.2.3.1 (I 1∪ ﹁ I 2) support be greater than or equal to support threshold value (awsup (I 1∪ ﹁ I 2)>=minsup), so, if 1. I 1→ ﹁ I 2awCPIR value be greater than or equal to confidence threshold value (awCPIR (I 1→ ﹁ I 2)>=minconf), excavate the negative correlation rule I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ I 1awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I 2→ I 1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 2→ I 1; Awsup (I 1∪ ﹁ I 2), awCPIR (I 1→ ﹁ I 2) and awCPIR (﹁ I 2→ I 1) computing formula with the formula of 4.2.3.1;
If 5.2.3.2 (﹁ I 1∪ I 2) support be greater than or equal to support threshold value (awsup (﹁ I 1∪ I 2>=minsup), so, if 1. ﹁ I 1→ I 2awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I 1→ I 2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting 1→ I 2; If 2. I 2→ ﹁ I 1awCPIR value be greater than or equal to confidence threshold value (awCPIR (I 2→ ﹁ I 1)>=minconf), excavate the negative correlation rule I of complete weighting 2→ ﹁ I 1; Awsup (﹁ I 1∪ I 2), awCPIR (﹁ I 1→ I 2) and awCPIR (I 2→ ﹁ I 1) computing formula with the formula of 4.2.3.2;
5.2.4, continue 5.2.1~5.2.3 step, if awN iproper subclass set in each proper subclass and if only if is removed once, proceed to 5.2.5 step;
5.2.5, continue 5.1 steps, if each negative term collection awN in interesting complete weighting negative term collection set awNIS iall and if only if is removed once, and the positive and negative association rule mining of weighting finishes completely;
So far, the positive and negative association rule mining of weighting finishes completely.
The present invention compared with prior art, has following beneficial effect:
(1) for the defect of the positive and negative association rule mining of existing weighting, the present invention has built the positive and negative association mode of complete weighting and has evaluated framework: support-CPIR model (Conditional Probability Increment Ratio)-correlativity-interest-degree, and the Pruning strategy of frequent item set and negative term collection, propose a kind of new positive and negative association rule mining method of complete weighting based on SCPIRCI evaluation framework, effectively solved the positive and negative Association Rule Mining problem of complete weighting.The present invention not only considers the complete weighted data feature that project changes with data-base recording, adopts new item collection Pruning strategy, and the excavation time is significantly reduced, and greatly improves digging efficiency.
(2) propose the interior weights ratio of complete plus item collection item and dimension than concept, enriched the theory that complete weighted data excavates.
(3), by a large amount of strict and careful experiments, the present invention is tested to comparison with traditional item without the positive and negative association rule mining method of weighting.Take Chinese Web test set CWT200g as experiment document test set, from the excavation performance experiment Analysis of aspect to the technology of the present invention such as support variation, degree of confidence variation, the number of entry and the variations of document sets scale.Experimental result shows: with control methods comparison, the excavation performance of the technology of the present invention has reached good effect, and digging efficiency is greatly improved; No matter be in support changes of threshold situation or confidence threshold value situation of change, what candidate, frequent item set and the negative term collection that the technology of the present invention is excavated and positive and negative correlation rule quantity were all excavated than existing control methods is few a lot; Under the number of entry and affairs document scale situation of change, the present invention also shows good extensibility.Ananlysis of main cause is as follows: control methods be based on project frequency excavate without the positive and negative association rule mining method of weighting, do not consider a centralization of state power value, do not have to reflect the feature that complete weighted data is intrinsic comprehensively, thereby, can produce a lot of invalid items with false collects and positive and negative association rule model, make collection and regular quantity much more, its digging efficiency lowers greatly.The invention belongs to the positive and negative association rule mining method of complete weighting excavating based on weights, effectively overcome the inherent shortcoming of control methods, the feature (being that objective being distributed in transaction journal of project weights changes along with record changes) that complete weighted data model is had incorporates in whole mining process, make excavated correlation rule more rationally and more approaching reality, simultaneously, adopt new Pruning strategy, invalid and barren frequent item set and negative term collection quantity are significantly reduced, effectively having reduced barren rule occurs, improve widely digging efficiency.
Accompanying drawing explanation
Fig. 1 is of the present invention for finding the block diagram of the complete weighting pattern method for digging of correlation rule between text word.
Fig. 2 is of the present invention for finding the overall procedure schematic diagram of the complete weighting pattern method for digging of correlation rule between text word.
Fig. 3 is that the present invention tests the candidate quantity comparison diagram excavating under different support threshold values in 1.
Fig. 4 is that the present invention tests the frequent item set quantity comparison diagram excavating under different support threshold values in 1.
Fig. 5 is that the present invention tests rule (A → B) the quantity comparison diagram excavating under different support threshold values in 1.
Fig. 6 is that the present invention tests negative rule (A → ﹁ B) the quantity comparison diagram excavating under different support threshold values in 1.
Fig. 7 is that the present invention tests negative rule (﹁ A → B) the quantity comparison diagram excavating under different support threshold values in 1.
Fig. 8 is that the present invention tests negative rule (﹁ A → ﹁ B) the quantity comparison diagram excavating under different support threshold values in 1.
Fig. 9 is candidate, the frequent and negative term collection number change figure that the present invention tests disparity items number in 2.
Figure 10 is the positive and negative correlation rule number change figure that the present invention tests disparity items number in 2.
Figure 11 is the negative correlation rule number change figure that the present invention tests disparity items number in 2.
Figure 12 is candidate, the frequent and negative term collection number change figure that the present invention tests different document scale in 2.
Figure 13 is the negative correlation rule number change figure that the present invention tests different document scale in 2.
Figure 14 is the positive and negative correlation rule number change figure that the present invention tests different document scale in 2.
Specific embodiment mode
For technical scheme of the present invention is described better, below complete weighted data model and the relevant concept that the present invention relates to are described below:
1. the difference that weighted association rules excavates and all-weighted association excavates
Weighted association rules excavates and all-weighted association excavates, their key distinction is that its project weights source is different with excavated data model, the former project weights are set by user is subjective, and be independent of transaction database, once set, invariable in whole mining process, for example, copy paper in shop and facsimile recorder, because copy paper price is not as the height of facsimile recorder, its single-piece profit is lower than facsimile recorder, in order to embody the importance difference of commodity to profit contribution, higher weights given by facsimile recorder commodity higher single-piece profit by user, and the weights of copy paper commodity are relatively low, after its weight setting, just immobilize, and be independent of its transaction data base, the latter's project weights are not to be set by the user, but derive from each transaction journal of transaction database, and change with transaction journal is different, for example, in the text database of magnanimity, each Feature Words project weights are to derive from each document in its database, change along with document difference, for different documents, its Feature Words project weights are different.
Item weighted data model and all-weighted item data model are respectively the data models that weighted association rules excavates and all-weighted association excavates, and are diverse two class data models, as shown in Table 1 and Table 2, are wherein { i 1, i 2..., i mits project set, { T 1, T 2..., T nit is its affairs set.In weighted data model, { w 1, w 2..., w mits project weights, and " 1 " the expression project of " 1/0 " occurs in transaction journal, " 0 " represents absent variable situation.In complete weighted data model, " w[T i] [i j]/0 (1≤i≤n, 1≤j≤m) " represent the weights of project, if project in transaction journal, occur, its weights are " w[T i] [i j] ", otherwise be " 0 ".
A table 1 weighted data model table 2 all-weighted item data model
Figure BDA0000477363740000091
Example: table 3 has 5 projects and 5 transaction journals, and wherein project set is { i 1, i 2, i 3, i 4, i 5}={ Apple, Orange, Banana, Milk, Coca-cola}, as known from Table 3, i 1do not appear at T 3in transaction journal.Table 4 is all-weighted item data instances, project and transaction journal quantity and with table 3, wherein, project i 1at transaction journal T 1, T 2, T 3, T 5in weights be respectively 0.85,0.93,0.65,0.75, do not appear at transaction journal T 4therefore its weights are 0.
A table 3 weighted data example table 4 all-weighted item data instances
Figure BDA0000477363740000092
2. weighted data excavates key concept completely
If weighted data storehouse AWD={T completely 1, T 2..., T n, number of transactions is n, T i(1≤i≤n) represent i affairs in AWD, item collects I={i 1, i 2..., i mrepresenting whole project sets in AWD, item number is m, i j(1≤j≤m) represents j project in AWD, w[T i] [i j] (1≤i≤n, 1≤j≤m) expression project i jat transaction journal T iin weights, refer to the all-weighted item data model of table 2.If I 1, I 2a subitem collection of collection I,
Figure BDA0000477363740000095
and,
Figure BDA0000477363740000094
provide following basic definition:
Definition 1 (weighting support completely: All-weighted support, be called for short awsup): the computing formula of weighting support awsup (I) is suc as formula shown in (1) completely.
awsup ( I ) = W 1 n &times; k - - - ( 1 )
Wherein, , n is the transaction journal sum of complete weighted data storehouse AWD, k is a length (being the project number of I) of collection I.
Completely weighting negative term collection and negative correlation rule support suc as formula (2) to shown in formula (5).
awsup(﹁I)=1–awsup(I) (2)
awsup(I 1→﹁I 2)=awsup(I 1∪﹁I 2)=awsup(I 1)–awsup(I 1∪I 2) (3)
awsup(﹁I 1→I 2)=awsup(﹁I 1∪I 2)=awsup(I 2)–awsup(I 1∪I 2) (4)
awsup(﹁I 1→﹁I 2)=awsup(﹁I 1∪﹁I 2)=1–awsup(I 1)–awsup(I 2)+awsup(I 1∪I 2) (5)
Definition 2 (weighted frequent items and negative term integrate completely): establish minimum support threshold value as minsup, for complete weighted term collection I, if awsup (I) >=minsup claims that a collection I is complete weighted frequent items.For complete weighted term collection (I 1∪ I 2), work as I 1and I 2while being all frequent item set, if awsup is (I 1∪ I 2) <minsup, a collection (I 1∪ I 2) be called complete weighting negative term collection.
Example: establish minsup=0.1, in table 4 data, awsup (i 2)=(0.21+0.35+0.05)/(5 × 1)=0.122>minsup, awsup (i 4)=0.192>minsup, awsup (i 2∪ i 4)=0.06<minsup, therefore a collection (i 2∪ i 4) be complete weighting negative term collection.
Definition 3 (weighted term collection interest-degree completely: All-weighted Itemset Interest, be awItemsetInt): interest-degree is the tolerance of user to excavated association mode degree of concern, its value is higher, illustrate that this association mode is noveler, user is just higher to its degree of concern.Based on excavating the interest-degree model definition (Cheng Jihua under environment without weighted data, Guo Jiansheng, Shi Pengfei. excavate many strategy process researchs [J] of pay close attention to rule. Chinese journal of computers, 2000,23 (1): 47-51.), provide complete weighted term collection interest-degree (awItemsetInt) computing formula suc as formula (6) to shown in formula (9):
awItemsetInt(I 1∪I 2)=awsup(I 1)×awsup(I 1∪I 2)×(1–awsup(I 2)) (6)
awItemsetInt(I 1∪﹁I 2)=awsup(I 1)×awsup(I 2)×(awsup(I 1)–awsup(I 1∪I 2)) (7)
awItemsetInt(﹁I 1∪I 2)=(1–awsup(I 1))×(1–awsup(I 2)×(awsup(I 2)–awsup(I 1∪I 2)) (8)
awItemsetInt(﹁I 1∪﹁I 2)=awsup(I 2)×(1–awsup(I 1))×(1–awsup(I 1)–awsup(I 2)+awsup(I 1∪I 2))
(9)
Definition 4 (weighting CPIR value completely: All-weighted Conditional_Probability Increment Ratio, be called for short awCPIR): CPIR model is to express p (I with the ratio of conditional probability and prior probability 2/ I 1) relative p (I 2) increase progressively degree, in document, provided its computing formula: CPIR (I 2/ I 1)=(p (I 2/ I 1) – p (I 2))/(1 – p (I 2)).The needs that computing formula based on CPIR model and completely weighted data excavate, the awCPIR computing formula that provides the positive and negative correlation rule of complete weighting suc as formula (10) to shown in formula (13):
awCPIR ( I 1 &RightArrow; I 2 ) = awsup ( I 2 &cup; I 1 ) - awsup ( I 1 ) awsup ( I 2 ) awsup ( I 1 ) ( 1 - awsup ( I 2 ) ) - - - ( 10 )
Figure BDA0000477363740000104
Figure BDA0000477363740000111
Degree of confidence using awCPIR value as all-weighted association, its value is larger, illustrates that the confidence level of this correlation rule is higher, paid close attention to by user.
Example: in the complete data of table 4, awsup (i 1)=0.636, awsup (﹁ i 1)=1-0.636=0.364, awsup (i 2)=0.122, awsup (i 1∪ i 2)=0.294, awCPIR (i 1→ i 2)=(| 0.294-0.636 × 0.122|)/(0.636 × (1-0.122))=0.39, awCPIR (i 1→ ﹁ i 2)=2.79, awCPIR (﹁ i 1→ i 2)=0.68, awCPIR (﹁ i 1→ ﹁ i 2)=4.86.
Definition 5 (weights ratio in weighted term completely: All-weighted Weight Ratio from Itemset, be called for short awIWR): establish w 12and w 1, w 2be respectively complete weighted term collection (I 1, I 2) and subitem collection I 1and I 2weights summation in complete weighted data storehouse AWD, by w 12(w 1× w 2) ratio be called weights ratio in complete weighted term collection, in being called for short, weights are than (awIWR (I 1, I 2)), shown in formula (14).
awIWR ( I 1 , I 2 ) = w 12 w 1 &times; w 2 - - - ( 14 )
Definition 6 (dimension ratio in weighted term completely: All-weighted Dimension Ratio from Itemset, be called for short awIDR): establish k 12, k 1and k 2be respectively a collection (I 1, I 2) and subitem collection I 1and I 2project number, by k 12(k 1× k 2) ratio be called dimensional ratio in complete weighted term collection, in being called for short, dimension is than (awIDR (I 1, I 2)), shown in formula (15).
awIDR ( I 1 , I 2 ) = k 12 k 1 &times; k 2 - - - ( 15 )
Definition 7 (weighted term collection correlativity completely: All-weighted itemset correlation, be called for short awISCorr): item collection correlativity definition (the Chengqi Zhang based on traditional, Shichao Zhang.Association rule mining:models and algorithms[M] .Springer-Verlag Berlin, Heidelberg, 2002:47-84, ISBN:3-540-43533-6.), provide complete weighted term collection (I 1, I 2) correlativity (awISCorr (I 1, I 2),
Figure BDA0000477363740000115
) computing formula suc as formula shown in (16).
awISCorr ( I 1 , I 2 ) = awsup ( I 1 &cup; I 2 ) awsup ( I 1 ) &times; awsup ( I 2 ) - - - ( 16 )
According to the character of correlativity, excavate under environment a collection (I at complete weighted data 1, I 2) correlativity has following character:
Character 1:
Character 2:
Figure BDA0000477363740000117
Character 3:
Figure BDA0000477363740000118
Character 4:
Figure BDA00004773637400001110
2. awISCorr (﹁ I 1, I 2) <1; 3. awISCorr (﹁ I 1, ﹁ I 2) >1.
Character 5: 2. awISCorr (﹁ I 1, I 2) >1; 3. awISCorr (﹁ I 1, ﹁ I 2) <1.
Inference is excavated in environment at complete weighted data, known terms collection (I 1, I 2), and if 1. n × awIWR (I 1, I 2) > awIDR (I 1, I 2), complete weighting subitem collection I 1and I 2become positive correlation, and can excavate complete weighting positive association rule I 1→ I 2with negative correlation rule ﹁ I 1→ ﹁ I 2pattern; If 2. n × awIWR (I 1, I 2) <awIDR (I 1, I 2), complete weighted term collection I 1and I 2become negative correlation, and can excavate the negative correlation rule I of complete weighting 1→ ﹁ I 2with ﹁ I 1→ I 2pattern;
Figure BDA0000477363740000122
Figure BDA0000477363740000121
According to above-mentioned inference, in the time excavating all-weighted association, only need to calculate the interior weights of complete weighted term than awIWR (I 1, I 2) and dimension than awIDR (I 1, I 2), do not need computational item collection correlativity, just can directly concentrate the positive and negative correlation rule of the complete weighting of excavation from frequent item set and negative term.
Example: for (i 1, i 2, i 3), establish I 1=(i 1, i 2), I 2=(i 3), awIWR (I 1, I 2)=3.34/ (2.94 × 2.85)=0.399, awIDR (I 1, I 2)=3/ (2 × 1)=1.5, n × awIWR (I 1, I 2)=5 × 0.5517=1.995>1.5=awIDR (I 1, I 2), according to above-mentioned inference, I 1and I 2become positive correlation, can excavate correlation rule I 1→ I 2with negative correlation rule ﹁ I 1→ ﹁ I 2pattern.Employing formula (16) checking: awsup (i 1∪ i 2)=0.294, awsup (i 3)=0.57, awsup (i 1∪ i 2∪ i 3)=0.223, awISCorr (I 1, I 2)=0.223/ (0.294 × 0.57)=1.33>1, do as one likes matter 1 and character 4, I 1and I 2become positive correlation, can excavate correlation rule I 1→ I 2with negative correlation rule ﹁ I 1→ ﹁ I 2pattern, conclusion is consistent.
In like manner, for complete weighted term collection (i 2, i 4), its awIWR (i 2, i 4)=0.102, awIDR (i 2, i 4)=2, n × awIWR (i 2, i 4)=0.51<2=awIDR (i 2, i 4), known according to inference, i 2and i 4become negative correlation, can excavate i 2→ ﹁ i 4with ﹁ i 2→ i 4pattern.
Definition 8 (the effectively complete positive and negative correlation rule of weighting): establishing minconf is minimal confidence threshold, as complete weighted term collection I 1and I 2meet following 3 conditions, claim correlation rule I 1→ I 2, ﹁ I 1→ ﹁ I 2, I 1→ ﹁ I 2with ﹁ I 1→ I 2for the effective completely positive and negative correlation rule of weighting: 1. I 1and I 2complete weighted frequent items, I 1∩ I 2=φ; 2. I 1→ I 2, ﹁ I 1→ ﹁ I 2, I 1→ ﹁ I 2with ﹁ I 1→ I 2support be more than or equal to minsup; 3. I 1→ I 2, ﹁ I 1→ ﹁ I 2, I 1→ ﹁ I 2with ﹁ I 1→ I 2awCPIR value be not less than minconf.
Example: suppose minsup=0.1, minconf=0.3 knows from upper example, completely weighted term collection (i 1, i 2), (i 3) and (i 1, i 2, i 3) support be all greater than minsup, (i 1, i 2) and (i 3) become positive correlation, again because, awCPIR ((i 1, i 2) → (i 3))=| 0.223 – 0.94 × 0.57|/(0.294 × (1 – 0.57))=0.438>minconf, awCPIR (﹁ (i 1, i 2) → ﹁ (i 3))=0.138<minconf, according to character 4 and definition 8, (i 1, i 2) → (i 3) be an effectively complete weighting positive association rule, and negative regular ﹁ (i 1, i 2) → ﹁ (i 3) not effective.In like manner, for complete weighted term collection (i 2, i 4), due to awsup (i 2)=0.122>minsup, awsup (i 4)=0.192>minsup, awsup (i 2∪ ﹁ i 4)=0.062<minsup, awsup (﹁ i 2∪ i 4)=0.132>minsup, awCPIR (﹁ i 2→ i 4)=0.052<minconf, according to definition 8, negative correlation rule i 2→ ﹁ i 4with ﹁ i 2→ i 4it not the negative correlation rule of effectively complete weighting.
Below by specific embodiment, technical scheme of the present invention is described further.
The process of his-and-hers watches 4 complete weighted data Case digging all-weighted associations of the present invention following (wherein, minsup=0.1, minInt=0.1, minconf=0.4, w represents a centralization of state power value, s represent and collects a support):
Step1:awPIS={φ};awNIS={φ};
Step2: C 1 = { ( i 1 ) : w : 3.18 s : 0.636 , ( i 2 ) : w : 0.61 s : 0.122 , ( i 3 ) : w : 2.85 s : 0.57 , ( i 4 ) : w : 0.96 s : 0.192 , ( i 5 ) : w : 0.92 s : 0.184 } &DoubleRightArrow; L 1 = { ( i 1 ) , ( i 2 ) , ( i 3 ) , ( i 4 ) , ( i 5 ) } &DoubleRightArrow; awPIS = { L 1 } .
Step3:① C 2 = { ( i 1 , i 2 ) : w : 2.94 s : 0.294 , ( i 1 , i 3 ) : w : 4.43 s : 0.443 , ( i 1 , i 4 ) : w : 0.76 s : 0.076 , ( i 1 , i 5 ) : w : 2.52 s : 0.192 , ( i 2 , i 3 ) : w : 1.76 s : 0.176 , ( i 2 , i 4 ) : w : 0.06 s : 0.006 , ( i 2 , i 5 ) : w : 0.95 s : 0.095 , ( i 3 , i 4 ) : w : 1.8 s : 0.18 , ( i 3 , i 5 ) : w : 0.82 s : 0.082 , ( i 4 , i 5 ) : w : 0.91 s : 0.091 } &DoubleRightArrow; L 2 = { ( i 1 , i 2 ) , ( i 1 , i 3 ) , ( i 1 , i 5 ) , ( i 2 , i 3 ) , ( i 3 , i 4 ) } , N 2 = { ( i 1 , i 4 ) , ( i 2 , i 4 ) , ( i 2 , i 5 ) , ( i 3 , i 5 ) , ( i 4 , i 5 ) } &DoubleRightArrow; awPIS = { L 1 &cup; L 2 } , awNIS = { N 2 } ; C 3 = { ( i 1 , i 2 , i 3 ) : w : 3.34 s : 0.223 , ( i 1 , i 2 , i 5 ) : w : 1.7 s : 0.113 , ( i 1 , i 3 , i 5 ) : w : 1.67 s : 0.111 } &DoubleRightArrow; L 3 = { ( i 1 , i 2 , i 3 ) , ( i 1 , i 2 , i 5 ) , ( i 1 , i 3 , i 5 ) } , N 3 = { &phi; } &DoubleRightArrow; awPIS = { L 1 &cup; L 2 &cup; L 3 } , awNIS = { N 2 &cup; N 3 } ; C 4 = { ( i 1 , i 2 , i 3 , i 5 ) : w : 0 s : 0 } &DoubleRightArrow; L 3 = { &phi; } .
Step4: beta pruning: for the item collection beta pruning in frequent item set set awPIS.The frequent item set of being wiped out is: (i 2, i 3), (i 3, i 4), (i 1, i 2, i 5), (i 1, i 3, i 5), the awPIS={ (i after beta pruning 1, i 2), (i 1, i 3), (i 1, i 5), (i 1, i 2, i 3)
Step5: in like manner, in negative term collection set awNIS, the negative term collection of being wiped out is: (i 3, i 5), the awNIS={ (i after beta pruning 1, i 4), (i 2, i 4), (i 2, i 5), (i 4, i 5).
Step6: excavate the positive and negative correlation rule of complete weighting from frequent item set set awPIS He in negative term collection set awNIS, with frequent item set (i 1, i 2, i 3) and negative term collection (i 4, i 5) be example, provide its mining process as follows:
For frequent item set (i 1, i 2, i 3), with its subset I 1=(i 1) and I 2=(i 2, i 3) be example, from upper example, awsup (i 1), awsup (i 2, i 3) be all greater than minsup, awIDR (I 1, I 2)=1.5, n × awIWR (I 1, I 2)=2.98>awIDR (I 1, I 2), awsup (I 1∪ I 2)=0.223>minsup, awCPIR (I 1→ I 2)=0.212<minconf, awCPIR (I 2→ I 1)=1.73>minconf; Awsup (﹁ I 1∪ ﹁ I 2)=0.411>minsup, awCPIR (﹁ I 1→ ﹁ I 2)=1.73>minconf, awCPIR (﹁ I 2→ ﹁ I 1)=0.212<minconf, therefore, I 2→ I 1with ﹁ I 1→ ﹁ I 2(i.e. (i 2, i 3) → (i 1) and ﹁ (i 1) → ﹁ (i 2, i 3)) be an effectively complete positive and negative correlation rule of weighting.
For negative term collection (i 4, i 5), its subset I 1=(i 4) and I 2=(i 5), from upper example, awsup (i 4), awsup (i 5) be all greater than minsup, awIDR (I 1, I 2)=2, n × awIWR (I 1, I 2)=1.03<awIDR (I 1, I 2), awsup (I 1∪ ﹁ I 2)=0.101>minsup, awsup (﹁ I 1∪ I 2)=0.093<minsup, awCPIR (I 1→ ﹁ I 2)=1.577>minconf, awCPIR (﹁ I 2→ I 1)=0.084<minconf, therefore, I 1→ ﹁ I 2(i.e. (i 4) → ﹁ (i 5)) be a negative correlation rule of effectively complete weighting.
Below by experiment, beneficial effect of the present invention is described further.
In order to verify validity of the present invention, correctness and extendability, we select the part language material of the Chinese Web test set CWT200g (Chinese Web Test Collection with200GB web pages) being provided by network laboratories of Peking University as this paper experimental data test set.The running environment of experiment is Intel (R) Core (TM) i7-3770CPU@3.4GHz3.4GHz, internal memory 4.0G, and operating system is windows7, and programming language is realized and is adopted delphi2006, and Database Systems are SQL Server2008.Select typically without the positive and negative association rule mining method of weighting (Xindong Wu, Chengqi Zhang, and Shichao Zhang, Efficient Mining of Both Positive and Negative Association Rules, ACM Transactions on Information Systems, 22 (2004), 3:381-405.) (being designated as PNAR-Mining method) be experiment control methods.
The capacity of Chinese Web test set CWT200g is 197GB, comprises 37,482,913 webpages, and each page compresses arrangement according to sky net storage format.12024 pieces of plain text document from CWT200g test set, are extracted as experiment document test set.Adopt Chinese lexical analysis system ICTCLAS (Inst. of Computing Techn. Academia Sinica's development is write) to test text document participle.Feature Words weights (w ij) computing formula be w ij=(0.5+0.5 × tf ij/ max j(tf ij)) × idf i.The preprocessing process of experiment test document is: participle, remove stop words, extract Feature Words and calculate its weights, build text database and feature dictionary based on vector space model.After the pre-service of experiment document test set, obtain 8751 Feature Words, its document frequency (containing the number of documents of this Feature Words) df is 51 to 11258.According to excavating needs, in experiment, remove the Feature Words that df value is lower and higher, extraction df value is at 1500 to 5838 Feature Words (now obtaining altogether 400 Feature Words) construction feature word project library.Total frequency that Feature Words occurs in 12024 pieces of experiment test documents is 1019494 times, on average in every piece of document, occurs 85 times.Experiment parameter is as shown in table 5.
Table 5 experiment parameter table
Figure 2014100969852100002DEST_PATH_IMAGE001
Experiment 1: excavate Performance Ratio in support changes of threshold situation
Under different support threshold values, inventing an AWPNAR-Mining and control methods PNAR-Mining excavation collection in experiment document test set herein (is candidate (Candidate Itemset, CI), frequent item set (Frequent Itemset, FI), negative term collection (Negative Itemset,) and positive and negative correlation rule (Positive and Negative Association Rule NI), PNAR) quantity (ItemNum=50 more as shown in Figures 3 to 8, minconf=0.0002, minInt=0.0002, TRecordNum=12024).
Experiment 2: excavate Performance Ratio under confidence threshold value situation of change
Under confidence threshold value situation of change, invent AWPNAR-Mining and control methods PNAR-Mining herein and excavate positive and negative correlation rule (A → B, A → ﹁ B, ﹁ A → B and ﹁ A → ﹁ B) quantity (minsup=0.03 more as shown in table 6 in experiment document test set, minInt=0.0002, ItemNum=50, TRecordNum=12024).
The positive and negative correlation rule quantity comparison of excavating under the different confidence threshold value of table 6
Experiment 3: excavate time efficiency Performance Ratio
Excavate time efficiency performances in order to compare 2 kinds of methods, we add up the excavation time of inventing AWPNAR-Mining and control methods PNAR-Mining herein respectively in support changes of threshold situation and under confidence threshold value situation of change, its result (minInt=0.0002 as shown in table 7 and table 8, ItemNum=50, TRecordNum=12024).The time comparison (minconf=0.0002) that the lower 2 kinds of method for digging of table 7 degree of expressing support for changes of threshold situation excavate a collection and correlation rule in experiment document test set, table 8 represents the positive and negative correlation rule time comparison of the excavation under confidence threshold value situation of change (minsup=0.03).
Under the different support threshold values of table 7, excavate a collection and correlation rule time (unit: second) relatively
Figure BDA0000477363740000152
Under the different confidence threshold value of table 8, excavate the time (unit: second) of positive and negative correlation rule relatively
Figure BDA0000477363740000161
Experiment 4: Scalable Performance analysis
We change and the Scalable Performance experiment and analysis of two kinds of situations of data test collection scale variation to the inventive method from the number of entry.
In order to test extensibility of the present invention, experiment parameter is set: ItemNum=50, TRecordNum=12024, minsup=0.05, minconf=0.07, minInt=0.001, changes and data test collection scale is distinguished under situation of change in the number of entry, AWPNAR-Mining method of the present invention in data test collection 1 Mining Frequent Itemsets Based (FI), negative term collection (NI) and positive and negative correlation rule (PNAR) isotype number change result as shown in Fig. 9 to Figure 14.
In a word, above-mentioned experimental result shows, with control methods PNAR-Mining comparison, the excavation performance of AWPNAR-Mining method of the present invention has reached good effect, and digging efficiency is greatly improved; No matter be in support changes of threshold situation or confidence threshold value situation of change, candidate, frequent item set and the negative term collection that the present invention excavates and positive and negative correlation rule quantity all than control methods few a lot.

Claims (2)

1. for finding a complete weighting pattern method for digging for correlation rule between text word, it is characterized in that, comprise the steps:
(1) complete weighted data pretreatment stage: pending complete weighted data is carried out to pre-service, build complete weighted data storehouse and project library;
(2) completely weighted frequent items and negative term collection excavation phase, comprises the following steps 2.1 and step 2.2:
2.1, from project library, extract complete weighting candidate 1_ item collection, and excavate the frequent 1_ item of complete weighting collection; Concrete steps are carried out according to 2.1.1~2.1.3:
2.1.1, from project library, extract complete weighting candidate 1_ item collection;
2.1.2, the weights summation of cumulative complete weighting candidate 1_ item collection in weighted data storehouse completely, calculate its support;
2.1.3 the frequent 1_ item of the complete weighting collection of, concentrating support to be more than or equal to minimum support threshold value complete weighting candidate 1_ item joins complete weighted frequent items set;
2.2, from complete weighting candidate 2_ item collection, according to step, 2.2.1~2.2.4 operates:
2.2.1, complete weighting frequent (i-1) _ collection is carried out to Apriori connection, generates complete weighting candidate i_ item collection; Described i >=2;
2.2.2, the weights summation of cumulative complete weighting candidate i_ item collection in weighted data storehouse completely, calculate its support;
2.2.3, concentrate from complete weighting candidate i_ item the frequent i_ item collection taking-up that its support is not less than to support threshold value, deposit complete weighted frequent items set in, meanwhile, the negative i_ item collection of complete weighting that its support is less than to support threshold value deposits the set of complete weighting negative term collection in;
2.2.4, the value of i is added to 1, if frequent (i-1) _ Xiang Jiwei sky just proceeds to (3) step, otherwise, continue 2.2.1~2.2.3 step;
(3) the beta pruning stage: obtain interesting complete weighted frequent items and negative term collection by the beta pruning stage:
3.1, for each the frequent i-item collection awL in frequent item set set i, calculate IAWFI (awL i) value, wipe out its IAWFI (awL i) value is false frequent item set, obtains interesting complete weighted frequent items set after beta pruning;
3.2, for each the negative i-item collection awN in the set of complete weighting negative term collection i, calculate IAWNI (awN i) value, wipe out its IAWNI (awN i) value is false negative term collection, obtains interesting complete weighting negative term collection set after beta pruning;
(4) from interesting complete weighted frequent items set, excavate effectively the positive and negative correlation rule of weighting completely, comprise the following steps:
4.1, take out frequent item set awL from interesting complete weighted frequent items set i, obtain awL iall proper subclass, build awL iproper subclass set, then carry out following operation:
4.2.1, from awL iproper subclass set in take out arbitrarily two proper subclass I 1and I 2, work as I 1and I 2common factor be empty set, I 1and I 2project number sum equal the project number of its former frequent item set and I 1and I 2support be all not less than support threshold value, calculate frequent item set (I 1∪ I 2) item in weights than awIWR (I 1, I 2) and dimension than awIDR (I 1, I 2);
4.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I 1, I 2)) product be greater than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) >awIDR (I 1, I 2)), proceed as follows:
If 4.2.2.1 I 1→ I 2awCPIR value (awCPIR (I 1→ I 2)) be not less than confidence threshold value minconf, excavate all-weighted association I 1→ I 2; If I 2→ I 1awCPIR value (awCPIR (I 2→ I 1)) be not less than confidence threshold value minconf, excavate all-weighted association I 2→ I 1;
If 4.2.2.2 (﹁ I 1∪ ﹁ I 2) support be not less than support threshold value minsup, so, if 1. ﹁ I 1→ ﹁ I 2awCPIR value (awCPIR (﹁ I 1→ ﹁ I 2)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ ﹁ I 1awCPIR value (awCPIR (﹁ I 2→ ﹁ I 1)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 2→ ﹁ I 1;
4.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I 1, I 2)) product be less than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) <awIDR (I 1, I 2)), proceed as follows:
If 4.2.3.1 (I 1∪ ﹁ I 2) support be not less than support threshold value minsup, so, if 1. I 1→ ﹁ I 2awCPIR value (awCPIR (I 1→ ﹁ I 2)) be not less than confidence threshold value minconf, excavate the negative correlation rule I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ I 1awCPIR value (awCPIR (﹁ I 2→ I 1)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 2→ I 1;
If 4.2.3.2 (﹁ I 1∪ I 2) support be not less than support threshold value minsup, so, if 1. ﹁ I 1→ I 2awCPIR value (awCPIR (﹁ I 1→ I 2)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 1→ I 2; If 2. I 2→ ﹁ I 1awCPIR value (awCPIR (I 2→ ﹁ I 1)) be not less than confidence threshold value minconf, excavate the negative correlation rule I of complete weighting 2→ ﹁ I 1;
4.2.4, continue 4.2.1~4.2.3 step, if awL iproper subclass set in each proper subclass and if only if is removed once, proceed to 4.2.5 step;
4.2.5, continue 4.1 steps, if each frequent item set awL in interesting complete weighted frequent items set iall and if only if is removed once, proceeds to (5) step;
(5) from interesting complete weighting negative term collection set, excavate effectively the negative correlation rule of weighting completely, comprise the following steps:
5.1, take out negative term collection awN from interesting complete weighting negative term collection set i, obtain awN iall proper subclass, build awN iproper subclass set, then carry out following operation:
5.2.1, from awN iproper subclass set in take out arbitrarily two proper subclass I 1and I 2, work as I 1and I 2common factor be empty set, I 1and I 2project number sum equal the project number of its former frequent item set and I 1and I 2support be all greater than or equal to support threshold value, calculate negative term collection (I 1∪ I 2) item in weights than (awIWR (I 1, I 2)) and dimension than (awIDR (I 1, I 2));
5.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I 1, I 2)) product be greater than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) >awIDR (I 1, I 2)), proceed as follows:
If 5.2.2.1 (﹁ I 1∪ ﹁ I 2) support be greater than or equal to support threshold value minsup, so, if 1. ﹁ I 1→ ﹁ I 2awCPIR value (awCPIR (﹁ I 1→ ﹁ I 2)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ ﹁ I 1awCPIR value (awCPIR (﹁ I 2→ ﹁ I 1)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 2→ ﹁ I 1;
5.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I 1, I 2)) product be less than its dimension than (awIDR (I 1, I 2)) time (n × awIWR (I 1, I 2) <awIDR (I 1, I 2)), proceed as follows:
If 5.2.3.1 (I 1∪ ﹁ I 2) support be greater than or equal to support threshold value minsup, so, if 1. I 1→ ﹁ I 2awCPIR value (awCPIR (I 1→ ﹁ I 2)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule I of complete weighting 1→ ﹁ I 2; If 2. ﹁ I 2→ I 1awCPIR value (awCPIR (﹁ I 2→ I 1)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 2→ I 1;
If 5.2.3.2 (﹁ I 1∪ I 2) support be greater than or equal to support threshold value minsup, so, if 1. ﹁ I 1→ I 2awCPIR value (awCPIR (﹁ I 1→ I 2)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting 1→ I 2; If 2. I 2→ ﹁ I 1awCPIR value (awCPIR (I 2→ ﹁ I 1)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule I of complete weighting 2→ ﹁ I 1;
5.2.4, continue 5.2.1~5.2.3 step, if awN iproper subclass set in each proper subclass and if only if is removed once, proceed to 5.2.5 step;
5.2.5, continue 5.1 steps, if each negative term collection awN in interesting complete weighting negative term collection set iall and if only if is removed once, and the positive and negative association rule mining of weighting finishes completely;
Described " ﹁ I 1, ﹁ I 2, I 1∪ ﹁ I 2, I 1→ ﹁ I 2" etc. " ﹁ " in symbol be negative correlation symbol, ﹁ I 1be illustrated in and in issued transaction, do not occur I 1event, be called negative term collection I 1; (I 1∪ ﹁ I 2) representing an item collection, this collection has subitem collection I 1with negative subitem collection I 2; Correlation rule I 1→ ﹁ I 2its implication is: if subset I 1event occur or occur, subset I so 2event there will not be or not occur.
2. according to claim 1 for finding the complete weighting pattern method for digging of correlation rule between text word, it is characterized in that, the described pending pretreated concrete steps of complete weighted data are, in the time that pending complete weighted data is Chinese text data, carries out participle, remove stop words, extract Feature Words and calculate its weights; In the time that pending complete weighted data is English text data, carries out stem extraction, get rid of stop words, lexical analysis, extraction Feature Words and calculate its weights.
CN201410096985.2A 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts Expired - Fee Related CN103838854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410096985.2A CN103838854B (en) 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410096985.2A CN103838854B (en) 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts

Publications (2)

Publication Number Publication Date
CN103838854A true CN103838854A (en) 2014-06-04
CN103838854B CN103838854B (en) 2017-03-22

Family

ID=50802351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410096985.2A Expired - Fee Related CN103838854B (en) 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts

Country Status (1)

Country Link
CN (1) CN103838854B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182527A (en) * 2014-08-27 2014-12-03 广西教育学院 Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN104217013A (en) * 2014-09-22 2014-12-17 广西教育学院 Course positive and negative mode excavation method and system based on item weighing and item set association degree
CN104239430A (en) * 2014-08-27 2014-12-24 广西教育学院 Item weight change based method and system for mining education data association rules
CN104239536A (en) * 2014-09-22 2014-12-24 广西教育学院 Completely-weighted course positive and negative association pattern mining method and system based on mutual information
CN109471885A (en) * 2018-09-30 2019-03-15 齐鲁工业大学 Based on the data analysing method and system for weighting positive and negative sequence pattern

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5809499A (en) * 1995-10-20 1998-09-15 Pattern Discovery Software Systems, Ltd. Computational method for discovering patterns in data sets
CN101650730A (en) * 2009-09-08 2010-02-17 中国科学院计算技术研究所 Method and system for discovering weighted-value frequent-item in data flow
CN102306183A (en) * 2011-08-30 2012-01-04 王洁 Transaction data stream closed weighted frequent pattern (DS_CWFP) mining method
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5809499A (en) * 1995-10-20 1998-09-15 Pattern Discovery Software Systems, Ltd. Computational method for discovering patterns in data sets
CN101650730A (en) * 2009-09-08 2010-02-17 中国科学院计算技术研究所 Method and system for discovering weighted-value frequent-item in data flow
CN102306183A (en) * 2011-08-30 2012-01-04 王洁 Transaction data stream closed weighted frequent pattern (DS_CWFP) mining method
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182527A (en) * 2014-08-27 2014-12-03 广西教育学院 Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN104239430A (en) * 2014-08-27 2014-12-24 广西教育学院 Item weight change based method and system for mining education data association rules
CN104239430B (en) * 2014-08-27 2017-04-12 广西教育学院 Item weight change based method and system for mining education data association rules
CN104182527B (en) * 2014-08-27 2017-07-18 广西财经学院 Association rule mining method and its system between Sino-British text word based on partial order item collection
CN104217013A (en) * 2014-09-22 2014-12-17 广西教育学院 Course positive and negative mode excavation method and system based on item weighing and item set association degree
CN104239536A (en) * 2014-09-22 2014-12-24 广西教育学院 Completely-weighted course positive and negative association pattern mining method and system based on mutual information
CN104217013B (en) * 2014-09-22 2017-06-13 广西教育学院 The positive and negative mode excavation method and system of course based on the item weighted sum item collection degree of association
CN109471885A (en) * 2018-09-30 2019-03-15 齐鲁工业大学 Based on the data analysing method and system for weighting positive and negative sequence pattern
CN109471885B (en) * 2018-09-30 2022-05-31 齐鲁工业大学 Data analysis method and system based on weighted positive and negative sequence mode

Also Published As

Publication number Publication date
CN103838854B (en) 2017-03-22

Similar Documents

Publication Publication Date Title
CN106991092B (en) Method and equipment for mining similar referee documents based on big data
Cobo et al. Science mapping software tools: Review, analysis, and cooperative study among tools
Zhang et al. A hybrid term–term relations analysis approach for topic detection
CN103279570B (en) A kind of matrix weights negative mode method for digging of text-oriented data base
CN103955542A (en) Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method
CN103838854A (en) Completely-weighted mode mining method for discovering association rules among texts
CN103390051A (en) Topic detection and tracking method based on microblog data
CN103544242A (en) Microblog-oriented emotion entity searching system
CN102831119B (en) Short text clustering Apparatus and method for
CN110287329A (en) A kind of electric business classification attribute excavation method based on commodity text classification
CN104182527A (en) Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN106202065A (en) A kind of across language topic detecting method and system
Wang et al. Understanding geological reports based on knowledge graphs using a deep learning approach
Tang et al. Lily results for OAEI 2018.
Hu et al. Grounding Topic Models with Knowledge Bases.
Qiu et al. Construction and application of a knowledge graph for iron deposits using text mining analytics and a deep learning algorithm
Cekinel et al. Event prediction from news text using subgraph embedding and graph sequence mining
Zhao et al. Semi-supervised classification based mixed sampling for imbalanced data
Ay et al. Turkish abstractive text document summarization using text to text transfer transformer
Xie et al. The twenty-first century of structural engineering research: A topic modeling approach
CN109241275A (en) A kind of text subject clustering algorithm based on natural language processing
Jadhav et al. Pattern based topic model for data mining
Jingli et al. Web clustering based on tag set similarity
CN105718430A (en) Grouping minimum value-based method for calculating fingerprint similarity
KR20210056631A (en) Issue occurrence prediction system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
CB03 Change of inventor or designer information

Inventor after: Huang Mingxuan

Inventor before: Huang Mingxuan

Inventor before: Yuan Changan

COR Change of bibliographic data
TA01 Transfer of patent application right

Effective date of registration: 20160317

Address after: Nanning City, 530003 West Road Mingxiu the Guangxi Zhuang Autonomous Region No. 100

Applicant after: Guangxi Finance and Economics Institute

Address before: Nanning City, the Guangxi Zhuang Autonomous Region Qingxiu District JianZheng Road No. 37 530023

Applicant before: Guangxi College of Education

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170322

Termination date: 20180314

CF01 Termination of patent right due to non-payment of annual fee