Chinese word weighting positive and negative mode mining method and system based on correlation coefficient
Technical Field
The invention belongs to the field of text mining, in particular to a correlation coefficient-based Chinese word weighting positive and negative pattern mining method and a mining system thereof, which are suitable for the fields of feature word association pattern discovery, Chinese text information retrieval query expansion, cross-language information retrieval and the like in Chinese text mining. The positive and negative correlation mode of the feature words is applied to web search engines such as Baidu and Google to realize query expansion, is favorable for improving the query performance, and meets the information query requirements of users.
Background
In recent 20 years, the association pattern mining research has achieved remarkable results, and the results can be summarized into three major categories, namely, unweighted positive and negative association pattern mining technology, weighted positive and negative association pattern mining technology, matrix weighted (also called fully weighted) positive and negative association pattern mining technology and the like.
Association pattern mining research began with the Apriori method, which is an unweighted association pattern mining method proposed by Agrawal et al (AGRAWAL R, IMIELINSKI T, SWAMIA. mining association rules between sections of items in large database [ C ]// Proceedings of 1993 ACM SIGMOD International Conference on Management of data, Washington D.C.: ACM Press,1993:207 + 216). On the basis, scholars propose an improved unweighted association pattern mining method from different angles and methods. The defects of the mining of the unweighted positive and negative correlation mode are as follows: the different importance of items and the different weights of items in the transaction database are not taken into account, resulting in a large number of invalid, redundant and uninteresting association patterns.
The item weighted association pattern mining technology overcomes some defects of the traditional mining technology, namely, item weights are introduced by considering different importance among items. Item weighted association pattern mining research began in 1998 and was typically conducted using the weighted association rule mining methods proposed by Cai et al (CAI C H, DA A, FU W C, et al. mining association rules with weighted entries [ C ]// Proceedings of IEEE International database Engineering and application symposiums. Washington D.C.: IEEE Computer Society,1998: 68-77.). Thereafter, some improved methods have appeared, such as Vo, etc. (VO B, COENEN F, LE B.A new method for mining Weighted items based on WIT-trees [ J ]. Extertsystems with Applications,2013(40):1256 + 1264.), and proposed Weighted tree-based Weighted Frequent item set mining method. The method for mining the term weighted positive and negative correlation mode has the following defects: the case that the item weight has different weights in the transaction database is ignored.
The item matrix weighted association pattern mining technology attaches importance to the inherent characteristics of matrix weighted data, namely, the condition that items have different weights in each transaction record of a database is considered, and the defect of weighted association pattern mining is overcome. Data in which the item weights are objectively distributed in the transaction record and vary with the record is generally referred to as matrix weighted data, also referred to as fully weighted data. A matrix weighted association model mining study begins in 2003, and is typically a full weighted association rule mining method proposed by Tan Yi Red, etc. (Tan Yi Red, Lin ya Pini. These methods effectively mine matrix-weighted association rules, but cannot mine matrix-weighted negative association patterns. With the rapid growth of matrix weighted data (such as network text data and the like), the matrix weighted positive and negative association mode mining technology has higher and higher application value in the fields of text information retrieval, text mining and the like, and the back part or the back part of the association rule can be used as a source of an information retrieval query expansion word. Aiming at the problems, the invention provides a Chinese word weighting positive and negative mode mining method and system based on correlation coefficients. Experimental results show that the feature word mining method provided by the invention can effectively reduce the number of feature word candidate items and mining time, the mining performance of the feature word mining method is superior to that of the existing mining method without the weighted positive and negative association mode, and the feature word association mode can provide reliable query expansion word sources for retrieval systems such as web search engines and the like so as to improve the query performance of the retrieval systems.
Disclosure of Invention
The invention aims to deeply explore Chinese text feature word association modes, provides a method and a system for mining a weighted positive and negative mode between Chinese words based on a correlation coefficient, improves the Chinese text mining efficiency, is applied to a web search engine to realize query expansion, can improve the retrieval performance, is applied to Chinese text mining, and can find more practical and reasonable Chinese feature word association modes so as to improve the accuracy of text clustering and classification.
The technical scheme adopted by the invention is as follows: a Chinese word space weighting positive and negative mode mining method based on correlation coefficients comprises the following steps:
(1) preprocessing a Chinese text: preprocessing Chinese text information data to be processed: the Chinese text is divided into words to remove stop words, characteristic words and weight calculation thereof are extracted, and a text information database and a characteristic word item database based on a vector space model are constructed.
The text feature word weight calculation formula is as follows: w is aij=(1+ln(tfij))×idfi,
Wherein, wijThe weight, idf, of the ith feature word in the jth documentiThe inverse document frequency of the ith feature word, its value idfi=log(N/dfi) N is the total number of documents in the document set, dfiFor the number of documents containing the ith feature word, tfijThe word frequency of the ith characteristic word in the jth document;
(2) mining Chinese characteristic word matrix weighted frequent 1-item set L1: taking candidate 1-item set C from item library1Cumulative C1Item set weight w (C)1) The support is calculated mwS (C)1) From C, compared with ms1Well-mining matrix weighting frequent 1-item set L1mwPIS was added. Candidate 1-item set C1Degree of support mwS (C)1) The formula is as follows:
wherein n is the total number of records in the text information database.
(3) Mining interesting Chinese feature word matrix weighting frequent i-item set LiAnd a negative i-item set Ni(i is more than or equal to 2), comprising the following steps (3.1) to (3.3):
(3.1) frequent (i-1) -item set Li-1Apriori concatenation is performed to generate a candidate i _ term set CiCumulative CiWeight w (C)i) And calculating its support mwS (C)i)。mwS(Ci) The calculation formula is as follows:
(3.2) if candidate i _ item set CiDegree of support mwS (C)i) Greater than or equal to a minimum support threshold ms, mwS (C)i) More than or equal to ms, calculating the association degree mwFIR (C) of the frequent item seti) Correlate them to a degree mwFIR (C)i) Greater than or equal to the minimum frequent correlation threshold mFr (i.e., mwFIR (C)i) Equal to or greater than mFr) weighted frequent i-term set LiAnd adding a frequent item set mwPIS. Frequent item set association degree mwFIR (C)i) The calculation formula is as follows:
wherein,is CiA collection of sub-sets of items.
(3.3) if candidate i _ entry set CiDegree of support mwS (C)i) Less than a minimum support threshold ms, mwS (C)i)<ms, calculating the correlation degree mwNIR (C) of the negative item seti) Degree of correlation mwNIR (C)i) Greater than or equal to a minimum negative term set relevancy threshold mNr (i.e., mwNIR (C)i) Not less than mNr) weighted negative i-term set NiAdd the negative set mwNIS. mwNIR (C)i) The calculation formula is as follows:
wherein,is CiA collection of sub-sets of items.
(4) An effective Chinese characteristic word matrix weighting positive and negative association rule mode is mined from a Chinese characteristic word frequent item set mwPIS, and the method comprises the following steps (4.1) to (4.6):
(4.1) extracting a characteristic word frequent item set L from the Chinese characteristic word frequent item set mwPISiFind LiAll proper subsets of (a).
(4.2) from LiArbitrarily take out two proper subsets I from the proper subset set1And I2When I is1And I2Support of (I) mwS1) And mwS (I)2) Are all greater than or equal to ms, i.e. mwS (I)1)≥ms,mwS(I2) Is greater than or equal to ms, andI1∪I2=Licomputing a matrix weighted frequent item set (I)1,I2) Correlation coefficient of (1) mwPCC (I)1,I2)。mwS(I1)、mwS(I2) And mwPCC (I)1,I2) The calculation formula of (a) is as follows:
wherein i1And i2Is I1And I2The number of items, i.e. the dimension.
Wherein mwS (×) >0, mwS (×) ≠ 1.
(4.3) matrix-weighted frequent item set (I)1,I2) Correlation coefficient of (1) mwPCC (I)1,I2) Greater than or equal to the correlation threshold β,i.e. mwPCC (I)1,I2) Not less than β, calculating VMWAR (I)1,I2Mc, mi), if its value is equal to 1, the strong association rule I of matrix weighted Chinese character word is obtained1→I2Adding a matrix weighting positive association rule set mwPAR; calculating effective matrix weighted negative association rule1→﹁I2Evaluation value VMWAR (|)1,﹁I2Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I1→﹁I2mwNAR was added. VMWAR (I)1,I2Mc, mi) and VMWAR (|)1,﹁I2Mc, mi) is calculated as follows:
wherein,
wherein,
(4.4) matrix-weighted item set (I)1,I2) Correlation coefficient of (1) mwPCC (I)1,I2) Less than or equal to- β, i.e. mwPCC (I)1,I2) Less than or equal to- β, calculating VMWAR (I)1,﹁I2Mc, mi), if its value is equal to 1, the strong negative association rule I of the matrix weighted Chinese character word is obtained1→﹁I2Adding mwNAR; calculating VMWAR (|)1,I2Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I1→I2mwNAR was added. VMWAR (I)1,﹁I2Mc, mi) and VMWAR (|)1,I2Mc, mi) is calculated as follows:
wherein,
wherein,
(4.5) continuing the step (4.2) when the characteristic word is frequent item set LiEach proper subset in the proper subset set is taken out once and can be taken out only once, and then the step (4.6) is carried out;
(4.6) continuing (4.1) when each frequent item set L in the feature word frequent item setiAll the materials are taken out once and only can be taken out once, and then the operation of the step (4) is finished, and the step (5) is switched to;
(5) mining an effective Chinese characteristic word matrix weighted negative association rule mode from a negative term set mwNIS, wherein the method comprises the following steps (5.1) to (5.6):
(5.1) extracting the characteristic word negative item set N from the Chinese characteristic word negative item set mwNISiFinding NiAll proper subsets of (a).
(5.2) from NiArbitrarily take out two proper subsets I from the proper subset set1And I2When I is1And I2Support of (I) mwS1) And mwS (I)2) Are all greater than or equal to ms, andI1∪I2=Nicomputing a matrix weighted negative term set (I)1,I2) Correlation coefficient of (1) mwPCC (I)1,I2)。mwPCC(I1,I2) The calculation formula of (4.2) is the same.
(5.3) matrix weighted negative term set (I)1,I2) Correlation coefficient of (1) mwPCC (I)1,I2) Greater than or equal to the correlation threshold β, i.e., mwPCC (I)1,I2) Greater than or equal to β calculating VMWAR (|)1,﹁I2Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I1→﹁I2mwNAR was added. VMWAR (|)1,﹁I2And mc, mi) is calculated according to the formula in the step (4.3).
(5.4) matrix weighted negative term set (I)1,I2) Correlation coefficient of (1) mwPCC (I)1,I2) Less than or equal to- β, i.e. mwPCC (I)1,I2) Less than or equal to- β, calculating an effective matrix weighted negative association rule I1→﹁I2Evaluation value VMWAR (I)1,﹁I2Mc, mi), if its value is equal to 1, the strong negative association rule I of the matrix weighted Chinese character word is obtained1→﹁I2Adding a matrix weighted negative association rule set mwNAR; calculating effective matrix weighted negative association rule1→I2Evaluation value VMWAR (|)1,I2Value of mc, mi), if it isEqual to 1, then obtain the matrix weighted Chinese character strong negative association rule1→I2mwNAR was added. VMWAR (I)1,﹁I2Mc, mi) and VMWAR (|)1,I2And mc, mi) is calculated according to the formula in the step (4.4).
(5.5) continuing (5.2) when the feature word negative item set NiEach proper subset in the proper subset set is taken out once and can be taken out only once, and then the step (5.6) is carried out;
(5.6) continuing (5.1) when each negative item set N in the feature word frequent item setiAll the materials are taken out once and can be taken out once only, and the operation of the step (5) is finished;
and ending the matrix weighted positive and negative mode mining of the Chinese feature words. The ms is a minimum support threshold, the mc is a minimum confidence threshold, the mi is a minimum interestingness threshold, and the beta is a correlation coefficient threshold.
A mining system suitable for the correlation coefficient-based Chinese inter-word weighted positive-negative pattern mining method is characterized by comprising the following 4 modules:
the Chinese text information preprocessing module: the method is used for performing word segmentation and stop word deletion on the Chinese text information to be processed, extracting the characteristic words and calculating the weight of the characteristic words to construct a Chinese text information base and a characteristic word item base.
The Chinese characteristic word candidate item set generation module: the module firstly excavates a matrix weighted Chinese feature word candidate 1-item set from a feature word item library and a Chinese text information library, calculates the support degree of the matrix weighted Chinese feature word candidate 1-item set, and generates a Chinese feature word candidate i-item set by Apriori connection of the frequent (i-1) -item set from an i-item set (i is more than or equal to 2).
The Chinese characteristic word frequent item set and negative item set generating module: the module calculates the support degree of a candidate i-item set of the Chinese characteristic words, and compares the support degree with a minimum support degree threshold value to obtain a frequent i-item set and a negative i-item set of the Chinese characteristic words; calculating the association degree of the frequent item set, and comparing the association degree with a threshold value of the association degree of the frequent item set to obtain an interesting Chinese feature word frequent item set; and calculating the relevance of the negative term set, and comparing the relevance with a threshold value of the relevance of the negative term set to obtain an interesting Chinese feature word negative term set.
The Chinese feature word positive and negative association rule generation and result display module comprises: the module firstly generates a proper subset of a Chinese feature word frequent item set, calculates a correlation coefficient, an interest degree and a confidence degree of a Chinese feature word association rule mode, compares the correlation coefficient, the interest degree threshold and the confidence degree threshold, and excavates an effective matrix weighting Chinese feature word strong positive and negative association rule mode from the frequent item set; then generating a proper subset of the negative term set of the Chinese characteristic words, calculating the correlation coefficient, the interest degree and the confidence degree of the negative association rule mode of the Chinese characteristic words, comparing the correlation coefficient threshold, the interest degree threshold and the confidence degree threshold, and mining an effective matrix weighting strong negative association rule mode of the Chinese characteristic words from the negative term set; and finally, displaying the effective matrix weighted Chinese feature word positive and negative association rule mode to a user in a required form for analysis and use by the user.
The Chinese characteristic word frequent item set and negative item set generating module comprises the following 2 modules:
a characteristic word frequent item set generation module: the module calculates the support degree of the candidate item set of the Chinese characteristic words, compares the support degree with a support degree threshold value to obtain a frequent item set, calculates the association degree of the frequent item set, and compares the association degree with a correlation degree threshold value to obtain an interesting matrix weighted Chinese characteristic word frequent item set.
The feature word negative item set generation module: the module calculates the support degree of the candidate item set of the Chinese characteristic words, compares the support degree with a support degree threshold value to obtain a negative item set, calculates the relevance degree of the negative item set, and compares the relevance degree with a relevance degree threshold value to obtain an interesting matrix weighted negative item set of the Chinese characteristic words.
The Chinese feature word positive-negative association rule generation and result display module comprises the following 3 modules:
a strong positive and negative association rule generation module from a frequent item set: the module generates a proper subset of a Chinese feature word frequent item set, calculates a correlation coefficient, an interest degree and a confidence degree of the Chinese feature word association rule mode, compares the correlation coefficient, the interest degree threshold and the confidence degree threshold, and mines an effective matrix weighting Chinese feature word strong positive and negative association rule mode from the frequent item set.
A strong negative association rule generation module from the negative set of terms: the module generates a proper subset of the negative term set of the Chinese characteristic words, calculates the correlation coefficient, the interestingness and the confidence coefficient of the negative association rule mode of the Chinese characteristic words, compares the correlation coefficient threshold, the interestingness threshold and the confidence coefficient threshold, and mines an effective matrix weighting strong negative association rule mode of the Chinese characteristic words from the negative term set.
The strong positive and negative association rule display module of the feature words: the module displays the effective matrix weighted Chinese feature word positive and negative association rule mode to the user in a required form for analysis and use by the user.
The support degree threshold ms, the confidence degree threshold mc, the interest degree threshold mi and the correlation coefficient threshold beta in the mining system are input by a user.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a Chinese feature word matrix weighted positive and negative correlation mode mining method and a mining system thereof based on correlation coefficients. The invention adopts a new item set pruning technology, avoids the generation of a plurality of invalid, false and uninteresting correlation modes, greatly improves the mining efficiency, and leads the mined correlation modes to be closer to the actual conditions. Compared with the existing mining method, the mined candidate item set, frequent item set and negative item set and the number of positive and negative association rule patterns are greatly reduced, the mining time is greatly reduced, and the mining efficiency is greatly improved. Experimental results show that the double-threshold pruning strategy provided by the method is effective, the pruning effect is obvious, a more practical Chinese characteristic word association mode can be obtained, and the method has high application value and wide application prospect in the fields of text mining, information retrieval and the like. The positive and negative association mode of the character feature words is applied to web search engines such as Baidu search engines and Google search engines to realize query expansion, so that the query performance is improved, and the information query requirements of users are met.
(2) The Chinese standard data set CWT200g in China is used as experimental data, the method is compared and analyzed with a classical mining method without a weighted association mode, and experimental results show that the number of candidate item sets mined by the method is less than that mined by an comparative method no matter the support degree threshold value or the confidence degree threshold value changes, the mining time of the method is less than that mined by the comparative method, the reduction is larger, and the mining efficiency is greatly improved.
Drawings
FIG. 1 is a block diagram of a correlation coefficient-based mining method for weighted positive and negative patterns of Chinese words.
FIG. 2 is an overall flow chart of the correlation coefficient-based Chinese inter-word weighted positive-negative pattern mining method of the present invention.
FIG. 3 is a block diagram of the system for mining weighted positive and negative patterns between Chinese words based on correlation coefficients according to the present invention.
FIG. 4 is a block diagram of a module for generating a frequent term set and a negative term set of Chinese feature words according to the present invention.
FIG. 5 is a block diagram of the structure of the Chinese feature word positive-negative association rule generation and result display module according to the present invention.
Detailed Description
To better illustrate the technical solution of the present invention, the following introduces the chinese text data model and related concepts related to the present invention as follows:
first, basic concept
Let MWD be { r ═ r1,r2,…,rnN, Is ═ i, where the transaction record number Is n1,i2,…,imDenotes the set of all items (Itemset, Is) in MWD, the number of items Is m, ij(1 ≦ j ≦ m) indicates the jth entry in the MWD, in the transaction record riThe weight value in is w [ r ]i][ij](0≦wij≦ 1). Is provided withI1,I2Is a sub-set of item set I1∪I2The sum of the two is I and,the following basic definitions are given.
Define 1 Matrix-weighted Support (Matrix-weighted Support, mwS): the calculation formula of the matrix weighting support mwS (I) (Tan Yi hong, Lin ya Ping. mining of the fully weighted association rule in the vector space model. computer engineering and application, 2003(13): 208-.
The matrix weighted negative term set and the negative association rule support degree are shown as formulas (2) to (5).
mwS(﹁I)=1–mwS(I) (2)
mwS(I1→﹁I2)=mwS(I1,﹁I2)=mwS(I1)–mwS(I1,I2) (3)
mwS(﹁I1→I2)=mwS(﹁I1,I2)=mwS(I2)–mwS(I1,I2) (4)
mwS(﹁I1→﹁I2)=mwS(﹁I1,﹁I2)=1–mwS(I1)–mwS(I2)+mwS(I1,I2) (5)
Defining 2 matrix weighting frequent item set and negative item set: for the matrix weighting item set I, if mwS (I) is more than or equal to ms, the item set I is called as a matrix weighting frequent item set; when I is1And I2Are all matrix weighted frequent item sets, if mwS (I)1,I2)<ms, then item set (I)1,I2) Referred to as a matrix-weighted negative term set, where ms is the minimum support threshold.
Define 3 Matrix weighted Confidence (Matrix-weighted Confidence, mwC): the matrix weighting positive and negative association rule confidence degree calculation formula is as follows (6) to (10):
define 4 Matrix weighted Pattern correlation coefficient (mwPCC): matrix weighting Pattern (I)1,I2) Correlation coefficient mwPCC (I)1,I2) The formula (2) is shown in formula (10).
Wherein mwS (×) >0, mwS (×) ≠ 1.
Defining 5 matrix weighted frequent item setAlignment (Matrix-weighted free ItemsetRelevance, mwFIR): for a matrix weighted frequent item set FI ═ i1,i2,…,im)(m>1) Whose sub-item set is And (3) taking the conditional probability of the frequent item set FI when the sub item set with the maximum support degree occurs as the correlation degree of FI, and giving a calculation formula of the correlation degree mwFIR (FI) among the sub item sets of the matrix weighted frequent item set FI as shown in a formula (11).
Define 6 Matrix weighted Negative item set relevance (Matrix-weighted Negative Itemset Relevance, mwNIR): for matrix weighted negative set of terms NI ═ i1,i2,…,ir)(r>1) Whose sub-item set is The conditional probability of the negative term set NI when the sub-term set with the maximum support degree does not occur is taken as the correlation degree of NI, and a calculation formula of the correlation degree mwNIR (NI) between the matrix weighting negative term set NI and the sub-term set is given as shown in a formula (12).
Defining 7 matrix weighted positive and negative association rule interestingness: matrix-weighted positive and negative association rule interest (mwARI) formulas are shown in formulas (13) to (16).
Effective matrix weighting positive and negative association rule mining idea
Assuming that a minimum confidence (mc) threshold is mc, a minimum interest (mi) threshold is mi, a correlation coefficient threshold is beta (beta belongs to (0, 1)), and an effective matrix weighting association rule mining basic idea is as follows:
(1) weighting frequent itemsets (I) for interesting matrices1,I2) Item set I1And I2Are all frequent item sets, if mwPCC (I)1,I2)≥β,VMWAR(I1,I21 and VMWAR (|)1,﹁I2Where mc, mi) ═ 1, then I1→I2And I1→﹁I2Is an effective matrix weighting positive and negative association rule; if mwPCC (I)1,I2) < I > β, when VMWAR (I)1,﹁I2Mc, mi ═ 1 and VMWAR (I)1,﹁I2When mc, mi) is 1, then I1→﹁I2And I1→﹁I2Is an effective negative rule for matrix weighting.
Wherein, VMWAR (I)1,I2,mc,mi)、VMWAR(﹁I1,﹁I2,mc,mi)、VMWAR(I1,﹁I2Mc, mi) and VMWAR (I)1,﹁I2Mc, mi) are calculated asThe formulae (17) to (20).
(2) Weighting negative sets (I) for interesting matrices1,I2) Item set I1And I2Are all frequent item sets, if mwPCC (I)1,I2)≥β,VMWAR(﹁I1,﹁I21 mm, mi ═ 1 mm-1→﹁I2Is an effective matrix weighted negative association rule; if mwPCC (I)1,I2)≤-β,VMWAR(I1,﹁I2,mc,mi)=1、VMWAR(﹁I1,I2When mc, mi) is 1, then I1→﹁I2、﹁I1→I2Is an effective matrix weighted negative association rule.
Three, interesting pruning strategies for matrix-weighted item sets
Let the minimum frequent dependency (mFr) threshold be mFr and the minimum negative dependency (mNr) threshold be mNr.
An interesting matrix-weighted frequent itemset I pruning strategy: when mwS (I) ≧ ms, if mwFIR (I) ≧ mFr, item set I is an interesting matrix-weighted frequent item set that should be retained, otherwise, if mwFIR (I) < mFr, item set I is pruned.
An interesting matrix-weighted negative set I pruning strategy: when mws (I) < ms, if mwnir (ni) ≧ mNr, item set I is an interesting matrix-weighted negative item set that should be retained, otherwise, if mwnir (ni) < mNr, item set I is pruned.
The technical solution of the present invention is further illustrated by the following specific examples.
The excavation method and system adopted by the present invention in a specific embodiment are shown in fig. 1-5.
Example (c): the following formula is an example of a Chinese text database with 5 Chinese document records and 5 feature term items and their weights, i.e., the set of documents is { d }1,d2,d3,d4,d5The feature word set is { i }1,i2,i3,i4,i5Program, queue, function, environment, member.
The mining method of the invention is adopted to mine the Chinese characteristic word matrix weighted positive and negative association mode for the Chinese document data example, and the mining process is as follows (ms is 0.15, mc is 0.3, mFr is 0.3, mNr is 0.12, mi is 0.26, beta is 0.1):
1. mining matrix weighted feature word frequent 1_ item set L1As shown in table 1, wherein n is 5.
Table 1:
C1 |
w(C1) |
mwS(C1) |
(i1) |
2.8 |
0.56 |
(i2) |
0.55 |
0.11 |
(i3) |
2.6 |
0.52 |
(i4) |
0.92 |
0.184 |
(i5) |
0.84 |
0.168 |
as can be seen from Table 1, L1={(i1),(i3),(i4),(i5)},
Set of frequent itemsets of feature words { (i)1),(i3),(i4),(i5)}。
2. Mining matrix weighted feature word frequent k _ term set LkAnd negative k-term set NkAnd k is more than or equal to 2.
k=2:
(1) Frequent 1_ item set L of feature words1Apriori connection is carried out to generate a characteristic word candidate 2_ item set C2And calculating w (C)2) And mwS (C)2) As shown in table 2.
Table 2:
for table 2, the following operations were performed:
rhone mwS (C)2) More than or equal to ms, calculate mwFIR (C)2) Mixing mwFIR (C)2) Fun matrix weighted frequent 2-term set L of > mFr2Add frequent item set mwPIS, i.e. L2={(i1,i3),(i1,i5)},mwPIS={(i1),(i3),(i4),(i5),(i1,i3),(i1,i5)}
Rhone mwS (C)2)<ms, calculate mwNIR (C)2) Mixing mwNIR (C)2) Fun matrix weighted negative 2-term set N of not less than mNr2Add negative set of terms mwNIS, N2={(i1,i4),(i3,i5)},mwNIS={(i1,i4),(i3,i5)}k=3:
*L2Apriori connection is carried out to generate a Chinese feature word candidate 3_ item set C3And add up C3mwS (C)3) As shown in table 3.
Table 3:
C3 |
w(C3) |
mwS(C2) |
mwNIR(C3)(mwS(C3)<0) |
(i1,i3,i5) |
1.44 |
0.096 |
=0.096/(1-0.56)=0.218 |
for table 3, the following operations were performed:
*(i1,i3,i5) Is { (i)1),(i3),(i5),(i1,i3),(i1,i5)(i3,i5) In these subsets, the most supported is (i)1) The value is 0.56, again due to mwS (C)i)<ms, so mwNIR (C)i)=0.096/(1-0.56)=0.218≥mNr,
Namely N3={(i1,i3,i5)},mwNIS={(i1,i4),(i3,i5),(i1,i3,i5)}
When k is 4, due to L3For null, mining the frequent k _ item set L of the matrix weighted feature wordskAnd negative k-term set NkAfter that, the procedure proceeds to the following 3 steps. The final mining term set results are: mwPIS { (i)1),(i3),(i4),(i5),(i1,i3),(i1,i5)},mwNIS={(i1,i4),(i3,i5),(i1,i3,i5)}
3. And mining an effective matrix weighting Chinese feature word positive and negative association rule mode from the frequent item set mwPIS.
Frequent item set (i) with characteristic words in mwPIS1,i5) For example, the effective matrix weighting positive and negative association rule pattern mining process is given as follows:
frequent item set (i)1,i5) Is { (i)1),(i5) Is given by I1=(i1),I2=(i5)。
mwS(I1)=0.56≥ms,mwS(I2)=0.168≥ms,mwS(I1,I2)=0.214
And (3) calculating:
because of mwPCC (I)1,I2)>β is 0.1, so,
(1)
because of VMWAR (I)1,I2And mc, mi) is 1, so that an effective matrix weighting Chinese feature word association rule I is obtained1→I2I.e., (i)1)→(i5) Or, (program) → (members).
(2)mwS(﹁I1,﹁I2)=1–0.56–0.168+0.214=0.486,mwS(﹁I1)=1–0.56=0.44
mwS(﹁I2)=1–0.168=0.832
Due to the fact thatSo long as you can not go out1→﹁I2。
In summary, for the Chinese feature word frequent item set (i)1,i5) An effective matrix weighted Chinese feature word association rule pattern (i) can be mined1)→(i5) Or, (procedure) → (members) (ms 0.15, mc 0.3, mFr 0.3, mNr 0.12, mi 0.26, β 0.1)
4. And mining an effective matrix weighted Chinese characteristic word negative association rule pattern from the negative item set mwNIS.
With characteristic word negative item set (i) in mwNIS3,i5) And (i)1,i4) For example, the mining process of the strong negative association rule mode of the effective matrix weighted Chinese feature words is as follows:
set of negative terms (i)3,i5) Is { (i)3),(i5) Is given by I1=(i3),I2=(i5)。
mwS(I1)=0.52≥ms,mwS(I2)=0.168≥ms,mwS(I1,I2)=0.084
And (3) calculating:
because of mwPCC (I)1,I2)>- β is-0.1, so no association rule is mined.
Set of negative terms (i)1,i4) Is { (i)1),(i4) Is given by I1=(i1),I2=(i4)。
mwS(I1)=0.56≥ms,mwS(I2)=0.184≥ms,mwS(I1,I2)=0.072
And (3) calculating:
because of mwPCC (I)1,I2)=-0.1614<- β is-0.1, so,
(1)mwS(I1,﹁I2)=0.56–0.072=0.488,mwS(﹁I2)=1–0.184=0.816,
due to the fact thatSo that no negative association rule I can be mined1→﹁I2。
(2)mwS(﹁I1,I2)=0.184–0.072=0.112,
Due to VMWAR (|)1,I21 to give effective negative association rules I1→I2I 2 (q)1)→(i4) Or rhizoma→ (environment).
In summary, for the Chinese feature word negative set (i)1,i4) Effective matrix weighted Chinese character negative association rules1→I2I 2 (q)1)→(i4) Or rhizoma→ (environment) (ms 0.15, mc 0.3, mFr 0.3, mNr 0.12, mi 0.26, β 0.1).
The beneficial effects of the present invention are further illustrated by experiments below.
In order to verify the effectiveness and the correctness of the invention, an experiment source program is written, a classical unweighted positive and negative association rule mining algorithm (WU Xin-dong, ZHANG Cheng-qi and ZHANG Shi-chao. effective mining of negative positive and negative association rules [ J ]. ACM Transactions on information Systems,2004,22(3):381 and 405) (named as PNARMiner algorithm) is selected as an experiment contrast algorithm, and the experiment contrast and analysis are carried out on the algorithm mining performance from 4 aspects of support degree change, combination parameter change, rule interest degree change, correlation coefficient change and the like. Associated Rules (AR) A → B, A → B and A → B are expressed by AR1, AR2, AR3 and AR4 respectively in the following tables.
The experimental data are from 12024 pure Chinese text documents of Chinese Web Test collection CWT200g (Chinese Web Test collection with 200GB Web Pages) part corpus provided by Beijing university network laboratories. And obtaining a text database and a feature word item library through document preprocessing such as word segmentation, stop word removal, feature word extraction, weight calculation and the like. The experimental data after pretreatment were as follows: the experimental document set may be preprocessed to obtain 8751 feature words, and the document frequency df (i.e. the document length containing the feature words) ranges from 51 to 11258. And extracting the feature words of which the df values are not less than 1500 and not more than 5838 to construct a feature word item library for mining (400 feature words in total). The experimental parameters were: n is 12024, the number of mined items (ItemNumber, ItemNum) is 50, ms, mFr, mNr, mc, mi, beta, and the maximum length of the mined item set in the experiment process is 4.
Experiment one: mining performance with varying support degree threshold
The Candidate set (CI), Frequent item set (FI), Negative item set (NI) and associated rule number results mined by the two algorithms with the support threshold varying are shown in tables 1 and 2,
wherein the experimental parameters are as follows: mc 0.07, mFr 0.06, mNr 0.001, mi 0.01, β 0.05.
As can be seen from tables 1 and 2, as the support threshold increases, the number of each type of item set and association rule gradually decreases, wherein the number of the item sets and association rules mined by the MWARM-SRCCCI algorithm herein is less than that of the comparison algorithm pnarmier, the maximum reduction of the number of the item sets can reach 94.9%, and the maximum reduction of the number of the rules can reach 99.7%.
Experiment two: mining performance combining parameter threshold variation
Since the valid matrix-weighted association rule is the result of comprehensive evaluation of the support, confidence, interest, and correlation coefficient of the association rule, 7 sets of GP values, that is, GP1 {0.03,0.01,0.01,0.05}, GP2 {0.035,0.015,0.015,0.055}, GP3 {0.038,0.02,0.018,0.055}, GP4 {0.04,0.035,0.02,0.065}, GP 64 {0.05,0.04,0.03,0.07}, GP6 {0.06,0.07,0.04,0.08, 0. 7}, 0.05 {0.04,0.03, 0.07}, GP 675 {0.06,0.07 }, 0.3 } of the association rule, and correlation coefficient are shown in the following table of the correlation Parameter set as ms, mc, mi, mn.
TABLE 3 comparison of the number of positive and negative association rules mined under combined parameter changes
The experimental results in Table 3 show that as the combined parameter values increase, the number of classes of association rules progressively decrease with fewer of the herein algorithm miners than the comparison algorithm miners, positive association rules A → B mode numbers decreasing by 95.36%, in negative association rules the largest is A → B mode numbers reaching 93.99%, the smallest is A → B mode numbers reaching 82.85%.
Experiment three: dig time efficiency performance comparison
The time for the 2 algorithms to mine the sets of terms and association rules for the case of varying support thresholds and combination parameters is shown in tables 4 and 5.
The results in tables 4 and 5 show that the MWARM-SRCCCI mining term set and association rule herein have less time than the comparative algorithm pnarnier, respectively, 51.92% and 74.74% in the case of a change in the support threshold (mc 0.07, mFr 0.06, mNr 0.001, mi 0.01, β 0.05) and a change in the combined parameter values, respectively, indicating that the algorithm mining efficiency herein is indeed improved.
Experiment four: mining performance for regular interestingness threshold changes
The experiment mainly verifies the validity of the interest degree threshold mi of the association rules of the algorithm, and the quantity of the mining matrix weighted association rules of the algorithm under the condition of mi change is shown in the table 6.
TABLE 6 number of rules mined by the algorithm herein when mi varies
mi |
AR1 |
AR2 |
AR3 |
AR4 |
0.01 |
1320 |
247 |
247 |
10838 |
0.03 |
1320 |
153 |
153 |
1096 |
0.05 |
1320 |
70 |
70 |
108 |
0.09 |
1320 |
0 |
0 |
0 |
0.80 |
1130 |
0 |
0 |
0 |
0.90 |
468 |
0 |
0 |
0 |
0.95 |
34 |
0 |
0 |
0 |
As can be seen from table 6, as the interestingness mi threshold increases, the number of association rules becomes smaller. It can be seen that the matrix-weighted positive-negative association rule pattern impact of mi is large, with negative association rules appearing only in the low-value portion of mi, and positive association rules affected in the high-value portion of mi.
Experiment five: item set pruning Performance analysis
In order to verify the effectiveness of the item set pruning strategy provided by the present disclosure, the pruning performance of the MWARM-SRCCCI algorithm is experimentally analyzed according to two cases, namely, the frequent item set association threshold mFr variation and the negative item set association threshold mNr variation, and the experimental results are shown in table 7 and table 8, where mFr and mNr are 0, and are cases without pruning function.
Tables 7 and 8 show that the more frequent and negative item sets are pruned as mFr and mNr increase, the more significant the pruning effect. At the same time, the mNr value is much smaller than the mFr value, indicating the advantage of setting a dual relevance threshold.
The experimental results show that compared with the experiments, the mining performance of the invention has good mining performance, compared with the existing mining algorithm, the mined candidate item set, frequent item set and negative item set, and the number of positive and negative association rule patterns are greatly reduced, the pruning effect is obvious, the mining time is greatly reduced, and the mining efficiency is greatly improved.