CN104216874B

CN104216874B - Positive and negative mode excavation method and system are weighted between the Chinese word based on coefficient correlation

Info

Publication number: CN104216874B
Application number: CN201410483377.7A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2014-09-22
Filing date: 2014-09-22
Publication date: 2017-03-29
Anticipated expiration: 2034-09-22
Also published as: CN104216874A

Abstract

Positive and negative mode excavation method and system are weighted between a kind of Chinese word based on coefficient correlation, Chinese text pretreatment are carried out using Chinese text information pre-processing module；1 item collection of Feature Words candidate is generated using Chinese feature words candidate generation module, from i item collections (i >=2), candidate's i item collections are produced by candidate (i 1) item collection, calculate its support, obtain frequent item set and negative dependent, item collection beta pruning is carried out according to the degree of association of item collection, interesting feature words frequent item set and negative dependent is obtained；Produced using the positive and negative correlation rule of Chinese Feature Words and result display module calculates correlation rule interest-degree and confidence level, the positive and negative association rule model of interesting Feature Words is excavated from frequent item set and negative dependent, user is shown to.The present invention is avoided that invalid and barren Chinese Feature Words association mode occurs, and digging efficiency is greatly improved, and its association rule model applies to Chinese text information retrieval field and realizes query expansion, improves Information retrieval queries performance.

Description

Chinese word weighting positive and negative mode mining method and system based on correlation coefficient

Technical Field

The invention belongs to the field of text mining, in particular to a correlation coefficient-based Chinese word weighting positive and negative pattern mining method and a mining system thereof, which are suitable for the fields of feature word association pattern discovery, Chinese text information retrieval query expansion, cross-language information retrieval and the like in Chinese text mining. The positive and negative correlation mode of the feature words is applied to web search engines such as Baidu and Google to realize query expansion, is favorable for improving the query performance, and meets the information query requirements of users.

Background

In recent 20 years, the association pattern mining research has achieved remarkable results, and the results can be summarized into three major categories, namely, unweighted positive and negative association pattern mining technology, weighted positive and negative association pattern mining technology, matrix weighted (also called fully weighted) positive and negative association pattern mining technology and the like.

Association pattern mining research began with the Apriori method, which is an unweighted association pattern mining method proposed by Agrawal et al (AGRAWAL R, IMIELINSKI T, SWAMIA. mining association rules between sections of items in large database [ C ]// Proceedings of 1993 ACM SIGMOD International Conference on Management of data, Washington D.C.: ACM Press,1993:207 + 216). On the basis, scholars propose an improved unweighted association pattern mining method from different angles and methods. The defects of the mining of the unweighted positive and negative correlation mode are as follows: the different importance of items and the different weights of items in the transaction database are not taken into account, resulting in a large number of invalid, redundant and uninteresting association patterns.

The item weighted association pattern mining technology overcomes some defects of the traditional mining technology, namely, item weights are introduced by considering different importance among items. Item weighted association pattern mining research began in 1998 and was typically conducted using the weighted association rule mining methods proposed by Cai et al (CAI C H, DA A, FU W C, et al. mining association rules with weighted entries [ C ]// Proceedings of IEEE International database Engineering and application symposiums. Washington D.C.: IEEE Computer Society,1998: 68-77.). Thereafter, some improved methods have appeared, such as Vo, etc. (VO B, COENEN F, LE B.A new method for mining Weighted items based on WIT-trees [ J ]. Extertsystems with Applications,2013(40):1256 + 1264.), and proposed Weighted tree-based Weighted Frequent item set mining method. The method for mining the term weighted positive and negative correlation mode has the following defects: the case that the item weight has different weights in the transaction database is ignored.

The item matrix weighted association pattern mining technology attaches importance to the inherent characteristics of matrix weighted data, namely, the condition that items have different weights in each transaction record of a database is considered, and the defect of weighted association pattern mining is overcome. Data in which the item weights are objectively distributed in the transaction record and vary with the record is generally referred to as matrix weighted data, also referred to as fully weighted data. A matrix weighted association model mining study begins in 2003, and is typically a full weighted association rule mining method proposed by Tan Yi Red, etc. (Tan Yi Red, Lin ya Pini. These methods effectively mine matrix-weighted association rules, but cannot mine matrix-weighted negative association patterns. With the rapid growth of matrix weighted data (such as network text data and the like), the matrix weighted positive and negative association mode mining technology has higher and higher application value in the fields of text information retrieval, text mining and the like, and the back part or the back part of the association rule can be used as a source of an information retrieval query expansion word. Aiming at the problems, the invention provides a Chinese word weighting positive and negative mode mining method and system based on correlation coefficients. Experimental results show that the feature word mining method provided by the invention can effectively reduce the number of feature word candidate items and mining time, the mining performance of the feature word mining method is superior to that of the existing mining method without the weighted positive and negative association mode, and the feature word association mode can provide reliable query expansion word sources for retrieval systems such as web search engines and the like so as to improve the query performance of the retrieval systems.

Disclosure of Invention

The invention aims to deeply explore Chinese text feature word association modes, provides a method and a system for mining a weighted positive and negative mode between Chinese words based on a correlation coefficient, improves the Chinese text mining efficiency, is applied to a web search engine to realize query expansion, can improve the retrieval performance, is applied to Chinese text mining, and can find more practical and reasonable Chinese feature word association modes so as to improve the accuracy of text clustering and classification.

The technical scheme adopted by the invention is as follows: a Chinese word space weighting positive and negative mode mining method based on correlation coefficients comprises the following steps:

(1) preprocessing a Chinese text: preprocessing Chinese text information data to be processed: the Chinese text is divided into words to remove stop words, characteristic words and weight calculation thereof are extracted, and a text information database and a characteristic word item database based on a vector space model are constructed.

The text feature word weight calculation formula is as follows: w is a_ij＝(1+ln(tf_ij))×idf_i，

Wherein, w_ijThe weight, idf, of the ith feature word in the jth document_iThe inverse document frequency of the ith feature word, its value idf_i＝log(N/df_i) N is the total number of documents in the document set, df_iFor the number of documents containing the ith feature word, tf_ijThe word frequency of the ith characteristic word in the jth document;

(2) mining Chinese characteristic word matrix weighted frequent 1-item set L₁: taking candidate 1-item set C from item library₁Cumulative C₁Item set weight w (C)₁) The support is calculated mwS (C)₁) From C, compared with ms₁Well-mining matrix weighting frequent 1-item set L₁mwPIS was added. Candidate 1-item set C₁Degree of support mwS (C)₁) The formula is as follows:

wherein n is the total number of records in the text information database.

(3) Mining interesting Chinese feature word matrix weighting frequent i-item set L_iAnd a negative i-item set N_i(i is more than or equal to 2), comprising the following steps (3.1) to (3.3):

(3.1) frequent (i-1) -item set L_i-1Apriori concatenation is performed to generate a candidate i _ term set C_iCumulative C_iWeight w (C)_i) And calculating its support mwS (C)_i)。mwS(C_i) The calculation formula is as follows:

(3.2) if candidate i _ item set C_iDegree of support mwS (C)_i) Greater than or equal to a minimum support threshold ms, mwS (C)_i) More than or equal to ms, calculating the association degree mwFIR (C) of the frequent item set_i) Correlate them to a degree mwFIR (C)_i) Greater than or equal to the minimum frequent correlation threshold mFr (i.e., mwFIR (C)_i) Equal to or greater than mFr) weighted frequent i-term set L_iAnd adding a frequent item set mwPIS. Frequent item set association degree mwFIR (C)_i) The calculation formula is as follows:

wherein,is C_iA collection of sub-sets of items.

(3.3) if candidate i _ entry set C_iDegree of support mwS (C)_i) Less than a minimum support threshold ms, mwS (C)_i)<ms, calculating the correlation degree mwNIR (C) of the negative item set_i) Degree of correlation mwNIR (C)_i) Greater than or equal to a minimum negative term set relevancy threshold mNr (i.e., mwNIR (C)_i) Not less than mNr) weighted negative i-term set N_iAdd the negative set mwNIS. mwNIR (C)_i) The calculation formula is as follows:

wherein,is C_iA collection of sub-sets of items.

(4) An effective Chinese characteristic word matrix weighting positive and negative association rule mode is mined from a Chinese characteristic word frequent item set mwPIS, and the method comprises the following steps (4.1) to (4.6):

(4.1) extracting a characteristic word frequent item set L from the Chinese characteristic word frequent item set mwPIS_iFind L_iAll proper subsets of (a).

(4.2) from L_iArbitrarily take out two proper subsets I from the proper subset set₁And I₂When I is₁And I₂Support of (I) mwS₁) And mwS (I)₂) Are all greater than or equal to ms, i.e. mwS (I)₁)≥ms，mwS(I₂) Is greater than or equal to ms, andI₁∪I₂＝L_icomputing a matrix weighted frequent item set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂)。mwS(I₁)、mwS(I₂) And mwPCC (I)₁,I₂) The calculation formula of (a) is as follows:

wherein i₁And i₂Is I₁And I₂The number of items, i.e. the dimension.

Wherein mwS (×) >0, mwS (×) ≠ 1.

(4.3) matrix-weighted frequent item set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Greater than or equal to the correlation threshold β,i.e. mwPCC (I)₁,I₂) Not less than β, calculating VMWAR (I)₁,I₂Mc, mi), if its value is equal to 1, the strong association rule I of matrix weighted Chinese character word is obtained₁→I₂Adding a matrix weighting positive association rule set mwPAR; calculating effective matrix weighted negative association rule₁→﹁I₂Evaluation value VMWAR (|)₁,﹁I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→﹁I₂mwNAR was added. VMWAR (I)₁,I₂Mc, mi) and VMWAR (|)₁,﹁I₂Mc, mi) is calculated as follows:

wherein,

(4.4) matrix-weighted item set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Less than or equal to- β, i.e. mwPCC (I)₁,I₂) Less than or equal to- β, calculating VMWAR (I)₁,﹁I₂Mc, mi), if its value is equal to 1, the strong negative association rule I of the matrix weighted Chinese character word is obtained₁→﹁I₂Adding mwNAR; calculating VMWAR (|)₁,I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→I₂mwNAR was added. VMWAR (I)₁,﹁I₂Mc, mi) and VMWAR (|)₁,I₂Mc, mi) is calculated as follows:

wherein,

(4.5) continuing the step (4.2) when the characteristic word is frequent item set L_iEach proper subset in the proper subset set is taken out once and can be taken out only once, and then the step (4.6) is carried out;

(4.6) continuing (4.1) when each frequent item set L in the feature word frequent item set_iAll the materials are taken out once and only can be taken out once, and then the operation of the step (4) is finished, and the step (5) is switched to;

(5) mining an effective Chinese characteristic word matrix weighted negative association rule mode from a negative term set mwNIS, wherein the method comprises the following steps (5.1) to (5.6):

(5.1) extracting the characteristic word negative item set N from the Chinese characteristic word negative item set mwNIS_iFinding N_iAll proper subsets of (a).

(5.2) from N_iArbitrarily take out two proper subsets I from the proper subset set₁And I₂When I is₁And I₂Support of (I) mwS₁) And mwS (I)₂) Are all greater than or equal to ms, andI₁∪I₂＝N_icomputing a matrix weighted negative term set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂)。mwPCC(I₁,I₂) The calculation formula of (4.2) is the same.

(5.3) matrix weighted negative term set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Greater than or equal to the correlation threshold β, i.e., mwPCC (I)₁,I₂) Greater than or equal to β calculating VMWAR (|)₁,﹁I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→﹁I₂mwNAR was added. VMWAR (|)₁,﹁I₂And mc, mi) is calculated according to the formula in the step (4.3).

(5.4) matrix weighted negative term set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Less than or equal to- β, i.e. mwPCC (I)₁,I₂) Less than or equal to- β, calculating an effective matrix weighted negative association rule I₁→﹁I₂Evaluation value VMWAR (I)₁,﹁I₂Mc, mi), if its value is equal to 1, the strong negative association rule I of the matrix weighted Chinese character word is obtained₁→﹁I₂Adding a matrix weighted negative association rule set mwNAR; calculating effective matrix weighted negative association rule₁→I₂Evaluation value VMWAR (|)₁,I₂Value of mc, mi), if it isEqual to 1, then obtain the matrix weighted Chinese character strong negative association rule₁→I₂mwNAR was added. VMWAR (I)₁,﹁I₂Mc, mi) and VMWAR (|)₁,I₂And mc, mi) is calculated according to the formula in the step (4.4).

(5.5) continuing (5.2) when the feature word negative item set N_iEach proper subset in the proper subset set is taken out once and can be taken out only once, and then the step (5.6) is carried out;

(5.6) continuing (5.1) when each negative item set N in the feature word frequent item set_iAll the materials are taken out once and can be taken out once only, and the operation of the step (5) is finished;

and ending the matrix weighted positive and negative mode mining of the Chinese feature words. The ms is a minimum support threshold, the mc is a minimum confidence threshold, the mi is a minimum interestingness threshold, and the beta is a correlation coefficient threshold.

A mining system suitable for the correlation coefficient-based Chinese inter-word weighted positive-negative pattern mining method is characterized by comprising the following 4 modules:

the Chinese text information preprocessing module: the method is used for performing word segmentation and stop word deletion on the Chinese text information to be processed, extracting the characteristic words and calculating the weight of the characteristic words to construct a Chinese text information base and a characteristic word item base.

The Chinese characteristic word candidate item set generation module: the module firstly excavates a matrix weighted Chinese feature word candidate 1-item set from a feature word item library and a Chinese text information library, calculates the support degree of the matrix weighted Chinese feature word candidate 1-item set, and generates a Chinese feature word candidate i-item set by Apriori connection of the frequent (i-1) -item set from an i-item set (i is more than or equal to 2).

The Chinese characteristic word frequent item set and negative item set generating module: the module calculates the support degree of a candidate i-item set of the Chinese characteristic words, and compares the support degree with a minimum support degree threshold value to obtain a frequent i-item set and a negative i-item set of the Chinese characteristic words; calculating the association degree of the frequent item set, and comparing the association degree with a threshold value of the association degree of the frequent item set to obtain an interesting Chinese feature word frequent item set; and calculating the relevance of the negative term set, and comparing the relevance with a threshold value of the relevance of the negative term set to obtain an interesting Chinese feature word negative term set.

The Chinese feature word positive and negative association rule generation and result display module comprises: the module firstly generates a proper subset of a Chinese feature word frequent item set, calculates a correlation coefficient, an interest degree and a confidence degree of a Chinese feature word association rule mode, compares the correlation coefficient, the interest degree threshold and the confidence degree threshold, and excavates an effective matrix weighting Chinese feature word strong positive and negative association rule mode from the frequent item set; then generating a proper subset of the negative term set of the Chinese characteristic words, calculating the correlation coefficient, the interest degree and the confidence degree of the negative association rule mode of the Chinese characteristic words, comparing the correlation coefficient threshold, the interest degree threshold and the confidence degree threshold, and mining an effective matrix weighting strong negative association rule mode of the Chinese characteristic words from the negative term set; and finally, displaying the effective matrix weighted Chinese feature word positive and negative association rule mode to a user in a required form for analysis and use by the user.

The Chinese characteristic word frequent item set and negative item set generating module comprises the following 2 modules:

a characteristic word frequent item set generation module: the module calculates the support degree of the candidate item set of the Chinese characteristic words, compares the support degree with a support degree threshold value to obtain a frequent item set, calculates the association degree of the frequent item set, and compares the association degree with a correlation degree threshold value to obtain an interesting matrix weighted Chinese characteristic word frequent item set.

The feature word negative item set generation module: the module calculates the support degree of the candidate item set of the Chinese characteristic words, compares the support degree with a support degree threshold value to obtain a negative item set, calculates the relevance degree of the negative item set, and compares the relevance degree with a relevance degree threshold value to obtain an interesting matrix weighted negative item set of the Chinese characteristic words.

The Chinese feature word positive-negative association rule generation and result display module comprises the following 3 modules:

a strong positive and negative association rule generation module from a frequent item set: the module generates a proper subset of a Chinese feature word frequent item set, calculates a correlation coefficient, an interest degree and a confidence degree of the Chinese feature word association rule mode, compares the correlation coefficient, the interest degree threshold and the confidence degree threshold, and mines an effective matrix weighting Chinese feature word strong positive and negative association rule mode from the frequent item set.

A strong negative association rule generation module from the negative set of terms: the module generates a proper subset of the negative term set of the Chinese characteristic words, calculates the correlation coefficient, the interestingness and the confidence coefficient of the negative association rule mode of the Chinese characteristic words, compares the correlation coefficient threshold, the interestingness threshold and the confidence coefficient threshold, and mines an effective matrix weighting strong negative association rule mode of the Chinese characteristic words from the negative term set.

The strong positive and negative association rule display module of the feature words: the module displays the effective matrix weighted Chinese feature word positive and negative association rule mode to the user in a required form for analysis and use by the user.

The support degree threshold ms, the confidence degree threshold mc, the interest degree threshold mi and the correlation coefficient threshold beta in the mining system are input by a user.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a Chinese feature word matrix weighted positive and negative correlation mode mining method and a mining system thereof based on correlation coefficients. The invention adopts a new item set pruning technology, avoids the generation of a plurality of invalid, false and uninteresting correlation modes, greatly improves the mining efficiency, and leads the mined correlation modes to be closer to the actual conditions. Compared with the existing mining method, the mined candidate item set, frequent item set and negative item set and the number of positive and negative association rule patterns are greatly reduced, the mining time is greatly reduced, and the mining efficiency is greatly improved. Experimental results show that the double-threshold pruning strategy provided by the method is effective, the pruning effect is obvious, a more practical Chinese characteristic word association mode can be obtained, and the method has high application value and wide application prospect in the fields of text mining, information retrieval and the like. The positive and negative association mode of the character feature words is applied to web search engines such as Baidu search engines and Google search engines to realize query expansion, so that the query performance is improved, and the information query requirements of users are met.

(2) The Chinese standard data set CWT200g in China is used as experimental data, the method is compared and analyzed with a classical mining method without a weighted association mode, and experimental results show that the number of candidate item sets mined by the method is less than that mined by an comparative method no matter the support degree threshold value or the confidence degree threshold value changes, the mining time of the method is less than that mined by the comparative method, the reduction is larger, and the mining efficiency is greatly improved.

Drawings

FIG. 1 is a block diagram of a correlation coefficient-based mining method for weighted positive and negative patterns of Chinese words.

FIG. 2 is an overall flow chart of the correlation coefficient-based Chinese inter-word weighted positive-negative pattern mining method of the present invention.

FIG. 3 is a block diagram of the system for mining weighted positive and negative patterns between Chinese words based on correlation coefficients according to the present invention.

FIG. 4 is a block diagram of a module for generating a frequent term set and a negative term set of Chinese feature words according to the present invention.

FIG. 5 is a block diagram of the structure of the Chinese feature word positive-negative association rule generation and result display module according to the present invention.

Detailed Description

To better illustrate the technical solution of the present invention, the following introduces the chinese text data model and related concepts related to the present invention as follows:

first, basic concept

Let MWD be { r ═ r₁,r₂,…,r_nN, Is ═ i, where the transaction record number Is n₁,i₂,…,i_mDenotes the set of all items (Itemset, Is) in MWD, the number of items Is m, i_j(1 ≦ j ≦ m) indicates the jth entry in the MWD, in the transaction record r_iThe weight value in is w [ r ]_i][i_j](0≦w_ij≦ 1). Is provided withI₁,I₂Is a sub-set of item set I₁∪I₂The sum of the two is I and,the following basic definitions are given.

Define 1 Matrix-weighted Support (Matrix-weighted Support, mwS): the calculation formula of the matrix weighting support mwS (I) (Tan Yi hong, Lin ya Ping. mining of the fully weighted association rule in the vector space model. computer engineering and application, 2003(13): 208-.

The matrix weighted negative term set and the negative association rule support degree are shown as formulas (2) to (5).

mwS(﹁I)＝1–mwS(I) (2)

mwS(I₁→﹁I₂)＝mwS(I₁,﹁I₂)＝mwS(I₁)–mwS(I₁,I₂) (3)

mwS(﹁I₁→I₂)＝mwS(﹁I₁,I₂)＝mwS(I₂)–mwS(I₁,I₂) (4)

mwS(﹁I₁→﹁I₂)＝mwS(﹁I₁,﹁I₂)＝1–mwS(I₁)–mwS(I₂)+mwS(I₁,I₂) (5)

Defining 2 matrix weighting frequent item set and negative item set: for the matrix weighting item set I, if mwS (I) is more than or equal to ms, the item set I is called as a matrix weighting frequent item set; when I is₁And I₂Are all matrix weighted frequent item sets, if mwS (I)₁,I₂)<ms, then item set (I)₁,I₂) Referred to as a matrix-weighted negative term set, where ms is the minimum support threshold.

Define 3 Matrix weighted Confidence (Matrix-weighted Confidence, mwC): the matrix weighting positive and negative association rule confidence degree calculation formula is as follows (6) to (10):

define 4 Matrix weighted Pattern correlation coefficient (mwPCC): matrix weighting Pattern (I)₁,I₂) Correlation coefficient mwPCC (I)₁,I₂) The formula (2) is shown in formula (10).

Wherein mwS (×) >0, mwS (×) ≠ 1.

Defining 5 matrix weighted frequent item setAlignment (Matrix-weighted free ItemsetRelevance, mwFIR): for a matrix weighted frequent item set FI ═ i₁,i₂,…,i_m)(m>1) Whose sub-item set is And (3) taking the conditional probability of the frequent item set FI when the sub item set with the maximum support degree occurs as the correlation degree of FI, and giving a calculation formula of the correlation degree mwFIR (FI) among the sub item sets of the matrix weighted frequent item set FI as shown in a formula (11).

Define 6 Matrix weighted Negative item set relevance (Matrix-weighted Negative Itemset Relevance, mwNIR): for matrix weighted negative set of terms NI ═ i₁,i₂,…,i_r)(r>1) Whose sub-item set is The conditional probability of the negative term set NI when the sub-term set with the maximum support degree does not occur is taken as the correlation degree of NI, and a calculation formula of the correlation degree mwNIR (NI) between the matrix weighting negative term set NI and the sub-term set is given as shown in a formula (12).

Defining 7 matrix weighted positive and negative association rule interestingness: matrix-weighted positive and negative association rule interest (mwARI) formulas are shown in formulas (13) to (16).

Effective matrix weighting positive and negative association rule mining idea

Assuming that a minimum confidence (mc) threshold is mc, a minimum interest (mi) threshold is mi, a correlation coefficient threshold is beta (beta belongs to (0, 1)), and an effective matrix weighting association rule mining basic idea is as follows:

(1) weighting frequent itemsets (I) for interesting matrices₁,I₂) Item set I₁And I₂Are all frequent item sets, if mwPCC (I)₁,I₂)≥β，VMWAR(I₁,I₂1 and VMWAR (|)₁,﹁I₂Where mc, mi) ═ 1, then I₁→I₂And I₁→﹁I₂Is an effective matrix weighting positive and negative association rule; if mwPCC (I)₁,I₂) < I > β, when VMWAR (I)₁,﹁I₂Mc, mi ═ 1 and VMWAR (I)₁,﹁I₂When mc, mi) is 1, then I₁→﹁I₂And I₁→﹁I₂Is an effective negative rule for matrix weighting.

Wherein, VMWAR (I)₁,I₂,mc,mi)、VMWAR(﹁I₁,﹁I₂,mc,mi)、VMWAR(I₁,﹁I₂Mc, mi) and VMWAR (I)₁,﹁I₂Mc, mi) are calculated asThe formulae (17) to (20).

(2) Weighting negative sets (I) for interesting matrices₁,I₂) Item set I₁And I₂Are all frequent item sets, if mwPCC (I)₁,I₂)≥β，VMWAR(﹁I₁,﹁I₂1 mm, mi ═ 1 mm-₁→﹁I₂Is an effective matrix weighted negative association rule; if mwPCC (I)₁,I₂)≤－β，VMWAR(I₁,﹁I₂,mc,mi)＝1、VMWAR(﹁I₁,I₂When mc, mi) is 1, then I₁→﹁I₂、﹁I₁→I₂Is an effective matrix weighted negative association rule.

Three, interesting pruning strategies for matrix-weighted item sets

Let the minimum frequent dependency (mFr) threshold be mFr and the minimum negative dependency (mNr) threshold be mNr.

An interesting matrix-weighted frequent itemset I pruning strategy: when mwS (I) ≧ ms, if mwFIR (I) ≧ mFr, item set I is an interesting matrix-weighted frequent item set that should be retained, otherwise, if mwFIR (I) < mFr, item set I is pruned.

An interesting matrix-weighted negative set I pruning strategy: when mws (I) < ms, if mwnir (ni) ≧ mNr, item set I is an interesting matrix-weighted negative item set that should be retained, otherwise, if mwnir (ni) < mNr, item set I is pruned.

The technical solution of the present invention is further illustrated by the following specific examples.

The excavation method and system adopted by the present invention in a specific embodiment are shown in fig. 1-5.

Example (c): the following formula is an example of a Chinese text database with 5 Chinese document records and 5 feature term items and their weights, i.e., the set of documents is { d }₁,d₂,d₃,d₄,d₅The feature word set is { i }₁,i₂,i₃,i₄,i₅Program, queue, function, environment, member.

The mining method of the invention is adopted to mine the Chinese characteristic word matrix weighted positive and negative association mode for the Chinese document data example, and the mining process is as follows (ms is 0.15, mc is 0.3, mFr is 0.3, mNr is 0.12, mi is 0.26, beta is 0.1):

1. mining matrix weighted feature word frequent 1_ item set L₁As shown in table 1, wherein n is 5.

Table 1:

C₁	w(C₁)	mwS(C₁)
			(i₁)	2.8	0.56
(i₂)	0.55	0.11
			(i₃)	2.6	0.52
(i₄)	0.92	0.184
			(i₅)	0.84	0.168

as can be seen from Table 1, L₁＝{(i₁),(i₃),(i₄),(i₅)}，

Set of frequent itemsets of feature words { (i)₁),(i₃),(i₄),(i₅)}。

2. Mining matrix weighted feature word frequent k _ term set L_kAnd negative k-term set N_kAnd k is more than or equal to 2.

k＝2:

(1) Frequent 1_ item set L of feature words₁Apriori connection is carried out to generate a characteristic word candidate 2_ item set C₂And calculating w (C)₂) And mwS (C)₂) As shown in table 2.

Table 2:

for table 2, the following operations were performed:

rhone mwS (C)₂) More than or equal to ms, calculate mwFIR (C)₂) Mixing mwFIR (C)₂) Fun matrix weighted frequent 2-term set L of > mFr₂Add frequent item set mwPIS, i.e. L₂＝{(i₁,i₃),(i₁,i₅)}，mwPIS＝{(i₁),(i₃),(i₄),(i₅),(i₁,i₃),(i₁,i₅)}

Rhone mwS (C)₂)<ms, calculate mwNIR (C)₂) Mixing mwNIR (C)₂) Fun matrix weighted negative 2-term set N of not less than mNr₂Add negative set of terms mwNIS, N₂＝{(i₁,i₄),(i₃,i₅)}，mwNIS＝{(i₁,i₄),(i₃,i₅)}k＝3:

*L₂Apriori connection is carried out to generate a Chinese feature word candidate 3_ item set C₃And add up C₃mwS (C)₃) As shown in table 3.

Table 3:

C₃	w(C₃)	mwS(C₂)	mwNIR(C₃)(mwS(C₃)<0)
				(i₁,i₃,i₅)	1.44	0.096	＝0.096/(1-0.56)＝0.218

for table 3, the following operations were performed:

*(i₁,i₃,i₅) Is { (i)₁),(i₃),(i₅),(i₁,i₃),(i₁,i₅)(i₃,i₅) In these subsets, the most supported is (i)₁) The value is 0.56, again due to mwS (C)_i)<ms, so mwNIR (C)_i)＝0.096/(1-0.56)＝0.218≥mNr，

Namely N₃＝{(i₁,i₃,i₅)}，mwNIS＝{(i₁,i₄),(i₃,i₅),(i₁,i₃,i₅)}

When k is 4, due to L₃For null, mining the frequent k _ item set L of the matrix weighted feature words_kAnd negative k-term set N_kAfter that, the procedure proceeds to the following 3 steps. The final mining term set results are: mwPIS { (i)₁),(i₃),(i₄),(i₅),(i₁,i₃),(i₁,i₅)}，mwNIS＝{(i₁,i₄),(i₃,i₅),(i₁,i₃,i₅)}

3. And mining an effective matrix weighting Chinese feature word positive and negative association rule mode from the frequent item set mwPIS.

Frequent item set (i) with characteristic words in mwPIS₁,i₅) For example, the effective matrix weighting positive and negative association rule pattern mining process is given as follows:

frequent item set (i)₁,i₅) Is { (i)₁),(i₅) Is given by I₁＝(i₁)，I₂＝(i₅)。

mwS(I₁)＝0.56≥ms，mwS(I₂)＝0.168≥ms，mwS(I₁,I₂)＝0.214

And (3) calculating:

because of mwPCC (I)₁,I₂)>β is 0.1, so,

(1)

because of VMWAR (I)₁,I₂And mc, mi) is 1, so that an effective matrix weighting Chinese feature word association rule I is obtained₁→I₂I.e., (i)₁)→(i₅) Or, (program) → (members).

(2)mwS(﹁I₁,﹁I₂)＝1–0.56–0.168+0.214＝0.486，mwS(﹁I₁)＝1–0.56＝0.44

mwS(﹁I₂)＝1–0.168＝0.832

Due to the fact thatSo long as you can not go out₁→﹁I₂。

In summary, for the Chinese feature word frequent item set (i)₁,i₅) An effective matrix weighted Chinese feature word association rule pattern (i) can be mined₁)→(i₅) Or, (procedure) → (members) (ms 0.15, mc 0.3, mFr 0.3, mNr 0.12, mi 0.26, β 0.1)

4. And mining an effective matrix weighted Chinese characteristic word negative association rule pattern from the negative item set mwNIS.

With characteristic word negative item set (i) in mwNIS₃,i₅) And (i)₁,i₄) For example, the mining process of the strong negative association rule mode of the effective matrix weighted Chinese feature words is as follows:

set of negative terms (i)₃,i₅) Is { (i)₃),(i₅) Is given by I₁＝(i₃)，I₂＝(i₅)。

mwS(I₁)＝0.52≥ms，mwS(I₂)＝0.168≥ms，mwS(I₁,I₂)＝0.084

And (3) calculating:

because of mwPCC (I)₁,I₂)>- β is-0.1, so no association rule is mined.

Set of negative terms (i)₁,i₄) Is { (i)₁),(i₄) Is given by I₁＝(i₁)，I₂＝(i₄)。

mwS(I₁)＝0.56≥ms，mwS(I₂)＝0.184≥ms，mwS(I₁,I₂)＝0.072

And (3) calculating:

because of mwPCC (I)₁,I₂)＝－0.1614<- β is-0.1, so,

(1)mwS(I₁,﹁I₂)＝0.56–0.072＝0.488，mwS(﹁I₂)＝1–0.184＝0.816，

due to the fact thatSo that no negative association rule I can be mined₁→﹁I₂。

(2)mwS(﹁I₁,I₂)＝0.184–0.072＝0.112，

Due to VMWAR (|)₁,I₂1 to give effective negative association rules I₁→I₂I 2 (q)₁)→(i₄) Or rhizoma→ (environment).

In summary, for the Chinese feature word negative set (i)₁,i₄) Effective matrix weighted Chinese character negative association rules₁→I₂I 2 (q)₁)→(i₄) Or rhizoma→ (environment) (ms 0.15, mc 0.3, mFr 0.3, mNr 0.12, mi 0.26, β 0.1).

The beneficial effects of the present invention are further illustrated by experiments below.

In order to verify the effectiveness and the correctness of the invention, an experiment source program is written, a classical unweighted positive and negative association rule mining algorithm (WU Xin-dong, ZHANG Cheng-qi and ZHANG Shi-chao. effective mining of negative positive and negative association rules [ J ]. ACM Transactions on information Systems,2004,22(3):381 and 405) (named as PNARMiner algorithm) is selected as an experiment contrast algorithm, and the experiment contrast and analysis are carried out on the algorithm mining performance from 4 aspects of support degree change, combination parameter change, rule interest degree change, correlation coefficient change and the like. Associated Rules (AR) A → B, A → B and A → B are expressed by AR1, AR2, AR3 and AR4 respectively in the following tables.

The experimental data are from 12024 pure Chinese text documents of Chinese Web Test collection CWT200g (Chinese Web Test collection with 200GB Web Pages) part corpus provided by Beijing university network laboratories. And obtaining a text database and a feature word item library through document preprocessing such as word segmentation, stop word removal, feature word extraction, weight calculation and the like. The experimental data after pretreatment were as follows: the experimental document set may be preprocessed to obtain 8751 feature words, and the document frequency df (i.e. the document length containing the feature words) ranges from 51 to 11258. And extracting the feature words of which the df values are not less than 1500 and not more than 5838 to construct a feature word item library for mining (400 feature words in total). The experimental parameters were: n is 12024, the number of mined items (ItemNumber, ItemNum) is 50, ms, mFr, mNr, mc, mi, beta, and the maximum length of the mined item set in the experiment process is 4.

Experiment one: mining performance with varying support degree threshold

The Candidate set (CI), Frequent item set (FI), Negative item set (NI) and associated rule number results mined by the two algorithms with the support threshold varying are shown in tables 1 and 2,

wherein the experimental parameters are as follows: mc 0.07, mFr 0.06, mNr 0.001, mi 0.01, β 0.05.

As can be seen from tables 1 and 2, as the support threshold increases, the number of each type of item set and association rule gradually decreases, wherein the number of the item sets and association rules mined by the MWARM-SRCCCI algorithm herein is less than that of the comparison algorithm pnarmier, the maximum reduction of the number of the item sets can reach 94.9%, and the maximum reduction of the number of the rules can reach 99.7%.

Experiment two: mining performance combining parameter threshold variation

Since the valid matrix-weighted association rule is the result of comprehensive evaluation of the support, confidence, interest, and correlation coefficient of the association rule, 7 sets of GP values, that is, GP1 {0.03,0.01,0.01,0.05}, GP2 {0.035,0.015,0.015,0.055}, GP3 {0.038,0.02,0.018,0.055}, GP4 {0.04,0.035,0.02,0.065}, GP 64 {0.05,0.04,0.03,0.07}, GP6 {0.06,0.07,0.04,0.08, 0. 7}, 0.05 {0.04,0.03, 0.07}, GP 675 {0.06,0.07 }, 0.3 } of the association rule, and correlation coefficient are shown in the following table of the correlation Parameter set as ms, mc, mi, mn.

TABLE 3 comparison of the number of positive and negative association rules mined under combined parameter changes

The experimental results in Table 3 show that as the combined parameter values increase, the number of classes of association rules progressively decrease with fewer of the herein algorithm miners than the comparison algorithm miners, positive association rules A → B mode numbers decreasing by 95.36%, in negative association rules the largest is A → B mode numbers reaching 93.99%, the smallest is A → B mode numbers reaching 82.85%.

Experiment three: dig time efficiency performance comparison

The time for the 2 algorithms to mine the sets of terms and association rules for the case of varying support thresholds and combination parameters is shown in tables 4 and 5.

The results in tables 4 and 5 show that the MWARM-SRCCCI mining term set and association rule herein have less time than the comparative algorithm pnarnier, respectively, 51.92% and 74.74% in the case of a change in the support threshold (mc 0.07, mFr 0.06, mNr 0.001, mi 0.01, β 0.05) and a change in the combined parameter values, respectively, indicating that the algorithm mining efficiency herein is indeed improved.

Experiment four: mining performance for regular interestingness threshold changes

The experiment mainly verifies the validity of the interest degree threshold mi of the association rules of the algorithm, and the quantity of the mining matrix weighted association rules of the algorithm under the condition of mi change is shown in the table 6.

TABLE 6 number of rules mined by the algorithm herein when mi varies

mi	AR1	AR2	AR3	AR4
					0.01	1320	247	247	10838
0.03	1320	153	153	1096
					0.05	1320	70	70	108
0.09	1320	0	0	0
					0.80	1130	0	0	0
0.90	468	0	0	0
					0.95	34	0	0	0

As can be seen from table 6, as the interestingness mi threshold increases, the number of association rules becomes smaller. It can be seen that the matrix-weighted positive-negative association rule pattern impact of mi is large, with negative association rules appearing only in the low-value portion of mi, and positive association rules affected in the high-value portion of mi.

Experiment five: item set pruning Performance analysis

In order to verify the effectiveness of the item set pruning strategy provided by the present disclosure, the pruning performance of the MWARM-SRCCCI algorithm is experimentally analyzed according to two cases, namely, the frequent item set association threshold mFr variation and the negative item set association threshold mNr variation, and the experimental results are shown in table 7 and table 8, where mFr and mNr are 0, and are cases without pruning function.

Tables 7 and 8 show that the more frequent and negative item sets are pruned as mFr and mNr increase, the more significant the pruning effect. At the same time, the mNr value is much smaller than the mFr value, indicating the advantage of setting a dual relevance threshold.

The experimental results show that compared with the experiments, the mining performance of the invention has good mining performance, compared with the existing mining algorithm, the mined candidate item set, frequent item set and negative item set, and the number of positive and negative association rule patterns are greatly reduced, the pruning effect is obvious, the mining time is greatly reduced, and the mining efficiency is greatly improved.

Claims

1. A Chinese word space weighting positive and negative mode mining method based on correlation coefficients is characterized by comprising the following steps:

(1) preprocessing a Chinese text: preprocessing Chinese text information data to be processed: removing stop words from Chinese text by word segmentation, extracting characteristic words and calculating weights of the characteristic words, and constructing a text information database and a characteristic word item database based on a vector space model;

(2) mining Chinese characteristic word matrix weighted frequent 1-item set L₁: taking candidate 1-item set C from item library₁Tired ofAdding C₁Item set weight, its support is calculated mwS (C)₁) Compared with the minimum support threshold ms, from C₁Well-mining matrix weighting frequent 1-item set L₁Adding a frequent item set mwPIS;

(3) mining interesting Chinese feature word matrix weighting frequent i-item set L_iAnd a negative i-item set N_iComprises the following steps (3.1) to (3.3); the i is more than or equal to 2,

(3.1) frequent (i-1) -item set L_i-1Apriori concatenation is performed to generate a candidate i _ term set C_iCumulative C_iAnd calculating mwS (C) thereof_i)；

(3.2) if candidate i _ item set C_iDegree of support mwS (C)_i) Greater than or equal to the minimum support threshold ms, and calculating the association degree mwFIR (C) of the frequent item set_i) Correlate them to a degree mwFIR (C)_i) Interesting matrix weighted frequent i-item set L greater than or equal to minimum frequent relevance threshold mFr_iAdding a frequent item set mwPIS;

(3.3) if candidate i _ entry set C_iDegree of support mwS (C)_i) Calculating the degree of association mwNIR (C) of the negative term set less than the minimum support degree threshold ms_i) Degree of correlation mwNIR (C)_i) Interesting matrix weighted negative i-term set N greater than or equal to minimum negative term set relevancy threshold mNr_iAdding a negative item set mwNIS;

(4.1) extracting a characteristic word frequent item set L from the Chinese characteristic word frequent item set mwPIS_iFind L_iAll proper subsets of (c);

(4.2) from L_iArbitrarily take out two proper subsets I from the proper subset set₁And I₂When I is₁And I₂Support of (I) mwS₁) And mwS (I)₂) Are all greater than or equal to ms, andI₁∪I₂＝L_icalculating the momentArray weighted frequent item set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂)；

(4.3) matrix-weighted frequent item set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Greater than or equal to the correlation threshold β, i.e., mwPCC (I)₁,I₂) Not less than β, calculating effective matrix weighting association rule I₁→I₂Evaluation value VMWAR (I)₁,I₂Mc, mi), if its value is equal to 1, the strong association rule I of matrix weighted Chinese character word is obtained₁→I₂Adding a matrix weighting positive association rule set mwPAR; calculating effective matrix weighted negative association rule₁→﹁I₂Evaluation value VMWAR (|)₁,﹁I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→﹁I₂Adding a matrix weighted negative association rule set mwNAR;

(4.4) matrix-weighted item set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Less than or equal to- β, i.e. mwPCC (I)₁,I₂) Less than or equal to- β, calculating an effective matrix weighted negative association rule I₁→﹁I₂Evaluation value VMWAR (I)₁,﹁I₂Mc, mi), if its value is equal to 1, the strong negative association rule I of the matrix weighted Chinese character word is obtained₁→﹁I₂Adding a matrix weighted negative association rule set mwNAR; calculating effective matrix weighted negative association rule₁→I₂Evaluation value VMWAR (|)₁,I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→I₂Adding a matrix weighted negative association rule set mwNAR;

(4.6) continuing (4.1) when each frequent item set L in the feature word frequent item set_iAre all taken outAnd (4) if the operation is finished and the operation is shifted to the step (5) if the operation can be taken out only once;

(5.1) extracting the characteristic word negative item set N from the Chinese characteristic word negative item set mwNIS_iFinding N_iAll proper subsets of (c);

(5.2) from N_iArbitrarily take out two proper subsets I from the proper subset set₁And I₂When I is₁And I₂Support of (I) mwS₁) And mwS (I)₂) Are all greater than or equal to a minimum support threshold ms, andI₁∪I₂＝N_icomputing a matrix weighted negative term set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂)；

(5.3) matrix weighted negative term set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Greater than or equal to the correlation threshold β, i.e., mwPCC (I)₁,I₂) Greater than or equal to β calculating effective matrix weighted negative association rule I₁→﹁I₂Evaluation value VMWAR (|)₁,﹁I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→﹁I₂Adding a matrix weighted negative association rule set mwNAR;

(5.4) matrix weighted negative term set (I)₁,I₂) Correlation coefficient of (1) mwPCC (I)₁,I₂) Less than or equal to- β, i.e. mwPCC (I)₁,I₂) Less than or equal to- β, calculating an effective matrix weighted negative association rule I₁→﹁I₂Evaluation value VMWAR (I)₁,﹁I₂Mc, mi), if its value is equal to 1, the strong negative association rule I of the matrix weighted Chinese character word is obtained₁→﹁I₂Adding a matrix weighted negative association rule set mwNAR; calculating effective matrix weighted negative association rule₁→I₂Evaluation value VMWAR (|)₁,I₂Mc, mi) if its value is equal to 1, then we find the matrix weighted Chinese feature strongly negative association rule I₁→I₂Adding a matrix weighted negative association rule set mwNAR;

at this point, mining positive and negative modes of the matrix weighted Chinese feature words is finished; the ms is a minimum support threshold, the mc is a minimum confidence threshold, the mi is a minimum interestingness threshold, and the beta is a correlation coefficient threshold.

2. A mining system suitable for the correlation coefficient-based Chinese inter-word weighted positive-negative pattern mining method of claim 1, comprising the following 4 modules:

the Chinese text information preprocessing module: the method comprises the steps of performing word segmentation and stop word deletion on Chinese text information to be processed, extracting feature words and calculating weights of the feature words, and constructing a Chinese text information base and a feature word item base;

the Chinese characteristic word candidate item set generation module: the module firstly mines matrix weighted Chinese feature word candidate 1-item set C from a feature word item library and a Chinese text information library₁The support is calculated mwS (C)₁) Thus obtaining the frequent 1-item set L of Chinese character words₁From the candidate i-item set C_iFrom, frequently (i-1) -item set L_i-1Chinese characteristic word candidate i-item set C is generated through Apriori connection_iWherein i is more than or equal to 2;

the Chinese characteristic word frequent item set and negative item set generating module: the module calculates Chinese characteristic word candidate i-item set C_iDegree of support mwS (C)_i) Comparing with the minimum support threshold ms to obtain a Chinese characteristic word frequent i-item set L_iAnd a negative i-item set N_i(ii) a Calculating the association degree of the frequent item set, and comparing the association degree with a threshold value of the association degree of the frequent item set to obtain an interesting Chinese feature word frequent item set; calculating the relevance of the negative term set, and comparing the relevance with a threshold value of the relevance of the negative term set to obtain an interesting Chinese feature word negative term set;

3. The mining system of claim 2, wherein the chinese feature word frequent term set and negative term set generating module comprises the following 2 modules:

a characteristic word frequent item set generation module: the module calculates the support degree of the candidate item set of the Chinese characteristic words, compares the support degree with a support degree threshold value to obtain a frequent item set, calculates the association degree of the frequent item set, and compares the association degree with the association degree threshold value to obtain an interesting matrix weighted Chinese characteristic word frequent item set;

the feature word negative item set generation module: the module calculates the support degree of the candidate item set of the Chinese characteristic words, compares the support degree with a support degree threshold value to obtain a negative item set, calculates the association degree of the negative item set, and compares the association degree with an association degree threshold value to obtain an interesting matrix weighted negative item set of the Chinese characteristic words.

4. The mining system of claim 2, wherein the Chinese feature word positive-negative association rule generation and result display module comprises the following 3 modules:

a strong positive and negative association rule generation module from a frequent item set: the module generates a proper subset of a Chinese feature word frequent item set, calculates a correlation coefficient, an interest degree and a confidence degree of the Chinese feature word association rule mode, compares the correlation coefficient, the interest degree threshold and the confidence degree threshold, and mines an effective matrix weighting Chinese feature word strong positive and negative association rule mode from the frequent item set;

a strong negative association rule generation module from the negative set of terms: the module generates a proper subset of a negative term set of the Chinese characteristic words, calculates correlation coefficients, interestingness and confidence degrees of the negative association rule mode of the Chinese characteristic words, compares the correlation coefficients, interestingness thresholds and confidence degree thresholds with the confidence degree threshold, and mines an effective matrix weighting strong negative association rule mode of the Chinese characteristic words from the negative term set;