CN107247970A

CN107247970A - A kind of method for digging and device of commodity qualification rate correlation rule

Info

Publication number: CN107247970A
Application number: CN201710487560.8A
Authority: CN
Inventors: 王连印; 凌建华; 魏旭晖; 黄景涛; 黄晖
Original assignee: SHANGHAI TENLY SOFTWARE Inc; Information Center Of State Administration Of Quality Supervision Inspection And Quarantine
Current assignee: SHANGHAI TENLY SOFTWARE Inc; Information Center Of State Administration Of Quality Supervision Inspection And Quarantine
Priority date: 2017-06-23
Filing date: 2017-06-23
Publication date: 2017-10-13

Abstract

The present invention provides a kind of method for digging and device of commodity qualification rate correlation rule, memory module, for obtaining and storing original training data collection；First excavates module, for carrying out tagsort to data training set using decision Tree algorithms, and extracts characteristic of division variable importance data set；Second excavates module, and for characteristic parameter importance threshold values will to be set to exclude Multidimensional Association Rules training dataset distracter to obtained characteristic variable importance data set and tune ginseng data cross, screening obtains pure characteristic variable parameter set；3rd excavates module, for pure characteristic variable parameter set to be obtained into commodity qualification rate rule model by Multidimensional Association Rules.The advantage of the invention is that：Optimize the input variable optimization of Association Rules Model；The value after the information gain standardization of decision tree spanning tree is utilized simultaneously, it is to avoid decision tree is in face of continuous variable and the calculating performance issue of sequence type data；Without the extensive beta pruning optimization problem of decision tree spanning tree.

Description

A kind of method for digging and device of commodity qualification rate correlation rule

Technical field

The present invention relates to a kind of method for digging and device of commodity qualification rate correlation rule.

Background technology

Inspection and quarantine business statistics data are collecting and count to the data produced by routine check quarantine business, from total Reflect the operation conditions of regular period inspection and quarantine business on body, and support from different perspectives to carry out the every business of inspection and quarantine Analysis, including the inspection declaration of inspection and quarantine business enterprise, concentrate document examination, the view of the scene, examine detection etc. produced by data.

In daily inspection and quarantine business generally by the way of sampling inspection, comprehensively detection is examined and can not almost done Arrive；To give batch commodity, not all examine for every batch, excavate import-export commodity quality law, determine emphasis inspection content, Detection and degree of risk, just turn into the important means that big data aids in quality testing department to solve this thorny problem.

Rule is understood using big data analysis in the industry at present, more typically using Multidimensional Association Rules, but multidimensional association rule Then have：

Database table it is very huge and to input data without examination ability, cause invalid or onrelevant variable information excessively to produce, And algorithm model generation is easily excessively extensive, and support it is relatively low when when adding a large amount of Hash functions, Mining Multidimensional Association Rules Efficiency can low-down shortcoming.

The content of the invention

The Multidimensional Association Rules data used for above-mentioned commodity inspection quarantine commodity big data analysis are huge without examination energy Power, the low technical problem of efficiency, the present invention provide a kind of method of use decision-tree model algorithm optimization Multidimensional Association Rules and Device, it is specific as follows：

A kind of method for digging of commodity qualification rate correlation rule, the method for digging comprises the following steps:

A. original training data collection is obtained；

B. tagsort is carried out to data training set using decision Tree algorithms, and extracts characteristic of division variable importance data Collection；

C., characteristic variable importance data set and adjust ginseng data cross row that characteristic parameter importance threshold values is obtained to step B are set Except Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set；

D. the pure characteristic variable parameter set obtained to step C obtains commodity qualification rate rule model by Multidimensional Association Rules.

On the basis of above-mentioned technical proposal, further, the step B is entered using decision Tree algorithms to data training set Decision Tree algorithms described in row tagsort are C4.5 decision Tree algorithms.

Further, a kind of device of the excavation of commodity qualification rate correlation rule, it is characterised in that including：

Memory module, for obtaining and storing original training data collection；

First excavates module, for carrying out tagsort to data training set using decision Tree algorithms, and extracts characteristic of division Variable importance data set；

Second excavates module, for will set characteristic parameter importance threshold values to obtained characteristic variable importance data set and tune Join data cross and exclude Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set；

3rd excavates module, for pure characteristic variable parameter set to be obtained into the regular mould of commodity qualification rate by Multidimensional Association Rules Type.

The advantage of the invention is that：The input variable optimization of Association Rules Model is optimized, while being generated using decision tree Value after the information gain standardization of tree, it is to avoid decision tree is in face of continuous variable and the calculating performance issue of sequence type data； Without the extensive beta pruning optimization problem of decision tree spanning tree.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the method for digging of commodity qualification rate correlation rule of the present invention；

Fig. 2 is the structural representation of the excavating gear of commodity qualification rate correlation rule of the present invention.

Embodiment

Embodiments of the invention are described below in detail, the strength of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar original paper or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.

As shown in figure 1, planting the method for digging of commodity qualification rate correlation rule, the method for digging comprises the following steps:

A. original training data collection is obtained；

Wherein step B is specific as follows：

B1：The training set obtained according to step A, it is multinode or single-node data collection to judge the training set, if single node Data set is directly transferred to step D and sets up model；

B2：If S is the set of n data sample, sample set is divided into c different classes

, each classThe number of samples contained is, then S be divided into the comentropy of c class or expect letter Breath, has

WhereinIt is that sample belongs to the i-th class in SProbability, i.e.,。

It is combined into assuming that attribute A all different values must collect,It is that the value of attribute A in S is v sample set, i.e.,, on each branch node after selecting attribute A, to the sample set of the node The entropy of classification.It is expected that entropy is defined as each subset caused by selection AEntropy weighted sum, weights is belong to's Sample accounts for original sample S ratio, that is, it is expected that entropy is

Wherein,Being willIn sample be divided into the comentropy of c class, information of the attribute A with respect to sample set S increases Beneficial GainIt is defined as

Information gain GainThe expectation compression of entropy, Gain caused by referring to know after attribute A valueIt is bigger, say The information that bright selection testing attribute A is provided classification is more.

Information gain is used for dividing the feature of training dataset, there is the Characteristic Problem for being partial to select value more, The use information ratio of gains（information gain ratio）This problem can be corrected.This is the another of feature selecting One criterion information gain ratioIt is defined as follows

B3：Information gain is chosen than current maximum structure current node, and records this tagsort parameter；

B4：Corresponding node builds decision tree ergodic data collection, obtains all information gain ratios.

B5：Information gain is exported than being preserved after standardization as characteristic of division variable importance data set.

Wherein step C is as follows：

C1：Characteristic variable importance data set DB and Multidimensional Association Rules minimum support that input step B is obtained；

C2：Scan data set first finds out all frequency collection, the frequency that these item collections occur at least with predefined most ramuscule Degree of holding is the same；Then Strong association rule is produced by frequency collection, these rules must are fulfilled for minimum support and Minimum support4；Then Desired rule is produced using the C1 frequency collection found, the strictly all rules of the item only comprising set is produced, each of which rule Right part only has one.

It is defined as follows：It can represent to shape the implications such as A → B, what the conjunctive normal form that A and B are expressed as rule was constituted Logical formula, A ∩ B=.Its major parameter has support and confidence level.

（1）Support S

Affairs A and B percentage are included in transaction set D simultaneously, being referred to as rule A → B has support S.

The computational methods of support are：

S (A → B)=things number comprising A and B/things sum × 100%

（2）Confidence level C

The percentage of number of transactions comprising A and the number of transactions comprising B simultaneously in transaction set D, being referred to as rule A → B has confidence level C。

The computational methods of confidence level are：

C (A → B)=the things number for including A and B/includes A things number × 100%

The rule referred to as Strong association rule of minimum support and min confidence is met simultaneously, i.e., wished in association rule mining Hope the correlation rule found.

C3：Using downward closing attribute, if that is, one item collection is Frequent Item Sets, then its nonvoid subset must be Frequent Item Sets, the subset of Frequent Set also must be Frequent Set.The like, all Frequent Item Sets are generated, then from frequency Qualified correlation rule is found out in numerous Item Sets.

C4：By joint and the step of beta pruning two, a Frequent Set is generated.For example：

1, wherein Lk-1 are Frequent Set.Merge the item for only having last element different, such as

{ 1,2 }, { 1,3 }, { Isosorbide-5-Nitrae }, { 2,3 }, { 2,4 }

Generate 3- Frequent Item Sets：

Because { 1,2 }, { 1,3 }, { Isosorbide-5-Nitrae } is all identical in addition to last element, institute is in the hope of { 1,2 }, and the union of { 1,3 } is obtained To { 1,2,3 }, the union of { 1,2 } and { Isosorbide-5-Nitrae } obtains { 1,2,4 }, and the union of { 1,3 } and { Isosorbide-5-Nitrae } is obtained { 1,3,4 }.But by Subset { 3,4 } in { 1,3,4 } is not concentrated in 2- frequent items, so needing { 1,3,4 } to weed out.

2, the set after merging, if support is unsatisfactory for requiring, deletes the merging set.

C5：For all Frequent Sets for meeting minimum support, strong rule association is obtained according to min confidence.

As shown in Fig. 2 a kind of device of the excavation of commodity qualification rate correlation rule, it is characterised in that including：

Memory module 10, for obtaining and storing original training data collection；

First excavates module 11, and for carrying out tagsort to data training set using decision Tree algorithms, and it is special to extract classification Levy variable importance data set；

Second excavates module 12, for characteristic parameter importance threshold values will to be set to obtained characteristic variable importance data set and Ginseng data cross is adjusted to exclude Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set；

3rd excavates module 13, for pure characteristic variable parameter set to be obtained into commodity qualification rate rule by Multidimensional Association Rules Model.

Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art is not departing from the principle and objective of the present invention In the case of above-described embodiment can be changed within the scope of the invention, change, replace and modification.The model of the present invention Enclose and extremely equally limited by appended claims.

Claims

1. a kind of method for digging of commodity qualification rate correlation rule, it is characterised in that the method for digging comprises the following steps:

A. original training data collection is obtained；

2. a kind of method for digging of commodity qualification rate correlation rule according to claim 1, it is characterised in that the step It is C4.5 decision Tree algorithms that B carries out decision Tree algorithms described in tagsort to data training set using decision Tree algorithms.

3. a kind of device of the excavation of commodity qualification rate correlation rule, it is characterised in that including：

Memory module, for obtaining and storing original training data collection；