CN107247970A - A kind of method for digging and device of commodity qualification rate correlation rule - Google Patents
A kind of method for digging and device of commodity qualification rate correlation rule Download PDFInfo
- Publication number
- CN107247970A CN107247970A CN201710487560.8A CN201710487560A CN107247970A CN 107247970 A CN107247970 A CN 107247970A CN 201710487560 A CN201710487560 A CN 201710487560A CN 107247970 A CN107247970 A CN 107247970A
- Authority
- CN
- China
- Prior art keywords
- data
- characteristic
- decision tree
- qualification rate
- association rules
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of method for digging and device of commodity qualification rate correlation rule, memory module, for obtaining and storing original training data collection;First excavates module, for carrying out tagsort to data training set using decision Tree algorithms, and extracts characteristic of division variable importance data set;Second excavates module, and for characteristic parameter importance threshold values will to be set to exclude Multidimensional Association Rules training dataset distracter to obtained characteristic variable importance data set and tune ginseng data cross, screening obtains pure characteristic variable parameter set;3rd excavates module, for pure characteristic variable parameter set to be obtained into commodity qualification rate rule model by Multidimensional Association Rules.The advantage of the invention is that:Optimize the input variable optimization of Association Rules Model;The value after the information gain standardization of decision tree spanning tree is utilized simultaneously, it is to avoid decision tree is in face of continuous variable and the calculating performance issue of sequence type data;Without the extensive beta pruning optimization problem of decision tree spanning tree.
Description
Technical field
The present invention relates to a kind of method for digging and device of commodity qualification rate correlation rule.
Background technology
Inspection and quarantine business statistics data are collecting and count to the data produced by routine check quarantine business, from total
Reflect the operation conditions of regular period inspection and quarantine business on body, and support from different perspectives to carry out the every business of inspection and quarantine
Analysis, including the inspection declaration of inspection and quarantine business enterprise, concentrate document examination, the view of the scene, examine detection etc. produced by data.
In daily inspection and quarantine business generally by the way of sampling inspection, comprehensively detection is examined and can not almost done
Arrive;To give batch commodity, not all examine for every batch, excavate import-export commodity quality law, determine emphasis inspection content,
Detection and degree of risk, just turn into the important means that big data aids in quality testing department to solve this thorny problem.
Rule is understood using big data analysis in the industry at present, more typically using Multidimensional Association Rules, but multidimensional association rule
Then have:
Database table it is very huge and to input data without examination ability, cause invalid or onrelevant variable information excessively to produce,
And algorithm model generation is easily excessively extensive, and support it is relatively low when when adding a large amount of Hash functions, Mining Multidimensional Association Rules
Efficiency can low-down shortcoming.
The content of the invention
The Multidimensional Association Rules data used for above-mentioned commodity inspection quarantine commodity big data analysis are huge without examination energy
Power, the low technical problem of efficiency, the present invention provide a kind of method of use decision-tree model algorithm optimization Multidimensional Association Rules and
Device, it is specific as follows:
A kind of method for digging of commodity qualification rate correlation rule, the method for digging comprises the following steps:
A. original training data collection is obtained;
B. tagsort is carried out to data training set using decision Tree algorithms, and extracts characteristic of division variable importance data
Collection;
C., characteristic variable importance data set and adjust ginseng data cross row that characteristic parameter importance threshold values is obtained to step B are set
Except Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set;
D. the pure characteristic variable parameter set obtained to step C obtains commodity qualification rate rule model by Multidimensional Association Rules.
On the basis of above-mentioned technical proposal, further, the step B is entered using decision Tree algorithms to data training set
Decision Tree algorithms described in row tagsort are C4.5 decision Tree algorithms.
Further, a kind of device of the excavation of commodity qualification rate correlation rule, it is characterised in that including:
Memory module, for obtaining and storing original training data collection;
First excavates module, for carrying out tagsort to data training set using decision Tree algorithms, and extracts characteristic of division
Variable importance data set;
Second excavates module, for will set characteristic parameter importance threshold values to obtained characteristic variable importance data set and tune
Join data cross and exclude Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set;
3rd excavates module, for pure characteristic variable parameter set to be obtained into the regular mould of commodity qualification rate by Multidimensional Association Rules
Type.
The advantage of the invention is that:The input variable optimization of Association Rules Model is optimized, while being generated using decision tree
Value after the information gain standardization of tree, it is to avoid decision tree is in face of continuous variable and the calculating performance issue of sequence type data;
Without the extensive beta pruning optimization problem of decision tree spanning tree.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the method for digging of commodity qualification rate correlation rule of the present invention;
Fig. 2 is the structural representation of the excavating gear of commodity qualification rate correlation rule of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the strength of the embodiment is shown in the drawings, wherein from beginning to end
Same or similar label represents same or similar original paper or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.
As shown in figure 1, planting the method for digging of commodity qualification rate correlation rule, the method for digging comprises the following steps:
A. original training data collection is obtained;
B. tagsort is carried out to data training set using decision Tree algorithms, and extracts characteristic of division variable importance data
Collection;
C., characteristic variable importance data set and adjust ginseng data cross row that characteristic parameter importance threshold values is obtained to step B are set
Except Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set;
D. the pure characteristic variable parameter set obtained to step C obtains commodity qualification rate rule model by Multidimensional Association Rules.
Wherein step B is specific as follows:
B1:The training set obtained according to step A, it is multinode or single-node data collection to judge the training set, if single node
Data set is directly transferred to step D and sets up model;
B2:If S is the set of n data sample, sample set is divided into c different classes
, each classThe number of samples contained is, then S be divided into the comentropy of c class or expect letter
Breath, has
WhereinIt is that sample belongs to the i-th class in SProbability, i.e.,。
It is combined into assuming that attribute A all different values must collect,It is that the value of attribute A in S is v sample set, i.e.,, on each branch node after selecting attribute A, to the sample set of the node
The entropy of classification.It is expected that entropy is defined as each subset caused by selection AEntropy weighted sum, weights is belong to's
Sample accounts for original sample S ratio, that is, it is expected that entropy is
Wherein,Being willIn sample be divided into the comentropy of c class, information of the attribute A with respect to sample set S increases
Beneficial GainIt is defined as
Information gain GainThe expectation compression of entropy, Gain caused by referring to know after attribute A valueIt is bigger, say
The information that bright selection testing attribute A is provided classification is more.
Information gain is used for dividing the feature of training dataset, there is the Characteristic Problem for being partial to select value more,
The use information ratio of gains(information gain ratio)This problem can be corrected.This is the another of feature selecting
One criterion information gain ratioIt is defined as follows
B3:Information gain is chosen than current maximum structure current node, and records this tagsort parameter;
B4:Corresponding node builds decision tree ergodic data collection, obtains all information gain ratios.
B5:Information gain is exported than being preserved after standardization as characteristic of division variable importance data set.
Wherein step C is as follows:
C1:Characteristic variable importance data set DB and Multidimensional Association Rules minimum support that input step B is obtained;
C2:Scan data set first finds out all frequency collection, the frequency that these item collections occur at least with predefined most ramuscule
Degree of holding is the same;Then Strong association rule is produced by frequency collection, these rules must are fulfilled for minimum support and Minimum support4;Then
Desired rule is produced using the C1 frequency collection found, the strictly all rules of the item only comprising set is produced, each of which rule
Right part only has one.
It is defined as follows:It can represent to shape the implications such as A → B, what the conjunctive normal form that A and B are expressed as rule was constituted
Logical formula, A ∩ B=.Its major parameter has support and confidence level.
(1)Support S
Affairs A and B percentage are included in transaction set D simultaneously, being referred to as rule A → B has support S.
The computational methods of support are:
S (A → B)=things number comprising A and B/things sum × 100%
(2)Confidence level C
The percentage of number of transactions comprising A and the number of transactions comprising B simultaneously in transaction set D, being referred to as rule A → B has confidence level
C。
The computational methods of confidence level are:
C (A → B)=the things number for including A and B/includes A things number × 100%
The rule referred to as Strong association rule of minimum support and min confidence is met simultaneously, i.e., wished in association rule mining
Hope the correlation rule found.
C3:Using downward closing attribute, if that is, one item collection is Frequent Item Sets, then its nonvoid subset must be
Frequent Item Sets, the subset of Frequent Set also must be Frequent Set.The like, all Frequent Item Sets are generated, then from frequency
Qualified correlation rule is found out in numerous Item Sets.
C4:By joint and the step of beta pruning two, a Frequent Set is generated.For example:
1, wherein Lk-1 are Frequent Set.Merge the item for only having last element different, such as
{ 1,2 }, { 1,3 }, { Isosorbide-5-Nitrae }, { 2,3 }, { 2,4 }
Generate 3- Frequent Item Sets:
Because { 1,2 }, { 1,3 }, { Isosorbide-5-Nitrae } is all identical in addition to last element, institute is in the hope of { 1,2 }, and the union of { 1,3 } is obtained
To { 1,2,3 }, the union of { 1,2 } and { Isosorbide-5-Nitrae } obtains { 1,2,4 }, and the union of { 1,3 } and { Isosorbide-5-Nitrae } is obtained { 1,3,4 }.But by
Subset { 3,4 } in { 1,3,4 } is not concentrated in 2- frequent items, so needing { 1,3,4 } to weed out.
2, the set after merging, if support is unsatisfactory for requiring, deletes the merging set.
C5:For all Frequent Sets for meeting minimum support, strong rule association is obtained according to min confidence.
As shown in Fig. 2 a kind of device of the excavation of commodity qualification rate correlation rule, it is characterised in that including:
Memory module 10, for obtaining and storing original training data collection;
First excavates module 11, and for carrying out tagsort to data training set using decision Tree algorithms, and it is special to extract classification
Levy variable importance data set;
Second excavates module 12, for characteristic parameter importance threshold values will to be set to obtained characteristic variable importance data set and
Ginseng data cross is adjusted to exclude Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set;
3rd excavates module 13, for pure characteristic variable parameter set to be obtained into commodity qualification rate rule by Multidimensional Association Rules
Model.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example
Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art is not departing from the principle and objective of the present invention
In the case of above-described embodiment can be changed within the scope of the invention, change, replace and modification.The model of the present invention
Enclose and extremely equally limited by appended claims.
Claims (3)
1. a kind of method for digging of commodity qualification rate correlation rule, it is characterised in that the method for digging comprises the following steps:
A. original training data collection is obtained;
B. tagsort is carried out to data training set using decision Tree algorithms, and extracts characteristic of division variable importance data
Collection;
C., characteristic variable importance data set and adjust ginseng data cross row that characteristic parameter importance threshold values is obtained to step B are set
Except Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set;
D. the pure characteristic variable parameter set obtained to step C obtains commodity qualification rate rule model by Multidimensional Association Rules.
2. a kind of method for digging of commodity qualification rate correlation rule according to claim 1, it is characterised in that the step
It is C4.5 decision Tree algorithms that B carries out decision Tree algorithms described in tagsort to data training set using decision Tree algorithms.
3. a kind of device of the excavation of commodity qualification rate correlation rule, it is characterised in that including:
Memory module, for obtaining and storing original training data collection;
First excavates module, for carrying out tagsort to data training set using decision Tree algorithms, and extracts characteristic of division
Variable importance data set;
Second excavates module, for will set characteristic parameter importance threshold values to obtained characteristic variable importance data set and tune
Join data cross and exclude Multidimensional Association Rules training dataset distracter, screening obtains pure characteristic variable parameter set;
3rd excavates module, for pure characteristic variable parameter set to be obtained into the regular mould of commodity qualification rate by Multidimensional Association Rules
Type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710487560.8A CN107247970A (en) | 2017-06-23 | 2017-06-23 | A kind of method for digging and device of commodity qualification rate correlation rule |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710487560.8A CN107247970A (en) | 2017-06-23 | 2017-06-23 | A kind of method for digging and device of commodity qualification rate correlation rule |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107247970A true CN107247970A (en) | 2017-10-13 |
Family
ID=60019539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710487560.8A Pending CN107247970A (en) | 2017-06-23 | 2017-06-23 | A kind of method for digging and device of commodity qualification rate correlation rule |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107247970A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108520039A (en) * | 2018-04-02 | 2018-09-11 | 河南大学 | A kind of big data method for optimization analysis |
CN110119551A (en) * | 2019-04-29 | 2019-08-13 | 西安电子科技大学 | Shield machine cutter abrasion degeneration linked character analysis method based on machine learning |
CN111670445A (en) * | 2018-01-31 | 2020-09-15 | Asml荷兰有限公司 | Substrate marking method based on process parameters |
CN117376108A (en) * | 2023-12-07 | 2024-01-09 | 深圳市亲邻科技有限公司 | Intelligent operation and maintenance method and system for Internet of things equipment |
CN117725527A (en) * | 2023-12-27 | 2024-03-19 | 北京领雁科技股份有限公司 | Score model optimization method based on machine learning analysis rules |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101419627A (en) * | 2008-12-03 | 2009-04-29 | 山东中烟工业公司 | Cigarette composition maintenance action digging system based on associations ruler and method thereof |
CN102567807A (en) * | 2010-12-23 | 2012-07-11 | 上海亚太计算机信息系统有限公司 | Method for predicating gas card customer churn |
CN104239437A (en) * | 2014-08-28 | 2014-12-24 | 国家电网公司 | Power-network-dispatching-oriented intelligent warning analysis method |
CN106407349A (en) * | 2016-09-06 | 2017-02-15 | 北京三快在线科技有限公司 | Product recommendation method and device |
-
2017
- 2017-06-23 CN CN201710487560.8A patent/CN107247970A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101419627A (en) * | 2008-12-03 | 2009-04-29 | 山东中烟工业公司 | Cigarette composition maintenance action digging system based on associations ruler and method thereof |
CN102567807A (en) * | 2010-12-23 | 2012-07-11 | 上海亚太计算机信息系统有限公司 | Method for predicating gas card customer churn |
CN104239437A (en) * | 2014-08-28 | 2014-12-24 | 国家电网公司 | Power-network-dispatching-oriented intelligent warning analysis method |
CN106407349A (en) * | 2016-09-06 | 2017-02-15 | 北京三快在线科技有限公司 | Product recommendation method and device |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111670445A (en) * | 2018-01-31 | 2020-09-15 | Asml荷兰有限公司 | Substrate marking method based on process parameters |
CN111670445B (en) * | 2018-01-31 | 2024-03-22 | Asml荷兰有限公司 | Substrate marking method based on process parameters |
CN108520039A (en) * | 2018-04-02 | 2018-09-11 | 河南大学 | A kind of big data method for optimization analysis |
CN110119551A (en) * | 2019-04-29 | 2019-08-13 | 西安电子科技大学 | Shield machine cutter abrasion degeneration linked character analysis method based on machine learning |
CN117376108A (en) * | 2023-12-07 | 2024-01-09 | 深圳市亲邻科技有限公司 | Intelligent operation and maintenance method and system for Internet of things equipment |
CN117376108B (en) * | 2023-12-07 | 2024-03-01 | 深圳市亲邻科技有限公司 | Intelligent operation and maintenance method and system for Internet of things equipment |
CN117725527A (en) * | 2023-12-27 | 2024-03-19 | 北京领雁科技股份有限公司 | Score model optimization method based on machine learning analysis rules |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107247970A (en) | A kind of method for digging and device of commodity qualification rate correlation rule | |
Aldino et al. | Implementation of K-means algorithm for clustering corn planting feasibility area in south lampung regency | |
Tang et al. | When do random forests fail? | |
US8346779B2 (en) | Method and system for extended bitmap indexing | |
CN102364498B (en) | Multi-label-based image recognition method | |
CN108960833B (en) | Abnormal transaction identification method, equipment and storage medium based on heterogeneous financial characteristics | |
CN110135494A (en) | Feature selection approach based on maximum information coefficient and Geordie index | |
CN104462184B (en) | A kind of large-scale data abnormality recognition method based on two-way sampling combination | |
CN106339942A (en) | Financial information processing method and system | |
CN104933444B (en) | A kind of design method of the multi-level clustering syncretizing mechanism towards multidimensional property data | |
CN105373606A (en) | Unbalanced data sampling method in improved C4.5 decision tree algorithm | |
CN104216874B (en) | Positive and negative mode excavation method and system are weighted between the Chinese word based on coefficient correlation | |
CN108846338A (en) | Polarization characteristic selection and classification method based on object-oriented random forest | |
CN109299185B (en) | Analysis method for convolutional neural network extraction features aiming at time sequence flow data | |
CN108280236A (en) | A kind of random forest visualization data analysing method based on LargeVis | |
CN110533116A (en) | Based on the adaptive set of Euclidean distance at unbalanced data classification method | |
CN110297853A (en) | Frequent Set method for digging and device | |
CN108596227B (en) | Mining method for dominant influence factors of electricity consumption behaviors of users | |
CN105045806A (en) | Dynamic splitting and maintenance method of quantile query oriented summary data | |
CN115952067A (en) | Database operation abnormal behavior detection method and readable storage medium | |
CN102799616A (en) | Outlier point detection method in large-scale social network | |
CN109389172B (en) | Radio signal data clustering method based on non-parameter grid | |
Zhang et al. | Multiscale analysis of time irreversibility based on phase-space reconstruction and horizontal visibility graph approach | |
CN105938561A (en) | Canonical-correlation-analysis-based computer data attribute reduction method | |
Dong | Application of Big Data Mining Technology in Blockchain Computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20171013 |
|
WD01 | Invention patent application deemed withdrawn after publication |