CN108170769A - A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms - Google Patents
A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms Download PDFInfo
- Publication number
- CN108170769A CN108170769A CN201711426288.9A CN201711426288A CN108170769A CN 108170769 A CN108170769 A CN 108170769A CN 201711426288 A CN201711426288 A CN 201711426288A CN 108170769 A CN108170769 A CN 108170769A
- Authority
- CN
- China
- Prior art keywords
- attribute
- data
- ratio
- information gain
- decision tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Abstract
The invention discloses a kind of assembling manufacturing qualitative data processing methods based on decision Tree algorithms.Belong to prediction of quality field.This method includes establishing quality data model according to quality service flow and qualitative data table, it stores it in oracle database, relevant quality service data are extracted in data warehouse by ETL tools informatic, big data platform Splunk is made to be connect with quality data warehouse by Splunk database interface DB Connect, realize the real-time extraction of data;Improved data mining C4.5 decision Tree algorithms are integrated into Splunk platforms, make decision Tree algorithms Map Reduceization using thought of dividing and ruling, and complete big data platform Splunk clusters, realize parallel computation and the parallel search of data, classified excavation is carried out to qualitative data, achievees the purpose that aid decision making person carries out qualitative data decision.This method operation efficiency greatly promotes, and can handle the qualitative data of magnanimity, has very high practical value.
Description
Technical field
The present invention relates to a kind of assembling manufacturing qualitative data processing methods based on decision Tree algorithms, and it is pre- to belong to quality classification
Survey field.
Background technology
With the fast development and application of industrial automation technology and computer information technology, in assembling, manufacture scene product
The qualitative data of magnanimity is tired out, this excites people to analyzing the great interests of these data, is wherein hidden to identify and excavate
Rule, so as to which preferably Instructing manufacture is put into practice.
Qualitative data Mining Interesting in assembling, the quality fluctuation in manufacturing process, by analyze tdeduction prediction data it
Between homogeneity and otherness, so as to timely extension tech adjust.Fuzzy classifier method, artificial neural network, Bayes Method
Qualitative data process can be analyzed and predicted with data digging methods such as support vector machines, achieve preferable application effect
Fruit.In contrast, traditional decision-tree be easily handled variation data, to noise data with robustness, production rule hold
Readily understood, the advantages that recognition efficiency is high.C4.5 algorithms in decision tree can use the form of regression equation to carry out predictive variable
Modeling, therefore this method is highly suitable for the Complex Assembly with more than characteristic, manufacturing process qualitative data forecasting research.
At present, many scholars more deeply study decision tree C4.5 algorithm optimization both at home and abroad.Xu's Zhao et al.
A kind of decision tree method for diagnosing faults based on identification reduction matrix is proposed, realizes efficiently producing for fault sample decision table
And ensure the correctness of diagnosis;Wang Wenxia etc. takes multiple scan for tradition C4.5 Decision Tree Algorithms needs, causes to transport
The defects of line efficiency is low, proposes a kind of new improvement C4.5 Decision Tree Algorithms, which is derived by optimizing information gain
Relevant logarithm operation in algorithm, to reduce the run time of Decision Tree Algorithm, by the letter of connection attribute in traditional algorithm
Architomy attribute is improved to the processing of optimal dividing dot splitting, to improve efficiency of algorithm.Huang Xiuxia for C4.5 algorithms when calculating
Between between long, attribute the problem of interdependence effects, it is proposed that a kind of C4.5 algorithms based on GINI Mean value of index between attribute
(GC4.5).However, the staged with qualitative data increases, these optimization algorithms have been not enough to calculate the " big of such magnanimity
Data ".
Invention content
It is an object of the invention to be directed to the deficiency of prior art, a kind of assembling manufacturing matter based on decision Tree algorithms is provided
Data processing method is measured, classifies suitable for the huge qualitative data of data volume and predicts, reach accurate when Accurate Prediction breaks down
Trouble location is carried out classification processing by ground.
In order to achieve the above objectives, idea of the invention is that:
Present invention introduces big data search platform Splunk, and data directory and data search are realized by the cluster of Splunk
Distributed treatment, then decision tree C4.5 parallel algorithm is embedded into Splunk search engine instruction systems, in data platform
It is upper to realize improved decision tree C4.5 algorithm parallel computation.The search inquiry speed of data is not only greatly improved in this way, is dropped
Low error rate, moreover it is possible to improve the arithmetic speed of algorithm.
During assembling manufacturing, quality service flow and qualitative data table establish quality data model, are deposited
Storage is extracted relevant quality service data in data warehouse by ETL tools informatic in oracle database,
Big data platform Splunk is made to be connect with quality data warehouse by Splunk database interface DB Connect, realize data
It extracts in real time;Improved data mining C4.5 decision Tree algorithms are integrated into Splunk platforms, using dividing and ruling, thought makes decision
Tree algorithm Map-Reduceization, and big data platform Splunk clusters are completed, realize parallel computation and the parallel search of data, it is complete
The qualitative data generated during pairs of assembling manufacturing carries out classified excavation, reaches aid decision making person and completes to assembling manufacturing quality
Data carry out the purpose of decision.
Conceived according to foregoing invention, the technical solution adopted by the present invention is:
A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, concrete operation step are as follows:
(1), the qualitative data table of comparisons is established:Analyze qualitative data operation flow, establish qualitative data input field and
Target output field:Whether deviation process responsible department, appearance effects, sealing influence, failure cause, is important Parts, deviation journey
Degree and disposition classification;
(2), qualitative data processing model is established:Each property value of qualitative data is numbered, and mapped;
(3), sample process module by the property value of categorical attribute and arrives training sample data D;Algorithm platform receives training
Sample data D, and training C4.5 decision-tree models;
(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model;
(5), decision tree P mining is analyzed:Using the quality data model in step (2), determine under Splunk to improvement
Plan tree algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set;
Horizontal division is carried out to data set, is that horizontal segmentation, the size of data set that each Map functions are read nearby are carried out to data set
It is the same, avoids the occurrence of the situation of load imbalance;Vertical division is carried out to data set, i.e., it is one or several are complete
The information gain of attribute and the calculating of information gain-ratio are distributed to an individual processor and are handled, and each processor is parallel
Ground handles one or more attribute and divides required information gain and the calculating process of information gain-ratio;In vertical division mould
Under type, the calculating process of the split point of each attribute performs parallel.
The step (1), which establishes the qualitative data table of comparisons, is:According to qualitative data table, by correlated quality management in enterprise
Personnel provide can comprehensively cover quality service flow and parameter as possible, establish attribute and the table of comparisons of flow processing.
The step (2) establishes qualitative data processing model:Define the quality that all data item in affairs are separation
Qualitative data record in data form, there is no two identical affairs.
Step (3) the training C4.5 decision-tree models include the following steps:
A) comentropy of training sample data D is calculated:
Wherein PiIt is that arbitrary sample belongs to class C in DiProbability;
B) computation attribute A comentropies:Attribute A has V different value { a1,a2,...,av, D is divided into V subset
{D1,D2,...,Dv, wherein DjIt is the subset of D, they have value a on attribute Aj, attribute A comentropies are:
Wherein, item Dj/ D is subset DjThe weight of shared total sample, Info (A) are based on the sample classification divided by A to D
Required comentropy;
C) by step a), b) obtain the information gain of attribute A:
Gain (A)=Info (D)-Info (A)
D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect,
This bias is overcome, using segmentation information amount Splitlnfo (A):
E) by step c), d), information gain-ratio GainRatio (A) is calculated:
F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided;
G) property value of categorical attribute to be divided in training sample data D is obtained into a data according to incremental sequence
Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points by collection
The N-1 division points among the last one division points, it is determined by calculating N-1 to the average value of neighboring property values two-by-two
Position, and ensure all properties value of categorical attribute to be divided between first division points and the last one division points,
According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division of information gain-ratio maximum
Point is used as best division position, then divides training sample data D in most preferably division position according to categorical attribute to be divided
Into the classification equal with the quantity of class label.
The step (4) improves C4.5 decision Tree algorithms:In algorithm when best attributes is selected to divide data set, if root
According to the attribute that highest information gain-ratio is chosen, value classification number is most in current optional attribute subset, then in Attributions selection degree
Balance factor, adjustment information gain, and then adjustment information ratio of profit increase are added in amount, overcomes deviation multivalue problem as possible;If attribute
Meet equilibrium condition it is then modified after classification information entropy;
A) attribute meets equilibrium condition, it is modified after classification information entropy be:
Balance factor justice is defined as:
Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample attribute D, Split Attribute A
It is represented with the incidence relation of sample data D with linked list;
B) modified information gain is:
Gain*(A)=Info (D)-InfoV *(A)
C) modified information gain-ratio is:
Step (5) the decision tree P mining is analyzed:Each category is calculated by Map-Reduce parallel calculating methods
The information gain-ratio of property;Specially:
A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, selects the category of ratio of profit increase maximum
Property divides sample as the root node of decision-tree model, calculates the information gain-ratio of each attribute;Mapper functions are responsible for sieve
Choosing and transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees
Parallel computation brings together the value for having identical key values among these, as the input of reducer functions, passes through
Reducer functions centralized calculation is so as to obtain final result of calculation;
B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel;In each processor
In the case that parallel processing needs the training dataset of serial process originally, reduce to need to traverse mass data record and spend
The time loss taken.
C) under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then
The time complexity ratio of calculating information gain and information gain-ratio significantly reduces when serial, and communication cost also opposite reduction.
Compared with prior art, the advantages of the present invention are:
The present invention is applied to improved decision Tree algorithms in qualitative data, and employ big data technology, passes through
The parallel search capabilities of Splunk greatly save the decision Tree algorithms search-related data time.Pass through experimental result and practical matter
Amount processing comparison, discovery have very strong practicability, are greatly reduced what is spent during process flow when quality problems occur for product
Time, so as to which the policymaker for being unfamiliar with business be helped to carry out effectively decision.After adding in decision tree parallel algorithm, operation efficiency
It greatly promotes, the qualitative data of magnanimity can be handled, there is very high practical value.
Description of the drawings
Fig. 1 is the main flow block diagram of the present invention.
Fig. 2 is the quality data model of the step (2) of the present invention.
Fig. 3 is the algorithm calculation flow chart of the step (5) of the present invention.
Specific embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.The present embodiment is with the technology of the present invention side
Implemented premised on case, give detailed embodiment and specific operating process, but protection scope of the present invention is unlimited
In following embodiments.
As shown in Figure 1, a kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, concrete operation step is such as
Under:
(1), the qualitative data table of comparisons is established:Analyze qualitative data operation flow, establish qualitative data input field and
Target output field:Whether deviation process responsible department, appearance effects, sealing influence, failure cause, is important Parts, deviation journey
Degree and disposition classification;
(2), qualitative data processing model is established:Each property value of qualitative data is numbered, and mapped;
(3), sample process module by the property value of categorical attribute and arrives training sample data D;Algorithm platform receives training
Sample data D, and training C4.5 decision-tree models;
(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model;
(5), decision tree P mining is analyzed:Using the quality data model in step (2), determine under Splunk to improvement
Plan tree algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set;
Horizontal division is carried out to data set, is that horizontal segmentation, the size of data set that each Map functions are read nearby are carried out to data set
It is the same, avoids the occurrence of the situation of load imbalance;Vertical division is carried out to data set, i.e., it is one or several are complete
The information gain of attribute and the calculating of information gain-ratio are distributed to an individual processor and are handled, and each processor is parallel
Ground handles one or more attribute and divides required information gain and the calculating process of information gain-ratio;In vertical division mould
Under type, the calculating process of the split point of each attribute performs parallel.
The step (1), which establishes the qualitative data table of comparisons, is:According to qualitative data table, by correlated quality management in enterprise
Personnel provide can comprehensively cover quality service flow and parameter as possible, establish attribute and the table of comparisons of flow processing.
As shown in Fig. 2, the step (2) establish qualitative data processing model be:Define affairs in all data item be
Qualitative data record in the qualitative data table of separation, there is no two identical affairs.
Step (3) the training C4.5 decision-tree models include the following steps:
A) comentropy of training sample data D is calculated:
Wherein PiIt is that arbitrary sample belongs to class C in DiProbability;
B) computation attribute A comentropies:Attribute A has V different value { a1,a2,...,av, D is divided into V subset
{D1,D2,...,Dv, wherein DjIt is the subset of D, they have value a on attribute Aj, attribute A comentropies are:
Wherein, item Dj/ D is subset DjThe weight of shared total sample, Info (A) are based on the sample classification divided by A to D
Required comentropy;
C) by step a), b) obtain the information gain of attribute A:
Gain (A)=Info (D)-Info (A)
D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect,
This bias is overcome, using segmentation information amount Splitlnfo (A):
E) by step c), d), information gain-ratio GainRatio (A) is calculated:
F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided;
G) property value of categorical attribute to be divided in training sample data D is obtained into a data according to incremental sequence
Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points by collection
The N-1 division points among the last one division points, it is determined by calculating N-1 to the average value of neighboring property values two-by-two
Position, and ensure all properties value of categorical attribute to be divided between first division points and the last one division points,
According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division of information gain-ratio maximum
Point is used as best division position, then divides training sample data D in most preferably division position according to categorical attribute to be divided
Into the classification equal with the quantity of class label.
The step (4) improves C4.5 decision Tree algorithms:In algorithm when best attributes is selected to divide data set, if root
According to the attribute that highest information gain-ratio is chosen, value classification number is most in current optional attribute subset, then in Attributions selection degree
Balance factor, adjustment information gain, and then adjustment information ratio of profit increase are added in amount, overcomes deviation multivalue problem as possible;If attribute
Meet equilibrium condition it is then modified after classification information entropy;
A) attribute meets equilibrium condition, it is modified after classification information entropy be:
Balance factor justice is defined as:
Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample data D, Split Attribute A
It is represented shown in following table with linked list with the incidence relation of sample data D:
D1 | D2 | … | Dk | It amounts to | |
A1 | S11 | S21 | … | Sk1 | S*1 |
A2 | S12 | S22 | … | Sk2 | S*2 |
… | … | … | … | … | … |
An | S1n | S2n | … | Skn | S*n |
It amounts to | S1* | S2* | … | Sk* | N |
S in formula*iFor A=AiThe example sum of attribute, Sj*For DjExample sum, N be data set example sum, SjiFor
Attribute A=AiThere is classification D in examplejExample sum, desired value when selected properties are not associated with classification is:
B) modified information gain is:
Gain*(A)=Info (D)-InfoV *(A)
C) modified information gain-ratio is:
As shown in figure 3, step (5) the decision tree P mining analysis is:Pass through Map-Reduce parallel calculating methods
Calculate the information gain-ratio of each attribute;Specially:
A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, selects the category of ratio of profit increase maximum
Property divides sample as the root node of decision-tree model, calculates the information gain-ratio of each attribute;Mapper functions are responsible for sieve
Choosing and transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees
Parallel computation brings together the value for having identical key values among these, as the input of reducer functions, passes through
Reducer functions centralized calculation is so as to obtain final result of calculation;
B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel;In each processor
In the case that parallel processing needs the training dataset of serial process originally, reduce to need to traverse mass data record and spend
The time loss taken.
C) under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then
The time complexity ratio of calculating information gain and information gain-ratio significantly reduces when serial, and communication cost also opposite reduction.
The specific embodiment of the present invention, with the product quality data generated in the assembling of certain manufacturing enterprise, manufacturing process
It is as follows for analysis:
The qualitative data field that the data object of decision tree analysis processing is enumerated from Fig. 2, including deviation process, responsibility
Whether department appearance effects, sealing influence, failure cause, is important Parts, departure degree.Using treatment classification as output result.
The functional experiment (D1) of property value of deviation process, part manufacture (D2), assembling (D3), storehouse examine (D4), client to check
(D5);The property value of responsible department mainly has part machining sector (R1), assembling department (R2), manufacturing sector (R3), supplier
(R4), quality audit department (R5);It is (A1) and no (A2) that the property value of appearance effects field, which has,;Sealing influences the attribute of field
It is (S1) and no (S2) that value, which includes,;The property value of failure cause includes supplier's quality problems (F1), design improves or change
(F2), material expired (F3), problem of manufacturing qualities (F4), process change (F5);Whether the property value of important Parts includes general part
(I1), important parts (I2) and key component (I3);Departure degree mainly has larger (De1), smaller (De2), general (De3), great
(De4), important (De5) five property values.Disposition classification, which mainly has, does over again (C1), reprocesses (C2), preliminary treatment (C3), makes as former state
With (C4), it is returned to seller (C5), scraps (C6) six property values.
The information gain-ratio of each attribute is calculated by Map-Reduce parallel calculating methods.In order to obtain decision tree mould
Type will first calculate information gain-ratio, and after comparing size, the attribute for selecting ratio of profit increase maximum comes as the root node of decision-tree model
Divide sample.The information gain-ratio of each attribute is calculated below.
The type of disposition sorting field, which mainly includes, does over again, is returned to seller, preliminary treatment, reprocesses, using as former state and reporting
Useless six attribute, the total number of records have 20004 altogether.Mapper functions are responsible for screening and be converted<Key, value>The input of form,
Exporting equally is<Key, value>The intermediate data of form, C4.5 parallel computations will have identical key values among these
Value is brought together, as the input of reducer functions, by reducer functions centralized calculation so as to obtain final meter
It calculates as a result, obtain disposition classification results has 2225 to do over again, being returned to seller has 2251, and preliminary treatment has 8611, reprocesses
There are 1717, used 3919 as former state, scrapping there are 1281.With the entropy Info for the inspection result that above-mentioned formula obtains
(inspection result) is:
For the field of responsible department, property value is that the record number of part machining sector is 5335, wherein record of doing over again
Number is 676, and the record number for being returned to seller is 997, and the record number of preliminary treatment is 2552, and the record number reprocessed is 110
Item, the record number used as former state are 873, and the record number scrapped is 127;The record number of assembling department is 10181, wherein
Record number of doing over again is 1011, and the record number for being returned to seller is 644, and the record number of preliminary treatment is 5003, the note reprocessed
It is 1026 to record number, and the record number used as former state is 1877, and the record number scrapped is 620;The record number of manufacturing sector is
2035, wherein record number of doing over again is 282, the record number for being returned to seller is 256, and the record number of preliminary treatment is 513,
The record number reprocessed is 284, and the record number used as former state is 455, and the record number scrapped is 245.The record of supplier
Number is 1902, wherein record number of doing over again is 175, the record number for being returned to seller is 311, and the record number of preliminary treatment is
422, the record number reprocessed is 234, and the record number used as former state is 508, and the record number scrapped is 252.Quality is examined
The record number for looking into department is 551, wherein record number of doing over again is 81, the record number for being returned to seller is 43, preliminary treatment
It is 121 to record number, and the record number reprocessed is 63, and the record number used as former state is 206, and the record number scrapped is 37.
Calculating comentropy of these three attributes in classification respectively is:
Whether it is that the entropy Info (I) and information gain Gain (I) of important Parts is:
Gain (R)=Info (C)-Info (R)=2.2492-2.1702=0.0790
The attribute segmentation information amount SplitInfo (I) of responsible department's attribute is:
The information gain-ratio GainRatio (I) of responsible department's attribute is:
Similarly, the information gain-ratio of other attributes is:GainRatio (D)=0.0136, GainRatio (I)=
0.0058, GainRatio (A)=0.0045, GainRatio (S)=0.0037, GainRatio (F)=0.0158,
GainRatio (De)=0.0067.
Due to GainRatio (R)>GainRatio(F)>GainRatio(D)>GainRatio(De)>GainRatio(I)
>GainRatio(A)>GainRatio (S), the information gain-ratio of attribute " R " is maximum and attribute value number is also maximum, meets and draws
Enter the condition of balance factor update information ratio of profit increase, therefore use the information gain-ratio of balance factor λ adjustment " R " attributes.Attribute
Shown in the Attribute Association table following table of " R ":
Similarly, E is obtained31=2296.52, E41=457.92, E51=1045.18, E61=341.64, E12=1132.41, E22
=1145.13, E32=4382.55, E42=873.86, E52=1994.57, E62=651.96, E13=226.35, E23=
228.99, E33=875.99, E43=174.67, E53=398.68, E63=130.32, E14=211.56, E24=214.03, E34
=818.74, E44=163.25, E54=372.62, E64=121.80, E15=61.29, E25=62.00, E35=237.19, E45
=47.29, E55=107.95, E65=35.28.
λ=0.0059, after introducing balance factor, the classification information entropy of computation attribute " R " is:
Gain*(R)=Info (C)-Info*(R)=2.2492-2.2351=0.0141
At this point, the information gain-ratio of four attributes is ordered as GainRatio (F)>GainRatio(D)>GainRatio(R)
>GainRatio(De)>GainRatio(I)>GainRatio(A)>GainRatio(S).Failure cause after introducing balance factor
Information gain-ratio it is maximum inside all properties, so it is selected as the root node of decision tree first, from every in the attribute
One-component draws a branch, and divides sample.
By experimental result, analysis extraction is carried out for decision tree arborescence hierarchical relationship above, is conducive to help to determine
Plan person completes quality fault data within the shortest time rational assessment or helps those unfamiliar to quality service
Manager realizes quality service guidance.
Claims (6)
1. a kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, which is characterized in that concrete operation step is such as
Under:
(1), the qualitative data table of comparisons is established:The operation flow of qualitative data is analyzed, establishes qualitative data input field and target
Output field:Deviation process, responsible department, appearance effects, sealing influence, failure cause, whether be important Parts, departure degree with
And disposition classification;
(2), qualitative data processing model is established:Each property value of qualitative data is numbered, and mapped;
(3), sample process module by the property value of categorical attribute and arrives training sample data D;Algorithm platform receives training sample
Data D, and training C4.5 decision-tree models;
(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model;
(5), decision tree P mining is analyzed:Using the quality data model in step (2), to improving decision tree under Splunk
Algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set;Logarithm
Horizontal division is carried out according to collection, is that horizontal segmentation is carried out to data set, the size of data set that each Map functions are read nearby is one
Sample, avoid the occurrence of the situation of load imbalance;Vertical division is carried out to data set, i.e., by one or several complete attributes
The calculating of information gain and information gain-ratio distribute to an individual processor and handled, each processor is located in parallel
It manages one or more attribute and divides required information gain and the calculating process of information gain-ratio;In vertical division model
Under, the calculating process of the split point of each attribute performs parallel.
2. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that
The step (1), which establishes the qualitative data table of comparisons, is:According to qualitative data table, provided by correlated quality administrative staff in enterprise
Quality service flow and parameter can comprehensively be covered as possible, establish attribute and the table of comparisons of flow processing.
3. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that
The step (2) establishes qualitative data processing model:Define the qualitative data table that all data item in affairs are separation
In qualitative data record, there is no two identical affairs.
4. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that
Step (3) the training C4.5 decision-tree models include the following steps:
A) comentropy of training sample data D is calculated:
Wherein PiIt is that arbitrary sample belongs to class C in DiProbability;
B) computation attribute A comentropies:Attribute A has V different value { a1,a2,...,av, D is divided into V subset { D1,
D2,...,Dv, wherein DjIt is the subset of D, they have value a on attribute Aj, attribute A comentropies are:
Wherein, item Dj/ D is subset DjThe weight of shared total sample, Info (A) be based on as A divide to needed for the sample classification of D
The comentropy wanted;
C) by step a), b) obtain the information gain of attribute A:
Gain (A)=Info (D)-Info (A)
D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect, gram
This bias is taken, using segmentation information amount Splitlnfo (A):
E) by step c), d), information gain-ratio GainRatio (A) is calculated:
F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided;
G) property value of categorical attribute to be divided in training sample data D is obtained into a data set according to incremental sequence, it will
Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points and most
N-1 division points among the latter division points determine its position by calculating N-1 to the average value of neighboring property values two-by-two
It puts, and ensures all properties value of categorical attribute to be divided between first division points and the last one division points, root
According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division points of information gain-ratio maximum
As best division position, then training sample data D is split into according to categorical attribute to be divided in most preferably division position
The classification equal with the quantity of class label.
5. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that
The step (4) improves C4.5 decision Tree algorithms:In algorithm when best attributes is selected to divide data set, if being believed according to highest
The attribute that breath ratio of profit increase is chosen value classification number in current optional attribute subset is most, then is added in Attributions selection measurement
Balance factor, adjustment information gain, and then adjustment information ratio of profit increase, overcome deviation multivalue problem as possible;If attribute meets balance
Classification information entropy after condition is then modified;
A) attribute meets equilibrium condition, it is modified after classification information entropy be:
Balance factor justice is defined as:
Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample data D, Split Attribute A and sample
The incidence relation of notebook data D is represented with linked list shown in following table:
S in formula*iFor A=AiThe example sum of attribute, Sj*For DjExample sum, N be data set example sum, SjiFor attribute
A=AiThere is classification D in examplejExample sum, desired value when selected properties are not associated with classification is:
B) modified information gain is:
Gain*(A)=Info (D)-InfoV *(A)
C) modified information gain-ratio is:
6. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that
Step (5) the decision tree P mining is analyzed:The information of each attribute is calculated by Map-Reduce parallel calculating methods
Ratio of profit increase;Specially:
A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, the attribute for selecting ratio of profit increase maximum is made
Sample is divided for the root node of decision-tree model, calculates the information gain-ratio of each attribute;Mapper functions be responsible for screening and
Transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees are parallel
It calculates and brings together the value that there is identical key values among these, as the input of reducer functions, pass through
Reducer functions centralized calculation is so as to obtain final result of calculation;
B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel;It is parallel in each processor
In the case of having handled the training dataset for needing serial process originally, reducing needs to traverse what mass data record was spent
Time loss;
C) it under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then calculates
The time complexity of information gain and information gain-ratio ratio significantly reduces when serial, and communication cost also opposite reduction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711426288.9A CN108170769A (en) | 2017-12-26 | 2017-12-26 | A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711426288.9A CN108170769A (en) | 2017-12-26 | 2017-12-26 | A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108170769A true CN108170769A (en) | 2018-06-15 |
Family
ID=62520859
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711426288.9A Pending CN108170769A (en) | 2017-12-26 | 2017-12-26 | A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108170769A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063116A (en) * | 2018-07-27 | 2018-12-21 | 考拉征信服务有限公司 | Data identification method, device, electronic equipment and computer readable storage medium |
CN109101632A (en) * | 2018-08-15 | 2018-12-28 | 中国人民解放军海军航空大学 | Product quality abnormal data retrospective analysis method based on manufacture big data |
CN109492833A (en) * | 2018-12-25 | 2019-03-19 | 洛阳中科协同科技有限公司 | A kind of bearing ring quality of production prediction technique based on decision Tree algorithms |
CN111125078A (en) * | 2019-12-19 | 2020-05-08 | 华北电力大学 | Defect data correction method for relay protection device |
CN111241056A (en) * | 2019-12-31 | 2020-06-05 | 国网浙江省电力有限公司电力科学研究院 | Power energy consumption data storage optimization method based on decision tree model |
CN111695588A (en) * | 2020-04-14 | 2020-09-22 | 北京迅达云成科技有限公司 | Distributed decision tree learning system based on cloud computing |
CN112269778A (en) * | 2020-10-15 | 2021-01-26 | 西安工程大学 | Equipment fault diagnosis method |
CN112862126A (en) * | 2021-03-04 | 2021-05-28 | 扬州浩辰电力设计有限公司 | Intelligent substation secondary equipment defect elimination recommendation method based on decision tree |
CN113689036A (en) * | 2021-08-24 | 2021-11-23 | 成都电科智联科技有限公司 | Thermal imager quality problem reason prediction method based on decision tree C4.5 algorithm |
CN115859944A (en) * | 2023-02-15 | 2023-03-28 | 莱芜职业技术学院 | Computer data mining method based on big data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102054002A (en) * | 2009-10-28 | 2011-05-11 | 中国移动通信集团公司 | Method and device for generating decision tree in data mining system |
US20150262064A1 (en) * | 2014-03-17 | 2015-09-17 | Microsoft Corporation | Parallel decision tree processor architecture |
CN106294667A (en) * | 2016-08-05 | 2017-01-04 | 四川九洲电器集团有限责任公司 | A kind of decision tree implementation method based on ID3 and device |
-
2017
- 2017-12-26 CN CN201711426288.9A patent/CN108170769A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102054002A (en) * | 2009-10-28 | 2011-05-11 | 中国移动通信集团公司 | Method and device for generating decision tree in data mining system |
US20150262064A1 (en) * | 2014-03-17 | 2015-09-17 | Microsoft Corporation | Parallel decision tree processor architecture |
CN106294667A (en) * | 2016-08-05 | 2017-01-04 | 四川九洲电器集团有限责任公司 | A kind of decision tree implementation method based on ID3 and device |
Non-Patent Citations (3)
Title |
---|
唐露新等: "一种改进型C4.5算法在STM焊接质量中的应用研究", 《舰船电子工程》 * |
孙媛: "基于Hadoop平台的决策树算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
熊冰妍: "不平衡数据分类方法及其在手机换机预测中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063116B (en) * | 2018-07-27 | 2020-04-21 | 考拉征信服务有限公司 | Data identification method and device, electronic equipment and computer readable storage medium |
CN109063116A (en) * | 2018-07-27 | 2018-12-21 | 考拉征信服务有限公司 | Data identification method, device, electronic equipment and computer readable storage medium |
CN109101632A (en) * | 2018-08-15 | 2018-12-28 | 中国人民解放军海军航空大学 | Product quality abnormal data retrospective analysis method based on manufacture big data |
CN109492833A (en) * | 2018-12-25 | 2019-03-19 | 洛阳中科协同科技有限公司 | A kind of bearing ring quality of production prediction technique based on decision Tree algorithms |
CN111125078A (en) * | 2019-12-19 | 2020-05-08 | 华北电力大学 | Defect data correction method for relay protection device |
CN111241056B (en) * | 2019-12-31 | 2024-03-01 | 国网浙江省电力有限公司营销服务中心 | Power energy data storage optimization method based on decision tree model |
CN111241056A (en) * | 2019-12-31 | 2020-06-05 | 国网浙江省电力有限公司电力科学研究院 | Power energy consumption data storage optimization method based on decision tree model |
CN111695588A (en) * | 2020-04-14 | 2020-09-22 | 北京迅达云成科技有限公司 | Distributed decision tree learning system based on cloud computing |
CN112269778A (en) * | 2020-10-15 | 2021-01-26 | 西安工程大学 | Equipment fault diagnosis method |
CN112862126A (en) * | 2021-03-04 | 2021-05-28 | 扬州浩辰电力设计有限公司 | Intelligent substation secondary equipment defect elimination recommendation method based on decision tree |
CN112862126B (en) * | 2021-03-04 | 2023-10-13 | 扬州浩辰电力设计有限公司 | Decision tree-based recommendation method for eliminating defects of secondary equipment of intelligent substation |
CN113689036A (en) * | 2021-08-24 | 2021-11-23 | 成都电科智联科技有限公司 | Thermal imager quality problem reason prediction method based on decision tree C4.5 algorithm |
CN115859944A (en) * | 2023-02-15 | 2023-03-28 | 莱芜职业技术学院 | Computer data mining method based on big data |
CN115859944B (en) * | 2023-02-15 | 2023-10-17 | 莱芜职业技术学院 | Big data-based computer data mining method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108170769A (en) | A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms | |
CN111199343B (en) | Multi-model fusion tobacco market supervision abnormal data mining method | |
CN106022477A (en) | Intelligent analysis decision system and method | |
WO2019042099A1 (en) | Chinese medicine production process knowledge system | |
CN107451666A (en) | Breaker based on big data analysis assembles Tracing back of quality questions system and method | |
CN108573078B (en) | Fracturing effect prediction method based on data mining | |
CN113590698B (en) | Artificial intelligence technology-based data asset classification modeling and hierarchical protection method | |
CN115641162A (en) | Prediction data analysis system and method based on construction project cost | |
CN111126865B (en) | Technology maturity judging method and system based on technology big data | |
CN113537807A (en) | Enterprise intelligent wind control method and device | |
CN116777284A (en) | Space and attribute data integrated quality inspection method | |
CN115409120A (en) | Data-driven-based auxiliary user electricity stealing behavior detection method | |
CN113628024A (en) | Financial data intelligent auditing system and method based on big data platform system | |
CN110597796B (en) | Big data real-time modeling method and system based on full life cycle | |
Fana et al. | Data Warehouse Design With ETL Method (Extract, Transform, And Load) for Company Information Centre | |
CN112506930B (en) | Data insight system based on machine learning technology | |
CN114358812A (en) | Multi-dimensional power marketing analysis method and system based on operation and maintenance big data | |
CN112001539B (en) | High-precision passenger transport prediction method and passenger transport prediction system | |
CN113920366A (en) | Comprehensive weighted main data identification method based on machine learning | |
CN115481841A (en) | Material demand prediction method based on feature extraction and improved random forest | |
CN112215514A (en) | Operation analysis report generation method and system | |
Munawar et al. | Business Intelligence Framework for Mapping Analysis of Crafts Creative Industry Products Exports in West Java, Indonesia | |
CN117171145B (en) | Analysis processing method, equipment and storage medium for enterprise management system data | |
CN117076454B (en) | Engineering quality acceptance form data structured storage method and system | |
Kun et al. | Supplier Management Decision Support System Under Data Mining Algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180615 |