CN108170769A

CN108170769A - A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms

Info

Publication number: CN108170769A
Application number: CN201711426288.9A
Authority: CN
Inventors: 蔡红霞; 魏壮宇; 任民山; 丁阳; 张英雄
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2018-06-15

Abstract

The invention discloses a kind of assembling manufacturing qualitative data processing methods based on decision Tree algorithms.Belong to prediction of quality field.This method includes establishing quality data model according to quality service flow and qualitative data table, it stores it in oracle database, relevant quality service data are extracted in data warehouse by ETL tools informatic, big data platform Splunk is made to be connect with quality data warehouse by Splunk database interface DB Connect, realize the real-time extraction of data；Improved data mining C4.5 decision Tree algorithms are integrated into Splunk platforms, make decision Tree algorithms Map Reduceization using thought of dividing and ruling, and complete big data platform Splunk clusters, realize parallel computation and the parallel search of data, classified excavation is carried out to qualitative data, achievees the purpose that aid decision making person carries out qualitative data decision.This method operation efficiency greatly promotes, and can handle the qualitative data of magnanimity, has very high practical value.

Description

A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms

Technical field

The present invention relates to a kind of assembling manufacturing qualitative data processing methods based on decision Tree algorithms, and it is pre- to belong to quality classification Survey field.

Background technology

With the fast development and application of industrial automation technology and computer information technology, in assembling, manufacture scene product The qualitative data of magnanimity is tired out, this excites people to analyzing the great interests of these data, is wherein hidden to identify and excavate Rule, so as to which preferably Instructing manufacture is put into practice.

Qualitative data Mining Interesting in assembling, the quality fluctuation in manufacturing process, by analyze tdeduction prediction data it Between homogeneity and otherness, so as to timely extension tech adjust.Fuzzy classifier method, artificial neural network, Bayes Method Qualitative data process can be analyzed and predicted with data digging methods such as support vector machines, achieve preferable application effect Fruit.In contrast, traditional decision-tree be easily handled variation data, to noise data with robustness, production rule hold Readily understood, the advantages that recognition efficiency is high.C4.5 algorithms in decision tree can use the form of regression equation to carry out predictive variable Modeling, therefore this method is highly suitable for the Complex Assembly with more than characteristic, manufacturing process qualitative data forecasting research.

At present, many scholars more deeply study decision tree C4.5 algorithm optimization both at home and abroad.Xu's Zhao et al. A kind of decision tree method for diagnosing faults based on identification reduction matrix is proposed, realizes efficiently producing for fault sample decision table And ensure the correctness of diagnosis；Wang Wenxia etc. takes multiple scan for tradition C4.5 Decision Tree Algorithms needs, causes to transport The defects of line efficiency is low, proposes a kind of new improvement C4.5 Decision Tree Algorithms, which is derived by optimizing information gain Relevant logarithm operation in algorithm, to reduce the run time of Decision Tree Algorithm, by the letter of connection attribute in traditional algorithm Architomy attribute is improved to the processing of optimal dividing dot splitting, to improve efficiency of algorithm.Huang Xiuxia for C4.5 algorithms when calculating Between between long, attribute the problem of interdependence effects, it is proposed that a kind of C4.5 algorithms based on GINI Mean value of index between attribute (GC4.5).However, the staged with qualitative data increases, these optimization algorithms have been not enough to calculate the " big of such magnanimity Data ".

Invention content

It is an object of the invention to be directed to the deficiency of prior art, a kind of assembling manufacturing matter based on decision Tree algorithms is provided Data processing method is measured, classifies suitable for the huge qualitative data of data volume and predicts, reach accurate when Accurate Prediction breaks down Trouble location is carried out classification processing by ground.

In order to achieve the above objectives, idea of the invention is that：

Present invention introduces big data search platform Splunk, and data directory and data search are realized by the cluster of Splunk Distributed treatment, then decision tree C4.5 parallel algorithm is embedded into Splunk search engine instruction systems, in data platform It is upper to realize improved decision tree C4.5 algorithm parallel computation.The search inquiry speed of data is not only greatly improved in this way, is dropped Low error rate, moreover it is possible to improve the arithmetic speed of algorithm.

During assembling manufacturing, quality service flow and qualitative data table establish quality data model, are deposited Storage is extracted relevant quality service data in data warehouse by ETL tools informatic in oracle database, Big data platform Splunk is made to be connect with quality data warehouse by Splunk database interface DB Connect, realize data It extracts in real time；Improved data mining C4.5 decision Tree algorithms are integrated into Splunk platforms, using dividing and ruling, thought makes decision Tree algorithm Map-Reduceization, and big data platform Splunk clusters are completed, realize parallel computation and the parallel search of data, it is complete The qualitative data generated during pairs of assembling manufacturing carries out classified excavation, reaches aid decision making person and completes to assembling manufacturing quality Data carry out the purpose of decision.

Conceived according to foregoing invention, the technical solution adopted by the present invention is：

A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, concrete operation step are as follows：

(1), the qualitative data table of comparisons is established：Analyze qualitative data operation flow, establish qualitative data input field and Target output field：Whether deviation process responsible department, appearance effects, sealing influence, failure cause, is important Parts, deviation journey Degree and disposition classification；

(2), qualitative data processing model is established：Each property value of qualitative data is numbered, and mapped；

(3), sample process module by the property value of categorical attribute and arrives training sample data D；Algorithm platform receives training Sample data D, and training C4.5 decision-tree models；

(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model；

(5), decision tree P mining is analyzed：Using the quality data model in step (2), determine under Splunk to improvement Plan tree algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set； Horizontal division is carried out to data set, is that horizontal segmentation, the size of data set that each Map functions are read nearby are carried out to data set It is the same, avoids the occurrence of the situation of load imbalance；Vertical division is carried out to data set, i.e., it is one or several are complete The information gain of attribute and the calculating of information gain-ratio are distributed to an individual processor and are handled, and each processor is parallel Ground handles one or more attribute and divides required information gain and the calculating process of information gain-ratio；In vertical division mould Under type, the calculating process of the split point of each attribute performs parallel.

The step (1), which establishes the qualitative data table of comparisons, is：According to qualitative data table, by correlated quality management in enterprise Personnel provide can comprehensively cover quality service flow and parameter as possible, establish attribute and the table of comparisons of flow processing.

The step (2) establishes qualitative data processing model：Define the quality that all data item in affairs are separation Qualitative data record in data form, there is no two identical affairs.

Step (3) the training C4.5 decision-tree models include the following steps：

A) comentropy of training sample data D is calculated：

Wherein P_iIt is that arbitrary sample belongs to class C in D_iProbability；

B) computation attribute A comentropies：Attribute A has V different value { a₁,a₂,...,a_v, D is divided into V subset {D₁,D₂,...,D_v, wherein D_jIt is the subset of D, they have value a on attribute A_j, attribute A comentropies are：

Wherein, item D_j/ D is subset D_jThe weight of shared total sample, Info (A) are based on the sample classification divided by A to D Required comentropy；

C) by step a), b) obtain the information gain of attribute A：

Gain (A)=Info (D)-Info (A)

D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect, This bias is overcome, using segmentation information amount Splitlnfo (A)：

E) by step c), d), information gain-ratio GainRatio (A) is calculated：

F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided；

G) property value of categorical attribute to be divided in training sample data D is obtained into a data according to incremental sequence Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points by collection The N-1 division points among the last one division points, it is determined by calculating N-1 to the average value of neighboring property values two-by-two Position, and ensure all properties value of categorical attribute to be divided between first division points and the last one division points, According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division of information gain-ratio maximum Point is used as best division position, then divides training sample data D in most preferably division position according to categorical attribute to be divided Into the classification equal with the quantity of class label.

The step (4) improves C4.5 decision Tree algorithms：In algorithm when best attributes is selected to divide data set, if root According to the attribute that highest information gain-ratio is chosen, value classification number is most in current optional attribute subset, then in Attributions selection degree Balance factor, adjustment information gain, and then adjustment information ratio of profit increase are added in amount, overcomes deviation multivalue problem as possible；If attribute Meet equilibrium condition it is then modified after classification information entropy；

A) attribute meets equilibrium condition, it is modified after classification information entropy be：

Balance factor justice is defined as：

Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample attribute D, Split Attribute A It is represented with the incidence relation of sample data D with linked list；

B) modified information gain is：

Gain^*(A)=Info (D)-Info_V ^*(A)

C) modified information gain-ratio is：

Step (5) the decision tree P mining is analyzed：Each category is calculated by Map-Reduce parallel calculating methods The information gain-ratio of property；Specially：

A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, selects the category of ratio of profit increase maximum Property divides sample as the root node of decision-tree model, calculates the information gain-ratio of each attribute；Mapper functions are responsible for sieve Choosing and transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees Parallel computation brings together the value for having identical key values among these, as the input of reducer functions, passes through Reducer functions centralized calculation is so as to obtain final result of calculation；

B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel；In each processor In the case that parallel processing needs the training dataset of serial process originally, reduce to need to traverse mass data record and spend The time loss taken.

C) under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then The time complexity ratio of calculating information gain and information gain-ratio significantly reduces when serial, and communication cost also opposite reduction.

Compared with prior art, the advantages of the present invention are：

The present invention is applied to improved decision Tree algorithms in qualitative data, and employ big data technology, passes through The parallel search capabilities of Splunk greatly save the decision Tree algorithms search-related data time.Pass through experimental result and practical matter Amount processing comparison, discovery have very strong practicability, are greatly reduced what is spent during process flow when quality problems occur for product Time, so as to which the policymaker for being unfamiliar with business be helped to carry out effectively decision.After adding in decision tree parallel algorithm, operation efficiency It greatly promotes, the qualitative data of magnanimity can be handled, there is very high practical value.

Description of the drawings

Fig. 1 is the main flow block diagram of the present invention.

Fig. 2 is the quality data model of the step (2) of the present invention.

Fig. 3 is the algorithm calculation flow chart of the step (5) of the present invention.

Specific embodiment

The present invention is described in detail with reference to the accompanying drawings and detailed description.The present embodiment is with the technology of the present invention side Implemented premised on case, give detailed embodiment and specific operating process, but protection scope of the present invention is unlimited In following embodiments.

As shown in Figure 1, a kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, concrete operation step is such as Under：

As shown in Fig. 2, the step (2) establish qualitative data processing model be：Define affairs in all data item be Qualitative data record in the qualitative data table of separation, there is no two identical affairs.

Step (3) the training C4.5 decision-tree models include the following steps：

A) comentropy of training sample data D is calculated：

Wherein P_iIt is that arbitrary sample belongs to class C in D_iProbability；

C) by step a), b) obtain the information gain of attribute A：

Gain (A)=Info (D)-Info (A)

E) by step c), d), information gain-ratio GainRatio (A) is calculated：

Balance factor justice is defined as：

Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample data D, Split Attribute A It is represented shown in following table with linked list with the incidence relation of sample data D：

	D₁	D₂	…	D_k	It amounts to
						A₁	S₁₁	S₂₁	…	S_k1	S_*1
A₂	S₁₂	S₂₂	…	S_k2	S_*2
						…	…	…	…	…	…
A_n	S_1n	S_2n	…	S_kn	S_*n
						It amounts to	S_1*	S_2*	…	S_k*	N

S in formula_*iFor A=A_iThe example sum of attribute, S_j*For D_jExample sum, N be data set example sum, S_jiFor Attribute A=A_iThere is classification D in example_jExample sum, desired value when selected properties are not associated with classification is：

B) modified information gain is：

Gain^*(A)=Info (D)-Info_V ^*(A)

C) modified information gain-ratio is：

As shown in figure 3, step (5) the decision tree P mining analysis is：Pass through Map-Reduce parallel calculating methods Calculate the information gain-ratio of each attribute；Specially：

The specific embodiment of the present invention, with the product quality data generated in the assembling of certain manufacturing enterprise, manufacturing process It is as follows for analysis：

The qualitative data field that the data object of decision tree analysis processing is enumerated from Fig. 2, including deviation process, responsibility Whether department appearance effects, sealing influence, failure cause, is important Parts, departure degree.Using treatment classification as output result. The functional experiment (D1) of property value of deviation process, part manufacture (D2), assembling (D3), storehouse examine (D4), client to check (D5)；The property value of responsible department mainly has part machining sector (R1), assembling department (R2), manufacturing sector (R3), supplier (R4), quality audit department (R5)；It is (A1) and no (A2) that the property value of appearance effects field, which has,；Sealing influences the attribute of field It is (S1) and no (S2) that value, which includes,；The property value of failure cause includes supplier's quality problems (F1), design improves or change (F2), material expired (F3), problem of manufacturing qualities (F4), process change (F5)；Whether the property value of important Parts includes general part (I1), important parts (I2) and key component (I3)；Departure degree mainly has larger (De1), smaller (De2), general (De3), great (De4), important (De5) five property values.Disposition classification, which mainly has, does over again (C1), reprocesses (C2), preliminary treatment (C3), makes as former state With (C4), it is returned to seller (C5), scraps (C6) six property values.

The information gain-ratio of each attribute is calculated by Map-Reduce parallel calculating methods.In order to obtain decision tree mould Type will first calculate information gain-ratio, and after comparing size, the attribute for selecting ratio of profit increase maximum comes as the root node of decision-tree model Divide sample.The information gain-ratio of each attribute is calculated below.

The type of disposition sorting field, which mainly includes, does over again, is returned to seller, preliminary treatment, reprocesses, using as former state and reporting Useless six attribute, the total number of records have 20004 altogether.Mapper functions are responsible for screening and be converted<Key, value>The input of form, Exporting equally is<Key, value>The intermediate data of form, C4.5 parallel computations will have identical key values among these Value is brought together, as the input of reducer functions, by reducer functions centralized calculation so as to obtain final meter It calculates as a result, obtain disposition classification results has 2225 to do over again, being returned to seller has 2251, and preliminary treatment has 8611, reprocesses There are 1717, used 3919 as former state, scrapping there are 1281.With the entropy Info for the inspection result that above-mentioned formula obtains (inspection result) is：

For the field of responsible department, property value is that the record number of part machining sector is 5335, wherein record of doing over again Number is 676, and the record number for being returned to seller is 997, and the record number of preliminary treatment is 2552, and the record number reprocessed is 110 Item, the record number used as former state are 873, and the record number scrapped is 127；The record number of assembling department is 10181, wherein Record number of doing over again is 1011, and the record number for being returned to seller is 644, and the record number of preliminary treatment is 5003, the note reprocessed It is 1026 to record number, and the record number used as former state is 1877, and the record number scrapped is 620；The record number of manufacturing sector is 2035, wherein record number of doing over again is 282, the record number for being returned to seller is 256, and the record number of preliminary treatment is 513, The record number reprocessed is 284, and the record number used as former state is 455, and the record number scrapped is 245.The record of supplier Number is 1902, wherein record number of doing over again is 175, the record number for being returned to seller is 311, and the record number of preliminary treatment is 422, the record number reprocessed is 234, and the record number used as former state is 508, and the record number scrapped is 252.Quality is examined The record number for looking into department is 551, wherein record number of doing over again is 81, the record number for being returned to seller is 43, preliminary treatment It is 121 to record number, and the record number reprocessed is 63, and the record number used as former state is 206, and the record number scrapped is 37.

Calculating comentropy of these three attributes in classification respectively is：

Whether it is that the entropy Info (I) and information gain Gain (I) of important Parts is：

Gain (R)=Info (C)-Info (R)=2.2492-2.1702=0.0790

The attribute segmentation information amount SplitInfo (I) of responsible department's attribute is：

The information gain-ratio GainRatio (I) of responsible department's attribute is：

Similarly, the information gain-ratio of other attributes is：GainRatio (D)=0.0136, GainRatio (I)= 0.0058, GainRatio (A)=0.0045, GainRatio (S)=0.0037, GainRatio (F)=0.0158, GainRatio (De)=0.0067.

Due to GainRatio (R)>GainRatio(F)>GainRatio(D)>GainRatio(De)>GainRatio(I) >GainRatio(A)>GainRatio (S), the information gain-ratio of attribute " R " is maximum and attribute value number is also maximum, meets and draws Enter the condition of balance factor update information ratio of profit increase, therefore use the information gain-ratio of balance factor λ adjustment " R " attributes.Attribute Shown in the Attribute Association table following table of " R "：

Similarly, E is obtained₃₁=2296.52, E₄₁=457.92, E₅₁=1045.18, E₆₁=341.64, E₁₂=1132.41, E₂₂ =1145.13, E₃₂=4382.55, E₄₂=873.86, E₅₂=1994.57, E₆₂=651.96, E₁₃=226.35, E₂₃= 228.99, E₃₃=875.99, E₄₃=174.67, E₅₃=398.68, E₆₃=130.32, E₁₄=211.56, E₂₄=214.03, E₃₄ =818.74, E₄₄=163.25, E₅₄=372.62, E₆₄=121.80, E₁₅=61.29, E₂₅=62.00, E₃₅=237.19, E₄₅ =47.29, E₅₅=107.95, E₆₅=35.28.

λ=0.0059, after introducing balance factor, the classification information entropy of computation attribute " R " is：

Gain^*(R)=Info (C)-Info^*(R)=2.2492-2.2351=0.0141

At this point, the information gain-ratio of four attributes is ordered as GainRatio (F)>GainRatio(D)>GainRatio(R) >GainRatio(De)>GainRatio(I)>GainRatio(A)>GainRatio(S).Failure cause after introducing balance factor Information gain-ratio it is maximum inside all properties, so it is selected as the root node of decision tree first, from every in the attribute One-component draws a branch, and divides sample.

By experimental result, analysis extraction is carried out for decision tree arborescence hierarchical relationship above, is conducive to help to determine Plan person completes quality fault data within the shortest time rational assessment or helps those unfamiliar to quality service Manager realizes quality service guidance.

Claims

1. a kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, which is characterized in that concrete operation step is such as Under：

(1), the qualitative data table of comparisons is established：The operation flow of qualitative data is analyzed, establishes qualitative data input field and target Output field：Deviation process, responsible department, appearance effects, sealing influence, failure cause, whether be important Parts, departure degree with And disposition classification；

(5), decision tree P mining is analyzed：Using the quality data model in step (2), to improving decision tree under Splunk Algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set；Logarithm Horizontal division is carried out according to collection, is that horizontal segmentation is carried out to data set, the size of data set that each Map functions are read nearby is one Sample, avoid the occurrence of the situation of load imbalance；Vertical division is carried out to data set, i.e., by one or several complete attributes The calculating of information gain and information gain-ratio distribute to an individual processor and handled, each processor is located in parallel It manages one or more attribute and divides required information gain and the calculating process of information gain-ratio；In vertical division model Under, the calculating process of the split point of each attribute performs parallel.

2. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that The step (1), which establishes the qualitative data table of comparisons, is：According to qualitative data table, provided by correlated quality administrative staff in enterprise Quality service flow and parameter can comprehensively be covered as possible, establish attribute and the table of comparisons of flow processing.

3. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that The step (2) establishes qualitative data processing model：Define the qualitative data table that all data item in affairs are separation In qualitative data record, there is no two identical affairs.

4. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that Step (3) the training C4.5 decision-tree models include the following steps：

A) comentropy of training sample data D is calculated：

Wherein P_iIt is that arbitrary sample belongs to class C in D_iProbability；

B) computation attribute A comentropies：Attribute A has V different value { a₁,a₂,...,a_v, D is divided into V subset { D₁, D₂,...,D_v, wherein D_jIt is the subset of D, they have value a on attribute A_j, attribute A comentropies are：

Wherein, item D_j/ D is subset D_jThe weight of shared total sample, Info (A) be based on as A divide to needed for the sample classification of D The comentropy wanted；

C) by step a), b) obtain the information gain of attribute A：

Gain (A)=Info (D)-Info (A)

D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect, gram This bias is taken, using segmentation information amount Splitlnfo (A)：

E) by step c), d), information gain-ratio GainRatio (A) is calculated：

G) property value of categorical attribute to be divided in training sample data D is obtained into a data set according to incremental sequence, it will Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points and most N-1 division points among the latter division points determine its position by calculating N-1 to the average value of neighboring property values two-by-two It puts, and ensures all properties value of categorical attribute to be divided between first division points and the last one division points, root According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division points of information gain-ratio maximum As best division position, then training sample data D is split into according to categorical attribute to be divided in most preferably division position The classification equal with the quantity of class label.

5. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that The step (4) improves C4.5 decision Tree algorithms：In algorithm when best attributes is selected to divide data set, if being believed according to highest The attribute that breath ratio of profit increase is chosen value classification number in current optional attribute subset is most, then is added in Attributions selection measurement Balance factor, adjustment information gain, and then adjustment information ratio of profit increase, overcome deviation multivalue problem as possible；If attribute meets balance Classification information entropy after condition is then modified；

Balance factor justice is defined as：

Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample data D, Split Attribute A and sample The incidence relation of notebook data D is represented with linked list shown in following table：

B) modified information gain is：

Gain^*(A)=Info (D)-Info_V ^*(A)

C) modified information gain-ratio is：

6. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that Step (5) the decision tree P mining is analyzed：The information of each attribute is calculated by Map-Reduce parallel calculating methods Ratio of profit increase；Specially：

A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, the attribute for selecting ratio of profit increase maximum is made Sample is divided for the root node of decision-tree model, calculates the information gain-ratio of each attribute；Mapper functions be responsible for screening and Transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees are parallel It calculates and brings together the value that there is identical key values among these, as the input of reducer functions, pass through Reducer functions centralized calculation is so as to obtain final result of calculation；

B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel；It is parallel in each processor In the case of having handled the training dataset for needing serial process originally, reducing needs to traverse what mass data record was spent Time loss；

C) it under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then calculates The time complexity of information gain and information gain-ratio ratio significantly reduces when serial, and communication cost also opposite reduction.