CN108170769A - A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms - Google Patents

A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms Download PDF

Info

Publication number
CN108170769A
CN108170769A CN201711426288.9A CN201711426288A CN108170769A CN 108170769 A CN108170769 A CN 108170769A CN 201711426288 A CN201711426288 A CN 201711426288A CN 108170769 A CN108170769 A CN 108170769A
Authority
CN
China
Prior art keywords
attribute
data
ratio
information gain
decision tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711426288.9A
Other languages
Chinese (zh)
Inventor
蔡红霞
魏壮宇
任民山
丁阳
张英雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201711426288.9A priority Critical patent/CN108170769A/en
Publication of CN108170769A publication Critical patent/CN108170769A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The invention discloses a kind of assembling manufacturing qualitative data processing methods based on decision Tree algorithms.Belong to prediction of quality field.This method includes establishing quality data model according to quality service flow and qualitative data table, it stores it in oracle database, relevant quality service data are extracted in data warehouse by ETL tools informatic, big data platform Splunk is made to be connect with quality data warehouse by Splunk database interface DB Connect, realize the real-time extraction of data;Improved data mining C4.5 decision Tree algorithms are integrated into Splunk platforms, make decision Tree algorithms Map Reduceization using thought of dividing and ruling, and complete big data platform Splunk clusters, realize parallel computation and the parallel search of data, classified excavation is carried out to qualitative data, achievees the purpose that aid decision making person carries out qualitative data decision.This method operation efficiency greatly promotes, and can handle the qualitative data of magnanimity, has very high practical value.

Description

A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms
Technical field
The present invention relates to a kind of assembling manufacturing qualitative data processing methods based on decision Tree algorithms, and it is pre- to belong to quality classification Survey field.
Background technology
With the fast development and application of industrial automation technology and computer information technology, in assembling, manufacture scene product The qualitative data of magnanimity is tired out, this excites people to analyzing the great interests of these data, is wherein hidden to identify and excavate Rule, so as to which preferably Instructing manufacture is put into practice.
Qualitative data Mining Interesting in assembling, the quality fluctuation in manufacturing process, by analyze tdeduction prediction data it Between homogeneity and otherness, so as to timely extension tech adjust.Fuzzy classifier method, artificial neural network, Bayes Method Qualitative data process can be analyzed and predicted with data digging methods such as support vector machines, achieve preferable application effect Fruit.In contrast, traditional decision-tree be easily handled variation data, to noise data with robustness, production rule hold Readily understood, the advantages that recognition efficiency is high.C4.5 algorithms in decision tree can use the form of regression equation to carry out predictive variable Modeling, therefore this method is highly suitable for the Complex Assembly with more than characteristic, manufacturing process qualitative data forecasting research.
At present, many scholars more deeply study decision tree C4.5 algorithm optimization both at home and abroad.Xu's Zhao et al. A kind of decision tree method for diagnosing faults based on identification reduction matrix is proposed, realizes efficiently producing for fault sample decision table And ensure the correctness of diagnosis;Wang Wenxia etc. takes multiple scan for tradition C4.5 Decision Tree Algorithms needs, causes to transport The defects of line efficiency is low, proposes a kind of new improvement C4.5 Decision Tree Algorithms, which is derived by optimizing information gain Relevant logarithm operation in algorithm, to reduce the run time of Decision Tree Algorithm, by the letter of connection attribute in traditional algorithm Architomy attribute is improved to the processing of optimal dividing dot splitting, to improve efficiency of algorithm.Huang Xiuxia for C4.5 algorithms when calculating Between between long, attribute the problem of interdependence effects, it is proposed that a kind of C4.5 algorithms based on GINI Mean value of index between attribute (GC4.5).However, the staged with qualitative data increases, these optimization algorithms have been not enough to calculate the " big of such magnanimity Data ".
Invention content
It is an object of the invention to be directed to the deficiency of prior art, a kind of assembling manufacturing matter based on decision Tree algorithms is provided Data processing method is measured, classifies suitable for the huge qualitative data of data volume and predicts, reach accurate when Accurate Prediction breaks down Trouble location is carried out classification processing by ground.
In order to achieve the above objectives, idea of the invention is that:
Present invention introduces big data search platform Splunk, and data directory and data search are realized by the cluster of Splunk Distributed treatment, then decision tree C4.5 parallel algorithm is embedded into Splunk search engine instruction systems, in data platform It is upper to realize improved decision tree C4.5 algorithm parallel computation.The search inquiry speed of data is not only greatly improved in this way, is dropped Low error rate, moreover it is possible to improve the arithmetic speed of algorithm.
During assembling manufacturing, quality service flow and qualitative data table establish quality data model, are deposited Storage is extracted relevant quality service data in data warehouse by ETL tools informatic in oracle database, Big data platform Splunk is made to be connect with quality data warehouse by Splunk database interface DB Connect, realize data It extracts in real time;Improved data mining C4.5 decision Tree algorithms are integrated into Splunk platforms, using dividing and ruling, thought makes decision Tree algorithm Map-Reduceization, and big data platform Splunk clusters are completed, realize parallel computation and the parallel search of data, it is complete The qualitative data generated during pairs of assembling manufacturing carries out classified excavation, reaches aid decision making person and completes to assembling manufacturing quality Data carry out the purpose of decision.
Conceived according to foregoing invention, the technical solution adopted by the present invention is:
A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, concrete operation step are as follows:
(1), the qualitative data table of comparisons is established:Analyze qualitative data operation flow, establish qualitative data input field and Target output field:Whether deviation process responsible department, appearance effects, sealing influence, failure cause, is important Parts, deviation journey Degree and disposition classification;
(2), qualitative data processing model is established:Each property value of qualitative data is numbered, and mapped;
(3), sample process module by the property value of categorical attribute and arrives training sample data D;Algorithm platform receives training Sample data D, and training C4.5 decision-tree models;
(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model;
(5), decision tree P mining is analyzed:Using the quality data model in step (2), determine under Splunk to improvement Plan tree algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set; Horizontal division is carried out to data set, is that horizontal segmentation, the size of data set that each Map functions are read nearby are carried out to data set It is the same, avoids the occurrence of the situation of load imbalance;Vertical division is carried out to data set, i.e., it is one or several are complete The information gain of attribute and the calculating of information gain-ratio are distributed to an individual processor and are handled, and each processor is parallel Ground handles one or more attribute and divides required information gain and the calculating process of information gain-ratio;In vertical division mould Under type, the calculating process of the split point of each attribute performs parallel.
The step (1), which establishes the qualitative data table of comparisons, is:According to qualitative data table, by correlated quality management in enterprise Personnel provide can comprehensively cover quality service flow and parameter as possible, establish attribute and the table of comparisons of flow processing.
The step (2) establishes qualitative data processing model:Define the quality that all data item in affairs are separation Qualitative data record in data form, there is no two identical affairs.
Step (3) the training C4.5 decision-tree models include the following steps:
A) comentropy of training sample data D is calculated:
Wherein PiIt is that arbitrary sample belongs to class C in DiProbability;
B) computation attribute A comentropies:Attribute A has V different value { a1,a2,...,av, D is divided into V subset {D1,D2,...,Dv, wherein DjIt is the subset of D, they have value a on attribute Aj, attribute A comentropies are:
Wherein, item Dj/ D is subset DjThe weight of shared total sample, Info (A) are based on the sample classification divided by A to D Required comentropy;
C) by step a), b) obtain the information gain of attribute A:
Gain (A)=Info (D)-Info (A)
D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect, This bias is overcome, using segmentation information amount Splitlnfo (A):
E) by step c), d), information gain-ratio GainRatio (A) is calculated:
F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided;
G) property value of categorical attribute to be divided in training sample data D is obtained into a data according to incremental sequence Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points by collection The N-1 division points among the last one division points, it is determined by calculating N-1 to the average value of neighboring property values two-by-two Position, and ensure all properties value of categorical attribute to be divided between first division points and the last one division points, According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division of information gain-ratio maximum Point is used as best division position, then divides training sample data D in most preferably division position according to categorical attribute to be divided Into the classification equal with the quantity of class label.
The step (4) improves C4.5 decision Tree algorithms:In algorithm when best attributes is selected to divide data set, if root According to the attribute that highest information gain-ratio is chosen, value classification number is most in current optional attribute subset, then in Attributions selection degree Balance factor, adjustment information gain, and then adjustment information ratio of profit increase are added in amount, overcomes deviation multivalue problem as possible;If attribute Meet equilibrium condition it is then modified after classification information entropy;
A) attribute meets equilibrium condition, it is modified after classification information entropy be:
Balance factor justice is defined as:
Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample attribute D, Split Attribute A It is represented with the incidence relation of sample data D with linked list;
B) modified information gain is:
Gain*(A)=Info (D)-InfoV *(A)
C) modified information gain-ratio is:
Step (5) the decision tree P mining is analyzed:Each category is calculated by Map-Reduce parallel calculating methods The information gain-ratio of property;Specially:
A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, selects the category of ratio of profit increase maximum Property divides sample as the root node of decision-tree model, calculates the information gain-ratio of each attribute;Mapper functions are responsible for sieve Choosing and transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees Parallel computation brings together the value for having identical key values among these, as the input of reducer functions, passes through Reducer functions centralized calculation is so as to obtain final result of calculation;
B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel;In each processor In the case that parallel processing needs the training dataset of serial process originally, reduce to need to traverse mass data record and spend The time loss taken.
C) under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then The time complexity ratio of calculating information gain and information gain-ratio significantly reduces when serial, and communication cost also opposite reduction.
Compared with prior art, the advantages of the present invention are:
The present invention is applied to improved decision Tree algorithms in qualitative data, and employ big data technology, passes through The parallel search capabilities of Splunk greatly save the decision Tree algorithms search-related data time.Pass through experimental result and practical matter Amount processing comparison, discovery have very strong practicability, are greatly reduced what is spent during process flow when quality problems occur for product Time, so as to which the policymaker for being unfamiliar with business be helped to carry out effectively decision.After adding in decision tree parallel algorithm, operation efficiency It greatly promotes, the qualitative data of magnanimity can be handled, there is very high practical value.
Description of the drawings
Fig. 1 is the main flow block diagram of the present invention.
Fig. 2 is the quality data model of the step (2) of the present invention.
Fig. 3 is the algorithm calculation flow chart of the step (5) of the present invention.
Specific embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.The present embodiment is with the technology of the present invention side Implemented premised on case, give detailed embodiment and specific operating process, but protection scope of the present invention is unlimited In following embodiments.
As shown in Figure 1, a kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, concrete operation step is such as Under:
(1), the qualitative data table of comparisons is established:Analyze qualitative data operation flow, establish qualitative data input field and Target output field:Whether deviation process responsible department, appearance effects, sealing influence, failure cause, is important Parts, deviation journey Degree and disposition classification;
(2), qualitative data processing model is established:Each property value of qualitative data is numbered, and mapped;
(3), sample process module by the property value of categorical attribute and arrives training sample data D;Algorithm platform receives training Sample data D, and training C4.5 decision-tree models;
(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model;
(5), decision tree P mining is analyzed:Using the quality data model in step (2), determine under Splunk to improvement Plan tree algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set; Horizontal division is carried out to data set, is that horizontal segmentation, the size of data set that each Map functions are read nearby are carried out to data set It is the same, avoids the occurrence of the situation of load imbalance;Vertical division is carried out to data set, i.e., it is one or several are complete The information gain of attribute and the calculating of information gain-ratio are distributed to an individual processor and are handled, and each processor is parallel Ground handles one or more attribute and divides required information gain and the calculating process of information gain-ratio;In vertical division mould Under type, the calculating process of the split point of each attribute performs parallel.
The step (1), which establishes the qualitative data table of comparisons, is:According to qualitative data table, by correlated quality management in enterprise Personnel provide can comprehensively cover quality service flow and parameter as possible, establish attribute and the table of comparisons of flow processing.
As shown in Fig. 2, the step (2) establish qualitative data processing model be:Define affairs in all data item be Qualitative data record in the qualitative data table of separation, there is no two identical affairs.
Step (3) the training C4.5 decision-tree models include the following steps:
A) comentropy of training sample data D is calculated:
Wherein PiIt is that arbitrary sample belongs to class C in DiProbability;
B) computation attribute A comentropies:Attribute A has V different value { a1,a2,...,av, D is divided into V subset {D1,D2,...,Dv, wherein DjIt is the subset of D, they have value a on attribute Aj, attribute A comentropies are:
Wherein, item Dj/ D is subset DjThe weight of shared total sample, Info (A) are based on the sample classification divided by A to D Required comentropy;
C) by step a), b) obtain the information gain of attribute A:
Gain (A)=Info (D)-Info (A)
D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect, This bias is overcome, using segmentation information amount Splitlnfo (A):
E) by step c), d), information gain-ratio GainRatio (A) is calculated:
F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided;
G) property value of categorical attribute to be divided in training sample data D is obtained into a data according to incremental sequence Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points by collection The N-1 division points among the last one division points, it is determined by calculating N-1 to the average value of neighboring property values two-by-two Position, and ensure all properties value of categorical attribute to be divided between first division points and the last one division points, According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division of information gain-ratio maximum Point is used as best division position, then divides training sample data D in most preferably division position according to categorical attribute to be divided Into the classification equal with the quantity of class label.
The step (4) improves C4.5 decision Tree algorithms:In algorithm when best attributes is selected to divide data set, if root According to the attribute that highest information gain-ratio is chosen, value classification number is most in current optional attribute subset, then in Attributions selection degree Balance factor, adjustment information gain, and then adjustment information ratio of profit increase are added in amount, overcomes deviation multivalue problem as possible;If attribute Meet equilibrium condition it is then modified after classification information entropy;
A) attribute meets equilibrium condition, it is modified after classification information entropy be:
Balance factor justice is defined as:
Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample data D, Split Attribute A It is represented shown in following table with linked list with the incidence relation of sample data D:
D1 D2 Dk It amounts to
A1 S11 S21 Sk1 S*1
A2 S12 S22 Sk2 S*2
An S1n S2n Skn S*n
It amounts to S1* S2* Sk* N
S in formula*iFor A=AiThe example sum of attribute, Sj*For DjExample sum, N be data set example sum, SjiFor Attribute A=AiThere is classification D in examplejExample sum, desired value when selected properties are not associated with classification is:
B) modified information gain is:
Gain*(A)=Info (D)-InfoV *(A)
C) modified information gain-ratio is:
As shown in figure 3, step (5) the decision tree P mining analysis is:Pass through Map-Reduce parallel calculating methods Calculate the information gain-ratio of each attribute;Specially:
A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, selects the category of ratio of profit increase maximum Property divides sample as the root node of decision-tree model, calculates the information gain-ratio of each attribute;Mapper functions are responsible for sieve Choosing and transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees Parallel computation brings together the value for having identical key values among these, as the input of reducer functions, passes through Reducer functions centralized calculation is so as to obtain final result of calculation;
B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel;In each processor In the case that parallel processing needs the training dataset of serial process originally, reduce to need to traverse mass data record and spend The time loss taken.
C) under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then The time complexity ratio of calculating information gain and information gain-ratio significantly reduces when serial, and communication cost also opposite reduction.
The specific embodiment of the present invention, with the product quality data generated in the assembling of certain manufacturing enterprise, manufacturing process It is as follows for analysis:
The qualitative data field that the data object of decision tree analysis processing is enumerated from Fig. 2, including deviation process, responsibility Whether department appearance effects, sealing influence, failure cause, is important Parts, departure degree.Using treatment classification as output result. The functional experiment (D1) of property value of deviation process, part manufacture (D2), assembling (D3), storehouse examine (D4), client to check (D5);The property value of responsible department mainly has part machining sector (R1), assembling department (R2), manufacturing sector (R3), supplier (R4), quality audit department (R5);It is (A1) and no (A2) that the property value of appearance effects field, which has,;Sealing influences the attribute of field It is (S1) and no (S2) that value, which includes,;The property value of failure cause includes supplier's quality problems (F1), design improves or change (F2), material expired (F3), problem of manufacturing qualities (F4), process change (F5);Whether the property value of important Parts includes general part (I1), important parts (I2) and key component (I3);Departure degree mainly has larger (De1), smaller (De2), general (De3), great (De4), important (De5) five property values.Disposition classification, which mainly has, does over again (C1), reprocesses (C2), preliminary treatment (C3), makes as former state With (C4), it is returned to seller (C5), scraps (C6) six property values.
The information gain-ratio of each attribute is calculated by Map-Reduce parallel calculating methods.In order to obtain decision tree mould Type will first calculate information gain-ratio, and after comparing size, the attribute for selecting ratio of profit increase maximum comes as the root node of decision-tree model Divide sample.The information gain-ratio of each attribute is calculated below.
The type of disposition sorting field, which mainly includes, does over again, is returned to seller, preliminary treatment, reprocesses, using as former state and reporting Useless six attribute, the total number of records have 20004 altogether.Mapper functions are responsible for screening and be converted<Key, value>The input of form, Exporting equally is<Key, value>The intermediate data of form, C4.5 parallel computations will have identical key values among these Value is brought together, as the input of reducer functions, by reducer functions centralized calculation so as to obtain final meter It calculates as a result, obtain disposition classification results has 2225 to do over again, being returned to seller has 2251, and preliminary treatment has 8611, reprocesses There are 1717, used 3919 as former state, scrapping there are 1281.With the entropy Info for the inspection result that above-mentioned formula obtains (inspection result) is:
For the field of responsible department, property value is that the record number of part machining sector is 5335, wherein record of doing over again Number is 676, and the record number for being returned to seller is 997, and the record number of preliminary treatment is 2552, and the record number reprocessed is 110 Item, the record number used as former state are 873, and the record number scrapped is 127;The record number of assembling department is 10181, wherein Record number of doing over again is 1011, and the record number for being returned to seller is 644, and the record number of preliminary treatment is 5003, the note reprocessed It is 1026 to record number, and the record number used as former state is 1877, and the record number scrapped is 620;The record number of manufacturing sector is 2035, wherein record number of doing over again is 282, the record number for being returned to seller is 256, and the record number of preliminary treatment is 513, The record number reprocessed is 284, and the record number used as former state is 455, and the record number scrapped is 245.The record of supplier Number is 1902, wherein record number of doing over again is 175, the record number for being returned to seller is 311, and the record number of preliminary treatment is 422, the record number reprocessed is 234, and the record number used as former state is 508, and the record number scrapped is 252.Quality is examined The record number for looking into department is 551, wherein record number of doing over again is 81, the record number for being returned to seller is 43, preliminary treatment It is 121 to record number, and the record number reprocessed is 63, and the record number used as former state is 206, and the record number scrapped is 37.
Calculating comentropy of these three attributes in classification respectively is:
Whether it is that the entropy Info (I) and information gain Gain (I) of important Parts is:
Gain (R)=Info (C)-Info (R)=2.2492-2.1702=0.0790
The attribute segmentation information amount SplitInfo (I) of responsible department's attribute is:
The information gain-ratio GainRatio (I) of responsible department's attribute is:
Similarly, the information gain-ratio of other attributes is:GainRatio (D)=0.0136, GainRatio (I)= 0.0058, GainRatio (A)=0.0045, GainRatio (S)=0.0037, GainRatio (F)=0.0158, GainRatio (De)=0.0067.
Due to GainRatio (R)>GainRatio(F)>GainRatio(D)>GainRatio(De)>GainRatio(I) >GainRatio(A)>GainRatio (S), the information gain-ratio of attribute " R " is maximum and attribute value number is also maximum, meets and draws Enter the condition of balance factor update information ratio of profit increase, therefore use the information gain-ratio of balance factor λ adjustment " R " attributes.Attribute Shown in the Attribute Association table following table of " R ":
Similarly, E is obtained31=2296.52, E41=457.92, E51=1045.18, E61=341.64, E12=1132.41, E22 =1145.13, E32=4382.55, E42=873.86, E52=1994.57, E62=651.96, E13=226.35, E23= 228.99, E33=875.99, E43=174.67, E53=398.68, E63=130.32, E14=211.56, E24=214.03, E34 =818.74, E44=163.25, E54=372.62, E64=121.80, E15=61.29, E25=62.00, E35=237.19, E45 =47.29, E55=107.95, E65=35.28.
λ=0.0059, after introducing balance factor, the classification information entropy of computation attribute " R " is:
Gain*(R)=Info (C)-Info*(R)=2.2492-2.2351=0.0141
At this point, the information gain-ratio of four attributes is ordered as GainRatio (F)>GainRatio(D)>GainRatio(R) >GainRatio(De)>GainRatio(I)>GainRatio(A)>GainRatio(S).Failure cause after introducing balance factor Information gain-ratio it is maximum inside all properties, so it is selected as the root node of decision tree first, from every in the attribute One-component draws a branch, and divides sample.
By experimental result, analysis extraction is carried out for decision tree arborescence hierarchical relationship above, is conducive to help to determine Plan person completes quality fault data within the shortest time rational assessment or helps those unfamiliar to quality service Manager realizes quality service guidance.

Claims (6)

1. a kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms, which is characterized in that concrete operation step is such as Under:
(1), the qualitative data table of comparisons is established:The operation flow of qualitative data is analyzed, establishes qualitative data input field and target Output field:Deviation process, responsible department, appearance effects, sealing influence, failure cause, whether be important Parts, departure degree with And disposition classification;
(2), qualitative data processing model is established:Each property value of qualitative data is numbered, and mapped;
(3), sample process module by the property value of categorical attribute and arrives training sample data D;Algorithm platform receives training sample Data D, and training C4.5 decision-tree models;
(4), C4.5 decision Tree algorithms are improved, add in balance factor in a model;
(5), decision tree P mining is analyzed:Using the quality data model in step (2), to improving decision tree under Splunk Algorithm carries out Map/Reduce parallelizations, mainly realizes parallelization by carrying out horizontal and vertical division to data set;Logarithm Horizontal division is carried out according to collection, is that horizontal segmentation is carried out to data set, the size of data set that each Map functions are read nearby is one Sample, avoid the occurrence of the situation of load imbalance;Vertical division is carried out to data set, i.e., by one or several complete attributes The calculating of information gain and information gain-ratio distribute to an individual processor and handled, each processor is located in parallel It manages one or more attribute and divides required information gain and the calculating process of information gain-ratio;In vertical division model Under, the calculating process of the split point of each attribute performs parallel.
2. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that The step (1), which establishes the qualitative data table of comparisons, is:According to qualitative data table, provided by correlated quality administrative staff in enterprise Quality service flow and parameter can comprehensively be covered as possible, establish attribute and the table of comparisons of flow processing.
3. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that The step (2) establishes qualitative data processing model:Define the qualitative data table that all data item in affairs are separation In qualitative data record, there is no two identical affairs.
4. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that Step (3) the training C4.5 decision-tree models include the following steps:
A) comentropy of training sample data D is calculated:
Wherein PiIt is that arbitrary sample belongs to class C in DiProbability;
B) computation attribute A comentropies:Attribute A has V different value { a1,a2,...,av, D is divided into V subset { D1, D2,...,Dv, wherein DjIt is the subset of D, they have value a on attribute Aj, attribute A comentropies are:
Wherein, item Dj/ D is subset DjThe weight of shared total sample, Info (A) be based on as A divide to needed for the sample classification of D The comentropy wanted;
C) by step a), b) obtain the information gain of attribute A:
Gain (A)=Info (D)-Info (A)
D) information gain often tends to attribute of the selection with numerous values, but not necessarily brings good prediction effect, gram This bias is taken, using segmentation information amount Splitlnfo (A):
E) by step c), d), information gain-ratio GainRatio (A) is calculated:
F) categorical attribute of information gain-ratio maximum in attribute A is found out, and as categorical attribute to be divided;
G) property value of categorical attribute to be divided in training sample data D is obtained into a data set according to incremental sequence, it will Data set is divided into two different Sub Data Sets of N+1 kinds, corresponding N+1 division points, for being located at first division points and most N-1 division points among the latter division points determine its position by calculating N-1 to the average value of neighboring property values two-by-two It puts, and ensures all properties value of categorical attribute to be divided between first division points and the last one division points, root According to two different Sub Data Sets of N+1 kinds, the information gain-ratio of all division points is calculated, by the division points of information gain-ratio maximum As best division position, then training sample data D is split into according to categorical attribute to be divided in most preferably division position The classification equal with the quantity of class label.
5. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that The step (4) improves C4.5 decision Tree algorithms:In algorithm when best attributes is selected to divide data set, if being believed according to highest The attribute that breath ratio of profit increase is chosen value classification number in current optional attribute subset is most, then is added in Attributions selection measurement Balance factor, adjustment information gain, and then adjustment information ratio of profit increase, overcome deviation multivalue problem as possible;If attribute meets balance Classification information entropy after condition is then modified;
A) attribute meets equilibrium condition, it is modified after classification information entropy be:
Balance factor justice is defined as:
Wherein, the value of λ is codetermined by the value of two variables of current computation attribute A and sample data D, Split Attribute A and sample The incidence relation of notebook data D is represented with linked list shown in following table:
D1 D2 Dk It amounts to A1 S11 S21 Sk1 S*1 A2 S12 S22 Sk2 S*2 An S1n S2n Skn S*n It amounts to S1* S2* Sk* N
S in formula*iFor A=AiThe example sum of attribute, Sj*For DjExample sum, N be data set example sum, SjiFor attribute A=AiThere is classification D in examplejExample sum, desired value when selected properties are not associated with classification is:
B) modified information gain is:
Gain*(A)=Info (D)-InfoV *(A)
C) modified information gain-ratio is:
6. the assembling manufacturing qualitative data processing method according to claim 1 based on decision Tree algorithms, which is characterized in that Step (5) the decision tree P mining is analyzed:The information of each attribute is calculated by Map-Reduce parallel calculating methods Ratio of profit increase;Specially:
A) in order to obtain decision-tree model, information gain-ratio is first calculated, after comparing size, the attribute for selecting ratio of profit increase maximum is made Sample is divided for the root node of decision-tree model, calculates the information gain-ratio of each attribute;Mapper functions be responsible for screening and Transformation<Key, value>The input of form, output are equally<Key, value>The intermediate data of form, C4.5 decision trees are parallel It calculates and brings together the value that there is identical key values among these, as the input of reducer functions, pass through Reducer functions centralized calculation is so as to obtain final result of calculation;
B) under horizontal division mode, the process for calculating best attributes segmentation effectively carries out parallel;It is parallel in each processor In the case of having handled the training dataset for needing serial process originally, reducing needs to traverse what mass data record was spent Time loss;
C) it under vertical division mode, since each processor parallel processing needs the attribute of serial process originally, then calculates The time complexity of information gain and information gain-ratio ratio significantly reduces when serial, and communication cost also opposite reduction.
CN201711426288.9A 2017-12-26 2017-12-26 A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms Pending CN108170769A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711426288.9A CN108170769A (en) 2017-12-26 2017-12-26 A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711426288.9A CN108170769A (en) 2017-12-26 2017-12-26 A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms

Publications (1)

Publication Number Publication Date
CN108170769A true CN108170769A (en) 2018-06-15

Family

ID=62520859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711426288.9A Pending CN108170769A (en) 2017-12-26 2017-12-26 A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms

Country Status (1)

Country Link
CN (1) CN108170769A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063116A (en) * 2018-07-27 2018-12-21 考拉征信服务有限公司 Data identification method, device, electronic equipment and computer readable storage medium
CN109101632A (en) * 2018-08-15 2018-12-28 中国人民解放军海军航空大学 Product quality abnormal data retrospective analysis method based on manufacture big data
CN109492833A (en) * 2018-12-25 2019-03-19 洛阳中科协同科技有限公司 A kind of bearing ring quality of production prediction technique based on decision Tree algorithms
CN111125078A (en) * 2019-12-19 2020-05-08 华北电力大学 Defect data correction method for relay protection device
CN111241056A (en) * 2019-12-31 2020-06-05 国网浙江省电力有限公司电力科学研究院 Power energy consumption data storage optimization method based on decision tree model
CN111695588A (en) * 2020-04-14 2020-09-22 北京迅达云成科技有限公司 Distributed decision tree learning system based on cloud computing
CN112269778A (en) * 2020-10-15 2021-01-26 西安工程大学 Equipment fault diagnosis method
CN112862126A (en) * 2021-03-04 2021-05-28 扬州浩辰电力设计有限公司 Intelligent substation secondary equipment defect elimination recommendation method based on decision tree
CN113689036A (en) * 2021-08-24 2021-11-23 成都电科智联科技有限公司 Thermal imager quality problem reason prediction method based on decision tree C4.5 algorithm
CN115859944A (en) * 2023-02-15 2023-03-28 莱芜职业技术学院 Computer data mining method based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054002A (en) * 2009-10-28 2011-05-11 中国移动通信集团公司 Method and device for generating decision tree in data mining system
US20150262064A1 (en) * 2014-03-17 2015-09-17 Microsoft Corporation Parallel decision tree processor architecture
CN106294667A (en) * 2016-08-05 2017-01-04 四川九洲电器集团有限责任公司 A kind of decision tree implementation method based on ID3 and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054002A (en) * 2009-10-28 2011-05-11 中国移动通信集团公司 Method and device for generating decision tree in data mining system
US20150262064A1 (en) * 2014-03-17 2015-09-17 Microsoft Corporation Parallel decision tree processor architecture
CN106294667A (en) * 2016-08-05 2017-01-04 四川九洲电器集团有限责任公司 A kind of decision tree implementation method based on ID3 and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
唐露新等: "一种改进型C4.5算法在STM焊接质量中的应用研究", 《舰船电子工程》 *
孙媛: "基于Hadoop平台的决策树算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
熊冰妍: "不平衡数据分类方法及其在手机换机预测中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063116B (en) * 2018-07-27 2020-04-21 考拉征信服务有限公司 Data identification method and device, electronic equipment and computer readable storage medium
CN109063116A (en) * 2018-07-27 2018-12-21 考拉征信服务有限公司 Data identification method, device, electronic equipment and computer readable storage medium
CN109101632A (en) * 2018-08-15 2018-12-28 中国人民解放军海军航空大学 Product quality abnormal data retrospective analysis method based on manufacture big data
CN109492833A (en) * 2018-12-25 2019-03-19 洛阳中科协同科技有限公司 A kind of bearing ring quality of production prediction technique based on decision Tree algorithms
CN111125078A (en) * 2019-12-19 2020-05-08 华北电力大学 Defect data correction method for relay protection device
CN111241056B (en) * 2019-12-31 2024-03-01 国网浙江省电力有限公司营销服务中心 Power energy data storage optimization method based on decision tree model
CN111241056A (en) * 2019-12-31 2020-06-05 国网浙江省电力有限公司电力科学研究院 Power energy consumption data storage optimization method based on decision tree model
CN111695588A (en) * 2020-04-14 2020-09-22 北京迅达云成科技有限公司 Distributed decision tree learning system based on cloud computing
CN112269778A (en) * 2020-10-15 2021-01-26 西安工程大学 Equipment fault diagnosis method
CN112862126A (en) * 2021-03-04 2021-05-28 扬州浩辰电力设计有限公司 Intelligent substation secondary equipment defect elimination recommendation method based on decision tree
CN112862126B (en) * 2021-03-04 2023-10-13 扬州浩辰电力设计有限公司 Decision tree-based recommendation method for eliminating defects of secondary equipment of intelligent substation
CN113689036A (en) * 2021-08-24 2021-11-23 成都电科智联科技有限公司 Thermal imager quality problem reason prediction method based on decision tree C4.5 algorithm
CN115859944A (en) * 2023-02-15 2023-03-28 莱芜职业技术学院 Computer data mining method based on big data
CN115859944B (en) * 2023-02-15 2023-10-17 莱芜职业技术学院 Big data-based computer data mining method

Similar Documents

Publication Publication Date Title
CN108170769A (en) A kind of assembling manufacturing qualitative data processing method based on decision Tree algorithms
CN111199343B (en) Multi-model fusion tobacco market supervision abnormal data mining method
CN106022477A (en) Intelligent analysis decision system and method
WO2019042099A1 (en) Chinese medicine production process knowledge system
CN107451666A (en) Breaker based on big data analysis assembles Tracing back of quality questions system and method
CN108573078B (en) Fracturing effect prediction method based on data mining
CN113590698B (en) Artificial intelligence technology-based data asset classification modeling and hierarchical protection method
CN115641162A (en) Prediction data analysis system and method based on construction project cost
CN111126865B (en) Technology maturity judging method and system based on technology big data
CN113537807A (en) Enterprise intelligent wind control method and device
CN116777284A (en) Space and attribute data integrated quality inspection method
CN115409120A (en) Data-driven-based auxiliary user electricity stealing behavior detection method
CN113628024A (en) Financial data intelligent auditing system and method based on big data platform system
CN110597796B (en) Big data real-time modeling method and system based on full life cycle
Fana et al. Data Warehouse Design With ETL Method (Extract, Transform, And Load) for Company Information Centre
CN112506930B (en) Data insight system based on machine learning technology
CN114358812A (en) Multi-dimensional power marketing analysis method and system based on operation and maintenance big data
CN112001539B (en) High-precision passenger transport prediction method and passenger transport prediction system
CN113920366A (en) Comprehensive weighted main data identification method based on machine learning
CN115481841A (en) Material demand prediction method based on feature extraction and improved random forest
CN112215514A (en) Operation analysis report generation method and system
Munawar et al. Business Intelligence Framework for Mapping Analysis of Crafts Creative Industry Products Exports in West Java, Indonesia
CN117171145B (en) Analysis processing method, equipment and storage medium for enterprise management system data
CN117076454B (en) Engineering quality acceptance form data structured storage method and system
Kun et al. Supplier Management Decision Support System Under Data Mining Algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180615