CN108960437A - Unbalanced data learning method based on industry manufacture big data - Google Patents

Unbalanced data learning method based on industry manufacture big data Download PDF

Info

Publication number
CN108960437A
CN108960437A CN201810858296.9A CN201810858296A CN108960437A CN 108960437 A CN108960437 A CN 108960437A CN 201810858296 A CN201810858296 A CN 201810858296A CN 108960437 A CN108960437 A CN 108960437A
Authority
CN
China
Prior art keywords
cost matrix
data
cost
unbalanced
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810858296.9A
Other languages
Chinese (zh)
Inventor
张彩霞
王向东
王新东
胡绍林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan University
Original Assignee
Foshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan University filed Critical Foshan University
Priority to CN201810858296.9A priority Critical patent/CN108960437A/en
Publication of CN108960437A publication Critical patent/CN108960437A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses the unbalanced data learning methods based on industry manufacture big data, include the following steps 101, determine that industry manufactures acquisition source and the acquisition mode of big data: 102, obtaining industry manufacture big data from acquisition source according to the acquisition mode of step 101 to constitute unbalanced dataset;103, unbalanced dataset is modified by sampling mechanism, to provide the data distribution of balance;104, the unbalanced dataset is introduced into SFBP cost matrix frame, is compared by the search item by item of cost matrix frame elements, the element number that constraint condition is met in every row and each column is counted;The ratio for meeting each row and column shared by the element number of constraint condition by comparing each row and column adds corresponding cost value ranks number to SFBP cost matrix frame, changes its cost matrix frame, to optimize the degree of balance of unbalanced dataset.

Description

Unbalanced data learning method based on industry manufacture big data
Technical field
The present invention relates to industry to manufacture big data processing field, is based especially on the unbalanced data of industry manufacture big data Learning method.
Background technique
Uneven problem concerning study, which is primarily upon data, indicates the performance of learning algorithm when insufficient and class is distributed torsional deformation. Manufacturing industry is in terms of tracking and command network and measuring and control data, due to presenting typical uneven from distinct device and for different objects Weighing apparatus form.Due to the intrinsic complicated feature of unbalanced dataset, learns such data and need efficiently to turn a large amount of initial data Turn to new understanding, new principle, new algorithm and the new tool of information and the representation of knowledge.
Summary of the invention
In order to overcome the deficiencies in the prior art described above, the present invention provides a kind of injustice based on industry manufacture big data Weigh data learning method.
The technical solution adopted by the present invention to solve the technical problems are as follows:
Based on the unbalanced data learning method of industry manufacture big data, include the following steps
101, acquisition source and the acquisition mode of industry manufacture big data are determined:
102, industry manufacture big data is obtained to constitute unbalanced data from acquisition source according to the acquisition mode of step 101 Collection;
103, unbalanced dataset is modified by sampling mechanism, to provide the data distribution of balance;
104, the unbalanced dataset is introduced into SFBP cost matrix frame, item by item by cost matrix frame elements Search is compared, and is counted to the element number for meeting constraint condition in every row and each column;Meet about by comparing each row and column The ratio of each row and column shared by the element number of beam condition adds corresponding cost value ranks number to SFBP cost matrix frame, Change its cost matrix frame, to optimize the degree of balance of unbalanced dataset.
The sampling mechanism includes random oversampling and sub- sampling, the synthesis sampling with data generation, adaptive synthesis Sampling, data scrubbing sampling, the sampling based on cluster and the integrated sampling of Boosting.
The building process of the cost matrix specifically comprises the following steps:
Step 1, the insertion operation cost value Ci and delete operation cost value Cd set;
Step 2, the original cost matrix for constructing SFBP algorithm;
Step 3 counts the element number that replacement operation part in original cost matrix meets constraint condition line by line;Its Middle constraint condition are as follows:
αCs>(Ci+Cd)
Wherein, α is given parameters, α ∈ (0,1];Cs is element value, that is, replacement operation cost value, and Ci is the insertion behaviour of setting Make cost value, Cd is the delete operation cost value of setting;
All elements for meeting constraint condition in every a line of replacement operation part in step 4, the original cost matrix of calculating The ratio between number and the row all elements number accounting ti at once, and Statistics Bar accounting ti is less than the line number m of default accounting parameter q; Wherein i=1,2 ..., M, M be original cost matrix in replacement operation part total line number, q ∈ (0,1];
All elements for meeting constraint condition in every a line of replacement operation part in step 5, the original cost matrix of calculating The ratio between number and the column all elements number are column accounting tj, and Statistics Bar accounting tj is less than the columns n of default accounting parameter q; Wherein j=1,2 ..., N, N be original cost matrix in replacement operation part total columns, q ∈ (0,1];
Step 6, according to step 4 gained line number m and step 5 gained columns n, calculate r=max (m, n);And entire original R row and r column element are added on the basis of cost matrix, thus obtain the solution cost square used for subsequent calculating figure editing distance Battle array.
Bring beneficial effect of the present invention has:
The present invention describes mistake in conjunction with cost matrix by integrating the data distributions of a variety of sampling mechanism equilibrium data collection Classify the cost of any specific data sample, is compared by the search item by item to cost matrix frame elements, to every row and each column The middle element number for meeting constraint condition is counted;Meet by comparing each row and column every shared by the element number of constraint condition The ratio of row each column adds corresponding cost value ranks number to SFBP cost matrix frame, changes its cost matrix frame, to reach To the purpose of optimization.When reaching optimization aim, so that it may solution calculating is carried out to cost matrix using general derivation algorithm, Limitation to avoid constraint condition from using algorithm, so that algorithm is preferably applied in industry manufacture big data.
Specific embodiment
Unbalanced data learning method based on industry manufacture big data of the invention, includes the following steps
101, acquisition source and the acquisition mode of industry manufacture big data are determined:
102, industry manufacture big data is obtained to constitute unbalanced data from acquisition source according to the acquisition mode of step 101 Collection;
103, unbalanced dataset is modified by sampling mechanism, to provide the data distribution of balance;
104, the unbalanced dataset is introduced into SFBP cost matrix frame, item by item by cost matrix frame elements Search is compared, and is counted to the element number for meeting constraint condition in every row and each column;Meet about by comparing each row and column The ratio of each row and column shared by the element number of beam condition adds corresponding cost value ranks number to SFBP cost matrix frame, Change its cost matrix frame, to optimize the degree of balance of unbalanced dataset.
The sampling mechanism includes random oversampling and sub- sampling, the synthesis sampling with data generation, adaptive synthesis Sampling, data scrubbing sampling, the sampling based on cluster and the integrated sampling of Boosting.
The building process of the cost matrix specifically comprises the following steps:
Step 1, the insertion operation cost value Ci and delete operation cost value Cd set;
Step 2, the original cost matrix for constructing SFBP algorithm;
Step 3 counts the element number that replacement operation part in original cost matrix meets constraint condition line by line;Its Middle constraint condition are as follows:
αCs>(Ci+Cd)
Wherein, α is given parameters, α ∈ (0,1];Cs is element value, that is, replacement operation cost value, and Ci is the insertion behaviour of setting Make cost value, Cd is the delete operation cost value of setting;
All elements for meeting constraint condition in every a line of replacement operation part in step 4, the original cost matrix of calculating The ratio between number and the row all elements number accounting ti at once, and Statistics Bar accounting ti is less than the line number m of default accounting parameter q; Wherein i=1,2 ..., M, M be original cost matrix in replacement operation part total line number, q ∈ (0,1];
All elements for meeting constraint condition in every a line of replacement operation part in step 5, the original cost matrix of calculating The ratio between number and the column all elements number are column accounting tj, and Statistics Bar accounting tj is less than the columns n of default accounting parameter q; Wherein j=1,2 ..., N, N be original cost matrix in replacement operation part total columns, q ∈ (0,1];
Step 6, according to step 4 gained line number m and step 5 gained columns n, calculate r=max (m, n);And entire original R row and r column element are added on the basis of cost matrix, thus obtain the solution cost square used for subsequent calculating figure editing distance Battle array.
The present invention describes mistake in conjunction with cost matrix by integrating the data distributions of a variety of sampling mechanism equilibrium data collection Classify the cost of any specific data sample, is compared by the search item by item to cost matrix frame elements, to every row and each column The middle element number for meeting constraint condition is counted;Meet by comparing each row and column every shared by the element number of constraint condition The ratio of row each column adds corresponding cost value ranks number to SFBP cost matrix frame, changes its cost matrix frame, to reach To the purpose of optimization.When reaching optimization aim, so that it may solution calculating is carried out to cost matrix using general derivation algorithm, Limitation to avoid constraint condition from using algorithm, so that algorithm is preferably applied in industry manufacture big data.
It attempts through class sample representation sex ratio different from sampling come equilibrium data distribution, the cost of unbalanced data is quick Sense learning method includes the weighting of cost-sensitive data space, boots and samples such as cost-sensitive, and cost minimization technology is such as various Meta technology and cost-sensitive correlation technology such as cost-sensitive decision tree and cost-sensitive neural network.
In addition, the invention also includes kernel-based learning algorithms methods: using statistical learning and Vapnik-Chervonenki s (VC) dimension is theoretical, minimizes total error in classification in conjunction with SVM, realizes autonomous learning.
It should be noted that described above is presently preferred embodiments of the present invention, the invention is not limited to above-mentioned Embodiment all should belong to protection scope of the present invention as long as it reaches technical effect of the invention with identical means.

Claims (3)

1. the unbalanced data learning method based on industry manufacture big data, it is characterised in that: include the following steps
101, acquisition source and the acquisition mode of industry manufacture big data are determined:
102, industry manufacture big data is obtained to constitute unbalanced dataset from acquisition source according to the acquisition mode of step 101;
103, unbalanced dataset is modified by sampling mechanism, to provide the data distribution of balance;
104, the unbalanced dataset is introduced into SFBP cost matrix frame, passes through the search item by item of cost matrix frame elements Compare, the element number that constraint condition is met in every row and each column is counted;Meet constraint item by comparing each row and column The ratio of each row and column shared by the element number of part is added corresponding cost value ranks number to SFBP cost matrix frame, is changed Its cost matrix frame, to optimize the degree of balance of unbalanced dataset.
2. the unbalanced data learning method according to claim 1 based on industry manufacture big data, it is characterised in that: institute Stating sampling mechanism includes that random oversampling and sub- sampling, the synthesis sampling generated with data, adaptive synthesis sampling, data are clear Reason sampling, the sampling based on cluster and the integrated sampling of Boosting.
3. the unbalanced data learning method according to claim 1 based on industry manufacture big data, it is characterised in that: institute The building process for stating cost matrix specifically comprises the following steps:
Step 1, the insertion operation cost value Ci and delete operation cost value Cd set;
Step 2, the original cost matrix for constructing SFBP algorithm;
Step 3 counts the element number that replacement operation part in original cost matrix meets constraint condition line by line;Wherein about Beam condition are as follows:
αCs>(Ci+Cd)
Wherein, α is given parameters, α ∈ (0,1];Cs is element value, that is, replacement operation cost value, and Ci is the insertion operation generation of setting Value, Cd are the delete operation cost value of setting;
All element numbers for meeting constraint condition in every a line of replacement operation part in step 4, the original cost matrix of calculating With the ratio between row all elements number accounting ti at once, and Statistics Bar accounting ti is less than the line number m of default accounting parameter q;Wherein i =1,2 ..., M, M be original cost matrix in replacement operation part total line number, q ∈ (0,1];
All element numbers for meeting constraint condition in every a line of replacement operation part in step 5, the original cost matrix of calculating With the ratio between the column all elements number i.e. column accounting tj, and Statistics Bar accounting tj is less than the columns n of default accounting parameter q;Wherein j =1,2 ..., N, N be original cost matrix in replacement operation part total columns, q ∈ (0,1];
Step 6, according to step 4 gained line number m and step 5 gained columns n, calculate r=max (m, n);And in entire original cost R row and r column element are added on the basis of matrix, thus obtain the solution cost matrix used for subsequent calculating figure editing distance.
CN201810858296.9A 2018-07-31 2018-07-31 Unbalanced data learning method based on industry manufacture big data Pending CN108960437A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810858296.9A CN108960437A (en) 2018-07-31 2018-07-31 Unbalanced data learning method based on industry manufacture big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810858296.9A CN108960437A (en) 2018-07-31 2018-07-31 Unbalanced data learning method based on industry manufacture big data

Publications (1)

Publication Number Publication Date
CN108960437A true CN108960437A (en) 2018-12-07

Family

ID=64465627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810858296.9A Pending CN108960437A (en) 2018-07-31 2018-07-31 Unbalanced data learning method based on industry manufacture big data

Country Status (1)

Country Link
CN (1) CN108960437A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140089237A1 (en) * 2012-09-25 2014-03-27 Reunify Llc Methods and systems for scalable group detection from multiple data streams
CN105630936A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Unbalanced data processing method and device based on single-class decision tree
CN107609592A (en) * 2017-09-15 2018-01-19 桂林电子科技大学 A kind of figure edit distance approach towards Letter identification
CN107886135A (en) * 2017-12-01 2018-04-06 江苏蓝深远望科技股份有限公司 A kind of parallel random forests algorithm for handling uneven big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140089237A1 (en) * 2012-09-25 2014-03-27 Reunify Llc Methods and systems for scalable group detection from multiple data streams
CN105630936A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Unbalanced data processing method and device based on single-class decision tree
CN107609592A (en) * 2017-09-15 2018-01-19 桂林电子科技大学 A kind of figure edit distance approach towards Letter identification
CN107886135A (en) * 2017-12-01 2018-04-06 江苏蓝深远望科技股份有限公司 A kind of parallel random forests algorithm for handling uneven big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KOLOMVATSOS .ET.AL: "A load balance modeul for post-emergemcy management", 《EXPERT SYSTEMS WITH APPLICATIONS》 *

Similar Documents

Publication Publication Date Title
KR102376824B1 (en) System and method for learning manufacturing processes and optimizing manufacturing processes
Chen et al. Dissecting the phenotypic components of crop plant growth and drought responses based on high-throughput image analysis
CN110659207B (en) Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration
Zhang et al. Categorizing and mining concept drifting data streams
CN109472321B (en) Time series type earth surface water quality big data oriented prediction and evaluation model construction method
CN109145948A (en) A kind of injection molding machine putty method for detecting abnormality based on integrated study
CN111105160A (en) Steel quality prediction method based on tendency heterogeneous bagging algorithm
CN103488561B (en) A kind of kNN fault detection method of online upgrading master sample model
CN112269818A (en) Method, system, device and medium for positioning device parameter root cause
Lauzon-Gauthier et al. The Sequential Multi-block PLS algorithm (SMB-PLS): Comparison of performance and interpretability
JP5229631B2 (en) Manufacturing condition adjustment device
CN102841985A (en) Method for identifying key proteins based on characteristics of structural domain
CN114116829A (en) Abnormal data analysis method, abnormal data analysis system, and storage medium
CN113468794B (en) Temperature and humidity prediction and reverse optimization method for small-sized closed space
Kuriakose et al. Data-driven decisions for accelerated plant breeding
CN103902706A (en) Method for classifying and predicting big data on basis of SVM (support vector machine)
CN108960437A (en) Unbalanced data learning method based on industry manufacture big data
Badhiye et al. KNN technique for analysis and prediction of temperature and humidity data
CN113257364A (en) Single cell transcriptome sequencing data clustering method and system based on multi-objective evolution
Chen et al. Big data analytic for multivariate fault detection and classification in semiconductor manufacturing
Soares et al. Design and application of soft sensor using ensemble methods
TW202113356A (en) Method and analysis of establishing molding characteristics mass spectrum and discrimination model, method of discriminating microorganism characterization provide more precise mass-to-charge ratio
WO2022263716A1 (en) Analyzing measurement results of a communications network or other target system
CN112786120B (en) Method for synthesizing chemical material with assistance of neural network
Mola et al. Discriminant analysis and factorial multiple splits in recursive partitioning for data mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination