CN108960437A

CN108960437A - Unbalanced data learning method based on industry manufacture big data

Info

Publication number: CN108960437A
Application number: CN201810858296.9A
Authority: CN
Inventors: 张彩霞; 王向东; 王新东; 胡绍林
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2018-07-31
Filing date: 2018-07-31
Publication date: 2018-12-07

Abstract

The invention discloses the unbalanced data learning methods based on industry manufacture big data, include the following steps 101, determine that industry manufactures acquisition source and the acquisition mode of big data: 102, obtaining industry manufacture big data from acquisition source according to the acquisition mode of step 101 to constitute unbalanced dataset；103, unbalanced dataset is modified by sampling mechanism, to provide the data distribution of balance；104, the unbalanced dataset is introduced into SFBP cost matrix frame, is compared by the search item by item of cost matrix frame elements, the element number that constraint condition is met in every row and each column is counted；The ratio for meeting each row and column shared by the element number of constraint condition by comparing each row and column adds corresponding cost value ranks number to SFBP cost matrix frame, changes its cost matrix frame, to optimize the degree of balance of unbalanced dataset.

Description

Unbalanced data learning method based on industry manufacture big data

Technical field

The present invention relates to industry to manufacture big data processing field, is based especially on the unbalanced data of industry manufacture big data Learning method.

Background technique

Uneven problem concerning study, which is primarily upon data, indicates the performance of learning algorithm when insufficient and class is distributed torsional deformation. Manufacturing industry is in terms of tracking and command network and measuring and control data, due to presenting typical uneven from distinct device and for different objects Weighing apparatus form.Due to the intrinsic complicated feature of unbalanced dataset, learns such data and need efficiently to turn a large amount of initial data Turn to new understanding, new principle, new algorithm and the new tool of information and the representation of knowledge.

Summary of the invention

In order to overcome the deficiencies in the prior art described above, the present invention provides a kind of injustice based on industry manufacture big data Weigh data learning method.

The technical solution adopted by the present invention to solve the technical problems are as follows:

Based on the unbalanced data learning method of industry manufacture big data, include the following steps

101, acquisition source and the acquisition mode of industry manufacture big data are determined:

102, industry manufacture big data is obtained to constitute unbalanced data from acquisition source according to the acquisition mode of step 101 Collection；

103, unbalanced dataset is modified by sampling mechanism, to provide the data distribution of balance；

104, the unbalanced dataset is introduced into SFBP cost matrix frame, item by item by cost matrix frame elements Search is compared, and is counted to the element number for meeting constraint condition in every row and each column；Meet about by comparing each row and column The ratio of each row and column shared by the element number of beam condition adds corresponding cost value ranks number to SFBP cost matrix frame, Change its cost matrix frame, to optimize the degree of balance of unbalanced dataset.

The sampling mechanism includes random oversampling and sub- sampling, the synthesis sampling with data generation, adaptive synthesis Sampling, data scrubbing sampling, the sampling based on cluster and the integrated sampling of Boosting.

The building process of the cost matrix specifically comprises the following steps:

Step 1, the insertion operation cost value Ci and delete operation cost value Cd set；

Step 2, the original cost matrix for constructing SFBP algorithm；

Step 3 counts the element number that replacement operation part in original cost matrix meets constraint condition line by line；Its Middle constraint condition are as follows:

αCs>(Ci+Cd)

Wherein, α is given parameters, α ∈ (0,1]；Cs is element value, that is, replacement operation cost value, and Ci is the insertion behaviour of setting Make cost value, Cd is the delete operation cost value of setting；

All elements for meeting constraint condition in every a line of replacement operation part in step 4, the original cost matrix of calculating The ratio between number and the row all elements number accounting ti at once, and Statistics Bar accounting ti is less than the line number m of default accounting parameter q； Wherein i=1,2 ..., M, M be original cost matrix in replacement operation part total line number, q ∈ (0,1]；

All elements for meeting constraint condition in every a line of replacement operation part in step 5, the original cost matrix of calculating The ratio between number and the column all elements number are column accounting tj, and Statistics Bar accounting tj is less than the columns n of default accounting parameter q； Wherein j=1,2 ..., N, N be original cost matrix in replacement operation part total columns, q ∈ (0,1]；

Step 6, according to step 4 gained line number m and step 5 gained columns n, calculate r=max (m, n)；And entire original R row and r column element are added on the basis of cost matrix, thus obtain the solution cost square used for subsequent calculating figure editing distance Battle array.

Bring beneficial effect of the present invention has:

The present invention describes mistake in conjunction with cost matrix by integrating the data distributions of a variety of sampling mechanism equilibrium data collection Classify the cost of any specific data sample, is compared by the search item by item to cost matrix frame elements, to every row and each column The middle element number for meeting constraint condition is counted；Meet by comparing each row and column every shared by the element number of constraint condition The ratio of row each column adds corresponding cost value ranks number to SFBP cost matrix frame, changes its cost matrix frame, to reach To the purpose of optimization.When reaching optimization aim, so that it may solution calculating is carried out to cost matrix using general derivation algorithm, Limitation to avoid constraint condition from using algorithm, so that algorithm is preferably applied in industry manufacture big data.

Specific embodiment

Unbalanced data learning method based on industry manufacture big data of the invention, includes the following steps

Step 2, the original cost matrix for constructing SFBP algorithm；

αCs>(Ci+Cd)

It attempts through class sample representation sex ratio different from sampling come equilibrium data distribution, the cost of unbalanced data is quick Sense learning method includes the weighting of cost-sensitive data space, boots and samples such as cost-sensitive, and cost minimization technology is such as various Meta technology and cost-sensitive correlation technology such as cost-sensitive decision tree and cost-sensitive neural network.

In addition, the invention also includes kernel-based learning algorithms methods: using statistical learning and Vapnik-Chervonenki s (VC) dimension is theoretical, minimizes total error in classification in conjunction with SVM, realizes autonomous learning.

It should be noted that described above is presently preferred embodiments of the present invention, the invention is not limited to above-mentioned Embodiment all should belong to protection scope of the present invention as long as it reaches technical effect of the invention with identical means.

Claims

1. the unbalanced data learning method based on industry manufacture big data, it is characterised in that: include the following steps

102, industry manufacture big data is obtained to constitute unbalanced dataset from acquisition source according to the acquisition mode of step 101；

104, the unbalanced dataset is introduced into SFBP cost matrix frame, passes through the search item by item of cost matrix frame elements Compare, the element number that constraint condition is met in every row and each column is counted；Meet constraint item by comparing each row and column The ratio of each row and column shared by the element number of part is added corresponding cost value ranks number to SFBP cost matrix frame, is changed Its cost matrix frame, to optimize the degree of balance of unbalanced dataset.

2. the unbalanced data learning method according to claim 1 based on industry manufacture big data, it is characterised in that: institute Stating sampling mechanism includes that random oversampling and sub- sampling, the synthesis sampling generated with data, adaptive synthesis sampling, data are clear Reason sampling, the sampling based on cluster and the integrated sampling of Boosting.

3. the unbalanced data learning method according to claim 1 based on industry manufacture big data, it is characterised in that: institute The building process for stating cost matrix specifically comprises the following steps:

Step 2, the original cost matrix for constructing SFBP algorithm；

Step 3 counts the element number that replacement operation part in original cost matrix meets constraint condition line by line；Wherein about Beam condition are as follows:

αCs>(Ci+Cd)

Wherein, α is given parameters, α ∈ (0,1]；Cs is element value, that is, replacement operation cost value, and Ci is the insertion operation generation of setting Value, Cd are the delete operation cost value of setting；

All element numbers for meeting constraint condition in every a line of replacement operation part in step 4, the original cost matrix of calculating With the ratio between row all elements number accounting ti at once, and Statistics Bar accounting ti is less than the line number m of default accounting parameter q；Wherein i =1,2 ..., M, M be original cost matrix in replacement operation part total line number, q ∈ (0,1]；

All element numbers for meeting constraint condition in every a line of replacement operation part in step 5, the original cost matrix of calculating With the ratio between the column all elements number i.e. column accounting tj, and Statistics Bar accounting tj is less than the columns n of default accounting parameter q；Wherein j =1,2 ..., N, N be original cost matrix in replacement operation part total columns, q ∈ (0,1]；

Step 6, according to step 4 gained line number m and step 5 gained columns n, calculate r=max (m, n)；And in entire original cost R row and r column element are added on the basis of matrix, thus obtain the solution cost matrix used for subsequent calculating figure editing distance.