CN107590262A

CN107590262A - The semi-supervised learning method of big data analysis

Info

Publication number: CN107590262A
Application number: CN201710861920.6A
Authority: CN
Inventors: 黄国华
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-09-21
Filing date: 2017-09-21
Publication date: 2018-01-16

Abstract

The invention discloses a kind of semi-supervised learning method of big data analysis, extracts big data from multiple data sources first, and big data is changed according to certain rule, obtains the data format for being adapted to computer disposal；Then data processing is carried out to entering the big data after line discipline is changed；Database is finally established according to the big data after data processing, and the multiple semi-supervised classifiers of structure and the final safe semi-supervised classifier of structure, its specific building process is to build the big semi-supervised classifier of multiple othernesses firstly for given training dataset, is then improved by maximizing performance under worst case to build final safe semi-supervised classifier.The method of the present invention seldom causes hydraulic performance decline in implementation process, at the same time achieves and the existing classical comparable performance of technology height.

Description

The semi-supervised learning method of big data analysis

Technical field

The present invention relates to big data technical field, specifically a kind of semi-supervised learning method of big data analysis, profit With the important technology of unmarked study, existed automatically using a large amount of flag datas lifting learner in the case of without extraneous intervene Generalization ability in whole data distribution.

Background technology

In big data application system, current multiple industries, especially intelligence analysis field, it can be obtained from multiple data sources Different data, the existing various information from industry and commerce, civil aviaton, entry and exit, household register etc., also have from all kinds of portal websites (such as Group buying websites, recruitment website, social network sites) log-on message, and by web crawlers obtain Various types of data；Wherein data Type has structural data, semi-structured data, unstructured data again；These data contents, form are disorderly and unsystematic, and information is empty It is real to combine.So needing by big data analytical technology, useful value information in being excavated from massive multi-source data, it is Each alanysis application provides data supporting.

Machine learning method attempts to the historical data of task to improve the performance of task.For the study got well Can, machine learning method such as supervised learning method, usually require that historical data has clear and definite concept mark (to be referred to as having mark Data) and require to have largely have flag data.In many realistic tasks, because the acquisition of concept mark needs to expend largely Human and material resources, therefore it is typically rare to have flag data, and largely without concept mark historical data (referred to as not Flag data) it can then be readily obtained.How to aid in improving merely with having flag data on a small quantity using a large amount of Unlabeled datas Obtained performance has turned into an important topic of machine learning method, and semi-supervised learning method is two big main flows of this aspect One of technology.

Semi-supervised learning method is able to extensive use at many aspects；But in the case of many, it is existing semi-supervised Learning method can cause hydraulic performance decline using Unlabeled data, i.e. the performance of semi-supervised learning method can be significantly lower than and directly utilize A small amount of performance having acquired by flag data training supervised learning method.This phenomenon has had a strong impact on that semi-supervised learning method exists Application in actual task, because user it is generally desirable to make use of semi-supervised learning method without causing hydraulic performance decline.Therefore A kind of safe semi-supervised learning method is needed to cause, on the one hand it can generally bring performance to improve, and on the other hand its is seldom Performance can be caused to be remarkably decreased.Based on semi-supervised learning problem, generally existing, the achievement of this respect will be in actual task Played a role in many actual tasks.

The content of the invention

Big data is extracted from multiple heterogeneous data sources the technical problem to be solved in the present invention is to provide one kind, and to big data Enter line discipline conversion；Data processing is carried out to entering the big data after line discipline is changed；Established according to the big data after data processing Database, and the study that exercises supervision, obtain prediction result.

In order to solve the above-mentioned technical problem, the present invention takes following technical scheme：

A kind of semi-supervised learning method of big data analysis, comprises the following steps：

Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and The big data is changed according to certain rule, obtains the data format for being adapted to computer disposal；

Step 2, data processing is carried out to entering the big data after line discipline is changed；

Step 3, database is established according to the big data after data processing；

Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion；

Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization is passed through Method optimizes to the prediction result of semi-supervised classifier；

Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, extraction wherein desired value Optimal semi-supervised classifier；

Step 7, to there is flag data to be trained, supervised learning method is obtained, by supervised learning method prediction not Flag data, obtain the prediction result on Unlabeled data；

Step 8, according to the prediction result of supervised learning method, performance is defined to the prediction result on any Unlabeled data Improve function；

Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising of minimum Data, the minimum performance is improved to the performance corresponding to data and improves the performance raising letter that function is defined as under worst case Number；

Step 10, object function is improved according to the performance under worst case, by optimization method to the pre- of Unlabeled data Survey result to optimize, optimum results are exported, the prediction result as final safe semi-supervised classifier.

The initialization semi-supervised classifier refers to be initialized to the prediction result on Unlabeled data.

The object function of semi-supervised classifier includes interval and the probability likelihood of different classes of data.

Supervised learning method in the step 7 include production model method, arest neighbors supervised learning method (KNN) and SVMs learning method (SVM).

Entering the mode of line discipline conversion to the big data includes data cleansing and data prediction, the data cleansing with The data prediction includes at least one of：Standardized format, abnormal data removing, error correcting, duplicate removal.

In the case where the big data is structural data, data are carried out to entering the big data after line discipline is changed The mode of processing includes at least one of：Object extraction, data correlation, confidence calculations, tag computation, model calculate.

In the case where the big data is unstructured data, enter line number to entering the big data after line discipline is changed Include at least one of according to the mode of processing：Word segmentation processing, characteristics extraction.

Also need to multi-dimensional feature data carrying out dimension-reduction treatment, data after carrying out first step data processing to the big data The method of dimensionality reduction includes：Linear dimensionality reduction and Nonlinear Dimension Reduction.

The present invention from multi-source heterogeneous data by analyzing various information and building database, so as to be Various types of data point Analysis, behavioural analysis, the analysis of user's portrait, relation find to provide data supporting.And safe semi-supervised learning method is utilized, is implemented During seldom cause performance to be remarkably decreased, at the same time achieve and the highly comparable performance of prior art.

Brief description of the drawings

Accompanying drawing 1 is schematic flow sheet of the present invention.

Embodiment

For the ease of the understanding of those skilled in the art, the invention will be further described below in conjunction with the accompanying drawings.

The present invention includes structural data, semi-structured data towards the present invention towards massive multi-source data, data And unstructured data, the various features attribute information and character relation topological diagram of personnel will be calculated from all data.It is right Data perform complicated processing procedure, including：Data pick-up, data cleansing, data backfill, property value calculate；By the category of calculating Property value is inserted in unified Object table carries out retrieval displaying will pass through interface.

As shown in Figure 1, a kind of semi-supervised learning method of big data analysis, comprises the following steps：

It should be noted that described above be not limited to the present invention, the creation design of the present invention is not being departed from Under the premise of, it is any obviously to replace within protection scope of the present invention.

Claims

1. a kind of semi-supervised learning method of big data analysis, comprises the following steps：

Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and to institute State big data to be changed according to certain rule, obtain the data format for being adapted to computer disposal；

Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization method is passed through The prediction result of semi-supervised classifier is optimized；

Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, it is optimal extracts wherein desired value Semi-supervised classifier；

Step 7, to there is flag data to be trained, supervised learning method is obtained, is predicted by the supervised learning method unmarked Data, obtain the prediction result on Unlabeled data；

Step 8, according to the prediction result of supervised learning method, performance raising is defined to the prediction result on any Unlabeled data Function；

Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising number of minimum According to the performance raising function being defined as the performance raising function corresponding to minimum performance raising data under worst case；

Step 10, object function is improved according to the performance under worst case, passes through prediction knot of the optimization method to Unlabeled data Fruit is optimized, and optimum results are exported, the prediction result as final safe semi-supervised classifier.

2. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that the initialization half is supervised Grader is superintended and directed to refer to initialize the prediction result on Unlabeled data.

3. the semi-supervised learning method of big data analysis according to claim 2, it is characterised in that semi-supervised classifier Object function includes interval and the probability likelihood of different classes of data.

4. the semi-supervised learning method of big data analysis according to claim 3, it is characterised in that in the step 7 Supervised learning method includes production model method, arest neighbors supervised learning method (KNN) and SVMs learning method (SVM)。

5. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that enter to the big data The mode of line discipline conversion includes data cleansing and data prediction, and the data cleansing and the data prediction include following At least one：Standardized format, abnormal data removing, error correcting, duplicate removal.

6. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that be in the big data In the case of structural data, to enter line discipline conversion after the big data carry out data processing mode include it is following at least One of：Object extraction, data correlation, confidence calculations, tag computation, model calculate.

7. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that be in the big data In the case of unstructured data, the mode of data processing is carried out to entering the big data after line discipline is changed to be included below extremely It is one of few：Word segmentation processing, characteristics extraction.

8. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that enter to the big data Also need to multi-dimensional feature data carrying out dimension-reduction treatment after row first step data processing, the method for Data Dimensionality Reduction includes：Linear drop Peacekeeping Nonlinear Dimension Reduction.