CN107590262A - The semi-supervised learning method of big data analysis - Google Patents

The semi-supervised learning method of big data analysis Download PDF

Info

Publication number
CN107590262A
CN107590262A CN201710861920.6A CN201710861920A CN107590262A CN 107590262 A CN107590262 A CN 107590262A CN 201710861920 A CN201710861920 A CN 201710861920A CN 107590262 A CN107590262 A CN 107590262A
Authority
CN
China
Prior art keywords
data
semi
big data
supervised
learning method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710861920.6A
Other languages
Chinese (zh)
Inventor
黄国华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710861920.6A priority Critical patent/CN107590262A/en
Publication of CN107590262A publication Critical patent/CN107590262A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of semi-supervised learning method of big data analysis, extracts big data from multiple data sources first, and big data is changed according to certain rule, obtains the data format for being adapted to computer disposal;Then data processing is carried out to entering the big data after line discipline is changed;Database is finally established according to the big data after data processing, and the multiple semi-supervised classifiers of structure and the final safe semi-supervised classifier of structure, its specific building process is to build the big semi-supervised classifier of multiple othernesses firstly for given training dataset, is then improved by maximizing performance under worst case to build final safe semi-supervised classifier.The method of the present invention seldom causes hydraulic performance decline in implementation process, at the same time achieves and the existing classical comparable performance of technology height.

Description

The semi-supervised learning method of big data analysis
Technical field
The present invention relates to big data technical field, specifically a kind of semi-supervised learning method of big data analysis, profit With the important technology of unmarked study, existed automatically using a large amount of flag datas lifting learner in the case of without extraneous intervene Generalization ability in whole data distribution.
Background technology
In big data application system, current multiple industries, especially intelligence analysis field, it can be obtained from multiple data sources Different data, the existing various information from industry and commerce, civil aviaton, entry and exit, household register etc., also have from all kinds of portal websites (such as Group buying websites, recruitment website, social network sites) log-on message, and by web crawlers obtain Various types of data;Wherein data Type has structural data, semi-structured data, unstructured data again;These data contents, form are disorderly and unsystematic, and information is empty It is real to combine.So needing by big data analytical technology, useful value information in being excavated from massive multi-source data, it is Each alanysis application provides data supporting.
Machine learning method attempts to the historical data of task to improve the performance of task.For the study got well Can, machine learning method such as supervised learning method, usually require that historical data has clear and definite concept mark (to be referred to as having mark Data) and require to have largely have flag data.In many realistic tasks, because the acquisition of concept mark needs to expend largely Human and material resources, therefore it is typically rare to have flag data, and largely without concept mark historical data (referred to as not Flag data) it can then be readily obtained.How to aid in improving merely with having flag data on a small quantity using a large amount of Unlabeled datas Obtained performance has turned into an important topic of machine learning method, and semi-supervised learning method is two big main flows of this aspect One of technology.
Semi-supervised learning method is able to extensive use at many aspects;But in the case of many, it is existing semi-supervised Learning method can cause hydraulic performance decline using Unlabeled data, i.e. the performance of semi-supervised learning method can be significantly lower than and directly utilize A small amount of performance having acquired by flag data training supervised learning method.This phenomenon has had a strong impact on that semi-supervised learning method exists Application in actual task, because user it is generally desirable to make use of semi-supervised learning method without causing hydraulic performance decline.Therefore A kind of safe semi-supervised learning method is needed to cause, on the one hand it can generally bring performance to improve, and on the other hand its is seldom Performance can be caused to be remarkably decreased.Based on semi-supervised learning problem, generally existing, the achievement of this respect will be in actual task Played a role in many actual tasks.
The content of the invention
Big data is extracted from multiple heterogeneous data sources the technical problem to be solved in the present invention is to provide one kind, and to big data Enter line discipline conversion;Data processing is carried out to entering the big data after line discipline is changed;Established according to the big data after data processing Database, and the study that exercises supervision, obtain prediction result.
In order to solve the above-mentioned technical problem, the present invention takes following technical scheme:
A kind of semi-supervised learning method of big data analysis, comprises the following steps:
Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and The big data is changed according to certain rule, obtains the data format for being adapted to computer disposal;
Step 2, data processing is carried out to entering the big data after line discipline is changed;
Step 3, database is established according to the big data after data processing;
Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion;
Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization is passed through Method optimizes to the prediction result of semi-supervised classifier;
Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, extraction wherein desired value Optimal semi-supervised classifier;
Step 7, to there is flag data to be trained, supervised learning method is obtained, by supervised learning method prediction not Flag data, obtain the prediction result on Unlabeled data;
Step 8, according to the prediction result of supervised learning method, performance is defined to the prediction result on any Unlabeled data Improve function;
Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising of minimum Data, the minimum performance is improved to the performance corresponding to data and improves the performance raising letter that function is defined as under worst case Number;
Step 10, object function is improved according to the performance under worst case, by optimization method to the pre- of Unlabeled data Survey result to optimize, optimum results are exported, the prediction result as final safe semi-supervised classifier.
The initialization semi-supervised classifier refers to be initialized to the prediction result on Unlabeled data.
The object function of semi-supervised classifier includes interval and the probability likelihood of different classes of data.
Supervised learning method in the step 7 include production model method, arest neighbors supervised learning method (KNN) and SVMs learning method (SVM).
Entering the mode of line discipline conversion to the big data includes data cleansing and data prediction, the data cleansing with The data prediction includes at least one of:Standardized format, abnormal data removing, error correcting, duplicate removal.
In the case where the big data is structural data, data are carried out to entering the big data after line discipline is changed The mode of processing includes at least one of:Object extraction, data correlation, confidence calculations, tag computation, model calculate.
In the case where the big data is unstructured data, enter line number to entering the big data after line discipline is changed Include at least one of according to the mode of processing:Word segmentation processing, characteristics extraction.
Also need to multi-dimensional feature data carrying out dimension-reduction treatment, data after carrying out first step data processing to the big data The method of dimensionality reduction includes:Linear dimensionality reduction and Nonlinear Dimension Reduction.
The present invention from multi-source heterogeneous data by analyzing various information and building database, so as to be Various types of data point Analysis, behavioural analysis, the analysis of user's portrait, relation find to provide data supporting.And safe semi-supervised learning method is utilized, is implemented During seldom cause performance to be remarkably decreased, at the same time achieve and the highly comparable performance of prior art.
Brief description of the drawings
Accompanying drawing 1 is schematic flow sheet of the present invention.
Embodiment
For the ease of the understanding of those skilled in the art, the invention will be further described below in conjunction with the accompanying drawings.
The present invention includes structural data, semi-structured data towards the present invention towards massive multi-source data, data And unstructured data, the various features attribute information and character relation topological diagram of personnel will be calculated from all data.It is right Data perform complicated processing procedure, including:Data pick-up, data cleansing, data backfill, property value calculate;By the category of calculating Property value is inserted in unified Object table carries out retrieval displaying will pass through interface.
As shown in Figure 1, a kind of semi-supervised learning method of big data analysis, comprises the following steps:
Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and The big data is changed according to certain rule, obtains the data format for being adapted to computer disposal;
Step 2, data processing is carried out to entering the big data after line discipline is changed;
Step 3, database is established according to the big data after data processing;
Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion;
Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization is passed through Method optimizes to the prediction result of semi-supervised classifier;
Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, extraction wherein desired value Optimal semi-supervised classifier;
Step 7, to there is flag data to be trained, supervised learning method is obtained, by supervised learning method prediction not Flag data, obtain the prediction result on Unlabeled data;
Step 8, according to the prediction result of supervised learning method, performance is defined to the prediction result on any Unlabeled data Improve function;
Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising of minimum Data, the minimum performance is improved to the performance corresponding to data and improves the performance raising letter that function is defined as under worst case Number;
Step 10, object function is improved according to the performance under worst case, by optimization method to the pre- of Unlabeled data Survey result to optimize, optimum results are exported, the prediction result as final safe semi-supervised classifier.
The initialization semi-supervised classifier refers to be initialized to the prediction result on Unlabeled data.
The object function of semi-supervised classifier includes interval and the probability likelihood of different classes of data.
Supervised learning method in the step 7 include production model method, arest neighbors supervised learning method (KNN) and SVMs learning method (SVM).
Entering the mode of line discipline conversion to the big data includes data cleansing and data prediction, the data cleansing with The data prediction includes at least one of:Standardized format, abnormal data removing, error correcting, duplicate removal.
In the case where the big data is structural data, data are carried out to entering the big data after line discipline is changed The mode of processing includes at least one of:Object extraction, data correlation, confidence calculations, tag computation, model calculate.
In the case where the big data is unstructured data, enter line number to entering the big data after line discipline is changed Include at least one of according to the mode of processing:Word segmentation processing, characteristics extraction.
Also need to multi-dimensional feature data carrying out dimension-reduction treatment, data after carrying out first step data processing to the big data The method of dimensionality reduction includes:Linear dimensionality reduction and Nonlinear Dimension Reduction.
It should be noted that described above be not limited to the present invention, the creation design of the present invention is not being departed from Under the premise of, it is any obviously to replace within protection scope of the present invention.

Claims (8)

1. a kind of semi-supervised learning method of big data analysis, comprises the following steps:
Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and to institute State big data to be changed according to certain rule, obtain the data format for being adapted to computer disposal;
Step 2, data processing is carried out to entering the big data after line discipline is changed;
Step 3, database is established according to the big data after data processing;
Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion;
Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization method is passed through The prediction result of semi-supervised classifier is optimized;
Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, it is optimal extracts wherein desired value Semi-supervised classifier;
Step 7, to there is flag data to be trained, supervised learning method is obtained, is predicted by the supervised learning method unmarked Data, obtain the prediction result on Unlabeled data;
Step 8, according to the prediction result of supervised learning method, performance raising is defined to the prediction result on any Unlabeled data Function;
Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising number of minimum According to the performance raising function being defined as the performance raising function corresponding to minimum performance raising data under worst case;
Step 10, object function is improved according to the performance under worst case, passes through prediction knot of the optimization method to Unlabeled data Fruit is optimized, and optimum results are exported, the prediction result as final safe semi-supervised classifier.
2. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that the initialization half is supervised Grader is superintended and directed to refer to initialize the prediction result on Unlabeled data.
3. the semi-supervised learning method of big data analysis according to claim 2, it is characterised in that semi-supervised classifier Object function includes interval and the probability likelihood of different classes of data.
4. the semi-supervised learning method of big data analysis according to claim 3, it is characterised in that in the step 7 Supervised learning method includes production model method, arest neighbors supervised learning method (KNN) and SVMs learning method (SVM)。
5. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that enter to the big data The mode of line discipline conversion includes data cleansing and data prediction, and the data cleansing and the data prediction include following At least one:Standardized format, abnormal data removing, error correcting, duplicate removal.
6. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that be in the big data In the case of structural data, to enter line discipline conversion after the big data carry out data processing mode include it is following at least One of:Object extraction, data correlation, confidence calculations, tag computation, model calculate.
7. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that be in the big data In the case of unstructured data, the mode of data processing is carried out to entering the big data after line discipline is changed to be included below extremely It is one of few:Word segmentation processing, characteristics extraction.
8. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that enter to the big data Also need to multi-dimensional feature data carrying out dimension-reduction treatment after row first step data processing, the method for Data Dimensionality Reduction includes:Linear drop Peacekeeping Nonlinear Dimension Reduction.
CN201710861920.6A 2017-09-21 2017-09-21 The semi-supervised learning method of big data analysis Pending CN107590262A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710861920.6A CN107590262A (en) 2017-09-21 2017-09-21 The semi-supervised learning method of big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710861920.6A CN107590262A (en) 2017-09-21 2017-09-21 The semi-supervised learning method of big data analysis

Publications (1)

Publication Number Publication Date
CN107590262A true CN107590262A (en) 2018-01-16

Family

ID=61047545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710861920.6A Pending CN107590262A (en) 2017-09-21 2017-09-21 The semi-supervised learning method of big data analysis

Country Status (1)

Country Link
CN (1) CN107590262A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108923962A (en) * 2018-06-25 2018-11-30 哈尔滨工业大学 A kind of Local network topology measurement task selection method based on semi-supervised clustering
CN109977094A (en) * 2019-01-30 2019-07-05 中南大学 A method of the semi-supervised learning for structural data
CN111797832A (en) * 2020-07-14 2020-10-20 成都数之联科技有限公司 Automatic generation method and system of image interesting region and image processing method
CN111896681A (en) * 2020-07-08 2020-11-06 南昌工程学院 Semi-supervised semi-learning type atmospheric pollutant system
CN113168907A (en) * 2018-11-30 2021-07-23 第一百欧有限公司 Diagnostic system providing method using semi-supervised learning and diagnostic system using the same
WO2023273249A1 (en) * 2021-06-30 2023-01-05 国网上海市电力公司 Tsvm-model-based abnormality detection method for automatic verification system of smart electricity meter

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390171A (en) * 2013-07-24 2013-11-13 南京大学 Safe semi-supervised learning method
CN104636493A (en) * 2015-03-04 2015-05-20 浪潮电子信息产业股份有限公司 Method for classifying dynamic data on basis of multi-classifier fusion
CN106528874A (en) * 2016-12-08 2017-03-22 重庆邮电大学 Spark memory computing big data platform-based CLR multi-label data classification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390171A (en) * 2013-07-24 2013-11-13 南京大学 Safe semi-supervised learning method
CN104636493A (en) * 2015-03-04 2015-05-20 浪潮电子信息产业股份有限公司 Method for classifying dynamic data on basis of multi-classifier fusion
CN106528874A (en) * 2016-12-08 2017-03-22 重庆邮电大学 Spark memory computing big data platform-based CLR multi-label data classification method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108923962A (en) * 2018-06-25 2018-11-30 哈尔滨工业大学 A kind of Local network topology measurement task selection method based on semi-supervised clustering
CN108923962B (en) * 2018-06-25 2021-05-28 哈尔滨工业大学 Local network topology measurement task selection method based on semi-supervised clustering
CN113168907A (en) * 2018-11-30 2021-07-23 第一百欧有限公司 Diagnostic system providing method using semi-supervised learning and diagnostic system using the same
CN109977094A (en) * 2019-01-30 2019-07-05 中南大学 A method of the semi-supervised learning for structural data
CN109977094B (en) * 2019-01-30 2021-02-19 中南大学 Semi-supervised learning method for structured data
CN111896681A (en) * 2020-07-08 2020-11-06 南昌工程学院 Semi-supervised semi-learning type atmospheric pollutant system
CN111797832A (en) * 2020-07-14 2020-10-20 成都数之联科技有限公司 Automatic generation method and system of image interesting region and image processing method
CN111797832B (en) * 2020-07-14 2024-02-02 成都数之联科技股份有限公司 Automatic generation method and system for image region of interest and image processing method
WO2023273249A1 (en) * 2021-06-30 2023-01-05 国网上海市电力公司 Tsvm-model-based abnormality detection method for automatic verification system of smart electricity meter

Similar Documents

Publication Publication Date Title
CN107590262A (en) The semi-supervised learning method of big data analysis
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN105631479A (en) Imbalance-learning-based depth convolution network image marking method and apparatus
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN109726246A (en) One kind being associated with reason retrogressive method with visual power grid accident based on data mining
CN110968667A (en) Periodical and literature table extraction method based on text state characteristics
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN103559199B (en) Method for abstracting web page information and device
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
US20150095022A1 (en) List recognizing method and list recognizing system
CN103778205A (en) Commodity classifying method and system based on mutual information
CN104112026A (en) Short message text classifying method and system
CN104217038A (en) Knowledge network building method for financial news
CN106339481B (en) The compound new word discovery method of Chinese based on maximum confidence
CN104850617A (en) Short text processing method and apparatus
CN110222328A (en) Participle and part-of-speech tagging method, apparatus, equipment and storage medium neural network based
CN104933032A (en) Method for extracting keywords of blog based on complex network
CN103116636B (en) The big Data subject method for digging of the text of feature based spatial decomposition and device
CN107463624B (en) A kind of method and system that city interest domain identification is carried out based on social media data
CN106844588A (en) A kind of analysis method and system of the user behavior data based on web crawlers
CN108763192A (en) Entity relation extraction method and device for text-processing
CN109325204B (en) Automatic extraction method of webpage content
CN109165295B (en) Intelligent resume evaluation method
CN110347841A (en) A kind of method, apparatus, storage medium and the electronic equipment of document content classification
CN107992508B (en) Chinese mail signature extraction method and system based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180116

WD01 Invention patent application deemed withdrawn after publication