CN107590262A - The semi-supervised learning method of big data analysis - Google Patents
The semi-supervised learning method of big data analysis Download PDFInfo
- Publication number
- CN107590262A CN107590262A CN201710861920.6A CN201710861920A CN107590262A CN 107590262 A CN107590262 A CN 107590262A CN 201710861920 A CN201710861920 A CN 201710861920A CN 107590262 A CN107590262 A CN 107590262A
- Authority
- CN
- China
- Prior art keywords
- data
- semi
- big data
- supervised
- learning method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of semi-supervised learning method of big data analysis, extracts big data from multiple data sources first, and big data is changed according to certain rule, obtains the data format for being adapted to computer disposal;Then data processing is carried out to entering the big data after line discipline is changed;Database is finally established according to the big data after data processing, and the multiple semi-supervised classifiers of structure and the final safe semi-supervised classifier of structure, its specific building process is to build the big semi-supervised classifier of multiple othernesses firstly for given training dataset, is then improved by maximizing performance under worst case to build final safe semi-supervised classifier.The method of the present invention seldom causes hydraulic performance decline in implementation process, at the same time achieves and the existing classical comparable performance of technology height.
Description
Technical field
The present invention relates to big data technical field, specifically a kind of semi-supervised learning method of big data analysis, profit
With the important technology of unmarked study, existed automatically using a large amount of flag datas lifting learner in the case of without extraneous intervene
Generalization ability in whole data distribution.
Background technology
In big data application system, current multiple industries, especially intelligence analysis field, it can be obtained from multiple data sources
Different data, the existing various information from industry and commerce, civil aviaton, entry and exit, household register etc., also have from all kinds of portal websites (such as
Group buying websites, recruitment website, social network sites) log-on message, and by web crawlers obtain Various types of data;Wherein data
Type has structural data, semi-structured data, unstructured data again;These data contents, form are disorderly and unsystematic, and information is empty
It is real to combine.So needing by big data analytical technology, useful value information in being excavated from massive multi-source data, it is
Each alanysis application provides data supporting.
Machine learning method attempts to the historical data of task to improve the performance of task.For the study got well
Can, machine learning method such as supervised learning method, usually require that historical data has clear and definite concept mark (to be referred to as having mark
Data) and require to have largely have flag data.In many realistic tasks, because the acquisition of concept mark needs to expend largely
Human and material resources, therefore it is typically rare to have flag data, and largely without concept mark historical data (referred to as not
Flag data) it can then be readily obtained.How to aid in improving merely with having flag data on a small quantity using a large amount of Unlabeled datas
Obtained performance has turned into an important topic of machine learning method, and semi-supervised learning method is two big main flows of this aspect
One of technology.
Semi-supervised learning method is able to extensive use at many aspects;But in the case of many, it is existing semi-supervised
Learning method can cause hydraulic performance decline using Unlabeled data, i.e. the performance of semi-supervised learning method can be significantly lower than and directly utilize
A small amount of performance having acquired by flag data training supervised learning method.This phenomenon has had a strong impact on that semi-supervised learning method exists
Application in actual task, because user it is generally desirable to make use of semi-supervised learning method without causing hydraulic performance decline.Therefore
A kind of safe semi-supervised learning method is needed to cause, on the one hand it can generally bring performance to improve, and on the other hand its is seldom
Performance can be caused to be remarkably decreased.Based on semi-supervised learning problem, generally existing, the achievement of this respect will be in actual task
Played a role in many actual tasks.
The content of the invention
Big data is extracted from multiple heterogeneous data sources the technical problem to be solved in the present invention is to provide one kind, and to big data
Enter line discipline conversion;Data processing is carried out to entering the big data after line discipline is changed;Established according to the big data after data processing
Database, and the study that exercises supervision, obtain prediction result.
In order to solve the above-mentioned technical problem, the present invention takes following technical scheme:
A kind of semi-supervised learning method of big data analysis, comprises the following steps:
Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and
The big data is changed according to certain rule, obtains the data format for being adapted to computer disposal;
Step 2, data processing is carried out to entering the big data after line discipline is changed;
Step 3, database is established according to the big data after data processing;
Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion;
Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization is passed through
Method optimizes to the prediction result of semi-supervised classifier;
Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, extraction wherein desired value
Optimal semi-supervised classifier;
Step 7, to there is flag data to be trained, supervised learning method is obtained, by supervised learning method prediction not
Flag data, obtain the prediction result on Unlabeled data;
Step 8, according to the prediction result of supervised learning method, performance is defined to the prediction result on any Unlabeled data
Improve function;
Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising of minimum
Data, the minimum performance is improved to the performance corresponding to data and improves the performance raising letter that function is defined as under worst case
Number;
Step 10, object function is improved according to the performance under worst case, by optimization method to the pre- of Unlabeled data
Survey result to optimize, optimum results are exported, the prediction result as final safe semi-supervised classifier.
The initialization semi-supervised classifier refers to be initialized to the prediction result on Unlabeled data.
The object function of semi-supervised classifier includes interval and the probability likelihood of different classes of data.
Supervised learning method in the step 7 include production model method, arest neighbors supervised learning method (KNN) and
SVMs learning method (SVM).
Entering the mode of line discipline conversion to the big data includes data cleansing and data prediction, the data cleansing with
The data prediction includes at least one of:Standardized format, abnormal data removing, error correcting, duplicate removal.
In the case where the big data is structural data, data are carried out to entering the big data after line discipline is changed
The mode of processing includes at least one of:Object extraction, data correlation, confidence calculations, tag computation, model calculate.
In the case where the big data is unstructured data, enter line number to entering the big data after line discipline is changed
Include at least one of according to the mode of processing:Word segmentation processing, characteristics extraction.
Also need to multi-dimensional feature data carrying out dimension-reduction treatment, data after carrying out first step data processing to the big data
The method of dimensionality reduction includes:Linear dimensionality reduction and Nonlinear Dimension Reduction.
The present invention from multi-source heterogeneous data by analyzing various information and building database, so as to be Various types of data point
Analysis, behavioural analysis, the analysis of user's portrait, relation find to provide data supporting.And safe semi-supervised learning method is utilized, is implemented
During seldom cause performance to be remarkably decreased, at the same time achieve and the highly comparable performance of prior art.
Brief description of the drawings
Accompanying drawing 1 is schematic flow sheet of the present invention.
Embodiment
For the ease of the understanding of those skilled in the art, the invention will be further described below in conjunction with the accompanying drawings.
The present invention includes structural data, semi-structured data towards the present invention towards massive multi-source data, data
And unstructured data, the various features attribute information and character relation topological diagram of personnel will be calculated from all data.It is right
Data perform complicated processing procedure, including:Data pick-up, data cleansing, data backfill, property value calculate;By the category of calculating
Property value is inserted in unified Object table carries out retrieval displaying will pass through interface.
As shown in Figure 1, a kind of semi-supervised learning method of big data analysis, comprises the following steps:
Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and
The big data is changed according to certain rule, obtains the data format for being adapted to computer disposal;
Step 2, data processing is carried out to entering the big data after line discipline is changed;
Step 3, database is established according to the big data after data processing;
Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion;
Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization is passed through
Method optimizes to the prediction result of semi-supervised classifier;
Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, extraction wherein desired value
Optimal semi-supervised classifier;
Step 7, to there is flag data to be trained, supervised learning method is obtained, by supervised learning method prediction not
Flag data, obtain the prediction result on Unlabeled data;
Step 8, according to the prediction result of supervised learning method, performance is defined to the prediction result on any Unlabeled data
Improve function;
Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising of minimum
Data, the minimum performance is improved to the performance corresponding to data and improves the performance raising letter that function is defined as under worst case
Number;
Step 10, object function is improved according to the performance under worst case, by optimization method to the pre- of Unlabeled data
Survey result to optimize, optimum results are exported, the prediction result as final safe semi-supervised classifier.
The initialization semi-supervised classifier refers to be initialized to the prediction result on Unlabeled data.
The object function of semi-supervised classifier includes interval and the probability likelihood of different classes of data.
Supervised learning method in the step 7 include production model method, arest neighbors supervised learning method (KNN) and
SVMs learning method (SVM).
Entering the mode of line discipline conversion to the big data includes data cleansing and data prediction, the data cleansing with
The data prediction includes at least one of:Standardized format, abnormal data removing, error correcting, duplicate removal.
In the case where the big data is structural data, data are carried out to entering the big data after line discipline is changed
The mode of processing includes at least one of:Object extraction, data correlation, confidence calculations, tag computation, model calculate.
In the case where the big data is unstructured data, enter line number to entering the big data after line discipline is changed
Include at least one of according to the mode of processing:Word segmentation processing, characteristics extraction.
Also need to multi-dimensional feature data carrying out dimension-reduction treatment, data after carrying out first step data processing to the big data
The method of dimensionality reduction includes:Linear dimensionality reduction and Nonlinear Dimension Reduction.
It should be noted that described above be not limited to the present invention, the creation design of the present invention is not being departed from
Under the premise of, it is any obviously to replace within protection scope of the present invention.
Claims (8)
1. a kind of semi-supervised learning method of big data analysis, comprises the following steps:
Step 1, big data being extracted from multiple data sources, the big data includes structural data and unstructured data, and to institute
State big data to be changed according to certain rule, obtain the data format for being adapted to computer disposal;
Step 2, data processing is carried out to entering the big data after line discipline is changed;
Step 3, database is established according to the big data after data processing;
Step 4, to flag data in database and Unlabeled data, the multiple semi-supervised classifiers of random initializtion;
Step 5, for each initial semi-supervised classifier, according to the object function of semi-supervised classifier, optimization method is passed through
The prediction result of semi-supervised classifier is optimized;
Step 6, the prediction result of the semi-supervised classifier optimized is divided into multiple desired values, it is optimal extracts wherein desired value
Semi-supervised classifier;
Step 7, to there is flag data to be trained, supervised learning method is obtained, is predicted by the supervised learning method unmarked
Data, obtain the prediction result on Unlabeled data;
Step 8, according to the prediction result of supervised learning method, performance raising is defined to the prediction result on any Unlabeled data
Function;
Step 9, to the prediction result on any Unlabeled data, performance improves function and obtains the performance raising number of minimum
According to the performance raising function being defined as the performance raising function corresponding to minimum performance raising data under worst case;
Step 10, object function is improved according to the performance under worst case, passes through prediction knot of the optimization method to Unlabeled data
Fruit is optimized, and optimum results are exported, the prediction result as final safe semi-supervised classifier.
2. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that the initialization half is supervised
Grader is superintended and directed to refer to initialize the prediction result on Unlabeled data.
3. the semi-supervised learning method of big data analysis according to claim 2, it is characterised in that semi-supervised classifier
Object function includes interval and the probability likelihood of different classes of data.
4. the semi-supervised learning method of big data analysis according to claim 3, it is characterised in that in the step 7
Supervised learning method includes production model method, arest neighbors supervised learning method (KNN) and SVMs learning method
(SVM)。
5. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that enter to the big data
The mode of line discipline conversion includes data cleansing and data prediction, and the data cleansing and the data prediction include following
At least one:Standardized format, abnormal data removing, error correcting, duplicate removal.
6. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that be in the big data
In the case of structural data, to enter line discipline conversion after the big data carry out data processing mode include it is following at least
One of:Object extraction, data correlation, confidence calculations, tag computation, model calculate.
7. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that be in the big data
In the case of unstructured data, the mode of data processing is carried out to entering the big data after line discipline is changed to be included below extremely
It is one of few:Word segmentation processing, characteristics extraction.
8. the semi-supervised learning method of big data analysis according to claim 1, it is characterised in that enter to the big data
Also need to multi-dimensional feature data carrying out dimension-reduction treatment after row first step data processing, the method for Data Dimensionality Reduction includes:Linear drop
Peacekeeping Nonlinear Dimension Reduction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710861920.6A CN107590262A (en) | 2017-09-21 | 2017-09-21 | The semi-supervised learning method of big data analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710861920.6A CN107590262A (en) | 2017-09-21 | 2017-09-21 | The semi-supervised learning method of big data analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107590262A true CN107590262A (en) | 2018-01-16 |
Family
ID=61047545
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710861920.6A Pending CN107590262A (en) | 2017-09-21 | 2017-09-21 | The semi-supervised learning method of big data analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107590262A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108923962A (en) * | 2018-06-25 | 2018-11-30 | 哈尔滨工业大学 | A kind of Local network topology measurement task selection method based on semi-supervised clustering |
CN109977094A (en) * | 2019-01-30 | 2019-07-05 | 中南大学 | A method of the semi-supervised learning for structural data |
CN111797832A (en) * | 2020-07-14 | 2020-10-20 | 成都数之联科技有限公司 | Automatic generation method and system of image interesting region and image processing method |
CN111896681A (en) * | 2020-07-08 | 2020-11-06 | 南昌工程学院 | Semi-supervised semi-learning type atmospheric pollutant system |
CN113168907A (en) * | 2018-11-30 | 2021-07-23 | 第一百欧有限公司 | Diagnostic system providing method using semi-supervised learning and diagnostic system using the same |
WO2023273249A1 (en) * | 2021-06-30 | 2023-01-05 | 国网上海市电力公司 | Tsvm-model-based abnormality detection method for automatic verification system of smart electricity meter |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390171A (en) * | 2013-07-24 | 2013-11-13 | 南京大学 | Safe semi-supervised learning method |
CN104636493A (en) * | 2015-03-04 | 2015-05-20 | 浪潮电子信息产业股份有限公司 | Method for classifying dynamic data on basis of multi-classifier fusion |
CN106528874A (en) * | 2016-12-08 | 2017-03-22 | 重庆邮电大学 | Spark memory computing big data platform-based CLR multi-label data classification method |
-
2017
- 2017-09-21 CN CN201710861920.6A patent/CN107590262A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390171A (en) * | 2013-07-24 | 2013-11-13 | 南京大学 | Safe semi-supervised learning method |
CN104636493A (en) * | 2015-03-04 | 2015-05-20 | 浪潮电子信息产业股份有限公司 | Method for classifying dynamic data on basis of multi-classifier fusion |
CN106528874A (en) * | 2016-12-08 | 2017-03-22 | 重庆邮电大学 | Spark memory computing big data platform-based CLR multi-label data classification method |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108923962A (en) * | 2018-06-25 | 2018-11-30 | 哈尔滨工业大学 | A kind of Local network topology measurement task selection method based on semi-supervised clustering |
CN108923962B (en) * | 2018-06-25 | 2021-05-28 | 哈尔滨工业大学 | Local network topology measurement task selection method based on semi-supervised clustering |
CN113168907A (en) * | 2018-11-30 | 2021-07-23 | 第一百欧有限公司 | Diagnostic system providing method using semi-supervised learning and diagnostic system using the same |
CN109977094A (en) * | 2019-01-30 | 2019-07-05 | 中南大学 | A method of the semi-supervised learning for structural data |
CN109977094B (en) * | 2019-01-30 | 2021-02-19 | 中南大学 | Semi-supervised learning method for structured data |
CN111896681A (en) * | 2020-07-08 | 2020-11-06 | 南昌工程学院 | Semi-supervised semi-learning type atmospheric pollutant system |
CN111797832A (en) * | 2020-07-14 | 2020-10-20 | 成都数之联科技有限公司 | Automatic generation method and system of image interesting region and image processing method |
CN111797832B (en) * | 2020-07-14 | 2024-02-02 | 成都数之联科技股份有限公司 | Automatic generation method and system for image region of interest and image processing method |
WO2023273249A1 (en) * | 2021-06-30 | 2023-01-05 | 国网上海市电力公司 | Tsvm-model-based abnormality detection method for automatic verification system of smart electricity meter |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107590262A (en) | The semi-supervised learning method of big data analysis | |
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
CN105631479A (en) | Imbalance-learning-based depth convolution network image marking method and apparatus | |
CN111767725B (en) | Data processing method and device based on emotion polarity analysis model | |
CN109726246A (en) | One kind being associated with reason retrogressive method with visual power grid accident based on data mining | |
CN110968667A (en) | Periodical and literature table extraction method based on text state characteristics | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN103559199B (en) | Method for abstracting web page information and device | |
CN102298638A (en) | Method and system for extracting news webpage contents by clustering webpage labels | |
US20150095022A1 (en) | List recognizing method and list recognizing system | |
CN103778205A (en) | Commodity classifying method and system based on mutual information | |
CN104112026A (en) | Short message text classifying method and system | |
CN104217038A (en) | Knowledge network building method for financial news | |
CN106339481B (en) | The compound new word discovery method of Chinese based on maximum confidence | |
CN104850617A (en) | Short text processing method and apparatus | |
CN110222328A (en) | Participle and part-of-speech tagging method, apparatus, equipment and storage medium neural network based | |
CN104933032A (en) | Method for extracting keywords of blog based on complex network | |
CN103116636B (en) | The big Data subject method for digging of the text of feature based spatial decomposition and device | |
CN107463624B (en) | A kind of method and system that city interest domain identification is carried out based on social media data | |
CN106844588A (en) | A kind of analysis method and system of the user behavior data based on web crawlers | |
CN108763192A (en) | Entity relation extraction method and device for text-processing | |
CN109325204B (en) | Automatic extraction method of webpage content | |
CN109165295B (en) | Intelligent resume evaluation method | |
CN110347841A (en) | A kind of method, apparatus, storage medium and the electronic equipment of document content classification | |
CN107992508B (en) | Chinese mail signature extraction method and system based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180116 |
|
WD01 | Invention patent application deemed withdrawn after publication |