CN108959350A - Classification prediction technique based on big data distributed computing - Google Patents
Classification prediction technique based on big data distributed computing Download PDFInfo
- Publication number
- CN108959350A CN108959350A CN201810375744.XA CN201810375744A CN108959350A CN 108959350 A CN108959350 A CN 108959350A CN 201810375744 A CN201810375744 A CN 201810375744A CN 108959350 A CN108959350 A CN 108959350A
- Authority
- CN
- China
- Prior art keywords
- data
- classification
- big data
- distributed computing
- test sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Abstract
The present invention provides a kind of classification prediction technique based on big data distributed computing, includes the following steps: step 1: carrying out data cleansing for data in distributed file system, removes the noise in distributed file system, and form standardized data collection;Step 2: the data concentrated to standardized data are tagged using clustering algorithm, and obtain training sample and test sample;Step 3: based on the standardized data collection for using clustering algorithm tagged, using the Mapreduce Computational frame of big data platform, iterating to calculate out training pattern parameter for training sample, using training pattern parameter, prediction classification is carried out to test sample.Classification prediction technique according to the present invention based on big data distributed computing, convenient for carrying out data classification prediction and verification to labeling data.
Description
Technical field
The present invention relates to big data processing technology fields, specifically, in particular to one kind is based on big data distribution
The classification prediction technique of calculation.
Background technique
In big data era, data mining is chief work.The excavation of big data be from magnanimity, it is incomplete, have
Discovery is implied in wherein valuable, potentially useful information and knowledge in noise, fuzzy, random large database
Process and a kind of decision support processes.It is based primarily upon artificial intelligence, machine learning, pattern learning, statistics etc..Pass through
Big data is analyzed increasingly automatedly, makes the reasoning of inductive, therefrom excavates potential mode.In the related art,
When especially relating to big data classification prediction algorithm, when for application class and trend prediction, tend not to accurately react
Concrete condition.
Summary of the invention
The present invention is directed to solve one of above-mentioned technical problem in the prior art at least to a certain extent.In view of this,
The present invention needs to provide that a kind of generalization ability is strong, the small classification prediction technique based on big data distributed computing of extensive error.
Classification prediction technique according to the present invention based on big data distributed computing, includes the following steps: step 1: needle
Data cleansing is carried out to data in distributed file system, removes the noise in the distributed file system, and form standard
Change data set;Step 2: the data concentrated to the standardized data are tagged using clustering algorithm, and obtain training sample
And test sample;Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform
Mapreduce Computational frame is iterated to calculate out training pattern parameter for the training sample, is joined using the training pattern
Number, carries out prediction classification to the test sample.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention can be based on big data
Platform storage mass data, data are cleaned using the Computational frame of the Mapreduce of big data platform, cluster and
Labeling processing.Finally according to labeling data result using Gradient Boosting algorithm carry out data classification prediction and
Verification.
Meanwhile the classification prediction technique based on big data distributed computing of embodiment according to the present invention, for magnanimity
Big data, using Ensemble Learning Algorithms, the learning algorithm the same PROBLEM DECOMPOSITION into multiple and different modules, to improve
The generalization ability of classifier;The extensive error of algorithm is reduced by generating multiple classifiers, thus the thought of boosting algorithm performance,
And the thought data classification field is applied to.
In addition, the classification prediction technique according to the above embodiment of the present invention based on big data distributed computing can also have
There is following additional technical characteristic:
According to one embodiment of present invention, further comprise step 4: the prediction classification of the test sample is tested
The prediction classification accuracy of card, such as test sample is low, then adjusts the parameter of GBM algorithm, and for the training sample
Continue training pattern as a result, to obtain to the accurate higher data result of test sample prediction classification.
According to one embodiment of present invention, manual verification is used to the prediction classification of the test sample in step 4.
According to one embodiment of present invention, in step 2, the data that the standardized data is concentrated are calculated using cluster
After method is tagged, the training sample and the test sample are obtained by manual confirmation.
According to one embodiment of present invention, in step 4, when adjusting the parameter of GBM algorithm, comprising the following steps:
Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner;Step 6: being based on step
The parameter of the quantity adjustment decision tree of learner described in 5;Step 7: reducing the learning rate, and correspondingly improve the base
Practise the quantity of device.
According to one embodiment of present invention, in step 7, before reducing the learning rate, the training sample is adjusted
Ratio.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Fig. 1 is the block diagram of the classification prediction technique based on big data distributed computing of embodiment according to the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", " length ", " width ",
" thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time
The orientation or positional relationship of the instructions such as needle ", " counterclockwise " is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of
The description present invention and simplified description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with spy
Fixed orientation construction and operation, therefore be not considered as limiting the invention.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include one or more of the features.In the description of the present invention, the meaning of " plurality " is two or more,
Unless otherwise specifically defined.
In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc.
Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral;It can be mechanical connect
It connects, is also possible to be electrically connected;It can be directly connected, can also can be in two elements indirectly connected through an intermediary
The interaction relationship of the connection in portion or two elements.It for the ordinary skill in the art, can be according to specific feelings
Condition understands the concrete meaning of above-mentioned term in the present invention.
In the present invention unless specifically defined or limited otherwise, fisrt feature second feature "upper" or "lower"
It may include that the first and second features directly contact, also may include that the first and second features are not direct contacts but pass through it
Between other characterisation contact.Moreover, fisrt feature includes the first spy above the second feature " above ", " above " and " above "
Sign is right above second feature and oblique upper, or is merely representative of first feature horizontal height higher than second feature.Fisrt feature exists
Second feature " under ", " lower section " and " following " include that fisrt feature is directly below and diagonally below the second feature, or is merely representative of
First feature horizontal height is less than second feature.
As shown in Figure 1, the classification prediction technique based on big data distributed computing of embodiment according to the present invention, including
Following steps:
Step 1: carrying out data cleansing for data in distributed file system (HDFS), remove in distributed file system
Noise, and form standardized data collection.
Step 2: the data concentrated to standardized data are tagged using clustering algorithm, and obtain training sample and test
Sample.
Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform
Mapreduce Computational frame iterates to calculate out training pattern parameter for training sample, using training pattern parameter, to test
Sample carries out prediction classification.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention can be based on big data
Platform storage mass data, data are cleaned using the Computational frame of the Mapreduce of big data platform, cluster and
Labeling processing.Finally according to labeling data result using Gradient Boosting algorithm carry out data classification prediction and
Verification.
Meanwhile the classification prediction technique based on big data distributed computing of embodiment according to the present invention, for magnanimity
Big data, using Ensemble Learning Algorithms, the learning algorithm the same PROBLEM DECOMPOSITION into multiple and different modules, to improve
The generalization ability of classifier;The extensive error of algorithm is reduced by generating multiple classifiers, thus the thought of boosting algorithm performance,
And the thought data classification field is applied to.
It is understood that in an embodiment of the present invention, the main data storage capacities and meter for utilizing big data platform
Calculation ability handles mass data training and classification prediction.It is broadly divided into three steps:
Data cleansing and labeling construct training data sample and test sample;
Training data model is tested using training sample, uses the accuracy of model result validation test sample;
If accuracy is not high, model parameter is adjusted, continues to train the accuracy of training data and validation test sample straight
Reach the accuracy of precognition.
It according to one embodiment of present invention, can be further based on the classification prediction technique of big data distributed computing
Including step 4: the step includes: that the prediction classification to test sample is verified, such as the prediction classification accuracy of test sample
It is low, then the parameter of GBM (Gradient Boosting Machine) algorithm is adjusted, and continue trained mould for training sample
Type is as a result, to obtain to the accurate higher data result of test sample prediction classification.Thus, it is possible to improve validation test sample
Accuracy.
It is understood that preferably, using manual verification to the prediction classification of the test sample in step 4.
It is understood that in step 2, after using clustering algorithm tagged the data that standardized data is concentrated,
Training sample and test sample are obtained by manual confirmation.Thus, it is possible to improve the accuracy of training sample and test sample.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention is being adjusted in step 4
When the parameter of whole GBM algorithm, comprising the following steps:
Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner;
Step 6: the parameter of the quantity adjustment decision tree based on learner described in step 5;
Step 7: reducing the learning rate, and correspondingly improve the quantity of the base learner.
Further, according to one embodiment of present invention, in step 7, before reducing the learning rate, described in adjustment
The ratio of training sample.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention is classified for big data
Prediction algorithm, classification are to find out the common feature of one group of data object in mass data storage system and incite somebody to action according to classification mode
It is divided into different classes, and the purpose is to the maps data items in database be arrived and touch a given classification by disaggregated model
In.It can be applied to and be related to application class, in trend prediction.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
What can be combined in any suitable manner in one or more embodiment or examples.In addition, those skilled in the art can say this
Different embodiments or examples described in bright book are engaged and are combined.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, modifies, replacement and variant.
Claims (6)
1. a kind of classification prediction technique based on big data distributed computing, which comprises the steps of:
Step 1: carrying out data cleansing for data in distributed file system, remove making an uproar in the distributed file system
Sound, and form standardized data collection;
Step 2: the data concentrated to the standardized data are tagged using clustering algorithm, and obtain training sample and test
Sample;
Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform
Mapreduce Computational frame is iterated to calculate out training pattern parameter for the training sample, is joined using the training pattern
Number, carries out prediction classification to the test sample.
2. the classification prediction technique according to claim 1 based on big data distributed computing, which is characterized in that further
Including step 4: the prediction classification of the test sample to be verified, the prediction classification accuracy of such as test sample is low,
The parameter of GBM algorithm is then adjusted, and the continuation training pattern for the training sample is as a result, to obtain to the test sample
The accurate higher data result of prediction classification.
3. the classification prediction technique according to claim 2 based on big data distributed computing, which is characterized in that in step
Manual verification is used to the prediction classification of the test sample in 4.
4. the classification prediction technique according to claim 1 based on big data distributed computing, which is characterized in that in step
In 2, after using clustering algorithm tagged the data that the standardized data is concentrated, the training is obtained by manual confirmation
Sample and the test sample.
5. the classification prediction technique according to claim 2 based on big data distributed computing, which is characterized in that in step
In 4, when adjusting the parameter of GBM algorithm, comprising the following steps:
Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner;
Step 6: the parameter of the quantity adjustment decision tree based on learner described in step 5;
Step 7: reducing the learning rate, and correspondingly improve the quantity of the base learner.
6. the classification prediction technique according to claim 5 based on big data distributed computing, which is characterized in that in step
In 7, before reducing the learning rate, the ratio of the training sample is adjusted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810375744.XA CN108959350A (en) | 2018-04-25 | 2018-04-25 | Classification prediction technique based on big data distributed computing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810375744.XA CN108959350A (en) | 2018-04-25 | 2018-04-25 | Classification prediction technique based on big data distributed computing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108959350A true CN108959350A (en) | 2018-12-07 |
Family
ID=64499076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810375744.XA Pending CN108959350A (en) | 2018-04-25 | 2018-04-25 | Classification prediction technique based on big data distributed computing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108959350A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109584668A (en) * | 2018-12-29 | 2019-04-05 | 中铁工程装备集团有限公司 | A kind of rock tunnel(ling) machine training platform based on virtual reality and big data |
CN111553375A (en) * | 2019-02-08 | 2020-08-18 | Sap欧洲公司 | Using transformations to verify computer vision quality |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567391A (en) * | 2010-12-20 | 2012-07-11 | 中国移动通信集团广东有限公司 | Method and device for building classification forecasting mixed model |
CN104268260A (en) * | 2014-10-10 | 2015-01-07 | 中国科学院重庆绿色智能技术研究院 | Method, device and system for classifying streaming data |
CN106528874A (en) * | 2016-12-08 | 2017-03-22 | 重庆邮电大学 | Spark memory computing big data platform-based CLR multi-label data classification method |
-
2018
- 2018-04-25 CN CN201810375744.XA patent/CN108959350A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567391A (en) * | 2010-12-20 | 2012-07-11 | 中国移动通信集团广东有限公司 | Method and device for building classification forecasting mixed model |
CN104268260A (en) * | 2014-10-10 | 2015-01-07 | 中国科学院重庆绿色智能技术研究院 | Method, device and system for classifying streaming data |
CN106528874A (en) * | 2016-12-08 | 2017-03-22 | 重庆邮电大学 | Spark memory computing big data platform-based CLR multi-label data classification method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109584668A (en) * | 2018-12-29 | 2019-04-05 | 中铁工程装备集团有限公司 | A kind of rock tunnel(ling) machine training platform based on virtual reality and big data |
CN111553375A (en) * | 2019-02-08 | 2020-08-18 | Sap欧洲公司 | Using transformations to verify computer vision quality |
CN111553375B (en) * | 2019-02-08 | 2024-04-05 | Sap欧洲公司 | Using transformations to verify computer vision quality |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100550054C (en) | A kind of image solid matching method and device thereof | |
US10706332B2 (en) | Analog circuit fault mode classification method | |
CN105389480B (en) | Multiclass imbalance genomics data iteration Ensemble feature selection method and system | |
CN101268505B (en) | Method and system for classifying a video | |
CN106202177A (en) | A kind of file classification method and device | |
CN107045788A (en) | Traffic Forecasting Methodology and device | |
CN107067045A (en) | Data clustering method, device, computer-readable medium and electronic equipment | |
CN108985380B (en) | Point switch fault identification method based on cluster integration | |
CN110008259A (en) | The method and terminal device of visualized data analysis | |
CN101915764A (en) | Road surface crack detection method based on dynamic programming | |
US20200320709A1 (en) | Verifying and/or improving the consistency of labels within the scope of medical image processing | |
CN109948740A (en) | A kind of classification method based on tranquillization state brain image | |
CN103714261A (en) | Intelligent auxiliary medical treatment decision supporting method of two-stage mixed model | |
CN104050361A (en) | Intelligent analysis early warning method for dangerousness tendency of prison persons serving sentences | |
CN108959350A (en) | Classification prediction technique based on big data distributed computing | |
CN110097098A (en) | Data classification method and device, medium and electronic equipment based on base classifier | |
CN103745233B (en) | The hyperspectral image classification method migrated based on spatial information | |
CN102663418B (en) | An image set modeling and matching method based on regression model | |
CN109947881A (en) | A kind of POI judging method, device, mobile terminal and computer readable storage medium | |
Lugaresi et al. | Generation and tuning of discrete event simulation models for manufacturing applications | |
Sanchez-Rodriguez et al. | Detecting brain network communities: Considering the role of information flow and its different temporal scales | |
CN109656808A (en) | A kind of Software Defects Predict Methods based on hybrid active learning strategies | |
CN104978569A (en) | Sparse representation based incremental face recognition method | |
CN109615002A (en) | Decision tree SVM university student's consumer behavior evaluation method based on PSO | |
CN108710967A (en) | Expressway traffic accident Severity forecasting method based on data fusion and support vector machines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181207 |