CN108959350A - Classification prediction technique based on big data distributed computing - Google Patents

Classification prediction technique based on big data distributed computing Download PDF

Info

Publication number
CN108959350A
CN108959350A CN201810375744.XA CN201810375744A CN108959350A CN 108959350 A CN108959350 A CN 108959350A CN 201810375744 A CN201810375744 A CN 201810375744A CN 108959350 A CN108959350 A CN 108959350A
Authority
CN
China
Prior art keywords
data
classification
big data
distributed computing
test sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810375744.XA
Other languages
Chinese (zh)
Inventor
魏素扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Ming Crane Cloud Technology Co Ltd
Original Assignee
Jiangsu Ming Crane Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Ming Crane Cloud Technology Co Ltd filed Critical Jiangsu Ming Crane Cloud Technology Co Ltd
Priority to CN201810375744.XA priority Critical patent/CN108959350A/en
Publication of CN108959350A publication Critical patent/CN108959350A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The present invention provides a kind of classification prediction technique based on big data distributed computing, includes the following steps: step 1: carrying out data cleansing for data in distributed file system, removes the noise in distributed file system, and form standardized data collection;Step 2: the data concentrated to standardized data are tagged using clustering algorithm, and obtain training sample and test sample;Step 3: based on the standardized data collection for using clustering algorithm tagged, using the Mapreduce Computational frame of big data platform, iterating to calculate out training pattern parameter for training sample, using training pattern parameter, prediction classification is carried out to test sample.Classification prediction technique according to the present invention based on big data distributed computing, convenient for carrying out data classification prediction and verification to labeling data.

Description

Classification prediction technique based on big data distributed computing
Technical field
The present invention relates to big data processing technology fields, specifically, in particular to one kind is based on big data distribution The classification prediction technique of calculation.
Background technique
In big data era, data mining is chief work.The excavation of big data be from magnanimity, it is incomplete, have Discovery is implied in wherein valuable, potentially useful information and knowledge in noise, fuzzy, random large database Process and a kind of decision support processes.It is based primarily upon artificial intelligence, machine learning, pattern learning, statistics etc..Pass through Big data is analyzed increasingly automatedly, makes the reasoning of inductive, therefrom excavates potential mode.In the related art, When especially relating to big data classification prediction algorithm, when for application class and trend prediction, tend not to accurately react Concrete condition.
Summary of the invention
The present invention is directed to solve one of above-mentioned technical problem in the prior art at least to a certain extent.In view of this, The present invention needs to provide that a kind of generalization ability is strong, the small classification prediction technique based on big data distributed computing of extensive error.
Classification prediction technique according to the present invention based on big data distributed computing, includes the following steps: step 1: needle Data cleansing is carried out to data in distributed file system, removes the noise in the distributed file system, and form standard Change data set;Step 2: the data concentrated to the standardized data are tagged using clustering algorithm, and obtain training sample And test sample;Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform Mapreduce Computational frame is iterated to calculate out training pattern parameter for the training sample, is joined using the training pattern Number, carries out prediction classification to the test sample.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention can be based on big data Platform storage mass data, data are cleaned using the Computational frame of the Mapreduce of big data platform, cluster and Labeling processing.Finally according to labeling data result using Gradient Boosting algorithm carry out data classification prediction and Verification.
Meanwhile the classification prediction technique based on big data distributed computing of embodiment according to the present invention, for magnanimity Big data, using Ensemble Learning Algorithms, the learning algorithm the same PROBLEM DECOMPOSITION into multiple and different modules, to improve The generalization ability of classifier;The extensive error of algorithm is reduced by generating multiple classifiers, thus the thought of boosting algorithm performance, And the thought data classification field is applied to.
In addition, the classification prediction technique according to the above embodiment of the present invention based on big data distributed computing can also have There is following additional technical characteristic:
According to one embodiment of present invention, further comprise step 4: the prediction classification of the test sample is tested The prediction classification accuracy of card, such as test sample is low, then adjusts the parameter of GBM algorithm, and for the training sample Continue training pattern as a result, to obtain to the accurate higher data result of test sample prediction classification.
According to one embodiment of present invention, manual verification is used to the prediction classification of the test sample in step 4.
According to one embodiment of present invention, in step 2, the data that the standardized data is concentrated are calculated using cluster After method is tagged, the training sample and the test sample are obtained by manual confirmation.
According to one embodiment of present invention, in step 4, when adjusting the parameter of GBM algorithm, comprising the following steps: Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner;Step 6: being based on step The parameter of the quantity adjustment decision tree of learner described in 5;Step 7: reducing the learning rate, and correspondingly improve the base Practise the quantity of device.
According to one embodiment of present invention, in step 7, before reducing the learning rate, the training sample is adjusted Ratio.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.
Detailed description of the invention
Fig. 1 is the block diagram of the classification prediction technique based on big data distributed computing of embodiment according to the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise " is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of The description present invention and simplified description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with spy Fixed orientation construction and operation, therefore be not considered as limiting the invention.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include one or more of the features.In the description of the present invention, the meaning of " plurality " is two or more, Unless otherwise specifically defined.
In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral;It can be mechanical connect It connects, is also possible to be electrically connected;It can be directly connected, can also can be in two elements indirectly connected through an intermediary The interaction relationship of the connection in portion or two elements.It for the ordinary skill in the art, can be according to specific feelings Condition understands the concrete meaning of above-mentioned term in the present invention.
In the present invention unless specifically defined or limited otherwise, fisrt feature second feature "upper" or "lower" It may include that the first and second features directly contact, also may include that the first and second features are not direct contacts but pass through it Between other characterisation contact.Moreover, fisrt feature includes the first spy above the second feature " above ", " above " and " above " Sign is right above second feature and oblique upper, or is merely representative of first feature horizontal height higher than second feature.Fisrt feature exists Second feature " under ", " lower section " and " following " include that fisrt feature is directly below and diagonally below the second feature, or is merely representative of First feature horizontal height is less than second feature.
As shown in Figure 1, the classification prediction technique based on big data distributed computing of embodiment according to the present invention, including Following steps:
Step 1: carrying out data cleansing for data in distributed file system (HDFS), remove in distributed file system Noise, and form standardized data collection.
Step 2: the data concentrated to standardized data are tagged using clustering algorithm, and obtain training sample and test Sample.
Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform Mapreduce Computational frame iterates to calculate out training pattern parameter for training sample, using training pattern parameter, to test Sample carries out prediction classification.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention can be based on big data Platform storage mass data, data are cleaned using the Computational frame of the Mapreduce of big data platform, cluster and Labeling processing.Finally according to labeling data result using Gradient Boosting algorithm carry out data classification prediction and Verification.
Meanwhile the classification prediction technique based on big data distributed computing of embodiment according to the present invention, for magnanimity Big data, using Ensemble Learning Algorithms, the learning algorithm the same PROBLEM DECOMPOSITION into multiple and different modules, to improve The generalization ability of classifier;The extensive error of algorithm is reduced by generating multiple classifiers, thus the thought of boosting algorithm performance, And the thought data classification field is applied to.
It is understood that in an embodiment of the present invention, the main data storage capacities and meter for utilizing big data platform Calculation ability handles mass data training and classification prediction.It is broadly divided into three steps:
Data cleansing and labeling construct training data sample and test sample;
Training data model is tested using training sample, uses the accuracy of model result validation test sample;
If accuracy is not high, model parameter is adjusted, continues to train the accuracy of training data and validation test sample straight Reach the accuracy of precognition.
It according to one embodiment of present invention, can be further based on the classification prediction technique of big data distributed computing Including step 4: the step includes: that the prediction classification to test sample is verified, such as the prediction classification accuracy of test sample It is low, then the parameter of GBM (Gradient Boosting Machine) algorithm is adjusted, and continue trained mould for training sample Type is as a result, to obtain to the accurate higher data result of test sample prediction classification.Thus, it is possible to improve validation test sample Accuracy.
It is understood that preferably, using manual verification to the prediction classification of the test sample in step 4.
It is understood that in step 2, after using clustering algorithm tagged the data that standardized data is concentrated, Training sample and test sample are obtained by manual confirmation.Thus, it is possible to improve the accuracy of training sample and test sample.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention is being adjusted in step 4 When the parameter of whole GBM algorithm, comprising the following steps:
Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner;
Step 6: the parameter of the quantity adjustment decision tree based on learner described in step 5;
Step 7: reducing the learning rate, and correspondingly improve the quantity of the base learner.
Further, according to one embodiment of present invention, in step 7, before reducing the learning rate, described in adjustment The ratio of training sample.
The classification prediction technique based on big data distributed computing of embodiment according to the present invention is classified for big data Prediction algorithm, classification are to find out the common feature of one group of data object in mass data storage system and incite somebody to action according to classification mode It is divided into different classes, and the purpose is to the maps data items in database be arrived and touch a given classification by disaggregated model In.It can be applied to and be related to application class, in trend prediction.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office What can be combined in any suitable manner in one or more embodiment or examples.In addition, those skilled in the art can say this Different embodiments or examples described in bright book are engaged and are combined.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims (6)

1. a kind of classification prediction technique based on big data distributed computing, which comprises the steps of:
Step 1: carrying out data cleansing for data in distributed file system, remove making an uproar in the distributed file system Sound, and form standardized data collection;
Step 2: the data concentrated to the standardized data are tagged using clustering algorithm, and obtain training sample and test Sample;
Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform Mapreduce Computational frame is iterated to calculate out training pattern parameter for the training sample, is joined using the training pattern Number, carries out prediction classification to the test sample.
2. the classification prediction technique according to claim 1 based on big data distributed computing, which is characterized in that further Including step 4: the prediction classification of the test sample to be verified, the prediction classification accuracy of such as test sample is low, The parameter of GBM algorithm is then adjusted, and the continuation training pattern for the training sample is as a result, to obtain to the test sample The accurate higher data result of prediction classification.
3. the classification prediction technique according to claim 2 based on big data distributed computing, which is characterized in that in step Manual verification is used to the prediction classification of the test sample in 4.
4. the classification prediction technique according to claim 1 based on big data distributed computing, which is characterized in that in step In 2, after using clustering algorithm tagged the data that the standardized data is concentrated, the training is obtained by manual confirmation Sample and the test sample.
5. the classification prediction technique according to claim 2 based on big data distributed computing, which is characterized in that in step In 4, when adjusting the parameter of GBM algorithm, comprising the following steps:
Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner;
Step 6: the parameter of the quantity adjustment decision tree based on learner described in step 5;
Step 7: reducing the learning rate, and correspondingly improve the quantity of the base learner.
6. the classification prediction technique according to claim 5 based on big data distributed computing, which is characterized in that in step In 7, before reducing the learning rate, the ratio of the training sample is adjusted.
CN201810375744.XA 2018-04-25 2018-04-25 Classification prediction technique based on big data distributed computing Pending CN108959350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810375744.XA CN108959350A (en) 2018-04-25 2018-04-25 Classification prediction technique based on big data distributed computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810375744.XA CN108959350A (en) 2018-04-25 2018-04-25 Classification prediction technique based on big data distributed computing

Publications (1)

Publication Number Publication Date
CN108959350A true CN108959350A (en) 2018-12-07

Family

ID=64499076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810375744.XA Pending CN108959350A (en) 2018-04-25 2018-04-25 Classification prediction technique based on big data distributed computing

Country Status (1)

Country Link
CN (1) CN108959350A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584668A (en) * 2018-12-29 2019-04-05 中铁工程装备集团有限公司 A kind of rock tunnel(ling) machine training platform based on virtual reality and big data
CN111553375A (en) * 2019-02-08 2020-08-18 Sap欧洲公司 Using transformations to verify computer vision quality

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567391A (en) * 2010-12-20 2012-07-11 中国移动通信集团广东有限公司 Method and device for building classification forecasting mixed model
CN104268260A (en) * 2014-10-10 2015-01-07 中国科学院重庆绿色智能技术研究院 Method, device and system for classifying streaming data
CN106528874A (en) * 2016-12-08 2017-03-22 重庆邮电大学 Spark memory computing big data platform-based CLR multi-label data classification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567391A (en) * 2010-12-20 2012-07-11 中国移动通信集团广东有限公司 Method and device for building classification forecasting mixed model
CN104268260A (en) * 2014-10-10 2015-01-07 中国科学院重庆绿色智能技术研究院 Method, device and system for classifying streaming data
CN106528874A (en) * 2016-12-08 2017-03-22 重庆邮电大学 Spark memory computing big data platform-based CLR multi-label data classification method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584668A (en) * 2018-12-29 2019-04-05 中铁工程装备集团有限公司 A kind of rock tunnel(ling) machine training platform based on virtual reality and big data
CN111553375A (en) * 2019-02-08 2020-08-18 Sap欧洲公司 Using transformations to verify computer vision quality
CN111553375B (en) * 2019-02-08 2024-04-05 Sap欧洲公司 Using transformations to verify computer vision quality

Similar Documents

Publication Publication Date Title
CN100550054C (en) A kind of image solid matching method and device thereof
US10706332B2 (en) Analog circuit fault mode classification method
CN105389480B (en) Multiclass imbalance genomics data iteration Ensemble feature selection method and system
CN101268505B (en) Method and system for classifying a video
CN106202177A (en) A kind of file classification method and device
CN107045788A (en) Traffic Forecasting Methodology and device
CN107067045A (en) Data clustering method, device, computer-readable medium and electronic equipment
CN108985380B (en) Point switch fault identification method based on cluster integration
CN110008259A (en) The method and terminal device of visualized data analysis
CN101915764A (en) Road surface crack detection method based on dynamic programming
US20200320709A1 (en) Verifying and/or improving the consistency of labels within the scope of medical image processing
CN109948740A (en) A kind of classification method based on tranquillization state brain image
CN103714261A (en) Intelligent auxiliary medical treatment decision supporting method of two-stage mixed model
CN104050361A (en) Intelligent analysis early warning method for dangerousness tendency of prison persons serving sentences
CN108959350A (en) Classification prediction technique based on big data distributed computing
CN110097098A (en) Data classification method and device, medium and electronic equipment based on base classifier
CN103745233B (en) The hyperspectral image classification method migrated based on spatial information
CN102663418B (en) An image set modeling and matching method based on regression model
CN109947881A (en) A kind of POI judging method, device, mobile terminal and computer readable storage medium
Lugaresi et al. Generation and tuning of discrete event simulation models for manufacturing applications
Sanchez-Rodriguez et al. Detecting brain network communities: Considering the role of information flow and its different temporal scales
CN109656808A (en) A kind of Software Defects Predict Methods based on hybrid active learning strategies
CN104978569A (en) Sparse representation based incremental face recognition method
CN109615002A (en) Decision tree SVM university student's consumer behavior evaluation method based on PSO
CN108710967A (en) Expressway traffic accident Severity forecasting method based on data fusion and support vector machines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181207