CN108959350A

CN108959350A - Classification prediction technique based on big data distributed computing

Info

Publication number: CN108959350A
Application number: CN201810375744.XA
Authority: CN
Inventors: 魏素扬
Original assignee: Jiangsu Ming Crane Cloud Technology Co Ltd
Current assignee: Jiangsu Ming Crane Cloud Technology Co Ltd
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2018-12-07

Abstract

The present invention provides a kind of classification prediction technique based on big data distributed computing, includes the following steps: step 1: carrying out data cleansing for data in distributed file system, removes the noise in distributed file system, and form standardized data collection；Step 2: the data concentrated to standardized data are tagged using clustering algorithm, and obtain training sample and test sample；Step 3: based on the standardized data collection for using clustering algorithm tagged, using the Mapreduce Computational frame of big data platform, iterating to calculate out training pattern parameter for training sample, using training pattern parameter, prediction classification is carried out to test sample.Classification prediction technique according to the present invention based on big data distributed computing, convenient for carrying out data classification prediction and verification to labeling data.

Description

Classification prediction technique based on big data distributed computing

Technical field

The present invention relates to big data processing technology fields, specifically, in particular to one kind is based on big data distribution The classification prediction technique of calculation.

Background technique

In big data era, data mining is chief work.The excavation of big data be from magnanimity, it is incomplete, have Discovery is implied in wherein valuable, potentially useful information and knowledge in noise, fuzzy, random large database Process and a kind of decision support processes.It is based primarily upon artificial intelligence, machine learning, pattern learning, statistics etc..Pass through Big data is analyzed increasingly automatedly, makes the reasoning of inductive, therefrom excavates potential mode.In the related art, When especially relating to big data classification prediction algorithm, when for application class and trend prediction, tend not to accurately react Concrete condition.

Summary of the invention

The present invention is directed to solve one of above-mentioned technical problem in the prior art at least to a certain extent.In view of this, The present invention needs to provide that a kind of generalization ability is strong, the small classification prediction technique based on big data distributed computing of extensive error.

Classification prediction technique according to the present invention based on big data distributed computing, includes the following steps: step 1: needle Data cleansing is carried out to data in distributed file system, removes the noise in the distributed file system, and form standard Change data set；Step 2: the data concentrated to the standardized data are tagged using clustering algorithm, and obtain training sample And test sample；Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform Mapreduce Computational frame is iterated to calculate out training pattern parameter for the training sample, is joined using the training pattern Number, carries out prediction classification to the test sample.

The classification prediction technique based on big data distributed computing of embodiment according to the present invention can be based on big data Platform storage mass data, data are cleaned using the Computational frame of the Mapreduce of big data platform, cluster and Labeling processing.Finally according to labeling data result using Gradient Boosting algorithm carry out data classification prediction and Verification.

Meanwhile the classification prediction technique based on big data distributed computing of embodiment according to the present invention, for magnanimity Big data, using Ensemble Learning Algorithms, the learning algorithm the same PROBLEM DECOMPOSITION into multiple and different modules, to improve The generalization ability of classifier；The extensive error of algorithm is reduced by generating multiple classifiers, thus the thought of boosting algorithm performance, And the thought data classification field is applied to.

In addition, the classification prediction technique according to the above embodiment of the present invention based on big data distributed computing can also have There is following additional technical characteristic:

According to one embodiment of present invention, further comprise step 4: the prediction classification of the test sample is tested The prediction classification accuracy of card, such as test sample is low, then adjusts the parameter of GBM algorithm, and for the training sample Continue training pattern as a result, to obtain to the accurate higher data result of test sample prediction classification.

According to one embodiment of present invention, manual verification is used to the prediction classification of the test sample in step 4.

According to one embodiment of present invention, in step 2, the data that the standardized data is concentrated are calculated using cluster After method is tagged, the training sample and the test sample are obtained by manual confirmation.

According to one embodiment of present invention, in step 4, when adjusting the parameter of GBM algorithm, comprising the following steps: Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner；Step 6: being based on step The parameter of the quantity adjustment decision tree of learner described in 5；Step 7: reducing the learning rate, and correspondingly improve the base Practise the quantity of device.

According to one embodiment of present invention, in step 7, before reducing the learning rate, the training sample is adjusted Ratio.

Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Fig. 1 is the block diagram of the classification prediction technique based on big data distributed computing of embodiment according to the present invention.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise " is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of The description present invention and simplified description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with spy Fixed orientation construction and operation, therefore be not considered as limiting the invention.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include one or more of the features.In the description of the present invention, the meaning of " plurality " is two or more, Unless otherwise specifically defined.

In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral；It can be mechanical connect It connects, is also possible to be electrically connected；It can be directly connected, can also can be in two elements indirectly connected through an intermediary The interaction relationship of the connection in portion or two elements.It for the ordinary skill in the art, can be according to specific feelings Condition understands the concrete meaning of above-mentioned term in the present invention.

In the present invention unless specifically defined or limited otherwise, fisrt feature second feature "upper" or "lower" It may include that the first and second features directly contact, also may include that the first and second features are not direct contacts but pass through it Between other characterisation contact.Moreover, fisrt feature includes the first spy above the second feature " above ", " above " and " above " Sign is right above second feature and oblique upper, or is merely representative of first feature horizontal height higher than second feature.Fisrt feature exists Second feature " under ", " lower section " and " following " include that fisrt feature is directly below and diagonally below the second feature, or is merely representative of First feature horizontal height is less than second feature.

As shown in Figure 1, the classification prediction technique based on big data distributed computing of embodiment according to the present invention, including Following steps:

Step 1: carrying out data cleansing for data in distributed file system (HDFS), remove in distributed file system Noise, and form standardized data collection.

Step 2: the data concentrated to standardized data are tagged using clustering algorithm, and obtain training sample and test Sample.

Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform Mapreduce Computational frame iterates to calculate out training pattern parameter for training sample, using training pattern parameter, to test Sample carries out prediction classification.

It is understood that in an embodiment of the present invention, the main data storage capacities and meter for utilizing big data platform Calculation ability handles mass data training and classification prediction.It is broadly divided into three steps:

Data cleansing and labeling construct training data sample and test sample；

Training data model is tested using training sample, uses the accuracy of model result validation test sample；

If accuracy is not high, model parameter is adjusted, continues to train the accuracy of training data and validation test sample straight Reach the accuracy of precognition.

It according to one embodiment of present invention, can be further based on the classification prediction technique of big data distributed computing Including step 4: the step includes: that the prediction classification to test sample is verified, such as the prediction classification accuracy of test sample It is low, then the parameter of GBM (Gradient Boosting Machine) algorithm is adjusted, and continue trained mould for training sample Type is as a result, to obtain to the accurate higher data result of test sample prediction classification.Thus, it is possible to improve validation test sample Accuracy.

It is understood that preferably, using manual verification to the prediction classification of the test sample in step 4.

It is understood that in step 2, after using clustering algorithm tagged the data that standardized data is concentrated, Training sample and test sample are obtained by manual confirmation.Thus, it is possible to improve the accuracy of training sample and test sample.

The classification prediction technique based on big data distributed computing of embodiment according to the present invention is being adjusted in step 4 When the parameter of whole GBM algorithm, comprising the following steps:

Step 5: one higher learning rate of selection, and it is based on the learning rate, adjust the quantity of base learner；

Step 6: the parameter of the quantity adjustment decision tree based on learner described in step 5；

Step 7: reducing the learning rate, and correspondingly improve the quantity of the base learner.

Further, according to one embodiment of present invention, in step 7, before reducing the learning rate, described in adjustment The ratio of training sample.

The classification prediction technique based on big data distributed computing of embodiment according to the present invention is classified for big data Prediction algorithm, classification are to find out the common feature of one group of data object in mass data storage system and incite somebody to action according to classification mode It is divided into different classes, and the purpose is to the maps data items in database be arrived and touch a given classification by disaggregated model In.It can be applied to and be related to application class, in trend prediction.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office What can be combined in any suitable manner in one or more embodiment or examples.In addition, those skilled in the art can say this Different embodiments or examples described in bright book are engaged and are combined.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of classification prediction technique based on big data distributed computing, which comprises the steps of:

Step 1: carrying out data cleansing for data in distributed file system, remove making an uproar in the distributed file system Sound, and form standardized data collection；

Step 2: the data concentrated to the standardized data are tagged using clustering algorithm, and obtain training sample and test Sample；

Step 3: based on the standardized data collection for using clustering algorithm tagged, using big data platform Mapreduce Computational frame is iterated to calculate out training pattern parameter for the training sample, is joined using the training pattern Number, carries out prediction classification to the test sample.

2. the classification prediction technique according to claim 1 based on big data distributed computing, which is characterized in that further Including step 4: the prediction classification of the test sample to be verified, the prediction classification accuracy of such as test sample is low, The parameter of GBM algorithm is then adjusted, and the continuation training pattern for the training sample is as a result, to obtain to the test sample The accurate higher data result of prediction classification.

3. the classification prediction technique according to claim 2 based on big data distributed computing, which is characterized in that in step Manual verification is used to the prediction classification of the test sample in 4.

4. the classification prediction technique according to claim 1 based on big data distributed computing, which is characterized in that in step In 2, after using clustering algorithm tagged the data that the standardized data is concentrated, the training is obtained by manual confirmation Sample and the test sample.

5. the classification prediction technique according to claim 2 based on big data distributed computing, which is characterized in that in step In 4, when adjusting the parameter of GBM algorithm, comprising the following steps:

6. the classification prediction technique according to claim 5 based on big data distributed computing, which is characterized in that in step In 7, before reducing the learning rate, the ratio of the training sample is adjusted.