CN107169572A

CN107169572A - A kind of machine learning Service Assembly method based on Mahout

Info

Publication number: CN107169572A
Application number: CN201611203680.2A
Authority: CN
Inventors: 郭文忠; 黄益成; 陈星�
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-09-15
Anticipated expiration: 2036-12-23
Also published as: CN107169572B

Abstract

The present invention provides a kind of machine learning Service Assembly method based on Mahout, it is characterised in that comprise the following steps：Step S1：The data of different-format are pre-processed；Step S2：Carry out model training；Step S3：Model is estimated；Step S4：Operator is carried out to unified encapsulation；Step S5：The machine learning method to be used and the form for the data to be handled described according to user, mounter study and work flow path；Step S6：When these machine learning workflows are after by Oozie end of runs in Hadoop platform, the model evaluation operator of each workflow will provide the assessment result of workflow；User selects machine learning workflow according to this assessment result.Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, so as to efficiently in the enterprising row data excacation of Hadoop platform.

Description

A kind of machine learning Service Assembly method based on Mahout

Technical field

The present invention relates to a kind of machine learning Service Assembly method based on Mahout.

Background technology

Nowadays human society is produced daily and the data volume of storage is more and more huger, while being accompanied by user more comes various The data analysis requirements of change.How simply, the machine learning stream of large-scale data can be handled by quickly and efficiently building one Journey, has become a demand urgently to be resolved hurrily at present.Mahout is built upon the distributed machines of increasing income on Hadoop Practise algorithms library, Mahout appearance solve traditional machine learning stock lack and enliven that technology community, autgmentability be poor, nothing Method handles distributed mass data and the defect such as do not increase income.But because the machine learning algorithm that Mahout is provided is numerous, each Algorithm possesses the adjustable parameters of several at most dozens ofs at least again, so carrying out data using Mahout in Hadoop platform Excavation stills need very high learning cost.

The content of the invention

To solve the above problems, the present invention provides a kind of Mahout machine learning Service Assembly methods under multiplexing visual angle.

The present invention uses following technical scheme：A kind of machine learning Service Assembly method based on Mahout, its feature exists In comprising the following steps：Step S1：The data of different-format are pre-processed, be converted into feature that model training uses to Amount；Step S2：Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training；Step S3：The model completed to training It is estimated；Step S4：Step S1, step S2 and step S3 belong to a series of operators in Mahout algorithms libraries, and these are calculated Son carries out unified encapsulation, becomes a series of services that Oozie workflow platforms call specification that meet；Step S5：According to The machine learning method to be used of user's description and the form for the data to be handled, assemble more than one satisfaction The machine learning workflow flow path of demand；Step S6：When these machine learning workflows by Oozie in Hadoop platform After end of run, the model evaluation operator of each workflow will provide the assessment result of workflow；User comments according to this Estimate result, select machine learning workflow.

In an embodiment of the present invention, in addition to step S7：The selected machine learning workflow storage of user is arrived In knowledge base, for the user or the user with similar demands is multiplexed after this.

In an embodiment of the present invention, the data prediction in step S1 includes：SeqDirectory、Lucene2Seq、 Seq2Sparse, Arff.Vector, Split, SplitDataSet, Describe and Hive.

In an embodiment of the present invention, the Clustering Model in step S2 is included using Canopy, K-Means, fuzzy K- Means, LDA and five clustering algorithms of spectral clustering Clustering Model.

Further, the disaggregated model in step S2 includes the classification using NB Algorithm and random forests algorithm Model.

Further, the collaborative filtering recommending model in step S2 is using matrix factorisation collaborative filtering and is based on The collaborative filtering recommending model of the Collaborative Filtering Recommendation Algorithm of article.

In an embodiment of the present invention, the model evaluation of completion is trained to comprise the following steps in step S3：Step S31：It is poly- Class model is assessed to be estimated using distance between cluster and cluster outgoing inspection；Step S32：Disaggregated model is assessed to be entered using accuracy Row is assessed, if disaggregated model uses NB Algorithm, assessing also includes confusion matrix；Step S33：Collaborative filtering recommending mould Type is assessed to be estimated using the accuracy rate of model.

Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, from And can be efficiently in the enterprising row data excacation of Hadoop platform.

Brief description of the drawings

Fig. 1 is broad flow diagram of the invention.

Fig. 2 is machine learning workflow diagram of the present invention.

Fig. 3 is the workflow diagram that generates according to demand in one embodiment of the invention.

Embodiment

Explanation is further explained to the present invention with specific embodiment below in conjunction with the accompanying drawings.

A kind of machine learning Service Assembly method based on Mahout, it comprises the following steps：Step S1：By different-format Data pre-processed, be converted into the characteristic vector that model training is used；Step S2：Carry out Clustering Model, disaggregated model and Collaborative filtering recommending model training；Step S3：The model that training is completed is estimated；Step S4：Step S1, step S2 and step Rapid S3 belongs to a series of operators in Mahout algorithms libraries, and these operators are carried out to unified encapsulation, becomes and meets Oozie Workflow platform calls a series of services of specification；Step S5：The machine learning method to be used described according to user with And the form for the data to be handled, assemble the machine learning workflow flow path that more than one meets demand；Step S6：When These machine learning workflows are by Oozie, in Hadoop platform after end of run, the model of each workflow is commented Estimation will provide the assessment result of workflow；User selects machine learning workflow according to this assessment result.Oozie is Java web applications for dispatching Hadoop operations.Multiple sequences of operation are combined into a logic working list by Oozie Member.It is integrated with Hadoop, and supports the Hadoop tasks such as MapReduce, Pig, Hive and Sqoop.

(1)SeqDirectory

The operator is used for the file that the file of text document is changed into SequenceFile forms.SequenceFile files It is a kind of serializing file that Hadoop is used for storing key-value pairs of binary form and designed.Set up Hadoop it On Mahout provide machine learning algorithm input file naturally also require form to be SequenceFile. The order that SeqDirectory operators are used in Mahout is " seqdirectory ".We are by calculating SeqDirectory Sub- carry out service encapsulation, 2 parameters are exposed to user, in addition to outgoing route, " -- input " parameters, which are used to specify, needs place The file input path of reason, because this operator is entrance operator, so user must specify the parameter.

(2)Lucene2Seq

Similar with Seqdirectory, the operator is used to Lucene index files changing into SequenceFile forms text Part, simply its input is the index file of Lucene parser generations.The life that Lucene2Seq operators are used in Mahout Order is " lucene2seq ", after encapsulation, and we expose Lucene2Seq 2 parameters to user：" -- input " parameters are used Lucene index files input path to be processed is needed to specify, is equally also the parameter that must be specified as entrance operator, separately Outer one is the outgoing route parameter given tacit consent to.

(3)Seq2Sparse

Seq2Sparse operators be for from text document generate vector important tool, it read by SeqDirectory operators are converted into the text data of SequenceFile forms, first generate dictionary file according to the data, so The characteristic vector of text is regenerated based on dictionary afterwards.The vectorial characteristic value weighting scheme can be simple tf word frequency vector, Can also be the popular tf-idf vectors of industry, vectorial weighting scheme by " -- weight " is specified.Seq2Sparse is calculated The order that son is used in Mahout is " seq2sparse ".The assignable parameter of seq2sparse operators has dozens of, we By being encapsulated to it, in addition to outgoing route, 5 parameters are also exposed to user：" -- analyzerName " is used to specify institute Text segmenter class name, default value is Lucene standard scores parsers；" -- weight " is used to specify weight mechanism used, Tf is the weighting based on word frequency, and tfidf is the weighting based on TF-IDF, and the parameter default is tfidf；“-- MinSupport " is used to that the minimum frequency of the word of lexicon file can be put into whole set, and the word less than the frequency is ignored, Default value is 2；" -- minDF " is used for the minimum number of document where specifying the word that can be put into lexicon file, less than the frequency Word is ignored, and default value is 1；" -- maxDFPercent " is used for the maximum of document where specifying the word that can be put into lexicon file Number, default value is 99.

(4)ARFF.Vector

ARFF files are the storage document data sets of Weka acquiescences.Each ARFF files one two-dimensional table of correspondence.Form Each row be data set each example, each row are each attributes of data set.The operator is used for from ARFF file generated models The characteristic vector for training operator recognizable.The order that Seq2Sparse operators are used in Mahout is " seq2sparse ".I ARFF.Vector three parameters are exposed to user：" -- input " parameters are used to specify ARFF files input path, " -- Output " is used to specify characteristic vector outgoing route, " -- dicout " is used to specify the outgoing route of vectorial dictionary file.Wherein " -- input " is essential option, and " -- output " and " -- dicout " is option.

(5)CSV2Vector

The operator is used to csv file being directly changed into characteristic vector.It is not can directly making of providing in Mahout storehouses Order, we encapsulate the function on the basis of Mahout source codes.The java class that CSV2Vector operators are used is The class name, when Oozie is encapsulated, can be written on by " services.encapsulation.csv2vector "<main-class> Between be called.Similar, the operator is exposed to user-in file path " -- input " and output file path " -- Two parameters of output ", wherein input path must be specified, outgoing route is optional.

(6)SplitDataSet

For the Collaborative Filtering Recommendation Algorithm in Mahout, the file commonly entered is to carry three column datas CSV formatted files.Wherein first is classified as the ID of user, and second is classified as the ID of article, and the 3rd is classified as preference of the user for article Value.SplitDataSet operators are used to the input data of CSV forms being cut into training set and test set two parts.It The order used in Mahout is " splitdataset ".The assignable parameter of user has：" -- input " specified files input road Footpath, " -- output " specified file outgoing routes, " -- trainingPercentage " specifies the percentage of the data for training Than, " -- probePercentage " specifies the percentage of the data for test.Wherein " -- input " must be specified, and other three Individual is option, " -- trainingPercentage " default values are 0.9, " -- probePercentage " default values are 0.1.

(7)Split

Split operators are used to characteristic vector be cut into training set and test set.The order that it is used in Mahout is “split”.User may specify that parameter has five：" -- input " specifies input path；"-trainingOutput " specifies training Collect outgoing route；" -- testOutput " nominative testing collection outgoing routes；" -- randomSelectionSize " specifies random choosing The bar number of test data is selected as, default value is 100；" -- randomSelectionPct " is specified and is randomly selected to be test data Percentage, default value is 0.2.Except " -- in addition to input " must be specified, other four parameters are options.

(8)Describe

The operator is used to mark the field and target variable in ARFF data sets so that sorting algorithm being capable of identification data collection So as to do further model training.The order that it is used in Mahout is " describe ".The adjustable ginseng of user Number has three:" -- path " is used to specify input data path, " -- file " is used for the path for specifying the descriptor file of generation, " -- descriptor " is used for the data type of description field and is chosen as the field of target variable.Except outgoing route is optional Outside, input data is obtained in the Service Assembly stage from operator above, " -- descriptor " parameters must then be specified by user.

(9)HiveAction

Oozie supports Hive task nodes, and we substantially increase the data of this method by the encapsulation serviced Hive Pretreatment potentiality.Action station codes of the Hive in Oozie, $ { preprocess.hql } specifies the text of HiveQL scripts Hive SQL statement has been write in part path, the script.When user uses Hive to act node, HiveQL command script It must specify.

(1) Canopy is clustered

Canopy clustering algorithms are a kind of unsupervised pre- clustering algorithms.It is typically used as K-Means algorithms or fuzzy K- The pre-treatment step of Means algorithms.It is intended to accelerate the cluster operation to large data sets.Because the size of data set can not be true It is fixed, and K-Means will input the number that clusters at the very start, thus directly using K-Means algorithms be not one very well Selection.The order that Canopy operators are used in Mahout is " canopy ", and the parameter exposed to user has 4：Export road Footpath " -- output ", for determine cluster granularity " -- t1 " and " -- t2 ", and similarity distance metric mode " -- distanceMeasure”.T1 and t2 value is essential option, and distance metric mode is option, square Euclidean distance of default value Measurement.

(2) K-Means is clustered

K-Means clusters are popular clustering methods in data mining, are used extensively in numerous scientific domains Do clustering algorithm.The purpose of K-Means clusters is that n data point is divided into k cluster, and each data point, which belongs to one of them, to be had The cluster of nearest average.The order that K-Means is used in Mahout is " kmeans ".When being clustered in advance without using Canopy When, the requirement of K-Means algorithms will finally gather the number of the cluster, namely K values specified at the very start.Therefore, K-Means ginseng Number schemes have two kinds, when not using Canopy to be clustered in advance, K-Means need specified cluster result number of clusters " -- clusters”.In addition, the operator also has 4 optional parameters：Default value for square Euclidean distance measurement " -- DistanceMeasure ", default value for 20 maximum iteration " -- maxIter ", default value for 0.5 convergence threshold " -- ConvergenceDelta ", and outgoing route parameter " -- output ".

(3) K-Means clusters are obscured

Fuzzy K-means (also referred to as Fuzzy C-means) is K-Means extension.K-Means is used to find hard cluster (point Only belong to a cluster), and fuzzy K-Means is then that soft cluster (i.e. specified point is found with a kind of formalization method of more statistics Multiple clusters can be belonged to specific probability).Orders of the fuzzy K-Means in Mahout is " fkmeans ", the encapsulation of parameter Scheme is identical with kmeans, will not be described here.

(4) spectral clustering

In cluster, spectral clustering using data similarity matrix frequency spectrum (characteristic value) with less dimension Dimensionality reduction is performed before cluster.The order that spectral clustering is used in Mahout is " spectralkmeans ", the number of cluster " -- Clusters " is the parameter that must be specified, and default value is 20 highest iterations " -- maxIter ", outgoing route " -- Output ", default value for square Euclidean distance measurement " -- the convergence threshold that distanceMeasure " and default value are 0.5 " -- convergenceDelta " is optional parameters.

(5) LDA is clustered

Potential Dirichlet distribution (LDA) is a powerful machine learning algorithm, and it can gather a series of word Class enter some topic, and by a series of clustering documents enter multiple mixing topic.In natural language processing, LDA is One kind generation statistical model, it allows observation collection to be explained by non-observation group, explains why some parts of data are similar. The order that potential Dirichlet distribution service is used in Mahout is " lda ", the topic data of model " -- num_topics " It is the parameter that must be specified, highest the iterations " -- maxIter " and outgoing route " -- output " that default value is 20 in addition It is optional parameters.

(1) naive Bayesian

Many occasions need to use classification in life, and naive Bayesian is a kind of simple and effective common classification algorithm.It Basic thought it is very simple, i.e., for the item to be sorted provided, solve under conditions of this appearance each classification occur it is general Rate, which is maximum, is considered as which classification this item to be sorted belongs to.The order that naive Bayesian is used in Mahout is " trainnb ", model outgoing route " -- output " and default value are "/mahout/trainnb/labelIndex " mark Label index outgoing route " -- labelIndex " is optional parameters.

(2) random forest

Random forest or Stochastic Decision-making woods algorithm are to be used to classify, and return the global learning method with other tasks, it leads to Cross and build multiple decision trees in training and export single tree as the pattern (classification) of class or the class (recurrence) of consensus forecast. The problem of Stochastic Decision-making forest correct for their training set of decision tree over adaptation.Random forests algorithm is used in Mahout Order be " buildforest ", optional parameters has " -- output " and " -- selection ".Second parameter is represented every The quantity of the individual randomly selected variable of tree node, default value is the square root of the quantity of explanatory variable.

(1) matrix factorisation collaborative filtering

The algorithm is the collaborative filtering based on ALS-WR matrix factorisations.Current collaborative filtering mainly can area It is divided into based on factorization and based on neighborhood two kinds.Because factorization can be from the global influence for considering user's ballot, institute With can be better compared to the collaborative filtering effect based on neighborhood in theory and practice.The operator is used in Mahout Order be " parallelALS ", " -- numFeatures " and regularization parameter " -- lambda " is to necessarily refer to the dimension of feature Fixed parameter, " -- numIterations " and outgoing route " -- output " is optional parameters to the iterations that default value is 10.

(2) collaborative filtering recommending based on article

Collaborative filtering recommending based on article is according to hobby data of the user to other articles, by finding similar but using The article that family was not evaluated also recommends user.This is the widely used Collaborative Filtering Recommendation Algorithm of current industry.The calculation The order that is used in Mahout of son is " itemsimilarity ", except outgoing route " -- in addition to output ", optional parameters Also measuring similarity " -- the maximum similarity value " -- between similarityClassname ", article MaxSimilaritiesPerItem ", largest item preference value " -- maxPrefs ", the minimum preference value " -- of each user MinPrefsPerUser " and whether regard input as the data without preference value parameter " -- booleanData ".Their acquiescence Value is successively：Similarity_euclidean_distance, 100,500,1 and false.

In an embodiment of the present invention, the model evaluation of completion is trained to comprise the following steps in step S3：

Step S31：Clustering Model is assessed to be estimated using distance between cluster and cluster outgoing inspection；

(1) distance is assessed between cluster

Distance can be good at reflecting clustering result quality between cluster.It can not possibly be leaned between different cluster central points in good cluster result Obtain too near, closely then mean that cluster process generates multiple groups with similar features, and cause the difference between cluster inadequate very much Significantly.So we will not wish that the distance between cluster is too near.Distance and the outcome quality clustered are closely related between cluster.Cluster spacing From evaluation operators not inside the order that Mahout is provided, therefore we encode and encapsulate the operator service.Operator is corresponding Java class is InterClusterDistances.The file for the cluster result that one of parameter of the operator is assessed for needs Obtain, specified without user in path, the output parameter that the path can train service in Service Assembly from Clustering Model.This There is the outgoing route that an optional parameters is exactly assessment result outside.

(2) cluster output is checked

The main tool that cluster output is checked in Mahout is ClusterDumper.Read with ClusterDumper The output of clustering algorithm is very convenient in Mahout.According to this output, we just can very easily assess the quality of cluster.Any cluster Have a center vector, it be in the cluster average value a little.In this problem, except potential Dirichlet distribution is poly- Outside class, this point is clustered for Canopy, K-Means clusters, fuzzy K-Means clusters and spectral clustering are all set up.Particularly, when When the target of cluster is text document, feature is exactly word, that is to say, that those words of central point weight highest reflect The implication to be expressed of document in the cluster.

The operation order of the operator is " clusterdumper ", and input is that gathering is closed, it can the Service Assembly stage from Obtained in the output parameter of Clustering Model training, without specifying.In addition, the dictionary generated when data are switched into vector is also one Optional input.Optional parameters " -- numWords " represents the word number for needing to print, and default value is 10.

Step S32：Disaggregated model is assessed to be estimated using accuracy, if disaggregated model uses NB Algorithm, Assessing also includes confusion matrix；

Disaggregated model is estimated, the most frequently used evaluation criterion typically has two kinds of accuracy and confusion matrix.Accuracy It is well understood that what that confusion matrix is again, may most direct index for exporting the grader of non-scoring results It is exactly confusion matrix.Confusion matrix is model output result and the crosstab of correct desired value always.Every a line correspondence of matrix Really desired value and each row correspondence model output valve.The test sample quilt that the element value of the i-th row of matrix jth row is classification i Model assigns to the number in classification j.The big element of the corresponding confusion matrix of one good model all concentrates on diagonal.These are diagonal Line element refers to that sample in classification i is correctly assigned to the number in i classes.

Naive Bayesian evaluation operators by test process, can calculate the accuracy of model-naive Bayesian and obscure square Battle array.Its operation order is " testnb ", " trainnb " order of correspondence training.The evaluation operators, which need to input, to be used to train Characteristic vector, and the model and index tab trained, these paths by during Service Assembly from operator above Middle parameter is directly obtained, and is specified without user.

(1) random forest is assessed

Random forest evaluation operators can be used for the accuracy for testing Random Forest model.Its order is " testforest ", input path parameter is obtained in Service Assembly.

Step S33：Collaborative filtering recommending model evaluation is estimated using the accuracy rate of model.

(1) matrix factorisation collaborative filtering is assessed

The operator draws the accuracy rate of recommended models by calculating RMSE and MAE, and its operation order is “evaluateFactorization”.The input of the operator is user's matrix model, article matrix model and the number for test According to collection, they can be obtained in the Service Assembly stage from operator parameter above.

(2) collaborative filtering recommending based on article is assessed

Article similarity model of the operator according to obtained by training, the accurate of model is obtained using test data set test Rate.Input path is same to be obtained in the Service Assembly stage.

Machine-learning process is distinguished into data prediction, model training and model evaluation three phases by the present invention, each There is the Mahout machine learning service under a series of operators for belonging to the stage, multiplexing visual angle in stage in Mahout algorithms libraries Assemble method general view is as shown in Figure 1.

Above-mentioned operator is packaged, after the service with unified call specification is packaged into, can just be entered using them Row Service Assembly, generates a complete machine learning workflow full figure.As shown in Fig. 2 being exactly the operator institute energy that we encapsulate The overall picture of the machine learning workflow assembled.Circle in figure represents individually perform the operator of some task, and one has 25 Individual operator.Line represents the order and flow of operator execution.What square was represented is the intermediate result that operator performs output, the module The overall appearance of work flow diagram is intended merely to, the specific tasks that can be run are not represented.Machine-learning process is distinguished into number by us Data preprocess, model training and model evaluation three phases.Wherein, the circle on the left side represents data prediction operator service, such as Text document formatted file is changed into the SeqDirectory of SequenceFile formatted files；The circle of middle medium blue is represented Model training operator is serviced, and such as K-means clustering operators are serviced；The circle representative model evaluation operators service on the right is as poly- in assessed The InterClustersDistance of distance between class result cluster.

Considered according to the support scope of general data mining demand and the Hadoop ecosystems, final we select branch The input data Format Type held has five kinds, and they are respectively：Hive tables data, csv file, ARFF files, Lucene indexes File and TXT text document files.These files are stored in HDFS.HDFS is distributed document storage system, its energy Very easily store the data of magnanimity.HiveAction is used to handle Hive table data inputs in preconditioning operator, and defeated Go out CSV formatted files into HDFS；Lucene2Seq operators are used to handle Lucene index files, and are converted into SequenceFile formatted files；ARFF.Vector is used to handle ARFF formatted files, and according to the corresponding feature of file generated Vector；CSV2Vector and SplitDataSet are used to handle the input of CSV formatted files, and CSV2Vector is according to CSV forms File generated characteristic vector, SplitDataSet is then that CSV data are cut into training set and test set by a certain percentage.Due to This 6 operators are the entrances of whole machine learning workflow, therefore their data file input path must be specified by user, System just can know that will handle for which data.For remaining 19 operators, then the input path of specified file is not needed, Because their input is exactly the outgoing route of some other operators before workflow, as long as during Service Assembly in the past The operator in face is obtained and carried out on demand assembled.

Reuters-21578 news data collection is used in the specific embodiment of the invention.It is in machine learning research field In be widely used.The collection of these data and mark are initially in exploitation CONSTRUE texts by basis set group in card and Reuter Completed during categorizing system.Reuters-21578 data sets are divided into 22 files, except last file is only included Outside 578 documents, remaining each file is comprising 1000 documents.These files are SGML forms, similar to XML, are being opened Begin to have processed them into TXT documents in advance before this case study and be stored in HDFS.

Because input file is TXT forms, so natural we have selected Seq2Directory preconditioning operators work For the entrance of whole machine learning workflow.In addition, carrying out text topic from both clustering algorithms of Kmeans and LDA It was found that, and find result using ClusterDumper and LDAPrintTopics operators output ultimatum topic.

According to above-mentioned demand, dependence before and after the operator with reference to described by machine study and work stream full figure can be rapid The packaged operator of use generate two machine learning workflow flow paths.On the basis of script workflow full figure, remove not After related operator, workflow as shown in Figure 3 has just been obtained.It is upper it will be seen that we have obtained two works from figure Make flow path, they be respectively SeqDirectory → Seq2Sparse → K-Means → ClusterDumper and SeqDirectory→Seq2Sparse→LDA→LDAPrintTopics。

By work flow diagram, we learn that the operator service one that needs are used has six, and then draw these calculations used Son must be specified and optional parameter list.As shown in table 1, the only initial data that we must specify in parameter are defeated Enter topic number in path, the K values of K-Means algorithms and LDA algorithm etc. three, other specification details can be in the Service Assembly stage It is automatic to obtain.

The parameter list that the workflow operator of table 1 need to be specified

Some other optional important parameters can certainly be specified.In the present example, due to not knowing for K-Means For algorithm, which Clustering Effect is more preferable actually for the Euclidean distance measurement and COS distance measurement of acquiescence, so determine to make simultaneously With two sets of K-Means parametric schemes.Particularly, due to have selected COS distance measurement, also by K-Means convergence of algorithm threshold values Parameter " -- convergenceDelta " is appointed as 0.1, rather than 0.5 given tacit consent to, because the scope of COS distance is 0 to 1.Cause This, according to specified parametric scheme, most generates three cluster workflows at last.Wherein one is realized LDA cluster process, a reality The K-Means cluster process of existing Euclidean distance measurement, in addition one realize COS distance measurement K-Means cluster process.

After three above-mentioned workflow flow paths are performed on Oozie, their final assessment results have been obtained, have been exported As a result it is listed in table.5 topics are respectively listed for every kind of result, and each topic lists maximum preceding 5 lists of weight Word.

As shown in table 2, the topic recognition effect that LDA clusters are obtained is generally all well and good.From 5 forward keywords The topic content that each news documents clustered are talked about is found out in big enable.For example from the 1st wheat clustered (wheat), agriculture In industry (agriculture), outlet (export) and ton (tones) these keywords, it can be seen that it is in discussion and wheat The related agriculture topic of yield.From the 2nd IBM clustered (ibm), computer (computers), American Telephone and Telegraph Company (att) and in personal (personal) these keywords, it can be seen that it is to be related to computer and associated companies in discussion Topic.The other three clusters, and is the topic for discussing that bank finance, oil are related to the energy and security respectively, it may have brighter Aobvious effect.

The LDA cluster results of table 2

As shown in table 3, the K-Means text clusters measured using COS distance then can be weaker compared to LDA Clustering Effects, But it is generally also good.1st, 2,3 keywords clustered show that they are to talk about and wheat yield, stock, original respectively The related topic of oily rise in price.Although the 4th clusters probably to be seen to be and discusses that some thing removes the topic increased every year, Specifically what increases just indefinite, illustrates the effect clustered and bad.It is furthermore noted that the 5th topic is all Numeral, although looking that the correlation of this inside that clusters is very high, it is meaningless.

The K-Means COS distance cluster results of table 3

Last is as shown in table 4 the result for the K-Means text clusters measured using Euclidean distance, and the text clusters effect Compared to other two clusters of fruit are then poor many.1st and the 2nd keyword relevancies clustered are also viable reluctantly, but Be from the 3rd, the 4th and the 5th keyword clustered be difficult to find out their correlation, or even give people some " incoherent " Sensation.Particularly such as he, said, vs, would word, the identification for topic is without in all senses really.

The K-Means Euclidean distances of table 4 are clustered

It was therefore concluded that, in these three machine learning workflows that Service Assembly is obtained, LDA cluster work Flow the effect found to text topic preferably, the K-Means cluster workflows measured using COS distance are taken second place, Euclidean distance degree The K-Means cluster workflows of amount are then worst., can be by the machine learning work if the good results that user clusters to LDA Preserved as stream, can be easily multiplexed in the future.Certainly, user can continue to calculate in tuning these workflows The parameter of son, in the hope of reaching more preferable Clustering Effect.

In addition, seeing from assessment result, occurred in that such as 7-apr-1987, numeral, he in the result of three clusters With the word such as said, explanation may carry out handling not good enough during text participle generation characteristic vector in pretreatment stage.Therefore, In order to further improve the discovery effect of text topic, the parameter for changing preconditioning operator Seq2Sparse also can with being debugged It is a good selection.

Experiment shows, using this method can fast and effeciently customize with the reusable machine learning flow of tuning so that Can be efficiently in the enterprising row data excacation of Hadoop platform.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims

1. a kind of machine learning Service Assembly method based on Mahout, it is characterised in that comprise the following steps：

Step S1：The data of different-format are pre-processed, the characteristic vector that model training is used is converted into；

Step S2：Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training；

Step S3：The model that training is completed is estimated；

Step S4：Step S1, step S2 and step S3 belong to a series of operators in Mahout algorithms libraries, and these operators are carried out Unified encapsulation, becomes a series of services that Oozie workflow platforms call specification that meet；

Step S5：The machine learning method to be used and the form for the data to be handled described according to user, group Take on the machine learning workflow flow path that more than one meets demand；

Step S6：When these machine learning workflows are after by Oozie end of runs in Hadoop platform, each work The model evaluation operator for making to flow will provide the assessment result of workflow；User selects machine learning work according to this assessment result Flow.

2. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：Also include step Rapid S7：The selected machine learning workflow of user is stored into knowledge base, for the user or with similar demands User be multiplexed after this.

3. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S1 Data prediction include：SeqDirectory、Lucene2Seq、Seq2Sparse、Arff.Vector、Split、 SplitDataSet, Describe and Hive.

4. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S2 Clustering Model include the cluster mould using Canopy, K-Means, fuzzy K-Means, LDA and five clustering algorithms of spectral clustering Type.

5. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S2 Disaggregated model include the disaggregated model using NB Algorithm and random forests algorithm.

6. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S2 Collaborative filtering recommending model using matrix factorisation collaborative filtering and the Collaborative Filtering Recommendation Algorithm based on article Collaborative filtering recommending model.

7. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S3 The model evaluation that training is completed comprises the following steps：

Step S32：Disaggregated model is assessed to be estimated using accuracy, if disaggregated model uses NB Algorithm, is assessed Also include confusion matrix；