CN107169572B

CN107169572B - A kind of machine learning Service Assembly method based on Mahout

Info

Publication number: CN107169572B
Application number: CN201611203680.2A
Authority: CN
Inventors: 郭文忠; 黄益成; 陈星�
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2018-09-18
Anticipated expiration: 2036-12-23
Also published as: CN107169572A

Abstract

The present invention provides a kind of machine learning Service Assembly method based on Mahout, which is characterized in that includes the following steps：Step S1：The data of different-format are pre-processed；Step S2：Carry out model training；Step S3：Model is assessed；Step S4：Operator is carried out to unified encapsulation；Step S5：According to user's description machine learning method to be used and the data to be handled format, mounter study and work flow path；Step S6：When these machine learning workflows are after by Oozie end of runs in Hadoop platform, the model evaluation operator of each workflow will provide the assessment result of workflow；User selects machine learning workflow according to this assessment result.Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, so as to efficiently in Hadoop platform carry out data excacation.

Description

A kind of machine learning Service Assembly method based on Mahout

Technical field

The machine learning Service Assembly method based on Mahout that the present invention relates to a kind of.

Background technology

Nowadays human society generates daily and the data volume of storage is more and more huger, while the be accompanied by user the next various The data analysis requirements of change.How simply, the machine learning stream that can handle large-scale data is quickly and efficiently built Journey has become a demand urgently to be resolved hurrily at present.Mahout is built upon the distributed machines of increasing income on Hadoop Practise algorithms library, the appearance of Mahout, which solves to lack existing for traditional machine learning library, enlivens that technology community, autgmentability be poor, nothing The defects of method handles distributed mass data and does not increase income.But since the machine learning algorithm that Mahout is provided is numerous, each Algorithm possesses the adjustable parameters of few then several at most dozens ofs again, so carrying out data using Mahout in Hadoop platform There is still a need for very high learning costs for excavation.

Invention content

To solve the above problems, the present invention provides a kind of Mahout machine learning Service Assembly methods being multiplexed under visual angle.

The present invention uses following technical scheme：A kind of machine learning Service Assembly method based on Mahout, feature exist In including the following steps：Step S1：The data of different-format are pre-processed, be converted into feature that model training uses to Amount；Step S2：Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training；Step S3：The model that training is completed It is assessed；Step S4：Step S1, step S2 and step S3 belongs to a series of operators in Mahout algorithms libraries, these are calculated Son carries out unified encapsulation, becomes and meets a series of services that Oozie workflow platforms call specification；Step S5：According to User description institute machine learning method to be used and the data to be handled format, assemble more than one meet The machine learning workflow flow path of demand；Step S6：When these machine learning workflows by Oozie in Hadoop platform After end of run, the model evaluation operator of each workflow will provide the assessment result of workflow；User comments according to this Estimate as a result, selecting machine learning workflow.

In an embodiment of the present invention, further include step S7：The selected machine learning workflow storage of user is arrived In knowledge base, for the user or the user with similar demands is multiplexed after this.

In an embodiment of the present invention, the data prediction in step S1 includes：SeqDirectory、Lucene2Seq、 Seq2Sparse, Arff.Vector, Split, SplitDataSet, Describe and Hive.

In an embodiment of the present invention, the Clustering Model in step S2 includes using Canopy, K-Means, fuzzy K- The Clustering Model of Means, LDA and five clustering algorithms of spectral clustering.

Further, the disaggregated model in step S2 includes the classification using NB Algorithm and random forests algorithm Model.

Further, the collaborative filtering recommending model in step S2 using matrix factorisation collaborative filtering and is based on The collaborative filtering recommending model of the Collaborative Filtering Recommendation Algorithm of article.

In an embodiment of the present invention, the model evaluation that training is completed in step S3 includes the following steps：Step S31：It is poly- Class model assessment is assessed using distance between cluster and cluster outgoing inspection；Step S32：Disaggregated model assessment using accuracy into Row assessment, if disaggregated model uses NB Algorithm, assessment further includes confusion matrix；Step S33：Collaborative filtering recommending mould Type assessment is assessed using the accuracy rate of model.

Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, from And data excacation can be efficiently carried out in Hadoop platform.

Description of the drawings

Fig. 1 is the broad flow diagram of the present invention.

Fig. 2 is machine learning work flow diagram of the present invention.

Fig. 3 is the work flow diagram generated according to demand in one embodiment of the invention.

Specific implementation mode

Explanation is further explained to the present invention in the following with reference to the drawings and specific embodiments.

A kind of machine learning Service Assembly method based on Mahout comprising following steps：Step S1：By different-format Data pre-processed, be converted into the feature vector that model training uses；Step S2：Carry out Clustering Model, disaggregated model and Collaborative filtering recommending model training；Step S3：The model completed to training is assessed；Step S4：Step S1, step S2 and step Rapid S3 belongs to a series of operators in Mahout algorithms libraries, these operators are carried out to unified encapsulation, becomes and meets Oozie Workflow platform calls a series of services of specification；Step S5：According to institute's machine learning method to be used of user's description with And the format for the data to be handled, assemble the machine learning workflow flow path of more than one meet demand；Step S6：When These machine learning workflows are by Oozie, in Hadoop platform after end of run, the model of each workflow is commented Estimation will provide the assessment result of workflow；User selects machine learning workflow according to this assessment result.Oozie is Java web applications for dispatching Hadoop operations.Multiple sequences of operation are combined into a logic working list by Oozie Member.It is integrated with Hadoop, and the Hadoop tasks such as support MapReduce, Pig, Hive and Sqoop.

(1)SeqDirectory

The operator is used to change into the file of text document the file of SequenceFile formats.SequenceFile files It is a kind of serializing file that Hadoop designs for storing key-value pairs of binary form.Establish Hadoop it On Mahout provide machine learning algorithm input file naturally also require that format is SequenceFile. The order that SeqDirectory operators use in Mahout is " seqdirectory ".We are by calculating SeqDirectory Son carries out service encapsulation, and 2 parameters are exposed to user, and other than outgoing route, " -- input " parameter is used to specify and needs to locate The file of reason inputs path, since this operator is entrance operator, so user must specify the parameter.

(2)Lucene2Seq

Similar with Seqdirectory, which is used to Lucene index files changing into SequenceFile format texts Part, only its input is the index file of Lucene parser generations.The life that Lucene2Seq operators use in Mahout Order is " lucene2seq ", and after encapsulation, we expose 2 parameters of Lucene2Seq to user：" -- input " parameter use It specifies lucene index files to be treated to input path, is also equally the parameter that must be specified as entrance operator, separately Outer one is the outgoing route parameter given tacit consent to.

(3)Seq2Sparse

Seq2Sparse operators be for from text document generate vector important tool, it read by SeqDirectory operators are converted into the text data of SequenceFile formats, first generate dictionary file according to the data, so The feature vector of text is regenerated based on dictionary afterwards.The characteristic value weighting scheme of the vector can be that simple tf word frequency is vectorial, Can also be the popular tf-idf vectors of industry, vectorial weighting scheme is specified by " -- weight ".Seq2Sparse is calculated The order that son uses in Mahout is " seq2sparse ".The assignable parameter of seq2sparse operators has dozens of, we By being encapsulated to it, other than outgoing route, 5 parameters also are exposed to user：" -- analyzerName " it is used to specify institute Text segmenter class name, default value are Lucene standard scores parsers；" -- weight " it is used to specify weight mechanism used, Tf is the weighting based on word frequency, and tfidf is the weighting based on TF-IDF, which is tfidf；“-- MinSupport " is used to be put into the minimum frequency of the word of lexicon file in entire set, and the word less than the frequency is ignored, Default value is 2；" -- minDF " it is used to specify the minimum number of the word that can be put into lexicon file place document, less than the frequency Word is ignored, default value 1；" -- maxDFPercent " for specifying the maximum for the word place document that can be put into lexicon file a Number, default value 99.

(4)ARFF.Vector

ARFF files are the storage document data sets of Weka acquiescences.Each ARFF files correspond to a two-dimensional table.Table Each row be data set each example, each row are each attributes of data set.The operator is used for from ARFF file generated models The identifiable feature vector of training operator.The order that Seq2Sparse operators use in Mahout is " seq2sparse ".I Three parameters of ARFF.Vector are exposed to user：" -- input " parameter is used to specify ARFF files input path, " -- Output " is used to specify feature vector outgoing route, " -- dicout " it is used to specify the outgoing route of vectorial dictionary file.Wherein " -- input " is essential option, " -- output " and " -- dicout " it is option.

(5)CSV2Vector

The operator is used to csv file being directly changed into feature vector.It is not can directly making of being provided in the libraries Mahout Order, we encapsulate the function on the basis of Mahout source codes.The java class that CSV2Vector operators use is The class name can be written on by " services.encapsulation.csv2vector " when Oozie is encapsulated<main-class> Between be called.It is similar, the operator be exposed to user-in file path " -- input " and output file path " -- Two parameters of output ", wherein input path must specify, outgoing route is optional.

(6)SplitDataSet

For the Collaborative Filtering Recommendation Algorithm in Mahout, the file that commonly enters is to carry three column datas CSV formatted files.Wherein first is classified as the ID of user, and second is classified as the ID of article, and third is classified as preference of the user for article Value.SplitDataSet operators are used to the input data of CSV formats being cut into training set and test set two parts.It The order used in Mahout is " splitdataset ".The assignable parameter of user has：" -- input " specified file input road Diameter, " -- output " specified file outgoing route, " -- trainingPercentage " specify the percentage of the data for training Than " -- probePercentage " specifies the percentage of the data for test.Wherein " -- input " it must specify, other three A is option, and " -- trainingPercentage " default value is 0.9, and " -- probePercentage " default value is 0.1.

(7)Split

Split operators are used to feature vector be cut into training set and test set.The order that it is used in Mahout is “split”.User may specify there are five parameters：" -- input " specify input path；"-trainingOutput " specified training Collect outgoing route；" -- testOutput " nominative testing collection outgoing route；" -- randomSelectionSize " specify random choosing It is selected as the item number of test data, default value 100；" -- randomSelectionPct " is specified to be randomly selected to be test data Percentage, default value 0.2.Other than " -- input " must specify, other four parameters are options.

(8)Describe

The operator is used to mark the field and target variable in ARFF data sets so that sorting algorithm can identify data set So as to do further model training.The order that it is used in Mahout is " describe ".The adjustable ginseng of user There are three numbers:" -- path " for specifying input data path, " -- file " it is used for the path of the specified descriptor file generated, " -- descriptor " it is used for the data type of description field and is chosen as the field of target variable.Except outgoing route is optional Outside, input data obtains in the Service Assembly stage from the operator of front, and " -- descriptor " parameter must then be specified by user.

(9)HiveAction

Oozie supports Hive task nodes, we substantially increase the data of this method by the encapsulation serviced Hive Pretreatment potentiality.Action station codes of the Hive in Oozie, $ { preprocess.hql } specify the text of HiveQL scripts The SQL statement of Hive has been write in part path in the script.When user uses Hive to act node, HiveQL command script It must specify.

(1) Canopy is clustered

Canopy clustering algorithms are a kind of unsupervised pre- clustering algorithms.It is typically used as K-Means algorithms or fuzzy K- The pre-treatment step of Means algorithms.It is intended to accelerate the cluster operation to large data sets.Since the size of data set can not be true It is fixed, and K-Means will input the number to cluster at the very start, therefore the use of K-Means algorithms is not directly one fine Selection.The order that Canopy operators use in Mahout is " canopy ", and the parameter exposed to user has 4：Export road Diameter " -- output ", " -- t1 " for determining cluster granularity and " -- t2 " and similarity distance metric mode " -- distanceMeasure”.The value of t1 and t2 is essential option, and distance metric mode is option, square Euclidean distance of default value Measurement.

(2) K-Means is clustered

K-Means clusters are popular clustering methods in data mining, are used extensively in numerous scientific domains Do clustering algorithm.The purpose of K-Means clusters is that n data point is divided into k cluster, and each data point, which belongs to one of them, to be had The cluster of nearest mean value.The order that K-Means is used in Mahout is " kmeans ".When being clustered in advance without using Canopy When, K-Means algorithms require specifying the number namely K values that finally gather the cluster at the very start.Therefore, the ginseng of K-Means Number schemes there are two types of, when not having to be clustered in advance using Canopy, K-Means need specify cluster result number of clusters " -- clusters”.In addition, the operator also has 4 optional parameters：Default value be square Euclidean distance measurement " -- DistanceMeasure ", the maximum iteration " -- maxIter " that default value is 20, convergence threshold that default value is 0.5 " -- ConvergenceDelta " and outgoing route parameter " -- output ".

(3) K-Means clusters are obscured

Fuzzy K-means (also referred to as Fuzzy C-means) is the extension of K-Means.K-Means is for having found hard cluster (point Only belong to a cluster), and fuzzy K-Means is then to find soft cluster (i.e. specified point with a kind of formalization method of more statistics Multiple clusters can be belonged to specific probability).Orders of the fuzzy K-Means in Mahout is " fkmeans ", the encapsulation of parameter Scheme is identical with kmeans, and this will not be repeated here.

(4) spectral clustering

In cluster, spectral clustering using data similarity matrix frequency spectrum (characteristic value) in less dimension Dimensionality reduction is executed before cluster.The order that spectral clustering uses in Mahout is " spectralkmeans ", the number of cluster " -- Clusters " is the parameter that must be specified, highest iterations that default value is 20 " -- maxIter ", outgoing route " -- Output ", default value are " -- distanceMeasure " of square Euclidean distance measurement and the convergence threshold that default value is 0.5 " -- convergenceDelta " it is optional parameters.

(5) LDA is clustered

Potential Dirichlet distribution (LDA) is a powerful machine learning algorithm, it can gather a series of word Class enters some topic, and a series of clustering documents are entered to the topic of multiple mixing.In natural language processing, LDA is A kind of generation statistical model, it allows observation collection to be explained by non-observation group, explains why certain parts of data are similar. The order that potential Dirichlet distribution service uses in Mahout is " lda ", the topic data of model " -- num_topics " It is the parameter that must be specified, default value is 20 in addition highest iterations " -- maxIter " and outgoing route " -- output " It is optional parameters.

(1) naive Bayesian

Many occasion needs use classification in life, and naive Bayesian is a kind of simple and effective common classification algorithm.It Basic thought it is very simple, i.e., for the item to be sorted provided, solve each classification under conditions of this appearance occur it is general Rate, which is maximum, is considered as which classification this item to be sorted belongs to.The order that naive Bayesian uses in Mahout is " trainnb ", the outgoing route " -- output " and default value of model are the mark of "/mahout/trainnb/labelIndex " Label index outgoing route " -- labelIndex " it is optional parameters.

(2) random forest

Random forest or Stochastic Decision-making woods algorithm are for classifying, returning the global learning method with other tasks, lead to Cross training when build multiple decision trees and export the pattern (classification) or consensus forecast as class class (recurrence) single tree. Stochastic Decision-making forest corrects for the problem of their training set of decision tree over adaptation.Random forests algorithm uses in Mahout Order be " buildforest ", optional parameters has " -- output " and " -- selection ".Second parameter is indicated every The quantity of a randomly selected variable of tree node, default value are the square roots of the quantity of explanatory variable.

(1) matrix factorisation collaborative filtering

The algorithm is the collaborative filtering based on ALS-WR matrix factorisations.Collaborative filtering mainly can area at present It is divided into based on factorization and two kinds based on neighborhood.Because factorization can be from the global influence for considering user and voting, institute With can be better compared to the collaborative filtering effect based on neighborhood in theory and practice.The operator uses in Mahout Order be " parallelALS ", the dimension of feature " -- numFeatures " and regularization parameter " -- lambda " are to necessarily refer to Fixed parameter, iterations " -- numIterations " and outgoing route that default value is 10 " -- output " it is optional parameters.

(2) collaborative filtering recommending based on article

Based on the collaborative filtering recommending of article according to user to the hobby data of other articles, by finding similar but using The article that family was not evaluated also recommends user.This is the widely used Collaborative Filtering Recommendation Algorithm of current industry.The calculation The order that son uses in Mahout is " itemsimilarity ", other than outgoing route " -- output ", optional parameters Also between measuring similarity " -- similarityClassname ", article maximum similarity value " -- MaxSimilaritiesPerItem ", largest item preference value " -- maxPrefs ", the minimum preference value of each user " -- MinPrefsPerUser " and whether regard the parameter " -- booleanData " inputted as the data of no preference value.Their acquiescence Value is successively：Similarity_euclidean_distance, 100,500,1 and false.

In an embodiment of the present invention, the model evaluation that training is completed in step S3 includes the following steps：

Step S31：Clustering Model assessment is assessed using distance between cluster and cluster outgoing inspection；

(1) distance is assessed between cluster

Distance can be good at reflecting clustering result quality between cluster.It can not possibly be leaned between different cluster central points in good cluster result Must be too close, closely then mean that cluster process produces multiple groups with similar features very much, and cause the difference between cluster inadequate Significantly.So we will not wish that the distance between cluster is too close.The outcome quality of distance and cluster is closely related between cluster.Cluster spacing From evaluation operators not inside the order that Mahout is provided, therefore we encode and encapsulate the operator service.Operator is corresponding Java class is InterClusterDistances.One of operator parameter is the file for the cluster result for needing to assess Path, which can obtain in Service Assembly from the output parameter of Clustering Model training service, specified without user.This It is outer that there are one the outgoing routes that optional parameters is exactly assessment result.

(2) cluster output is checked

Check that the main tool of cluster output is ClusterDumper in Mahout.It is read with ClusterDumper The output of clustering algorithm is very convenient in Mahout.According to this output, we can very easily assess the quality of cluster.Any cluster All there are one center vectors, it is the average value of all the points in the cluster.In this project, in addition to potential Dirichlet distribution is poly- Except class, this point clusters Canopy, K-Means clusters, fuzzy K-Means clusters and spectral clustering are all set up.Particularly, when When the target of cluster is text document, feature is exactly word, that is to say, that those of central point weight highest word reflects The meaning to be expressed of document in the cluster.

The operation order of the operator is " clusterdumper ", and input is that gathering is closed, it can the Service Assembly stage from It is obtained in the output parameter of Clustering Model training, without specified.The dictionary generated when in addition, data being switched to vector is also one Optional input.Optional parameters " -- numWords " indicate the word number for needing to print, default value 10.

Step S32：Disaggregated model assessment is assessed using accuracy, if disaggregated model uses NB Algorithm, Assessment further includes confusion matrix；

Disaggregated model is assessed, most common evaluation criterion generally has two kinds of accuracy and confusion matrix.Accuracy It is well understood that that confusion matrix is againIt, may most direct index for the grader for exporting non-scoring results It is exactly confusion matrix.Confusion matrix is that model exports result and the always crosstab of correct desired value.Every a line of matrix corresponds to The really output valve of desired value and each corresponding model of row.The test sample quilt that the element value of matrix the i-th row jth row is classification i Model assigns to the number in classification j.The big element of the corresponding confusion matrix of one good model all concentrates on diagonal line.These are diagonal Line element refers to that sample in classification i is correctly assigned to the number in i classes.

Naive Bayesian evaluation operators by test process, can calculate the accuracy of model-naive Bayesian and obscure square Battle array.Its operation order is " testnb ", " trainnb " order of corresponding training.The evaluation operators need to input for training Feature vector, and the model and index tab trained, these paths are by the operator from front during Service Assembly Middle parameter directly acquires, specified without user.

(1) random forest is assessed

Random forest evaluation operators can be used for testing the accuracy of Random Forest model.Its order is " testforest ", input path parameter are obtained in Service Assembly.

Step S33：Collaborative filtering recommending model evaluation is assessed using the accuracy rate of model.

(1) matrix factorisation collaborative filtering is assessed

The operator obtains the accuracy rate of recommended models by calculating RMSE and MAE, it, which runs to order, is “evaluateFactorization”.The input of the operator is user's matrix model, article matrix model and the number for test According to collection, they can be obtained in the Service Assembly stage from the operator parameter of front.

(2) the collaborative filtering recommending assessment based on article

The operator is tested to obtain the accurate of model using test data set according to the article similarity model of training gained Rate.Path is inputted similarly to obtain in the Service Assembly stage.

Machine-learning process is distinguished into data prediction, model training and model evaluation three phases by the present invention, each There are a series of operators for belonging to the stage in stage in Mahout algorithms libraries, the Mahout machine learning services being multiplexed under visual angle Assemble method general view is as shown in Figure 1.

Above-mentioned operator is packaged, after being packaged into the service with unified call specification, can use they into Row Service Assembly generates a complete machine learning workflow full figure.As shown in Fig. 2, being exactly the operator institute energy that we encapsulate The overall picture of the machine learning workflow assembled.Circle in figure indicates the operator that some task can be individually performed, and one shares 25 A operator.Line indicates the order and flow that operator executes.What square indicated is the intermediate result that operator executes output, the module It is intended merely to the overall beautiful of work flow diagram, does not represent the specific tasks that can be run.Machine-learning process is distinguished into number by us Data preprocess, model training and model evaluation three phases.Wherein, the circle on the left side represents data prediction operator service, such as Text document formatted file is converted to the SeqDirectory of SequenceFile formatted files；The circle of middle medium blue represents Model training operator service, such as K-means clustering operator services；The circle representative model evaluation operators service on the right is as poly- in assessed The InterClustersDistance of distance between class result cluster.

Considered according to general data mining demand and the support range of the Hadoop ecosystems, final we select branch There are five types of the input data format types held, they are respectively：Hive tables data, csv file, ARFF files, Lucene indexes File and TXT text document files.These files are stored in HDFS.HDFS is distributed document storage system, its energy Very easily store the data of magnanimity.HiveAction is inputted for handling Hive table data in preconditioning operator, and defeated Go out in CSV formatted files to HDFS；Lucene2Seq operators are converted into for handling Lucene index files SequenceFile formatted files；ARFF.Vector is used to handle ARFF formatted files, and according to the corresponding feature of file generated Vector；CSV2Vector and SplitDataSet is used to handle the input of CSV formatted files, and CSV2Vector is according to CSV formats File generated feature vector, SplitDataSet are then that CSV data are cut into training set and test set by a certain percentage.Due to This 6 operators are the entrances of entire machine learning workflow, therefore their data file input path must be specified by user, System just can know that will handle for which data.For remaining 19 operators, then the input path of specified file is not needed, Because their input is exactly the outgoing route of some other operator before workflow, as long as during Service Assembly in the past The operator in face is obtained and is carried out on demand assembled.

Reuters-21578 news data collection is used in the specific embodiment of the invention.It is in machine learning research field In be widely used.The collection of these data and label are initially by basis set group in card and Reuter in exploitation CONSTRUE texts It is completed during categorizing system.Reuters-21578 data sets are divided into 22 files, except the last one file includes only Except 578 documents, remaining each file includes 1000 documents.These files are SGML formats, are similar to XML, are opening Begin to have processed them into TXT documents in advance before this case study and be stored in HDFS.

Since input file is TXT formats, so naturally we have selected Seq2Directory preconditioning operators to make For the entrance of entire machine learning workflow.In addition, selecting both clustering algorithms of Kmeans and LDA to carry out text topic It was found that and finding result using ClusterDumper and LDAPrintTopics operators output ultimatum topic.

According to above-mentioned demand, dependence before and after the operator in conjunction with described in machine study and work stream full figure can be rapid Use packaged operator to generate two machine learning workflow flow paths.On the basis of script workflow full figure, remove not After relevant operator, workflow as shown in Figure 3 has just been obtained.It is upper it will be seen that we have obtained two works from figure Make flow path, they be respectively SeqDirectory → Seq2Sparse → K-Means → ClusterDumper and SeqDirectory→Seq2Sparse→LDA→LDAPrintTopics。

There are six we learn that operator service that needs are used has altogether by work flow diagram, and then obtain these calculations used Son must specify and optional parameter list.As shown in table 1, the only initial data that we must specify in parameter are defeated Enter path, the K values of K-Means algorithms and LDA algorithm topic number etc. three, other parameters details can be in the Service Assembly stage It is automatic to obtain.

The parameter list that 1 workflow operator of table need to specify

Some other optional important parameters can certainly be specified.In the present example, due to not knowing for K-Means For algorithm, which Clustering Effect is more preferable actually for the Euclidean distance measurement and COS distance measurement of acquiescence, so determine while making With two sets of K-Means parametric schemes.Particularly, due to having selected COS distance measurement, also by the convergence threshold of K-Means algorithms Parameter " -- convergenceDelta " is appointed as 0.1, rather than 0.5 given tacit consent to, because the range of COS distance is 0 to 1.Cause This will finally generate three cluster workflows according to specified parametric scheme.Wherein one realization LDA cluster process, a reality The K-Means cluster process of existing Euclidean distance measurement, the K-Means cluster process that in addition a realization COS distance is measured.

After three above-mentioned workflow flow paths are executed on Oozie, their final assessment results have been obtained, have been exported As a result it is listed in table.5 topics are respectively listed for each result, and each topic lists maximum preceding 5 lists of weight Word.

As shown in table 2, the topic recognition effect that LDA is clustered is generally all well and good.From 5 forward keywords The topic content that each news documents to cluster are talked about is found out in big enable.Such as from the 1st wheat to cluster (wheat), agriculture In industry (agriculture), outlet (export) and ton (tones) these keywords, it can be seen that it is in discussion and wheat The relevant agriculture topic of yield.From the 2nd IBM to cluster (ibm), computer (computers), American Telephone and Telegraph Company (att) and in personal (personal) these keywords, it can be seen that it is to be related to computer and associated companies in discussion Topic.The other three clusters, and is that bank finance, oil and the energy and the relevant topic of security are discussed respectively, it may have brighter Aobvious effect.

2 LDA cluster results of table

As shown in table 3, the K-Means text clusters measured using COS distance then can be weaker compared to LDA Clustering Effects, But it is generally also good.It is to talk about and wheat yield, stock, original that 1st, 2,3 keyword to cluster, which shows them respectively, The oily relevant topic of rise in price.Although the 4th clusters probably to be seen to be and discusses that some thing removes the topic increased every year, Specifically what increases just indefinite, illustrates the effect to cluster and bad.It is furthermore noted that the 5th topic is all Number, although seeming that the correlation of this inside that clusters is very high, it is meaningless.

3 K-Means COS distance cluster results of table

It is last as shown in table 4, it is the K-Means text clusters measured using Euclidean distance as a result, text cluster effect Compared to other two clusters of fruit then want poor many.1st and the 2nd keyword relevancies to cluster are also viable reluctantly, but It is to be difficult to find out their correlation from the 3rd, the 4th and the 5th keyword to cluster, or even give people's some " incoherent " Feeling.Especially such as he, said, vs, would word, the identification for topic are without in all senses really.

4 K-Means Euclidean distances of table cluster

It was therefore concluded that in these three machine learning workflows that Service Assembly obtains, LDA clusters work The effect that stream finds text topic is best, and the K-Means cluster workflows measured using COS distance are taken second place, Euclidean distance degree The K-Means cluster workflows of amount are then worst.It, can be by the machine learning work if the good results that user clusters LDA It is preserved as stream, can be easily multiplexed in the future.Certainly, user can continue to calculate in tuning these workflows The parameter of son, in the hope of reaching better Clustering Effect.

In addition, seeing from assessment result, occur such as 7-apr-1987, number, he in the result clustered at three It may carry out handling not good enough when text participle generates feature vector in pretreatment stage with words, explanations such as said.Therefore, In order to further improve the discovery effect of text topic, the parameter for changing preconditioning operator Seq2Sparse also can with being debugged It is a good selection.

Experiment show using this method can fast and effeciently customize with the reusable machine learning flow of tuning, to Data excacation can be efficiently carried out in Hadoop platform.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of machine learning Service Assembly method based on Mahout, which is characterized in that include the following steps：

Step S1：The data of different-format are pre-processed, the feature vector that model training uses is converted into；

Step S2：Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training；

Step S3：The model completed to training is assessed；

Step S4：By step Sl, step S2 and step S3 three phases in the form of operator in Mahout algorithms libraries Unified encapsulation is carried out, becomes and meets a series of services that Oozie workflow platforms call specification；

Step S5：According to user's description machine learning method to be used and the data to be handled format, group Take on the machine learning workflow flow path of a plurality of meet demand；

Step S6：When these machine learning workflows are after by Oozie end of runs in Hadoop platform, each work The model evaluation operator for making to flow will provide the assessment result of workflow；User selects machine learning work according to this assessment result It flows；

Step S7：The selected machine learning workflow of user is stored into knowledge base, so that user is multiplexed after this；

The model evaluation that training is completed in step S3 includes the following steps：

Step S32：Disaggregated model assessment is assessed using accuracy, if disaggregated model uses NB Algorithm, assessment It further include confusion matrix；Confusion matrix is the crosstab of disaggregated model output result and real goal value；Confusion matrix it is each The corresponding real goal value of row, the output valve of each corresponding disaggregated model of row；

Step S33：Collaborative filtering recommending model evaluation is assessed using the accuracy rate of model；

Collaborative filtering recommending model in step S2 uses matrix factorisation collaborative filtering and the collaboration based on article Filter the collaborative filtering recommending model of proposed algorithm.

2. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S1 Data prediction include：SeqDirectory、Lucene2Seq、Seq2Sparse、Arff.Vector、Split、 SplitDataSet, Describe and Hive.

3. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S2 Clustering Model include cluster mould using Canopy, K-Means, fuzzy K-Means, LDA and five clustering algorithms of spectral clustering Type.

4. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that：In step S2 Disaggregated model include disaggregated model using NB Algorithm and random forests algorithm.