CN107169572A - A kind of machine learning Service Assembly method based on Mahout - Google Patents

A kind of machine learning Service Assembly method based on Mahout Download PDF

Info

Publication number
CN107169572A
CN107169572A CN201611203680.2A CN201611203680A CN107169572A CN 107169572 A CN107169572 A CN 107169572A CN 201611203680 A CN201611203680 A CN 201611203680A CN 107169572 A CN107169572 A CN 107169572A
Authority
CN
China
Prior art keywords
machine learning
model
mahout
data
workflow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611203680.2A
Other languages
Chinese (zh)
Other versions
CN107169572B (en
Inventor
郭文忠
黄益成
陈星�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201611203680.2A priority Critical patent/CN107169572B/en
Publication of CN107169572A publication Critical patent/CN107169572A/en
Application granted granted Critical
Publication of CN107169572B publication Critical patent/CN107169572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The present invention provides a kind of machine learning Service Assembly method based on Mahout, it is characterised in that comprise the following steps:Step S1:The data of different-format are pre-processed;Step S2:Carry out model training;Step S3:Model is estimated;Step S4:Operator is carried out to unified encapsulation;Step S5:The machine learning method to be used and the form for the data to be handled described according to user, mounter study and work flow path;Step S6:When these machine learning workflows are after by Oozie end of runs in Hadoop platform, the model evaluation operator of each workflow will provide the assessment result of workflow;User selects machine learning workflow according to this assessment result.Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, so as to efficiently in the enterprising row data excacation of Hadoop platform.

Description

A kind of machine learning Service Assembly method based on Mahout
Technical field
The present invention relates to a kind of machine learning Service Assembly method based on Mahout.
Background technology
Nowadays human society is produced daily and the data volume of storage is more and more huger, while being accompanied by user more comes various The data analysis requirements of change.How simply, the machine learning stream of large-scale data can be handled by quickly and efficiently building one Journey, has become a demand urgently to be resolved hurrily at present.Mahout is built upon the distributed machines of increasing income on Hadoop Practise algorithms library, Mahout appearance solve traditional machine learning stock lack and enliven that technology community, autgmentability be poor, nothing Method handles distributed mass data and the defect such as do not increase income.But because the machine learning algorithm that Mahout is provided is numerous, each Algorithm possesses the adjustable parameters of several at most dozens ofs at least again, so carrying out data using Mahout in Hadoop platform Excavation stills need very high learning cost.
The content of the invention
To solve the above problems, the present invention provides a kind of Mahout machine learning Service Assembly methods under multiplexing visual angle.
The present invention uses following technical scheme:A kind of machine learning Service Assembly method based on Mahout, its feature exists In comprising the following steps:Step S1:The data of different-format are pre-processed, be converted into feature that model training uses to Amount;Step S2:Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training;Step S3:The model completed to training It is estimated;Step S4:Step S1, step S2 and step S3 belong to a series of operators in Mahout algorithms libraries, and these are calculated Son carries out unified encapsulation, becomes a series of services that Oozie workflow platforms call specification that meet;Step S5:According to The machine learning method to be used of user's description and the form for the data to be handled, assemble more than one satisfaction The machine learning workflow flow path of demand;Step S6:When these machine learning workflows by Oozie in Hadoop platform After end of run, the model evaluation operator of each workflow will provide the assessment result of workflow;User comments according to this Estimate result, select machine learning workflow.
In an embodiment of the present invention, in addition to step S7:The selected machine learning workflow storage of user is arrived In knowledge base, for the user or the user with similar demands is multiplexed after this.
In an embodiment of the present invention, the data prediction in step S1 includes:SeqDirectory、Lucene2Seq、 Seq2Sparse, Arff.Vector, Split, SplitDataSet, Describe and Hive.
In an embodiment of the present invention, the Clustering Model in step S2 is included using Canopy, K-Means, fuzzy K- Means, LDA and five clustering algorithms of spectral clustering Clustering Model.
Further, the disaggregated model in step S2 includes the classification using NB Algorithm and random forests algorithm Model.
Further, the collaborative filtering recommending model in step S2 is using matrix factorisation collaborative filtering and is based on The collaborative filtering recommending model of the Collaborative Filtering Recommendation Algorithm of article.
In an embodiment of the present invention, the model evaluation of completion is trained to comprise the following steps in step S3:Step S31:It is poly- Class model is assessed to be estimated using distance between cluster and cluster outgoing inspection;Step S32:Disaggregated model is assessed to be entered using accuracy Row is assessed, if disaggregated model uses NB Algorithm, assessing also includes confusion matrix;Step S33:Collaborative filtering recommending mould Type is assessed to be estimated using the accuracy rate of model.
Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, from And can be efficiently in the enterprising row data excacation of Hadoop platform.
Brief description of the drawings
Fig. 1 is broad flow diagram of the invention.
Fig. 2 is machine learning workflow diagram of the present invention.
Fig. 3 is the workflow diagram that generates according to demand in one embodiment of the invention.
Embodiment
Explanation is further explained to the present invention with specific embodiment below in conjunction with the accompanying drawings.
A kind of machine learning Service Assembly method based on Mahout, it comprises the following steps:Step S1:By different-format Data pre-processed, be converted into the characteristic vector that model training is used;Step S2:Carry out Clustering Model, disaggregated model and Collaborative filtering recommending model training;Step S3:The model that training is completed is estimated;Step S4:Step S1, step S2 and step Rapid S3 belongs to a series of operators in Mahout algorithms libraries, and these operators are carried out to unified encapsulation, becomes and meets Oozie Workflow platform calls a series of services of specification;Step S5:The machine learning method to be used described according to user with And the form for the data to be handled, assemble the machine learning workflow flow path that more than one meets demand;Step S6:When These machine learning workflows are by Oozie, in Hadoop platform after end of run, the model of each workflow is commented Estimation will provide the assessment result of workflow;User selects machine learning workflow according to this assessment result.Oozie is Java web applications for dispatching Hadoop operations.Multiple sequences of operation are combined into a logic working list by Oozie Member.It is integrated with Hadoop, and supports the Hadoop tasks such as MapReduce, Pig, Hive and Sqoop.
In an embodiment of the present invention, in addition to step S7:The selected machine learning workflow storage of user is arrived In knowledge base, for the user or the user with similar demands is multiplexed after this.
In an embodiment of the present invention, the data prediction in step S1 includes:SeqDirectory、Lucene2Seq、 Seq2Sparse, Arff.Vector, Split, SplitDataSet, Describe and Hive.
(1)SeqDirectory
The operator is used for the file that the file of text document is changed into SequenceFile forms.SequenceFile files It is a kind of serializing file that Hadoop is used for storing key-value pairs of binary form and designed.Set up Hadoop it On Mahout provide machine learning algorithm input file naturally also require form to be SequenceFile. The order that SeqDirectory operators are used in Mahout is " seqdirectory ".We are by calculating SeqDirectory Sub- carry out service encapsulation, 2 parameters are exposed to user, in addition to outgoing route, " -- input " parameters, which are used to specify, needs place The file input path of reason, because this operator is entrance operator, so user must specify the parameter.
(2)Lucene2Seq
Similar with Seqdirectory, the operator is used to Lucene index files changing into SequenceFile forms text Part, simply its input is the index file of Lucene parser generations.The life that Lucene2Seq operators are used in Mahout Order is " lucene2seq ", after encapsulation, and we expose Lucene2Seq 2 parameters to user:" -- input " parameters are used Lucene index files input path to be processed is needed to specify, is equally also the parameter that must be specified as entrance operator, separately Outer one is the outgoing route parameter given tacit consent to.
(3)Seq2Sparse
Seq2Sparse operators be for from text document generate vector important tool, it read by SeqDirectory operators are converted into the text data of SequenceFile forms, first generate dictionary file according to the data, so The characteristic vector of text is regenerated based on dictionary afterwards.The vectorial characteristic value weighting scheme can be simple tf word frequency vector, Can also be the popular tf-idf vectors of industry, vectorial weighting scheme by " -- weight " is specified.Seq2Sparse is calculated The order that son is used in Mahout is " seq2sparse ".The assignable parameter of seq2sparse operators has dozens of, we By being encapsulated to it, in addition to outgoing route, 5 parameters are also exposed to user:" -- analyzerName " is used to specify institute Text segmenter class name, default value is Lucene standard scores parsers;" -- weight " is used to specify weight mechanism used, Tf is the weighting based on word frequency, and tfidf is the weighting based on TF-IDF, and the parameter default is tfidf;“-- MinSupport " is used to that the minimum frequency of the word of lexicon file can be put into whole set, and the word less than the frequency is ignored, Default value is 2;" -- minDF " is used for the minimum number of document where specifying the word that can be put into lexicon file, less than the frequency Word is ignored, and default value is 1;" -- maxDFPercent " is used for the maximum of document where specifying the word that can be put into lexicon file Number, default value is 99.
(4)ARFF.Vector
ARFF files are the storage document data sets of Weka acquiescences.Each ARFF files one two-dimensional table of correspondence.Form Each row be data set each example, each row are each attributes of data set.The operator is used for from ARFF file generated models The characteristic vector for training operator recognizable.The order that Seq2Sparse operators are used in Mahout is " seq2sparse ".I ARFF.Vector three parameters are exposed to user:" -- input " parameters are used to specify ARFF files input path, " -- Output " is used to specify characteristic vector outgoing route, " -- dicout " is used to specify the outgoing route of vectorial dictionary file.Wherein " -- input " is essential option, and " -- output " and " -- dicout " is option.
(5)CSV2Vector
The operator is used to csv file being directly changed into characteristic vector.It is not can directly making of providing in Mahout storehouses Order, we encapsulate the function on the basis of Mahout source codes.The java class that CSV2Vector operators are used is The class name, when Oozie is encapsulated, can be written on by " services.encapsulation.csv2vector "<main-class> Between be called.Similar, the operator is exposed to user-in file path " -- input " and output file path " -- Two parameters of output ", wherein input path must be specified, outgoing route is optional.
(6)SplitDataSet
For the Collaborative Filtering Recommendation Algorithm in Mahout, the file commonly entered is to carry three column datas CSV formatted files.Wherein first is classified as the ID of user, and second is classified as the ID of article, and the 3rd is classified as preference of the user for article Value.SplitDataSet operators are used to the input data of CSV forms being cut into training set and test set two parts.It The order used in Mahout is " splitdataset ".The assignable parameter of user has:" -- input " specified files input road Footpath, " -- output " specified file outgoing routes, " -- trainingPercentage " specifies the percentage of the data for training Than, " -- probePercentage " specifies the percentage of the data for test.Wherein " -- input " must be specified, and other three Individual is option, " -- trainingPercentage " default values are 0.9, " -- probePercentage " default values are 0.1.
(7)Split
Split operators are used to characteristic vector be cut into training set and test set.The order that it is used in Mahout is “split”.User may specify that parameter has five:" -- input " specifies input path;"-trainingOutput " specifies training Collect outgoing route;" -- testOutput " nominative testing collection outgoing routes;" -- randomSelectionSize " specifies random choosing The bar number of test data is selected as, default value is 100;" -- randomSelectionPct " is specified and is randomly selected to be test data Percentage, default value is 0.2.Except " -- in addition to input " must be specified, other four parameters are options.
(8)Describe
The operator is used to mark the field and target variable in ARFF data sets so that sorting algorithm being capable of identification data collection So as to do further model training.The order that it is used in Mahout is " describe ".The adjustable ginseng of user Number has three:" -- path " is used to specify input data path, " -- file " is used for the path for specifying the descriptor file of generation, " -- descriptor " is used for the data type of description field and is chosen as the field of target variable.Except outgoing route is optional Outside, input data is obtained in the Service Assembly stage from operator above, " -- descriptor " parameters must then be specified by user.
(9)HiveAction
Oozie supports Hive task nodes, and we substantially increase the data of this method by the encapsulation serviced Hive Pretreatment potentiality.Action station codes of the Hive in Oozie, $ { preprocess.hql } specifies the text of HiveQL scripts Hive SQL statement has been write in part path, the script.When user uses Hive to act node, HiveQL command script It must specify.
In an embodiment of the present invention, the Clustering Model in step S2 is included using Canopy, K-Means, fuzzy K- Means, LDA and five clustering algorithms of spectral clustering Clustering Model.
(1) Canopy is clustered
Canopy clustering algorithms are a kind of unsupervised pre- clustering algorithms.It is typically used as K-Means algorithms or fuzzy K- The pre-treatment step of Means algorithms.It is intended to accelerate the cluster operation to large data sets.Because the size of data set can not be true It is fixed, and K-Means will input the number that clusters at the very start, thus directly using K-Means algorithms be not one very well Selection.The order that Canopy operators are used in Mahout is " canopy ", and the parameter exposed to user has 4:Export road Footpath " -- output ", for determine cluster granularity " -- t1 " and " -- t2 ", and similarity distance metric mode " -- distanceMeasure”.T1 and t2 value is essential option, and distance metric mode is option, square Euclidean distance of default value Measurement.
(2) K-Means is clustered
K-Means clusters are popular clustering methods in data mining, are used extensively in numerous scientific domains Do clustering algorithm.The purpose of K-Means clusters is that n data point is divided into k cluster, and each data point, which belongs to one of them, to be had The cluster of nearest average.The order that K-Means is used in Mahout is " kmeans ".When being clustered in advance without using Canopy When, the requirement of K-Means algorithms will finally gather the number of the cluster, namely K values specified at the very start.Therefore, K-Means ginseng Number schemes have two kinds, when not using Canopy to be clustered in advance, K-Means need specified cluster result number of clusters " -- clusters”.In addition, the operator also has 4 optional parameters:Default value for square Euclidean distance measurement " -- DistanceMeasure ", default value for 20 maximum iteration " -- maxIter ", default value for 0.5 convergence threshold " -- ConvergenceDelta ", and outgoing route parameter " -- output ".
(3) K-Means clusters are obscured
Fuzzy K-means (also referred to as Fuzzy C-means) is K-Means extension.K-Means is used to find hard cluster (point Only belong to a cluster), and fuzzy K-Means is then that soft cluster (i.e. specified point is found with a kind of formalization method of more statistics Multiple clusters can be belonged to specific probability).Orders of the fuzzy K-Means in Mahout is " fkmeans ", the encapsulation of parameter Scheme is identical with kmeans, will not be described here.
(4) spectral clustering
In cluster, spectral clustering using data similarity matrix frequency spectrum (characteristic value) with less dimension Dimensionality reduction is performed before cluster.The order that spectral clustering is used in Mahout is " spectralkmeans ", the number of cluster " -- Clusters " is the parameter that must be specified, and default value is 20 highest iterations " -- maxIter ", outgoing route " -- Output ", default value for square Euclidean distance measurement " -- the convergence threshold that distanceMeasure " and default value are 0.5 " -- convergenceDelta " is optional parameters.
(5) LDA is clustered
Potential Dirichlet distribution (LDA) is a powerful machine learning algorithm, and it can gather a series of word Class enter some topic, and by a series of clustering documents enter multiple mixing topic.In natural language processing, LDA is One kind generation statistical model, it allows observation collection to be explained by non-observation group, explains why some parts of data are similar. The order that potential Dirichlet distribution service is used in Mahout is " lda ", the topic data of model " -- num_topics " It is the parameter that must be specified, highest the iterations " -- maxIter " and outgoing route " -- output " that default value is 20 in addition It is optional parameters.
Further, the disaggregated model in step S2 includes the classification using NB Algorithm and random forests algorithm Model.
(1) naive Bayesian
Many occasions need to use classification in life, and naive Bayesian is a kind of simple and effective common classification algorithm.It Basic thought it is very simple, i.e., for the item to be sorted provided, solve under conditions of this appearance each classification occur it is general Rate, which is maximum, is considered as which classification this item to be sorted belongs to.The order that naive Bayesian is used in Mahout is " trainnb ", model outgoing route " -- output " and default value are "/mahout/trainnb/labelIndex " mark Label index outgoing route " -- labelIndex " is optional parameters.
(2) random forest
Random forest or Stochastic Decision-making woods algorithm are to be used to classify, and return the global learning method with other tasks, it leads to Cross and build multiple decision trees in training and export single tree as the pattern (classification) of class or the class (recurrence) of consensus forecast. The problem of Stochastic Decision-making forest correct for their training set of decision tree over adaptation.Random forests algorithm is used in Mahout Order be " buildforest ", optional parameters has " -- output " and " -- selection ".Second parameter is represented every The quantity of the individual randomly selected variable of tree node, default value is the square root of the quantity of explanatory variable.
Further, the collaborative filtering recommending model in step S2 is using matrix factorisation collaborative filtering and is based on The collaborative filtering recommending model of the Collaborative Filtering Recommendation Algorithm of article.
(1) matrix factorisation collaborative filtering
The algorithm is the collaborative filtering based on ALS-WR matrix factorisations.Current collaborative filtering mainly can area It is divided into based on factorization and based on neighborhood two kinds.Because factorization can be from the global influence for considering user's ballot, institute With can be better compared to the collaborative filtering effect based on neighborhood in theory and practice.The operator is used in Mahout Order be " parallelALS ", " -- numFeatures " and regularization parameter " -- lambda " is to necessarily refer to the dimension of feature Fixed parameter, " -- numIterations " and outgoing route " -- output " is optional parameters to the iterations that default value is 10.
(2) collaborative filtering recommending based on article
Collaborative filtering recommending based on article is according to hobby data of the user to other articles, by finding similar but using The article that family was not evaluated also recommends user.This is the widely used Collaborative Filtering Recommendation Algorithm of current industry.The calculation The order that is used in Mahout of son is " itemsimilarity ", except outgoing route " -- in addition to output ", optional parameters Also measuring similarity " -- the maximum similarity value " -- between similarityClassname ", article MaxSimilaritiesPerItem ", largest item preference value " -- maxPrefs ", the minimum preference value " -- of each user MinPrefsPerUser " and whether regard input as the data without preference value parameter " -- booleanData ".Their acquiescence Value is successively:Similarity_euclidean_distance, 100,500,1 and false.
In an embodiment of the present invention, the model evaluation of completion is trained to comprise the following steps in step S3:
Step S31:Clustering Model is assessed to be estimated using distance between cluster and cluster outgoing inspection;
(1) distance is assessed between cluster
Distance can be good at reflecting clustering result quality between cluster.It can not possibly be leaned between different cluster central points in good cluster result Obtain too near, closely then mean that cluster process generates multiple groups with similar features, and cause the difference between cluster inadequate very much Significantly.So we will not wish that the distance between cluster is too near.Distance and the outcome quality clustered are closely related between cluster.Cluster spacing From evaluation operators not inside the order that Mahout is provided, therefore we encode and encapsulate the operator service.Operator is corresponding Java class is InterClusterDistances.The file for the cluster result that one of parameter of the operator is assessed for needs Obtain, specified without user in path, the output parameter that the path can train service in Service Assembly from Clustering Model.This There is the outgoing route that an optional parameters is exactly assessment result outside.
(2) cluster output is checked
The main tool that cluster output is checked in Mahout is ClusterDumper.Read with ClusterDumper The output of clustering algorithm is very convenient in Mahout.According to this output, we just can very easily assess the quality of cluster.Any cluster Have a center vector, it be in the cluster average value a little.In this problem, except potential Dirichlet distribution is poly- Outside class, this point is clustered for Canopy, K-Means clusters, fuzzy K-Means clusters and spectral clustering are all set up.Particularly, when When the target of cluster is text document, feature is exactly word, that is to say, that those words of central point weight highest reflect The implication to be expressed of document in the cluster.
The operation order of the operator is " clusterdumper ", and input is that gathering is closed, it can the Service Assembly stage from Obtained in the output parameter of Clustering Model training, without specifying.In addition, the dictionary generated when data are switched into vector is also one Optional input.Optional parameters " -- numWords " represents the word number for needing to print, and default value is 10.
Step S32:Disaggregated model is assessed to be estimated using accuracy, if disaggregated model uses NB Algorithm, Assessing also includes confusion matrix;
Disaggregated model is estimated, the most frequently used evaluation criterion typically has two kinds of accuracy and confusion matrix.Accuracy It is well understood that what that confusion matrix is again, may most direct index for exporting the grader of non-scoring results It is exactly confusion matrix.Confusion matrix is model output result and the crosstab of correct desired value always.Every a line correspondence of matrix Really desired value and each row correspondence model output valve.The test sample quilt that the element value of the i-th row of matrix jth row is classification i Model assigns to the number in classification j.The big element of the corresponding confusion matrix of one good model all concentrates on diagonal.These are diagonal Line element refers to that sample in classification i is correctly assigned to the number in i classes.
Naive Bayesian evaluation operators by test process, can calculate the accuracy of model-naive Bayesian and obscure square Battle array.Its operation order is " testnb ", " trainnb " order of correspondence training.The evaluation operators, which need to input, to be used to train Characteristic vector, and the model and index tab trained, these paths by during Service Assembly from operator above Middle parameter is directly obtained, and is specified without user.
(1) random forest is assessed
Random forest evaluation operators can be used for the accuracy for testing Random Forest model.Its order is " testforest ", input path parameter is obtained in Service Assembly.
Step S33:Collaborative filtering recommending model evaluation is estimated using the accuracy rate of model.
(1) matrix factorisation collaborative filtering is assessed
The operator draws the accuracy rate of recommended models by calculating RMSE and MAE, and its operation order is “evaluateFactorization”.The input of the operator is user's matrix model, article matrix model and the number for test According to collection, they can be obtained in the Service Assembly stage from operator parameter above.
(2) collaborative filtering recommending based on article is assessed
Article similarity model of the operator according to obtained by training, the accurate of model is obtained using test data set test Rate.Input path is same to be obtained in the Service Assembly stage.
Machine-learning process is distinguished into data prediction, model training and model evaluation three phases by the present invention, each There is the Mahout machine learning service under a series of operators for belonging to the stage, multiplexing visual angle in stage in Mahout algorithms libraries Assemble method general view is as shown in Figure 1.
Above-mentioned operator is packaged, after the service with unified call specification is packaged into, can just be entered using them Row Service Assembly, generates a complete machine learning workflow full figure.As shown in Fig. 2 being exactly the operator institute energy that we encapsulate The overall picture of the machine learning workflow assembled.Circle in figure represents individually perform the operator of some task, and one has 25 Individual operator.Line represents the order and flow of operator execution.What square was represented is the intermediate result that operator performs output, the module The overall appearance of work flow diagram is intended merely to, the specific tasks that can be run are not represented.Machine-learning process is distinguished into number by us Data preprocess, model training and model evaluation three phases.Wherein, the circle on the left side represents data prediction operator service, such as Text document formatted file is changed into the SeqDirectory of SequenceFile formatted files;The circle of middle medium blue is represented Model training operator is serviced, and such as K-means clustering operators are serviced;The circle representative model evaluation operators service on the right is as poly- in assessed The InterClustersDistance of distance between class result cluster.
Considered according to the support scope of general data mining demand and the Hadoop ecosystems, final we select branch The input data Format Type held has five kinds, and they are respectively:Hive tables data, csv file, ARFF files, Lucene indexes File and TXT text document files.These files are stored in HDFS.HDFS is distributed document storage system, its energy Very easily store the data of magnanimity.HiveAction is used to handle Hive table data inputs in preconditioning operator, and defeated Go out CSV formatted files into HDFS;Lucene2Seq operators are used to handle Lucene index files, and are converted into SequenceFile formatted files;ARFF.Vector is used to handle ARFF formatted files, and according to the corresponding feature of file generated Vector;CSV2Vector and SplitDataSet are used to handle the input of CSV formatted files, and CSV2Vector is according to CSV forms File generated characteristic vector, SplitDataSet is then that CSV data are cut into training set and test set by a certain percentage.Due to This 6 operators are the entrances of whole machine learning workflow, therefore their data file input path must be specified by user, System just can know that will handle for which data.For remaining 19 operators, then the input path of specified file is not needed, Because their input is exactly the outgoing route of some other operators before workflow, as long as during Service Assembly in the past The operator in face is obtained and carried out on demand assembled.
Reuters-21578 news data collection is used in the specific embodiment of the invention.It is in machine learning research field In be widely used.The collection of these data and mark are initially in exploitation CONSTRUE texts by basis set group in card and Reuter Completed during categorizing system.Reuters-21578 data sets are divided into 22 files, except last file is only included Outside 578 documents, remaining each file is comprising 1000 documents.These files are SGML forms, similar to XML, are being opened Begin to have processed them into TXT documents in advance before this case study and be stored in HDFS.
Because input file is TXT forms, so natural we have selected Seq2Directory preconditioning operators work For the entrance of whole machine learning workflow.In addition, carrying out text topic from both clustering algorithms of Kmeans and LDA It was found that, and find result using ClusterDumper and LDAPrintTopics operators output ultimatum topic.
According to above-mentioned demand, dependence before and after the operator with reference to described by machine study and work stream full figure can be rapid The packaged operator of use generate two machine learning workflow flow paths.On the basis of script workflow full figure, remove not After related operator, workflow as shown in Figure 3 has just been obtained.It is upper it will be seen that we have obtained two works from figure Make flow path, they be respectively SeqDirectory → Seq2Sparse → K-Means → ClusterDumper and SeqDirectory→Seq2Sparse→LDA→LDAPrintTopics。
By work flow diagram, we learn that the operator service one that needs are used has six, and then draw these calculations used Son must be specified and optional parameter list.As shown in table 1, the only initial data that we must specify in parameter are defeated Enter topic number in path, the K values of K-Means algorithms and LDA algorithm etc. three, other specification details can be in the Service Assembly stage It is automatic to obtain.
The parameter list that the workflow operator of table 1 need to be specified
Some other optional important parameters can certainly be specified.In the present example, due to not knowing for K-Means For algorithm, which Clustering Effect is more preferable actually for the Euclidean distance measurement and COS distance measurement of acquiescence, so determine to make simultaneously With two sets of K-Means parametric schemes.Particularly, due to have selected COS distance measurement, also by K-Means convergence of algorithm threshold values Parameter " -- convergenceDelta " is appointed as 0.1, rather than 0.5 given tacit consent to, because the scope of COS distance is 0 to 1.Cause This, according to specified parametric scheme, most generates three cluster workflows at last.Wherein one is realized LDA cluster process, a reality The K-Means cluster process of existing Euclidean distance measurement, in addition one realize COS distance measurement K-Means cluster process.
After three above-mentioned workflow flow paths are performed on Oozie, their final assessment results have been obtained, have been exported As a result it is listed in table.5 topics are respectively listed for every kind of result, and each topic lists maximum preceding 5 lists of weight Word.
As shown in table 2, the topic recognition effect that LDA clusters are obtained is generally all well and good.From 5 forward keywords The topic content that each news documents clustered are talked about is found out in big enable.For example from the 1st wheat clustered (wheat), agriculture In industry (agriculture), outlet (export) and ton (tones) these keywords, it can be seen that it is in discussion and wheat The related agriculture topic of yield.From the 2nd IBM clustered (ibm), computer (computers), American Telephone and Telegraph Company (att) and in personal (personal) these keywords, it can be seen that it is to be related to computer and associated companies in discussion Topic.The other three clusters, and is the topic for discussing that bank finance, oil are related to the energy and security respectively, it may have brighter Aobvious effect.
The LDA cluster results of table 2
As shown in table 3, the K-Means text clusters measured using COS distance then can be weaker compared to LDA Clustering Effects, But it is generally also good.1st, 2,3 keywords clustered show that they are to talk about and wheat yield, stock, original respectively The related topic of oily rise in price.Although the 4th clusters probably to be seen to be and discusses that some thing removes the topic increased every year, Specifically what increases just indefinite, illustrates the effect clustered and bad.It is furthermore noted that the 5th topic is all Numeral, although looking that the correlation of this inside that clusters is very high, it is meaningless.
The K-Means COS distance cluster results of table 3
Last is as shown in table 4 the result for the K-Means text clusters measured using Euclidean distance, and the text clusters effect Compared to other two clusters of fruit are then poor many.1st and the 2nd keyword relevancies clustered are also viable reluctantly, but Be from the 3rd, the 4th and the 5th keyword clustered be difficult to find out their correlation, or even give people some " incoherent " Sensation.Particularly such as he, said, vs, would word, the identification for topic is without in all senses really.
The K-Means Euclidean distances of table 4 are clustered
It was therefore concluded that, in these three machine learning workflows that Service Assembly is obtained, LDA cluster work Flow the effect found to text topic preferably, the K-Means cluster workflows measured using COS distance are taken second place, Euclidean distance degree The K-Means cluster workflows of amount are then worst., can be by the machine learning work if the good results that user clusters to LDA Preserved as stream, can be easily multiplexed in the future.Certainly, user can continue to calculate in tuning these workflows The parameter of son, in the hope of reaching more preferable Clustering Effect.
In addition, seeing from assessment result, occurred in that such as 7-apr-1987, numeral, he in the result of three clusters With the word such as said, explanation may carry out handling not good enough during text participle generation characteristic vector in pretreatment stage.Therefore, In order to further improve the discovery effect of text topic, the parameter for changing preconditioning operator Seq2Sparse also can with being debugged It is a good selection.
Experiment shows, using this method can fast and effeciently customize with the reusable machine learning flow of tuning so that Can be efficiently in the enterprising row data excacation of Hadoop platform.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims (7)

1. a kind of machine learning Service Assembly method based on Mahout, it is characterised in that comprise the following steps:
Step S1:The data of different-format are pre-processed, the characteristic vector that model training is used is converted into;
Step S2:Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training;
Step S3:The model that training is completed is estimated;
Step S4:Step S1, step S2 and step S3 belong to a series of operators in Mahout algorithms libraries, and these operators are carried out Unified encapsulation, becomes a series of services that Oozie workflow platforms call specification that meet;
Step S5:The machine learning method to be used and the form for the data to be handled described according to user, group Take on the machine learning workflow flow path that more than one meets demand;
Step S6:When these machine learning workflows are after by Oozie end of runs in Hadoop platform, each work The model evaluation operator for making to flow will provide the assessment result of workflow;User selects machine learning work according to this assessment result Flow.
2. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:Also include step Rapid S7:The selected machine learning workflow of user is stored into knowledge base, for the user or with similar demands User be multiplexed after this.
3. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S1 Data prediction include:SeqDirectory、Lucene2Seq、Seq2Sparse、Arff.Vector、Split、 SplitDataSet, Describe and Hive.
4. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S2 Clustering Model include the cluster mould using Canopy, K-Means, fuzzy K-Means, LDA and five clustering algorithms of spectral clustering Type.
5. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S2 Disaggregated model include the disaggregated model using NB Algorithm and random forests algorithm.
6. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S2 Collaborative filtering recommending model using matrix factorisation collaborative filtering and the Collaborative Filtering Recommendation Algorithm based on article Collaborative filtering recommending model.
7. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S3 The model evaluation that training is completed comprises the following steps:
Step S31:Clustering Model is assessed to be estimated using distance between cluster and cluster outgoing inspection;
Step S32:Disaggregated model is assessed to be estimated using accuracy, if disaggregated model uses NB Algorithm, is assessed Also include confusion matrix;
Step S33:Collaborative filtering recommending model evaluation is estimated using the accuracy rate of model.
CN201611203680.2A 2016-12-23 2016-12-23 A kind of machine learning Service Assembly method based on Mahout Active CN107169572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611203680.2A CN107169572B (en) 2016-12-23 2016-12-23 A kind of machine learning Service Assembly method based on Mahout

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611203680.2A CN107169572B (en) 2016-12-23 2016-12-23 A kind of machine learning Service Assembly method based on Mahout

Publications (2)

Publication Number Publication Date
CN107169572A true CN107169572A (en) 2017-09-15
CN107169572B CN107169572B (en) 2018-09-18

Family

ID=59848573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611203680.2A Active CN107169572B (en) 2016-12-23 2016-12-23 A kind of machine learning Service Assembly method based on Mahout

Country Status (1)

Country Link
CN (1) CN107169572B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875049A (en) * 2018-06-27 2018-11-23 中国建设银行股份有限公司 text clustering method and device
CN108897587A (en) * 2018-06-22 2018-11-27 北京优特捷信息技术有限公司 Plug type machine learning algorithm operation method, device and readable storage medium storing program for executing
CN109697447A (en) * 2017-10-20 2019-04-30 富士通株式会社 Disaggregated model construction device, method and electronic equipment based on random forest
CN109871809A (en) * 2019-02-22 2019-06-11 福州大学 A kind of machine learning process intelligence assemble method based on semantic net
CN111104214A (en) * 2019-12-26 2020-05-05 北京九章云极科技有限公司 Workflow application method and device
CN111459820A (en) * 2020-03-31 2020-07-28 北京九章云极科技有限公司 Model application method and device and data analysis processing system
WO2020237898A1 (en) * 2019-05-29 2020-12-03 深圳技术大学 Personalized recommendation method for online education system, terminal and storage medium
CN112130933A (en) * 2020-08-04 2020-12-25 中科天玑数据科技股份有限公司 Method and device for constructing and calling operator set
CN112183768A (en) * 2020-10-23 2021-01-05 福州大学 Intelligent machine learning process assembling method based on semantic net (deep learning oriented)
WO2021088909A1 (en) * 2019-11-06 2021-05-14 第四范式(北京)技术有限公司 Method and system for assisting operator development

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578007A (en) * 2012-07-20 2014-02-12 三星电子(中国)研发中心 Mixed recommendation system and method for intelligent device
CN104462373A (en) * 2014-12-09 2015-03-25 南京大学 Personalized recommendation engine implementing method based on multiple Agents
US9176949B2 (en) * 2011-07-06 2015-11-03 Altamira Technologies Corporation Systems and methods for sentence comparison and sentence-based search

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9176949B2 (en) * 2011-07-06 2015-11-03 Altamira Technologies Corporation Systems and methods for sentence comparison and sentence-based search
CN103578007A (en) * 2012-07-20 2014-02-12 三星电子(中国)研发中心 Mixed recommendation system and method for intelligent device
CN104462373A (en) * 2014-12-09 2015-03-25 南京大学 Personalized recommendation engine implementing method based on multiple Agents

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张建平: "基于Hadoop与Mahout推荐技术的研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
曾小波: "基于协同过滤的推荐系统的研究", 《中国优秀硕士学位论文全文数据库》 *
范飞,黄文明,邓珍荣: "Oozie工作流在Mahout分布式数据挖掘中的应用", 《桂林电子科技大学学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697447A (en) * 2017-10-20 2019-04-30 富士通株式会社 Disaggregated model construction device, method and electronic equipment based on random forest
CN108897587A (en) * 2018-06-22 2018-11-27 北京优特捷信息技术有限公司 Plug type machine learning algorithm operation method, device and readable storage medium storing program for executing
CN108897587B (en) * 2018-06-22 2021-11-12 北京优特捷信息技术有限公司 Pluggable machine learning algorithm operation method and device and readable storage medium
CN108875049A (en) * 2018-06-27 2018-11-23 中国建设银行股份有限公司 text clustering method and device
CN109871809A (en) * 2019-02-22 2019-06-11 福州大学 A kind of machine learning process intelligence assemble method based on semantic net
WO2020237898A1 (en) * 2019-05-29 2020-12-03 深圳技术大学 Personalized recommendation method for online education system, terminal and storage medium
WO2021088909A1 (en) * 2019-11-06 2021-05-14 第四范式(北京)技术有限公司 Method and system for assisting operator development
CN111104214A (en) * 2019-12-26 2020-05-05 北京九章云极科技有限公司 Workflow application method and device
CN111459820A (en) * 2020-03-31 2020-07-28 北京九章云极科技有限公司 Model application method and device and data analysis processing system
CN111459820B (en) * 2020-03-31 2021-01-05 北京九章云极科技有限公司 Model application method and device and data analysis processing system
CN112130933A (en) * 2020-08-04 2020-12-25 中科天玑数据科技股份有限公司 Method and device for constructing and calling operator set
CN112183768A (en) * 2020-10-23 2021-01-05 福州大学 Intelligent machine learning process assembling method based on semantic net (deep learning oriented)

Also Published As

Publication number Publication date
CN107169572B (en) 2018-09-18

Similar Documents

Publication Publication Date Title
CN107169572B (en) A kind of machine learning Service Assembly method based on Mahout
Xiao et al. Feature-selection-based dynamic transfer ensemble model for customer churn prediction
US20080097937A1 (en) Distributed method for integrating data mining and text categorization techniques
AlQahtani Product sentiment analysis for amazon reviews
CN107169061B (en) Text multi-label classification method fusing double information sources
Adib et al. A deep hybrid learning approach to detect bangla fake news
CN106294355A (en) A kind of determination method and apparatus of business object attribute
CN112905739A (en) False comment detection model training method, detection method and electronic equipment
Jebaseel et al. M-learning sentiment analysis with data mining techniques
Gabbay et al. Isolation forests and landmarking-based representations for clustering algorithm recommendation using meta-learning
Mir et al. Online fake review detection using supervised machine learning and BERT model
Zheng et al. Joint learning of entity semantics and relation pattern for relation extraction
Tallón-Ballesteros et al. Merging subsets of attributes to improve a hybrid consistency-based filter: a case of study in product unit neural networks
CN114428855A (en) Service record classification method for hierarchy and mixed data type
Curi et al. Multi-label classification of user reactions in online news
Gadri et al. An efficient system to predict customers’ satisfaction on touristic services using ML and DL approaches
Kanakamedala et al. Sentiment Analysis of Online Customer Reviews for Handicraft Product using Machine Learning: A Case of Flipkart
Shanto et al. Binary vs. Multiclass Sentiment Classification for Bangla E-commerce Product Reviews: A Comparative Analysis of Machine Learning Models
Liu et al. A comparison of machine learning algorithms for prediction of past due service in commercial credit
CN111753992A (en) Screening method and screening system
Onogawa et al. Why Do Users Choose a Hotel over Others? Review Analysis Using Interpretation Method of Machine Learning Models
bin Harunasir et al. Sentiment Analysis of Amazon Product Reviews by Supervised Machine Learning Models
Cheng Machine Learning Application in Car Insurance Direct Marketing
Sunil et al. Customer Review Classification Using Machine Learning and Deep Learning Techniques
Kumar et al. Shareable Representations for Search Query Understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant