CN107169572B - A kind of machine learning Service Assembly method based on Mahout - Google Patents
A kind of machine learning Service Assembly method based on Mahout Download PDFInfo
- Publication number
- CN107169572B CN107169572B CN201611203680.2A CN201611203680A CN107169572B CN 107169572 B CN107169572 B CN 107169572B CN 201611203680 A CN201611203680 A CN 201611203680A CN 107169572 B CN107169572 B CN 107169572B
- Authority
- CN
- China
- Prior art keywords
- model
- machine learning
- mahout
- data
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of machine learning Service Assembly method based on Mahout, which is characterized in that includes the following steps:Step S1:The data of different-format are pre-processed;Step S2:Carry out model training;Step S3:Model is assessed;Step S4:Operator is carried out to unified encapsulation;Step S5:According to user's description machine learning method to be used and the data to be handled format, mounter study and work flow path;Step S6:When these machine learning workflows are after by Oozie end of runs in Hadoop platform, the model evaluation operator of each workflow will provide the assessment result of workflow;User selects machine learning workflow according to this assessment result.Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, so as to efficiently in Hadoop platform carry out data excacation.
Description
Technical field
The machine learning Service Assembly method based on Mahout that the present invention relates to a kind of.
Background technology
Nowadays human society generates daily and the data volume of storage is more and more huger, while the be accompanied by user the next various
The data analysis requirements of change.How simply, the machine learning stream that can handle large-scale data is quickly and efficiently built
Journey has become a demand urgently to be resolved hurrily at present.Mahout is built upon the distributed machines of increasing income on Hadoop
Practise algorithms library, the appearance of Mahout, which solves to lack existing for traditional machine learning library, enlivens that technology community, autgmentability be poor, nothing
The defects of method handles distributed mass data and does not increase income.But since the machine learning algorithm that Mahout is provided is numerous, each
Algorithm possesses the adjustable parameters of few then several at most dozens ofs again, so carrying out data using Mahout in Hadoop platform
There is still a need for very high learning costs for excavation.
Invention content
To solve the above problems, the present invention provides a kind of Mahout machine learning Service Assembly methods being multiplexed under visual angle.
The present invention uses following technical scheme:A kind of machine learning Service Assembly method based on Mahout, feature exist
In including the following steps:Step S1:The data of different-format are pre-processed, be converted into feature that model training uses to
Amount;Step S2:Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training;Step S3:The model that training is completed
It is assessed;Step S4:Step S1, step S2 and step S3 belongs to a series of operators in Mahout algorithms libraries, these are calculated
Son carries out unified encapsulation, becomes and meets a series of services that Oozie workflow platforms call specification;Step S5:According to
User description institute machine learning method to be used and the data to be handled format, assemble more than one meet
The machine learning workflow flow path of demand;Step S6:When these machine learning workflows by Oozie in Hadoop platform
After end of run, the model evaluation operator of each workflow will provide the assessment result of workflow;User comments according to this
Estimate as a result, selecting machine learning workflow.
In an embodiment of the present invention, further include step S7:The selected machine learning workflow storage of user is arrived
In knowledge base, for the user or the user with similar demands is multiplexed after this.
In an embodiment of the present invention, the data prediction in step S1 includes:SeqDirectory、Lucene2Seq、
Seq2Sparse, Arff.Vector, Split, SplitDataSet, Describe and Hive.
In an embodiment of the present invention, the Clustering Model in step S2 includes using Canopy, K-Means, fuzzy K-
The Clustering Model of Means, LDA and five clustering algorithms of spectral clustering.
Further, the disaggregated model in step S2 includes the classification using NB Algorithm and random forests algorithm
Model.
Further, the collaborative filtering recommending model in step S2 using matrix factorisation collaborative filtering and is based on
The collaborative filtering recommending model of the Collaborative Filtering Recommendation Algorithm of article.
In an embodiment of the present invention, the model evaluation that training is completed in step S3 includes the following steps:Step S31:It is poly-
Class model assessment is assessed using distance between cluster and cluster outgoing inspection;Step S32:Disaggregated model assessment using accuracy into
Row assessment, if disaggregated model uses NB Algorithm, assessment further includes confusion matrix;Step S33:Collaborative filtering recommending mould
Type assessment is assessed using the accuracy rate of model.
Compared with prior art, the present invention can fast and effeciently customize with the reusable machine learning flow of tuning, from
And data excacation can be efficiently carried out in Hadoop platform.
Description of the drawings
Fig. 1 is the broad flow diagram of the present invention.
Fig. 2 is machine learning work flow diagram of the present invention.
Fig. 3 is the work flow diagram generated according to demand in one embodiment of the invention.
Specific implementation mode
Explanation is further explained to the present invention in the following with reference to the drawings and specific embodiments.
A kind of machine learning Service Assembly method based on Mahout comprising following steps:Step S1:By different-format
Data pre-processed, be converted into the feature vector that model training uses;Step S2:Carry out Clustering Model, disaggregated model and
Collaborative filtering recommending model training;Step S3:The model completed to training is assessed;Step S4:Step S1, step S2 and step
Rapid S3 belongs to a series of operators in Mahout algorithms libraries, these operators are carried out to unified encapsulation, becomes and meets Oozie
Workflow platform calls a series of services of specification;Step S5:According to institute's machine learning method to be used of user's description with
And the format for the data to be handled, assemble the machine learning workflow flow path of more than one meet demand;Step S6:When
These machine learning workflows are by Oozie, in Hadoop platform after end of run, the model of each workflow is commented
Estimation will provide the assessment result of workflow;User selects machine learning workflow according to this assessment result.Oozie is
Java web applications for dispatching Hadoop operations.Multiple sequences of operation are combined into a logic working list by Oozie
Member.It is integrated with Hadoop, and the Hadoop tasks such as support MapReduce, Pig, Hive and Sqoop.
In an embodiment of the present invention, further include step S7:The selected machine learning workflow storage of user is arrived
In knowledge base, for the user or the user with similar demands is multiplexed after this.
In an embodiment of the present invention, the data prediction in step S1 includes:SeqDirectory、Lucene2Seq、
Seq2Sparse, Arff.Vector, Split, SplitDataSet, Describe and Hive.
(1)SeqDirectory
The operator is used to change into the file of text document the file of SequenceFile formats.SequenceFile files
It is a kind of serializing file that Hadoop designs for storing key-value pairs of binary form.Establish Hadoop it
On Mahout provide machine learning algorithm input file naturally also require that format is SequenceFile.
The order that SeqDirectory operators use in Mahout is " seqdirectory ".We are by calculating SeqDirectory
Son carries out service encapsulation, and 2 parameters are exposed to user, and other than outgoing route, " -- input " parameter is used to specify and needs to locate
The file of reason inputs path, since this operator is entrance operator, so user must specify the parameter.
(2)Lucene2Seq
Similar with Seqdirectory, which is used to Lucene index files changing into SequenceFile format texts
Part, only its input is the index file of Lucene parser generations.The life that Lucene2Seq operators use in Mahout
Order is " lucene2seq ", and after encapsulation, we expose 2 parameters of Lucene2Seq to user:" -- input " parameter use
It specifies lucene index files to be treated to input path, is also equally the parameter that must be specified as entrance operator, separately
Outer one is the outgoing route parameter given tacit consent to.
(3)Seq2Sparse
Seq2Sparse operators be for from text document generate vector important tool, it read by
SeqDirectory operators are converted into the text data of SequenceFile formats, first generate dictionary file according to the data, so
The feature vector of text is regenerated based on dictionary afterwards.The characteristic value weighting scheme of the vector can be that simple tf word frequency is vectorial,
Can also be the popular tf-idf vectors of industry, vectorial weighting scheme is specified by " -- weight ".Seq2Sparse is calculated
The order that son uses in Mahout is " seq2sparse ".The assignable parameter of seq2sparse operators has dozens of, we
By being encapsulated to it, other than outgoing route, 5 parameters also are exposed to user:" -- analyzerName " it is used to specify institute
Text segmenter class name, default value are Lucene standard scores parsers;" -- weight " it is used to specify weight mechanism used,
Tf is the weighting based on word frequency, and tfidf is the weighting based on TF-IDF, which is tfidf;“--
MinSupport " is used to be put into the minimum frequency of the word of lexicon file in entire set, and the word less than the frequency is ignored,
Default value is 2;" -- minDF " it is used to specify the minimum number of the word that can be put into lexicon file place document, less than the frequency
Word is ignored, default value 1;" -- maxDFPercent " for specifying the maximum for the word place document that can be put into lexicon file a
Number, default value 99.
(4)ARFF.Vector
ARFF files are the storage document data sets of Weka acquiescences.Each ARFF files correspond to a two-dimensional table.Table
Each row be data set each example, each row are each attributes of data set.The operator is used for from ARFF file generated models
The identifiable feature vector of training operator.The order that Seq2Sparse operators use in Mahout is " seq2sparse ".I
Three parameters of ARFF.Vector are exposed to user:" -- input " parameter is used to specify ARFF files input path, " --
Output " is used to specify feature vector outgoing route, " -- dicout " it is used to specify the outgoing route of vectorial dictionary file.Wherein
" -- input " is essential option, " -- output " and " -- dicout " it is option.
(5)CSV2Vector
The operator is used to csv file being directly changed into feature vector.It is not can directly making of being provided in the libraries Mahout
Order, we encapsulate the function on the basis of Mahout source codes.The java class that CSV2Vector operators use is
The class name can be written on by " services.encapsulation.csv2vector " when Oozie is encapsulated<main-class>
Between be called.It is similar, the operator be exposed to user-in file path " -- input " and output file path " --
Two parameters of output ", wherein input path must specify, outgoing route is optional.
(6)SplitDataSet
For the Collaborative Filtering Recommendation Algorithm in Mahout, the file that commonly enters is to carry three column datas
CSV formatted files.Wherein first is classified as the ID of user, and second is classified as the ID of article, and third is classified as preference of the user for article
Value.SplitDataSet operators are used to the input data of CSV formats being cut into training set and test set two parts.It
The order used in Mahout is " splitdataset ".The assignable parameter of user has:" -- input " specified file input road
Diameter, " -- output " specified file outgoing route, " -- trainingPercentage " specify the percentage of the data for training
Than " -- probePercentage " specifies the percentage of the data for test.Wherein " -- input " it must specify, other three
A is option, and " -- trainingPercentage " default value is 0.9, and " -- probePercentage " default value is 0.1.
(7)Split
Split operators are used to feature vector be cut into training set and test set.The order that it is used in Mahout is
“split”.User may specify there are five parameters:" -- input " specify input path;"-trainingOutput " specified training
Collect outgoing route;" -- testOutput " nominative testing collection outgoing route;" -- randomSelectionSize " specify random choosing
It is selected as the item number of test data, default value 100;" -- randomSelectionPct " is specified to be randomly selected to be test data
Percentage, default value 0.2.Other than " -- input " must specify, other four parameters are options.
(8)Describe
The operator is used to mark the field and target variable in ARFF data sets so that sorting algorithm can identify data set
So as to do further model training.The order that it is used in Mahout is " describe ".The adjustable ginseng of user
There are three numbers:" -- path " for specifying input data path, " -- file " it is used for the path of the specified descriptor file generated,
" -- descriptor " it is used for the data type of description field and is chosen as the field of target variable.Except outgoing route is optional
Outside, input data obtains in the Service Assembly stage from the operator of front, and " -- descriptor " parameter must then be specified by user.
(9)HiveAction
Oozie supports Hive task nodes, we substantially increase the data of this method by the encapsulation serviced Hive
Pretreatment potentiality.Action station codes of the Hive in Oozie, $ { preprocess.hql } specify the text of HiveQL scripts
The SQL statement of Hive has been write in part path in the script.When user uses Hive to act node, HiveQL command script
It must specify.
In an embodiment of the present invention, the Clustering Model in step S2 includes using Canopy, K-Means, fuzzy K-
The Clustering Model of Means, LDA and five clustering algorithms of spectral clustering.
(1) Canopy is clustered
Canopy clustering algorithms are a kind of unsupervised pre- clustering algorithms.It is typically used as K-Means algorithms or fuzzy K-
The pre-treatment step of Means algorithms.It is intended to accelerate the cluster operation to large data sets.Since the size of data set can not be true
It is fixed, and K-Means will input the number to cluster at the very start, therefore the use of K-Means algorithms is not directly one fine
Selection.The order that Canopy operators use in Mahout is " canopy ", and the parameter exposed to user has 4:Export road
Diameter " -- output ", " -- t1 " for determining cluster granularity and " -- t2 " and similarity distance metric mode " --
distanceMeasure”.The value of t1 and t2 is essential option, and distance metric mode is option, square Euclidean distance of default value
Measurement.
(2) K-Means is clustered
K-Means clusters are popular clustering methods in data mining, are used extensively in numerous scientific domains
Do clustering algorithm.The purpose of K-Means clusters is that n data point is divided into k cluster, and each data point, which belongs to one of them, to be had
The cluster of nearest mean value.The order that K-Means is used in Mahout is " kmeans ".When being clustered in advance without using Canopy
When, K-Means algorithms require specifying the number namely K values that finally gather the cluster at the very start.Therefore, the ginseng of K-Means
Number schemes there are two types of, when not having to be clustered in advance using Canopy, K-Means need specify cluster result number of clusters " --
clusters”.In addition, the operator also has 4 optional parameters:Default value be square Euclidean distance measurement " --
DistanceMeasure ", the maximum iteration " -- maxIter " that default value is 20, convergence threshold that default value is 0.5 " --
ConvergenceDelta " and outgoing route parameter " -- output ".
(3) K-Means clusters are obscured
Fuzzy K-means (also referred to as Fuzzy C-means) is the extension of K-Means.K-Means is for having found hard cluster (point
Only belong to a cluster), and fuzzy K-Means is then to find soft cluster (i.e. specified point with a kind of formalization method of more statistics
Multiple clusters can be belonged to specific probability).Orders of the fuzzy K-Means in Mahout is " fkmeans ", the encapsulation of parameter
Scheme is identical with kmeans, and this will not be repeated here.
(4) spectral clustering
In cluster, spectral clustering using data similarity matrix frequency spectrum (characteristic value) in less dimension
Dimensionality reduction is executed before cluster.The order that spectral clustering uses in Mahout is " spectralkmeans ", the number of cluster " --
Clusters " is the parameter that must be specified, highest iterations that default value is 20 " -- maxIter ", outgoing route " --
Output ", default value are " -- distanceMeasure " of square Euclidean distance measurement and the convergence threshold that default value is 0.5
" -- convergenceDelta " it is optional parameters.
(5) LDA is clustered
Potential Dirichlet distribution (LDA) is a powerful machine learning algorithm, it can gather a series of word
Class enters some topic, and a series of clustering documents are entered to the topic of multiple mixing.In natural language processing, LDA is
A kind of generation statistical model, it allows observation collection to be explained by non-observation group, explains why certain parts of data are similar.
The order that potential Dirichlet distribution service uses in Mahout is " lda ", the topic data of model " -- num_topics "
It is the parameter that must be specified, default value is 20 in addition highest iterations " -- maxIter " and outgoing route " -- output "
It is optional parameters.
Further, the disaggregated model in step S2 includes the classification using NB Algorithm and random forests algorithm
Model.
(1) naive Bayesian
Many occasion needs use classification in life, and naive Bayesian is a kind of simple and effective common classification algorithm.It
Basic thought it is very simple, i.e., for the item to be sorted provided, solve each classification under conditions of this appearance occur it is general
Rate, which is maximum, is considered as which classification this item to be sorted belongs to.The order that naive Bayesian uses in Mahout is
" trainnb ", the outgoing route " -- output " and default value of model are the mark of "/mahout/trainnb/labelIndex "
Label index outgoing route " -- labelIndex " it is optional parameters.
(2) random forest
Random forest or Stochastic Decision-making woods algorithm are for classifying, returning the global learning method with other tasks, lead to
Cross training when build multiple decision trees and export the pattern (classification) or consensus forecast as class class (recurrence) single tree.
Stochastic Decision-making forest corrects for the problem of their training set of decision tree over adaptation.Random forests algorithm uses in Mahout
Order be " buildforest ", optional parameters has " -- output " and " -- selection ".Second parameter is indicated every
The quantity of a randomly selected variable of tree node, default value are the square roots of the quantity of explanatory variable.
Further, the collaborative filtering recommending model in step S2 using matrix factorisation collaborative filtering and is based on
The collaborative filtering recommending model of the Collaborative Filtering Recommendation Algorithm of article.
(1) matrix factorisation collaborative filtering
The algorithm is the collaborative filtering based on ALS-WR matrix factorisations.Collaborative filtering mainly can area at present
It is divided into based on factorization and two kinds based on neighborhood.Because factorization can be from the global influence for considering user and voting, institute
With can be better compared to the collaborative filtering effect based on neighborhood in theory and practice.The operator uses in Mahout
Order be " parallelALS ", the dimension of feature " -- numFeatures " and regularization parameter " -- lambda " are to necessarily refer to
Fixed parameter, iterations " -- numIterations " and outgoing route that default value is 10 " -- output " it is optional parameters.
(2) collaborative filtering recommending based on article
Based on the collaborative filtering recommending of article according to user to the hobby data of other articles, by finding similar but using
The article that family was not evaluated also recommends user.This is the widely used Collaborative Filtering Recommendation Algorithm of current industry.The calculation
The order that son uses in Mahout is " itemsimilarity ", other than outgoing route " -- output ", optional parameters
Also between measuring similarity " -- similarityClassname ", article maximum similarity value " --
MaxSimilaritiesPerItem ", largest item preference value " -- maxPrefs ", the minimum preference value of each user " --
MinPrefsPerUser " and whether regard the parameter " -- booleanData " inputted as the data of no preference value.Their acquiescence
Value is successively:Similarity_euclidean_distance, 100,500,1 and false.
In an embodiment of the present invention, the model evaluation that training is completed in step S3 includes the following steps:
Step S31:Clustering Model assessment is assessed using distance between cluster and cluster outgoing inspection;
(1) distance is assessed between cluster
Distance can be good at reflecting clustering result quality between cluster.It can not possibly be leaned between different cluster central points in good cluster result
Must be too close, closely then mean that cluster process produces multiple groups with similar features very much, and cause the difference between cluster inadequate
Significantly.So we will not wish that the distance between cluster is too close.The outcome quality of distance and cluster is closely related between cluster.Cluster spacing
From evaluation operators not inside the order that Mahout is provided, therefore we encode and encapsulate the operator service.Operator is corresponding
Java class is InterClusterDistances.One of operator parameter is the file for the cluster result for needing to assess
Path, which can obtain in Service Assembly from the output parameter of Clustering Model training service, specified without user.This
It is outer that there are one the outgoing routes that optional parameters is exactly assessment result.
(2) cluster output is checked
Check that the main tool of cluster output is ClusterDumper in Mahout.It is read with ClusterDumper
The output of clustering algorithm is very convenient in Mahout.According to this output, we can very easily assess the quality of cluster.Any cluster
All there are one center vectors, it is the average value of all the points in the cluster.In this project, in addition to potential Dirichlet distribution is poly-
Except class, this point clusters Canopy, K-Means clusters, fuzzy K-Means clusters and spectral clustering are all set up.Particularly, when
When the target of cluster is text document, feature is exactly word, that is to say, that those of central point weight highest word reflects
The meaning to be expressed of document in the cluster.
The operation order of the operator is " clusterdumper ", and input is that gathering is closed, it can the Service Assembly stage from
It is obtained in the output parameter of Clustering Model training, without specified.The dictionary generated when in addition, data being switched to vector is also one
Optional input.Optional parameters " -- numWords " indicate the word number for needing to print, default value 10.
Step S32:Disaggregated model assessment is assessed using accuracy, if disaggregated model uses NB Algorithm,
Assessment further includes confusion matrix;
Disaggregated model is assessed, most common evaluation criterion generally has two kinds of accuracy and confusion matrix.Accuracy
It is well understood that that confusion matrix is againIt, may most direct index for the grader for exporting non-scoring results
It is exactly confusion matrix.Confusion matrix is that model exports result and the always crosstab of correct desired value.Every a line of matrix corresponds to
The really output valve of desired value and each corresponding model of row.The test sample quilt that the element value of matrix the i-th row jth row is classification i
Model assigns to the number in classification j.The big element of the corresponding confusion matrix of one good model all concentrates on diagonal line.These are diagonal
Line element refers to that sample in classification i is correctly assigned to the number in i classes.
Naive Bayesian evaluation operators by test process, can calculate the accuracy of model-naive Bayesian and obscure square
Battle array.Its operation order is " testnb ", " trainnb " order of corresponding training.The evaluation operators need to input for training
Feature vector, and the model and index tab trained, these paths are by the operator from front during Service Assembly
Middle parameter directly acquires, specified without user.
(1) random forest is assessed
Random forest evaluation operators can be used for testing the accuracy of Random Forest model.Its order is
" testforest ", input path parameter are obtained in Service Assembly.
Step S33:Collaborative filtering recommending model evaluation is assessed using the accuracy rate of model.
(1) matrix factorisation collaborative filtering is assessed
The operator obtains the accuracy rate of recommended models by calculating RMSE and MAE, it, which runs to order, is
“evaluateFactorization”.The input of the operator is user's matrix model, article matrix model and the number for test
According to collection, they can be obtained in the Service Assembly stage from the operator parameter of front.
(2) the collaborative filtering recommending assessment based on article
The operator is tested to obtain the accurate of model using test data set according to the article similarity model of training gained
Rate.Path is inputted similarly to obtain in the Service Assembly stage.
Machine-learning process is distinguished into data prediction, model training and model evaluation three phases by the present invention, each
There are a series of operators for belonging to the stage in stage in Mahout algorithms libraries, the Mahout machine learning services being multiplexed under visual angle
Assemble method general view is as shown in Figure 1.
Above-mentioned operator is packaged, after being packaged into the service with unified call specification, can use they into
Row Service Assembly generates a complete machine learning workflow full figure.As shown in Fig. 2, being exactly the operator institute energy that we encapsulate
The overall picture of the machine learning workflow assembled.Circle in figure indicates the operator that some task can be individually performed, and one shares 25
A operator.Line indicates the order and flow that operator executes.What square indicated is the intermediate result that operator executes output, the module
It is intended merely to the overall beautiful of work flow diagram, does not represent the specific tasks that can be run.Machine-learning process is distinguished into number by us
Data preprocess, model training and model evaluation three phases.Wherein, the circle on the left side represents data prediction operator service, such as
Text document formatted file is converted to the SeqDirectory of SequenceFile formatted files;The circle of middle medium blue represents
Model training operator service, such as K-means clustering operator services;The circle representative model evaluation operators service on the right is as poly- in assessed
The InterClustersDistance of distance between class result cluster.
Considered according to general data mining demand and the support range of the Hadoop ecosystems, final we select branch
There are five types of the input data format types held, they are respectively:Hive tables data, csv file, ARFF files, Lucene indexes
File and TXT text document files.These files are stored in HDFS.HDFS is distributed document storage system, its energy
Very easily store the data of magnanimity.HiveAction is inputted for handling Hive table data in preconditioning operator, and defeated
Go out in CSV formatted files to HDFS;Lucene2Seq operators are converted into for handling Lucene index files
SequenceFile formatted files;ARFF.Vector is used to handle ARFF formatted files, and according to the corresponding feature of file generated
Vector;CSV2Vector and SplitDataSet is used to handle the input of CSV formatted files, and CSV2Vector is according to CSV formats
File generated feature vector, SplitDataSet are then that CSV data are cut into training set and test set by a certain percentage.Due to
This 6 operators are the entrances of entire machine learning workflow, therefore their data file input path must be specified by user,
System just can know that will handle for which data.For remaining 19 operators, then the input path of specified file is not needed,
Because their input is exactly the outgoing route of some other operator before workflow, as long as during Service Assembly in the past
The operator in face is obtained and is carried out on demand assembled.
Reuters-21578 news data collection is used in the specific embodiment of the invention.It is in machine learning research field
In be widely used.The collection of these data and label are initially by basis set group in card and Reuter in exploitation CONSTRUE texts
It is completed during categorizing system.Reuters-21578 data sets are divided into 22 files, except the last one file includes only
Except 578 documents, remaining each file includes 1000 documents.These files are SGML formats, are similar to XML, are opening
Begin to have processed them into TXT documents in advance before this case study and be stored in HDFS.
Since input file is TXT formats, so naturally we have selected Seq2Directory preconditioning operators to make
For the entrance of entire machine learning workflow.In addition, selecting both clustering algorithms of Kmeans and LDA to carry out text topic
It was found that and finding result using ClusterDumper and LDAPrintTopics operators output ultimatum topic.
According to above-mentioned demand, dependence before and after the operator in conjunction with described in machine study and work stream full figure can be rapid
Use packaged operator to generate two machine learning workflow flow paths.On the basis of script workflow full figure, remove not
After relevant operator, workflow as shown in Figure 3 has just been obtained.It is upper it will be seen that we have obtained two works from figure
Make flow path, they be respectively SeqDirectory → Seq2Sparse → K-Means → ClusterDumper and
SeqDirectory→Seq2Sparse→LDA→LDAPrintTopics。
There are six we learn that operator service that needs are used has altogether by work flow diagram, and then obtain these calculations used
Son must specify and optional parameter list.As shown in table 1, the only initial data that we must specify in parameter are defeated
Enter path, the K values of K-Means algorithms and LDA algorithm topic number etc. three, other parameters details can be in the Service Assembly stage
It is automatic to obtain.
The parameter list that 1 workflow operator of table need to specify
Some other optional important parameters can certainly be specified.In the present example, due to not knowing for K-Means
For algorithm, which Clustering Effect is more preferable actually for the Euclidean distance measurement and COS distance measurement of acquiescence, so determine while making
With two sets of K-Means parametric schemes.Particularly, due to having selected COS distance measurement, also by the convergence threshold of K-Means algorithms
Parameter " -- convergenceDelta " is appointed as 0.1, rather than 0.5 given tacit consent to, because the range of COS distance is 0 to 1.Cause
This will finally generate three cluster workflows according to specified parametric scheme.Wherein one realization LDA cluster process, a reality
The K-Means cluster process of existing Euclidean distance measurement, the K-Means cluster process that in addition a realization COS distance is measured.
After three above-mentioned workflow flow paths are executed on Oozie, their final assessment results have been obtained, have been exported
As a result it is listed in table.5 topics are respectively listed for each result, and each topic lists maximum preceding 5 lists of weight
Word.
As shown in table 2, the topic recognition effect that LDA is clustered is generally all well and good.From 5 forward keywords
The topic content that each news documents to cluster are talked about is found out in big enable.Such as from the 1st wheat to cluster (wheat), agriculture
In industry (agriculture), outlet (export) and ton (tones) these keywords, it can be seen that it is in discussion and wheat
The relevant agriculture topic of yield.From the 2nd IBM to cluster (ibm), computer (computers), American Telephone and Telegraph Company
(att) and in personal (personal) these keywords, it can be seen that it is to be related to computer and associated companies in discussion
Topic.The other three clusters, and is that bank finance, oil and the energy and the relevant topic of security are discussed respectively, it may have brighter
Aobvious effect.
2 LDA cluster results of table
As shown in table 3, the K-Means text clusters measured using COS distance then can be weaker compared to LDA Clustering Effects,
But it is generally also good.It is to talk about and wheat yield, stock, original that 1st, 2,3 keyword to cluster, which shows them respectively,
The oily relevant topic of rise in price.Although the 4th clusters probably to be seen to be and discusses that some thing removes the topic increased every year,
Specifically what increases just indefinite, illustrates the effect to cluster and bad.It is furthermore noted that the 5th topic is all
Number, although seeming that the correlation of this inside that clusters is very high, it is meaningless.
3 K-Means COS distance cluster results of table
It is last as shown in table 4, it is the K-Means text clusters measured using Euclidean distance as a result, text cluster effect
Compared to other two clusters of fruit then want poor many.1st and the 2nd keyword relevancies to cluster are also viable reluctantly, but
It is to be difficult to find out their correlation from the 3rd, the 4th and the 5th keyword to cluster, or even give people's some " incoherent "
Feeling.Especially such as he, said, vs, would word, the identification for topic are without in all senses really.
4 K-Means Euclidean distances of table cluster
It was therefore concluded that in these three machine learning workflows that Service Assembly obtains, LDA clusters work
The effect that stream finds text topic is best, and the K-Means cluster workflows measured using COS distance are taken second place, Euclidean distance degree
The K-Means cluster workflows of amount are then worst.It, can be by the machine learning work if the good results that user clusters LDA
It is preserved as stream, can be easily multiplexed in the future.Certainly, user can continue to calculate in tuning these workflows
The parameter of son, in the hope of reaching better Clustering Effect.
In addition, seeing from assessment result, occur such as 7-apr-1987, number, he in the result clustered at three
It may carry out handling not good enough when text participle generates feature vector in pretreatment stage with words, explanations such as said.Therefore,
In order to further improve the discovery effect of text topic, the parameter for changing preconditioning operator Seq2Sparse also can with being debugged
It is a good selection.
Experiment show using this method can fast and effeciently customize with the reusable machine learning flow of tuning, to
Data excacation can be efficiently carried out in Hadoop platform.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.
Claims (4)
1. a kind of machine learning Service Assembly method based on Mahout, which is characterized in that include the following steps:
Step S1:The data of different-format are pre-processed, the feature vector that model training uses is converted into;
Step S2:Carry out Clustering Model, disaggregated model and collaborative filtering recommending model training;
Step S3:The model completed to training is assessed;
Step S4:By step Sl, step S2 and step S3 three phases in the form of operator in Mahout algorithms libraries
Unified encapsulation is carried out, becomes and meets a series of services that Oozie workflow platforms call specification;
Step S5:According to user's description machine learning method to be used and the data to be handled format, group
Take on the machine learning workflow flow path of a plurality of meet demand;
Step S6:When these machine learning workflows are after by Oozie end of runs in Hadoop platform, each work
The model evaluation operator for making to flow will provide the assessment result of workflow;User selects machine learning work according to this assessment result
It flows;
Step S7:The selected machine learning workflow of user is stored into knowledge base, so that user is multiplexed after this;
The model evaluation that training is completed in step S3 includes the following steps:
Step S31:Clustering Model assessment is assessed using distance between cluster and cluster outgoing inspection;
Step S32:Disaggregated model assessment is assessed using accuracy, if disaggregated model uses NB Algorithm, assessment
It further include confusion matrix;Confusion matrix is the crosstab of disaggregated model output result and real goal value;Confusion matrix it is each
The corresponding real goal value of row, the output valve of each corresponding disaggregated model of row;
Step S33:Collaborative filtering recommending model evaluation is assessed using the accuracy rate of model;
Collaborative filtering recommending model in step S2 uses matrix factorisation collaborative filtering and the collaboration based on article
Filter the collaborative filtering recommending model of proposed algorithm.
2. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S1
Data prediction include:SeqDirectory、Lucene2Seq、Seq2Sparse、Arff.Vector、Split、
SplitDataSet, Describe and Hive.
3. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S2
Clustering Model include cluster mould using Canopy, K-Means, fuzzy K-Means, LDA and five clustering algorithms of spectral clustering
Type.
4. the machine learning Service Assembly method according to claim 1 based on Mahout, it is characterised in that:In step S2
Disaggregated model include disaggregated model using NB Algorithm and random forests algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611203680.2A CN107169572B (en) | 2016-12-23 | 2016-12-23 | A kind of machine learning Service Assembly method based on Mahout |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611203680.2A CN107169572B (en) | 2016-12-23 | 2016-12-23 | A kind of machine learning Service Assembly method based on Mahout |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107169572A CN107169572A (en) | 2017-09-15 |
CN107169572B true CN107169572B (en) | 2018-09-18 |
Family
ID=59848573
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611203680.2A Active CN107169572B (en) | 2016-12-23 | 2016-12-23 | A kind of machine learning Service Assembly method based on Mahout |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107169572B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697447A (en) * | 2017-10-20 | 2019-04-30 | 富士通株式会社 | Disaggregated model construction device, method and electronic equipment based on random forest |
CN108897587B (en) * | 2018-06-22 | 2021-11-12 | 北京优特捷信息技术有限公司 | Pluggable machine learning algorithm operation method and device and readable storage medium |
CN108875049A (en) * | 2018-06-27 | 2018-11-23 | 中国建设银行股份有限公司 | text clustering method and device |
CN109871809A (en) * | 2019-02-22 | 2019-06-11 | 福州大学 | A kind of machine learning process intelligence assemble method based on semantic net |
CN110276018A (en) * | 2019-05-29 | 2019-09-24 | 深圳技术大学 | Personalized recommendation method, terminal and storage medium for online education system |
CN110928529B (en) * | 2019-11-06 | 2021-10-26 | 第四范式(北京)技术有限公司 | Method and system for assisting operator development |
CN111104214B (en) * | 2019-12-26 | 2020-12-15 | 北京九章云极科技有限公司 | Workflow application method and device |
CN111459820B (en) * | 2020-03-31 | 2021-01-05 | 北京九章云极科技有限公司 | Model application method and device and data analysis processing system |
CN111611239B (en) * | 2020-04-17 | 2024-07-12 | 第四范式(北京)技术有限公司 | Method, device, equipment and storage medium for predicting business |
CN112130933A (en) * | 2020-08-04 | 2020-12-25 | 中科天玑数据科技股份有限公司 | Method and device for constructing and calling operator set |
CN112183768B (en) * | 2020-10-23 | 2022-07-08 | 福州大学 | Intelligent deep learning process assembling method based on semantic net |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9176949B2 (en) * | 2011-07-06 | 2015-11-03 | Altamira Technologies Corporation | Systems and methods for sentence comparison and sentence-based search |
CN103578007A (en) * | 2012-07-20 | 2014-02-12 | 三星电子(中国)研发中心 | Mixed recommendation system and method for intelligent device |
CN104462373A (en) * | 2014-12-09 | 2015-03-25 | 南京大学 | Personalized recommendation engine implementing method based on multiple Agents |
-
2016
- 2016-12-23 CN CN201611203680.2A patent/CN107169572B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107169572A (en) | 2017-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107169572B (en) | A kind of machine learning Service Assembly method based on Mahout | |
Xiao et al. | Feature-selection-based dynamic transfer ensemble model for customer churn prediction | |
CN109471938A (en) | A kind of file classification method and terminal | |
AlQahtani | Product sentiment analysis for amazon reviews | |
CN112559900B (en) | Product recommendation method and device, computer equipment and storage medium | |
Syamala et al. | A Filter Based Improved Decision Tree Sentiment Classification Model for Real-Time Amazon Product Review Data. | |
Idris et al. | Ensemble based efficient churn prediction model for telecom | |
CN109766911A (en) | A kind of behavior prediction method | |
CN107169061A (en) | A kind of text multi-tag sorting technique for merging double information sources | |
CN109977225A (en) | Public opinion analysis method and device | |
Sana et al. | A novel customer churn prediction model for the telecommunication industry using data transformation methods and feature selection | |
Gabbay et al. | Isolation forests and landmarking-based representations for clustering algorithm recommendation using meta-learning | |
Zheng et al. | Joint learning of entity semantics and relation pattern for relation extraction | |
Jesus et al. | Dynamic feature selection based on pareto front optimization | |
Silva et al. | Developing and Assessing a Human-Understandable Metric for Evaluating Local Interpretable Model-Agnostic Explanations. | |
Raman et al. | Multigraph attention network for analyzing company relations | |
BURLĂCIOIU et al. | TEXT MINING IN BUSINESS. A STUDY OF ROMANIAN CLIENT’S PERCEPTION WITH RESPECT TO USING TELECOMMUNICATION AND ENERGY APPS. | |
Gupta et al. | Artificial intelligence based predictive analysis of customer churn | |
Sinaga et al. | Sentiment Analysis on Hotel Ratings Using Dynamic Convolution Neural Network | |
Kumar et al. | Opinion Mining on Amazon Musical Product Reviews using Supervised Machine Learning Techniques | |
Sunil et al. | Customer review classification using machine learning and deep learning techniques | |
Alloghani et al. | Sentiment analysis for decision-making using machine learning algorithms | |
Mollah et al. | Adapting Contextual Embedding to Identify Sentiment of E-commerce Consumer Reviews with Addressing Class Imbalance Issues | |
Kumar et al. | Shareable Representations for Search Query Understanding | |
Martino et al. | A Hybrid Score to Optimize Clustering Hyperparameters for Online Search Term Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |