CN107103050A

CN107103050A - A kind of big data Modeling Platform and method

Info

Publication number: CN107103050A
Application number: CN201710211258.XA
Authority: CN
Inventors: 林伟豪; 李学辉; 李敬涛
Original assignee: Haitong Constant (dalian) Big Data Technology Co Ltd
Current assignee: Haitong Constant (dalian) Big Data Technology Co Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2017-08-29

Abstract

A kind of big data Modeling Platform and method, platform include：Data assets module, for the upload of data source, by user data update to cloud platform by the way of uploading or automatically updating manually, user handles the data of oneself upload by way of pulling modeling by hand；Data cleansing module, the ETL processing for carrying out data to data source finds and corrected the mistake that can recognize that in data file, including check data consistency, processing invalid value and missing values；Data check module, for carrying out detection and basic statistical work to data；Algoritic module, is modeled to mass data using the classical classification of some in machine learning or clustering algorithm, is then predicted using model；Front end display module, is patterned and shows for the data to having treated or the data not handled.The present invention includes many-sided function such as structural data modeling, data exhibiting, the self-service Intellectual analysis of support, can towed data exhibiting and modeling.

Description

A kind of big data Modeling Platform and method

" technical field "

The invention belongs to the technical fields such as electronic information, big data, and in particular to collection, storage, analysis, the exhibition of big data A kind of existing big data Modeling Platform and method.

" background technology "

With the fast development of internet, the data volume produced daily is very huge.Before the appearance of big data technology, pass The data processing of system encounters many bottlenecks.Will in the case that data volume is very big firstly, for traditional database Storage is caused to reach the upper limit, solution is to change the bigger hard disk of capacity, but the cost done so is very high.Next is exactly to count Calculation machine can not quickly handle big data quantity, and bottleneck can be also run into data processing speed.

At present, using big data technology can solve the poor autgmentability that traditional information technology infrastructure is present, poor fault tolerance, Performance is low, installation and deployment and many bottlenecks such as difficult in maintenance.Data are carried out using Hadoop HDFS distributed file systems Storage, favorable expandability, fault-tolerance are high.Large-scale dataset (being more than 1TB) is counted parallel using Hadoop MapReduce Calculate, improve calculating speed, performance is high.Realize that traditional database carries out the transmission of data with Hadoop using Sqoop components.But Existing big data technology is less susceptible to for non-technical personnel using big data technology.

" content of the invention "

The present invention is intended to provide a kind of big data Modeling Platform and method, include structural data modeling, data exhibiting etc. Many-sided function, support self-service Intellectual analysis, can towed data exhibiting and modeling etc., can be within the extremely short time The management cockpit and extemporaneous query analysis decision-making platform that decision-making foundation is provided are produced for business decision layer.The mesh of the present invention Realized by following technical scheme：

A kind of big data Modeling Platform, including：

Data assets module, for the upload of data source, by user data by the way of uploading or automatically updating manually Cloud platform is updated, user handles the data of oneself upload by way of pulling modeling by hand；

Data cleansing module, the ETL processing for carrying out data to data source finds and corrected in data file can recognize that Mistake, including check data consistency, processing invalid value and missing values；

Data check module, for carrying out detection and basic statistical work to data；

Algoritic module, is modeled, so using the classical classification of some in machine learning or clustering algorithm to mass data It is predicted afterwards using model；

Front end display module, is patterned and shows for the data to having treated or the data not handled.

As specific technical scheme, the data assets module includes three kinds of upload data modes, local file uploads, Bottom data is uploaded, database is uploaded, and wherein database, which is uploaded, supports tri- kinds of databases of MySql, Oracle, Sqlserver.

As specific technical scheme, the data cleansing module includes Sql processing submodule, sampling submodule, classification Collects submodule, merging data submodule, deletion repeat submodule, data partition submodule, sorting sub-module, Data Discretization Submodule, data normalization submodule, filtered variable submodule, transposition submodule, field reset submodule, missing values processing Module, outlier processing submodule, lookup transform subblock, insertion variable submodule, weighting submodule, the balanced submodule of sample Block, participle analyzing sub-module；Sql processing submodules are performed for direct editing Sql sentences, and sampling submodule is used to utilize Different sampling modes is sampled processing to data, and Classifying Sum submodule is used for field variable content in table according to equal Value, counting, summing mode are calculated, and generate respective labels variable column, wherein it is configurable to collect variable with calculating variable；Merge Data submodule is used to add the data of two tables according to row record addition or row variable, and row please be kept during row record addition Name variable is consistent, otherwise by newly-increased variable column；Delete and repeat submodule for the duplicate contents in selected variable to be deleted；Number It is used for the quantity or ratio for specifying sample data in training center and test section according to subregion submodule；The change that sorting sub-module will be selected Content is measured to arrange according to ascending order or descending；Data Discretization submodule is used for the variable column of selected continuous type, according to wide Branch mailbox waits frequency division case method, carries out discretization and is classified；Data normalization submodule only supports numeric type by selected Variable column carry out 0-1 standardization, as a result fall on [0,1] interval；Carry out Z standardization, data fit standardized normal distribution, average For 0, standard deviation is 1；Filtered variable submodule is used to be deleted selected variable row；Transposition submodule is used for institute in data It is ranks conversion that some row and columns, which carry out transposition,；Field, which resets submodule, to be used to rearrange the row variable position in data； Missing values processing submodule is used to variable will have been selected to be empty row record deletion；Outlier processing submodule is used for according to exceptional value Recognition rule is deleted exceptional value by setting ratio, and recognition rule includes standard deviation and quantile, i.e., certain apart from average Data beyond the standard deviation or quantile of multiple are identified as abnormal data；Searching transform subblock is used in selected variable Hold according to the lookup that imposes a condition, and replace with desired value；Insertion variable submodule is used to selected variable carrying out arithmetic, raw Into new variables row, in algorithm frame, the title of variable column is manually entered, arithmetic expression is edited.Weighting submodule is used to become selected Amount is weighted, in weight factor, input weight numerical value.Sample equilibrium submodule is used in selected numeric type variable In row, according to the lookup target data that imposes a condition, and the input weight factor, target data weighting is handled；Participle parses submodule Block is used for after the content of text parsing by selected participle field, according to entry generation row record after parsing.

As specific technical scheme, the data check module include data examination ＆ verification submodule, Frequence Analysis submodule, Descriptive statistic submodule；Sample index and detection that data examination ＆ verification submodule is used in statistical analysis selected variable, index bag Include virtual value, invalid value, null value and its accounting；Frequence Analysis submodule is used in selected variable, occurs to all the elements The frequency is counted；Descriptive statistic submodule, which is used to arrange specifying variable, carries out average, mode, median, the statistics amounted to Calculate.

As specific technical scheme, the algorithm model module includes Apriori algorithm submodule, Kmeans algorithms Module, NB Algorithm submodule, logistic regression algorithm submodule, ridge regression algorithm submodule, LASSO algorithm submodules Block, linear regression algorithm submodule；Apriori algorithm submodule is used to combine the associate field content statisticses frequency, by dimension word Content in section carries out the probability calculation of two frequency collection, and draws analysis indexes such as support；Kmeans algorithm submodules are used for will The data of word selection section are divided into n cluster, are configured to wherein clustering number, iterations, random count parameter, realize data The function of convergence；Naive Bayesian submodule, logistic regression submodule, ridge regression algorithm submodule, LASSO algorithm submodules, Linear regression algorithm submodule contributes to simulate sorting algorithm model to be predicted to new data.

As specific technical scheme, the front end display module by way of dilatory drag visual operating assembly come Tested, data cleansing and Algorithm Analysis are carried out according to well-established business model, by result data collection by connecting Front end display module in chart carry out visualizing multidimensional degree and show.

As specific technical scheme, the visualizing multidimensional degree show including：1st, with different data structures with not Same subtype is intuitively showed；2nd, the form bored below carries out the displaying of multi-dimensional data；3rd, customized with node Display form shows to being customized of special data；4th, exhibition preferably is carried out to data in the form of the linkage displaying of many figures It is existing.

A kind of modeling method based on above-mentioned big data Modeling Platform, step is as follows：

First, it would be desirable to which the data source of processing is upload the data on platform using interpolation data source node in data assets In case subsequently using；

Then, the need for according to business scenario, data are cleaned using the functional node in data cleansing module, such as Handled using missing values and delete field in data for empty data line, carried out the field that business needs using filtered variable Retain, a series of processing such as other field delete processing obtain the data of desired specific format；

Secondly, if the demand in business scenario not to algorithm, it is possible to utilize graphically entering for front end display module Row final data show；Show pattern class in the selection of functional node in front end various, select difference according to demand Figure carry out Data Representation；Algorithm is used if desired, it is necessary to add algorithm node；

Finally, respective nodes have been selected, have been showed data source nodes, data cleansing node, algorithm node or without, front end Node is attached preservation, and logical whole flow process can just be run by clicking on operation, and final data is come out with graphic exhibition.

In summary, the present invention is very flexible in terms of whole flow chart of data processing, and user can be according to different need Seek the corresponding workflow of completion；The stage is uploaded in data source, there are a variety of upload modes to provide selection；In the data cleansing stage, have A variety of processing modes provide selection；The algorithm stage, similarly comprising many algorithms；In the data exhibiting stage, include a variety of figures；This A variety of core main flow algorithms library out-of-the-boxs are supported in invention so that big data analysis is simplified and popular, and user understands pole Few statistics and Data Mining knowledge, just easily can carry out data mining and modeling analysis using the platform to big data.

" brief description of the drawings "

Fig. 1 be big data Modeling Platform provided in an embodiment of the present invention in imported using Sqoop technologies from database HDFS schematic diagram.

Fig. 2 is transmitting file on being carried out in big data Modeling Platform provided in an embodiment of the present invention in the form of data flow Schematic diagram.

" embodiment "

The embodiment to the present invention is described further below in conjunction with the accompanying drawings：

The big data Modeling Platform that the present embodiment is provided includes：Data assets, data cleansing, data detection, algorithm model, Front end shows.Each module is introduced in detail below：

Data assets module is used for the upload of data source, by the way of uploading or automatically updating manually by user data more Cloud platform is newly arrived, user can handle the data of oneself upload by way of pulling modeling by hand.

Data cleansing module is used for the ETL processing that data are carried out to data source, finds and correct in data file can recognize that Mistake, including check data consistency, handle invalid value and missing values etc..

Data check module is used to carry out detection and basic statistical work to data.

The purposes of algoritic module is that mass data is carried out using the classical classification of some in machine learning or clustering algorithm Modeling, is then predicted using model.

Front end display module is used to the data treated or the data not handled are patterned and showed, and gives The more intuitive form of expression of user.

Data assets module includes three kinds of upload data modes, and local file is uploaded, bottom data is uploaded, on database Pass, wherein database, which is uploaded, supports tri- kinds of databases of MySql, Oracle, Sqlserver.

Data cleansing module includes Sql processing, sampling, Classifying Sum, merging data, deletion repetition, data partition, row Sequence, Data Discretization, data normalization, filtered variable, transposition, field reset, missing values processing, search conversion, insertion variable, The modules such as weighting, balanced, the participle parsing of sample.Sql is handled to be performed for direct editing Sql sentences.Sample for utilizing Different sampling modes (N takes 1, random % etc.) are sampled processing to data.Classifying Sum is used in field variable in table Hold and calculated according to average, counting (collecting), summation (total) mode, generate respective labels variable column, wherein collecting variable and meter It is configurable to calculate variable.Merging data is used to add the data of two tables according to row record addition or row variable, row note Record please keep row name variable consistent when additional, otherwise by newly-increased variable column.Deletion is recycled and reused for the repetition in selected variable Content is deleted.Data partition is used for the quantity or ratio for specifying sample data in training center and test section.Sort selected change Content is measured to arrange according to ascending order or descending.Data Discretization is used to, by the variable column of selected continuous type, according to wide branch mailbox (divide Case width) or frequency division case method (branch mailbox quantity) is waited, carry out discretization and classified.Data normalization arranges selected variable (only supporting numeric type) carries out 0-1 standardization (it is interval that result falls on [0,1])；Z standardization (data fit standard normal point Cloth, average is 0, and 1) standard deviation is.Filtered variable is used to be deleted selected variable row.Transposition is used for will be all in data It is ranks conversion that row and column, which carries out transposition,.Field is reset for the row variable position in data to be rearranged.Missing values processing For variable will to have been selected to be empty row record deletion.Outlier, which is handled, to be used for exceptional value according to outlier identification rule by setting Ratio is deleted.(recognition rule include standard deviation and quantile, i.e., apart from average certain multiple standard deviation or quantile with Outer data are identified as abnormal data).Searching conversion is used for the content of selected variable according to the lookup that imposes a condition, and replaces with Desired value.Insertion variable is used to selected variable carrying out arithmetic, generation new variables row (alias).In algorithm frame, manually The title of input variable row, edits arithmetic expression.Weight for selected variable to be weighted, in weight factor, input Weighted value.Sample equilibrium is used in selected numeric type variable column, according to the lookup target data that imposes a condition, and inputs power Repeated factor, target data weighting is handled.Participle is parsed for after the content of text parsing by selected participle field, according to solution Entry generation row record after analysis.

Data check module includes data examination ＆ verification, frequency disribution, descriptive statistic module.Data are audited for statistical analysis Sample index's (virtual value, invalid value, null value and its accounting) and detection in selected variable.Frequence Analysis is used in selected variable In, the frequency that all the elements occur is counted.Descriptive statistic is used to carry out average, mode, middle position to specifying variable row The statistics such as number, total are calculated.

Algorithm model module includes Apriori algorithm, Kmeans algorithms, NB Algorithm, logistic regression algorithm, ridge The modules such as regression algorithm, LASSO algorithms, linear regression algorithm.Apriori algorithm is used to combine associate field content statisticses frequency It is secondary, the content in dimension field is carried out to the probability calculation of two frequency collection, and draw analysis indexes such as support etc..Kmeans algorithms For the data of the section of word selection to be divided into n cluster, wherein cluster number, iterations, random count parameter can be configured, Realize the function of convergence of data.Naive Bayesian, logistic regression scheduling algorithm are all sorting algorithms, and basic ideas are similar, for mould Sorting algorithm model is drawn up to be predicted new data.

Front end display module is the data analysis platform integrated with showing.User can be visual by way of dilatory drag The operating assembly of change is tested so that engineer without machine learning background can also play data digging well by left-hand seat easily Pick.Platform carries out data cleansing and Algorithm Analysis according to well-established business model, by result data collection by connecting Chart in the display module of front end carries out showing for various dimensions.

Intuitively showed with different subtypes with different data structures.

The form bored below carries out the displaying of multi-dimensional data.

Customize display form to show being customized of special data with node.

Preferably data are showed in the form of the linkage displaying of many figures.

Data picture is converted into data query, each item data interaction linkage, display data under different dimensions index In the tendency, ratio, relation of different angles, user's identification trend is helped, the knowledge and rule of data behind is found.Except original The data exhibiting modes such as some cake charts, column diagram, thermal map, geography information figure, can also be by the color of image, brightness, big The various ways such as small, shape, movement tendency are analyzed data in a series of figures, are helped user by interaction, are excavated Association between data.And the upper brill test of data, multidimensional parallel parsing are supported, promote decision-making using data.

Visualization can provide the user a total general view, then by scaling and screening, for needed for people provide it more Deep detailed information.Visual process is served when helping people to obtain more complete customer information using big data Key effect.And crisscross relation is the important ring in numerous big data scenes, social networks is perhaps exactly most significant Example, it is desirable to understand that big data information therein is extremely difficult by the form of text or form；On the contrary, visualization but can It is enough that the trend and natural mode of these networks are showed relatively sharp.The relation between social network user is embodied in image When, usually used is the method for visualizing based on cloud computing.Describe user node in social networks by correlation models Hierarchical relationship, this method can intuitively show the social relationships of user.In addition, it can also be by the sea using cloud Dupp software platform (Hadoop) is by visualization process parallelization, so that the big data for accelerating social networks is collected.

Big data visualization can be realized by a variety of methods, such as in multi-angle display data, focusing mass data Dynamic change, and filter information (including dynamic inquiry screening, star chart displaying, and close-coupled) etc..It is following some can Depending on change method analyzed and classified according to different data types (Large volume data, delta data and dynamic data) 's：

Tree-shaped schema：Space filling method for visualizing based on individual-layer data.

Circular filled type：The direct replacement of tree-shaped schema.It uses circle as original-shape, and can dividing from higher level Introduce more circular in Rotating fields.

Rising sun type：Polar coordinate system is transformed on the basis of dendrogram visualization.Variable parameter therein is by wide and high change Into radius and arc length.

Parallel coordinates formula：By visual analyzing, the multiple data factor in not homology theory township is expanded and come.

Steam schema：One kind of stack region figure, data are deployed around an axis, and with flowing and organic shape State.

Recirculating network schema：Data around circular arrangement, and according to their own related sex rate by curve phase Connect.Generally with different line widths or the correlation of color saturation measurement data object.

The main functional modules of big data Modeling Platform are described above, the present invention is while above-mentioned functions are disclosed, also Whole flow process process is disclosed, step is as follows：

First, it would be desirable to which the data source of processing is upload the data on platform using interpolation data source node in data assets In case subsequently using.

Then, the need for according to business scenario, data are cleaned using the functional node in data cleansing module, such as Handled using missing values and delete field in data for empty data line, carried out the field that business needs using filtered variable Retain, a series of processing such as other field delete processing obtain the data of desired specific format.

Secondly, if the demand in business scenario not to algorithm, it is possible to utilize graphically entering for front end display module Row final data show.Show pattern class in the selection of functional node in front end various, select difference according to demand Figure carry out Data Representation.It is such as naive Bayesian, linear if needing exist for using algorithm, it is necessary to add algorithm node Return etc..

Finally, respective nodes have been selected, it is necessary to by data source nodes, data cleansing node, algorithm node or without, front end Show node and be attached preservation, logical whole flow process can just be run by clicking on operation, and final data is come out with graphic exhibition.

The function and realization principle to each module are further described in detail below：

1. data assets

, can be with due to Modeling Platform data source disunity, it is necessary to which different data sources to be converted into unified data source By relevant database, such as Oracle, Mysql, Sqlserver etc., the data of file format, such as txt, csv etc., also may be used With on the basis of existing data source carry out processing form new data source, be converted into unified HIVE data sources, it is flat for modeling Platform flow processing provides data source.

(1) be directed to relevant database, using Sqoop technical finesses, Sqoop be by a MapReduce operation from A table is imported in database, this operation is extracted from table to be recorded line by line, is then written to HDFS, as shown in Figure 1.

Before importing starts, Sqoop checks the table that will be imported using JDBC.Retrieve row all in table and The SQL data types of row.These SQL types (VARCHAR, INTEGER) be mapped to Java data types (String, Integer etc.), it will preserve the value of field using these corresponding Java types in MapReduce applications.Sqoop generation Code generator creates the class of corresponding table using these information, the record extracted for preserving from table.

(2) data of file format are directed to, upper transmitting file is carried out in the form of data flow, is carried out using Hadoop technologies Processing, local file is uploaded to file on HDFS by MapReduce, then by file by specified table name and field name, It is stored in Hive, as shown in Figure 2.

(3) existing data source is directed to, is handled using Hive technologies, Hive is used on the basis of legacy data source Select sentences create new Hive tables, produce new data source, can also be directly using the data source existed.

2. data cleansing

In order to ensure that logical permanent big data Modeling Platform, for the requirement of data consistency, data cleansing work(is provided for this Can, mainly include SQL processing, sampling, Classifying Sum, merging data collection, data partition, sequence, data discrete, data standard, Filter scalar, transposition, field rearrangement, weighting, sample equilibrium etc..

(1) SQL processing is that new data source is created according to original data source using Hive select sentences.

(2) sampling is that original data source is sampled using Hive, produces new data source.

(3) Classifying Sum foundation collects variable and is grouped, and average is calculated using variable is calculated, collects, amount to.

(4) merging data collection is divided into row record addition and row variable is added, and row variable additional demand selection combining variable makes Handled with Hive and produce data set.

(5) duplicate keys are deleted and filters the data for removing and repeating according to duplicate removal variable using Hive.

(6) data partition carries out data partition according to specified training sample.

(7) sort by processing variable is ranked up to data source.

(8) Data Discretization produces discrete data formation result set according to processing variable.

(9) data normalization is according to processing variable, and selection standard method produces data set.

(10) filtered variable deletes it according to variable is deleted from result set, produces new data source.

(11) transposition is by all row and column transposition, after transposition, and newly-generated row name naming rule is transposition_1,transposition_2,……,transposition_15。

(12) field resets the order of specific field.

(13) missing values processing removes data according to processing variable, produces new result set.

(14) conversion is searched according to processing variable, if processing variable meets condition, is replaced.

(15) outlier is handled, and it is handled using exclusion pattern and recognition rule according to processing variable.

(16) the insertion variable variable new according to original row insertion.

(17) weighting adds weighted factor according to processing variable to it.

(18) sample is balanced according to processing variable, to its adding conditional, if meeting condition, according to its progress of factor pair Conversion.

Data cleansing is mainly handled using Hive Sql technologies, and the undesirable data of removal, which are mainly, endless Whole data, the data of mistake, the data repeated, can also be handled on the basis of legacy data.

3. data check

In order to meet requirement of the logical permanent big data Modeling Platform to data processing, data check function is provided for this, can To provide the flexibility ratio of processing data, the data for not meeting index can be appointed as invalid data.Mainly audited including data, Frequency disribution, descriptive statistic etc..

(1) data examination ＆ verification is to use to handle data source by Hive select sentences, invalid value detection method Be divided into two classes, field type detection and numerical value detection, index be effective sample, effective sample %, null value, null value %, invalid value, Invalid value %, collects according to processing variable packet, produces percentage, and then produce result set.

(2) frequency disribution formula is handled data source by Hive select sentences, is grouped according to processing variable And obtain total (count).

(3) descriptive statistic is that data source is handled using hive select sentences, first according to processing variable Carry out processing and obtain mode, average, median, total, maximum, minimum value, scope, standard error of mean can also be obtained, Percentile the and percentile_approx functions provided according to hive, obtain data statistics result quartile and Five quantiles etc..

4. algorithm model

Logical perseverance big data Modeling Platform is a machine learning algorithm platform based on Distributed Calculation engine.User passes through The dilatory visual operating assembly of mode dragged is tested so that the engineer without machine learning background can also be easily Left-hand seat plays data mining well.Platform provide Apriori, K_means, naive Bayesian, logistic regression, ridge regression, LASSO, The abundant machine language such as linear regression.

(1) algorithm model is mainly realized using Spark technologies, and the data set and training data of preparation are submitted into Spark Cluster efficient process simultaneously obtains result set.

(2) algorithm is realized using Java voices, and the algorithm routine of realization is broken into jar bags first disposes respectively with platform, And then Modeling Platform and the degree of coupling of algorithm are reduced, in Deployment Algorithm, do not interfere with the use of platform.

(3) implementing for algorithm is that task is submitted into the processing of Spark clusters, can by the Distributed Calculation of cluster Fast and effectively to iterate to calculate.

The algorithm model module of big data Modeling Platform mainly make use of spark mllib api to be programmed realization, Computing engines arithmetic speeds of the spark based on internal memory is fast, and many machine learning algorithms are included in spark mllib storehouses: Apriori, kmeans, naive Bayesian, logistic regression, ridge regression, lasso scheduling algorithms, these algorithms are largely divided into two classes:Point Class and cluster.Kmeans algorithms belong to cluster inside these algorithms, and above-named algorithm belongs to sorting algorithm, in code In realization, there are different logics in two class problems, will illustrate skill that algorithm model module is related in terms of the two below Art problem.

Clustering algorithm

Cluster, Cluster analysis are also translated into cluster class sometimes, and its core missions is：By one group of target The object that object is divided between several clusters, each cluster is similar as far as possible, and the object between cluster and cluster is as far as possible It is different.So-called clustering problem, is exactly to give an element set D, wherein each element has n observable attribute, uses D is divided into k subset by certain algorithm, it is desirable to which distinctiveness ratio is as low as possible between the element of each intra-subset, and different subsets Element distinctiveness ratio it is as high as possible.Wherein each subset is called a cluster.

Kmeans belongs to the iteration based on square error and reassigns clustering algorithm, its core concept very simple：

(1) K central point is randomly choosed.

(2) distance for arriving this K central point a little is calculated, the nearest central point of chosen distance is the cluster where it.

(3) center of K cluster is simply recalculated using arithmetic average (mean).

(4) repeat step 2 and 3, until cluster class is not changing or reaching greatest iteration value.

(5) output result.

The result quality of Kmeans algorithms is easily trapped into locally optimal solution, to K dependent on the selection to initial cluster center The no criterion of selection of value can be followed, more sensitive to abnormal data, can only handle the data of numerical attribute, cluster structure may It is uneven.

Kmeans algorithm flows and ins and outs are described below.Data source is obtained, interpolation data source is simultaneously in data assets Data source is dragged in painting canvas.Then the node of connection data cleansing carries out necessary ETL processing to data source so that data source It disclosure satisfy that calling for algorithm part.After the operation of data cleansing node, one can be deposited in the data warehouse hive of cluster Data after this node is treated, are called for algorithm part.

It is described in detail in algorithm part.Data source nodes need to be connected kmeans after completing with data cleansing node Algorithm node, kmeans algorithm nodes are dragged in painting canvas, are double-clicked minor node, can be ejected the configuration page, are wrapped in the configuration page Contain：1. choose which row to run kmeans algorithms, because in actual business demand, can not necessarily use all row； 2. needing to configure cluster class number, refer to that current data source is thought finally to be polymerized to how many classes；3. maximum iteration, algorithm performs are needed Want iteration how many times；4. random number of times；Click on and preserve after configuration is good, then click on operation and start configuration processor.

In the code of backstage, when program judges nodeType (node type) for K_Means, it can enter In KmeansServiceImpl stepKmeans methods.Inside this method, obtain first in parameters such as configuration interfaces, Set methods are performed to these parameters using KmeansInfo instance object, the parameter that kmeans algorithms need all is preserved In KmeansInfo instance objects.Then toKmeansString methods are performed, the parameter character with space-separated is obtained String.

Then, the formatTableData methods in DataRevert are performed, the effect of this method is to carry out feature to turn Change, because unavoidable in data source have character string, and spark kmeans algorithms require that data are double types, so secondary It is extremely important whether method runs succeeded for algorithm.

Perform algorithm jar bags here be spark yarn-client submission patterns, the benefit of this pattern is Script need not be write, jar bags can be directly run.It can perform afterwards in Co-Insight-mllib.jar KMeansInfo, parameter therein is incoming in web terminal, and main thinking is to obtain word selection segment data from the specified tables of hive, Corresponding format conversion is carried out to data, the vector format of requirement is changed into.Generated using Kmeans.train api training datas KmeansModel models, secondary step is the most important step of whole algorithm, and only generating model could be using model Predict methods determine the cluster situation of data.Finally result is stored in hdfs, then hive builds table and reads hdfs numbers According to.Finally show data, result hive tables and prediction data are merged into displaying, (cluster is calculated to this basic kmeans algorithm Method) complete.

Sorting algorithm

What is sorting algorithmIn simple terms, exactly the object with some characteristics is sorted out and corresponds to a known class Not Ji He in some classification on.For mathematical angle, it can be defined as follows：

Known collection：C={ y1, y2 .., yn } and I={ x1, x2 .., xm .. }, determines mapping ruler y=f (x), makes Any xi ∈ I one and only one yj ∈ C causes yj=f (xi) to set up.

Wherein, C is category set, and I is object to be sorted, and f is then grader, and the main task of sorting algorithm is exactly structure Make grader f.

The construction of sorting algorithm usually requires the set of a known class to be trained, and as a rule trains what is come Sorting algorithm can not possibly reach 100% accuracy rate.The quality of grader often with training data, checking data, training data The factors such as sample size are related.

For example, a stranger is seen in our daily lifes, the first thing feelings to be done are exactly to judge its sex, The process for judging sex is exactly the process of a classification.According to the conventional experience of life, hair length, dress ornament and body are generally gone through These three key elements of type are with regard to that can judge the sex of a people.Here " experience of life " be exactly one train on sex The model of judgement, its training data is the panoramic people run into daily life.Have one day suddenly, ma's big gun is gone to In face of you, close-fitting clothing are worn in long hair fluttering, but build but very man, and then you just feel uncertain, according to conventional warp Test --- the model namely trained, it is impossible to judge the sex of this people.Then you have learned to judge by Adam's apple Its sex, the quality that so your model is trained to is higher.But it is undeniable to be, occur that one you can not sentence forever Disconnected property others.So model be unable to reach forever 100% it is accurate, only can infinitely be connect with being on the increase for training data Nearly 100% it is accurate.

It is a difference in that the realization of spark mllib bottoms is different in sorting algorithm, between algorithms of different, is calling In the case of api, simply the method parameter of training data can be somewhat different, and other programmed logics are substantially similar, here with simplicity Described in detail exemplified by bayesian algorithm.

Naive Bayes Classification, Naive Bayes, you can also be its NB algorithm.Its core concept is very simple：For A certain prediction term, calculates the probability that the prediction term is each classification respectively, and what then select probability was maximum is categorized as its prediction point Class.Just look like that you predict that ma's big gun is that the possibility of woman is 40%, the possibility for being man is 41%, then can just be sentenced Breaking, he is man.

NB Algorithm flow and ins and outs are described below.Sorting algorithm is different from clustering algorithm flow, secondary stream Journey needs to obtain two data sources, and a data source carries label column as training data, and another data source label is classified as Sky is used as prediction data.Two data sources require that field name is identical with type, next will utilize the conjunction in data cleansing And data set, two data sources are merged in hive as follow-up processing in a table, it is normal herein directly to utilize row Record addition.

ETL operations can be carried out by having merged data set, carried out cleaning treatment to data, be then dragged in the Piao in algorithm model The algorithm node connection of plain Bayes, in the configuration page of naive Bayesian node, can select label column, which row conduct Perform algorithm to use, alpha attributes, training data ratio.Preserve operation after configuration is good, web terminal execution logic substantially with The execution logic of kmeans algorithm web terminals is similar.

Algorithmic code is right when choosing training data in Co-Insight-mllib.jar NativeBayes The hive tables merged before choose the data set that training dataset is predicted with needs so that whether label column is empty.Then utilize NaiveBayes.train methods train NaiveBayesModel models, and forecast set is entered also with the predict of model Row classification prediction, most result is stored in hive table at last, and the front end for after shows.

Summarize, the technology that algoritic module is mainly utilized is that spark mllib api is called, the reusability in algorithm realization By force, development rate is fast, and training pattern efficiency high can be good at utilizing cluster resource, substantially meet the universal of algorithm.

5. front end shows

Logical perseverance big data Modeling Platform provides abundant instrument and showed, and platform is in time using form that is more lively, having had Reveal and be hidden in fast changing and extraneous data behind business and see clearly.No matter in fields such as traffic, communications, by interactive real When data visualization come help business personnel find, diagnosis traffic issues, increasingly become in big data solution to close weight The ring wanted.Mainly include form displaying, block diagram, bar chart, line chart, scatter diagram, bubble diagram, worm hole, geographic distribution Deng.

(1) form displaying shows data in table form.

(2) block diagram, bar chart, line chart, pie chart, area-graph, scatter diagram etc. are according to X-axis and Y-axis display data.

(3) circular chart is according to classified variable and collects variable, and calculation is quantity or summation.

(4) radar map is according to classified variable, reduced parameters 1 and reduced parameters 2, and calculation is summation, maximum, equal Value.

Front end shows mainly to be showed using front end JQuery technologies.

The big data Modeling Platform of the application can allow non-technical personnel to require no knowledge about the situation of bottom big data technology Under can easily use.Platform utilize Gooflow procedure technologies, only need to carry out to data source, data processing, algorithm, The dragging connection of the nodes such as data exhibiting can just realize big data processing procedure.Big data Modeling Platform mainly utilizes Hive numbers Data are stored according to warehouse, data processing section is directly realized using Hive sql sentences.When the huge situation of data volume can also Reply, excellent performance well.Realized using Spark millib big data Modeling Platform algorithm part.Spark's is excellent Point is that output result can be stored in internal memory in the middle of job, so as to no longer need to read and write HDFS, is calculated based on internal memory, operation effect Rate is high.Spark machine learning storehouse includes algorithm wide variety, and classification, cluster scheduling algorithm disclosure satisfy that the demand of user.Create One-stop data analysis flow, finishing service demand are realized in data source, data processing, algorithm, the connection of data exhibiting node.

Above example is only that abundant disclosure is not intended to limit the present invention, all based on creation purport of the invention, without creating Property work equivalence techniques feature replacement, should be considered as the application exposure scope.

Claims

1. a kind of big data Modeling Platform, it is characterised in that including：

Data assets module, for the upload of data source, by user data update by the way of uploading or automatically updating manually To cloud platform, user handles the data of oneself upload by way of pulling modeling by hand；

Data cleansing module, the ETL processing for carrying out data to data source finds and corrected the mistake that can recognize that in data file Miss, including check data consistency, processing invalid value and missing values；

Algoritic module, is modeled, Ran Houli using the classical classification of some in machine learning or clustering algorithm to mass data It is predicted with model；

2. big data Modeling Platform according to claim 1, it is characterised in that the data assets module is included on three kinds Biography data mode, local file is uploaded, bottom data is uploaded, database is uploaded, wherein database upload support MySql, Tri- kinds of databases of Oracle, Sqlserver.

3. big data Modeling Platform according to claim 2, it is characterised in that the data cleansing module is included at Sql Manage submodule, sampling submodule, Classifying Sum submodule, merging data submodule, deletion repetition submodule, data partition submodule Block, sorting sub-module, Data Discretization submodule, data normalization submodule, filtered variable submodule, transposition submodule, word Section reset submodule, missing values processing submodule, outlier processing submodule, search transform subblock, insertion variable submodule, Weight the balanced submodule of submodule, sample, participle analyzing sub-module；Sql handles submodule and carried out for direct editing Sql sentences Perform, sampling submodule is used to be sampled data processing using different sampling modes, and Classifying Sum submodule is used for will In table field variable content according to average, counting, summing mode calculate, generate respective labels variable column, wherein collect variable with It is configurable to calculate variable；Merging data submodule is used to chase after the data of two tables according to row record addition or row variable Plus, row name variable please be keep consistent during row record addition, otherwise by newly-increased variable column；Delete and repeat submodule for that will select Duplicate contents in variable are deleted；Data partition submodule is used to specify the quantity of sample data or ratio in training center and test section Example；Sorting sub-module arranges selected variant content according to ascending order or descending；Data Discretization submodule is used for will be selected The variable column of continuous type, according to wide branch mailbox or waits frequency division case method, carries out discretization and is simultaneously classified；Data normalization submodule The selected variable column for only supporting numeric type is carried out 0-1 standardization by block, as a result falls on [0,1] interval；Carry out Z standardization, number According to standardized normal distribution is met, average is 0, and standard deviation is 1；Filtered variable submodule is used to be deleted selected variable row； It is ranks conversion that transposition submodule, which is used to row and column all in data carrying out transposition,；Field, which resets submodule, to be used for data In row variable position rearrange；Missing values processing submodule is used to variable will have been selected to be empty row record deletion；Outlier Processing submodule is used to be deleted exceptional value by setting ratio according to outlier identification rule, and recognition rule includes standard deviation And quantile, the i.e. data beyond the standard deviation or quantile of average certain multiple are identified as abnormal data；Search conversion Submodule is used for the content of selected variable according to the lookup that imposes a condition, and replaces with desired value；Insertion variable submodule is used for Selected variable is subjected to arithmetic, generation new variables row in algorithm frame, are manually entered the title of variable column, edit computing Formula.Weighting submodule is used to selected variable being weighted, in weight factor, input weight numerical value.The balanced son of sample Module is used in selected numeric type variable column, according to the lookup target data that imposes a condition, and the input weight factor, by target Data weighting processing；Participle analyzing sub-module is used for after the content of text parsing by selected participle field, according to word after parsing Bar generation row record.

4. big data Modeling Platform according to claim 3, it is characterised in that the data check module is examined including data Nucleon module, Frequence Analysis submodule, descriptive statistic submodule；Data examination ＆ verification submodule is used in statistical analysis selected variable Sample index and detection, index include virtual value, invalid value, null value and its accounting；Frequence Analysis submodule is used for selected In variable, the frequency that all the elements occur is counted；Descriptive statistic submodule be used for specifying variable row carry out average, Mode, median, the statistics amounted to are calculated.

5. big data Modeling Platform according to claim 4, it is characterised in that the algorithm model module includes Apriori algorithm submodule, Kmeans algorithm submodules, NB Algorithm submodule, logistic regression algorithm submodule, ridge Regression algorithm submodule, LASSO algorithm submodules, linear regression algorithm submodule；Apriori algorithm submodule, which is used to combine, to close Join the field contents statistics frequency, the content in dimension field is carried out to the probability calculation of two frequency collection, and draw analysis indexes such as branch Degree of holding；Kmeans algorithm submodules be used for by word selection section data be divided into n cluster, to wherein cluster number, iterations, with Machine count parameter is configured, and realizes the function of convergence of data；Naive Bayesian submodule, logistic regression submodule, ridge regression Algorithm submodule, LASSO algorithm submodules, linear regression algorithm submodule contribute to simulate sorting algorithm model to new Data are predicted.

6. big data Modeling Platform according to claim 5, it is characterised in that the front end display module is dragged by dilatory Mode visual operating assembly tested, carry out data cleansing and algorithm point according to well-established business model Analysis, showing for visualizing multidimensional degree is carried out by result data collection by the chart in the front end display module that connects.

7. big data Modeling Platform according to claim 6, it is characterised in that the visualizing multidimensional degree shows bag Include：1st, intuitively showed with different subtypes with different data structures；2nd, the form bored below carries out various dimensions The displaying of data；3rd, customize display form to show being customized of special data with node；4th, to scheme linkage exhibition more The form shown preferably shows to data.

8. a kind of modeling method based on big data Modeling Platform described in claim 1, step is as follows：

First, it would be desirable to the data source of processing upload the data to using interpolation data source node in data assets on platform in case Subsequently use；

Then, the need for according to business scenario, data are cleaned using the functional node in data cleansing module, such as utilized Missing values processing deletes field in data for empty data line, is protected the field that business needs using filtered variable Stay, a series of processing such as other field delete processing, obtain the data of desired specific format；

Secondly, if the demand in business scenario not to algorithm, it is possible to using front end display module graphical progress most Whole data show；Show pattern class in the selection of functional node in front end various, select different figures according to demand Shape carries out Data Representation；Algorithm is used if desired, it is necessary to add algorithm node；

Finally, respective nodes have been selected, by data source nodes, data cleansing node, algorithm node or without, front end have showed node Preservation is attached, logical whole flow process can just be run by clicking on operation, and final data is come out with graphic exhibition.