CN107194490A

CN107194490A - Predict modeling optimization

Info

Publication number: CN107194490A
Application number: CN201611262212.2A
Authority: CN
Inventors: A.麦克沙恩; J.多恩胡; B.拉米; A.卡米; N.杜利安; A.阿卜杜勒拉赫曼; L.奥洛格姆; F.马利; M.凯雷斯; E.马凯德
Original assignee: Business Objects Software Ltd
Current assignee: Business Objects Software Ltd
Priority date: 2016-03-14
Filing date: 2016-12-30
Publication date: 2017-09-22
Anticipated expiration: 2036-12-30
Also published as: CN107194490B

Abstract

Technology is described for recognizing the input training dataset being stored in bottom data platform；And send and instruct to data platform, the instruction can be operable to entrust to one or more data processing operations multiple nodes on data platform by data platform, train forecast model based on input training dataset.

Description

Predict modeling optimization

The cross reference of related application

This application claims U.S. Provisional Patent Application Serial No. the 62/307,971st, entitled " prediction modeling optimization " and U.S. Provisional Patent Application Serial No. the 62/307,671st, entitled the uniform client of distributed processing platform " be used for " Priority, both of which is filed on March 14th, 2016.The full content of two provisional applications is incorporated to by quoting herein The application.The application and be filed in _ U.S. Patent Application No. _ number, entitled " be used for the unified client of distributed processing platform End " is relevant, is herein incorporated herein entire contents by quoting.

Technical field

This specification is related to Optimization Prediction modeling.

Background technology

Prediction modeling is using statistics and mathematical method analyze data, finds pattern (pattern) and produce and can help Help the process of the model of prediction concrete outcome.For commercial object, forecast model is typically based upon on the sample of historical data And the different pieces of information collection generally with current data or event can be applied to afterwards.

The content of the invention

The novel aspects of theme described in this specification can be specific in the method including following action：Identification storage Input training dataset in bottom data platform；And send and instruct to data platform, the instruction can be by data platform Operation, so that by the way that one or more data processing operations to be entrusted to multiple nodes on data platform, number is trained based on input Forecast model is trained according to collection.Other embodiments in terms of these include corresponding system, device and computer program, are configured as Perform the action of method of the coding on computer memory device.

These and other embodiment each can alternatively include one or more of following characteristics.For example, to business Data set applied forecasting model recognizes one or more results, and each result associates with probability of occurrence.Data platform includes opening Source PC cluster framework.PC cluster framework of increasing income includes Apache Spark.This method is independently of from the defeated of data platform Enter the data transfer of training dataset.One or more processing operations include calculating with input associate one of training dataset or Multiple statistics are to reduce the number of the variable for generating forecast model.One or more processing operations include coding input instruction Practice the data of data set, including alphanumeric data is changed into numerical data.One or more processing operations include performing pass Calculated in the covariance matrix of input training dataset and matrix inversion is calculated.One or more processing operations include instructing input Practice data set burst, and forecast model is scored on burst.One or more processing operations are included based on one or more As a result one or more statistics of re-computation.One or more processing operations include iteratively commenting based on structural risk minimization Estimate the performance of forecast model.

The specific embodiment of theme described in this specification can be implemented to realize the one or more of advantages below.Example Such as, compared to traditional learning art, the study stage of prediction modeling can generally be reduced to 1/10 or more.In traditional learning art Performance and the scalability limitation of appearance can be put down by such as distributed treatment is transferred to from predictive server or desktop computer The database server or data platform of platform (e.g., Apache Hadoop).The embodiment of theme can be introduced in existing prediction Modeling software changes without main frame.The traditional learning art of contrast, data transportation requirements can be reduced or eliminated, and because And, training can be carried out on more large data sets and solution extends to big data.Optimization training process is also enabled more The autgmentability of wide data set (e.g., resulting from data preparation stage).For example, 50,000 row training datasets can be in embodiments Use to train forecast model.

Moreover, the training of conventional model can generally be performed in client-side, thus require that large data sets are deposited from data Storage device communication arrive client, and thus consume a large amount of network bandwidths.In some embodiments, at least some quilts in processing Performed on distributed processing platform (e.g., Hadoop clusters), and some are performed by client application (e.g., modeling device), because This reduces to client application and transmits large data sets and the network bandwidth amount only needed for client-side performs modeling operation.Some In example, the process step that more data is intensive and/or processing is intensive can perform to utilize the bigger processing of cluster on cluster Ability.Also, because cluster can be closer to the data storage device in network topology, the intensive behaviour of more data of cluster The execution of work can avoid consuming, using conventional exercises technology when may occur by data storage device and modeling device it Between roundtrip communications mass data this by the network bandwidth being consumed.In view of analysis can be avoided (e.g., on cluster) in database Communication data in potentially unsafe communication channel, embodiment can also provide safe advantage.Moreover, such as individual can know The sensitivity and/or private data of other information (PII) ratio can more safely be handled on other systems on cluster.

On the machine learning that can be used in prediction modeling, embodiment also provides more advantages.For example, machine learning At least some in the middle more complicated and/or intensive internal step of processing used, such as coding and/or other data prepare behaviour Make, can be performed without any user mutual, e.g., these steps may be hidden to terminal user.Embodiment can also be adopted With one or more optimizations, it can be realized indolently.Such optimization can include the dimension for reducing just analyzed data set Spend to provide the high-performance of modeling device.In view of model may not be highly suitable for being used to the specific training set of training pattern, According to structural risk minimization (SRM) principle, better simply model (e.g., the dimension with reduction) is logical when handling new data Normal more useful and robust.

The one or more embodiments of the detail of theme described in this specification are explained in the accompanying drawings and the description below.Theme Other potential characteristics, aspect and advantage will become apparent from from description, drawings and claims.

Brief description of the drawings

Fig. 1 and Fig. 2 describe to be used for the example context of modeling (in-database modeling) in database.

The instantiation procedure stream of modeling in Fig. 3 A-3D descriptive data bases.

Fig. 4 describes the instantiation procedure for being modeled in database.

Fig. 5 descriptions can be used to realize the exemplary computing system of technology described herein.

Fig. 6 descriptions include the example system of the uniform client for distributed processing platform according to the embodiment of the disclosure System.

Fig. 7 A descriptions include the example system of the application using uniform client according to the embodiment of the disclosure.

Fig. 7 B describe the process for using uniform client for data processing of the embodiment according to the disclosure Example flowchart.

Fig. 8 describes the example class figure of the embodiment according to the disclosure.

Embodiment

There are many distinct methods to predict modeling.For example, regression modeling predicted value, and hidden group distinguished in data of classifying. In addition, having a large amount of in ready-made method (for example, k mean algorithms in R) to the machine learning algorithm, skill changed between proprietary method Art and embodiment.Especially, proprietary method can and Structural risk minization theoretical using such as Vapnik-Chervonenkis Change (Structural Risk Minimization) machine learning techniques come set up better quality and more commonly be applicable Model.The quality and robustness of model can be analyzed based on herein below：I) how quality, e.g., model describe well existing number According to --- this is realized by minimizing empiricism mistake；Ii) reliability or robustness：When model is applied to new data When, how model will be predicted well --- this is realized by minimizing unreliability.In terms of modeling is predicted, Classical forecast is built Mould solution connects to connect by the database of such as exploitation database connection (OBDC) and java databases connection (JDBC) Connect Relational DBMS (RDBMS), data pull restored to reservoir and the post processing data.

Therefore, prediction modeling can be data-intensive.Specifically, data preparation stage and study (training) stage may Need many scannings to identical data and to each many calculating for individually entering parameter.For example, the intersection statistics in algorithm Step may be needed to each input variable and each target variable Counting statistics amount.It is as shown in the table, for N number of input The input data set of variable, T target variable and R rows, intersects statistics calculating and is carried out N x T x R times.

Line number

Input variable 1

Input variable 2

Input variable 3

Input variable N

Target variable 1

Target variable 2

1

A

12

76.2

Complete

99.67

Product D

2

R

87

98.2

Prepare

142.32

Product X

…

R

B

4

62.5

Complete

150.1

Product A

Conventional architectures design make use of layered approach, wherein data source in one layer and data processing in another framework layer. The separation can also be denoted as following landscape, and wherein data reside in database (database server computer or server set Group) in and data processing on independent machine (e.g., server or desktop computer).It is logical between layer in some examples Letter is carried out via SQL and connectivity is enabled using such as JDBC and ODBC technology.However, when the framework be applied to it is pre- Survey modeling software when, due to need by whole training datasets from database cross network transmission to different machines to be handled, So which introducing performance and scalability limitation.Therefore, dependent on the algorithm or method that are used, complete training dataset The performance penalties of data transfer may repeatedly occur in study (training) stage.Further, in some examples, data processing is worked as It is positive occur compared with the stronger disposal ability of database server/cluster or Apache Hadoop clusters usual ability it is poor Hardware (desktop computer or single server computer of such as user) on when, may be limited by the framework of data transfer Performance and scalability processed.In addition, the data transmission method may not be with the throughput demands of growth (for example, will be in one day The number of user when the model number of foundation and the system architecture model) extend well.

The prediction that embodiment is provided in the automatic database for the shortcoming for overcoming or at least mitigating conventional architectures design is built Mould.The modeling can be performed in big data environment the performance modeled with overcoming in conventional architectures and scalability limitation, it is all as above State limitation.Tradition modeling can be performed in client-side, thus need large data sets to be communicated from data storage device to client, And thus consume a large amount of network bandwidths.In some embodiments, at least some execution on the cluster in processing, and one It is a little to be performed by client application (e.g., modeling device), therefore reduce to client application transmission large data sets and only in client-side Perform the network bandwidth amount needed for modeling operation.In some examples, the more data-intensive and/or intensive process step of processing can Perform to utilize the bigger disposal ability of the cluster on the cluster.Moreover, because the cluster can be closer in network topology In data storage device, so the cluster performs more data-intensive operation and can avoid this will be by data storage device The roundtrip communications mass data between modeling device and the network bandwidth being consumed.As described herein, modeling can be in database The modeling performed at least in part in the cluster (e.g., distributed processing platform) for also storing just analyzed data.Thus, examine Considering analysis in database can avoid modeling in the communication data in potentially unsafe communication channel, database from providing The advantage of safety.Moreover, such as personal recognizable information (personality identifiable information, PII) Sensitivity and/or private data can more safely be handled than on other systems on the cluster.

Modeled in database

Fig. 1 shows the example context 100 for being modeled in database.Specifically, environment 100 includes server calculating system System 102 and data platform 104.Server computing systems 102 can include one or more computing systems, and it includes computing system Cluster.Data platform 104 can include one or more computing systems (e.g., node), and it includes multiple meters based on user Calculation system.Server computing systems 102 can include automation modeling device 106, and automation modeling device 106 includes modeling service 108.Data platform 104 can include RDBMS 110, one or more standard query languages (SQL) engine 112 and data warehouse 114.Engine 112 can be described as big data SQL engines.In some examples, engine 112 can include Apache Spark or Apache Hive.Although with reference to being used as the example distribution formula processing platform (Hadoop e.g., developed by Apache Software Foundation Framework) data platform 104, there is discussed herein the embodiment of the disclosure, but be contemplated, the embodiment of the disclosure It can be realized using any appropriate distributed processing platform.Although server computing systems 102 are described as server, system 102 and/or modeling service 108 can serve as client when it is interacted with data platform 104.

Fig. 2 shows the example context 200 for being used to model in database similar to environment 100.Environment 200 includes automatic Change analysis module 202 and cluster 204.Cluster 204 can include the distributed processing platform for data processing.Some embodiment party In case, cluster 204 is Apache Hadoop clusters.Automated analysis module 202 includes modeling device 206.Some embodiments In, modeling device 206 is C++ modeling devices.Modeling device 202 can include link block 208 and driver 210.Some embodiments In, link block 208 is ODBC link blocks.In some embodiments, driver 210 is Spark Driver (JNI) module. In some examples, cluster 204 includes data warehouse 212, cluster manager dual system 214, associated with primary (native) modeling procedure Module 216, and distributed file system 218.In some embodiments, data warehouse 212 is Apache Hive data warehouses. Link block 208 can set up the connection (e.g., ODBC connections) of data warehouse 212.In some embodiments, cluster manager dual system 214 be YARN cluster manager dual systems.Driver 210 can be created to (e.g., YARN) connection of cluster manager dual system 214.Some are implemented In scheme, module 216 is Apache Spark modules, and associated modeling procedure is primary Spark modeling procedures.Some In embodiment, file system is Apache Hadoop distributed file systems (HDFS).In some embodiments, automation Analysis module 202 communicates with cluster 204.Specifically, link block 208 is led to (e.g., Apache Hive) data warehouse 212 Believe and (e.g., Spark) driver 210 communicates with (e.g., YARN) cluster manager dual system 214.Input training dataset (e.g., business Data set) can be via link block 208 and/or driver 210, in the connection that these modules are set up one or two upload It is defeated.Further, data warehouse 212 and module 216 can communicate with distributed file system 218, e.g., built-in for database Mould.In some embodiments, the communication between cluster 204 and automated analysis module 202 can use uniform client, as follows It is described.

Analysis module 202 can use ODBC to connect and be interacted with (e.g., Hive) data warehouse 212 to fetch in cluster The result set of the processing performed on 204 by (multiple) primary modeling procedure (e.g., (multiple) Spark operations).YARN connections can Request job is used on cluster 204, e.g., is run through primary modeling procedure.Primary modeling procedure (e.g., (multiple) Spark Operation) result can be written into file system 218 (e.g., HDFS).In some examples, as a result it can be copied from file system 218 Accessed to data warehouse 212 with may be unified the client analysis module 202 that is automated.

In some examples, modeling can perform the side of data processing with close data source in the database that environment 100 is performed Method is associated.In some examples, the use of modeling and processing in the database for predicting modeling is closed in the database of environment 100 Connection.Predict that modeling can include generation database private code (e.g., SQL or the program of storage) with by modeling process (e.g., environment Modeling process in 100) (delegate) is entrusted in the language optimized for data platform 104.

In some examples, modeling can include data preparation stage, study (training) in the database associated with environment 100 Stage, scoring stage and/or retraining stage.Cleaning and with data relevant exceptional value of the data preparation stage with data (outliers) processing association.Data preparation stage can also relate to using data manipulation (e.g., by using SQL window letters Number) increase the number of input variable to help to find the pattern (pattern) in data.For example, it was discovered that the purchase in one month The pattern of behavior, rather than minute rank pattern.Learn (training) stage with algorithm and technology to input training dataset Association.In some examples, the process for setting up model can be iteration to recognize suitable model.This can use business Domain knowledge is performed by software or by manually changing mode input.In addition, study (training) stage can intend with such as crossing It is concept related as conjunction and robustness.Further, the result that model is set up can include what can be used in the scoring stage Output.Scoring stage and the association of training pattern.Model can be embedded in business application or be used as micro services with for Given input prediction result.The retraining stage is with ensuring that existing model is accurate and provides Accurate Prediction pass for new data Connection, including model compare and again triggered in view of the learning process of more most recent datas.

The performance characteristics of modeling in database

In some embodiments, the data preparation stage of modeling can increase the number of input variable to produce in database With the statistically more robust model more preferably lifted.For example, by the number of the input variable of data source (e.g., arranging) from 200 to 2000 variables increase to 10 times, can be used to the discovery mode in the time window in minute or day.In some examples, data behaviour Vertical function can use the SQL processing in modeling software to produce extra input variable.Therefore, This correspondingly increases data Size and learning process performance requirement.

In some embodiments, the minimum input from user and minimum machine are needed during model learning/training stage The automatic machinery learning method of device learning knowledge, such as structural risk minimization is enabled for the expansible of higher throughput Property and simpler process on the whole, use prediction to model to provide more roles in enterprise.Automation model was set up The result of journey can use quantitative measurment to indicate model quality (error) and robustness (for new data set) to help user It was found that best model.

Modeling method, which is provided, in some embodiments, in database entrusts the data-intensive step for predicting modeling process Give bottom data platform, such as Apache Hadoop clusters and/or database.It is complete that data-intensive step is mainly requirement The step of data transfer of training dataset.In some embodiments, modeling method minimizes the number of process step in database Mesh, is included in data source (e.g., bottom data platform) to the recycling for the result for carrying out self study (training) stage.Therefore, this Reduce the processing cost of re-computation in subsequent step.For example, the result of process step can be buffered and (be stored in interim table) Recycled for after.

In some embodiments, the ginseng associated with data source (e.g., client computing system 104 and/or data warehouse 114) Number can be used for helping modeling in database.In some examples, the database platform associated with client computing system 104 can be with It can be used to support including primary low-level language storehouse (e.g., with C++) and its function to model in database.For example, as it is following enter One step is described, and covariance matrix calculation procedure, when being run in (e.g., big data) data source, can be delegated to Apache Spark MLLib (machine learning storehouse).Further, in some examples, such as Teradata RDBMS 100 includes optimization square The function that battle array is calculated.

In some embodiments, the step of being modeled in database may be recorded such that the performance of modeling in energy database is adjusted Whole, wherein these steps include run time, CPU and memory footprint (memory footprint).In some embodiments, Modeling can be to transparent using the terminal user of existing software in database, thus provides identical (or similar) user interface sum The use connected according to storehouse.

In some embodiments, the configuration can be used to the operation of each modeling procedure in further tuning data source with Further help performance.For example, when modeling procedure is delegated to Apache Spark, the number of allocated Spark actuators Mesh, the number of core and memory can be generally fine-tuned.

The process streams of modeling in database

Linear or polynomial regression analysis can be used to the relation between predictor and form recurrence and disaggregated model The basis of foundation.Linear regression model (LRM) is represented with following form：

Y-b₀+b₁X₁+b₂X₂+b₃X₃+…

Wherein X1, X2, X3 ... it is predictive variable (feature) and Y is target variable.

When known to coefficient corresponding with each variable (b1, b2, b3 ...) and intercept (b0), linear regression mould is defined Type.

Fig. 3 A show such as the instantiation procedure stream 300 modeled in the database performed by environment 100 and/or environment 200. In step 302, data prepare and the intersection statistics calculating of data is performed.For example, data manipulation is employed to become to increase input The number of amount, usually using SQL.Further, data manipulation can include combination input variable.For example, by variable " age " " marital status " is combined, because the influence that they may be to target variable " salary " more predictability.

Data prepare may further include data fragmentation so that from a piece of obtained model can by with as study Another of the part in (training) stage compares to check robustness.Data prepare to may further include processing such as database In " null " value data outliers.In some examples, such value can be kept and classify.Data prepare can be further Case (variable binning) including variable with reduce the number with the centrifugal pump of data correlation and will have it is close or The value of correlation puts group (e.g., chest) into.The intersection statistics calculating of data can include Counting statistics amount, and such as specific input becomes Counting and distribution of the value to each target variable.This can be used to assist in variable and reduce process to reduce the number of input variable Mesh.

In step 304, data encoding is performed.Specifically, alphanumeric data is transformed into numeral by data encoding.Example Such as, the sample SQL formula of coding " age " variable are (AGE-avg (AGE))/SQRT (VAR (AGE)).

In step 306, covariance matrix is calculated and is performed.Covariance matrix is its i, the elements of j positions be the i-th ' it is individual and Jth ' covariance between individual variable matrix.For example, the covariance between variable X 1 and variable X 2 is defined as：

It is performed in addition, matrix inversion is calculated.Specifically, following formula design factor can be used：

Wherein C be it is all prediction (predictors) covariance matrix, β ^ be coefficient (b1, b2 ...) vector and Z' representing matrixs Z transposition.Constant term b0 is the difference of y average and the average from estimation X β ^ predictions.

In step 308, for previously for inspection forecast model the data slicer that had generated of robustness to forecast model Scoring is performed.In step 310, the re-computation for intersecting statistics is performed using predicted value.Step 312, performance comparision is held OK.Specifically, the performance of forecast model is iteratively assessed based on structural risk minimization.In some embodiments, processing The result of step, which can be buffered and (be stored in interim table), to be recycled and/or is used by other steps to after.Such as Fig. 3 A Shown in example, (e.g., customization) caching can enable result and share between the various process steps.Although Fig. 3 A example difference For ODBC, JSON, the SQL and HDFS use of data cube computation, linking format, query language and file system descriptor, but it is real Apply the use that scheme supports other technologies, agreement and/or form.Optionally, data processing step can concurrently be held on cluster OK, all step 310 as shown in Figure 3A and 312 example.For example, multiple Spark operations can be multiple in cluster by operating in Spark examples are concurrently run.

Fig. 3 B to Fig. 3 D show the instantiation procedure stream for being modeled in database.In these examples, data processing is extremely A few part is performed in client-side (e.g., in the application or other client process separated with cluster).For example, the processing At least a portion can be performed by automated analysis modeling device 202.In Fig. 3 B-3D example, automated analysis modeling device 202 is C++ modeling devices.In some embodiments, modeling device 202 can using uniform client come with cluster (e.g., with such as Hadoop The distributed processing platform of cluster) interaction.Operation on the uniform client of the cluster is described further below.

Modeling device 202 can use uniform client to ask various operations serially or parallelly to operate on cluster.Fig. 3 B Into Fig. 3 D example, operation is Spark operations.These operations can be passed through including being used as sub- client by modeling device 202 The uniform client request of Spark clients, as described below.Other types of operation can be also run to perform at various data Manage step.In some embodiments, the result of various steps can be stored in data warehouse 212, and modeling device 202 can be with Result is fetched from data warehouse 212.In Fig. 3 B to Fig. 3 D example, data warehouse 212 is Hive data warehouses.Embodiment is also Support the use in other categorical data warehouses.

As shown in Figure 3 B, Spark operations can be asked and (e.g., triggered) to modeling device 202 through (e.g., YARN) driver 210, and And Spark operations 314 (e.g., intersecting statistics) can be run on cluster.Job result can be written into (e.g., Hive) data warehouse 212, and modeling device 202 can read result from data warehouse 212.Further processing can be performed afterwards.

As shown in Figure 3 C, further processing can include any an appropriate number of homework type operated on cluster.Such as Shown in example, operation can include operation 316, the operation for matrix disposal (e.g., using MLLib) for coded data 318, another operation 322 for the operation 320 to formula scoring, for intersecting statistics and operation 324 for performance.Its The operation of its type is also carried out scheme support.After each operation, the result of data processing step can be written into data warehouse 212.Modeling device 202 can fetch result from data warehouse 212, perform some processing localities, and the result based on processing locality It is determined that another operation to be performed on cluster.By this way, modeling device 202 can optionally performed using cluster it is some While data processing step, local data processing is performed.In some embodiments, (e.g., customization) caching can be used to Result is shared between the operation run on cluster, as described in reference to Fig. 3 A.In some embodiments, caching is as described below The working space that uniform client is used.

In some embodiments, flexible configuration can be used to specify the operation to run on cluster.Fig. 3 D are shown can It is used to the example of the metadata of the JSON forms of configuration example Spark operations.Other file formats can also be used to configuration and make Industry.In some embodiments, the form and/or outline (schema) of metadata are flexible and/or across multiple operations, or are directed to All operations are general.Thus, new job can recycle the outline of identical outline and/or same type.

For the process modeled in database

Fig. 4 shows the instantiation procedure 400 for being modeled in database.Process 400 can be with, for example, by environment 100 and/ Or environment 200 or other data processing equipments are performed.Process 400 can also be implemented as the finger being stored on computer-readable storage medium Order, and cause one or more data processing equipments to perform to the operation of the instruction by one or more data processing equipments Some or all of the operation of process 400.

The input training dataset being stored in bottom data platform is identified (402).Instruction is sent to data platform, The instruction can be operable to entrusting to one or more data processing operations into multiple sections on data platform by data platform Point, forecast model (404) trained based on input training dataset.In some embodiments, the instruction may specify in collection The data processing operation that performs is to train or otherwise determine that in forecast model, such as Fig. 3 A-3D example on group 204 Sample.(multiple) result set of (multiple) operation can be fetched (406) from data warehouse 212.In some examples, processing locality (e.g., exists On client-side modeling device) (multiple) result set fetched can be based at least partially on to perform (408).For extra place Whether reason operation, which will be executable to determine forecast model, can make determination (410).If it is, the process can return to 404 and another One instruction set can be sent to ask to run operation on cluster 204, and/or extra processing locality can be performed.If not yet There is extra processing to be executable to determine forecast model, then forecast model can be provided that (412).Forecast model can be employed (414) it is e.g., special with follow-up appearance in data set to recognize to data set (e.g., business data collection) to be made prediction on data Determine (multiple) result of the probabilistic correlation of data value.

Although Fig. 4 describe by particular order (e.g., operation is run first on cluster, and processing locality is carried out afterwards) and It is sequentially performed the example of processing, but embodiment not limited to this.Embodiment, which is supported to be included in, to be performed on cluster 204 or certainly Any number of data processing step that dynamicization analysis module 202 is locally executed and can sequentially or in parallel performed (is made Industry) modeling.

Unified client

All distributed processing platforms as described herein for being used to perform modeling, can be stored with batch mode and handle big Data set.In Hadoop example, the Hadoop ecosystems initially include MapReduce and Hadoop distributed field systems Unite (HDFS), and with the time progressively developed into support it is other processing engines (e.g., Hive, Impala, Spark, Tez etc.), Other Languages (e.g., PIG, HQL, HiveQL, SQL etc.) and other storage outlines (e.g., Parquet etc.).Specifically, contrast branch Hold MapReduce frameworks but do not support Spark previous version, the addition of Spark engines has been obviously improved Hadoop point Cloth treatment effeciency.Spark engines can be handled with many bottom iteration (iteration such as used in machine learning) Complex process.

By supporting with many different disposal engines, language and the technology garden (technological for storing outline " zoo "), when tissue is attempted platform intergration to particular organization's situation and/or workflow, distributed processing platform proposes one Individual engineering challenge.For example, the information technology group in business may wish to produce at the optimal data of suitable business specific needs Solution is managed, and to reach this purpose, the different technologies that they can be utilized and/or combined platform is supported.Platform is supported Entirely different technology can be with complimentary to one another and/or can operate simultaneously with one another.Traditionally, for application combination and/or association The operation for multiple technologies that leveling platform is supported, a large amount of interim (ad hoc) and/or private code will need to be written into.When application When design and/or logic change, these codes will be difficult to safeguard between the version of application.Embodiment provides uniform client, All subsystems that it serves as individual interface with distributed processing platform is supported are interacted, and convenient distributed processing platform is carried The consumption of the various service supplied.By by different sub-systems combination in individual session, uniform client also operate with gram Clothes be probably distributed processing platform each subsystem and/or technology in intrinsic each limitation (e.g., performance limitation, processing Capacity etc.).

Spark technologies have been designed to support the job run of the long-term operating under batch mode.Spark technologies pass through Shell scripts (e.g., spark-submit) support job run.The configuration of shell scripts brings challenge in itself.For example, Shell scripts impose many script parameters (argument) and prerequisite (prerequisite), such as client-side Hadoop XLM configure the presence with specific Hadoop environmental variances.

It is probably difficult using Spark for various reasons from the angle of client application.For example, Spark is difficult to Landscape during embedded application operation.The traditional approach of Spark operations is submitted to include creating custom command line and in single process Middle operation custom command line.Moreover, Spark operations are traditionally independent and disposably run, and visitor can not possibly be returned to Family end workflow (e.g., carry out intermediate steps) continues Spark job runs with the point being interrupted from it.Thus, in traditional platform, Spark can not be used easily with interaction and/or stateful mode.Moreover, traditionally Spark connections description cannot function as list Only concept is present.Alternatively, Spark interfaces can handle its configuration and include connecting the Spark works of relevant information and other parameters Industry is submitted.In addition, traditionally Spark may not provide the connection storage suitable with appearing in the connection repository in RDBMS situations The type of warehousing.For at least these reasons, in traditional solution, Spark interfaces be difficult to it is embedded, be difficult to configure, and can Job run only can be handled with batch mode, so as to avoid interacting with the middle of client application.

In order to mitigate, and in some instances, eliminate above listed on existing complete in distributed processing platform The limitation of distinct interface, embodiment provides the enhanced service consumption in distributed processing platform.Specifically, implement Scheme provides embeddable operation Spark clients (e.g., driver), so that Spark drivers are in non-JVM processes It also is brought into application process.In some embodiments, syllabified code and Spark clients are based on when Spark is run Can be operationally configurable.Spark drivers can consume the predefined Spark being persisted in specific storage storehouse Connection descriptor configures to simplify connection.Can be specific to each application domain during Spark job runs.During Spark job runs It can be stored in dedicated storage storehouse and can operationally be deployed to (e.g., Hadoop) cluster.In some embodiments, Spark clients provide interactive and/or stateful connection.Spark connections can be established is maintained at void to enable to have Intend the submission worked continuously of the intermediateness in working space.Internally, Spark connections can be with SparkContext examples Correspondence.

In some embodiments, at least some (or whole) of Hadoop particular customer end interfaces can be merged into as system The single-point client component of one client.The various services of uniform client enable (e.g., Hive, SparkSQL, Spark, MapReduce etc.) seamless association to realize the data processing chain of complicated and/or isomery.Through uniform client, Spark drivings Device can align with other drivers (e.g., Hive clients, HDFS clients etc.) in same technical characteristic rank.

Fig. 6, which is described, includes the example of the uniform client for distributed processing platform according to the embodiment of the disclosure System.As shown in the example of figure 6, system can include one or more of distributed processing platform distributed system 602.One In a little examples, (multiple) distributed system 602 includes (multiple) Hadoop system.Embodiment is also supported other types of (many It is individual) distributed system 602.(multiple) distributed system 602 can include subsystem and/or engine, such as MapReduce 606th, Hive engines 608, Spark engines 610, SparkSQL 612 and storage device 614 (e.g., HDFS).

The system can include uniform client 604.Uniform client 604 can include sub- client, such as MapReduce clients 616, Hive clients 618, Spark clients 620, SparkSQL clients 622 and/or storage dress Put client 624.Uniform client 604 can also include the sub- client of any other appropriate type, such as simple and leaven dough (SCOOP) client is programmed to object.Sub- client can also include HDFS clients.In some embodiments, sub- client One or more of the other (e.g., general) SQL clients can be included to support (multiple) SQL embodiment party in addition to Spark SQL Case, such as Cloudera Impala^TM.The each of each seed client of uniform client 604 is configured as and (multiple) points The corresponding subsystem interface of cloth system 602.For example, MapReduce clients 616 can be configured as and MapReduce 606 Interface, Hive clients 618 can be configured as being configured as and Spark with the interface of Hive engines 608, Spark clients 620 The interface of engine 610, SparkSQL clients 622 can be configured as with SparkSQL interfaces, and storage device client 624 can It is configured as and the interface of storage device 614.

In some embodiments, Spark clients 620 can access Spark Job Storages storehouse 626.Uniform client 604 It can access and using datamation space 628 and/or unified metadata 630 (e.g., table, RDD and/or file outline).Some In embodiment, uniform client 604 can access unified connection repository 632.Unified connection repository 632 can include It is Hive connections 634 (e.g., using ODBC and/or JDBC), SparkSQL connections 636 (e.g., using ODBC and/or JDBC), primary Spark connections 638 and/or primary HDFS connections one or more of 640.In some examples, there can be SparkSQL connections Pairing between 636 and primary Spark connections 638.In some examples, can there are primary Spark connections 638 and primary HDFS to connect Connect the pairing between 640.

Unified connection repository 632 can also be described as connecting metadata repository.Unified connection repository 632 can be deposited Storage indicates the metadata of the pairing between different connections (e.g., the connection of different types of pairing).The pairing can enable difference Sub- client (such as, MapReduce clients 616, Hive clients 618, Spark clients 620, SparkSQL clients 622 or storage device client 624 etc.) between interface.Specific unified ession for telecommunication, application can call multiple different sons Client, and data can be received and/or sent through each seed client.In first number defined in unified connection repository 632 The combination for the sub- client being used in specifically unified session is enabled according to the connection pairing of rank.It is defined on the connection of metadata rank Pairing also enables the switching between the sub- client that ession for telecommunication is used.For example, (e.g., session can be used a sub- client SparkSQL clients) initiate, and using identical unified session, initial sub- client can with it is also possible to use one or A number of other sub- client associates (e.g., linking).Switching between sub- client can be performed indolently, because every sub- client Share minimum common interface and therefore become interoperable in end.For example, the sub- clients of Spark can be with Hive SQL clients End or HDFS clients interoperability.The actual selection of sub- client can operationally be determined by specific session configuration.Sub- client Association (e.g., linking) between end can be performed, additional authorization or certification without client certificate in a seamless fashion.Recognize Card can be handled by " single-sign-on (single sign on) " method (e.g., using Kerberos), and " single-sign-on " method can be with The session of certification uniform client is once so that it is used in all sub- clients.In some embodiments, from giving in chain Determining the metadata and/or data of step outflow can not be persisted, and under alternatively being sent in process chain One sub- client.Embodiment enables different sub- client-side interfaces and combined in a seamless manner, for being used in unified ession for telecommunication. Every sub- client can be attached to common interface and therefore can provide interoperability between sub- client.With reference to Fig. 8 Further describe.

Fig. 8 describes example class Figure 80 0 according to the embodiment of the disclosure.In some embodiments, uniform client Interface can be realized according to class Figure 80 0.In example, class Figure 80 0 includes the layer of class 802,804,806,808,810,812 and 814 Level arrangement.As the example shows, each class can include various member methods and member fields.For example, UnifiedConnection Class 804 includes member method subConnectionList () and createWorkspace ().In some examples, each operation Handle specific sub- client, such as Spark SQL or HDFS.Each operation, such as HDFSJob classes 808, SQLSJob classes 810, The example of SparkJob classes 812 and/or MapReduceJob classes 814, it is possible to achieve interface AbstractClient (abstract clients End) 806.The following is the exemplary flow of the order through such embodiment.1) UnifiedConnection 802 can be by example Change.2) the stateful example of Workspace classes 804 can be created, and wherein interim data (staging data) can be resident. 3) operation can be added to Workspace (working space).In some examples, JSON can be now resultful defeated including may refer to Enter and output parameter.4) operation compiling can be triggered and (e.g., set up flow diagram to be relied on based on topology).In some examples, system It can confirm that flow diagram is formed well.5) production plan may operate in unified connection situation.Middle and/or ephemeral data It can be stored in working space.In Fig. 8 example, " subConnectionId, " " ApplicationRuntimeId, " and/ Or " MapReduceRuntimeId " may refer to wherein predefine connection and/or wherein store Spark or MapReduce Uniform client repository during operation.

Fig. 6 is looked back, the link of sub- client can be included in after the first sub- client reception data, the first sub- client There is provided data is used for by the second sub- client process.Although this paper example can describe in unified ession for telecommunication that two sons are objective Family end is linked together, but embodiment enables the chain of any an appropriate number of sub- client and fetches and be sequentially processed data.Son The link of client can be serial link, and wherein data are delivered to another sub- client and afterwards from a sub- client Another sub- client etc. is arrived again.Link can also enable parallel processing, and plurality of sub- client at least part same period is located in Manage identical data.Link can be related to branch, wherein handling by parallel in many sub- clients and/or many sub- client chains Ground is performed.Link may also include the merging of branched chain and/or link (rejoin) again to further processing.

The pairing of connection can operationally occur and can based on point to second (e.g., Hadoop) subsystem is (such as, From the different sub- client of sub- client for the first connection) the first connection.Embodiment is provided for combining inhomogeneity The uniform client of the data processing technique of type, it is e.g., corresponding from different sub- clients, to provide compared to traditional solution feature The data processing solution more enriched.By uniform client, embodiment is also by using (e.g., Hadoop) platform Multiple abilities provide the solution for enabling the bigger flexibility in data processing.

Unified connection repository 632 can store the metadata specifically connected for one or more interfaces.Some realities In example, the same subsystem of (multiple) distributed system 602 is pointed in only connection, and such connection just can be paired with each other.One In a little examples, primary Spark connections description at least includes XML Hadoop files, XML Hadoop file quilts under YARN patterns Class.path during Spark operations is operationally deployed to, to properly configure YARN and/or Hadoop components.

In some examples, Spark clients can be stored in package (e.g., jar file) during with Spark job runs and separate Repository.If Spark and/or Hadoop versions are compatible, operation workpiece can be transported using any Spark connections OK.

In some embodiments, uniform client 604 exposes the various single interfaces that it includes.Uniform client disappears Person's of expense (e.g., using) can initiate the given connection of special interface (e.g., Hive clients).Dependent on predetermined connection pairing, Uniform client consumer can automatically access (multiple) other service interfaces to set up Heterogeneous Data Processing figure, such as Fig. 7 A Shown in example.In some examples, certificate can be requested to enable the access of the connection to pairing.

Unified connection (e.g., the connection collection of pairing) can be bound to virtual data working space 628, and virtual data work is empty Between 628 can include being used for the status information of unified session between uniform client 604 and (multiple) distributed system 602. For example, datamation space 628 can include status information, such as with to Hive tables, internal memory elasticity distribution formula data (RDD), One or more intermediatenesses that the reference of HDFS filenames and/or client resource and/or the form of identifier are safeguarded.These Information can enable stateful connection and be kept.Safeguard that the reference to internal memory RDD can enable different works in status information Industry (e.g., Spark or other) is linked each other.For example, RDD can be incorporated as result return by the first Spark operations, and separately One operation can consume the result by the incoming parameter quoted as RDD.May be very big in view of RDD, operation can be passed Enter and/or return to reference to RDD rather than RDD in itself.The presence of status information can also be enabled in datamation space 628 Automatic remove is performed in conversation end.For example, at least some of status information can be deleted in conversation end, it is such as general As a result the reference (e.g., Hive tables) for getting back to uniform client 604 and/or application and creating.Embodiment enables data by edge The DFD shown in Fig. 7 A and be delivered to another process step from a process step.

Fig. 6 provides the example of the process chain as shown in unified connection repository 632.For example, the He of uniform client 604 The specific session of interaction between (multiple) distributed system 602 can be used (e.g., using SparkSQL) in a concrete fashion Spark engines and Hive engines, and HDFS can be utilized.It is single dependent on what is handled in the component of uniform client 604 The requirement met in session, step-by-step movement processing can be included in the data set that intermediate treatment of being transmitted scriptures on application side produces and by number (multiple) distributed system 602 is pushed to according to collection.This may be followed by the Spark processing of data set.Uniform client 604 can To enable the execution that application links these various process steps in a seamless manner.Step can also be including the use of HiveQL language Data preparation step.The use of uniform client 604 eliminate for by these data preparation works import SparkSQL or its The need for its language.Perform data preparation for example, uniform client 604 enables application using Hive, performed using Spark engines Various modeling procedures, and various results are got back into application using Hive and/or Spark.It is (many using that then can perform It is individual) intermediate treatment of result.Step can replace on uniform client side and/or (multiple) distributed system side.For (many It is individual) processing of distributed system side, embodiment enables times of the operation included in MapReduce, Spark, Hive, HDFS etc. The operation for number of anticipating is with the combination of random order.

Although the unified visitor that this paper example is described to be used for and single distributed processing platform (e.g., Hadoop) is used together The use at family end, but embodiment is not so limited.In some embodiments, uniform client can be used to convenient across many The data processing of individual distributed processing platform.In such example, unified connection repository 632 can include two HDFS of description The metadata of connection pairing between connection, e.g., with convenience data from a distributed processing platform to another distributed treatment The transmission of platform and/or copy.In such example, uniform client 604 can include the HDFS client as sub- client Hold to handle such cross-platform data transfer.

In some embodiments, the coupling or pairing of connection can be that user is specific, e.g., one or many between connection Individual specific association can be set up and be stored for specific user.In one example, connection pairing and/or association can be in following connections Between make：To Hive, Spark SQL etc. ODBC connections；Spark connections (e.g., including configuration file and attribute)；And HDFS Connection.One uniform client connection can include this 3 connections being associated together.One uniform client connection configuration can To be identical to all users, or there may be user's particular value to provide flexibility.For example, ODBC connections can be to all User is general, with more specific ODBC connections for user 1 and user 2.For user 1, specific ODBC connections can With including the information configured for Spark and HDFS is configured.For user 2, specific ODBC connections can include matching somebody with somebody for Spark Put the information configured with HDFS.As another example, general (e.g., technical user) ODBC connections can be used, but have pin Customization Spark configurations to user 2.For user 1, connection can be general with Spark configuration files and HDFS configurations ODBC connections.For user 2, connection can be with Spark configuration files, for the customization additional configurations and HDFS of user 2 The general purpose O DBC connections of configuration.

Fig. 7 A describe the example for including the application 702 using uniform client 604 of the embodiment according to the disclosure System.Shown in example as shown in Figure 7 A, system can include applying 702.It can include the He of uniform client 604 using 702 Uniform client working space 704 (e.g., datamation space 628).In some examples, (e.g., uniform client 604 is embedded into In process) to applying 702.For example, uniform client 604 can be loaded operationally as storehouse, arrived with being provided to application 702 The interface capability of each subsystem of (multiple) distributed system 602.

In some examples, uniform client working space 704 include data structure metadata 706 and to table, HDFS and/or RDD one or more references 708.Uniform client 604 can be configured as accessing and using uniform client working space 704 Perform its various operations.Uniform client 604 can run HQL 710 (e.g., for data materialization One or more of) (materialization) inquiry.Uniform client 604 can submit such as Spark operations 712 The operation of (e.g., being converted for data), and the RDD references exported are received from Spark operations 712.Uniform client 604 can be with Such as SparkSQL 714 (e.g., for data retrieval) SQL is run, and (multiple) result is received from SparkSQL 714. Uniform client 604 (e.g., can upload) operation PUT orders through HDFS orders 716 for data.Uniform client 604 can be with Operation and (multiple) RDD and/or HDFS is submitted to quote to Spark operations 718 (e.g., being converted for data).

In some examples, each data referencing hosted by working space 704 has the metadata for describing its structure.It is unified Client 604 can be configured as management to multiple companies of the different sub-systems of (multiple) distributed system 602 (e.g., Hadoop) Connect.If uniform client consumer needs inter-subsystem structure data processing figure, uniform client 604 is being used as data work Make to provide transit data in the transfer area (staging area) of the part in space.After unified connection is closed, odd-job The content in space is removed automatically by uniform client component.

Uniform client 604 can provide the single visit to (multiple) distributed system 602 to application or other consumers Ask a little.Each subsystem of (multiple) distributed system 602 can provide different benefits, and uniform client 604 can make Can be a large amount of interim special without performing using the different benefits for utilizing and/or combining each subsystem in seamless, efficient mode Fixed coding.

Uniform client 604 enables the establishment of unified session, for application 702 and the interface of (multiple) distributed system 602. When unified session is created from uniform client 604, uniform client 604 can create unified connection, its match and/or with Other manner combines different single connection type (e.g., to Hive, Spark, HDFS, MapReduce etc.).In order to realize this Primary Spark connections description can be appointed as outline collection (set of schema) by unified connection, embodiment.

Traditionally, Spark companies are facilitated by using establishment of connection and operation not being submitted into the shell scripts separated Connect.In some embodiments, the task that setting up the task of Spark connections can submit with operation is separated.Traditionally, Spark quilts It is configured to enable operation with batch mode operation and Spark does not enable interactive sessions.In some embodiments, uniform client 604 enable using the interaction Spark sessions between 702 and (multiple) distributed system 602.For example, uniform client 604 can be with (multiple) distributed system 602 initiated Spark operations, interrupted the operation to perform some intermediate steps, and performing Proceed Spark operations after (multiple) intermediate steps.

Traditionally, the information of description Spark connections may inconveniently be placed on multiple positions, such as XML file, Hadoop variables etc..In some embodiments, single Spark connections descriptor can include various Spark link informations, be visitor Family end easily accesses Spark link informations and provides more easily mode.Spark connections descriptor can be in Spark operations In repository 626.Uniform client 604 can access Spark Job Storages storehouse 626 with access Spark connection descriptors and Created based on link information therein and/or recover Spark connections.In this way, embodiment is provided and (multiple) Other engines that distributed system 602 is supported similarly efficiently treat Spark uniform client 604, thus facilitated application makes With Spark processing.Interacted instead of needing interim and/or private code to be written into from each different subsystem, it is unified Client 604, which is provided, to be enabled using 702 individual interfaces that can be interacted in a similar manner with each subsystem.

SparkSQL 714 is arrived in the specific link of sub- client shown in Fig. 7 A, e.g., HQL 710 to Spark operations 712 Deng, it is provided as example, and embodiment is not limited to the example.Generally, the sub- client of any proper number and type Can serially and/or in parallel it be linked with random order, to perform data processing.In Fig. 7 A example, Spark operations 712 are handled Data and the result that processing is provided to both SparkSQL 714 and another Spark operations 718, as above-mentioned parallel place The example of the branch of reason.Specific sub- client can be used to perform the operation of particular type during the example of link.For example, certain A little client can be used to fetch data from storage device, and other sub- clients can be used to some mode conversion number According to.After process step has been performed, some metadata can be returned uniform client 604 to indicate result or the instruction of processing Processing has been performed.Such returned metadata can include the reference to result, such as be returned from Spark operations 712 When Fig. 7 A shown in the RDD of output quote.The result of the various process steps of each seed client executing can making by reference With and it is associated with each other.

Fig. 7 B describe the process for being used to carry out data processing using uniform client of the embodiment according to the disclosure Example flowchart.The operation of process can be by applying 702, uniform client 604 and/or operating in client computing device, distribution The equipment of formula processing platform or other local other software modules are performed.

Request is received (720), and its instruction will use uniform client 604 to be performed in distributed processing platform at data Reason.In some examples, request can be from calling the application 702 of uniform client 604 to receive.

The sub- client of uniform client 604 is determined (722) and performs data processing step.In some examples, at data Reason stream and chain can be scheduled to solve particular problem.In some examples, data processing stream and chain can operationally pass through spirit Input configuration living and/or the result based on data processing are determined.If for example, data set is confirmed as relatively other son visitors Processing of the family end in a sub- client is inappreciable (e.g., lower cost), the then selection of the sub- client of lower cost Can operationally it be made.Using the sub- client executing data processing step (724) of determination, and result is provided to Further processing.In some embodiments, the reference for pointing to result can be provided that (726), so that other sub- clients can be right Result data performs further process step.

Whether extra processing is required and makes determination (728).If it is not required, then the knot of last process step Fruit can be provided that (730), and such as there is provided to applying 702.If necessary to further processing, then the process may return to 722 and It is determined that another sub- client identical or different with the sub- client used in previous steps.Process step can by (it is identical or It is different) perform to the sequence order of sub- client, and/or process step can by identical or different type many sub- clients simultaneously Perform capablely.

In some examples, at least some data processings can be performed in client-side, e.g., in the outer of distributed processing platform Portion.For example, result can be reclaimed through the Get Results streams shown in Fig. 7 A from Hadoop processors.Can be to the result of reception Processing locality is performed, and the result of processing locality can be sent for further being handled by another sub- client.Embodiment At least some of process step are enabled to perform outside distributed processing platform (e.g., Hadoop system).

Example Computing Device

Fig. 5, which is shown, can use the computer equipment 500 of technology described herein and showing for mobile computer device 550 Example.Computing device 500 is intended to indicate various forms of digital computers, such as on knee, desktop type, work station, individual digital Assistant, server, blade server, large scale computer and other suitable computers.Computing device 550 is intended to indicate various forms of Mobile device, such as personal digital assistant, cell phone, smart phone and other similar computing devices.Component described herein, it Connection and relation and their function be intended to simply exemplary, and be not intended to limit described in this document and/or will The embodiment for asking protection.At least one computing device 500 and/or 550 or one or more component, can be included herein In any one of the computing device, system and/or platform.

Computing device 500 include processor 502, memory 504, storage device 506, be connected to memory 504 and at a high speed The high-speed interface 508 of ECP Extended Capabilities Port 510 and it is connected to low speed bus 514 and the low-speed interface 512 of storage device 506.Component 502nd, various bus interconnections are each used 504,506,508,510 and 512, and may be mounted to that public motherboard or optionally Otherwise install.Processor 502 can handle the instruction for being performed in computing device 500, including be stored in storage In device 504 or the instruction that is stored in storage device 506 for GUI graphical information will include setting in outside input/output It is standby upper, the display 516 such as coupled with high-speed interface 508.In other embodiments, multiple processors and/or multiple buses Optionally it can be used together with multiple memories with multiple type memories.Also, multiple computing devices 500 can be with providing necessary Each equipment (e.g., as server group, blade server group or multicomputer system) connection of the part of operation.

Information in the storage computing device 500 of memory 504.In one embodiment, memory 504 is one or more volatile Property memory cell.In another embodiment, memory 504 is one or more Nonvolatile memery units.Memory 504 It can also be another form of computer-readable medium, such as disk or CD.

Storage device 506 can provide mass memory for computing device 500.In one embodiment, storage device 506 can be with Be or comprising computer-readable medium, such as floppy device, hard disc apparatus, compact disk equipment or tape unit, flash memory or Other similar solid storage devices, or the equipment being included in storage area networks or other configurations equipment array.Computer journey Sequence product can conscientiously be embodied in information carrier.Computer program product can also be comprising instruction, when instruction is run When, perform all one or more methods described above.Information carrier is computer or machine readable media, such as memory 504, Memory in storage device 506 or processor 502.

High-speed controller 508 is that computing device 500 manages bandwidth-intensive operations, and low speed controller 512 manages relatively low Bandwidth-intensive operations.Such function distribution is exemplary.In one embodiment, high-speed controller 508 and memory 504th, display 516 (e.g., through graphics processor or accelerator) is coupled, and with various expansion card (not shown) can be received High-speed expansion ports 510 are coupled.In embodiment, low speed controller 512 and storage device 506 and the coupling of low-speed expansion port 514 Connect.It may include that the low-speed expansion port of various COM1s (e.g., USB, bluetooth, Ethernet, wireless ethernet) can be with one Or multiple input-output apparatus couplings, such as, keyboard, pointing device, scanner, Huo,Ru, through network adapter with such as handing over Change planes or the network equipment of router is coupled.

Computing device 500 can realize with multiple multi-forms, as shown in the figure.For example, it can be implemented as standards service Device 520, or be implemented multiple times in such server group.It can also be implemented as the part of rack-mounted server system 524.This Outside, it can be realized in such as personal computer of laptop computer 522.Alternatively, the component from computing device 500 It can be combined with other component (not shown) in such as mobile device of equipment 550.Can each including for such equipment is counted The one or more of equipment 500,550 are calculated, and whole system can be made up of the multiple computing devices 500,550 communicated with one another.

Computing device 550 includes processor 552, memory 564, such as input-output apparatus of display 554, communication Interface 566 and transceiver 568 and other components.The storage that equipment 550 may be equipped with such as micro harddisk or miscellaneous equipment is set It is standby to provide additional storage.Component 550,552,564,554,566 and 568 each uses various bus interconnections, and component Some may be mounted to that on public motherboard or optionally otherwise install.

Processor 552 can run the instruction in computing device 540, including the instruction being stored in memory 564.Processing Device can be implemented as including the chipset of the chip of single and multiple analog- and digital- processors.Processor can be provided, For example, the coordination of other components of equipment 550, the application and equipment 550 that the control of such as user interface, equipment 550 are run is entered Capable radio communication.

Processor 552 through control interface 648 and can be couple to the display interface 556 of display 554 and be communicated with user.It is aobvious Showing device 554 can be, for example, TFT LCD (Thin Film Transistor-LCD) or OLED (Organic Light Emitting Diode) display, Or other suitable Display Techniques.Display interface 556, which can be included, to be used to drive display 554 figure to be presented and other to user The suitable circuit of information.Control interface 558 can receive order from user and change them to submit to processor 552.It is near so as to enabled device 550 and miscellaneous equipment in addition, external interface 562 can be provided when being communicated with processor 552 Range communication.External interface 562 can be provided, for example, in the wire communication in some embodiments, or other embodiments Radio communication, and multiple interfaces can be used as.

Information in the storage computing device 550 of memory 564.Memory 564 can be implemented as one of herein below or It is multiple：One or more computer-readable mediums, one or more volatile memory cells one or more non-volatile are deposited Storage unit.Extended menory 554 also can be provided that and expanded interface 552 is connected to equipment 550, and expansion interface 552 can be with Including for example, SIMM (signle in-line memory module) card interface.Such extended menory 554 can provide for equipment 550 Extra memory space, or it is alternatively the storage of equipment 550 application or other information.Specifically, extended menory 554 can include The instruction of said process is performed or supplemented, and security information can also be included.So as to for example, extended menory 554 can be carried For the security module for equipment 550, and the instruction programming of the safe handling of permitted device 550 can be used.In addition, safety applications connect Same extraneous information, such as places identification information in the non-mode that cracks on SIMM cards, can be provided via SIMM cards.

Memory can include, for example, flash memory and/or NVRAM memory, as discussed below.One embodiment party In case, computer program product is embodied conscientiously in information carrier.Computer program product includes instruction, works as operating instruction When, perform all one or more methods described above.Information carrier is computer or machine readable media, such as memory 564, Memory on extended menory 554, processor 552, or can be for example to be received by transceiver 568 or external interface 562 Transmitting signal.

Equipment 550 can when necessary include data signal through the radio communication of communication interface 566, communication interface 566 Process circuit.Communication interface 566 can provide the communication under various patterns or agreement, such as GSM voice calls, SMS, EMS or MMS message transmitting-receiving, CDMA, TDMA, PDC, WCDMA, CDMA2000 or GPRS and other.Can be with for example, being received through radio frequency Such communication occurs for hair device 568.In addition, can such as be sent out using transceiver (not shown) as bluetooth, WiFi or other Raw short haul connection.In addition, GPS (global positioning system) receiver module can provide extra navigation and position to equipment 550 The wireless data of correlation is put, it optionally can be used by operating in the application in equipment 550.

Equipment 550 it is also possible to use the voice communication of audio codec 560, and audio codec 560 can be received from user Voice messaging and transform it into workable digital information.Audio codec 660 similarly can generate all for user Such as through loudspeaker, the sound such as heard in the receiver of equipment 550.Such sound can include coming from voice telephone calls Sound, can include recording (e.g., speech message, music file etc.) and may also include by operate in equipment 550 should With the sound of generation.

Computing device 550 can realize with multiple multi-forms, as shown in the figure.For example, it can be implemented as cell phone 580.It can also be implemented as the part of smart phone 582, personal digital assistant or other similar mobile devices.

The various embodiments of system and technology described herein can be in digital circuit, integrated circuit, specially design Realized in ASIC (application specific integrated circuit), computer hardware, firmware, software and/or its combination.These various embodiments can be with It is included in the embodiment realized in one or more computer programs, one or more of computer programs are including at least It can run and/or can be explained on the programmable system of one programmable processor, at least one described programmable processor can be with Special or general processor, its couple with storage system, at least one input equipment and at least one output equipment with from it Receive data and instruction and send data and instruction to them.

These computer programs (also referred to as program, software, software application or code) include being used for programmable processor Machine instruction, and can realize in high-level program and/or Object-Oriented Programming Language, and/or realize in compilation/machine In device language.As used herein, term " machine readable media " " computer-readable medium ", which is referred to, is used for programmable processing Device provide machine instruction and/any computer program product of data, device and/or equipment (e.g., disk, CD, memory, Programmable logic device (PLD)), including machine instruction is received as the machine readable media of machine-readable signal.Term " machine Readable signal " refers to any signal for being used for that machine instruction and/or data are provided to programmable processor.

For offer and user mutual, system and technology described herein, which are implemented in have, to be used for user's display information Display device (e.g., CRT (cathode-ray tube) or LCD (liquid crystal display) display screen) and user can provide input to computer Keyboard and pointing device (e.g., mouse or tracking ball) computer on.Other type equipments can be also used to and user Interaction；For example, the feedback for being supplied to user can be any type of sensory feedback (e.g., visual feedback, audio feedback or tactile Feedback)；And the input from user can be received in any form, including sound, voice or touch input.

System and technology as described herein can realize including aft-end assembly (e.g., as data server) or including Middleware component (e.g., application server) or including front end assemblies (e.g., the client with graphic user interface or web browser Computer is held, user can be mutual through graphic user interface or web browser and the embodiment of system described herein and technology It is dynamic) or the rear end, middleware or front end assemblies any combination computing system.The component of system can be by arbitrary form or Jie Digital data communications (e.g., the communication network) interconnection of matter.The example of communication network includes LAN (" LAN "), wide area network (" WAN ") and internet.

Computing system can include client and server.Client and server is generally remote from each other and typically through logical Communication network is interacted.Because computer program operation is on the respective computers and each other with client-server relation, so Generate the relation of client and server.

Although the disclosure includes some characteristics, these are not necessarily to be construed as to the disclosure or the content that can be claimed Scope limitation, but the description of the feature as example embodiment of this disclosure.In the single implementation of the disclosure Some features described in the context of scheme can also be combined to be provided in single embodiment.On the contrary, single embodiment party Various features described in the context of case can also be provided separately or with any appropriate subgroup in various embodiments Close and provide.Although being worked moreover, feature can be described above as in some combinations and even initially claimed protection, It is that one or more features from combination claimed can be deleted from combination in some cases, and it is required The combination of protection can be directed to the modification of sub-portfolio or sub-portfolio.

Similarly, although describing operation in the accompanying drawings with particular order, this is understood not to require that these are grasped Make with shown particular order or perform in order, or require that all shown operations are performed to realize desired result.At certain In the case of a little, multitask and parallel processing are probably favourable.Moreover, the separation of the various system components in the embodiment above It is understood not in all embodiments need this separation, and it is to be understood that described program component and system are usual Multiple software product can be integrated in single software product or is encapsulated into together.

Therefore, it has been described that the specific embodiment of the disclosure.Other embodiments are in scope of the following claims It is interior.For example, the operation described in claim can be performed with different order and still realize desired result.Have been described Multiple embodiments.It will be appreciated, however, that various repair can be carried out in the case where not departing from spirit and scope of the present disclosure Change.It is, for example, possible to use various forms of streams illustrated above, including the step of rearrangement, addition or removal.Therefore, it is other Embodiment is within the scope of the appended claims.

Claims

1. a kind of computer implemented method by least one computing device, methods described is included：

The input being stored in the distributed processing platform including multiple subsystems training is recognized by least one described processor Data set；

It will instruct that from client application to be sent to the distributed processing platform described to ask by least one described processor At least one in multiple subsystems is run to perform at least one data processing operation, to be based on the input training data Collection determines forecast model；And

There is provided the forecast model to determine one or more results, each result and data set by least one described processor The probability of occurrence association of intermediate value.

2. computer implemented method as claimed in claim 1, wherein the instruction is by through including the system of many sub- clients One client is sent to the distributed processing platform from the client application, and every sub- client is configured as and described point The corresponding subsystem interface of cloth processing platform.

3. computer implemented method as claimed in claim 1, is further included：

Run at least one local data processing operation in the client application to determine by least one described processor The forecast model；

Wherein described at least one local data processing operation receive to include from performed on the distributed processing platform to The input for the result set that a few data processing operation is obtained.

4. computer implemented method as claimed in claim 1, wherein the distributed processing platform is Hadoop platform.

5. computer implemented method as claimed in claim 1, wherein at least one described data processing operation is included by institute State the data processing operation that the Spark subsystems in multiple subsystems are performed.

6. computer implemented method as claimed in claim 1, wherein methods described is independently of from the distributed treatment The data transfer of the input training dataset of platform.

7. computer implemented method as claimed in claim 1, wherein at least one described data processing operation includes calculating With one or more statistics for associating of input training dataset to reduce the variable for generating the forecast model Number.

8. computer implemented method as claimed in claim 7, wherein at least one described data processing operation is further wrapped Include based on one or more of one or more of statistics of result re-computation.

9. computer implemented method as claimed in claim 1, wherein at least one described data processing operation includes coding The data of the input training dataset, it includes alphanumeric data being changed into numerical data.

10. computer implemented method as claimed in claim 1, wherein at least one described data processing operation includes performing Covariance matrix on the input training dataset is calculated and matrix inversion is calculated.

11. computer implemented method as claimed in claim 1, wherein at least one described data processing operation is included institute Input training dataset burst is stated to determine one or more bursts, and on one or more of bursts to the prediction Model score.

12. computer implemented method as claimed in claim 1, wherein at least one described data processing operation includes being based on Structural risk minimization iteratively assesses the performance of the forecast model.

13. a kind of system, comprising：

At least one processor；And

Memory, is communicatively coupled at least one described processor, the memory store instruction, when by it is described at least one When processor runs the instruction, cause at least one computing device operation, the operation is included：

Identification is stored in the input training dataset in the distributed processing platform including multiple subsystems；

Instruction is sent to the distributed processing platform with least one in asking the multiple subsystem from client application It is individual to be run to perform at least one data processing operation, determine forecast model to be based on the input training dataset；And

The forecast model is provided to determine one or more results, each result is associated with the probability of occurrence of data lumped values.

14. system as claimed in claim 13, wherein the instruction by through include the uniform clients of many individual sub- clients from The client application is sent to the distributed processing platform, and every sub- client is configured as flat with the distributed treatment The corresponding subsystem interface of platform.

15. system as claimed in claim 13, the operation is further included：

At least one local data processing operation is run in the client application to determine the forecast model；

Wherein described at least one local data processing operation receives to include from the institute performed on the distributed processing platform State the input for the result set that at least one data processing operation is obtained.

16. system as claimed in claim 13, wherein the distributed processing platform is Hadoop platform.

17. one or more non-transient computer readable storage medium storing program for executing, its store instruction, when by least one processor operation institute When stating instruction, cause at least one computing device operation, the operation is included：

18. one or more non-transient computer readable storage medium storing program for executing as claimed in claim 17, wherein it is described at least one One or more statistics that data processing operation includes calculating with the input training dataset is associated are used to generate to reduce The number of the variable of the forecast model.

19. one or more non-transient computer readable storage medium storing program for executing as claimed in claim 18, wherein it is described at least one Data processing operation further comprises being based on one or more of one or more of statistics of result re-computation.

20. one or more non-transient computer readable storage medium storing program for executing as claimed in claim 17, wherein it is described at least one Data processing operation includes performing calculating on the covariance matrix of the input training dataset to be calculated with matrix inversion.