CN109446251A - The system and method for distributed artificial intelligence application and development - Google Patents

The system and method for distributed artificial intelligence application and development Download PDF

Info

Publication number
CN109446251A
CN109446251A CN201811024278.7A CN201811024278A CN109446251A CN 109446251 A CN109446251 A CN 109446251A CN 201811024278 A CN201811024278 A CN 201811024278A CN 109446251 A CN109446251 A CN 109446251A
Authority
CN
China
Prior art keywords
model
data
module
training
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811024278.7A
Other languages
Chinese (zh)
Inventor
朱沐尧
丁小可
石江枫
赵洲洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruiqi Information Technology Co Ltd
Original Assignee
Beijing Ruiqi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruiqi Information Technology Co Ltd filed Critical Beijing Ruiqi Information Technology Co Ltd
Priority to CN201811024278.7A priority Critical patent/CN109446251A/en
Publication of CN109446251A publication Critical patent/CN109446251A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the system and method for distributed artificial intelligence application and development, Feature Engineering modules, for being handled by characteristic conversion pretreated data;Machine learning module, for receiving the data after Feature Engineering resume module, and to model, it is analyzed and processed using machine learning, including model training module, model prediction module and model evaluation module, wherein, model training module determines data source for model prediction and uses model for the going property of standard by the training review initial data to model;Model prediction module, for the prediction by the model after training to test data;Model evaluation module, the model for being generated by training set carry out the assessment of modelling effect using test data.The invention has the advantages that: not only realization users to carry out integration across database and cross-platform data acquisition, but also interacts to analyzing and processing data with Hadoop cluster on the web console of browser end.

Description

The system and method for distributed artificial intelligence application and development
Technical field
The present invention relates to field of artificial intelligence, it particularly relates to a kind of distributed artificial intelligence application and development System and method.
Background technique
Conventional machines study mainly still uses linear model at present in prediction learning areas, in quantization transaction still It is such as to predict the share price of stock at some time point, but with the grasped data of scientific and technological progress such as financial industry based on returning The scale of construction it is increasing, machine learning is also higher and higher come the demand for promoting own service ability since financial circles, and such as finance is pre- Survey, anti-fraud, credit financing, investment decision, aid decision, Insurance Pricing and intelligence throw Gu etc., and financial circles are no longer full Be enough to be predicted using machine learning, but need by machine learning obtain it is more professional as provide countermeasure and suggestion, The data such as market analysis, analysis on its rationality, price analysis, situation analysis and risk control assessment;Traditional machine learning side Formula is no longer satisfied the demand developed such as financial circles.
For the problems in the relevant technologies, currently no effective solution has been proposed.
Summary of the invention
For above-mentioned technical problem in the related technology, what the present invention proposed a kind of distributed artificial intelligence application and development is System and method, are able to solve that function that conventional machines study occurs in existing industry is simple and structure is single, are unable to satisfy row The technical issues of industry Informatization Development demand.
To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:
A kind of system of distributed artificial intelligence application and development, comprising:
Data preprocessing module after the data for obtaining different data sources, carries out cleaning integration to data;
Feature Engineering module, for being handled by characteristic conversion pretreated data;
Machine learning module, for receiving the data after Feature Engineering resume module, and using machine learning to its progress of model Analysis processing, including model training module, model prediction module and model evaluation module, wherein
Model training module determines data for the going property of standard by the training review initial data to model for model prediction Source and use model;
Model prediction module, for the prediction by the model after training to test data;
Model evaluation module, the model for being generated by training set carry out the assessment of modelling effect using test data.
Further, the data preprocessing module includes:
Data conversion module, for the conversion by data between different-format;
HiveQL enquiry module writes query statement for the syntax format using HiveSQL;
Data segmentation module, for a data set to be splitted into the computing unit of two parts of data;
Data column selection module, for selecting the column handled in data;
Missing values completion module fills null value with certain value for handling missing data;
Model training input format conversion module, for characteristic to be converted to model training;
Model prediction input format conversion module, for characteristic to be converted to model prediction.
Further, the Feature Engineering module includes:
Data branch mailbox module, the continuous variable for handling branch mailbox are converted to discrete variable;
Standard scalar module, for being unit standard deviation, 0 mean value or 0 mean value unity standard deviation by each column feature normalization;
Characteristic value conversion module is used for converting the data into suitable format for model;
Characteristic type conversion module is converted for carrying out basic type to specified characteristic variable;
Feature normalization module, for data to be normalized.
Further, the model training of the model training module includes but is not limited to the training of LR disaggregated model, svm classifier Model training, Naive Bayes Classification Model training, gradient promote Decision-Tree Classifier Model training, random forest disaggregated model instruction Experienced and K-MEANS Clustering Model.
Further, the model prediction of the model prediction module includes but is not limited to the prediction of LR disaggregated model, classification mould Type prediction, svm classifier model prediction, Naive Bayes Classification Model prediction and the prediction of random forest disaggregated model.
Further, the model evaluation of the model evaluation module includes but is not limited to that two disaggregated models are assessed, SVM bis- divides Class model assessment, the assessment of more disaggregated models and model result compare.
Another aspect of the present invention provides a kind of method of distributed artificial intelligence application and development, comprising the following steps:
After S1 obtains the data of different data sources, cleaning integration is carried out to data;
S2 is handled pretreated data by characteristic conversion;
Data that treated in S3 receiving step S2, and using machine learning, to model, it is analyzed and processed, including model instruction White silk, model prediction and model evaluation, wherein
S31 checks the going property of standard of initial data by model training, determines data source for model prediction and uses model;
S32 passes through prediction of the model prediction by the model after training to test data;
S33 passes through the model that model evaluation generates training set, and the assessment of modelling effect is carried out using test data.
Further, the step S1 includes:
S11 passes through conversion of the data conversion by data between different-format;
S12 writes query statement using the syntax format of HiveSQL;
S13 divides the computing unit that a data set is splitted into two parts of data by data;
S14 selects the column handled by data column selection in data;
S15 is handled missing data by missing values completion, fills null value with certain value;
S16 is converted by model training input format characteristic being converted to model training;
S17 is converted by model prediction input format characteristic being converted to model prediction.
Further, the step S2 includes:
The continuous variable that branch mailbox is handled is converted to discrete variable using data branch mailbox by S21;
Each column feature normalization is unit standard deviation, 0 mean value or 0 mean value unity standard deviation using standard scalar by S22;
S23 converts the data into suitable format using characteristic value transformation and uses for model;
S24 is converted using characteristic type and is carried out basic type conversion to specified characteristic variable;
S25 is normalized data using feature normalization.
Further, the model training includes but is not limited to the training of LR disaggregated model, svm classifier model training, simplicity It is poly- that Bayesian Classification Model training, gradient promote Decision-Tree Classifier Model training, the training of random forest disaggregated model and K-MEANS Class model;Model prediction includes but is not limited to the prediction of LR disaggregated model, disaggregated model prediction, svm classifier model prediction, simple shellfish The prediction of this disaggregated model of leaf and the prediction of random forest disaggregated model;Model evaluation includes but is not limited to the assessment of two disaggregated models, SVM The assessment of two disaggregated models, the assessment of more disaggregated models and model result compare.
Beneficial effects of the present invention:
1, it by supporting multiple data sources, realizes user and carries out integration across database and cross-platform data acquisition;
2, HUE function is integrated, analysis processing number can be interacted to Hadoop cluster on the web console of browser end According to;
3, based on different computation models, model calculating, prediction, assessment and standard is provided for user and dissolves function, utmostly Meets the needs of industry user is based on to machine learning.
4, a variety of calculating forecast analysis are carried out to data, thus save user buy the funds of different analysis platforms at This.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.
Fig. 1 is the schematic diagram of the system of the distributed artificial intelligence application and development described according to embodiments of the present invention;
Fig. 2 is the schematic diagram of the data preprocessing module described according to embodiments of the present invention;
Fig. 3 is the schematic diagram of the Feature Engineering module described according to embodiments of the present invention;
Fig. 4 is the schematic diagram of the data source described according to embodiments of the present invention;
Fig. 5 is the schematic diagram of the model training module described according to embodiments of the present invention;
Fig. 6 is the schematic diagram of the model prediction module described according to embodiments of the present invention;
Fig. 7 is the schematic diagram of the model evaluation module described according to embodiments of the present invention;
Fig. 8 is the flow chart of the method for the distributed artificial intelligence application and development described according to embodiments of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected Range.
As shown in Figure 1, a kind of system of distributed artificial intelligence application and development described according to embodiments of the present invention, packet It includes:
Data preprocessing module after the data for obtaining different data sources, carries out cleaning integration to data;
Wherein, the system is by getting data from different data sources in data preprocessing module, after converting to data Data processing operation is carried out, and the data from different data sources different-format are converted to by data preprocessing module can be by The data that model uses, detailed process are as follows:
As shown in figure 8, the connection in order to support multiple database, is integrated with including Hive, MYSQL, SQL SERVER, ORACLE etc. Distributed file system under database and HDFS this Hadoop based on Hadoop such as local data base and IMPALA; The system provides the connection configuration parameter for being directed to different data sources, and the content in database is directly read after configuration connection, will be counted The form of tables of data is imported from these databases and data system accordingly, and is converted to PARQUET format and is stored in In HDFS, to carry out subsequent data processing, wherein
HDFS:HDFS (Hadoop distributed file system) is the system of an Error Tolerance, can provide the number of high-throughput According to access, the application being very suitable on large-scale dataset, while HDFS has the characteristics of high fault tolerance.
Hive:Hive is a Tool for Data Warehouse based on Hadoop, the data file of structuring can be mapped as One database table, and simple sql query function is provided, sql sentence can be converted to MapReduce task and transported Row.
Impala: SQL semanteme is provided, the PB grade big data being stored in the HDFS and HBase of Hadoop can be inquired, is had Have the characteristics that inquiry is quick.
MySQL: one of the relational database of most process.Have the characteristics that small in size, speed is fast, the total cost of ownership is low. The exploitation of general middle-size and small-size website all selects MySQL as site databases.
Oracle:Oracle Database Systems are one of most popular relational database management system, system portability It is good, easy to use, function is strong, be suitable for all kinds of large, medium and small, microcomputer environment.
SQL Server:SQL Server be one it is expansible, high performance, for distributed client/server meter Calculate designed data base management system.
It needs to handle the format and data content of processing before carrying out feature transformation for the data of importing, pass through Data processing operation makes data reach the degree that can be identified and be calculated by model.For different processing data requirementss, this is System provides following data function, data processing function can single use can also be applied in combination, according to right in front end page Functional module is attached.
Feature Engineering module, for being handled by characteristic conversion pretreated data;
In order to which the result of model analysis is more asked accurately, after carrying out data prediction, can be converted by data characteristics to pre- place Data after reason are further to be handled;In Feature Conversion module can independently use can be to combine to after pretreatment Data carry out various features conversion, during data after will converting are calculated for different models.
Machine learning module, for receiving the data after Feature Engineering resume module, and using machine learning to model its It is analyzed and processed, including model training module, model prediction module and model evaluation module, wherein
Model training module determines data for the going property of standard by the training review initial data to model for model prediction Source and use model;
Wherein, each model training module can be independently accessed the data after characterization and individually output training As a result, being mainly used for tying the training of data by the going property of standard and which kind of model of the training review initial data to model Fruit is more preferable, to determine data source for next step model prediction and use model.
Model prediction module, for the prediction by the model after training to test data;
Model after training can be used for the prediction to test data, and the input of prediction model includes two parts: being trained Model itself and test data, test data needs to include characteristic set identical with training dataset, and needs to pass through Original test data is converted to the format for being appropriate for model prediction by " model prediction input format conversion module ", then will be turned Model after changing is used for model prediction.
Model evaluation module, the model for being generated by training set carry out the assessment of modelling effect using test data.
Wherein, in the training process of machine learning, the training effect of assessment machine learning model is generally required, it is more general Time machine learning assessment mode be, by data set be divided into training set and verifying collect, training set is used for the training of model, simultaneously The model trained is applied on verifying collection, the target train value of prediction result and verifying collection script that model provides is compared Compared with being compared by the calculating of various indexs to the model trained, such as in two disaggregated models, accuracy rate (precision), the indexs such as f-score and AUC are frequently used for assessment models effect.
Within the system, provide bis- disaggregated model of SVM assessment (model trained applied to bis- disaggregated model of SVM), The reason of assessment of two disaggregated models and assessment of more disaggregated models, difference is the index and mode difference of each model evaluation.
As shown in Fig. 2, in one particular embodiment of the present invention, for different processing data requirementss, which is mentioned For following data function, data processing function can not only single use, and can be applied in combination, according in front end page Functional module is attached, data will be connected according to functional module and smoothly successively be executed from top to bottom, the data prediction Module includes:
Data conversion module, for the conversion by data between different-format;
Specifically, can accelerate to inquire compared to column storage layout (such as PARQUET format) with CSV format, only check all The column that need simultaneously execute calculating to their value, only read the fraction data of a data file or table;PARQUET is also supported Flexible compression options, therefore the storage on disk can be substantially reduced.It, should in order to support conversion of the data between different-format CSV can be turned PARQUET by system, after importing the file of CSV format, change into PARQUET format to use HiveQL Inquiry is handled;PARQUET turn JSON for will use HiveQL treated with PARQUET format storage data change into JSON file facilitates downloading to edit.
HiveQL enquiry module writes query statement for the syntax format using HiveSQL, and query statement is in Hive Number is run in library, for the data in Hive tables of data.
Data segmentation module is divided a for a data set to be splitted into the computing unit of two parts of data by data For data as the data for being used for training pattern, another is then used for model evaluation;
Data column selection module, for selecting the column handled in data;
It calculates, needs for largely column screen present in original table, data column selection module in order to simplify data It can support to determine which data column needs to carry out next step operation by selection and Negative selection.
Missing values completion module fills null value with certain value, to make information table for handling missing data Completion, the module support a variety of missing values completion strategies, including with statistical values such as average value, median, maximum value, minimum values As missing values completion value, using 0 as missing values completion value, completion value using self-defining value as missing values etc.;
Model training input format conversion module is provided for characteristic to be converted to model training for SPARK official The requirement of machine learning model in the library SPARK MLLIB carries out feature input for convenience of following model, by the module by portion Divide the data conversion of column for feature vector, and Uniform Name, converts single-row vector for multi-column data and arrange;
Model prediction input format conversion module is similar to " model training input for characteristic to be converted to model prediction Format converting module ", model prediction input format need feature being converted to Format Series Lines in a particular order, pass through the module The test data set of conversion is applied to model prediction module (including all moulds such as disaggregated model prediction, svm classifier model prediction Type prediction module) it is used for model prediction.
As shown in figure 3, in one particular embodiment of the present invention, the Feature Engineering module includes:
Data branch mailbox module, the continuous variable for handling branch mailbox are converted to discrete variable;
It generally when establishing disaggregated model, needs to continuous variable discretization, after feature discretization, model can be more stable, reduces The risk of model over-fitting, for example use logistic just to need as basic mode type to continuous when establishing and applying for scorecard model Variable carries out discretization, and discretization generallys use branch mailbox method;Data branch mailbox is actually for the processing of feature, will pass through branch mailbox The continuous variable of processing is converted to discrete variable, and the new feature after data branch mailbox can be used as the input progress of model training Machine learning training.
OneHot coding: OneHot coding is also known as one-hot coding, is a kind of character type variable to be converted to numeric type variable Method, mainly N number of state is encoded using N bit status register, each state is by his independent register Position, and there was only one effectively when any, column label index is mapped to a column binary system after encoding by ONEHOT Array.
StringToIndex coding: since many models do not support the variable uses StringToIndex of character types It goes the character Label in source data, sequential coding is carried out to it according to the frequency that Label occurs, such as: 0,1, 2 ...;If the label of input is numeric type, string can be converted to it by StringToIndex coding, and make Index is translated into string.
Standard scalar module, for being unit standard deviation, 0 mean value or 0 mean value unit mark by each column feature normalization It is quasi- poor;
Wherein, the object of standard scalar (StandardScaler) processing is each column, that is, every one-dimensional characteristic, by feature mark Standard turns to unity standard deviation or 0 mean value or 0 mean value unity standard deviation, therefore the module needs fit data in advance, obtains Do not have the mean value and standard deviation of different dimension, to scale every one-dimensional characteristic;
Characteristic value conversion module is used for converting the data into suitable format for model;
Wherein, it is that number can be used as feature by model that characteristic value, which needs format, and usually some initial data variables need Number can be just converted to by centainly handling;The module converts the data into suitable format and uses for model;The module supports CSV With the input data of PARQUET, data can be encoded by OneHot or StringToIndex coding carries out data conversion output.
Characteristic type conversion module is converted for carrying out basic type to specified characteristic variable;
Wherein, since the characteristic series in many tables of data all have certain data type, in some cases, these data class Type is determined during data are written or obtain, data type not most suitable for model training, such as Under certain specific conditions, all characteristic series are defined as character types during data write-in, including certain meanings are number The characteristic series of Value Types, the module are exactly that certain specific characteristic series are converted to specific data type by (supporting batch), The module is supported data to have included that tetra- seed type of String, Double, Interger, Long carries out conversion output.
Feature normalization module, for data to be normalized;
Wherein, data normalization is very important a step link in machine learning training, may without normalized feature It will lead to model and break down or train an extremely odd model, in order to allow the model of machine learning to be more suitable practical feelings Condition needs that data are normalized, and passes through linear transformation, standardized method and range method, feature normalization processing Data afterwards can also improve trained precision with lift scheme training (such as gradient decline) convergent speed.
As shown in figure 5, in one particular embodiment of the present invention, the model training of the model training module include but It is not limited to the training of LR disaggregated model, svm classifier model training, Naive Bayes Classification Model training, gradient promotion decision tree point Class model training, the training of random forest disaggregated model and K-MEANS Clustering Model, wherein
LR disaggregated model training: being a kind of simplest disaggregated model, and LR disaggregated model is applied on the basis of linear regression One logical function (usually sigmoid function), this function can help model training faster to converge to part most Near excellent solution, LR disaggregated model training the result is that one contains " formula " of special parameter and data, user can will instruct Practice nest on this formula, obtains training result.
Svm classifier model training: the target of SVM is that hyperplane or hyperplane set are constructed in this higher dimensional space, Space is cut using this hyperplane, to reach the target for carrying out classification based training;One group of trained example is given, often A trained example is marked as one or the other belonged in two classifications, and SVM training algorithm creates one for new example The model for distributing to one of two classifications becomes non-probability binary linearity classifier;The training result of SVM is to provide this The definition of " hyperplane ", user can be by prediction data sets in the definition of this hyperplane, obtaining prediction classification results.
Naive Bayes Classification Model training: being a kind of straightforward procedure for constructing classifier, which can be asked The class label that topic example allocation is indicated with characteristic value, class label are derived from finite aggregate, it is not the single of trained this classifier Algorithm, but a series of algorithms based on same principle: all Naive Bayes Classifiers all assume each feature of sample and its His feature is all uncorrelated.
Gradient promotes Decision-Tree Classifier Model training: the model is with the method that gradient declines come the division of iteration decision tree Process solves forecasting problem using the Shared Decision Making of more decision trees rather than single decision, be current application the most extensively and Best one of the disaggregated model of modelling effect;The training result of model is that more decision trees collectively constitute, and user can will be pre- Measured data is applied in this forest and is predicted, obtains classification results according to the function decision of more decision trees.
Random forest disaggregated model training: being the classifier comprising multiple decision trees, and classification of its output is By each tree other mode of output class come what is determined, the model training of random forest is exactly the process for generating multiple decision trees.With Machine forest model can be good at the problems such as processing includes shortage of data, error balance and data outlier, so in reality Application scenarios in application process are very extensive and effect is fine;Random forest disaggregated model can export one and include more decisions The forest of tree and prediction of result is carried out using most voting mechanisms.
K-MEANS Clustering Model: K-Means clustering algorithm is as a kind of cluster for being more applied to the field of data mining Parser, calculation be will represent the multi-dimentional-data partition of N number of point into K cluster so that each point belong to from His corresponding cluster of nearest mean value (i.e. cluster centre) take it as the standard of cluster, this is a kind of unsupervised cluster mode, Commonly used in some unsupervised clusterings;The result of model training contains the determination of K cluster centre, and each The affiliated cluster of point.
As shown in fig. 6, in one particular embodiment of the present invention, the model prediction of the model prediction module include but Be not limited to the prediction of LR disaggregated model, disaggregated model prediction, svm classifier model prediction, Naive Bayes Classification Model prediction and with Machine forest classified model prediction, wherein
The prediction of LR disaggregated model: suitable for the prediction of LR classification based training model, output is classified comprising the prediction that model provides (label) and the probability of each classification (probability);
Disaggregated model prediction: suitable for general classification based training model, prediction classification (label) that output is provided comprising model and The probability (probability) of each classification;
Svm classifier model prediction: suitable for the prediction of svm classifier model, output classifies (label) comprising the prediction that model provides With the probability (probability) of each classification;
Naive Bayes Classification Model prediction: it is provided suitable for the prediction of Naive Bayes Classification Model, output comprising model The probability (probability) of prediction classification (label) and each classification;
Random forest disaggregated model prediction: the prediction provided suitable for the prediction of random forest disaggregated model, output comprising model The probability (probability) of classification (label) and each classification;
As shown in fig. 7, in one particular embodiment of the present invention, the model evaluation of the model evaluation module includes but unlimited Compare in the assessment of two disaggregated models, the assessment of bis- disaggregated model of SVM, the assessment of more disaggregated models and model result, wherein
The assessment of two disaggregated models: it supports to promote decision tree, four class model of naive Bayesian for LR recurrence, random forest, gradient Prediction data is assessed, and the result of output includes:
I: assessment result text file output;
Ii:roc curve, ordinate Sensitivity, abscissa 1-Specificity, roc curve is under different threshold values The track of Sensitivity and 1-Specificity;
Iii:PR curve, i.e. Precision/Recall curve are using the recall ratio Recall calculated each time as abscissa, often The precision ratio Precision once calculated is ordinate.PR curve reflects classifier to the identification order of accuarcy of positive example and right Tradeoff between the covering power of positive example
IV: Precision curve, accurate rate (Precision) accurate rate for our prediction results, it indicate It is to predict that how many is real positive sample in the sample being positive.Precision curve is i.e. using Precision as ordinate, threshold Value threshold is the curve that abscissa is drawn
V: Recall curve, recall rate (Recall) are the measurements of covering surface, and measurement has multiple positive examples to be divided into positive example. For Recall curve using Recall as ordinate, threshold value threshold is the curve that abscissa is drawn
VI: F-Measure curve, i.e. F-Measure comprehensive evaluation index.F-Measure curve is vertical sit with F-Measure Mark, threshold are the curve that abscissa is drawn, when the highest point of curve is comprehensive evaluation index highest threshold
VII: prediction data text file output
VIII: auc value, auc(Area Under the ROC Curve) index be often used as in the model evaluation stage it is most important Evaluation index measures the accuracy of model, and acu value is better closer to the effect of 1 model;
The assessment of bis- disaggregated model of SVM: the index of bis- disaggregated model of SVM assessment is similar with two general disaggregated model evaluation indexes, The prediction data of SVM model output is supported in the assessment of bis- disaggregated model of SVM.The result of output includes:
I: assessment result text file output
More disaggregated model assessments: it supports for the assessment for getting over plan data including two kinds of LR logistic regression, random forest models.More points The index of class model assessment includes accuracy, recall, f-score, AUC etc..The result of output includes:
I: assessment result text file output
Model result compares: model result compares between the different models for using same model evaluation module estimation The results of different indexs be compared, while the mode of the result figure after comparison being shown, i.e., shown respectively not on year-on-year basis Compared with model result figure.
As shown in figure 8, another aspect of the present invention, provides a kind of method of distributed artificial intelligence application and development, including Following steps:
After S1 obtains the data of different data sources, cleaning integration is carried out to data;
By getting data from different data sources, data processing operation is carried out after converting to data, is located in advance by data Reason module converts the data from different data sources different-format to the data that can be used by model, detailed process are as follows:
In order to support the connection of multiple database, it is integrated with including the local data bases such as Hive, MYSQL, SQL SERVER, ORACLE And the distributed file system under database and HDFS this Hadoop based on Hadoop such as IMPALA;The system provides For the connection configuration parameter of different data sources, the content in database is directly read after configuration connection, by data with tables of data Form imported from these databases and data system, and be converted to PARQUET format and be stored in HDFS, to carry out Subsequent data processing.
S2 is handled pretreated data by characteristic conversion;
Data that treated in S3 receiving step S2, and using machine learning, to model, it is analyzed and processed, including model instruction White silk, model prediction and model evaluation, wherein
S31 checks the going property of standard of initial data by model training, determines data source for model prediction and uses model;
S32 passes through prediction of the model prediction by the model after training to test data;
S33 passes through the model that model evaluation generates training set, and the assessment of modelling effect is carried out using test data.
In one particular embodiment of the present invention, the step S1 includes:
S11 passes through conversion of the data conversion by data between different-format;
S12 divides the computing unit that a data set is splitted into two parts of data by data;
S13 selects the column handled by data column selection in data;
S14 is handled missing data by missing values completion, fills null value with certain value;
S15 is converted by model training input format characteristic being converted to model training;
S16 is converted by model prediction input format characteristic being converted to model prediction.
In one particular embodiment of the present invention, the step S2 includes:
The continuous variable that branch mailbox is handled is converted to discrete variable using data branch mailbox by S21;
Each column feature normalization is unit standard deviation, 0 mean value or 0 mean value unity standard deviation using standard scalar by S22;
S23 converts the data into suitable format using characteristic value transformation and uses for model;
S24 is converted using characteristic type and is carried out basic type conversion to specified characteristic variable;
S25 is normalized data using feature normalization.
In one particular embodiment of the present invention, the model training includes but is not limited to the training of LR disaggregated model, SVM Disaggregated model training, Naive Bayes Classification Model training, gradient promote Decision-Tree Classifier Model training, random forest classification mould Type training and K-MEANS Clustering Model;Model prediction includes but is not limited to the prediction of LR disaggregated model, disaggregated model prediction, SVM points Class model prediction, Naive Bayes Classification Model prediction and the prediction of random forest disaggregated model;Model evaluation includes but is not limited to The assessment of two disaggregated models, the assessment of bis- disaggregated model of SVM, the assessment of more disaggregated models and model result compare.
In conclusion, by supporting multiple data sources, realizing user's progress by means of above-mentioned technical proposal of the invention Integration across database and cross-platform data acquisition;Integrated HUE function, can on the web console of browser end with Hadoop cluster Interact to analyzing and processing data;Based on different computation models, model calculating, prediction, assessment and standard are provided for user Function is dissolved, utmostly meets the needs of industry user is based on to machine learning;A variety of calculating forecast analysis are carried out to data, To save the fund cost that user buys different analysis platforms.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of system of distributed artificial intelligence application and development characterized by comprising
Data preprocessing module after the data for obtaining different data sources, carries out cleaning integration to data;
Feature Engineering module, for being handled by characteristic conversion pretreated data;
Machine learning module, for receiving the data after Feature Engineering resume module, and using machine learning to its progress of model Analysis processing, including model training module, model prediction module and model evaluation module, wherein
Model training module determines data for the going property of standard by the training review initial data to model for model prediction Source and use model;
Model prediction module, for the prediction by the model after training to test data;
Model evaluation module, the model for being generated by training set carry out the assessment of modelling effect using test data.
2. the system of distributed artificial intelligence application and development according to claim 1, which is characterized in that the data are located in advance Managing module includes:
Data conversion module, for the conversion by data between different-format;
HiveQL enquiry module writes query statement for the syntax format using HiveSQL;
Data segmentation module, for a data set to be splitted into the computing unit of two parts of data;
Data column selection module, for selecting the column handled in data;
Missing values completion module fills null value with certain value for handling missing data;
Model training input format conversion module, for characteristic to be converted to model training;
Model prediction input format conversion module, for characteristic to be converted to model prediction.
3. the system of distributed artificial intelligence application and development according to claim 1, which is characterized in that the Feature Engineering Module includes:
Data branch mailbox module, the continuous variable for handling branch mailbox are converted to discrete variable;
Standard scalar module, for being unit standard deviation, 0 mean value or 0 mean value unity standard deviation by each column feature normalization;
Characteristic value conversion module is used for converting the data into suitable format for model;
Characteristic type conversion module is converted for carrying out basic type to specified characteristic variable;
Feature normalization module, for data to be normalized.
4. the system of distributed artificial intelligence application and development according to claim 1-3, which is characterized in that described The model training of model training module includes but is not limited to the training of LR disaggregated model, svm classifier model training, naive Bayesian point Class model training, gradient promote Decision-Tree Classifier Model training, the training of random forest disaggregated model and K-MEANS Clustering Model.
5. the system of distributed artificial intelligence application and development according to claim 1-3, which is characterized in that described The model prediction of model prediction module includes but is not limited to that LR disaggregated model is predicted, disaggregated model is predicted, svm classifier model is pre- It surveys, Naive Bayes Classification Model prediction and random forest disaggregated model are predicted.
6. the system of distributed artificial intelligence application and development according to claim 1-3, which is characterized in that described The model evaluation of model evaluation module includes but is not limited to the assessment of two disaggregated models, the assessment of bis- disaggregated model of SVM, more disaggregated models Assessment and model result compare.
7. a kind of method of distributed artificial intelligence application and development, which comprises the following steps:
After S1 obtains the data of different data sources, cleaning integration is carried out to data;
S2 is handled pretreated data by characteristic conversion;
Data that treated in S3 receiving step S2, and using machine learning, to model, it is analyzed and processed, including model instruction White silk, model prediction and model evaluation, wherein
S31 checks the going property of standard of initial data by model training, determines data source for model prediction and uses model;
S32 passes through prediction of the model prediction by the model after training to test data;
S33 passes through the model that model evaluation generates training set, and the assessment of modelling effect is carried out using test data.
8. the method for distributed artificial intelligence application and development according to claim 7, which is characterized in that the step S1 packet It includes:
S11 passes through conversion of the data conversion by data between different-format;
S12 writes query statement using the syntax format of HiveSQL;
S13 divides the computing unit that a data set is splitted into two parts of data by data;
S14 selects the column handled by data column selection in data;
S15 is handled missing data by missing values completion, fills null value with certain value;
S16 is converted by model training input format characteristic being converted to model training;
S17 is converted by model prediction input format characteristic being converted to model prediction.
9. the method for distributed artificial intelligence application and development according to claim 7, which is characterized in that the step S2 packet It includes:
The continuous variable that branch mailbox is handled is converted to discrete variable using data branch mailbox by S21;
Each column feature normalization is unit standard deviation, 0 mean value or 0 mean value unity standard deviation using standard scalar by S22;
S23 converts the data into suitable format using characteristic value transformation and uses for model;
S24 is converted using characteristic type and is carried out basic type conversion to specified characteristic variable;
S25 is normalized data using feature normalization.
10. according to the method for the described in any item distributed artificial intelligence application and developments of claim 7-9, which is characterized in that institute State model training include but is not limited to the training of LR disaggregated model, svm classifier model training, Naive Bayes Classification Model training, Gradient promotes Decision-Tree Classifier Model training, the training of random forest disaggregated model and K-MEANS Clustering Model;Model prediction includes But be not limited to the prediction of LR disaggregated model, disaggregated model prediction, svm classifier model prediction, Naive Bayes Classification Model prediction and The prediction of random forest disaggregated model;Model evaluation includes but is not limited to that two disaggregated models are assessed, bis- disaggregated model of SVM is assessed, is more Disaggregated model assessment and model result compare.
CN201811024278.7A 2018-09-04 2018-09-04 The system and method for distributed artificial intelligence application and development Pending CN109446251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811024278.7A CN109446251A (en) 2018-09-04 2018-09-04 The system and method for distributed artificial intelligence application and development

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811024278.7A CN109446251A (en) 2018-09-04 2018-09-04 The system and method for distributed artificial intelligence application and development

Publications (1)

Publication Number Publication Date
CN109446251A true CN109446251A (en) 2019-03-08

Family

ID=65533208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811024278.7A Pending CN109446251A (en) 2018-09-04 2018-09-04 The system and method for distributed artificial intelligence application and development

Country Status (1)

Country Link
CN (1) CN109446251A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125149A (en) * 2019-12-19 2020-05-08 广州品唯软件有限公司 Hive-based data acquisition method and device and storage medium
CN111382787A (en) * 2020-03-06 2020-07-07 芯薇(上海)智能科技有限公司 Target detection method based on deep learning
CN111582498A (en) * 2020-04-30 2020-08-25 重庆富民银行股份有限公司 QA (quality assurance) assistant decision method and system based on machine learning
WO2021017293A1 (en) * 2019-08-01 2021-02-04 平安科技(深圳)有限公司 Rule training method, apparatus, device, and storage medium
WO2021063171A1 (en) * 2019-09-30 2021-04-08 腾讯科技(深圳)有限公司 Decision tree model training method, system, storage medium, and prediction method
CN113381998A (en) * 2021-06-08 2021-09-10 上海天旦网络科技发展有限公司 Deep learning-based application protocol auxiliary analysis system and method
CN116991932A (en) * 2023-09-25 2023-11-03 济南卓鲁信息科技有限公司 Data analysis and management system and method based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573115A (en) * 2015-02-04 2015-04-29 新余兴邦信息产业有限公司 Method and system for achieving integration interface supporting operation of multi-type databases
US20150379423A1 (en) * 2014-06-30 2015-12-31 Amazon Technologies, Inc. Feature processing recipes for machine learning
CN106446217A (en) * 2016-09-30 2017-02-22 广州特道信息科技有限公司 High-speed big data integration system
CN107229976A (en) * 2017-06-08 2017-10-03 郑州云海信息技术有限公司 A kind of distributed machines learning system based on spark
CN107516135A (en) * 2017-07-14 2017-12-26 浙江大学 A kind of automation monitoring learning method for supporting multi-source data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150379423A1 (en) * 2014-06-30 2015-12-31 Amazon Technologies, Inc. Feature processing recipes for machine learning
CN104573115A (en) * 2015-02-04 2015-04-29 新余兴邦信息产业有限公司 Method and system for achieving integration interface supporting operation of multi-type databases
CN106446217A (en) * 2016-09-30 2017-02-22 广州特道信息科技有限公司 High-speed big data integration system
CN107229976A (en) * 2017-06-08 2017-10-03 郑州云海信息技术有限公司 A kind of distributed machines learning system based on spark
CN107516135A (en) * 2017-07-14 2017-12-26 浙江大学 A kind of automation monitoring learning method for supporting multi-source data

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021017293A1 (en) * 2019-08-01 2021-02-04 平安科技(深圳)有限公司 Rule training method, apparatus, device, and storage medium
WO2021063171A1 (en) * 2019-09-30 2021-04-08 腾讯科技(深圳)有限公司 Decision tree model training method, system, storage medium, and prediction method
CN111125149A (en) * 2019-12-19 2020-05-08 广州品唯软件有限公司 Hive-based data acquisition method and device and storage medium
CN111125149B (en) * 2019-12-19 2024-01-26 广州品唯软件有限公司 Hive-based data acquisition method, hive-based data acquisition device and storage medium
CN111382787A (en) * 2020-03-06 2020-07-07 芯薇(上海)智能科技有限公司 Target detection method based on deep learning
CN111582498A (en) * 2020-04-30 2020-08-25 重庆富民银行股份有限公司 QA (quality assurance) assistant decision method and system based on machine learning
CN113381998A (en) * 2021-06-08 2021-09-10 上海天旦网络科技发展有限公司 Deep learning-based application protocol auxiliary analysis system and method
CN113381998B (en) * 2021-06-08 2022-11-22 上海天旦网络科技发展有限公司 Deep learning-based application protocol auxiliary analysis system and method
CN116991932A (en) * 2023-09-25 2023-11-03 济南卓鲁信息科技有限公司 Data analysis and management system and method based on artificial intelligence
CN116991932B (en) * 2023-09-25 2023-12-15 济南卓鲁信息科技有限公司 Data analysis and management system and method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN109446251A (en) The system and method for distributed artificial intelligence application and development
CA3090128A1 (en) System and method for machine learning architecture for enterprise capitalization
Özemre et al. A big data analytics based methodology for strategic decision making
Das et al. Hands-On Automated Machine Learning: A beginner's guide to building automated machine learning systems using AutoML and Python
US20180129961A1 (en) System, method and computer-accessible medium for making a prediction from market data
US11875408B2 (en) Techniques for accurate evaluation of a financial portfolio
Karim et al. Stock price prediction using bi-lstm and gru-based hybrid deep learning approach
Gupta et al. Clustering-Classification based prediction of stock market future prediction
Raju et al. An approach for demand forecasting in steel industries using ensemble learning
Sanabila et al. Ensemble learning on large scale financial imbalanced data
Wang et al. Stock2Vec: a hybrid deep learning framework for stock market prediction with representation learning and temporal convolutional network
Telmoudi et al. RST–GCBR‐Clustering‐Based RGA–SVM Model for Corporate Failure Prediction
CN109767333A (en) Select based method, device, electronic equipment and computer readable storage medium
Miao et al. Customer churn prediction on credit card services using random forest method
Poh et al. Transfer ranking in finance: applications to cross-sectional momentum with data scarcity
Kuo et al. Building Graduate Salary Grading Prediction Model Based on Deep Learning.
Sharma et al. Deep learning in big data and data mining
Uma Maheswari et al. ARIMA versus ANN—A comparative study of predictive modelling techniques to determine stock price
Leporowski et al. Visualising deep network time-series representations
Zhao et al. Exploiting Expert Knowledge for Assigning Firms to Industries: A Novel Deep Learning Method
Shang et al. Alpine meadow: A system for interactive automl
Yan et al. A framework for stock selection via concept-oriented attention representation in hypergraph neural network
Olorunnimbe et al. Ensemble of temporal Transformers for financial time series
Sable et al. Deep Learning Model for Fusing Spatial and Temporal Data for Stock Market Prediction
CN116402241B (en) Multi-model-based supply chain data prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190308