Embodiment
In order that those skilled in the art more fully understand the technical scheme in the application, below in conjunction with this Shen
Accompanying drawing that please be in embodiment, the technical scheme in the embodiment of the present application is described in detail, it is clear that institute
The embodiment of description is only some embodiments of the present application, rather than whole embodiments.Based on the application
In embodiment, the every other embodiment that those of ordinary skill in the art are obtained should all belong to this Shen
The scope that please be protect.
For existing model file poor universality, be difficult to deployment the problem of, the application provides a kind of training pattern
Information output method, shown in Figure 1, this method may comprise steps of:
Training sample data, according to model training demand, are pre-processed, obtain pre-processed results by S101,
And the logic to pretreatment is recorded;
S102, by the use of the pre-processed results as mode input data, obtains training mould by training managing
Type;
S103, the characteristic information of the training pattern and the logic of pretreatment write-in model file are entered
Row output.
Data prediction is usually the processing scheme determined after data mining engineer repeatedly trial, and its is basic
Purpose is to be processed transformation to original data, can better adapt to model.To number of training
At typically can be including missing values processing, feature sliding-model control, combinations of features the step of pretreatment
Reason, feature selecting processing etc..The application need not be simultaneously defined to the details that implements of these steps,
Those skilled in the art can select suitable processing mode according to actual conditions, in addition, should according to actual
With demand, above-mentioned each step be not necessarily in pretreatment it is necessary, for example, when training data sample
When notebook data is exactly originally discretization value, then the process step of sliding-model control can be skipped.
Compared with prior art, the scheme of the application to data when pre-processing, after being used for
Outside the data prediction result of continuous training pattern, in addition it is also necessary to which the logic to pretreatment is also recorded.So
The reason for processing, is:In the model training stage, actually " pretreated number of training will be passed through
According to " as input data train obtained model.But model is after deployment, it can be directly obtained
Data are consistent with training sample data form, and such data can not directly input model and be counted
Calculate.To solve the problem, application scheme when will obtain pre-processed results used processing logic also record
Get off, and write model file.So, in model deployment phase, reading model file is passed through, so that it may
, can be directly pre- by data according to this two parts information to obtain the relevant information of pretreatment logical sum model
Processing module and model processing modules automatic deployment are in system.
Illustrate, it is assumed that in training sample data, feature field x span for (0,100], data
Excavation teacher is by making repeated attempts, it is believed that turns to [0,100] is discrete 4 intervals and can obtain preferable effect:
Specific corresponding discrete segment for (0,25], (26,50], (51,75], (76,100], respectively specify that corresponding discrete
Value 0,1,2,3.
Assuming that using above-mentioned discretization results, final training obtains model for y=2x+3.According to prior art
Implementation, only y=2x+3 can be write in model file, but by processing procedure above, it is right
For the model, " x " of input should corresponding be actually the value 0,1,2,3 of discretization, still
The data span that can be directly obtained in after model deployment is still consistent with training sample data
(0,100], in order to ensure the proper use of of model, the processing logic of discretization then needs manually to re-write.And
According to the scheme of the application, two parts information can be write in model file:
Part I is the characteristic information of model, is in this example y=2x+3;
Part II be pretreatment logic, in this example for:
(0,25]→0、
(26,50]→1
(51,75]→2
(76,100]→3
And then,, can be by model by the Part I information of reading model file in model deployment phase
Processing module automatic deployment is in system, and by the Part II information of reading model file, can by with
The sliding-model control module automatic deployment that the model coordinates is in system, it is to avoid artificial to rewrite sliding-model control mould
Block.
Certainly, the above citing be only used for schematically illustrating, realistic model file in need specifically to advise
Model writes corresponding information, and the application need not be simultaneously defined.
With reference to a more specifically embodiment, the scheme to the application is illustrated, in this embodiment,
Final model file is exported using PMML forms.
Modeling process is divided into following steps by general modelling methodology in data mining:At missing values
Reason, the processing of feature sliding-model control, combinations of features, feature selecting processing, model training, model evaluation.
Wherein " model evaluation " belongs to the test to input output model, unrelated with application scheme, and preceding 4 steps
" pretreatment " belonged in application scheme, based on said process, the application provides number as shown in Figure 2
According to training pattern information output method, wherein by S101a~S101d respectively correspond to missing values processing, feature from
Dispersion processing, combinations of features processing, feature selecting processing, this 4 pre-treatment steps export two parts number
According to:1) result that this step is obtained after handling input data;2) the processing logic of this step.
Correspondingly, overall handling process also includes two parts:On the one hand, 4 steps preprocessing process
In for concatenation relation, i.e. training sample data input first S101a, previous step output result under
The input of one step, is performed after 4 steps successively, S101d output pre-processed results, for subsequent step
S102 carries out model training;On the other hand, 4 steps export processing logic respectively, are obtained with S102 training
Model information carry out collect write-in model file.That is, in the model file of final output, except
Outside record cast self information, the processing logical message of 4 pre-treatment steps, and 4 are also have recorded respectively
The execution sequence of individual pre-treatment step.
In actual applications, can be by rewriting block code if some pre-treatment steps need not be performed
To realize the closing of preprocessing function.
For the ease of being managed collectively and extending, for missing values processing, feature sliding-model control, combinations of features
Processing and feature selecting handle 4 modules, can define unified module design specification, the application is with YAML
Exemplified by form, specific design specification is schematically as follows:
In above-mentioned design specification, each processing module includes 3 submodules:Input submodule inputs, calculation
Method module algorithm, output sub-module outputs, wherein subalgorithm module algorithm are optional, son
Concatenated between module with schemas, datas, models and evaluations.In outputs submodules
In, it can be respectively configured and whether export these four information:Wherein schemas is used for the place for exporting current block
Result is managed, latter module can directly arrive database search according to the schemas of the output of previous module
Data are used as the input of itself;Datas can be used for data output to local text;Models is used
In the processing logic of current block, evaluations is then used for the file of the contents such as output model effect, general to use
Show in visualization.It can be seen that,, at least should be in outputs for pretreatment module according to application scheme
The value for configuring schemas and datas is true.
Below by taking feature sliding-model control as an example, the processing procedure to module is illustrated:
Assuming that the mark (taskId) of feature sliding-model control module is 10003, the module depends on missing values
Processing module (taskId is 10002) is filled, it is assumed for convenience of description that sliding-model control module needs to use
The input data (i.e. the output data of Missing Data Filling processing module) arrived is as follows, is entered in the form of schema
Row expression:
The meaning expressed by the data is:Using 20150301 and 20150302 points in " user_table " table
The data in area, while only alternative column x1, x2, x3.Wherein, what from was represented is that the value of present field is
How to obtain, there are following several possibility:
“origin”:Value in current field is that original field is inherited
“fill”:Value in present field have passed through missing values processing
“discrete”:Value in present field have passed through discretization
“combine”:Value in present field have passed through combinations of features and obtain
“dummy”:Value in present field is obtained by dummy
On the basis of previous designs specification, design feature sliding-model control module is realized as follows:
The sliding-model control module is selected to x1, x2 and x3 using above-mentioned schema as input data
Row carry out discretization.The discretization method of wherein x1 row is cut-point, the discretization of x2 row for given 1,5,9
The frequency discretization such as method is and discretization interval is 3, the discretization methods of x3 row is waits frequency discretization and every
Individual interval number of samples is 5.
Notice in outputs submodules, schemas and models field values are true, show this
The output of descretization module finally includes two parts:The result of sliding-model control is carried out to input data
(schemas), and sliding-model control logic (models), what latter of which can be with JSON files
Form is exported, and this document content is as follows:
It can be seen that, in the JSON files, the processing logic of discretization is expressed,:The discretization point of x1 row
For-Inf~1,1~5,5~9,9~+Inf, the discretization interval of x2 row is-Inf~1,1~7,7~+Inf, x3 row
Discretization interval be~Inf~2,2~7,7~+Inf.
After S102 training obtains model, the JSON files of the processing logic of this discretization are collected into most
In whole PMML files, the following institute of specific writing mode of the contents of JSON files in PMML files
Show:
As can be seen that the processing logic of discretization is really the Local for being written with PMML files
In Transformations (LT, local conversion) section, Local Transformations are PMML standards
Defined in data conversion section, dedicated for placing the preposition processing logic of data, support conventional data
The functions such as filling, form conversion, discretization, also support customized data processing, LT sections can be by PMML
Resolver is recognized.So, in follow-up model deployment phase, system is by parsing PMML model files
In Local Transformations sections, it is possible to obtain the processing logic of discretization, and can be
Automatically corresponding sliding-model control module is reconstructed in system.Certainly, in addition to sliding-model control, for other
Data preprocessing module, such as combinations of features processing module, feature selecting processing module etc., can also be by
Logic write-in model file will be handled accordingly according to similar method, and the embodiment of the present application will not enumerate.
Corresponding to above method embodiment, the application also provides a kind of training pattern information output apparatus, referring to
Shown in Fig. 3, the device can include:
Pretreatment module 110, for according to model training demand, pre-processing, obtaining to training sample data
To pre-processed results;
Logic record module 120 is handled, is recorded for the logic to pretreatment;
Training module 130, for by the use of pre-processed results as mode input data, being obtained by training managing
Training pattern 140;
Output module, for the logic write-in model file of the characteristic information of training pattern and pretreatment to be entered
Row output.
In a kind of embodiment of the application, pretreatment module 110 can be specifically for using following
One or more modes are pre-processed to training sample data:
Missing values processing, the processing of feature sliding-model control, combinations of features, feature selecting processing.
In a kind of embodiment of the application, in pretreatment module 110 using various ways to training
In the case of sample data is pre-processed, processing logic record module 120 can be specifically for:Record respectively
The processing logic of each mode, and record the execution sequence of each mode.
In a kind of embodiment of the application, output module 140 can specifically use forecast model mark
Remember language PMML form output model files.
Further, output module 140 can be specifically for writing PMML forms text by the logic of pretreatment
In the local conversion section Local Transformations of part.
As seen through the above description of the embodiments, those skilled in the art can be understood that this
Application can add the mode of required general hardware platform to realize by software.Understood based on such, this Shen
The part that technical scheme please substantially contributes to prior art in other words can be in the form of software product
Embody, the computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc,
CD etc., including some instructions to cause a computer equipment (can be personal computer, server,
Or the network equipment etc.) perform method described in some parts of each embodiment of the application or embodiment.
Each embodiment in this specification is described by the way of progressive, identical phase between each embodiment
As part mutually referring to what each embodiment was stressed is the difference with other embodiment.
For device embodiment, because it is substantially similar to embodiment of the method, so describing to compare
Simply, the relevent part can refer to the partial explaination of embodiments of method.Device embodiment described above is only
It is only illustrative, wherein the module illustrated as separating component can be or may not be physics
It is upper separated, when implementing application scheme can the function of each module in same or multiple softwares and/or
Realized in hardware.Some or all of module therein can also be selected to realize this reality according to the actual needs
Apply the purpose of a scheme.Those of ordinary skill in the art are without creative efforts, you can with
Understand and implement.
Described above is only the embodiment of the application, it is noted that for the common of the art
For technical staff, on the premise of the application principle is not departed from, some improvements and modifications can also be made,
These improvements and modifications also should be regarded as the protection domain of the application.