CN104778254A

CN104778254A - Distributing type system for non-parameter topic automatic identifying and identifying method

Info

Publication number: CN104778254A
Application number: CN201510186154.9A
Authority: CN
Inventors: 李攀登; 王炼; 姜军
Original assignee: Peking Blue Coloured Light Mark Brand Management Consultant Inc Co
Current assignee: Peking Blue Coloured Light Mark Brand Management Consultant Inc Co
Priority date: 2015-04-20
Filing date: 2015-04-20
Publication date: 2015-07-15
Anticipated expiration: 2035-04-20
Also published as: CN104778254B

Abstract

The invention provides a distributing type system for non-parameter topic automatic identifying. The distributing type system comprises an R packaging calling layer and a distributing type data processing layer, wherein the R packaging calling layer comprises a parameter configuration module, a first communication analyzing module and a main function module, and the distributing type data processing layer comprises a second communication analyzing module, a task scheduling module, a model management module, an algorithm processing module and an enterprise application module. The invention also provides a method for the non-parameter topic automatic identifying.

Description

A kind of distributed system of non-parametric topic automatic marking and mask method

Technical field

The present invention relates to a kind of large data processing method of statistical learning technical applications, especially relate to the system that the automatic modeling of topic mark, distributed deployment and service application process automatically.

Background technology

Along with the technology of internet and the increasingly mature of product, the information of internet rapidly expands, people rely on the vestige that various carrier leaves oneself on various platform and media, such as, people make comments to article on electric business's platform, microblogging is delivered oneself interested topic, be embodied directly in Rapid Accumulation mass data, thereby produce a large amount of text datas, and how to pass through semantic from these texts, what the technology such as the statistical learning topic thought that therefrom digging user is expressed became industry concern is also extremely valuable technical matters, because the information that a large amount of service application all can be excavated out based on these is marketed accurately and data product application, current academia and the technology of industry member in this field have had large quantifier elimination.

But we use actual and find in studying, at least there is following problem in prior art: prior art all text based word distribution or imply theme distribution obey certain hypothesis distribution, and then the training of the subsequent parameter iteration of carrying out and model, a drawback of this method is when the distribution of hypothesis disobeyed in the word of actual text and the true theme of user, and just there will be based on the lower training of this hypothesis model result out seriously has partially; Also have the algorithm of some machine learning, as SVM, neural network etc. possess stronger predictive ability, but these algorithms are due to high computation complexity, limit its commercial Application at large data age, affect the popularization of commercial Application; Prior art needs human at periodic intervals's Renewal model parameter, and self-learning capability does not still possess.

Summary of the invention

In order to overcome above defect, the invention provides a kind of distributed system of non-parametric topic automatic marking, comprising: R encapsulates calling layer and distributed data processing layer; Described R encapsulates calling layer and comprises parameter configuration module, the first communication analysis module and principal function module; Described distributed data processing layer comprises second communication parsing module, task scheduling modules, model management module, algorithm processing module and enterprise application modules; Wherein, described parameter configuration module is used for accepting configuration information; Described principal function module for accepting the personalize development carried out algorithmic dispatching, information feed back process and other process, and generates the executable configuration of described distributed data processing layer accordingly and needs the mission bit stream performed to send to described distributed data processing layer; Described first communication analysis module is used for communicating to connect with described second communication parsing module, for encapsulating foundation communication between calling layer and distributed data processing layer at R and resolving Content of Communication; Described task scheduling modules for receiving the executable configuration of described principal function module transmission and needing the mission bit stream of execution, and correspondingly controls and coordinates the work of described model management module, described algorithm processing module and described enterprise application modules; Described model management module for building model, algorithm processing module computation model parameter described in instruction, and the model parameter that described algorithm processing module returns being integrated, generation model Parameter File; The instruction that described algorithm processing module receives described model management module calculates model parameter, and returns results; Described enterprise application modules is used for the language material of reception to carry out pre-service, and marks according to the model parameter file generated topic of described model management CMOS macro cell.

Preferably, described enterprise application modules will set up the IF-IDF matrix of label-word when pre-service after language material participle; Wherein, label is the mark of topic, is set to y; IF-IDF is the frequency conversion that each word and predictive variable occur in each language material, is set to x.

Preferably, described algorithm processing module solves parameter estimation in accordance with the following methods , optimal bandwidth hand support vector x _sin at least one of them: a) build class predictive equation and likelihood function, , adopt nonparametric technique extensive to it, obtain more generally, class maximum likelihood function , wherein, lfor the exponent number of likelihood function approximation by polynomi-als; B) nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function: , maximize this likelihood function, adopts Gradient Descent until convergence obtains , wherein, hat( ) represent parameter estimated value; C) window width is solved by following formula: , h _optfor the optimal value of the window width h that basis Solving Equations below solves, kfor kernel function, k _0,1( z) be 0-1 standardization kernel function, represent the variance function be normalized between 0 to 1.

Preferably, described enterprise application modules comprises prediction marking unit, for giving a mark to by the result of the new language material of existing model treatment; Qualified if given a mark, described system exports the result with the existing new language material of model treatment, if it is defective to give a mark, the existing model of described system update also processes new language material with the model after upgrading.

Preferably, described model management module comprises error distribution update unit; Described error distribution update unit gives each sample an error update weight, the weight of adjustment sample when upgrading existing model, and selects weight to meet the sample of certain condition as the training sample needed for Renewal model.

Preferably, described algorithm processing module comprises parameter Dispatching Unit and multiple core algorithm unit; Wherein, described core algorithm unit is used for carrying out algorithm iteration, production model parameter; Described parameter Dispatching Unit is used for each core algorithm unit distribution initial parameter, and the result of calculation after each iteration of each core algorithm unit is sent to other core algorithm unit.

Preferably, if model parameter does not restrain after the iteration of predetermined number of times, then current parameter is distributed to each core algorithm unit as new initial parameter by described parameter Dispatching Unit again, and automatically carries out next round iteration; If the convergence of described model, then described model parameter is back to model management module and carries out parameter integration by described parameter Dispatching Unit.

Preferably, described model management module carries out described parameter integration by following formula:

, wherein, Nodepred is the prediction integrated results of partial node and sample, and i is nodes, and j is sample number, and T is number of iterations, pred _{t, i, j}be the predicted value that on i-th node, a jth sample is obtained by the sorter of training in the t time iteration, _{t, i, j}for the predictive variable parameter that each node produces, for the estimated value set of kernel function, i.e. the functional value set of the estimated value of support vector and window width.

The present invention also provides a kind of method of non-parametric topic automatic marking, comprising: configuration step, carries out the information configuration of initiation parameter, service selection, algorithms selection; Transmitting step, transmits the information configured; Analyzing step, resolves configured information, is the executable configuration parameter of distributed data processing layer and the mission bit stream needing execution by configured information analysis; Task scheduling step, is distributed to model management module, algorithm processing module and enterprise application modules by the mission bit stream that described executable configuration parameter and needs perform; Model management step, carries out nonparametric model structure; Algorithm process step, the parameter that the model calculating described model management step structure needs; Enterprise's applying step, carries out pre-service by the language material of reception, and exports the result with model treatment language material.

Preferably, also comprise pre-treatment step, the IF-IDF matrix of label-word will be set up after language material participle; Wherein, label is the mark of topic, is set to y; IF-IDF is the frequency conversion that each word and predictive variable occur in each language material, is set to x.

Preferably, in described model management step, parameter estimation is solved in accordance with the following methods , optimal bandwidth hand support vector x _sin at least one of them: a) build class predictive equation and likelihood function, , adopt nonparametric technique extensive to it, obtain more generally, class maximum likelihood function , wherein, lfor the exponent number of likelihood function approximation by polynomi-als; B) nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function: , maximize this likelihood function, adopts Gradient Descent until convergence obtains , wherein, hat( ) represent parameter estimated value; C) window width is solved by following formula: , h _optfor the optimal value of the window width h that basis Solving Equations below solves, kfor kernel function, k _0,1( z) be 0-1 standardization kernel function, represent the variance function be normalized between 0 to 1.

Preferably, also comprise prediction marking step, give a mark to by the result of the new language material of existing model treatment with prediction marking unit; Qualified if given a mark, export the result with the existing new language material of model treatment, if it is defective to give a mark, then upgrades existing model and process new language material with the model after upgrading.

Preferably, in described model management step, give each sample an error update weight, the weight of adjustment sample when upgrading existing model, and select weight to meet the sample of certain condition as the training sample needed for Renewal model.

Preferably in described algorithm process step, multiple core algorithm unit for carrying out algorithm iteration, production model parameter is set; With parameter Dispatching Unit to each core algorithm unit distribution initial parameter, and the result of calculation after each iteration of each core algorithm unit is sent to other core algorithm unit.

Preferably, if model parameter does not restrain after the iteration of predetermined number of times, then using current parameter as new initial parameter, and automatically carry out next round iteration; If described model convergence, then carry out parameter integration by described model parameter.

Preferably, described parameter integration is carried out by following formula:

A kind of method for building up based on the non-parametric topic automatic marking of boosting provided by the invention and system, core inner product technology is adopted feature to be mapped to higher dimensional space under the prerequisite not increasing computation complexity, adopt predicated error distribution dynamic Regulation mechanism, and pass through boosting mode by Data dissemination to hadoop cluster each Node distribution formula iterative model parameter and support vector, the topic mark of internet language material is efficiently completed eventually through encapsulation R environment scheduling script, the user interest model that the present invention sets up is more accurate, burden and the wasting of resources of server and client side can be reduced.The such scheme that this programme embodiment provides compared with prior art, has following beneficial effect:

1, the prior distribution of data is not relied on based on nonparametric technique, and set up sorter at high-dimensional feature space, make topic marking model more accurately with flexible, and model prediction only relies on a small amount of support vector (the vectorial number much smaller than SVM produces), computation complexity is low is only O (N^2*Ns), N is data volume size, and Ns is support vector quantity;

2, based on predicated error distribution dynamic adjusting training sample distribution, improve the sample weights of prediction error, reduce the correct sample weights of prediction or avoid entering model training, can Automatic Optimal model further, the low computation complexity of a step-down of going forward side by side, the speed declined is O(1/T), T is iterations;

3, being deployed as of model adopts the mode of boosting to be distributed to clustered node, burden and the wasting of resources of server and system can be reduced, improve model training efficiency, on the other hand, each node parameter passback process is an Ensemble process, further increases the accuracy rate of model and reduces over-fitting risk.

Accompanying drawing explanation

Fig. 1 is the distributed system architecture figure of the topic automatic marking that embodiment of the present invention relates to;

Fig. 2 is the multiprocessor scheduling process structure diagram of the topic automatic marking that embodiment of the present invention relates to;

Fig. 3 is the process flow diagram of the topic language material automatic marking that embodiment of the present invention relates to;

Fig. 4 is the process flow diagram of the new established model of topic language material that embodiment of the present invention relates to;

Fig. 5 is the distributed training of nonparametric and model integrated schematic diagram that embodiment of the present invention relates to;

Fig. 6 is the process flow diagram of the topic language material model modification that embodiment of the present invention relates to.

Embodiment

Illustrated embodiment sets forth the present invention with reference to the accompanying drawings below.This time disclosed embodiment can be thought and is illustration in all respects, and tool is not restricted.Scope of the present invention not limit by the explanation of following embodiment, only by shown in the scope of claims, and comprises and to have same looking like and all distortion in right with right.

For solving the problems of the technologies described above, the invention provides a kind of distributed system based on quasi-boosting nonparametric topic automatic marking and method, this system comprises R and encapsulates calling layer and the large system of the large data analysis layer two of Hadoop, by http protocol intercommunication between two large systems, resolved by the JSON on backstage, carry out mutual scheduling and the passback of two large systems.Concrete, analysis operation personnel realize with R language for scheduling by operation R being encapsulated to calling layer, and Hadoop environment does the efficient succinct pattern calculated.Concrete, be packaged with the module of difference in functionality at the large data analysis layer of Hadoop, mainly comprise communication module, task scheduling modules, model management module, algorithm processing module, enterprise application modules and data memory module etc.The unit of difference in functionality is encapsulated respectively: task scheduling modules comprises file distributing unit, Job resolution unit, Job performance element and dispensing unit in modules; Model management module comprises parameter integration unit, model treatment unit, error distribution update unit and dispensing unit; Algorithm processing module comprises parameter Dispatching Unit, core algorithm unit and Data distribution8 and upgrades and storage unit; Enterprise application modules comprises data encasement unit, prediction marking unit, application interface unit and corpus labeling unit etc.Described module and unit coordinate to realize the distributed structure of core algorithm mutually, model management, algorithmic dispatching are resolved, data quasi-boosting distribution, store, parameter optimum is integrated, model training and deployment, automatically upgrade Optimized model parameter, the function that flow process encapsulates such as model enterprise applies.

The distributed system based on quasi-boosting nonparametric topic automatic marking involved in the present invention is illustrated below in conjunction with Fig. 1-6.

Fig. 1 is the distributed system architecture figure of the topic automatic marking that embodiment of the present invention relates to.The system of the topic automatic marking that embodiment of the present invention relates to comprises R and encapsulates calling layer 1 and the large data analysis layer of Hadoop 2 liang of large systems, and this two large system has respective communication module.The communication module that R encapsulates calling layer 1 is connected by mode that is wired or wireless network with the communication module under the large data analysis layer of Hadoop 2, realizes communication connection and the data access of two large systems (R encapsulates calling layer 1 and the large data analysis layer 2 of Hadoop).In addition, R encapsulates calling layer 1 and works under R language environment, and Hadoop is large, and data analysis layer 2 works under Java language.

R encapsulates calling layer 1 and comprises parameter configuration module 11, communication module 12 and Main Fun module 13.Wherein R encapsulates the parameter configuration module 11 of calling layer 1 containing parameter configuration files; Apply according to different enterprise, analysis operation personnel can input the activation bits such as different initial parameters, service selection, algorithms selection in parameter configuration files, configure different parameters stand with other activation bits according to different application.The communication module 12 that R encapsulates calling layer 1 comprises http protocol unit 121 and JSON resolution unit 122 two parts, wherein http protocol unit 121 is responsible for R to encapsulate the information such as the script of R language message and the configuration parameter of model such as the parameter configuration files of calling layer 1 and is passed to the large data analysis layer 2 of Hadoop, and is responsible for receiving from information such as the feedback information of the large data analysis layer 2 of Hadoop and model algorithm results; JSON resolution unit 122 is responsible for being sent to by large for Hadoop data analysis layer 2 the Java language information analysis of http protocol unit 121 to be corresponding R language scripts, makes R encapsulation calling layer 1 can identify action and the feedback information of each module of the large data analysis layer of Hadoop 2.Main Fun module 13 stores the simple functions function model be made up of R language and the function that can call other language models, this module in charge is encapsulated as R language UDF(User-Defined Functions by required for the enterprise's application operation that performs with from information integration such as the information feed back of the large data analysis layer 2 of Hadoop and algorithmic dispatchings), complete personalize development, thus R encapsulates the application of calling layer 1 just can perform other routine packages under R environment and function, again by the algorithm under regulation agreement execution Hadoop environment.Communication module 12 is connected with Main Fun module 13 with parameter configuration module 11 respectively.

The large data analysis layer 2 of Hadoop comprises communication module 21, task scheduling modules 22, model management module 23, algorithm processing module 24, enterprise application modules 25 and data memory module 26.

Wherein, communication module 21 comprises http protocol unit 211 and JSON resolution unit 212.This http protocol unit 211 information transmission such as is responsible for the transmission of task scheduling modules 22, model management module 23, algorithm processing module 24, enterprise application modules 25 and data memory module 26 intermodule and process feed backs and encapsulate calling layer 1 to R, and receives the parameter configuration files of R encapsulation calling layer 1 transmission and R language message script etc.Information and the model configuration parameter of the responsible R encapsulation calling layer 1 received by http protocol unit 211 of the JSON resolution unit 212 of communication module 21 resolve to the executable parameter configuration of the large data analysis layer of Hadoop 2 and need the Job information of execution.

Fig. 2 is the multiprocessor scheduling process structure diagram of the topic automatic marking that embodiment of the present invention relates to.Composition graphs 1 and Fig. 2 illustrate the structure of the task scheduling modules 22 of the large data analysis layer 2 of Hadoop, model management module 23, algorithm processing module 24, enterprise application modules 25 and data memory module 26 respectively.

Task scheduling modules 22 comprises file distributing unit 221, Job performance element 222, Job resolution unit 223 and dispensing unit 224; This module is the driver module of model management module 23, algorithm processing module 24 and enterprise application modules 25, and be responsible for the configuration of each module parameter, parameter returns, flow process is got through, the execution of Data dissemination order etc. runs the function driven automatically.Wherein, the Job information that Job resolution unit 223 is responsible for performed configuration and the needs execution parsed according to JSON resolution unit 212 is determined the task that the large data analysis layer 2 of Hadoop needs to do and is carried out data preparation, generate task list, trigger Job performance element 222 simultaneously.Job performance element 222 also can only have one for several unit arranged side by side, and this unit needs the job task performed separately according to the task list identification that Job resolution unit 223 generates, and triggers the work of other modules corresponding according to task definition.Dispensing unit 224 stores Hadoop layer bottom and connects the parameter and fileinfo that need, and the mission bit stream that this unit identifies according to Job performance element 222 configures correlation parameter to file distributing unit 221.Data that data encasement unit 251 in enterprise application modules 25 relates to by file distributing unit 221 distribute numbering and index, and according to the specific tasks information of Job performance element 222 by parameter information and inter-related task data configuration thereof to model management module or algorithm processing module etc.

Model management module 23 comprises parameter integration unit 231, model treatment unit 232, error distribution update unit 233 and dispensing unit 234.Wherein be packaged with the core algorithm based on non-parametric topic automatic marking in model treatment unit 232, this unit can drive the work of algorithm processing module 24.Parameter required when storing other each Modelon Modelings under model management module 23 in dispensing unit 234, other each unit that this unit is responsible under model management module 23 provide required model parameter.The model parameter of each for the Hadoop calculated in given number of iterations node converges is integrated by parameter integration unit 231, complete Ensemble(assemblage) parameter integration process, generation model Parameter File, and the information such as the model parameter file of generation are reached R encapsulation calling layer 1; The integration method that this process uses specifically provides in aftermentioned explanation.Error distribution update unit described in error distribution update unit 233 gives each sample an error update weight, needing the weight recalculating sample when upgrading existing model, and the sample selecting the weight that recalculates larger is as the training sample needed for Renewal model.

Algorithm processing module 24 comprises parameter distribution unit 241, core algorithm unit 242 and Data distribution8 and upgrades and storage unit 243; Core algorithm unit 242 is provided with multiple, and each core algorithm unit 242 is a boosting node.Core algorithm unit 242 is according to the various algorithm of specific tasks information and executing: as the iteration etc. of the solving of model, algorithm; When carrying out the iteration of algorithm, the parameter that each core algorithm unit calculates re-starts distribution by parameter Dispatching Unit 241, carries out next round iteration; This unit can directly be triggered by task scheduling modules 22, carries out simple functional calculating; This unit also can be triggered by the model treatment unit 232 of model management module 23, carries out the calculating such as complicated model solution or algorithm iteration.Parameter Dispatching Unit 241 is responsible for server each node distribution initial parameter data, and this parameter can be get parms at random, also can be the non-convergent parameter of each node calculated by core algorithm unit 242.When Data distribution8 upgrades and the responsible renewal of storage unit 243 carries out algorithm iteration with storage core algorithm unit 242, the correction data etc. after error correction.

Enterprise application modules 25 comprises data encasement unit 251, prediction marking unit 252, application interface unit 253 and corpus labeling unit 254, and this module is main entrance and the outlet of this system.Wherein, application interface unit 253 is responsible for carrying out docking and real-time management with other interfaces that enterprise applies, to obtain new opplication.Enterprise-level traffic issues is converted into the demand data involved by model language by data encasement unit 251 primary responsibility, calls and the raw data of distributing is prepared and pre-service for other models.Prediction marking unit 252 is given a mark to the prediction carried out of data object of input by the model parameter file in trigger model administration module, judges whether the model stored in system is applicable to the application data of new input; Qualified if given a mark, described system exports the result with the existing new language material of model treatment, if it is defective to give a mark, the existing model of described system update also processes new language material with the model after upgrading.Corpus labeling unit 254 is responsible for carrying out LDA text according to the language material of model parameter file to input and is birdsed of the same feather flock together, and forms more fine-grained topic mark and the probabilistic dictionaries of each topic classification, and the information such as the topic mark of generation are stored to data memory module 26.

Data memory module 26 is responsible for the information after task scheduling modules 22 in storage system, model management module 23, algorithm processing module 24 and the process of enterprise application modules 25 four module, as the information such as model, model parameter, topic mark calculated.

The communication module 21 of the large data analysis layer 2 of Hadoop is connected with data memory module 26 with task scheduling modules 22, model management module 23, algorithm processing module 24, enterprise application modules 25 respectively, connects between each module respectively by bus.

Fig. 3 is the process flow diagram of the topic language material automatic marking that embodiment of the present invention relates to.The treatment scheme of corpus labeling is illustrated below in conjunction with accompanying drawing.

First, new language material is inputted by application interface unit 253, this new language material is converted into the new corpus data involved by model language by data encasement unit 251, concrete, participle and the conversion of VSM numerical value are carried out to this input language material, and builds VSM numerical matrix, i.e. the IF-IDF matrix of label-word, wherein label is the mark of topic, is set to Y; IF-IDF is the frequency conversion that each word and predictive variable occur in each word, is set to X, calls and the original corpus data (new corpus data) distributed is stored in data memory module 26 as modules in system.Application interface unit 253 is determined to need according to the application of input language material the process carried out language material and is sent to R to encapsulate calling layer 1 by the http protocol unit 211 of communication module 21 relevant information, resolves to corresponding R language message (step S1) by JOSN resolution unit 122.The Main Fun module 13 that R encapsulates under calling layer 1 carries out analysis coupling to the R language message of resolving, and judges whether store in Main Fun module 13 transaction module Model (t-1) (the step S2) that mate with this application data.

When judging model M odel (t-1) not having in Main Fun module 13 to store the Model Matching corresponding with this new corpus data (step S2 is no), then analysis operation personnel need to operate R encapsulation calling layer, the instruction of input modeling and configuration confidence, make system carry out modeling again (step S3) to this application data.

Fig. 4 is the process flow diagram of the new established model of topic language material of the step S3 that embodiment of the present invention relates to.First analysis operation personnel call the corpus stored in data memory module 26, corpus is sent to data encasement unit 251 and pre-service is carried out to this language material: participle and the conversion of VSM numerical value are carried out to this language material, and build VSM numerical matrix (step S31), R will be sent to encapsulate calling layer 1 through the pretreated VSM numerical matrix of step S1 and S31, carry out initialization by parameter configuration module 11 pairs of parameters simultaneously, and trigger R module scheduling script, this R encapsulates the Job resolution unit 223 that calling layer 1 triggers the task scheduling modules 22 of the large data analysis layer 2 of Hadoop, Job resolution unit 223 is sent to by needing the information of Modling model and various instruction mission bit stream, Job resolution unit 223 is by generating task list to the parsing of command information, and task definition is distributed to Job performance element 222(step S32).Job performance element 222 is according to the model treatment unit 232 of task definition trigger model administration module 23.Model treatment unit 232 is according to task definition trigger data Dispatching Unit 221 and dispensing unit 224, task parameters simultaneously in the dispensing unit 224 transferred according to task definition of file distributing unit 221 sends to model treatment unit, enter boosting Data dissemination, boosting node as shown in Figure 5: 1 ~ n(step 33) simultaneously.Following model treatment unit 232 transfers the model parameter in dispensing unit 234 according to the mission bit stream of Job performance element 222, trigger the core algorithm unit 242 of the algorithm processing module 24 of each node, carry out by core algorithm unit 242 calculating (step S34) restraining model parameter (as estimates of parameters, window width and support vector etc.).Core algorithm unit 242 computation model parameter mainly contains estimates of parameters, window width and support vector etc. in this step, and corresponding algorithm is:

1 builds class predictive equation and likelihood function, , more generally, work as contiguous function gfor sigmoid, and f( x) for time linear, expressed is traditional parameter logistic regression model (LR), and we adopt nonparametric technique extensive to it, obtain more generally, and class maximum likelihood function is , wherein x _ifor each word in sample data is ithe frequency that individual sample occurs, y _ibe ithe topic mark of individual sample, lfor the exponent number of likelihood function approximation by polynomi-als;

2 pairs of former features carry out nucleus lesion mapping, can obtain non-parametric class maximum likelihood function as follows:

，

Adopt Gradient Descent until convergence obtains , wherein hat( ) represent parameter estimated value;

3 window widths are the important parameter of Feature Mapping smoothness constraint and local linear estimation, and its solved function is as follows:

，

Wherein h _optfor the window width solved according to Solving Equations below hoptimal value, kfor kernel function, k _0,1( z) be 0-1 standardization kernel function, represent the variance function be normalized between 0 to 1.

Parameter estimation obtained above , optimal bandwidth hand support vector (support points) x _s(expression can support optimum mark sindividual sample) be the required model parameter calculated of core algorithm unit 242.If carry out the parameter that once-through operation obtains not restrain, so system needs to carry out iterative computation, until the parameter convergence calculated.Fig. 5 is the distributed training of nonparametric and model integrated schematic diagram that embodiment of the present invention relates to.Below in conjunction with the acquisition of the model parameter of the convergence related in Fig. 5 description of step S34.Data and parameter are distributed to respectively each boosting node of 1 ~ n in step S33, the core algorithm unit 242 of each node calculates respective model parameter.If in the iterations that system or analysis operation personnel specify, calculate the model parameter obtained not restrain, so current non-convergent parameter is just distributed to each boosting node as new initial parameter by parameter Dispatching Unit 241 again, carry out the iterative computation of next round, until model parameter convergence; If in regulation iterations, model parameter does not restrain all the time, and so system will send miscue, and by analysis operation, personnel further operate according to miscue.

After having carried out the calculating of convergence model parameter, the model parameter of the convergence of each node calculate is back to the parameter integration unit 231 under model management module 23, by parameter integration unit 231, it is integrated, and Modling model, generate new model parameter file, for new data carries out prediction marking (step 35) simultaneously.Parameter integration method is in step s 35:

，

Wherein Nodepred is the prediction integrated results of partial node and sample, ifor nodes, jfor sample, T is number of iterations, pred _{t, i, j}it is the predicted value that on i-th node, a jth sample is obtained by the sorter of training in the t time iteration; _{t, i, j}for the predictive variable parameter that each node produces, for the estimated value set of kernel function, i.e. the functional value set of the estimated value of support vector and window width, .

Finally the information such as the model after integration and model parameter file are stored in data memory module 26, and the Main Fun module 13 simultaneously fed back under R encapsulation calling layer 1 is as predicted configuration file (step S6) (Fig. 3).

Return Fig. 3, when judging to store model M odel (t-1) that mate with this application in Main Fun module 13 (step S2 is yes), then R encapsulates calling layer and makes the marking unit of the prediction under enterprise application modules 25 252 pairs of model M odel (t-1) carry out prediction with the matching degree of this application data to give a mark (step S4).The application data of model parameter file to input in the parameter integration unit of prediction marking unit 252 calling model administration module carries out prediction marking, when score value is lower than certain value, just regard as model M odel(t-1) be not suitable for the application data (step S4 is no) inputted in step S1, then need Model(t-1) be optimized renewal (step S5).Need the information be optimized model can return to R and encapsulate calling layer, and be shown to analysis operation personnel.

Fig. 6 is the process flow diagram that the topic language material model optimization of the step S5 that embodiment of the present invention relates to upgrades.Analysis operation personnel are by encapsulating the operation of calling layer to R, make Data distribution8 upgrade the new corpus data that in the step S1 stored with storage unit 243, data encasement unit 251 transforms and distribute to data based on model treatment unit 232, model treatment unit 232 trigger error distributed update unit 233: this basic data is executed in model M odel (t-1), and carry out error correction computing, produce the distribution error of each new data sample, produce the distribution error of upper first phase training data simultaneously, and the information such as the error distribution of calculating are stored in (step S51) in error distribution update unit 233.The formula obtaining the error distribution when performing model (t-1) of new language material and upper first phase corpus and Data Update is in this step as follows: set new language material sample as ( x _{n
+ 1}, x _n+m), its error is distributed as , initial weight w _ibe 1/ m, namely w ^t=(1/ m..., 1/ m); If the sample of upper first phase accumulation be ( x ₁ ... X _n), initial weight is: , error is distributed as ; The modifying factor being arranged new language material sample by error distribution is , , obtaining renewal training sample weight is: ; First phase language material and current period language material sample weights in renewal, wherein h( x) be the algorithm of model management model choice or algorithm combination, c( x) be sample xconcrete class, with the error distribution that the language material sample being respectively new language material sample and upper first phase is obtained by model prediction, with for being respectively the Error Correction Factors of new language material sample and upper first phase language material sample.Can find out that from above formula the degree that the sample of pre-sniffing can be taken seriously in follow-up training study is higher, play the effect of Automatic Optimal model on the one hand, reduce on the other hand and predict that correct sample enters the possibility in new model training, for system saving resource space, lift scheme training effectiveness.The above-mentioned upper sample of first phase accumulation and the new language material sample of updated weight are integrated into new training sample simultaneously, and this new training sample is stored in data storage cell 26, and send to R to encapsulate calling layer.

Then the parameter configuration module 11 that analysis operation personnel encapsulate dispatch layer 1 by R configures initial parameter, and send relevant information to the large data analysis layer of Hadoop 2 by http protocol unit 121: the Job resolution unit 223 triggering the task scheduling modules 22 of the large data analysis layer 2 of Hadoop, and the data message of needs is sent to Job resolution unit 223, task definition by generating task list to the parsing of data message, and is distributed to Job performance element 222(step 52 by Job resolution unit 223).Job performance element 222 is according to the model treatment unit 232 of task trigger model administration module 23.Model treatment unit 232 is according to task definition trigger data Dispatching Unit 221 and dispensing unit 224, task parameters simultaneously in the dispensing unit 224 transferred according to task definition of file distributing unit 221 sends to model treatment unit, enters boosting Data dissemination (step 53) simultaneously.Following model treatment unit 232 transfers the model parameter information in dispensing unit 234 according to the mission bit stream of Job performance element 222, trigger the core algorithm unit 242 of the algorithm processing module 24 of each node, the calculating (step S54, in this step, the calculating of model parameter is identical with step S34) restraining model parameter parameters such as () estimates of parameters, window width and support vectors is carried out by core algorithm unit 242.Then by the model parameter of the convergence of each node calculate, the parameter integration unit 231 be back under model management module 23 is integrated, generate new model parameter file, and set up new model Model (t) (step 55, the integration method in this step is identical with the integration method in step 35).Then operational factor integrates script, calculates the accuracy rate of this model M odel (t), judges Model(t) whether be better than Model (t-1) (step S56).If the accuracy rate of Model (t) is lower than default threshold value, Model (t-1) is better than Model (t) (step S56 is no), then model (t) is covered and save as Model (t-1) (step S57), re-start step S51 and later operation, until new model Model (t) is better than Model (t-1).If the accuracy rate of Model (t) is higher than system thresholds, then judge that Model (t) is better than Model (t-1) (step S56 is yes), then using Model(t) and model parameter file etc. return R and encapsulate Main Fun module (step S6) under calling layer as predicted configuration file.

Return Fig. 3, when predicting that marking unit 252 exceedes certain value to this enterprise's applied forcasting marking score value, just regard as the enterprise's application inputted in step S1 and be applicable to memory model Model(t-1) (step S4 is yes), with regard to direct, this model parameter Model (t-1) is designated as Model(t like this) be stored in data memory module 26, and the Main Fun module 13 simultaneously fed back under R encapsulation calling layer 1 is as predicted configuration file (step S6).Finally at the topic mark unit 254 of enterprise application modules 25, with Model(t) information such as corresponding model parameter file carries out LDA text to language material and birdss of the same feather flock together, form more fine-grained topic mark and the probabilistic dictionaries of each topic classification, and final topic mark and probabilistic dictionaries etc. are stored in data storage cell 26, carry out later stage application deployment (step S7); The memory address of the information such as the new topic mark stored and probabilistic dictionaries is fed back to R encapsulation calling layer 1 by data storage cell 26 simultaneously.

In the above-described embodiment, in regulation iterations, if the model parameter that core algorithm unit 242 calculates does not restrain all the time, so system will give a warning, but is not limited only to this.Keep off in regulation iterations, the model parameter that core algorithm unit 242 calculates does not restrain all the time, so the result of last iteration is carried out preserving and sending as the final result calculated, but such result of calculation may make the error of the expectation inputted increase below.

In the above-described embodiment, use the large data analysis layer of Hadoop to carry out distributed arithmetic, but be not limited thereto.Also the distributed processing system(DPS) of other structures can be used to replace the large data analysis layer of Hadoop to carry out above-mentioned distributed arithmetic.

Claims

1. a distributed system for non-parametric topic automatic marking, comprising: R encapsulates calling layer and distributed data processing layer;

Described R encapsulates calling layer and comprises parameter configuration module, the first communication analysis module and principal function module;

Described distributed data processing layer comprises second communication parsing module, task scheduling modules, model management module, algorithm processing module and enterprise application modules; Wherein,

Described parameter configuration module is used for accepting configuration information;

Described principal function module for accepting the personalize development carried out algorithmic dispatching, information feed back process and other process, and generates the executable configuration of described distributed data processing layer accordingly and needs the mission bit stream performed to send to described distributed data processing layer;

Described first communication analysis module is used for communicating to connect with described second communication parsing module, for encapsulating foundation communication between calling layer and distributed data processing layer at R and resolving Content of Communication;

Described task scheduling modules for receiving the executable configuration of described principal function module transmission and needing the mission bit stream of execution, and correspondingly controls and coordinates the work of described model management module, described algorithm processing module and described enterprise application modules;

Described model management module for building model, algorithm processing module computation model parameter described in instruction, and the model parameter that described algorithm processing module returns being integrated, generation model Parameter File;

The instruction that described algorithm processing module receives described model management module calculates model parameter, and returns results;

Described enterprise application modules is used for the language material of reception to carry out pre-service, and marks according to the model parameter file generated topic of described model management CMOS macro cell.

2. system according to claim 1, is characterized in that:

Described enterprise application modules will set up the IF-IDF matrix of label-word when pre-service after language material participle;

Wherein, label is the mark of topic, is set to y; IF-IDF is the frequency conversion that each word and predictive variable occur in each language material, is set to x.

3. system according to claim 2, is characterized in that:

Described algorithm processing module solves parameter estimation in accordance with the following methods , optimal bandwidth hand support vector x _sin at least one of them:

A) class predictive equation and likelihood function is built, , adopt nonparametric technique extensive to it, obtain more generally, class maximum likelihood function , wherein, lfor the exponent number of likelihood function approximation by polynomi-als;

B) nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function:

，

Maximize this likelihood function, adopts Gradient Descent until convergence obtains , wherein, hat( ) represent parameter estimated value;

C) window width is solved by following formula:

，

h _optfor the window width solved according to Solving Equations below hoptimal value, kfor kernel function, k _0,1( z) be 0-1 standardization kernel function, represent the variance function be normalized between 0 to 1.

4. system according to claims 1 to 3, is characterized in that:

Described enterprise application modules comprises prediction marking unit, for giving a mark to by the result of the new language material of existing model treatment;

Qualified if given a mark, described system exports the result with the existing new language material of model treatment, if it is defective to give a mark, the existing model of described system update also processes new language material with the model after upgrading.

5. system according to claim 4, is characterized in that:

Described model management module comprises error distribution update unit;

Described error distribution update unit gives each sample an error update weight, the weight of adjustment sample when upgrading existing model, and selects weight to meet the sample of certain condition as the training sample needed for Renewal model.

6. system according to claim 4, is characterized in that:

Described algorithm processing module comprises parameter Dispatching Unit and multiple core algorithm unit; Wherein,

Described core algorithm unit is used for carrying out algorithm iteration, production model parameter;

Described parameter Dispatching Unit is used for each core algorithm unit distribution initial parameter, and the result of calculation after each iteration of each core algorithm unit is sent to other core algorithm unit.

7. system according to claim 6, is characterized in that:

If model parameter does not restrain after the iteration of predetermined number of times, then current parameter is distributed to each core algorithm unit as new initial parameter by described parameter Dispatching Unit again, and automatically carries out next round iteration; If the convergence of described model, then described model parameter is back to model management module and carries out parameter integration by described parameter Dispatching Unit.

8. system according to claim 7, is characterized in that:

Described model management module carries out described parameter integration by following formula:

，

Wherein, Nodepred is the prediction integrated results of partial node and sample, and i is nodes, and j is sample number, and T is number of iterations, pred _{t, i, j}be the predicted value that on i-th node, a jth sample is obtained by the sorter of training in the t time iteration, _{t, i, j}for the predictive variable parameter that each node produces, for the estimated value set of kernel function, i.e. the functional value set of the estimated value of support vector and window width.

9. a method for non-parametric topic automatic marking, comprising:

Configuration step, carries out the information configuration of initiation parameter, service selection, algorithms selection;

Transmitting step, transmits the information configured;

Analyzing step, resolves configured information, is the executable configuration parameter of distributed data processing layer and the mission bit stream needing execution by configured information analysis;

Task scheduling step, is distributed to model management module, algorithm processing module and enterprise application modules by the mission bit stream that described executable configuration parameter and needs perform;

Model management step, carries out nonparametric model structure;

Algorithm process step, the parameter that the model calculating described model management step structure needs;

Enterprise's applying step, carries out pre-service by the language material of reception, and exports the result with model treatment language material.

10. method according to claim 9, is characterized in that:

Also comprise pre-treatment step, the IF-IDF matrix of label-word will be set up after language material participle;

11. methods according to claim 10, is characterized in that:

Parameter estimation is solved in accordance with the following methods in described model management step , optimal bandwidth hand support vector x _sin at least one of them:

，

Maximize this likelihood function, adopts Gradient Descent until convergence obtains , wherein hat( ) represent parameter estimated value;

C) window width is solved by following formula:

,

12., according to the method described in claim 9 ~ 11, is characterized in that:

Also comprise prediction marking step, give a mark to by the result of the new language material of existing model treatment with prediction marking unit;

Qualified if given a mark, export the result with the existing new language material of model treatment, if it is defective to give a mark, then upgrades existing model and process new language material with the model after upgrading.

13. methods according to claim 12, is characterized in that:

In described model management step, give each sample an error update weight, the weight of adjustment sample when upgrading existing model, and select weight to meet the sample of certain condition as the training sample needed for Renewal model.

14. methods according to claim 12, is characterized in that:

In described algorithm process step, multiple core algorithm unit for carrying out algorithm iteration, production model parameter is set;

With parameter Dispatching Unit to each core algorithm unit distribution initial parameter, and the result of calculation after each iteration of each core algorithm unit is sent to other core algorithm unit.

15. methods according to claim 14, is characterized in that:

If model parameter does not restrain after the iteration of predetermined number of times, then using current parameter as new initial parameter, and automatically carry out next round iteration; If described model convergence, then carry out parameter integration by described model parameter.

16. methods according to claim 15, is characterized in that:

Described parameter integration is carried out by following formula:

，