CN104778254B

CN104778254B - A kind of distributed system and mask method of non-parametric topic automatic marking

Info

Publication number: CN104778254B
Application number: CN201510186154.9A
Authority: CN
Inventors: 李攀登; 王炼; 姜军
Original assignee: Peking Blue Coloured Light Mark Brand Management Consultant Inc Co
Current assignee: Peking Blue Coloured Light Mark Brand Management Consultant Inc Co
Priority date: 2015-04-20
Filing date: 2015-04-20
Publication date: 2018-03-27
Anticipated expiration: 2035-04-20
Also published as: CN104778254A

Abstract

The present invention provides a kind of distributed system of non-parametric topic automatic marking, including：R encapsulates calling layer and distributed data processing layer；The R encapsulation calling layer includes parameter configuration module, the first communication analysis module and principal function module；The distributed data processing layer includes the second communication analysis module, task scheduling modules, model management module, algorithm processing module and enterprise application modules.The present invention also provides a kind of method of non-parametric topic automatic marking.

Description

A kind of distributed system and mask method of non-parametric topic automatic marking

Technical field

The present invention relates to a kind of big data processing method of statistical learning technical applications, is marked more particularly, to topic The system that automatically processes of automatic modeling, distributed deployment and service application.

Background technology

With internet technology and product it is increasingly mature, the information of internet rapidly expands, and people rely on various loads Body leaves the vestige of oneself on various platforms and media, such as, people make comments on electric business platform to article, in microblogging On deliver oneself topic interested, be embodied directly in Rapid Accumulation mass data, thereby produce substantial amounts of text data, and The topic thought for how therefrom excavating user's expression by technologies such as semanteme, statistical learnings from these texts is paid close attention to as industry And extremely valuable technical problem because substantial amounts of service application all can carry out essence based on the information that these are excavated Accurate marketing and data products application, the technology of current academia and industrial quarters in this field has had substantial amounts of research.

But we have found that prior art at least has problems with actual use and research：Prior art all bases Theme distribution, which is distributed or implies, in the word of text obeys certain hypothesis distribution, and then the subsequent parameter iteration and the instruction of model carried out Practice, a drawback of this method is the base when the word of actual text and the true theme of user disobey the distribution of hypothesis The model result come is trained under this hypothesis and just occurs seriously have partially；Also have the algorithm of some machine learning, as SVM, Neutral net etc. possesses stronger predictive ability, but these algorithms limit it in the big data epoch due to high computation complexity Commercial Application, influence the popularization of commercial Application；Prior art needs manually to regularly update model parameter, and self-learning capability is still not Possess.

The content of the invention

In order to overcome disadvantages described above, the invention provides a kind of distributed system of non-parametric topic automatic marking, bag Include：R encapsulates calling layer and distributed data processing layer；The R encapsulation calling layer includes parameter configuration module, the first communication analysis Module and principal function module；The distributed data processing layer includes the second communication analysis module, task scheduling modules, model pipe Manage module, algorithm processing module and enterprise application modules；Wherein, the parameter configuration module is used to receive configuration information；It is described Principal function module is used to receive to handle algorithmic dispatching, feedback of the information and other handle carried out personalization and developed, and accordingly Generate the executable configuration of the distributed data processing layer and mission bit stream that needs perform is sent to the distributed data Process layer；The first communication analysis module is used to communicate to connect with the second communication analysis module, is called for being encapsulated in R Established between layer and distributed data processing layer and communicate and Content of Communication is parsed；The task scheduling modules are used to receive The mission bit stream that the executable configuration and needs that the principal function module is sent perform, and correspondingly control and coordinate the mould The work of type management module, the algorithm processing module and the enterprise application modules；The model management module is used to build Model, the algorithm processing module computation model parameter is instructed, and the model parameter that the algorithm processing module is returned is carried out Integrate, generation model Parameter File；The instruction that the algorithm processing module receives the model management module is entered to model parameter Row calculates, and returning result；The enterprise application modules are used to be pre-processed the language material of reception, and according to the model pipe Manage the model parameter file generated topic mark of module generation.

Preferably, the enterprise application modules establish the IF-IDF squares of label-word after language material is segmented in pretreatment Battle array；Wherein, label is the mark of topic, is set toY；IF-IDF is that each word is the frequency that predictive variable occurs in each language material Transformation of variables, it is set toX。

Preferably, the algorithm processing module solves parameter Estimation in accordance with the following methods, optimal bandwidthhAnd supporting vectorX _sIn at least one：a）Class predictive equation and likelihood function are built,, using nonparametric technique pair Its is extensive, obtains more generally, class maximum likelihood function, wherein,lIt is more for likelihood function Item formula order of approximation number；b）Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function：, maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein,hat() represent parameterEstimate；c）Window width is solved as the following formula：,h _optAccording to behind equation ask The window width h solved optimal value,KFor kernel function,K _0,1(z) it is 0-1 standardization kernel functions,Expression is normalized to 0 and arrived Variance function between 1.

Preferably, the enterprise application modules include prediction marking unit, for the new language material of existing model treatment Result is given a mark；If marking is qualified, the result of the system output new language material of existing model treatment, if marking Unqualified, the existing model of the system update simultaneously handles new language material with the model after renewal.

Preferably, the model management module includes error distribution update unit；The error distribution update unit assigns One error update weight of each sample, the weight of sample is adjusted in the existing model of renewal, and select weight to meet certain bar Training sample of the sample of part needed for as renewal model.

Preferably, the algorithm processing module includes parameter Dispatching Unit and multiple core algorithm units；Wherein, the core Center algorithm unit is used to carry out algorithm iteration, produces model parameter；The parameter Dispatching Unit is used for each core algorithm unit Distribute initial parameter, and the result of calculation after each each iteration of core algorithm unit is sent to other core algorithm lists Member.

Preferably, if model parameter does not restrain after the iteration of predetermined number of times, the parameter Dispatching Unit will be current Parameter be redistributed to each core algorithm unit as new initial parameter, and automatically carry out next round iteration；It is if described Model is restrained, then the model parameter is back to model management module and carries out parameter integration by the parameter Dispatching Unit.

Preferably, the model management module carries out the parameter integration by below equation：

, wherein, Nodepred For the prediction integrated results of partial node and sample, i is nodes, and j is sample number, and T is number of iterations,pred _t,i,jFor i-th of node The predicted value that upper j-th of sample is obtained in the t times iteration by the grader trained, _t,i,jFor prediction caused by each node Variable parameter,For the function value set of the estimation value set of kernel function, the i.e. estimate of supporting vector and window width.

The present invention also provides a kind of method of non-parametric topic automatic marking, including：Configuration step, carry out initialization ginseng Number, service selection, the information configuration of algorithms selection；Transmitting step, the configured information of transmission；Analyzing step, to what is configured Information is parsed, by the information configured resolves to the executable configuration parameter of distributed data processing layer and needs perform Mission bit stream；Task scheduling step, the mission bit stream that the executable configuration parameter and needs perform is distributed to model pipe Manage module, algorithm processing module and enterprise application modules；Model management step, carry out nonparametric model structure；Algorithm process walks Suddenly, the parameter needed for the model of the model management step structure is calculated；Enterprise's applying step, the language material of reception is located in advance Reason, and export the result with model treatment language material.

Preferably, in addition to pre-treatment step, the IF-IDF matrixes of label-word are established after language material is segmented；Wherein, mark The mark for topic is signed, is set toY；IF-IDF is that each word is the frequency conversion that predictive variable occurs in each language material, if ForX。

Preferably, parameter Estimation is solved in accordance with the following methods in the model management step, optimal bandwidthhAnd support VectorX _sIn at least one：a）Class predictive equation and likelihood function are built,, using nonparametric side Method is extensive to its, obtains more generally, class maximum likelihood function, wherein,lFor likelihood letter The exponent number of number approximation by polynomi-als；b）Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function：, maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein,hat() represent parameterEstimate；c）Window width is solved as the following formula：,h _optAccording to behind equation ask The window width h solved optimal value,KFor kernel function,K _0,1(z) it is 0-1 standardization kernel functions,Expression is normalized to 0 and arrived Variance function between 1.

Preferably, in addition to marking step is predicted, with prediction marking unit to the processing with the new language material of existing model treatment As a result given a mark；If marking is qualified, the result of the output new language material of existing model treatment, if marking is unqualified, more New existing model simultaneously handles new language material with the model after renewal.

Preferably, each one error update weight of sample is assigned in the model management step, in the existing mould of renewal The weight of sample is adjusted during type, and selects weight to meet training sample of the sample of certain condition needed for as renewal model.

Preferably in the algorithm process step, multiple cores for carrying out algorithm iteration, producing model parameter are set Center algorithm unit；Distribute initial parameter to each core algorithm unit with parameter Dispatching Unit, and each core algorithm unit is every Result of calculation after secondary iteration is sent to other core algorithm units.

Preferably, if model parameter does not restrain after the iteration of predetermined number of times, using current parameter as it is new just Beginning parameter, and next round iteration is carried out automatically；If the model convergence, parameter integration is carried out by the model parameter.

Preferably, the parameter integration is carried out by below equation：

A kind of method for building up and system based on the non-parametric topic automatic markings of boosting provided by the invention, use Core inner product technology maps feature to higher dimensional space on the premise of computation complexity is not increased, using prediction error distribution dynamic Regulation mechanism, and by boosting modes by data distribution to each Node distribution formula iterative model parameter of hadoop cluster And supporting vector, the topic that internet language material is efficiently completed eventually through encapsulation R environment scheduling script mark, the present invention is established User interest model it is more accurate, burden and the wasting of resources of server and client side can be reduced.This programme embodiment provides Such scheme compared with prior art, have following beneficial effect：

1st, based on prior distribution of the nonparametric technique independent of data, and grader is established in high-dimensional feature space so that Topic marking model is more accurate and flexible, and model prediction only relies on a small amount of supporting vector（Much smaller than caused by SVM to Measure number）, low computation complexity is only O (N^2*Ns), and N is data volume size, and Ns is supporting vector quantity；

2nd, based on prediction error distribution dynamic adjusting training sample distribution, the sample weights of prediction error is improved, are reduced pre- Survey correct sample weights or avoid enter into model training, can further Automatic Optimal model, and further reduce and calculate again Miscellaneous degree, the speed of decline is O（1/T）, T is iterations；

3rd, model be deployed as clustered node is distributed to by the way of boosting, the negative of server and system can be reduced Load and the wasting of resources, model training efficiency is improved, on the other hand, each node parameter passback process is an Ensemble process, is entered One step improves the accuracy rate of model and reduces over-fitting risk.

Brief description of the drawings

Fig. 1 is the distributed system architecture figure for the topic automatic marking that embodiment of the present invention is related to；

Fig. 2 is the multiprocessor scheduling process structure diagram for the topic automatic marking that embodiment of the present invention is related to；

Fig. 3 is the flow chart for the topic language material automatic marking that embodiment of the present invention is related to；

Fig. 4 is the flow chart for the new established model of topic language material that embodiment of the present invention is related to；

Fig. 5 is the distributed training of nonparametric and model integrated schematic diagram that embodiment of the present invention is related to；

Fig. 6 is the flow chart for the topic language material model modification that embodiment of the present invention is related to.

Embodiment

The present invention is illustrated below according to accompanying drawing illustrated embodiment.This time disclosed embodiment can consider in all sides Face is to illustrate, without limitation.The scope of the present invention is not limited by the explanation of implementation below, only by claims Shown in scope, and including having all deformations in the same meaning and right with right.

In order to solve the above technical problems, marked automatically based on quasi-boosting nonparametrics topic the invention provides one kind The distributed system and method for note, the system include R encapsulation two big systems of calling layer and Hadoop big datas process layer, and two substantially By http protocol intercommunication between system, parsed by the JSON on backstage, carry out the mutual scheduling and passback of two big systems.Specifically , analysis operation personnel can realize that Hadoop environment does what is calculated using R language to dispatch by encapsulating the operation of calling layer to R Efficient succinct pattern.Specifically, the module of difference in functionality is packaged with Hadoop big data process layers, it is main to include communication mould Block, task scheduling modules, model management module, algorithm processing module, enterprise application modules and data memory module etc..Each mould The unit of difference in functionality is encapsulated in block respectively：Task scheduling modules include file distributing unit, Job resolution units, Job and perform list Member and dispensing unit；Model management module includes parameter integration unit, model treatment unit, error distribution update unit and configuration Unit；Algorithm processing module includes parameter Dispatching Unit, core algorithm unit and data distributed update and memory cell；Enterprise should Include data preparatory unit, prediction marking unit, application interface unit and corpus labeling unit etc. with module.Described module and The mutually coordinated distributed structure for realizing core algorithm of unit, model management, algorithmic dispatching parsing, data quasi-boosting Distribution, storage, the optimal integrated, model training of parameter and deployment, automatically update the flows such as Optimized model parameter, the application of model enterprise The function of encapsulation.

Illustrate and involved in the present invention marked automatically based on quasi-boosting nonparametrics topic with reference to Fig. 1-6 The distributed system of note.

Fig. 1 is the distributed system architecture figure for the topic automatic marking that embodiment of the present invention is related to.Embodiment party of the present invention The system for the topic automatic marking that formula is related to includes R encapsulation 2 liang of big systems of calling layer 1 and Hadoop big datas process layer, and this two Big system has respective communication module.The communication module of R encapsulation calling layer 1 and the communication under Hadoop big datas process layer 2 Module is connected by way of wired or wireless network, realizes two big systems（R is encapsulated at calling layer 1 and Hadoop big datas Manage layer 2）Communication connection and data access.In addition, R encapsulation calling layer 1 works under R language environments, at Hadoop big datas Reason layer 2 works under Java language.

R encapsulation calling layer 1 includes parameter configuration module 11, communication module 12 and Main Fun modules 13.Wherein R encapsulation is adjusted Contain parameter configuration files with the parameter configuration module 11 of layer 1；Applied according to different enterprises, analysis operation personnel can match somebody with somebody in parameter Put and the activation bits such as different initial parameters, service selection, algorithms selection are inputted in file, it is different according to different application configurations Parameter and other activation bits stand.The communication module 12 of R encapsulation calling layer 1 includes http protocol unit 121 and JSON is parsed The two parts of unit 122, wherein http protocol unit 121 are responsible for encapsulating R into the R language messages such as the parameter configuration files of calling layer 1 Script and model the information transmission such as configuration parameter to Hadoop big datas process layer 2, and be responsible for receiving and come from Hadoop The information such as the feedback information of big data process layer 2 and model algorithm result；JSON resolution units 122 are responsible for Hadoop is big The Java language information that data analysis layer 2 is sent to http protocol unit 121 resolves to corresponding R language scripts, adjusts R encapsulation Action and the feedback information of each module of Hadoop big datas process layer 2 can be identified with layer 1.Main Fun modules 13 are stored with The simple functions function model that is made up of R language and the function that can call other language models, the module are responsible for looking forward to The information integration such as operation to be performed needed for industry application and the feedback of the information from Hadoop big datas process layer 2 and algorithmic dispatching It is encapsulated as R language UDF（User-Defined Functions）, personalized exploitation is completed, so as to which R encapsulates the application of calling layer 1 with regard to that can perform R Other program bags and function under environment, and can is by providing that agreement performs the algorithm under Hadoop environment.Communication module 12 is divided It is not connected with parameter configuration module 11 and Main Fun modules 13.

Hadoop big datas process layer 2 includes communication module 21, task scheduling modules 22, model management module 23, algorithm Processing module 24, enterprise application modules 25 and data memory module 26.

Wherein, communication module 21 includes http protocol unit 211 and JSON resolution units 212.The http protocol unit 211 It is responsible for task scheduling modules 22, model management module 23, algorithm processing module 24, enterprise application modules 25 and data storage mould The transmission of the intermodule of block 26 encapsulates calling layer 1 with information transfers such as processing feedbacks to R, and receives the ginseng that R encapsulation calling layer 1 is sent Number configuration file and R language message scripts etc..The JSON resolution units 212 of communication module 21 are responsible for http protocol unit 211 The information and model configuration parameter of the R encapsulation calling layer 1 of reception resolve to the executable parameter of Hadoop big datas process layer 2 and matched somebody with somebody Confidence ceases and needed the Job information performed.

Fig. 2 is the multiprocessor scheduling process structure diagram for the topic automatic marking that embodiment of the present invention is related to.With reference to Fig. 1 and Fig. 2 illustrates the task scheduling modules 22, model management module 23, algorithm process mould of Hadoop big datas process layer 2 respectively The structure of block 24, enterprise application modules 25 and data memory module 26.

Task scheduling modules 22 include file distributing unit 221, Job execution units 222, Job resolution units 223 and configuration Unit 224；The module is the drive module of model management module 23, algorithm processing module 24 and enterprise application modules 25, is responsible for The function of the automatic running drivings such as the configuration of each module parameter, parameter return, flow is got through, the execution of data distribution order.Its In, Job resolution units 223 are responsible for the executable configuration parsed according to JSON resolution units 212 and need the Job information performed Determine the task that the needs of Hadoop big datas process layer 2 are done and carried out data preparation, generate task list, trigger simultaneously Job execution units 222.Job execution units 222 can be several units arranged side by side can also there was only one, the unit according to The task list identification that Job resolution units 223 generate each needs the job task performed, and is triggered accordingly according to task definition Other modules work.Dispensing unit 224 stores parameter and the fileinfo that the connection of Hadoop layers bottom needs, the unit root According to the mission bit stream that Job execution units 222 identify relevant parameter is configured to file distributing unit 221.File distributing unit 221 will The data distribution numbering and index that data preparatory unit 251 is related in enterprise application modules 25, and according to Job execution units 222 Specific tasks information parameter information and its relevant task data are allocated to model management module or algorithm processing module etc..

Model management module 23 includes parameter integration unit 231, model treatment unit 232, error distribution update unit 233 With dispensing unit 234.The core algorithm based on non-parametric topic automatic marking is wherein packaged with model treatment unit 232, The unit can drive the work of algorithm processing module 24.Other under model management module 23 are stored with dispensing unit 234 Each unit parameter required when modeling, model needed for other each units offer that the unit is responsible under model management module 23 are joined Number.Parameter integration unit 231 is integrated the convergent model parameter of each nodes of the Hadoop calculated in given number of iterations, Complete Ensemble（Assemblage）Parameter integration process, generation model Parameter File, and the information such as model parameter file by generation Reach R encapsulation calling layer 1；The integration method that the process uses specifically gives in aftermentioned explanation.Error distribution update unit 233 The error distribution update unit assigns each one error update weight of sample, is recalculated when needing and updating existing model The weight of sample, and select training sample of the larger sample of the weight recalculated needed for as renewal model.

Algorithm processing module 24 includes parameter distribution unit 241, core algorithm unit 242 and data distributed update and storage Unit 243；Core algorithm unit 242 is provided with multiple, and each core algorithm unit 242 is a boosting node.Core Algorithm unit 242 performs various algorithms according to specific tasks information：The solution of such as model, the iteration of algorithm；Carrying out algorithm Iteration when, the parameter that each core algorithm unit calculates can re-start distribution by parameter Dispatching Unit 241, and progress is next Take turns iteration；The unit can be triggered directly by task scheduling modules 22, carry out simple feature calculating；The unit also can be by model The model treatment unit 232 of management module 23 triggers, and carries out model solution or algorithm iteration of complexity etc. and calculates.Parameter distribution is single Member 241 is responsible for each node distribution initial parameter data of server, and the parameter can be got parms at random or by core The non-convergent parameter for each node that center algorithm unit 242 calculates.Data distribution updates and memory cell 243 is responsible for renewal and storage When core algorithm unit 242 carries out algorithm iteration, amendment data after error correction etc..

Enterprise application modules 25 include data preparatory unit 251, prediction marking unit 252, application interface unit 253 and language Material mark unit 254, main entrance and outlet of the module for the system.Wherein, application interface unit 253 is responsible for answering with enterprise Other interfaces are docked and real-time management, to obtain new opplication.Data preparatory unit 251 is mainly responsible for enterprise-level industry Business problem is converted into the demand data involved by model language, and the initial data called and distributed for other models is prepared and in advance Processing.Prediction marking unit 252 is carried out by the model parameter file in trigger model management module to the data object of input Prediction and marking, the model for judging to have stored in system whether be applied to the application data that newly inputs；If marking is qualified, institute The system output result of the new language material of existing model treatment is stated, if marking is unqualified, the existing model of system update is simultaneously New language material is handled with the model after renewal.Corpus labeling unit 254 is responsible for entering the language material of input according to model parameter file Row LDA texts are birdsed of the same feather flock together, and form the more fine-grained topic mark and probabilistic dictionaries of each topic classification, and by the topic of generation The information such as mark are stored to data memory module 26.

Data memory module 26 is responsible for task scheduling modules 22 in storage system, model management module 23, algorithm process mould Information after block 24 and the processing of the four module of enterprise application modules 25, the model being such as calculated, model parameter, topic mark Information.

The communication module 21 of Hadoop big datas process layer 2 respectively with task scheduling modules 22, model management module 23, calculate Method processing module 24, enterprise application modules 25 are connected with data memory module 26, are connected respectively by bus between each module.

Fig. 3 is the flow chart for the topic language material automatic marking that embodiment of the present invention is related to.Below in conjunction with accompanying drawing specifically The handling process of bright corpus labeling.

First, new language material is inputted by application interface unit 253, the new language material is converted into mould by data preparatory unit 251 New corpus data involved by type language, specifically, being segmented to the input language material and the conversion of VSM numerical value, and build VSM The IF-IDF matrixes of numerical matrix, i.e. label- words, wherein label are the mark of topic, are set to Y；IF-IDF is each single Word is the frequency conversion that predictive variable occurs in each word, is set to X, the original called and distributed as modules in system Beginning corpus data（New corpus data）It is stored in data memory module 26.Application interface unit 253 should according to input language material With determining to need the processing that carries out language material and relevant information be sent into R by the http protocol unit 211 of communication module 21 Calling layer 1 is encapsulated, corresponding R language messages are resolved to by JOSN resolution units 122（Step S1）.Under R encapsulation calling layer 1 Main Fun modules 13 carry out analysis matching to the R language messages of parsing, judge whether to be stored with Main Fun modules 13 with The processing model M odel (t-1) (step S2) of application data matching.

When judging there is no the model for storing Model Matching corresponding with the new corpus data in Main Fun modules 13 During Model (t-1)（Step S2 is no）, then analysis operation personnel need to R encapsulation calling layer operate, input the finger of modeling Order and configuration confidence, make system model the application data again（Step S3）.

Fig. 4 is the flow chart of the new established model of topic language material for the step S3 that embodiment of the present invention is related to.Analysis behaviour first Make personnel and the training corpus stored is called in data memory module 26, training corpus is sent to data preparatory unit 251 And the language material is pre-processed：The language material is segmented and VSM numerical value is changed, and builds VSM numerical matrixs（Step S31）, R encapsulation calling layer 1 will be sent to by the pretreated VSM numerical matrixs of step S1 and S31, while match somebody with somebody by parameter Put module 11 to initialize parameter, and trigger R module scheduling scripts, R encapsulation calling layer 1 is triggered at Hadoop big datas Manage the Job resolution units 223 of the task scheduling modules 22 of layer 2, it would be desirable to establish the information and various instruction tasks letter of model Breath is sent to Job resolution units 223, and Job resolution units 223 generate task list by the parsing to command information, and will appoint Content assignment be engaged in Job execution units 222（Step S32）.Job execution units 222 manage mould according to task definition trigger model The model treatment unit 232 of block 23.Model treatment unit 232 is single according to task definition trigger data Dispatching Unit 221 and configuration Member 224, while the task parameters in the dispensing unit 224 transferred according to task definition of file distributing unit 221 are sent to model Processing unit, while enter boosting data distributions, boosting nodes as shown in Figure 5：1~n（Step 33）.Next Model treatment unit 232 transfers the model parameter in dispensing unit 234 according to the mission bit stream of Job execution units 222, and triggering is each The core algorithm unit 242 of the algorithm processing module 24 of node, convergence model parameter is carried out by core algorithm unit 242（Such as ginseng Number estimate, window width and supporting vector etc.）Calculating（Step S34）.The computation model of core algorithm unit 242 is joined in this step Number mainly has estimates of parameters, window width and supporting vector etc., and corresponding algorithm is：

1 structure class predictive equation and likelihood function,, more generally, work as contiguous functiongFor Sigmoid, andf（x）For it is linear when,Expressed is traditional parameter logistic regression model（LR）, I Using nonparametric technique it is extensive to its, obtain more generally, class maximum likelihood function is, whereinX _iIt is each word in sample dataiThe frequency that individual sample occurs,Y _iForiThe topic mark of individual sample,lFor likelihood The exponent number of function approximation by polynomi-als；

2 pairs of former features carry out nucleus lesion mapping, and it is as follows to can obtain non-parametric class maximum likelihood function：

,

Declined using gradient and obtained until restraining, whereinhat() represent parameter Estimate；

3 window widths are characterized the important parameter of mapping smoothing constraint and local linear estimation, and its solved function is as follows：

,

Whereinh _optAccording to behind the window width that goes out of equation solutionhOptimal value,KFor kernel function,K _0,1(z) it is 0-1 marks Standardization kernel function,Represent the variance function being normalized between 0 to 1.

Parameter Estimation obtained above, optimal bandwidthhAnd supporting vector（support points）X _s（Expression can support Optimal marksIndividual sample）The model parameter as calculated needed for core algorithm unit 242.If carry out once-through operation to obtain Parameter do not restrain, then system need be iterated calculatings, until calculate parameter restrain.Fig. 5 is embodiment of the present invention The distributed training of nonparametric being related to and model integrated schematic diagram.The convergent mould for illustrating to be related in step S34 with reference to Fig. 5 The acquisition of shape parameter.Data and parameter are distributed to 1~n each boosting nodes, the core of each node respectively in step S33 Center algorithm unit 242 is calculated respective model parameter.If in iteration as defined in system or analysis operation personnel time In number, the model parameter for calculating acquisition does not restrain, then parameter Dispatching Unit 241 just distributes current non-convergent parameter again To each boosting nodes as new initial parameter, the iterative calculation of next round is carried out, until model parameter restrains；If Provide iterations in, model parameter does not restrain all the time, then system will send miscue, by analysis operation personnel according to Miscue is further operated.

After the calculating for having carried out convergence model parameter, the convergent model parameter that each node is calculated is back to model management Parameter integration unit 231 under module 23, is integrated by parameter integration unit 231 to it, and establishes model, while is generated new Model parameter file, be predicted marking for new data（Step 35）.Parameter integration method is in step s 35：

,

Wherein Nodepred is the prediction integrated results of partial node and sample,iFor nodes,jFor sample, T is number of iterations,pred _t,i,jThe predicted value obtained for j-th of sample on i-th of node in the t times iteration by the grader trained； _t,i,jFor Predictive variable parameter caused by each node,For the estimation value set of kernel function, i.e. supporting vector and window width is estimated The function value set of evaluation,。

Finally by the storage of the information such as the model after integration and model parameter file into data memory module 26, and simultaneously will Its Main Fun module 13 fed back under R encapsulation calling layer 1 is used as predicted configuration file（Step S6）（Fig. 3）.

Fig. 3 is returned to, when judging to be stored with model M odel (t-1) matched with the application in Main Fun modules 13（Step Rapid S2 is yes）, then R encapsulate calling layer make prediction enterprise application modules 25 under give a mark unit 252 to model M odel (t-1) and The matching degree of the application data is predicted marking（Step S4）.The parameter of the prediction marking calling model management module of unit 252 Model parameter file in integral unit is predicted marking to the application data of input, when score value is less than certain value, it is assumed that For model M odel（t-1）It is not suitable for the application data inputted in step S1（Step S4 is no）, then need to Model（t-1） Optimize renewal（Step S5）.The information for needing to optimize model can return to R encapsulation calling layer, and be shown to analysis Operating personnel.

The flow chart for the topic language material model optimization renewal that Fig. 6 is the step S5 that embodiment of the present invention is related to.Analysis behaviour Make personnel by the operation to R encapsulation calling layer, make data distribution renewal and data in the step S1 of the storage of memory cell 243 accurate The new corpus data that standby unit 251 converts distributes to data based on model treatment unit 232, model treatment unit 232 Trigger error distributed update unit 233：The basic data is implemented in model M odel (t-1), and carries out error correction computing, Produce the distribution error of each new data sample, while produce the distribution error of upper phase training data, and by the error of calculating The information such as distribution are stored in error distribution update unit 233（Step S51）.New language material and instruction of the upper phase were obtained in this step The formula for practicing error distribution and data renewal of the language material when performing model (t-1) is as follows：If new language material sample is（X _n+1,…X _n+m）, its error is distributed as, initial weightw _iIt is 1/m, i.e.,w ^t=(1/m..., 1/m)；On if One phase accumulation sample be（X ₁ ,…X _n）, initial weight is：, error is distributed as ；The modifying factor of new language material sample is set to be by error distribution,, obtain Updating training sample weight is：；Update upper phase language material and this Phase language material sample weights, whereinh(X) for model management module selection algorithm or algorithm combination,c(X) it is sampleXActual class Not,WithThe error that respectively new language material sample and the language material sample of a upper phase are obtained by model prediction is distributed,With To be respectively the Error Correction Factors of new language material sample and upper phase language material sample.It can be seen that the sample of pre- sniffing from above formula The degree that can be taken seriously in follow-up training study is higher, on the one hand plays the effect of Automatic Optimal model, on the other hand reduces The possibility that correct sample enters in new model training is predicted, resource space, lift scheme training effectiveness are saved for system.Together When the sample that above-mentioned upper phase is accumulated and the new language material sample of updated weight be integrated into new training sample, it is and this is new Training sample be stored in data storage cell 26, and be sent to R encapsulation calling layer.

Then analysis operation personnel encapsulate the configuration initial parameter of parameter configuration module 11 of dispatch layer 1 by R, and pass through Http protocol unit 121 sends relevant information to Hadoop big datas process layer 2：Trigger appointing for Hadoop big datas process layer 2 The Job resolution units 223 for scheduler module 22 of being engaged in, and the data message of needs is sent to Job resolution units 223, Job parsings are single Member 223 generates task list by the parsing to data message, and task definition is distributed into Job execution units 222（Step 52）.Job execution units 222 are according to the model treatment unit 232 of task trigger model management module 23.Model treatment unit 232 According to task definition trigger data Dispatching Unit 221 and dispensing unit 224, while file distributing unit 221 is according to task definition Task parameters in the dispensing unit 224 transferred are sent to model treatment unit, while enter boosting data distributions（Step 53）.Following model treatment unit 232 transfers the model in dispensing unit 234 according to the mission bit stream of Job execution units 222 Parameter information, the core algorithm unit 242 of the algorithm processing module 24 of each node is triggered, is received by core algorithm unit 242 Hold back model parameter（The parameters such as estimates of parameters, window width and supporting vector）Calculating（Step S54, model parameter in the step Calculate identical with step S34）.Then the convergent model parameter each node calculated is back to the ginseng under model management module 23 Number integral unit 231 is integrated, and generates new model parameter file, and establish new model Model (t)（Step 55, the step In integration method it is identical with the integration method in step 35）.Then operational factor integrates script, calculates the model M odel (t) Accuracy rate, judge Model（t）Whether Model (t-1) (step S56) is better than.If Model (t) accuracy rate is less than system Given threshold, Model (t-1) are better than Model (t)（Step S56 is no）, then model (t) coverings are saved as into Model (t-1) （Step S57）, step S51 and later operation are re-started, until new model Model (t) is better than Model (t-1).If Model (t) accuracy rate is higher than system thresholds, then judges that Model (t) is better than Model (t-1)（Step S56 is yes）, then will Model（t）And model parameter file etc. returns to the Main Fun modules under R encapsulation calling layer（Step S6）As predicted configuration text Part.

Fig. 3 is returned to, when prediction marking unit 252 exceedes certain value to enterprise's applied forecasting marking score value, it is assumed that being The enterprise inputted in step S1, which applies, is applied to storage model Model（t-1）（Step S4 is yes）, thus directly by the mould Shape parameter Model (t-1) is designated as Model（t）Store in data memory module 26, and be fed back to R encapsulation simultaneously and call Main Fun modules 13 under layer 1 are used as predicted configuration file（Step S6）.Finally marked in the topic of enterprise application modules 25 Unit 254, uses Model（t）The information such as corresponding model parameter file carry out LDA texts to language material and birdsed of the same feather flock together, and form each topic The more fine-grained topic mark and probabilistic dictionaries of classification, and final topic mark and probabilistic dictionaries etc. are stored in data In memory cell 26, later stage application deployment are carried out（Step S7）；Data storage cell 26 marks the new topic of storage simultaneously R encapsulation calling layer 1 is fed back to the storage address of the information such as probabilistic dictionaries.

In the above-described embodiment, in regulation iterations, if the model parameter that core algorithm unit 242 calculates is all the time Do not restrain, then system will give a warning, but be not limited only to this.Gear in regulation iterations, count by core algorithm unit 242 The model parameter of calculation does not restrain all the time, then is preserved and is sent out using the result of last time iteration as the result finally calculated Send, but such result of calculation may increase the error of the expectation inputted below.

In the above-described embodiment, distributed arithmetic, but not limited to this are carried out using Hadoop big datas process layer.Also may be used To replace Hadoop big datas process layer to carry out above-mentioned distributed arithmetic using the distributed processing system(DPS) of other structures.

Claims

1. a kind of distributed system of non-parametric topic automatic marking, including：R encapsulates calling layer and distributed data processing Layer；

The R encapsulation calling layer includes parameter configuration module, the first communication analysis module and principal function module；

The distributed data processing layer include the second communication analysis module, task scheduling modules, model management module, at algorithm Manage module and enterprise application modules；Wherein,

The parameter configuration module is used to receive configuration information；

The principal function module is used to receive to algorithmic dispatching, feedback of the information processing and the two integration encapsulation process, completes individual character Hair is melted, and accordingly generates the executable configuration of the distributed data processing layer and mission bit stream that needs perform is sent to institute State distributed data processing layer；

The first communication analysis module is used to communicate to connect with the second communication analysis module, for encapsulating calling layer in R Established between distributed data processing layer and communicate and Content of Communication is parsed；

The task that the task scheduling modules are used for the executable configuration for receiving the principal function module transmission and needs perform Information, and correspondingly control and coordinate the model management module, the algorithm processing module and the enterprise application modules Work；

The model management module is used to build model, instructs the algorithm processing module computation model parameter, and by the calculation The model parameter of method processing module passback is integrated, generation model Parameter File；

The instruction that the algorithm processing module receives the model management module is calculated model parameter, and returning result；

The enterprise application modules are used to be pre-processed the language material of reception, and the mould generated according to the model management module Shape parameter file generated topic marks.

2. system according to claim 1, it is characterised in that：

The enterprise application modules establish the IF-IDF matrixes of label-word after language material is segmented in pretreatment；

Wherein, label is the mark of topic, is set toY；The each words of IF-IDF are that predictive variable occurs in each language material Frequency converts, and is set toX。

3. system according to claim 2, it is characterised in that：

The algorithm processing module solves parameter Estimation in accordance with the following methods, optimal bandwidthhAnd supporting vectorX _sIn at least its One of：

a）Class predictive equation and likelihood function are built,, it is extensive to its using nonparametric technique, obtain more one As, class maximum likelihood function, wherein,lFor the exponent number of likelihood function approximation by polynomi-als, Contiguous function g is sigmoid, X_iThe frequency conversion occurred for i-th of sample in each word, Y_iFor the topic of i-th of sample Mark；

b）Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function：

,

Maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein, Represent parameterEstimate, h represent window width；

c）Window width is solved as the following formula：

,

h _optAccording to behind the window width that goes out of equation solutionhOptimal value,KFor kernel function,K _0,1(z) it is 0-1 standardization core letters Number,Represent the variance function being normalized between 0 to 1.

4. according to system according to any one of claims 1 to 3, it is characterised in that：

The enterprise application modules include prediction marking unit, for being carried out with the result of the new language material of existing model treatment Marking；

If marking is qualified, the result of the system output new language material of existing model treatment, if marking is unqualified, the system System updates existing model and handles new language material with the model after renewal.

5. system according to claim 4, it is characterised in that：

The model management module includes error distribution update unit；

The error distribution update unit assigns each one error update weight of sample, and sample is adjusted in the existing model of renewal Weight, and select weight to meet the training sample of the sample of certain condition needed for as renewal model.

6. system according to claim 4, it is characterised in that：

The algorithm processing module includes parameter Dispatching Unit and multiple core algorithm units；Wherein,

The core algorithm unit is used to carry out algorithm iteration, produces model parameter；

The parameter Dispatching Unit is used to distribute initial parameter to each core algorithm unit, and each core algorithm unit is each Result of calculation after iteration is sent to other core algorithm units.

7. system according to claim 6, it is characterised in that：

If model parameter does not restrain after the iteration of predetermined number of times, the parameter Dispatching Unit divides current parameter again Each core algorithm unit is issued as new initial parameter, and carries out next round iteration automatically；If the model convergence, institute State parameter Dispatching Unit and the model parameter is back to model management module progress parameter integration.

8. system according to claim 7, it is characterised in that：

The model management module carries out the parameter integration by below equation：

,

Wherein, Nodepred is the prediction integrated results of partial node and sample, and i is nodes, and j is sample number, and T is number of iterations,pred _t,ijThe predicted value obtained for j-th of sample on i-th of node in the t times iteration by the grader trained,To be every Predictive variable parameter caused by individual node,For the estimation of the estimation value set, i.e. supporting vector and window width of kernel function The function value set of value.

9. a kind of method of non-parametric topic automatic marking, including：

Configuration step, carry out initiation parameter, service selection, the information configuration of algorithms selection；

Transmitting step, the configured information of transmission；

Analyzing step, the information configured is parsed, the information configured is resolved into distributed data processing layer to hold The mission bit stream that capable configuration parameter and needs performs；

Task scheduling step, the mission bit stream that the executable configuration parameter and needs perform is distributed to model management mould Block, algorithm processing module and enterprise application modules；

Model management step, carry out nonparametric model structure；

Algorithm process step, calculate the parameter needed for the model of the model management step structure；

Enterprise's applying step, the language material of reception is pre-processed, and export the result with model treatment language material.

10. according to the method for claim 9, it is characterised in that：

Also include pre-treatment step, the IF-IDF matrixes of label-word are established after language material is segmented；

11. according to the method for claim 10, it is characterised in that：

Parameter Estimation is solved in accordance with the following methods in the model management step, optimal bandwidthhAnd supporting vectorX _sIn extremely It is one of few：

a）Class predictive equation and likelihood function are built,, it is extensive to its using nonparametric technique, obtain more one As, class maximum likelihood function, wherein,lFor the exponent number of likelihood function approximation by polynomi-als, connect It is sigmoid, X to meet function g_iThe frequency conversion occurred for i-th of sample in each word, Y_iFor the topic of i-th sample Mark；

,

Maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein Represent parameterEstimate, h represent window width；

c）Window width is solved as the following formula：

,

12. the method according to any one of claim 9~11, it is characterised in that：

Also include prediction marking step, with prediction marking unit to being beaten with the result of the new language material of existing model treatment Point；

If marking is qualified, the output result of the new language material of existing model treatment, if marking is unqualified, existing mould is updated Type simultaneously handles new language material with the model after renewal.

13. according to the method for claim 12, it is characterised in that：

Each one error update weight of sample is assigned in the model management step, sample is adjusted in the existing model of renewal Weight, and select weight to meet the training sample of the sample of certain condition needed for as renewal model.

14. according to the method for claim 12, it is characterised in that：

In the algorithm process step, multiple core algorithm units for carrying out algorithm iteration, producing model parameter are set；

Distribute initial parameter to each core algorithm unit with parameter Dispatching Unit, and by after each each iteration of core algorithm unit Result of calculation be sent to other core algorithm units.

15. according to the method for claim 14, it is characterised in that：

If model parameter does not restrain after the iteration of predetermined number of times, using current parameter as new initial parameter, and from It is dynamic to carry out next round iteration；If the model convergence, parameter integration is carried out by the model parameter.

16. according to the method for claim 15, it is characterised in that：

The parameter integration is carried out by below equation：

,

Wherein, Nodepred is the prediction integrated results of partial node and sample, and i is nodes, and j is sample number, and T is number of iterations,pred _t,ijThe predicted value obtained for j-th of sample on i-th of node in the t times iteration by the grader trained,To be each Predictive variable parameter caused by node,For the estimate of the estimation value set, i.e. supporting vector and window width of kernel function Function value set.