CN104778254B - A kind of distributed system and mask method of non-parametric topic automatic marking - Google Patents
A kind of distributed system and mask method of non-parametric topic automatic marking Download PDFInfo
- Publication number
- CN104778254B CN104778254B CN201510186154.9A CN201510186154A CN104778254B CN 104778254 B CN104778254 B CN 104778254B CN 201510186154 A CN201510186154 A CN 201510186154A CN 104778254 B CN104778254 B CN 104778254B
- Authority
- CN
- China
- Prior art keywords
- model
- parameter
- module
- sample
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention provides a kind of distributed system of non-parametric topic automatic marking, including:R encapsulates calling layer and distributed data processing layer;The R encapsulation calling layer includes parameter configuration module, the first communication analysis module and principal function module;The distributed data processing layer includes the second communication analysis module, task scheduling modules, model management module, algorithm processing module and enterprise application modules.The present invention also provides a kind of method of non-parametric topic automatic marking.
Description
Technical field
The present invention relates to a kind of big data processing method of statistical learning technical applications, is marked more particularly, to topic
The system that automatically processes of automatic modeling, distributed deployment and service application.
Background technology
With internet technology and product it is increasingly mature, the information of internet rapidly expands, and people rely on various loads
Body leaves the vestige of oneself on various platforms and media, such as, people make comments on electric business platform to article, in microblogging
On deliver oneself topic interested, be embodied directly in Rapid Accumulation mass data, thereby produce substantial amounts of text data, and
The topic thought for how therefrom excavating user's expression by technologies such as semanteme, statistical learnings from these texts is paid close attention to as industry
And extremely valuable technical problem because substantial amounts of service application all can carry out essence based on the information that these are excavated
Accurate marketing and data products application, the technology of current academia and industrial quarters in this field has had substantial amounts of research.
But we have found that prior art at least has problems with actual use and research:Prior art all bases
Theme distribution, which is distributed or implies, in the word of text obeys certain hypothesis distribution, and then the subsequent parameter iteration and the instruction of model carried out
Practice, a drawback of this method is the base when the word of actual text and the true theme of user disobey the distribution of hypothesis
The model result come is trained under this hypothesis and just occurs seriously have partially;Also have the algorithm of some machine learning, as SVM,
Neutral net etc. possesses stronger predictive ability, but these algorithms limit it in the big data epoch due to high computation complexity
Commercial Application, influence the popularization of commercial Application;Prior art needs manually to regularly update model parameter, and self-learning capability is still not
Possess.
The content of the invention
In order to overcome disadvantages described above, the invention provides a kind of distributed system of non-parametric topic automatic marking, bag
Include:R encapsulates calling layer and distributed data processing layer;The R encapsulation calling layer includes parameter configuration module, the first communication analysis
Module and principal function module;The distributed data processing layer includes the second communication analysis module, task scheduling modules, model pipe
Manage module, algorithm processing module and enterprise application modules;Wherein, the parameter configuration module is used to receive configuration information;It is described
Principal function module is used to receive to handle algorithmic dispatching, feedback of the information and other handle carried out personalization and developed, and accordingly
Generate the executable configuration of the distributed data processing layer and mission bit stream that needs perform is sent to the distributed data
Process layer;The first communication analysis module is used to communicate to connect with the second communication analysis module, is called for being encapsulated in R
Established between layer and distributed data processing layer and communicate and Content of Communication is parsed;The task scheduling modules are used to receive
The mission bit stream that the executable configuration and needs that the principal function module is sent perform, and correspondingly control and coordinate the mould
The work of type management module, the algorithm processing module and the enterprise application modules;The model management module is used to build
Model, the algorithm processing module computation model parameter is instructed, and the model parameter that the algorithm processing module is returned is carried out
Integrate, generation model Parameter File;The instruction that the algorithm processing module receives the model management module is entered to model parameter
Row calculates, and returning result;The enterprise application modules are used to be pre-processed the language material of reception, and according to the model pipe
Manage the model parameter file generated topic mark of module generation.
Preferably, the enterprise application modules establish the IF-IDF squares of label-word after language material is segmented in pretreatment
Battle array;Wherein, label is the mark of topic, is set toY;IF-IDF is that each word is the frequency that predictive variable occurs in each language material
Transformation of variables, it is set toX。
Preferably, the algorithm processing module solves parameter Estimation in accordance with the following methods, optimal bandwidthhAnd supporting vectorX s In at least one:a)Class predictive equation and likelihood function are built,, using nonparametric technique pair
Its is extensive, obtains more generally, class maximum likelihood function, wherein,lIt is more for likelihood function
Item formula order of approximation number;b)Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function:, maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein,hat() represent parameterEstimate;c)Window width is solved as the following formula:,h opt According to behind equation ask
The window width h solved optimal value,KFor kernel function,K 0,1(z) it is 0-1 standardization kernel functions,Expression is normalized to 0 and arrived
Variance function between 1.
Preferably, the enterprise application modules include prediction marking unit, for the new language material of existing model treatment
Result is given a mark;If marking is qualified, the result of the system output new language material of existing model treatment, if marking
Unqualified, the existing model of the system update simultaneously handles new language material with the model after renewal.
Preferably, the model management module includes error distribution update unit;The error distribution update unit assigns
One error update weight of each sample, the weight of sample is adjusted in the existing model of renewal, and select weight to meet certain bar
Training sample of the sample of part needed for as renewal model.
Preferably, the algorithm processing module includes parameter Dispatching Unit and multiple core algorithm units;Wherein, the core
Center algorithm unit is used to carry out algorithm iteration, produces model parameter;The parameter Dispatching Unit is used for each core algorithm unit
Distribute initial parameter, and the result of calculation after each each iteration of core algorithm unit is sent to other core algorithm lists
Member.
Preferably, if model parameter does not restrain after the iteration of predetermined number of times, the parameter Dispatching Unit will be current
Parameter be redistributed to each core algorithm unit as new initial parameter, and automatically carry out next round iteration;It is if described
Model is restrained, then the model parameter is back to model management module and carries out parameter integration by the parameter Dispatching Unit.
Preferably, the model management module carries out the parameter integration by below equation:
, wherein, Nodepred
For the prediction integrated results of partial node and sample, i is nodes, and j is sample number, and T is number of iterations,pred t,i,j For i-th of node
The predicted value that upper j-th of sample is obtained in the t times iteration by the grader trained, t,i,j For prediction caused by each node
Variable parameter,For the function value set of the estimation value set of kernel function, the i.e. estimate of supporting vector and window width.
The present invention also provides a kind of method of non-parametric topic automatic marking, including:Configuration step, carry out initialization ginseng
Number, service selection, the information configuration of algorithms selection;Transmitting step, the configured information of transmission;Analyzing step, to what is configured
Information is parsed, by the information configured resolves to the executable configuration parameter of distributed data processing layer and needs perform
Mission bit stream;Task scheduling step, the mission bit stream that the executable configuration parameter and needs perform is distributed to model pipe
Manage module, algorithm processing module and enterprise application modules;Model management step, carry out nonparametric model structure;Algorithm process walks
Suddenly, the parameter needed for the model of the model management step structure is calculated;Enterprise's applying step, the language material of reception is located in advance
Reason, and export the result with model treatment language material.
Preferably, in addition to pre-treatment step, the IF-IDF matrixes of label-word are established after language material is segmented;Wherein, mark
The mark for topic is signed, is set toY;IF-IDF is that each word is the frequency conversion that predictive variable occurs in each language material, if
ForX。
Preferably, parameter Estimation is solved in accordance with the following methods in the model management step, optimal bandwidthhAnd support
VectorX s In at least one:a)Class predictive equation and likelihood function are built,, using nonparametric side
Method is extensive to its, obtains more generally, class maximum likelihood function, wherein,lFor likelihood letter
The exponent number of number approximation by polynomi-als;b)Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function:, maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein,hat() represent parameterEstimate;c)Window width is solved as the following formula:,h opt According to behind equation ask
The window width h solved optimal value,KFor kernel function,K 0,1(z) it is 0-1 standardization kernel functions,Expression is normalized to 0 and arrived
Variance function between 1.
Preferably, in addition to marking step is predicted, with prediction marking unit to the processing with the new language material of existing model treatment
As a result given a mark;If marking is qualified, the result of the output new language material of existing model treatment, if marking is unqualified, more
New existing model simultaneously handles new language material with the model after renewal.
Preferably, each one error update weight of sample is assigned in the model management step, in the existing mould of renewal
The weight of sample is adjusted during type, and selects weight to meet training sample of the sample of certain condition needed for as renewal model.
Preferably in the algorithm process step, multiple cores for carrying out algorithm iteration, producing model parameter are set
Center algorithm unit;Distribute initial parameter to each core algorithm unit with parameter Dispatching Unit, and each core algorithm unit is every
Result of calculation after secondary iteration is sent to other core algorithm units.
Preferably, if model parameter does not restrain after the iteration of predetermined number of times, using current parameter as it is new just
Beginning parameter, and next round iteration is carried out automatically;If the model convergence, parameter integration is carried out by the model parameter.
Preferably, the parameter integration is carried out by below equation:
, wherein, Nodepred
For the prediction integrated results of partial node and sample, i is nodes, and j is sample number, and T is number of iterations,pred t,i,j For i-th of node
The predicted value that upper j-th of sample is obtained in the t times iteration by the grader trained, t,i,j For prediction caused by each node
Variable parameter,For the function value set of the estimation value set of kernel function, the i.e. estimate of supporting vector and window width.
A kind of method for building up and system based on the non-parametric topic automatic markings of boosting provided by the invention, use
Core inner product technology maps feature to higher dimensional space on the premise of computation complexity is not increased, using prediction error distribution dynamic
Regulation mechanism, and by boosting modes by data distribution to each Node distribution formula iterative model parameter of hadoop cluster
And supporting vector, the topic that internet language material is efficiently completed eventually through encapsulation R environment scheduling script mark, the present invention is established
User interest model it is more accurate, burden and the wasting of resources of server and client side can be reduced.This programme embodiment provides
Such scheme compared with prior art, have following beneficial effect:
1st, based on prior distribution of the nonparametric technique independent of data, and grader is established in high-dimensional feature space so that
Topic marking model is more accurate and flexible, and model prediction only relies on a small amount of supporting vector(Much smaller than caused by SVM to
Measure number), low computation complexity is only O (N^2*Ns), and N is data volume size, and Ns is supporting vector quantity;
2nd, based on prediction error distribution dynamic adjusting training sample distribution, the sample weights of prediction error is improved, are reduced pre-
Survey correct sample weights or avoid enter into model training, can further Automatic Optimal model, and further reduce and calculate again
Miscellaneous degree, the speed of decline is O(1/T), T is iterations;
3rd, model be deployed as clustered node is distributed to by the way of boosting, the negative of server and system can be reduced
Load and the wasting of resources, model training efficiency is improved, on the other hand, each node parameter passback process is an Ensemble process, is entered
One step improves the accuracy rate of model and reduces over-fitting risk.
Brief description of the drawings
Fig. 1 is the distributed system architecture figure for the topic automatic marking that embodiment of the present invention is related to;
Fig. 2 is the multiprocessor scheduling process structure diagram for the topic automatic marking that embodiment of the present invention is related to;
Fig. 3 is the flow chart for the topic language material automatic marking that embodiment of the present invention is related to;
Fig. 4 is the flow chart for the new established model of topic language material that embodiment of the present invention is related to;
Fig. 5 is the distributed training of nonparametric and model integrated schematic diagram that embodiment of the present invention is related to;
Fig. 6 is the flow chart for the topic language material model modification that embodiment of the present invention is related to.
Embodiment
The present invention is illustrated below according to accompanying drawing illustrated embodiment.This time disclosed embodiment can consider in all sides
Face is to illustrate, without limitation.The scope of the present invention is not limited by the explanation of implementation below, only by claims
Shown in scope, and including having all deformations in the same meaning and right with right.
In order to solve the above technical problems, marked automatically based on quasi-boosting nonparametrics topic the invention provides one kind
The distributed system and method for note, the system include R encapsulation two big systems of calling layer and Hadoop big datas process layer, and two substantially
By http protocol intercommunication between system, parsed by the JSON on backstage, carry out the mutual scheduling and passback of two big systems.Specifically
, analysis operation personnel can realize that Hadoop environment does what is calculated using R language to dispatch by encapsulating the operation of calling layer to R
Efficient succinct pattern.Specifically, the module of difference in functionality is packaged with Hadoop big data process layers, it is main to include communication mould
Block, task scheduling modules, model management module, algorithm processing module, enterprise application modules and data memory module etc..Each mould
The unit of difference in functionality is encapsulated in block respectively:Task scheduling modules include file distributing unit, Job resolution units, Job and perform list
Member and dispensing unit;Model management module includes parameter integration unit, model treatment unit, error distribution update unit and configuration
Unit;Algorithm processing module includes parameter Dispatching Unit, core algorithm unit and data distributed update and memory cell;Enterprise should
Include data preparatory unit, prediction marking unit, application interface unit and corpus labeling unit etc. with module.Described module and
The mutually coordinated distributed structure for realizing core algorithm of unit, model management, algorithmic dispatching parsing, data quasi-boosting
Distribution, storage, the optimal integrated, model training of parameter and deployment, automatically update the flows such as Optimized model parameter, the application of model enterprise
The function of encapsulation.
Illustrate and involved in the present invention marked automatically based on quasi-boosting nonparametrics topic with reference to Fig. 1-6
The distributed system of note.
Fig. 1 is the distributed system architecture figure for the topic automatic marking that embodiment of the present invention is related to.Embodiment party of the present invention
The system for the topic automatic marking that formula is related to includes R encapsulation 2 liang of big systems of calling layer 1 and Hadoop big datas process layer, and this two
Big system has respective communication module.The communication module of R encapsulation calling layer 1 and the communication under Hadoop big datas process layer 2
Module is connected by way of wired or wireless network, realizes two big systems(R is encapsulated at calling layer 1 and Hadoop big datas
Manage layer 2)Communication connection and data access.In addition, R encapsulation calling layer 1 works under R language environments, at Hadoop big datas
Reason layer 2 works under Java language.
R encapsulation calling layer 1 includes parameter configuration module 11, communication module 12 and Main Fun modules 13.Wherein R encapsulation is adjusted
Contain parameter configuration files with the parameter configuration module 11 of layer 1;Applied according to different enterprises, analysis operation personnel can match somebody with somebody in parameter
Put and the activation bits such as different initial parameters, service selection, algorithms selection are inputted in file, it is different according to different application configurations
Parameter and other activation bits stand.The communication module 12 of R encapsulation calling layer 1 includes http protocol unit 121 and JSON is parsed
The two parts of unit 122, wherein http protocol unit 121 are responsible for encapsulating R into the R language messages such as the parameter configuration files of calling layer 1
Script and model the information transmission such as configuration parameter to Hadoop big datas process layer 2, and be responsible for receiving and come from Hadoop
The information such as the feedback information of big data process layer 2 and model algorithm result;JSON resolution units 122 are responsible for Hadoop is big
The Java language information that data analysis layer 2 is sent to http protocol unit 121 resolves to corresponding R language scripts, adjusts R encapsulation
Action and the feedback information of each module of Hadoop big datas process layer 2 can be identified with layer 1.Main Fun modules 13 are stored with
The simple functions function model that is made up of R language and the function that can call other language models, the module are responsible for looking forward to
The information integration such as operation to be performed needed for industry application and the feedback of the information from Hadoop big datas process layer 2 and algorithmic dispatching
It is encapsulated as R language UDF(User-Defined Functions), personalized exploitation is completed, so as to which R encapsulates the application of calling layer 1 with regard to that can perform R
Other program bags and function under environment, and can is by providing that agreement performs the algorithm under Hadoop environment.Communication module 12 is divided
It is not connected with parameter configuration module 11 and Main Fun modules 13.
Hadoop big datas process layer 2 includes communication module 21, task scheduling modules 22, model management module 23, algorithm
Processing module 24, enterprise application modules 25 and data memory module 26.
Wherein, communication module 21 includes http protocol unit 211 and JSON resolution units 212.The http protocol unit 211
It is responsible for task scheduling modules 22, model management module 23, algorithm processing module 24, enterprise application modules 25 and data storage mould
The transmission of the intermodule of block 26 encapsulates calling layer 1 with information transfers such as processing feedbacks to R, and receives the ginseng that R encapsulation calling layer 1 is sent
Number configuration file and R language message scripts etc..The JSON resolution units 212 of communication module 21 are responsible for http protocol unit 211
The information and model configuration parameter of the R encapsulation calling layer 1 of reception resolve to the executable parameter of Hadoop big datas process layer 2 and matched somebody with somebody
Confidence ceases and needed the Job information performed.
Fig. 2 is the multiprocessor scheduling process structure diagram for the topic automatic marking that embodiment of the present invention is related to.With reference to
Fig. 1 and Fig. 2 illustrates the task scheduling modules 22, model management module 23, algorithm process mould of Hadoop big datas process layer 2 respectively
The structure of block 24, enterprise application modules 25 and data memory module 26.
Task scheduling modules 22 include file distributing unit 221, Job execution units 222, Job resolution units 223 and configuration
Unit 224;The module is the drive module of model management module 23, algorithm processing module 24 and enterprise application modules 25, is responsible for
The function of the automatic running drivings such as the configuration of each module parameter, parameter return, flow is got through, the execution of data distribution order.Its
In, Job resolution units 223 are responsible for the executable configuration parsed according to JSON resolution units 212 and need the Job information performed
Determine the task that the needs of Hadoop big datas process layer 2 are done and carried out data preparation, generate task list, trigger simultaneously
Job execution units 222.Job execution units 222 can be several units arranged side by side can also there was only one, the unit according to
The task list identification that Job resolution units 223 generate each needs the job task performed, and is triggered accordingly according to task definition
Other modules work.Dispensing unit 224 stores parameter and the fileinfo that the connection of Hadoop layers bottom needs, the unit root
According to the mission bit stream that Job execution units 222 identify relevant parameter is configured to file distributing unit 221.File distributing unit 221 will
The data distribution numbering and index that data preparatory unit 251 is related in enterprise application modules 25, and according to Job execution units 222
Specific tasks information parameter information and its relevant task data are allocated to model management module or algorithm processing module etc..
Model management module 23 includes parameter integration unit 231, model treatment unit 232, error distribution update unit 233
With dispensing unit 234.The core algorithm based on non-parametric topic automatic marking is wherein packaged with model treatment unit 232,
The unit can drive the work of algorithm processing module 24.Other under model management module 23 are stored with dispensing unit 234
Each unit parameter required when modeling, model needed for other each units offer that the unit is responsible under model management module 23 are joined
Number.Parameter integration unit 231 is integrated the convergent model parameter of each nodes of the Hadoop calculated in given number of iterations,
Complete Ensemble(Assemblage)Parameter integration process, generation model Parameter File, and the information such as model parameter file by generation
Reach R encapsulation calling layer 1;The integration method that the process uses specifically gives in aftermentioned explanation.Error distribution update unit 233
The error distribution update unit assigns each one error update weight of sample, is recalculated when needing and updating existing model
The weight of sample, and select training sample of the larger sample of the weight recalculated needed for as renewal model.
Algorithm processing module 24 includes parameter distribution unit 241, core algorithm unit 242 and data distributed update and storage
Unit 243;Core algorithm unit 242 is provided with multiple, and each core algorithm unit 242 is a boosting node.Core
Algorithm unit 242 performs various algorithms according to specific tasks information:The solution of such as model, the iteration of algorithm;Carrying out algorithm
Iteration when, the parameter that each core algorithm unit calculates can re-start distribution by parameter Dispatching Unit 241, and progress is next
Take turns iteration;The unit can be triggered directly by task scheduling modules 22, carry out simple feature calculating;The unit also can be by model
The model treatment unit 232 of management module 23 triggers, and carries out model solution or algorithm iteration of complexity etc. and calculates.Parameter distribution is single
Member 241 is responsible for each node distribution initial parameter data of server, and the parameter can be got parms at random or by core
The non-convergent parameter for each node that center algorithm unit 242 calculates.Data distribution updates and memory cell 243 is responsible for renewal and storage
When core algorithm unit 242 carries out algorithm iteration, amendment data after error correction etc..
Enterprise application modules 25 include data preparatory unit 251, prediction marking unit 252, application interface unit 253 and language
Material mark unit 254, main entrance and outlet of the module for the system.Wherein, application interface unit 253 is responsible for answering with enterprise
Other interfaces are docked and real-time management, to obtain new opplication.Data preparatory unit 251 is mainly responsible for enterprise-level industry
Business problem is converted into the demand data involved by model language, and the initial data called and distributed for other models is prepared and in advance
Processing.Prediction marking unit 252 is carried out by the model parameter file in trigger model management module to the data object of input
Prediction and marking, the model for judging to have stored in system whether be applied to the application data that newly inputs;If marking is qualified, institute
The system output result of the new language material of existing model treatment is stated, if marking is unqualified, the existing model of system update is simultaneously
New language material is handled with the model after renewal.Corpus labeling unit 254 is responsible for entering the language material of input according to model parameter file
Row LDA texts are birdsed of the same feather flock together, and form the more fine-grained topic mark and probabilistic dictionaries of each topic classification, and by the topic of generation
The information such as mark are stored to data memory module 26.
Data memory module 26 is responsible for task scheduling modules 22 in storage system, model management module 23, algorithm process mould
Information after block 24 and the processing of the four module of enterprise application modules 25, the model being such as calculated, model parameter, topic mark
Information.
The communication module 21 of Hadoop big datas process layer 2 respectively with task scheduling modules 22, model management module 23, calculate
Method processing module 24, enterprise application modules 25 are connected with data memory module 26, are connected respectively by bus between each module.
Fig. 3 is the flow chart for the topic language material automatic marking that embodiment of the present invention is related to.Below in conjunction with accompanying drawing specifically
The handling process of bright corpus labeling.
First, new language material is inputted by application interface unit 253, the new language material is converted into mould by data preparatory unit 251
New corpus data involved by type language, specifically, being segmented to the input language material and the conversion of VSM numerical value, and build VSM
The IF-IDF matrixes of numerical matrix, i.e. label- words, wherein label are the mark of topic, are set to Y;IF-IDF is each single
Word is the frequency conversion that predictive variable occurs in each word, is set to X, the original called and distributed as modules in system
Beginning corpus data(New corpus data)It is stored in data memory module 26.Application interface unit 253 should according to input language material
With determining to need the processing that carries out language material and relevant information be sent into R by the http protocol unit 211 of communication module 21
Calling layer 1 is encapsulated, corresponding R language messages are resolved to by JOSN resolution units 122(Step S1).Under R encapsulation calling layer 1
Main Fun modules 13 carry out analysis matching to the R language messages of parsing, judge whether to be stored with Main Fun modules 13 with
The processing model M odel (t-1) (step S2) of application data matching.
When judging there is no the model for storing Model Matching corresponding with the new corpus data in Main Fun modules 13
During Model (t-1)(Step S2 is no), then analysis operation personnel need to R encapsulation calling layer operate, input the finger of modeling
Order and configuration confidence, make system model the application data again(Step S3).
Fig. 4 is the flow chart of the new established model of topic language material for the step S3 that embodiment of the present invention is related to.Analysis behaviour first
Make personnel and the training corpus stored is called in data memory module 26, training corpus is sent to data preparatory unit 251
And the language material is pre-processed:The language material is segmented and VSM numerical value is changed, and builds VSM numerical matrixs(Step
S31), R encapsulation calling layer 1 will be sent to by the pretreated VSM numerical matrixs of step S1 and S31, while match somebody with somebody by parameter
Put module 11 to initialize parameter, and trigger R module scheduling scripts, R encapsulation calling layer 1 is triggered at Hadoop big datas
Manage the Job resolution units 223 of the task scheduling modules 22 of layer 2, it would be desirable to establish the information and various instruction tasks letter of model
Breath is sent to Job resolution units 223, and Job resolution units 223 generate task list by the parsing to command information, and will appoint
Content assignment be engaged in Job execution units 222(Step S32).Job execution units 222 manage mould according to task definition trigger model
The model treatment unit 232 of block 23.Model treatment unit 232 is single according to task definition trigger data Dispatching Unit 221 and configuration
Member 224, while the task parameters in the dispensing unit 224 transferred according to task definition of file distributing unit 221 are sent to model
Processing unit, while enter boosting data distributions, boosting nodes as shown in Figure 5:1~n(Step 33).Next
Model treatment unit 232 transfers the model parameter in dispensing unit 234 according to the mission bit stream of Job execution units 222, and triggering is each
The core algorithm unit 242 of the algorithm processing module 24 of node, convergence model parameter is carried out by core algorithm unit 242(Such as ginseng
Number estimate, window width and supporting vector etc.)Calculating(Step S34).The computation model of core algorithm unit 242 is joined in this step
Number mainly has estimates of parameters, window width and supporting vector etc., and corresponding algorithm is:
1 structure class predictive equation and likelihood function,, more generally, work as contiguous functiongFor
Sigmoid, andf(x)For it is linear when,Expressed is traditional parameter logistic regression model(LR), I
Using nonparametric technique it is extensive to its, obtain more generally, class maximum likelihood function is, whereinX i It is each word in sample dataiThe frequency that individual sample occurs,Y i ForiThe topic mark of individual sample,lFor likelihood
The exponent number of function approximation by polynomi-als;
2 pairs of former features carry out nucleus lesion mapping, and it is as follows to can obtain non-parametric class maximum likelihood function:
,
Declined using gradient and obtained until restraining, whereinhat() represent parameter
Estimate;
3 window widths are characterized the important parameter of mapping smoothing constraint and local linear estimation, and its solved function is as follows:
,
Whereinh opt According to behind the window width that goes out of equation solutionhOptimal value,KFor kernel function,K 0,1(z) it is 0-1 marks
Standardization kernel function,Represent the variance function being normalized between 0 to 1.
Parameter Estimation obtained above, optimal bandwidthhAnd supporting vector(support points)X s (Expression can support
Optimal marksIndividual sample)The model parameter as calculated needed for core algorithm unit 242.If carry out once-through operation to obtain
Parameter do not restrain, then system need be iterated calculatings, until calculate parameter restrain.Fig. 5 is embodiment of the present invention
The distributed training of nonparametric being related to and model integrated schematic diagram.The convergent mould for illustrating to be related in step S34 with reference to Fig. 5
The acquisition of shape parameter.Data and parameter are distributed to 1~n each boosting nodes, the core of each node respectively in step S33
Center algorithm unit 242 is calculated respective model parameter.If in iteration as defined in system or analysis operation personnel time
In number, the model parameter for calculating acquisition does not restrain, then parameter Dispatching Unit 241 just distributes current non-convergent parameter again
To each boosting nodes as new initial parameter, the iterative calculation of next round is carried out, until model parameter restrains;If
Provide iterations in, model parameter does not restrain all the time, then system will send miscue, by analysis operation personnel according to
Miscue is further operated.
After the calculating for having carried out convergence model parameter, the convergent model parameter that each node is calculated is back to model management
Parameter integration unit 231 under module 23, is integrated by parameter integration unit 231 to it, and establishes model, while is generated new
Model parameter file, be predicted marking for new data(Step 35).Parameter integration method is in step s 35:
,
Wherein Nodepred is the prediction integrated results of partial node and sample,iFor nodes,jFor sample, T is number of iterations,pred t,i,j The predicted value obtained for j-th of sample on i-th of node in the t times iteration by the grader trained; t,i,j For
Predictive variable parameter caused by each node,For the estimation value set of kernel function, i.e. supporting vector and window width is estimated
The function value set of evaluation,。
Finally by the storage of the information such as the model after integration and model parameter file into data memory module 26, and simultaneously will
Its Main Fun module 13 fed back under R encapsulation calling layer 1 is used as predicted configuration file(Step S6)(Fig. 3).
Fig. 3 is returned to, when judging to be stored with model M odel (t-1) matched with the application in Main Fun modules 13(Step
Rapid S2 is yes), then R encapsulate calling layer make prediction enterprise application modules 25 under give a mark unit 252 to model M odel (t-1) and
The matching degree of the application data is predicted marking(Step S4).The parameter of the prediction marking calling model management module of unit 252
Model parameter file in integral unit is predicted marking to the application data of input, when score value is less than certain value, it is assumed that
For model M odel(t-1)It is not suitable for the application data inputted in step S1(Step S4 is no), then need to Model(t-1)
Optimize renewal(Step S5).The information for needing to optimize model can return to R encapsulation calling layer, and be shown to analysis
Operating personnel.
The flow chart for the topic language material model optimization renewal that Fig. 6 is the step S5 that embodiment of the present invention is related to.Analysis behaviour
Make personnel by the operation to R encapsulation calling layer, make data distribution renewal and data in the step S1 of the storage of memory cell 243 accurate
The new corpus data that standby unit 251 converts distributes to data based on model treatment unit 232, model treatment unit 232
Trigger error distributed update unit 233:The basic data is implemented in model M odel (t-1), and carries out error correction computing,
Produce the distribution error of each new data sample, while produce the distribution error of upper phase training data, and by the error of calculating
The information such as distribution are stored in error distribution update unit 233(Step S51).New language material and instruction of the upper phase were obtained in this step
The formula for practicing error distribution and data renewal of the language material when performing model (t-1) is as follows:If new language material sample is(X n+1 ,…X n+m), its error is distributed as, initial weightw i It is 1/m, i.e.,w t =(1/m..., 1/m);On if
One phase accumulation sample be(X 1 ,…X n), initial weight is:, error is distributed as
;The modifying factor of new language material sample is set to be by error distribution,, obtain
Updating training sample weight is:;Update upper phase language material and this
Phase language material sample weights, whereinh(X) for model management module selection algorithm or algorithm combination,c(X) it is sampleXActual class
Not,WithThe error that respectively new language material sample and the language material sample of a upper phase are obtained by model prediction is distributed,With
To be respectively the Error Correction Factors of new language material sample and upper phase language material sample.It can be seen that the sample of pre- sniffing from above formula
The degree that can be taken seriously in follow-up training study is higher, on the one hand plays the effect of Automatic Optimal model, on the other hand reduces
The possibility that correct sample enters in new model training is predicted, resource space, lift scheme training effectiveness are saved for system.Together
When the sample that above-mentioned upper phase is accumulated and the new language material sample of updated weight be integrated into new training sample, it is and this is new
Training sample be stored in data storage cell 26, and be sent to R encapsulation calling layer.
Then analysis operation personnel encapsulate the configuration initial parameter of parameter configuration module 11 of dispatch layer 1 by R, and pass through
Http protocol unit 121 sends relevant information to Hadoop big datas process layer 2:Trigger appointing for Hadoop big datas process layer 2
The Job resolution units 223 for scheduler module 22 of being engaged in, and the data message of needs is sent to Job resolution units 223, Job parsings are single
Member 223 generates task list by the parsing to data message, and task definition is distributed into Job execution units 222(Step
52).Job execution units 222 are according to the model treatment unit 232 of task trigger model management module 23.Model treatment unit 232
According to task definition trigger data Dispatching Unit 221 and dispensing unit 224, while file distributing unit 221 is according to task definition
Task parameters in the dispensing unit 224 transferred are sent to model treatment unit, while enter boosting data distributions(Step
53).Following model treatment unit 232 transfers the model in dispensing unit 234 according to the mission bit stream of Job execution units 222
Parameter information, the core algorithm unit 242 of the algorithm processing module 24 of each node is triggered, is received by core algorithm unit 242
Hold back model parameter(The parameters such as estimates of parameters, window width and supporting vector)Calculating(Step S54, model parameter in the step
Calculate identical with step S34).Then the convergent model parameter each node calculated is back to the ginseng under model management module 23
Number integral unit 231 is integrated, and generates new model parameter file, and establish new model Model (t)(Step 55, the step
In integration method it is identical with the integration method in step 35).Then operational factor integrates script, calculates the model M odel (t)
Accuracy rate, judge Model(t)Whether Model (t-1) (step S56) is better than.If Model (t) accuracy rate is less than system
Given threshold, Model (t-1) are better than Model (t)(Step S56 is no), then model (t) coverings are saved as into Model (t-1)
(Step S57), step S51 and later operation are re-started, until new model Model (t) is better than Model (t-1).If
Model (t) accuracy rate is higher than system thresholds, then judges that Model (t) is better than Model (t-1)(Step S56 is yes), then will
Model(t)And model parameter file etc. returns to the Main Fun modules under R encapsulation calling layer(Step S6)As predicted configuration text
Part.
Fig. 3 is returned to, when prediction marking unit 252 exceedes certain value to enterprise's applied forecasting marking score value, it is assumed that being
The enterprise inputted in step S1, which applies, is applied to storage model Model(t-1)(Step S4 is yes), thus directly by the mould
Shape parameter Model (t-1) is designated as Model(t)Store in data memory module 26, and be fed back to R encapsulation simultaneously and call
Main Fun modules 13 under layer 1 are used as predicted configuration file(Step S6).Finally marked in the topic of enterprise application modules 25
Unit 254, uses Model(t)The information such as corresponding model parameter file carry out LDA texts to language material and birdsed of the same feather flock together, and form each topic
The more fine-grained topic mark and probabilistic dictionaries of classification, and final topic mark and probabilistic dictionaries etc. are stored in data
In memory cell 26, later stage application deployment are carried out(Step S7);Data storage cell 26 marks the new topic of storage simultaneously
R encapsulation calling layer 1 is fed back to the storage address of the information such as probabilistic dictionaries.
In the above-described embodiment, in regulation iterations, if the model parameter that core algorithm unit 242 calculates is all the time
Do not restrain, then system will give a warning, but be not limited only to this.Gear in regulation iterations, count by core algorithm unit 242
The model parameter of calculation does not restrain all the time, then is preserved and is sent out using the result of last time iteration as the result finally calculated
Send, but such result of calculation may increase the error of the expectation inputted below.
In the above-described embodiment, distributed arithmetic, but not limited to this are carried out using Hadoop big datas process layer.Also may be used
To replace Hadoop big datas process layer to carry out above-mentioned distributed arithmetic using the distributed processing system(DPS) of other structures.
Claims (16)
1. a kind of distributed system of non-parametric topic automatic marking, including:R encapsulates calling layer and distributed data processing
Layer;
The R encapsulation calling layer includes parameter configuration module, the first communication analysis module and principal function module;
The distributed data processing layer include the second communication analysis module, task scheduling modules, model management module, at algorithm
Manage module and enterprise application modules;Wherein,
The parameter configuration module is used to receive configuration information;
The principal function module is used to receive to algorithmic dispatching, feedback of the information processing and the two integration encapsulation process, completes individual character
Hair is melted, and accordingly generates the executable configuration of the distributed data processing layer and mission bit stream that needs perform is sent to institute
State distributed data processing layer;
The first communication analysis module is used to communicate to connect with the second communication analysis module, for encapsulating calling layer in R
Established between distributed data processing layer and communicate and Content of Communication is parsed;
The task that the task scheduling modules are used for the executable configuration for receiving the principal function module transmission and needs perform
Information, and correspondingly control and coordinate the model management module, the algorithm processing module and the enterprise application modules
Work;
The model management module is used to build model, instructs the algorithm processing module computation model parameter, and by the calculation
The model parameter of method processing module passback is integrated, generation model Parameter File;
The instruction that the algorithm processing module receives the model management module is calculated model parameter, and returning result;
The enterprise application modules are used to be pre-processed the language material of reception, and the mould generated according to the model management module
Shape parameter file generated topic marks.
2. system according to claim 1, it is characterised in that:
The enterprise application modules establish the IF-IDF matrixes of label-word after language material is segmented in pretreatment;
Wherein, label is the mark of topic, is set toY;The each words of IF-IDF are that predictive variable occurs in each language material
Frequency converts, and is set toX。
3. system according to claim 2, it is characterised in that:
The algorithm processing module solves parameter Estimation in accordance with the following methods, optimal bandwidthhAnd supporting vectorX s In at least its
One of:
a)Class predictive equation and likelihood function are built,, it is extensive to its using nonparametric technique, obtain more one
As, class maximum likelihood function, wherein,lFor the exponent number of likelihood function approximation by polynomi-als,
Contiguous function g is sigmoid, XiThe frequency conversion occurred for i-th of sample in each word, YiFor the topic of i-th of sample
Mark;
b)Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function:
,
Maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein,
Represent parameterEstimate, h represent window width;
c)Window width is solved as the following formula:
,
h opt According to behind the window width that goes out of equation solutionhOptimal value,KFor kernel function,K 0,1(z) it is 0-1 standardization core letters
Number,Represent the variance function being normalized between 0 to 1.
4. according to system according to any one of claims 1 to 3, it is characterised in that:
The enterprise application modules include prediction marking unit, for being carried out with the result of the new language material of existing model treatment
Marking;
If marking is qualified, the result of the system output new language material of existing model treatment, if marking is unqualified, the system
System updates existing model and handles new language material with the model after renewal.
5. system according to claim 4, it is characterised in that:
The model management module includes error distribution update unit;
The error distribution update unit assigns each one error update weight of sample, and sample is adjusted in the existing model of renewal
Weight, and select weight to meet the training sample of the sample of certain condition needed for as renewal model.
6. system according to claim 4, it is characterised in that:
The algorithm processing module includes parameter Dispatching Unit and multiple core algorithm units;Wherein,
The core algorithm unit is used to carry out algorithm iteration, produces model parameter;
The parameter Dispatching Unit is used to distribute initial parameter to each core algorithm unit, and each core algorithm unit is each
Result of calculation after iteration is sent to other core algorithm units.
7. system according to claim 6, it is characterised in that:
If model parameter does not restrain after the iteration of predetermined number of times, the parameter Dispatching Unit divides current parameter again
Each core algorithm unit is issued as new initial parameter, and carries out next round iteration automatically;If the model convergence, institute
State parameter Dispatching Unit and the model parameter is back to model management module progress parameter integration.
8. system according to claim 7, it is characterised in that:
The model management module carries out the parameter integration by below equation:
,
Wherein, Nodepred is the prediction integrated results of partial node and sample, and i is nodes, and j is sample number, and T is number of iterations,pred t,ij The predicted value obtained for j-th of sample on i-th of node in the t times iteration by the grader trained,To be every
Predictive variable parameter caused by individual node,For the estimation of the estimation value set, i.e. supporting vector and window width of kernel function
The function value set of value.
9. a kind of method of non-parametric topic automatic marking, including:
Configuration step, carry out initiation parameter, service selection, the information configuration of algorithms selection;
Transmitting step, the configured information of transmission;
Analyzing step, the information configured is parsed, the information configured is resolved into distributed data processing layer to hold
The mission bit stream that capable configuration parameter and needs performs;
Task scheduling step, the mission bit stream that the executable configuration parameter and needs perform is distributed to model management mould
Block, algorithm processing module and enterprise application modules;
Model management step, carry out nonparametric model structure;
Algorithm process step, calculate the parameter needed for the model of the model management step structure;
Enterprise's applying step, the language material of reception is pre-processed, and export the result with model treatment language material.
10. according to the method for claim 9, it is characterised in that:
Also include pre-treatment step, the IF-IDF matrixes of label-word are established after language material is segmented;
Wherein, label is the mark of topic, is set toY;The each words of IF-IDF are that predictive variable occurs in each language material
Frequency converts, and is set toX。
11. according to the method for claim 10, it is characterised in that:
Parameter Estimation is solved in accordance with the following methods in the model management step, optimal bandwidthhAnd supporting vectorX s In extremely
It is one of few:
a)Class predictive equation and likelihood function are built,, it is extensive to its using nonparametric technique, obtain more one
As, class maximum likelihood function, wherein,lFor the exponent number of likelihood function approximation by polynomi-als, connect
It is sigmoid, X to meet function giThe frequency conversion occurred for i-th of sample in each word, YiFor the topic of i-th sample
Mark;
b)Nucleus lesion mapping is carried out to former feature, obtains non-parametric class maximum likelihood function:
,
Maximize the likelihood function, is declined using gradient and is obtained until restraining, wherein
Represent parameterEstimate, h represent window width;
c)Window width is solved as the following formula:
,
h opt According to behind the window width that goes out of equation solutionhOptimal value,KFor kernel function,K 0,1(z) it is 0-1 standardization core letters
Number,Represent the variance function being normalized between 0 to 1.
12. the method according to any one of claim 9~11, it is characterised in that:
Also include prediction marking step, with prediction marking unit to being beaten with the result of the new language material of existing model treatment
Point;
If marking is qualified, the output result of the new language material of existing model treatment, if marking is unqualified, existing mould is updated
Type simultaneously handles new language material with the model after renewal.
13. according to the method for claim 12, it is characterised in that:
Each one error update weight of sample is assigned in the model management step, sample is adjusted in the existing model of renewal
Weight, and select weight to meet the training sample of the sample of certain condition needed for as renewal model.
14. according to the method for claim 12, it is characterised in that:
In the algorithm process step, multiple core algorithm units for carrying out algorithm iteration, producing model parameter are set;
Distribute initial parameter to each core algorithm unit with parameter Dispatching Unit, and by after each each iteration of core algorithm unit
Result of calculation be sent to other core algorithm units.
15. according to the method for claim 14, it is characterised in that:
If model parameter does not restrain after the iteration of predetermined number of times, using current parameter as new initial parameter, and from
It is dynamic to carry out next round iteration;If the model convergence, parameter integration is carried out by the model parameter.
16. according to the method for claim 15, it is characterised in that:
The parameter integration is carried out by below equation:
,
Wherein, Nodepred is the prediction integrated results of partial node and sample, and i is nodes, and j is sample number, and T is number of iterations,pred t,ij The predicted value obtained for j-th of sample on i-th of node in the t times iteration by the grader trained,To be each
Predictive variable parameter caused by node,For the estimate of the estimation value set, i.e. supporting vector and window width of kernel function
Function value set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510186154.9A CN104778254B (en) | 2015-04-20 | 2015-04-20 | A kind of distributed system and mask method of non-parametric topic automatic marking |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510186154.9A CN104778254B (en) | 2015-04-20 | 2015-04-20 | A kind of distributed system and mask method of non-parametric topic automatic marking |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104778254A CN104778254A (en) | 2015-07-15 |
CN104778254B true CN104778254B (en) | 2018-03-27 |
Family
ID=53619718
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510186154.9A Active CN104778254B (en) | 2015-04-20 | 2015-04-20 | A kind of distributed system and mask method of non-parametric topic automatic marking |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104778254B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106022483B (en) * | 2016-05-11 | 2019-06-14 | 星环信息科技(上海)有限公司 | The method and apparatus converted between machine learning model |
CN109726392B (en) * | 2018-12-13 | 2023-10-10 | 井冈山大学 | Intelligent language cognition information processing system and method based on big data |
CN110032714B (en) * | 2019-02-25 | 2023-04-28 | 创新先进技术有限公司 | Corpus labeling feedback method and device |
CN116360752B (en) * | 2023-06-02 | 2023-08-22 | 钱塘科技创新中心 | Function programming method oriented to java, intelligent terminal and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591940A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | Map/Reduce-based quick support vector data description method and Map/Reduce-based quick support vector data description system |
CN103559205A (en) * | 2013-10-09 | 2014-02-05 | 山东省计算中心 | Parallel feature selection method based on MapReduce |
CN103886203A (en) * | 2014-03-24 | 2014-06-25 | 美商天睿信息系统(北京)有限公司 | Automatic modeling system and method based on index prediction |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7512570B2 (en) * | 2006-05-30 | 2009-03-31 | Zaracom Technologies Inc. | Artificial intelligence analyzer and generator |
TWI385492B (en) * | 2008-12-16 | 2013-02-11 | Ind Tech Res Inst | A system for maintaining and analyzing manufacturing equipment and method therefor |
-
2015
- 2015-04-20 CN CN201510186154.9A patent/CN104778254B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591940A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | Map/Reduce-based quick support vector data description method and Map/Reduce-based quick support vector data description system |
CN103559205A (en) * | 2013-10-09 | 2014-02-05 | 山东省计算中心 | Parallel feature selection method based on MapReduce |
CN103886203A (en) * | 2014-03-24 | 2014-06-25 | 美商天睿信息系统(北京)有限公司 | Automatic modeling system and method based on index prediction |
Non-Patent Citations (2)
Title |
---|
流动性信息与资产收益:基于非参数模型的分析;李攀登 等;《统计教育》;20100331(第3期);48-54 * |
领域自适应学习算法及其应用研究;许敏;《中国博士学位论文全文数据库·信息科技辑》;20141215(第12期);I140-25 * |
Also Published As
Publication number | Publication date |
---|---|
CN104778254A (en) | 2015-07-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108762768B (en) | Intelligent network service deployment method and system | |
CN104778254B (en) | A kind of distributed system and mask method of non-parametric topic automatic marking | |
CN104317658B (en) | A kind of loaded self-adaptive method for scheduling task based on MapReduce | |
CN109947567A (en) | A kind of multiple agent intensified learning dispatching method, system and electronic equipment | |
CN110503125A (en) | Motion detection is carried out using the movement in receptive field | |
CN103886203B (en) | Automatic modeling system and method based on index prediction | |
WO2022271686A3 (en) | Methods, processes, and systems to deploy artificial intelligence (ai)-based customer relationship management (crm) system using model-driven software architecture | |
Peng et al. | Deep reinforcement learning approach for capacitated supply chain optimization under demand uncertainty | |
CN108111335B (en) | A kind of method and system of scheduling and link virtual network function | |
CN108280538A (en) | Based on distributed logistics inventory's optimization method under cloud computing environment | |
CN104516785B (en) | A kind of cloud computing resources dispatch system and method | |
Darwish et al. | Towards sustainable industry 4.0: A green real-time IIoT multitask scheduling architecture for distributed 3D printing services | |
WO2020172825A1 (en) | Method and apparatus for determining transmission policy | |
CN108491253A (en) | A kind of calculating task processing method and edge calculations equipment | |
CN108053047A (en) | Cloud resources of production scheduling methods, devices and systems | |
CN106934497A (en) | Wisdom cell power consumption real-time predicting method and device based on deep learning | |
CN112631717A (en) | Network service function chain dynamic deployment system and method based on asynchronous reinforcement learning | |
CN106708625A (en) | Minimum-cost maximum-flow based large-scale resource scheduling system and minimum-cost maximum-flow based large-scale resource scheduling method | |
CN107103359A (en) | The online Reliability Prediction Method of big service system based on convolutional neural networks | |
US20190332070A1 (en) | Adaptive mixed integer nonlinear programming for process management | |
CN110198280A (en) | A kind of SDN link allocation method based on BP neural network | |
CN108170531A (en) | A kind of cloud data center request stream scheduling method based on depth belief network | |
CN114430815A (en) | Self-learning manufacturing scheduling method for flexible manufacturing system and equipment | |
Andriansyah et al. | A process algebra based simulation model of a miniload-workstation order picking system | |
CN107809493A (en) | A kind of Internet of Things operating system prototype based on multiple agent |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |