CN101449264B - Method and system and the data classification method of use machine learning method for data classification of transduceing - Google Patents

Method and system and the data classification method of use machine learning method for data classification of transduceing Download PDF

Info

Publication number
CN101449264B
CN101449264B CN200780001197.9A CN200780001197A CN101449264B CN 101449264 B CN101449264 B CN 101449264B CN 200780001197 A CN200780001197 A CN 200780001197A CN 101449264 B CN101449264 B CN 101449264B
Authority
CN
China
Prior art keywords
file
data
classification
labelling
unmarked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200780001197.9A
Other languages
Chinese (zh)
Other versions
CN101449264A (en
Inventor
毛里蒂乌斯·A·R·施密特勒
克里斯托弗·K·哈里斯
罗兰·博雷
安东尼·萨拉
妮古拉·卡鲁索
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kofax Inc
Original Assignee
Kofax Image Products Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kofax Image Products Inc filed Critical Kofax Image Products Inc
Priority to CN201610972541.XA priority Critical patent/CN107180264A/en
Publication of CN101449264A publication Critical patent/CN101449264A/en
Application granted granted Critical
Publication of CN101449264B publication Critical patent/CN101449264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of system for categorical data, method, data processing equipment and goods.Also disclose the data classification method using machine learning method.

Description

Method and system and the number of use machine learning method for data classification of transduceing According to sorting technique
Technical field
The invention mainly relates to the method and apparatus for data classification.Specifically, the invention provides the transduction of improvement Machine learning method.The invention still further relates to use the new application of machine learning method.
Background technology
The information age and in the recent period all trades and professions (include, particularly, scanning file, web material, search engine number According to, text data, image, audio data file, etc.) huge explosion of electronic data, how to process data and become very Important.
The field just starting to explore is the classification of non-artificial data.In many sorting techniques, machine or computer Must arrange according to the rule being manually entered and setting up and/or the artificial training examples study set up.Using training examples Machine learning in, study sample quantity generally little than the number of parameters of required estimation, i.e. meet given by training examples The quantity of solution of restrictive condition bigger.One challenge of machine learning is to find that a kind of restriction regardless of shortcoming has still been concluded Good solution.It is thus desirable to overcome these and/or other problem of the prior art.
It is yet further desirable to the actual application of various types of machine learning method.
Summary of the invention
In a computer based system, according to one embodiment of present invention, a kind of side for data classification Method, including: receive and have mark data points, have mark data points to have at least one labelling described in each, indicate this data point It is the training examples being included into a data point specifying classification, or the training from a data point specifying classification to be excluded Sample;Receive data untagged point;At least one cost preset of mark data points and data untagged point is had described in reception The factor;By iterative computation, use at least one cost factor described, and described in have mark data points and data untagged point As training examples, use maximum entropy to differentiate (MED), train a transductive classifier, wherein, for iterative computation each time, Adjust the cost factor function as an expectation mark value of data untagged point, and estimating according to group of data points membership probability Calculate, adjust the prior probability of a data point markers;Be used for the grader of training classifying described data untagged point, have labelling At least one in data point and input data point;And the classification of the data point or derivatives thereof of described classification is exported to one Individual user, another system and at least one during another.
According to another embodiment of the invention, a kind of method for data classification, provide including to computer system Needing the executable program code used, and perform on the computer systems, described program code includes multiple instruction, is used for: Access be stored in computer storage have mark data points, have mark data points to have at least one labelling described in each, Indicating this data point is the training examples being included into a data point specifying classification, or specifies classification to be excluded from one The training examples of data point;Unmarked data point is accessed from computer storage;Mark is had from described in computer storage access The cost factor that at least one of numeration strong point and data untagged point is preset;By iterative computation, use described at least one Cost factor, and storage have the data untagged point of mark data points and storage as training examples, train a maximum Entropy-discriminate (MED) transductive classifier, wherein, for iterative computation each time, adjusts data untagged point cost factor as one The function of individual expectation mark value, and according to the estimation of data point group membership's probability, adjust the priori of described data point markers Probability;Be used for the grader of training classifying described data untagged point, have in mark data points and input data point at least One;And the classification of the data point or derivatives thereof of described classification exported to a user, another system and another during At least one.
According to another embodiment of the invention, a kind of data processing equipment, including: at least one memorizer, it is used for depositing Storage: (i) has a mark data points, described each have mark data points to have at least one labelling, indicate this data point to be received Enter the training examples of a data point specifying classification, or the training examples from a data point specifying classification to be excluded; (ii) data untagged point;(iii) have described in mark data points and data untagged point at least one preset cost because of Son;And a transductive classifier training aids, to use cost factor of at least one storage described, and storage have labelling The data untagged point of data point and storage, as training examples, uses the maximum entropy of transduction to differentiate (MED), and cyclically training turns Lead grader, wherein, for MED iterative computation each time, adjust data untagged point cost factor as an expectation labelling The function of value, and according to the estimation of data point group membership's probability, adjust the prior probability of described data point markers;
Wherein, transductive classifier training aids the grader trained for data untagged point of classifying, have mark data points, And at least one in input data point;
Wherein, the classification of the data point or derivatives thereof of described classification, it is exported to a user, another system and another At least one during one.
According to another embodiment of the invention, a kind of goods, including: a computer-readable program recorded medium, This medium includes the executable instruction repertorie of one or more computer definitely, with the method performing the classification of a kind of data, Including: receive and have mark data points, have mark data points to have at least one labelling described in each, indicate this data point by Include the training examples of a data point specifying classification, or the training sample from a data point specifying classification to be excluded in Example;Receive data untagged point;Have described in reception mark data points and data untagged point at least one preset cost because of Son;Use the cost factor of at least one storage described, and the data untagged point having mark data points and storage of storage As training examples, utilize the maximum entropy of iteration to differentiate that (MED) calculates, train a transductive classifier, wherein, each time In MED iterative computation, adjust the cost factor function as an expectation mark value of data untagged point, and according to a number The estimation of strong point group membership's probability, adjusts a data point markers prior probability;The grader of training is used for described nothing of classifying Mark data points, have mark data points and input data point at least one;And by the data point or derivatives thereof of classification Classification export to a user, another system and at least one during another.
In a computer based system, according to another embodiment of the invention, a kind of data untagged point Class method, including: receive and have mark data points, have mark data points to have at least one labelling described in each, indicate this number Strong point is the training examples being included into a data point specifying classification, or from a data point specifying classification to be excluded Training examples;Reception has labelling and data untagged point;Receive and have the priori signature of mark data points and data untagged point general Rate information;At least one cost factor preset of mark data points and data untagged point is had described in reception;According to described number The labelling prior probability at strong point, determines that each has labelling and the desired labelling of data untagged point;Repeat following sub-step Suddenly, until data value is enough restrained.
● the data untagged point proportional to the absolute value of the expectation labelling of data point for each generates a regulation Value at cost;
● be determined by decision function, the given sample being included into training and being excluded training, use described in have labelling and Data untagged point, as training examples, trains a grader, and according to their expectation labelling, KL is dissipated by this decision function It is minimised as the prior probability distribution of decision function parameter;
● use the grader of described training, determine described in have labelling and the classification score value of data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to the described group membership's probability determined, update the labelling prior probability of described data untagged point;
● the labelling prior probability utilizing described renewal and the classification score value determined before, use maximum entropy to differentiate (MED), Determine described labelling and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation labelling;With
● by the described expectation labelling of iteration before is inserted described new expectation labelling, update for each data point Expect labelling.
One classification of input data point or derivatives thereof is exported to a user, another system and another process In at least one.
According to another embodiment of the invention, a kind of file classifying method, including: receive at least one markd kind Subfile, it has the known confidence levels of labelling distribution;Receive unmarked file;Receive at least one preset cost because of Son;Use at least one cost factor, at least one seed file described and described unmarked file preset described, logical Cross iterative computation one transductive classifier of training, wherein, for iterative computation each time, adjust described cost factor as one Expect the function of mark value;After at least part of iteration, store confidence score for described unmarked file;And will have The identifier of the unmarked file of high confidence score export to a user, another system and during another at least one Individual.
According to another embodiment of the invention, a kind of method for analyzing the file relevant to legal inquiry, including: Receive the file relevant to legal matter;Described file is performed a kind of file classifying method;And classify based on it, output is extremely The identifier of small part file.
According to another embodiment of the invention, a kind of method clearing up data, including: receive multiple markd data ?;Each for multiple classifications chooses the subset of described data item;In each subset, the deviation of described data item is set It is set to about zero;The deviation of the not data item in described subset is arranged to one be about zero preset value;Use described Data item in deviation, described subset and described data item the most in the subsets, as training examples, are instructed by iterative computation Practice a transductive classifier;The grader of described training is applied to each markd data item described, described with classification Each data item;And the classification of described input data item or derivatives thereof exported to a user, another system, another During at least one.
According to another embodiment of the invention, a kind of method for checking of invoice Yu the relatedness of entity, including: base In invoice format training one grader relevant to first entity;Access multiple be marked as with described first instance and its The invoice that at least one in its entity is relevant;Use described grader that invoice performs a kind of file classifying method;And it is defeated Going out the identifier of at least one invoice, it is uncorrelated with described first entity that this invoice has higher probability.
According to another embodiment of the invention, a kind of method for managing case history, including: train based on medical diagnosis One grader;Access multiple case history;Use described grader that described case history performs a kind of file classifying method;And output The identifier of at least one case history, it is relevant to described medical diagnosis that this case history has relatively low probability.
According to another embodiment of the invention, a kind of method for recognition of face, including: receive at least one face Have labelling drawing of seeds picture, described drawing of seeds picture has a known confidence levels;Receive unmarked image;Receive at least one Individual default cost factor;By iterative computation, use described at least one cost factor preset, at least one drawing of seeds Picture and described unmarked image, train a transductive classifier, wherein, for iterative computation each time, adjust described one-tenth This factor is as the function of a desired mark value;After at least part of iteration, store for described unmarked drawing of seeds picture One confidence score;And the identifier with the unmarked image of the highest confidence score exported to a user, another be System, at least one during another.
According to another embodiment of the invention, a kind of method for analyzing prior art document, including: based on one Search inquiry one grader of training;Access multiple prior art document;Use described grader at least part of described existing Technological document performs a kind of file classifying method;And classify based on it, export the mark of at least part of described prior art document Know symbol.
According to another embodiment of the invention, a kind of method making patent classification adapt to file content variation, including: connect Receive at least one markd seed file;Receive unmarked file;Use at least one seed file described and described nothing Tab file one transductive classifier of training;Use described grader, will there is a confidence levels higher than predetermined threshold value Unmarked file is referred to multiple existing classification;Use described grader, will there is a confidence level less than predetermined threshold value Other unmarked file is referred at least one new classification;Use grader, by least part of described classified file weight Newly it is referred to described existing classification and at least one new classification described;And the identifier of described sorted file is exported To a user, another system and at least one during another.
According to another embodiment of the invention, a kind of method for file is mated with claim, including: based on At least one claim one grader of training of one patent or patent application;Access multiple file;Use described classification Device performs a kind of file classifying method at least part of described file;And classify based on it, export at least part of described file Identifier.
According to another embodiment of the invention, a kind of patent or the sorting technique of patent application, including: based on multiple Know one grader of file training belonging to a specific patent classification;Receive a patent or at least one of patent application Point;Use the described grader a kind of file classifying method of described at least some of execution to described patent or patent application;With And export the classification of described patent or patent application, wherein, described file classifying method is a Yes/No sorting technique.
According to another embodiment of the invention, a kind of method adapting to file content variation, including: receive at least one There is labelling seed file;Receive unmarked file;Receive at least one cost factor preset;Described in using, at least one is preset Cost factor, at least one seed file described and described unmarked file, train a transductive classifier;Use institute State grader, be referred to multiple classification by having higher than the unmarked file of the confidence levels of a predetermined threshold value;Use described Grader, reclassifies multiple classification by the file of at least part of described classification;And the mark by described sorted file Symbol output is to a user, another system and at least one during another.
According to another embodiment of the invention, a kind of method of separate file, including: receive markd data;Connect Receive one group of unmarked file;Based on described markd data and unmarked file, transduction is used to rewrite probabilistic classification rule;Root According to described probabilistic classification rule, update the weight separated for file;Determine the position separated in described one group of file;By described The designator of the separation point position determined exports to a user, another system and at least one during another;And give Code stamped by file, and this code is relevant to described designator.
According to another embodiment of the invention, a kind of method of file search, including: receive a search inquiry;Base In described search queries retrieval file;Export described file;The labelling keyed in at least part of described file reception user, described Labelling indicates the dependency between described file and described search inquiry;The labelling instruction keyed in based on described search inquiry and user Practice a grader;Described grader is used described file to be performed a file classifying method, so that described file to be divided again Class;And classify based on it, export the identifier of at least part of described file.
Accompanying drawing explanation
Fig. 1 is the expectation labelling curve chart as a function of classification score value, and this classification score value is applicable to by use The MED that labelling is concluded differentiates that study obtains.
Fig. 2 is the schematic diagram of the iterative computation of one group of decision function obtained by MED study of transduceing.
Fig. 3 is changing of one group of decision function obtained by the transduction MED study improved according to an embodiment of the invention The schematic diagram that generation calculates.
Fig. 4 is according to one embodiment of the invention, uses a cost factor regulated, and one is used for unmarked number of classifying According to control flow chart.
Fig. 5 is according to one embodiment of the invention, uses user-defined priori probability information, and one is used for classifying without mark The process control chart of numeration evidence.
Fig. 6 is according to one embodiment of the invention, utilizes cost factor and the priori probability information of regulation, uses maximum entropy Differentiate, a detailed control flowchart for data untagged of classifying.
Fig. 7 is the network that the network structure of different embodiment described herein is implemented in display.
Fig. 8 be one representational, the system block diagram of the hardware environment relevant to subscriber equipment.
Fig. 9 is the block diagram of the device representing one embodiment of the present of invention.
Figure 10 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 11 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 12 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 13 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 14 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 15 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 16 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 17 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 18 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 19 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 19 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 20 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 21 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 22 is the method for one embodiment of the invention, for the control flow chart of a first document classification system.
Figure 23 is the method for one embodiment of the invention, for the control flow chart of a second document classification system.
Figure 24 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 25 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 26 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 27 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 28 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 29 is by the flow chart of the categorizing process performed according to an embodiment.
Detailed description of the invention
Following description be it is presently contemplated that the best approach realizing the present invention, the purpose of this description illustrates that this Bright General Principle, is not intended to limit the content of invention described herein.And, special characteristic described herein can Combine from other feature described of each in various different possible combination and permutation.
Unless separately defined especially herein, all terms all give the possible explanation that it is the widest, including from description The meaning of hint, and the meaning that skilled artisan understands that, and look like as defined in dictionary, paper etc..
Text classification
Benefit and the demand of text data classification are the hugest, and have had multiple sorting technique to be used.Below Discussion is for the sorting technique of text data:
For increasing its effectiveness and intelligence, it is desirable to the machine of such as computer etc can be classified (or identify) continuous expansion The big object in scope.Such as, computer can use optical character recognition to the hand-written or numeral of scanning and the word of classifying, and makes Carry out classification chart picture by pattern identification, such as face, fingerprint, fighter plane etc., or use speech recognition to classify sound, voice etc. Deng.
Machine is also required can classifying text information object, such as text computer file or document.Text classification Application is various and important.Such as, text classification can be used for managing text message object to be classified to a predetermined class Hierarchical structure that is other or that classify.So, the text message object that discovery (or finding) is relevant with particular topic is just simplified.Literary composition This classification can be used for suitable text message object is routed to suitable crowd or place.So, information service can will relate to The text message object of various themes (e.g., commercial affairs, physical culture, stock market, football, specific company, specific football team) routes to There is the crowd of different interest.Text classification can be used for filtering text message object, so that individual is from unwanted text Hold the invasion of (such as need not and uncalled Email, also referred to as SPAM, or " rubbish ").As from these As being appreciated that in example, text classification has multiple exciting and important application.
Rule-based classification
In some instances, it is necessary to the logic generally acknowledged based on certain, utilize absolute certitude that file content is classified. One rule-based system can be used for realizing this type of classification.Substantially, rule-based system uses the shape of production rule Formula:
IF condition, THEN is true.
Described condition can include whether text message includes some word or expression, has specific grammer, or has Specific attribute.Such as, if content of text has word and " closes ", phrase " Nasdaq " and numeral, then it is classified as " stock market " text.
In about 10 years of past, other type of grader is little by little used.Although this kind of grader is unlike base Grader in rule uses static state, predetermined logic like that, but in numerous applications, they are better than rule-based classification Device.This kind of grader generally includes a learning element and an executive component.This kind of grader includes neutral net, Bayes Network and support vector machine.Although each this kind of grader is well known, but reader for convenience, it is briefly described below each Plant grader.
There is the grader of study and executive component
As the end of upper joint is previously mentioned, in numerous applications, the grader with study and executive component is excellent In rule-based grader.Reiterating, these graders can include neutral net, Bayesian network and support vector Machine.
Neutral net
Neutral net is substantially the multilamellar of same treatment element (also referred to as neuron), level arrangement.Each neuron can There is one or more input, but only one of which exports.By a coefficient, the input of each neuron is weighted.Neuron Output is typically its weighting input and function of deviation value sum.This function, also referred to as activation primitive, it is common that one Sigmoid function.That is, this activation primitive can be S-shaped monotonic increase, and when its (multiple) input is respectively close to positive minus infinity, Asymptotics fixed value (such as+1,0 ,-1).Sigmoid function and the weight of single nerve and deviation value determine that neuron is to input signal Response or " irritability ".
In the level of neuron arranges, the output of the neuron in a layer can distribute as one or more in next layer The input of neuron.Typical neutral net can include an input layer and two (2) individual different layers;That is, one input layer, one Relay cell layer, and an output neuron layer.The node that note that described input layer is not neuron.More precisely, The node of input layer only has an input, and mainly provides the untreated input inputing to next layer.If, such as nerve net Network will be used for identifying a numerical character in 20 × 15 pel arrays, and this input layer can have 300 neurons (i.e. Each pixel of input), and output array can have 10 neurons (each in i.e. 10 numerals).
The use of neutral net generally comprises two (2) individual continuous print steps.First, initialize neutral net, and according to tool This network is trained in the known input having known output valve (or classification).Once neutral net is trained to, and it just can be used for classifying not The input known.By the weight of neuron and deviation being set to random value (generally being generated), nerve net by a Gauss distribution Network can be initialised.Then use a series of input with known output (or classification), train this neutral net.To instruct When white silk input is supplied to neutral net, adjust (such as according to known backpropagation techniques) neural weight and deviation value, so that The output of the neutral net of each single training mode approaches or mates this known output.Substantially, the gradient of weight space Decline and be used for minimizing output error.So, the study of training input continuously is used, towards weight and the local optimum of deviation Solve convergence.That is, weight and deviation is adjusted to minimum error.
In practical operation, the most do not become to converge to the certain point of optimal solution by this systematic training.On the contrary, system will be by " overtraining ", causes it the most professional for training data, thereby increases and it is possible to be bad at the input somewhat different with training set of classifying. Therefore, at its different times trained, this system is tested by one group of checking data.When the performance of system is at checking collection On when no longer improving, training stops.
Once having trained, so that it may use this neutral net, according to the weight determined during training and deviation, classification is not Know input.The Unknown worm if this neutral net can be classified surely, an output of the neuron in certain output layer will Can export far above other.
Bayesian network
Generally, Bayesian network uses it is assumed that as between data (e.g., input feature value) and prediction (e.g., classify) Medium.For given data (" P (assume | data) "), each probability assumed can be estimated.Use after assuming Test probability, obtain prediction from described hypothesis, be weighted with the single prediction that each is assumed.Data-oriented D, it was predicted that X's Probability can be expressed as:
P ( X | D ) = Σ i P ( X | D , H i ) P ( H i | D ) = Σ i P ( X | H i ) P ( H i | D )
Wherein, HiAssume for i-th.Maximize given D (P (Hi| D)) HiHypothesis H of maximum likelihood of probabilityi It is referred to as maximum a posteriori and assumes (or " HMAP"), and be represented by:
P (X | D)~P (X | HMAP)
Use bayes rule, data-oriented D, it is assumed that HiProbability be represented by:
P ( H i | D ) = P ( D | H i ) P ( H i ) P ( D )
The probability of data D keeps constant.Therefore, for finding HMAP, it is necessary to maximize molecule.
The Section 1 of molecule represents: given assume i, can be it is observed that the probability of these data.The Section 2 of molecule represents: point The given prior probability assuming i described in dispensing.
Bayesian network includes the directed edge between variable and variable, thus one directed acyclic graph (i.e. " DAG ") of definition. Each variable may be assumed that as the arbitrary value in the mutual exclusion state of limited quantity.For each variables A, it has female variable B1…Bn, have an attached probability tables (P (and A | B1…Bn).The described structured coding of Bayesian network described it is assumed that given its Female variable, each variable is conditionally independent of its non-sub-variable.
Assume the structure of Bayesian network it is known that and variable observable, the most only need condition for study list of probabilities set.Directly Use the statistics from one group of study sample, these lists can be estimated.If this structure it is known that and some variable is hiding, Then study is similar to above-mentioned neural network learning.
The example of simple Bayesian network is described below.Variable " MML " can represent " humidity on my lawn " (moisture of my lawn), and can have state " wet " and " doing ".MML variable can have " raining " and " my watering Device is opened " female variable, each there is "Yes" and "No" state.Another variable, " MNL " can represent " the grass of my neighbours The humidity on level ground ", and can have state " wet " and " doing ".MNL variable can share " raining " female variable.In this example, it was predicted that can It is " wet " or " doing " with the lawn being me.This prediction can be based on the assumption that (i): if rained, what my lawn will be wet is general Rate (x1) and assume (ii): if my water sprinkler is opened, the probability (x that my lawn will be wet2).The probability rained or I The probability opened of water sprinkler can be depending on other variable.Such as, if the lawn of my neighbours is wet, and they do not spill Hydrophone, that is likely to rain.
As it has been described above, as the example of neutral net, the conditional probability table in Bayesian network can be trained.Its advantage exists In, by allowing offer priori, this learning process can be shortened.Unfortunately, the prior probability of conditional probability is usually It is unknown, now uses unified prior probability.
It is one (1) individual that one embodiment of the present of invention can perform at least two (2) individual basic functions, i.e. generates grader Parameter, and object of classification, such as text message object.
Substantially, based on one group of training examples, parameter is generated for grader.One group of spy can be generated from one group of training examples Levy vector.The feature of this stack features vector can be simplified.The parameter of generation can be included the dullness defined (such as a S-shaped) function With a weight vectors.This weight vectors can determine (or by technology known to other) by the way of SVM trains.Can pass through Optimization method determines this dullness (such as S-shaped) function.
Text classifier includes a weight vectors and dullness (e.g., the S-shaped) function of a definition.Substantially, the present invention The output of text classifier be represented by:
O c = 1 1 + e A ( w ρ c · x ρ ) + B
Wherein:
OcThe classification output of=classification c;
wc=weight vectors the parameter relevant to classification c;
X=based on unknown text information object (simplification) characteristic vector;
A and B is a customized parameter of dull (e.g., S-shaped) function;
Output is calculated faster than being calculated output by expression formula (1) by expression formula (2).
According to being classified the form of object, text message object (i) can be converted to characteristic vector, and (ii) by grader It it is the simplification characteristic vector with less element by feature vectors reduction.
Transduction machine learning
Commercially, currently used in prior art automatic classification system is rule-based or utilizes conclusion type machine Study, i.e. use handmarking's training examples.Compared to transduction method, two kinds of methods are generally required for manually arranging work in a large number Make.The solution provided by rule-based system or conclusion type method is static solution, if there is no manual working, it Classification concept of drifting about cannot be adapted to.
Conclusion type machine learning is for (namely be based on the observation of or minority by attribute or relation owing to based on characterizing Or experience) type;Or formulate rule based on limited observation reproduction mode.Conclusion type machine learning includes from observing Reasoning in training cases, to set up general rule, this rule is then used in test case.
Distinguishingly, preferred embodiment uses transduction machine learning method.Transduction machine learning is an effective method, can To avoid these defects.
Transduction Machine Method can have labelling training examples learning from considerably less one group, automatically adapts to drift classification general Read, and automatically correct the training examples of labelling.These advantages make transduction machine learning become an interesting and valuable side Method, is suitable for the application of various business.
Transduction is in data learning pattern.By not only from having flag data but also from data untagged learning, transduction Extend the concept of conclusion type study.This makes transduction can learn not from having flag data capture or only part from there being mark The pattern of capture in numeration evidence.Therefore, comparing rule-based system or system based on the study of conclusion type, transduction can adapt to The dynamically environment of change.This ability makes transduction can be used in file search, data scrubbing, addressing drift classification concept etc. Deng.
The reality of the transductive classification utilizing support vector machine (SVM) classification and maximum entropy differentiation (MED) framework is described below Execute example.
Support vector machine
Support vector machine (SVM) is a kind of method that text classification is used, by using the concept pair of regularization theory Possible solution arranges restriction, and the method has processed the problem of a large amount of solution, and consequent evolvement problem.Such as, one two Unit's SVM classifier is chosen from the hyperplane of all accurate separation training datas and is maximized the hyperplane of boundary as solution.Maximum Boundary normalization under the restrictive condition that training data is classified exactly, meet aforementioned extensive and memory between select close The problem concerning study of suitable balance.Data have been remembered in restriction to training data, and normalization then ensure that the most extensive.Conclude and divide Class is from the training examples learning with known mark, i.e. the group membership of each training examples is known.When inducing classification from Known mark learning, transductive classification determines classifying rules from having labelling and data untagged.One transduction svm classifier Example is as shown in table 1.
The principle of transduction svm classifier
Require:Data matrix X of labeled training examples and their labels Y.
Require:Data matrix X ' of the unlabeled training examples.
Require:A list of all possible labels assignments of the unlabeled training examples
[Y1 ...,′Yn′]。
1:MaximumMargin=0
2 : Y ^ = 0 { Included label as sin gnment of unlabeled training examples . }
3:for all label assignments Yi.′1≤i≤n in the list of label assignments do
4:CurrentMaximumMargin:=MaximizeMargin (X, Y, X ', Yi′)
5:if CurrentMaximumMargin > MaximumMargin then
6:MaximumMargin:=CurrentMaximumMargin
7: Y ^ : = Y i ′
8:end if
9:end for
Table 1
Table 1 shows the principle of the transductive classification utilizing support vector machine.Solution is given by hyperplane, and this hyperplane is for nothing The all possible labelling distribution of flag data produces maximum figure.Described possible labelling distributes along with the number of data untagged Amount is exponentially increased, and for actually available method, the algorithm of table 1 must be estimated.The example of this estimation exists T.Joachims, Transductive inference for text classification using support Vector machines, Technical report, Universitact Dortmund, LAS VIII, 1999 (Joachims) it is described in.
Being uniformly distributed expression for labelling distribution in table 1, data untagged point has the probability of 1/2 becomes this group Front sample and there is the probability of 1/2 become negative sample, i.e. y=+1 (front sample) and y=-1 (negative sample) this two Plant possible labelling distribution even odds, and final expectation is labeled as 0.Be 0 labelling expectation can be by one equal to 1/2 consolidating Fixed category prior probability obtains, or the category prior probability (i.e. by a stochastic variable with uniform prior distribution Unknown category prior probability) obtain.Therefore, in the application of known class prior probability being not equal to 1/2, should by combining Additional information can improve this algorithm.Such as, it not being uniformly distributed, but according to category prior of labelling distribution in use table 1 Probability, the distribution of some labelling of prioritizing selection rather than the distribution of other labelling.But, but there is relatively high standard score less and join Boundary solution and relatively big but have between the boundary solution of relatively low labelling distribution that to make balance be difficult.The probability of labelling distribution and boundary Limit is different scale.
Maximum entropy differentiates
The method of another kind of classification, maximum entropy differentiation (MED) (referring to, e.g., T.Jebara, Machine LearningDiscriminative and Generative, Kluwer Academic Publishers) (Jebara) do not have Encounter the problem relevant to SVM, because decision function formal phase of normalization and labelling distribution formal phase of normalization are all derived from for solution Prior probability distribution, the most all in identical probability scale.Thus, if category prior, and labelling priori thus Time known, transduction MED classification is better than svm classifier of transduceing, because it allows priori signature knowledge to combine in principle fashion.
Conclude MED classification and assume the prior distribution of a decision function parameter, the prior distribution of a bias term, and one The prior distribution of boundary.It selects that distribution closest to prior distribution to be distributed as the final of these parameters, and produces The expectation decision function of one categorical data point exactly.
In form, such as, giving a linear classifier, problem is expressed as follows: find hyperplane parameter distribution p (Θ), partially Difference distribution p (b), data point categorised demarcation line p (γ), its joint probability distribution has a minimum Kullback Lai Baile and dissipates (Kullback Leibler divergence) KL gives each prior distribution p combined0, i.e.
p ( Θ ) , p ( γ ) , p ( b ) min = KL ( p ( Θ ) p ( γ ) p ( b ) | | p 0 ( Θ ) p 0 ( γ ) p 0 ( b ) ) , - - - ( 1 )
It is limited by restrictive condition
∀ t : ∫ dΘdγdbp ( Θ ) p ( γ ) p ( b ) ( y t ( Θ X t - b ) ) - y t ) ≥ 0 , - - - ( 2 )
Wherein Θ XtIt it is the dot product between separating hyperplane weight vectors and the characteristic vector of t data point.Due to mark Score and join ytFor known and fixing, it is not necessary to the prior distribution of binary flag distribution.Therefore, conclusion MED classification is generalized for transduction The short-cut method of MED classification, is as being limited at the prior distribution parameter that possible labelling distributes using binary flag distribution Reason.The example of transduction MED is as shown in table 2.
Transduction MED classification
Require:Data Matrix X of labeled and unlabeled training examples.
Require:Label prior probabilities p0(y)for labeled and unlabeled training examples.
1:<Y>:=ExpectedLabel (p0(y)){Expected label determined from the training
examples’label prior probabilities.}
3:W:=MinimizeKLDivergence (X,<Y>)
4:Y ' :=InduceLabels (W, X, p0(y))
+ (1-∈) Y ' 5:<Y>:=∈<Y>
6.end while
Table 2
For there being flag data, labelling prior distribution is a delta-function, thus can effectively determine and be labeled as+1 or-1. For data untagged, it is assumed that a labelling prior probability p0Y (), distributes to each data untagged one y=+1's of point The probability of positive labelling is p0(y), and the probability of the negative flag of a y=-1 is 1-p0(y).Assume a non-information labelling priori (p0(y)=1/2), produce the transduction MED similar with an above-mentioned transduction svm classifier classification.
As the situation at transduction svm classifier, it is right that the reality implementation applicatory of above-mentioned MED algorithm must be estimated Search in the most possible labelling distribution.The method is at T.Jaakkola, M.Meila, and T.Jebara, Maximum Entropy discrimination, Technical Report AITR-1668, Massachusetts Institute OfTechnology, Artificial Intelligence Laboratory, is described in 1999 (Jaakkola), and it selects One approximation, is two steps by procedure decomposition, is similar to an expected value and maximizes (EM) formula.In this formula, need Solve two problems.The first step, is equivalent to the M step in EM algorithm, when the best-guess distributed according to current markers, accurate During all data points of really classifying, it is similar to the maximum of boundary.Second step, is equivalent to E step, uses and determines in M step Classification results, and estimate new value for the group membership of each sample.Our this second step is called that labelling is concluded.Substantially Describe as shown in table 2.
The special implementation of the method for Jakkola cited herein, it is assumed that one has the zero average of hyperplane parameter Value and the Gaussian function of unit variance, a zero mean with straggling parameter and variances sigmab 2Gaussian function, formula exp [- C (1-γ)] a boundary priori, wherein γ is the boundary of data point, and c is cost factor, and one described above without mark The binary flag prior probability p of numeration evidence0(y).The transductive classification algorithm Jaakkola being discussed below, is hereby incorporated, due to Simplification and the reason of non-loss of generality, therefore assume the labelling prior probability of 1/2.
For a fixation probability distribution of given hyperplane parameter, labelling induction step determines marking probability distribution.Make By above-mentioned boundary and labelling priori, produce the object function (referring to table 2) of following labelling induction step:
Wherein λtIt is the t training examples Lagrange multiplier (Lagrange Multiplier), stFor in aforementioned M step Middle its classification score value determined, c is cost factor.First two in training examples summation obtain from boundary prior distribution, and Section 3 is given by labelling prior distribution.By maximizingLagrange multiplier is determined, and thereby determines that data untagged Marking probability distribution.Can be seen that in formula 3, data point acts on alone object function, therefore each Lagrange multiplier Determination unrelated with other Lagrange multiplier.Such as, in order to maximize a classification score value with highest absolute value | st| The effect of data untagged point, needs little Lagrange multiplier λt, and one has little value | st| data untagged Point, then need to utilize a big Lagrange multiplier, maximize it rightEffect.On the other hand, the one of data untagged point Individual expectation labelling<y>as the function representation of its classification score value s and Lagrange multiplier λ is:
<y>=tanh (λ s) (4)
Fig. 1 shows expectation labelling<y>function as a classification score value s, its use cost factor c=5 and c= 1.5.By use cost factor c=5 and c=1.5 solution formula 3, determine the Lagrange multiplier for producing Fig. 1.By Fig. 1 Understand, the data untagged point outside boundary, i.e. | s | > 1, there is the expectation labelling<y>close to 0, close to the number of boundary Strong point, i.e. | s | ≈ 1, produce and the highest definitely expect mark value, and the data point close to hyperplane, i.e. | s | < ∈, produce Raw |<y>| < ∈.As | s | → ∞, the reason of the non-intuitive labelling distribution of<y>→ 0 is determined method of discrimination, the method As long as meeting classification to limit, attempt to be kept as much as possible close to prior distribution.It is not one by selected by the known method of table 2 The artifact of the approximation selected, i.e. one algorithm, this algorithm is searched for the distribution of all possible labelling up hill and dale, and is therefore ensured that Find out globally optimal solution, and equally the expectation labelling close or equal to zero is distributed to the data untagged outside boundary.Again Secondary reaffirm, as it has been described above, that is to differentiate that viewpoint is desired.Data point outside boundary is unimportant for separating sample, The individual probability distribution of the most all these data points has been returned to their prior distribution.
The M step of the transductive classification algorithm of Jaakkola, is hereby incorporated, it is determined that the probability distribution of hyperplane parameter, partially Difference item and under conditions of limiting closest to the boundary of data point of respective prior distribution,
&ForAll; t : s t &lang; y t &rang; - &lang; &gamma; t &rang; &GreaterEqual; 0 - - - ( 5 )
Wherein, stIt is the t data point classification score value, < yt>it is its desired labelling,<γt> it is its desired boundary.Right In there being flag data, it is desirable to labelling be fixing, for<y>=+ 1 or<y>=-1.The expectation labelling of data untagged is positioned at district Between within (-1 ,+1) and estimated in labelling induction step.According to formula 5, owing to classification score value is by expecting that labelling determines, Data untagged must meet the classification more tightened up than there being flag data and limit.Additionally, the relational expression of given expectation labelling, as dividing One function of class score value, referring to Fig. 1, the data untagged close to separating hyperplane has most stringent of classification and limits, because of Score value and the absolute value of expectation labelling for them | < yt> | little.The complete target letter of the M step of given above-mentioned prior distribution Number is:
Section 1 is obtained by Gauss hyperplane parameter prior distribution, and Section 2 is boundary priori formal phase of normalization, last For deviation priori formal phase of normalization, by having zero mean and variances sigmab 2Gaussian prior obtain.The prior distribution of bias term can be managed Solution is the prior distribution of a category prior probability.Therefore, the formal phase of normalization corresponding to described deviation prior distribution just limits Face sample and the weight of negative sample.Referring to formula 6, the effect of bias term is minimized, in case the front sample on hyperplane Collective pull and pull equal to the collective of negative sample.Due to deviation priori, the collective of Lagrange multiplier limits just by data The expectation labelling weighting of point, and therefore data untagged is more less than the restriction having flag data.Thus, data untagged has ratio There is the ability of flag data stronger influence last solution.
In a word, in the M step of the transductive classification algorithm of Jaakkola, being hereby incorporated, data untagged needs ratio to have labelling Data meet tightened up classification and limit, and they have the restriction of flag data less for the accumulation weight ratio solved.It addition, tool Have one close to zero the data untagged of expectation labelling, within being positioned at the boundary of current M step, on the impact solved Greatly.So, as in figure 2 it is shown, by this algorithm is applied to data set, can be with the clean effect of graphic extension formulation E and M step Should.Data set includes that two have labelling sample, a negative sample (x) being positioned at x position-1, and the front sample of+1 (+), and along x-axis, six unmarked samples (o) between-1 and+1.Fork (x) represents that has the negative sample of labelling, Plus sige (+) represent that has a labelling front sample, and circle (o) represents data untagged.Different figures represents that separate surpasses Plane, is determined by the different iteration of M step.Final solution is determined by the transduction MED grader of Jaakkaola, is hereby incorporated, Front has labelling training examples to be classified by mistake.Fig. 2 shows the successive ignition of M step.In the first time iteration of M step, not Consider data untagged, and the hyperplane separated is positioned at x=0.One data untagged point with negative x value is than other nothing any The hyperplane that flag data separates closer to this.In labelling induction step subsequently, it will be assigned to minimum | and<y> |, correspondingly, in next M step, it has the authority of maximum and hyperplane is pushed to front has labelling sample.Expect labelling<y> Given shape as one by the cost factor (referring to Fig. 1) selected determine classification score value function, with data untagged The specific interval of point combines and creates bridge effect, and in each continuous print M step, the hyperplane of separation is just increasingly closer to Face sample.Intuitively, M step is perplexed by a kind of myopia, closest to the data untagged point of current separating hyperplane Can determine that most the final position of this plane, and away from data point the most critically important.Finally, since deviation priori item limits nothing The collective of flag data pulls less than there being the collective of flag data to pull, thus separating hyperplane moves on to beyond front labelling sample Example, produces a final solution, the 15th iteration in Fig. 2, and front labelling sample has been carried out the classification of mistake by it.At Fig. 2 In employ one &sigma; b 2 = 1 Deviation variance and the cost factor of a c=10.Utilize &sigma; b 2 = 1 Any at scope 9.8 < c < 13 Within cost factor produce the final hyperplane of a classification that a certain front labelling sample carried out mistake.And it is all in district Between cost factor outside 9.8 < c < 13, have between labelling sample Anywhere at two, produce the hyperplane separated.
The unstability of this algorithm is not limited merely to the sample shown in Fig. 2, when applying Jaakkola method, draws at this With, it is also subject to be confined to real-world data collection, including the Reuter's data set being well known to those skilled in the art.Table 2 Described in inherent instability is this embodiment major defect of the method, and limit its versatility, to the greatest extent Pipe Jaakkola method may be implemented in certain embodiments of the present invention.
One method for optimizing of the present invention uses the transductive classification of the framework using maximum entropy differentiation (MED).Easy to understand, this The different embodiments of invention, it is adaptable to classification, are applied equally to other MED problem concerning study using transduction, including, but do not limit In, transduction MED restores and image model.
By assuming that the prior probability distribution of a parameter, maximum entropy differentiates restriction and reduces possible solution.According in the phase The solution hoped describes under the restriction of training data exactly, closest to the probability distribution of the prior probability distribution assumed, last solution For the expected value likely solved.The prior probability distribution of all solutions is mapped to a formal phase of normalization, i.e. have selected one specific Prior distribution, just have selected for a specific normalization.
Differentiated that estimating is effective in the study from a small amount of sample by what support vector machine was implemented.The embodiment of the present invention Method and apparatus all there is as support vector machine this feature, and will not estimate that ratio solves necessary to given problem The more parameter of parameter, and therefore produce a sparse solution.Compared with generation mode estimation, generation mode estimation attempts to explain base Plinth process, it usually needs estimate higher statistics than differentiating.On the other hand, generation mode is more flexible, thus can be used for various respectively The problem of sample.It addition, generation mode estimation can directly include priori.By using maximum entropy to differentiate, the embodiment of the present invention Method and apparatus shorten the gap between pure discrimination model estimation (e.g., support vector machine study) and generation mode estimation.
The method of embodiments of the invention as shown in table 3 is a transduction MED sorting algorithm improved, and it does not has Aforementioned unstable problem in the presence of the method for Jaakkola (being hereby incorporated).Difference includes, but not limited at this In bright embodiment, each data point has the cost factor of himself, proportional to its absolute descriptor's expected value |<y>|.Separately Outward, according to estimating that group membership's probability, as the function of the distance of data point to decision function, after each M step, updates each The labelling prior probability of individual data point.The method of the embodiment of the present invention is as shown in the following Table 3:
The transduction MED classification improved
Require:Data matrix X of labeled and unlabeled training examples
Require:Label prior probabilities p0(y)for labeled and unlabeled training examples.
Require:Global cost factor c.
1:<Y>: ExpectedLabel (p0(y)){Expected label determined from the training
examples’label prior probabilities.}
3:C:=|<Y>| c{Scale each training example ' s cost factor by the absolute value of
its expected label.}
4:W:=MinimizeKLDivergence (X,<Y>, C)
5:p0(y) :=EstimateClassProbability (W,<Y>)
6:Y ' :=InduceLabels (W, X, p0(y), C)
+ (1-∈) Y ' 7:<Y>:=∈<Y>
8:end while
Table 3
Pass through |<y>| and regulate data point cost factor, relaxed what data untagged dragged for the collective on hyperplane The problem that effect is more higher than there being flag data, because the cost factor of data untagged is than the cost factor having flag data now Little, say, that each data untagged point for the independent role of last solution always less than the independent work having mark data points With.But, if the total amount of data untagged is much larger than the quantity having flag data, data untagged still can have reference numerals by ratio According to affecting last solution more.It addition, utilize the class probability of estimation, cost factor regulation is tied with update mark prior probability Close, the problem solving above-mentioned bridge effect.First M step, data untagged has little cost factor, produces one Expect labelling, as classification score value function, its relatively flat (see Fig. 1), correspondingly, to a certain extent, all unmarked Data are allowed to continue and pull hyperplane, although only have less weight.Further, since the renewal of labelling prior probability, away from The data untagged of the hyperplane separated is not previously allocated the expectation labelling that close to 0, but after many iterations, distributes One labelling close to y=+1 or y=-1, and the most little by little it is counted as having flag data to process.
In a particular implementation of the method for the embodiment of the present invention, by assuming that one has decision function parameter Θ Zero mean and a Gaussian prior of unit variance:
p 0 ( &Theta; ) = 1 ( 2 &pi; ) n e - 1 2 &Theta; t &Theta; , . - - - ( 7 )
The prior distribution of decision function parameter combines the important priori of specific classification problem on the horizon.Other For the prior distribution such as multinomial distribution of the important decision function parameter of classification problem, Poisson distribution, Cauchy's distribution (Breit-Wigner), maxwell boltzman distribution or B-E distribution.
The prior distribution of decision function threshold value b is by having average value mubAnd variances sigmab 2Gauss distribution give:
p 0 ( b ) = 1 2 &pi; &sigma; b e - 1 ( b - &mu; b ) 2 &sigma; b 2 - - - ( 8 )
Categorised demarcation line γ as data pointtPrior distribution
p 0 ( &gamma; t ) = ce - c ( 1 + 1 c - &gamma; t ) - - - ( 9 )
Chosen, wherein c is cost factor.This prior distribution and the prior distribution of use in Jaakkola (being hereby incorporated) Difference, the expression formula of Jaakkola is exp [-c (1-γ)].Preferably, the expression formula given by formula 9 be better than Jaakkola ( This quotes) expression formula that uses, even if because cost factor is less than 1, formula 9 also can produce a front expectation boundary, and as c < When 1, exp [-c (1-γ)] produces a negative expectation boundary.
These prior distributions given, can directly determine that corresponding partition function Z is (referring to sample T.M.Cover and J.A.Thomas, Elements of Information Theory, John Wiley&Sons, Inc.) (Cover), and target Function
For
According to Jaakkola (being hereby incorporated), the object function of M step is
And the object function of E step is
Wherein stIt is the classification score value of t data point, determines in M step above, p0,1(yt) it is the two of data point Meta-tag prior probability.For there being flag data, labelling priori is initialized as p0,1(yt)=1, and for data untagged, mark Note priori is initialized as p0,1(ytThe non-information priori of)=1/2, or category prior probability.
The part of the most named M step describes the algorithm solving M step object function.Similarly, the most named E The part of step describes E step algorithm.
In estimation class probability (the Estimate Class Probability) step of table 3 the 5th row, employ training Data, to determine calibration parameter, give score value p (c | s) for classification score value becomes the probability of group membership's probability, i.e. classification.With In score value calibration is estimated as the correlation technique of probability at J.Platt, Probabilistic outputs for support Vectormachines and comparison to regularized likelihood methods, pages 61-74, 2000 (Platt) and B.Zadrozny and C.Elkan, Transforming classifier scores into Accurate multi-classprobability estimates, is described in 2002 (Zadrozny).
Referring particularly to Fig. 3, fork (x) represents that has a negative sample of labelling, plus sige (+) indicate labelling front sample, and Circle (o) represents data untagged.Different curves represents the separating hyperplane determined with the different iteration of M step.20th time Iteration shows the last solution determined by the transduction MED grader improved.Fig. 3 show the transduction MED sorting algorithm of improvement, should For above-mentioned small data set.The parameter used is c=10, &sigma; b 2 = 1 , μb=0.Different c produces and is positioned at x ≈-0.5, and x Separating hyperplane between=0, as c < 3.5, hyperplane is positioned at the right side of the data untagged of an x < 0, and when c >= When 3.5, hyperplane is positioned at the left side of this data untagged point.
Referring particularly to Fig. 4, it is illustrated that a control flow, it is shown that the side of the classification data untagged of the embodiment of the present invention Method.Method 100 starts in step 102, accesses storage data 106 in step 104.These data are stored in memory element and include Flag data, data untagged and at least one cost factor preset.Data 106 include the data with the labelling of distribution Point.The data point identification of distribution has whether mark data points will be included into a specific classification, or from a particular category It is excluded.
Once data are accessed in step 104, and the method for the embodiment of the present invention is then used by the mark of data point in step 108 Note information, determines the labelling prior probability of this data point.Then, in step 110, according to described labelling prior probability, this is determined The expectation labelling of data point.Along with expectation labelling calculated in step 110, together with there being flag data, data untagged with become This factor, step 112 includes, by adjustment cost factor data untagged point, transduction MED grader being iterated training.? Each time in iterative computation, the cost factor of data untagged point is conditioned.So, MED grader is from iterating of calculating Learning.The grader of training then accesses input data 114 in step 116.Then the grader of this training is complete in step 118 The step of constituent class input data, and terminate in step 120.
Easy to understand, the data untagged of 106 and input data 114 can obtain from a single source.Thus, defeated Entering data/data untagged and can be used for the iterative process of step 112, this process is used for classifying the most in step 118.And, The embodiment of the present invention considers, input data 114 can include a feedback mechanism, so that input data are supplied to the storage 106 Data, in order to the MED grader of 112 is dynamically from the new data learning of input.
Referring particularly to Fig. 5, it is illustrated that a control flow chart, it is shown that the another kind of data untagged of the embodiment of the present invention Sorting technique, including user-defined priori probability information.Method 200 starts from step 202, accesses storage number in step 204 According to 206.These data 206 include flag data, data untagged, a default cost factor and customer-furnished Priori probability information.The flag data that has of 206 includes having the data point of labelling of distribution.This mark of the marker recognition of described distribution The data point of note is will to be included into a specific classification or be excluded from a particular category.
In step 208, it is desirable to labelling by 206 data calculate.Then, this desired labelling in step 210 together with Flag data, data untagged and cost factor is had to be used together, to guide the repetitive exercise of a transduction MED grader. The iterative computation of 210, in calculating each time, regulates the cost factor of data untagged.Calculate and continue, until grader is by just Really train.
Then, the grader of training accesses the input data from input data 212 in step 214.The grader of training Next the step of classifying input data can be completed in step 216.Process described in Fig. 4 and method, input data and nothing Flag data can obtain from a single source, and can enter system 206 and 212.So, input data 212 Can be in 210 impact training, in order to this process dynamically can change over along with continuous print input data.
In two methods shown in figures 4 and 5, a monitor can determine that system is either with or without reaching convergence.Work as MED The change of the hyperplane between the iteration each time calculated drops to below a default threshold value, it may be determined that convergence.In the present invention Another embodiment in, when determine expectation labelling change drop to below a default threshold value, it may be determined that described threshold value.As Fruit reaches convergence, then repetitive exercise process can stop.
Referring particularly to Fig. 6, it is shown that the repetitive exercise process of at least one embodiment of the inventive method is in further detail Control flow chart.Process 300 starts from step 302, and in step 304, the data from data 306 are accessed, and these data are permissible Include flag data, data untagged, at least one cost factor preset, and priori probability information.306 have labelling Data point includes a labelling, and whether data point described in this marker recognition is the instruction by being included into a data point specifying classification Practice sample, or the training examples of the data point of classification eliminating will be specified by one.The priori probability information of 306 includes labelling Data set and the probabilistic information of data untagged collection.
In step 308, it is desirable to labelling is determined by the data of the priori probability information from step 306.In the step 310, The cost factor of each data untagged collection is relative to the proportional regulation of absolute value of the expectation labelling of data point.Then pass through Determine a decision function, train a MED grader in step 312, i.e. according to the expectation mark having labelling and data untagged Note, utilizes and has labelling and data untagged as training examples, maximize in the training examples being included into and the training being excluded Boundary between sample.In step 314, the grader of the training of step 312 is used to determine classification score value.In step 316, classification Score value is calibrated to group membership's probability.In step 318, according to group membership's probability updating labelling priori probability information.In step 320 Performing a MED to calculate, to determine labelling and marginal probability distribution, wherein, classification score value determined above makes in MED calculates With.As a result, new expectation is marked at step 322 and calculates, and in step 324, uses the calculating from step 322 to update this phase Hope labelling.In step 326, the method determines whether to reach convergence.If it is, the method terminates in step 328.If not up to Convergence, then, from the beginning of step 310, complete the another an iteration of the method.Iteration is until reaching convergence, thus realizes MED The repetitive exercise of grader.When decision function change between MED iterative computation each time drops to below a preset value, Reach convergence.In another embodiment, when the change of the expectation mark value determined drop to a default threshold value with Time lower, reach convergence.
Fig. 7 shows a network architecture 700 according to an embodiment.As shown in the figure, it is provided that multiple remotely Network 702, including the first telecommunication network 702 and the second telecommunication network 704.Gateway 707 is attached to telecommunication network 702 with neighbouring Between network 708.In the environment of present networks architecture 700, each of network 704,706 can use arbitrary shape Formula, includes, but are not limited to: LAN, wide area network, such as the Internet, Public Switched Telephone Network (PSTN), intercom phone net, etc. Deng.
In use, gateway 707 as from telecommunication network 702 to the entrance of adjacent network 708.Thus, gateway 707 can As a router, can manage a given packet arriving gateway 707, and a switch, it is given number Actual path is provided according to bag turnover gateway 707.
Farther including at least one data server 714 being connected with described adjacent network 708, it can pass through gateway 707 access from telecommunication network 702.It is noted that data server 714 can include any kind of computer equipment/group Part.Be connected with each data server 714 is multiple subscriber equipmenies 716.These subscriber equipmenies 716 can include desk-top calculating Machine, laptop computer, hand-held computer, printer or other logical device any.It is noted that an embodiment In, subscriber equipment 717 can also be directly connected in arbitrary network.
One facsimile machine 720 or a series of facsimile machine 720 may connect to one or more network 704,706,708.
It is noted that data base and/or add-on assemble can be connected to any type of of network 704,706,708 Network element is used together or is incorporated into wherein.In the environment of this description, network element is preferably the random component of network.
According to an embodiment, Fig. 8 shows a representative hardware environment relevant with the subscriber equipment 716 of Fig. 7.This figure Show the hardware configuration of a typical workstation, there is a central processing unit 810, such as a microprocessor and multiple By system bus 812 other unit interconnective.
Work station shown in Fig. 8 includes random access memory (RAM) 814, read only memory (ROM) 816, and I/O is adaptive Device 818, is used for connecting ancillary equipment (disk storage unit 820 as being connected with bus 812), user interface adapter 822, uses In by keyboard 824, mouse 826, speaker 828, microphone 832 and/or other user interface facilities, such as touch screen sum code-phase Machine (not shown), is connected to bus 812, and communication adapter 834, for being connected to communication network 835 (e.g., data by work station Process network), and display adapter 836, for bus 812 is connected with display device 838.
Referring particularly to Fig. 9, it is shown that the device 414 of one embodiment of the invention.One embodiment of the present of invention includes using Storage device 814 in storage flag data 416.Each mark data points 416 includes a labelling, indicates this data point It is the training examples being included into a data point specifying classification, or the training from a data point specifying classification to be excluded Sample.Memorizer 814 also stores data untagged 418, priori probability data 420 and cost factor 422.
Processor 810 accesses the data from memorizer 814, and uses transduction MED to calculate one binary classifier of training, Can classify data untagged.By the use cost factor and have labelling and data untagged training examples by oneself, place Reason device 810 uses iteration transduction to calculate, and regulates this cost factor function as expectation mark value, thus affects cost The data of factor data 422, these data input processor 810 the most again.Therefore, cost factor 422 is along with processor 810 MED classification iteration each time and change.Once processor 810 trained a MED grader fully, and processor is with that Can instruct this grader that data untagged is referred to classified data 424.
Transduction SVM and the MED formula of prior art causes potential labelling distribution to be exponentially increased, and approximation must be to reality Border application development.In another embodiment of the present invention, describe the formula of different transduction MED classification, it is not necessary to suffer in The possible labelling distribution of exponential increase, and allow a conventional closed-form solution (closed formsolution).For linearly Grader, problem is expressed as follows: find hyperplane parameter distribution p (Θ), deviation profile p (b), data point categorised demarcation line p (γ), Its probability distribution combined compares the respective prior distribution p of combination0There is one minimize Ku Lebaike accumulation Le and dissipate (Kullback Leibler divergence) KL, i.e.
p ( &Theta; ) , p ( &gamma; ) , p ( b ) min = KL ( p ( &Theta; ) p ( &gamma; ) p ( b ) | | p 0 ( &Theta; ) p 0 ( &gamma; ) p 0 ( b ) ) - - - ( 13 )
It is limited by the following restriction having flag data
&ForAll; t : &Integral; d&Theta;d&gamma;dbp ( &Theta; ) p ( &gamma; ) p ( b ) ( y t ( &Theta; X t - b ) - &gamma; t ) &GreaterEqual; 0 - - - ( 14 )
And it is limited by the restriction of following data untagged
&ForAll; t &prime; : &Integral; d&Theta;d&gamma;dbp ( &Theta; ) p ( &gamma; ) p ( b ) ( ( &Theta; X t &prime; - b ) 2 - &gamma; t &prime; ) &GreaterEqual; 0 - - - ( 15 )
Wherein Θ XtFor the dot product between weight vectors and the characteristic vector of t data point of the hyperplane separated.Nothing Need the prior distribution of labelling.Flag data is had to be limited in the right side of hyperplane of separation according to labelling known to it, and for Only requirement is that of data untagged, they to hyperplane distance square more than boundary.In a word, embodiments of the invention are looked for To a hyperplane separated, it is closest to selected prior probability, separates exactly and has flag data, Yi Ji A balance between data untagged is not had between boundary.Have an advantage in that, it is not necessary to introduce the prior distribution of labelling, thus, Avoid the problem that potential labelling distribution index increases.
In the particular implementation of another embodiment of the present invention, use in the formula 7,8 and 9 of hyperplane parameter given Prior distribution, deviation and boundary, obtain following partition function:
Z ( &lambda; ) = 1 ( 2 &pi; ) n + 1 &sigma; b &Integral; d&Theta;db e - 1 2 &Theta; &gamma; &Theta; - 1 2 ( b - &mu; b &sigma; b ) 2 + &Sigma; t &lambda; t y t ( &Theta; &tau; X t - b ) + &Sigma; t &prime; &lambda; t &prime; ( &Theta; &tau; X t - b ) 2
( &Pi; t &Integral; p 0 ( &gamma; t ) e &Sigma; t &lambda; t &gamma; t d &gamma; t ) ( &Pi; t &prime; &Integral; p 0 ( &gamma; t &prime; ) e &Sigma; t &prime; &lambda; t &prime; &gamma; t &prime; d &gamma; t &prime; ) , - - - ( 16 )
Wherein subscript t is the subscript having flag data, and t ' is the subscript of data untagged.
Created symbol:
Z = ( &Theta; b - &mu; b ) , U = ( X - 1 ) ,
G 2 = &Sigma; t &prime; U t &prime; U t &prime; T , G 3 = G 1 - 2 G 2 ,
With W = &Sigma; t &lambda; t &gamma; t U t - 2 &Sigma; t &prime; &lambda; t &prime; &gamma; t &prime; U t &prime; ,
Formula 16 is rewritable is as follows:
After integration, following partition function is produced:
That is, final object function is:
+ &mu; b &Sigma; t y t &lambda; t - &mu; b 2 &Sigma; t &lambda; t &prime;
+ &Sigma; t ( 1 + 1 c ) &lambda; t + log ( 1 - &lambda; t c ) - - - ( 20 )
+ &Sigma; t &prime; ( 1 + 1 c ) &lambda; t &prime; + log ( 1 - &lambda; t &prime; c ) .
As the situation of the known mark discussed in the paragraph of referred to herein as M step, object functionCan be by application Similar method solves.Difference is, the matrix G in the quadratic form of maximum figure item3 -1Currently there is nondiagonal term.
Except classification, the present invention uses the method for maximum entropy differentiation framework to there is also multiple application.Such as, MED can be used for Solve the classification of data.In a word, can be used for any kind of discriminant function and prior distribution, recovery and image model (T.Jebara, Machine Learning Discriminative and Generative, Kluwer Academic Publishers)(Jebara)。
The application of the embodiment of the present invention can be formulated into the pure inductive learning problem with known mark, and tool There is the transduction problem concerning study of labelling and unmarked training examples.In embodiment below, the transduction MED described in table 3 divides Improving of class algorithm is classified for common transduction MED, transduction MED restores, the transduction MED study of image model is the most equally applicable. So, for purpose and the dependent claims thereof of the disclosure, word " is classified " and can be included restoring or image model.
M step
According to formula 11, the object function of M step is:
+ &Sigma; t ( 1 + 1 c ) &lambda; t + &Sigma; t log ( 1 - &lambda; t c ) , - - - ( 21 )
t|0≤λt≤ c},
Wherein Lagrange multiplier λtBy maximizing JMDetermine.
Ignore redundancy and limit λt< c, the lagrangian of above-mentioned two problems is:
+ &Sigma; t ( 1 + 1 c ) &lambda; t + &Sigma; t log ( 1 - &lambda; t c ) ,
+ &Sigma; t &delta; t &lambda; t ,
&ForAll; t : 0 &le; &lambda; t &le; c , &delta; t &GreaterEqual; 0 , &delta; t &lambda; t = 0 . - - - ( 22 )
KKT condition necessary and sufficient for optimality is:
+ ( 1 + 1 c ) - 1 c - &lambda; t + &delta; t
= - &Sigma; t &prime; , &lang; y t &rang; &lang; y t &prime; &rang; &lambda; t &prime; K ( X t , X t &prime; ) - &sigma; b 2 &lang; y t &rang; &Sigma; t &prime; &lambda; t &prime; &lang; y t &prime; &rang; - &mu; b &lang; y t &rang;
+ &lang; y t &rang; &lang; y t &rang; ( 1 + 1 c ) - &lang; y t &rang; &lang; y t &rang; ( c - &lambda; t ) + &delta; t
= &lang; y t &rang; ( - &Sigma; t &prime; &lang; y t &prime; &rang; &lambda; t &prime; K ( X t , X t &prime; ) - &sigma; b 2 &Sigma; t &prime; &lambda; t &prime; &lang; y t &prime; &rang; - &mu; b + 1 &lang; y t &rang; ( 1 + 1 c ) - 1 &lang; y t &rang; ( c - &lambda; t ) ) + &delta; t
= &lang; y t &rang; ( - F t - &sigma; b 2 &Sigma; t &prime; &lambda; t &prime; &lang; y t &prime; &rang; - &mu; b ) + &delta; t = 0
&ForAll; t : 0 &delta; t &GreaterEqual; 0 , &delta; t &lambda; t = 0 - - - ( 23 )
Wherein FtFor:
F t = &Sigma; t &prime; &lang; y t &prime; &rang; &lambda; t &prime; K ( X t , X t &prime; ) + 1 &lang; y t &rang; ( 1 + 1 c ) - 1 &lang; y t &rang; ( c - &lambda; t ) - - - ( 24 )
In optimal solution, deviation is equal to expectation deviation &lang; b &rang; = &sigma; b 2 &Sigma; t &lambda; t &lang; y t &rang; + &mu; b Obtain:
<yt>(-Ft-<b>)+δt=0 (25)
By considering δtλt=0 two kinds of situations limited, it can be gathered that these formula.The all λ of the first situationt=0, with And all 0 < λ of the secondt< c.Without considering the third, such as S.Keerthi, S.Shevade, C.Bhattacharhyya, And K.Murthy, Improvements to platt ' s smo algorithm for svm classifier design, 1999 (Keerthi), described in, it is applied to SVM algorithm;In this formula, potential function (potential function) is protected Hold λt≠c。
&lambda; t = 0 , &delta; t &GreaterEqual; 0 &DoubleRightArrow; ( F t + &lang; b &rang; ) &lang; y t &rang; &GreaterEqual; 0 - - - ( 26 )
0 < &lambda; t < c , &delta; t = 0 &DoubleRightArrow; ( F t + &lang; b &rang; ) = 0 - - - ( 27 )
In the case of these can there is interference in some data point t, until it reaches optimal solution.That is, λ is worked astDuring for non-zero, Ft≠-<b >, or work as λtWhen being zero, Ft<yt><-<b><yt>.Unfortunately, there is no optimal solution λt, just cannot calculate<b>.This is asked One good solution of topic is the method using for reference Keerthi (being again hereby incorporated), by building following three set:
I0={ t:0 < λt< c} (28)
I1={ t: < yt> > 0, λt=0} (29)
I4={ t: < yt> < 0, λt=0} (30)
By using these to gather, using following definition, we can limit the maximum limit interference of optimality condition. I0In element for interference, as long as they be not equal to-<b>, therefore, from I0Minimum and maximum FtFor becoming the time of interference Choosing.Work as Ft<-<b>time, I1In element be interference, therefore, if it exists, from I1Least member be maximum limit Interference.Finally, F is worked ast>-<b>time, at I4In element for interference, it is from I4Interference candidate produces greatest member.Therefore ,-< B > limited by these " minimum " of gathering as follows and " maximum " value:
- b up = F min t { F t : t &Element; I 0 &cup; I 1 } - - - ( 31 )
- b low = F max t { F t : t &Element; I 0 &cup; I 1 } - - - ( 32 )
Due in optimal solution ,-bupWith-blowNecessary equal reason, i.e.-<b>, then, reduction-bupWith-blowDifference Restrain away from training algorithm will be promoted.It addition, gap can also be determined as a kind of method determining that numerical value is restrained.
As it was previously stated, only reach convergence, just can know that the value of b=<b>.The difference of the method for another embodiment exists In, once can only optimize a sample.Therefore, every once, heuristic training will be at I0In sample and all samples between It is used alternatingly.
E step
In formula 12, the object function of E step is
Wherein stThe classification score value of t the data point for determining in M step before.Lagrange multiplier λtBy BigizationDetermine.
Ignore redundancy and limit λt< c, the lagrangian of above-mentioned two problems is:
&ForAll; t : 0 &le; &lambda; t &le; c , &delta; t &GreaterEqual; 0 , &delta; t &lambda; t = 0
KKT condition necessary and sufficient for optimality is:
&PartialD; L ( &lambda; ) &PartialD; 1 &lambda; t = ( 1 - 1 c ) - 1 c - &lambda; t - s t P 0 , t ( + 1 ) e &lambda; t s t - P 0 , t ( - 1 ) e - &lambda; t s t P 0 , t ( + 1 ) e &lambda; t s t + P 0 , t ( - 1 ) e - &lambda; t s t + &delta; t = 0 . - - - ( 35 )
Owing to sample has been carried out factorization, as long as ignoring sample, by optimizing KKT condition to Lagrange multiplier Solve and can complete.
For there being labelling sample, it is desirable to labelling < yt> there is P0,1(yt)=1 and P0,1(-yt)=0, simplifying KKT condition is:
&PartialD; L E ( &lambda; ) &PartialD; &lambda; t = ( 1 - 1 c ) - 1 c - &lambda; t - s t &lang; y t &rang; + &delta; t = 0 - - - ( 36 )
And generate the solution as the Lagrange multiplier having labelling sample:
&lambda; t = c - 1 - c &lang; y t &rang; s t ( 1 - 1 c ) &lang; y t &rang; s t - - - ( 37 )
For unmarked sample, formula 35 can not decompose and solves, but it is necessary that use, as met formula 35 to each The Lagrange multiplier of unmarked sample carries out linear search, determines.
Being multiple unrestriced sample below, it can be by above-mentioned enumerated method and derivation thereof or change, Yi Jiqi Its method known in the art realizes.Each example includes preferred computing, and in conjunction with optional computing or parameter, it can be Basic method for optimizing opinion is implemented.
In an embodiment, as shown in Figure 10, have mark data points to be received in step 1002, each data point have to A few labelling, indicating this data point is the training examples of the data point being included into a particular category, or specific from one The training examples of the data point that classification is excluded.It addition, data untagged point is received in step 1004, have described in reception simultaneously At least one default cost factor of mark data points and data untagged point.Described data point can include any medium, as Word, image, sound etc..The priori probability information having labelling and data untagged point can also be received.And, it is included into The labelling of training examples can be mapped as first numerical value, such as+1 etc., and the training examples being excluded can be mapped as the second number Value, such as-1 etc..There are mark data points, data untagged point, input data point described in it addition, and have mark data points and nothing The default cost factor of at least one of mark data points can be stored in computer storage.
Further, in step 1006, use at least one cost factor described, and have mark data points and unmarked number Strong point is as training examples, and by iterative computation, a transduction MED grader is trained to.For iterative computation each time, adjust Data untagged point cost factor as one expectation mark value, such as a data point expectation labelling absolute value etc., letter Number, and adjust data point label prior probability according to the estimation of group of data points membership probability, therefore ensure that stability.And, turn Lead grader and can learn to use the priori probability information of labelling and data untagged, which further improves stability.Training The iterative step of transductive classifier can repeat, until it reaches the convergence of data value, such as, when the decision function of transductive classifier Change when dropping to below a default threshold value, drop to below a default threshold value when the change of the expectation mark value determined Time, etc..
Additionally, in step 1008, the grader of training be used for classifying described data untagged point, have mark data points and At least one in input data point.Input data point can receive before or after grader is trained to, or does not receive. And, according to their expectation labelling, there are labelling and data untagged point described in utilization as study sample, it may be determined that judge letter Number, gives and is included into and dispossessed training examples, and KL can be dissipated the elder generation being minimised as decision function parameter by this decision function Test probability distribution.In other words, this decision function can use the multinomial distribution of decision function parameter, minimum KL dissipate Determine.
In step 1010, the classification of the data point of classification, or derivatives thereof, be exported to a user, another system and At least one during another.System can be long-range or locality.The example of the derivant of classification is not it may be that but It is limited to, the data point of classification itself, the sign of categorical data point or identifier or master file/document, etc..
In another embodiment, computer system uses and performs computer executable program code.This program code Including being stored in the instruction having mark data points of computer storage for accessing, mark data points is had to have described in each At least one labelling, indicates whether this data point is the training examples being included into a data point specifying classification, or from one The training examples of the data point being excluded in individual appointment classification.It addition, computer code includes for visiting from computer storage Ask the instruction of data untagged point, and have mark data points and at least the one of data untagged point from computer storage access The instruction of individual default cost factor.The priori probability information having labelling and data untagged point being stored in calculating memorizer also may be used With accessed.And, the labelling of the training examples being included into can be mapped as first numerical value, such as+1 etc., and the training being excluded Sample can be mapped as second numerical value, such as-1 etc..
Further, program code comprises instructions that, described instruction use at least one storage cost factor and Store has the data untagged point of mark data points and storage as training examples, trains transductive classification by iterative computation Device.And, for iterative computation each time, adjust the data untagged point cost factor expectation mark value as this data point, Such as data point expectation labelling absolute value, a function.And, for iteration each time, priori probability information is permissible The estimation of the group membership's probability according to data point is adjusted.The iterative step of training transductive classifier can be repeated, until number Convergence is reached, such as, when the change of the decision function of transductive classifier drops to below a default threshold value, when determining according to value The change of expectation mark value when dropping to below a default threshold value, etc..
It addition, program code comprises instructions that, described instruction is used for training grader, to data untagged point, has At least one in mark data points and input data point is classified, and for exporting the class of the data point of described classification The instruction of other or derivatives thereof, exports classification to a user, another system and at least one during another. And, according to their expectation labelling, there are labelling and data untagged point described in utilization as study sample, it may be determined that judge letter Number, the given training examples being included into and being excluded, KL can be dissipated the elder generation being minimised as decision function parameter by this decision function Test probability distribution.
In another embodiment, data processing equipment includes at least one memorizer, is used for storing: (i) has reference numerals Strong point, has mark data points to have at least one labelling described in each, and indicating this data point is to be included into one to specify classification The training examples of data point, or the training examples from a data point specifying classification to be excluded;(ii) data untagged Point;(iii) there is at least one default cost factor of mark data points and data untagged point described in.Described memorizer also may be used Labelling and the priori probability information of data untagged point is had with storage.And, the labelling of the training examples being included into can be mapped as First numerical value, such as+1 etc., and the training examples being excluded can be mapped as second numerical value, such as-1 etc..
It addition, described data processing equipment includes a transductive classification training aids, with utilize at least one cost described because of Son, and described in have mark data points and data untagged point as training examples, use the maximum entropy of transduction to differentiate (MED), Train described transductive classifier iteratively.Additionally, in MED iterative computation each time, adjust described data untagged point cost because of Son as the expectation mark value of this data point, the absolute value etc. of the expectation labelling of a such as data point, a function.And And, in MED iterative computation each time, priori probability information can be adjusted according to the estimation of data point group membership's probability. This device can also include one for determining the device that data value is restrained, e.g., when the decision function that transductive classifier calculates Change when dropping to below a default threshold value, drop to below a default threshold value when the change of the expectation mark value determined Time, etc., and once it is determined that convergence, then terminate calculating.
It addition, training grader for classify data untagged point, have mark data points and input data point in extremely Few one.And, according to their expectation labelling, there are labelling and data untagged point described in utilization as study sample, can be true Determining decision function, the given training examples being included into and being excluded, KL can be dissipated and be minimised as decision function by this decision function The prior probability distribution of parameter.And, the classification of the data point of classification, or derivatives thereof, export to a user, another be System and at least one during another.
In another embodiment, goods, including computer-readable program recorded medium, this medium wraps definitely Include the executable instruction repertorie of one or more computer, with the method performing data classification.In use, reception has reference numerals Strong point, each have mark data points to have at least one labelling, and indicating this data point is to be included into the data specifying classification The training examples of point, or the training examples from a data point specifying classification to be excluded.It addition, reception data untagged Point, and described in have mark data points and data untagged point at least one preset cost factor.Have mark data points and The priori probability information of data untagged point can also be stored in computer storage.And, the labelling of the training examples being included into Can be mapped as first numerical value, such as+1 etc., and the training examples being excluded can be mapped as second numerical value, such as-1, etc..
Further, have mark data points and the data untagged point that use at least one cost factor stored and storage are made For training examples, utilize the maximum entropy of iteration to differentiate that (MED) calculates, train transductive classifier.In the iteration each time that MED calculates In, adjust the data untagged point cost factor expectation mark value as this data point, the expectation labelling of a such as data point Absolute value etc., a function.And, in MED iterative computation each time, priori probability information can be according to a data point The estimation of group membership's probability is adjusted.The iterative step of training transductive classifier can be repeated, until it reaches data value is restrained, example As, when the change of the decision function of transductive classifier drops to below a default threshold value, when the expectation mark value determined When change drops to below a default threshold value, etc..
It addition, access input data point from computer storage, the grader of training is used for described data untagged of classifying Point, have mark data points and input data point at least one.And, according to their expectation labelling, have described in utilization Labelling and data untagged point are as study sample, it may be determined that decision function, and the given training examples being included into and being excluded should KL can be dissipated the prior probability distribution being minimised as decision function parameter by decision function.And, the classification of the data point of classification, Or derivatives thereof, is exported to a user, another system and at least one during another.
In another embodiment, it is provided that a kind of for data untagged of classifying in a computer based system Method.In use, there is mark data points to be received, have mark data points to have at least one labelling described in each, refer to Show that this data point is the training examples being included into a data point specifying classification, or the number specifying classification to be excluded from The training examples at strong point.
It addition, have labelling and data untagged point to be received, there are mark data points and the priori signature of data untagged point Probabilistic information is also received.And, there is at least one default cost factor of mark data points and data untagged point also to be connect Receive.
And, each has the expectation labelling labelling prior probability quilt according to this data point of labelling and data untagged point Determine.Repeat following sub-step, until data value is enough restrained.
● the data untagged point proportional to the absolute value of the expectation labelling of data point for each generates a regulation Value at cost;
● be determined by decision function, the given sample being included into training and being expelled out of training, use described in have labelling and Data untagged point, as training examples, trains a maximum entropy to differentiate (MED) grader, according to their expectation labelling, and should KL is dissipated the prior probability distribution being minimised as decision function parameter by decision function;
● use the grader of described training, determine described in have labelling and the classification score value of data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to the described group membership's probability determined, update the labelling prior probability of described data untagged point;
● the labelling prior probability utilizing described renewal and the classification score value determined before, use maximum entropy to differentiate (MED), Determine described labelling and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation labelling;With
● by the described expectation labelling of iteration before is inserted described new expectation labelling, update for each data point Expect labelling.
And, the classification or derivatives thereof of input data point, it is exported to a user, another system and another process In at least one.
When the change of decision function drops to below a default threshold value, reach convergence.Additionally, when the expectation mark determined When the change of note value drops to below a default threshold value, it is also possible to reach to dissipate.And, the labelling of the training examples being included into Can have arbitrary value, such as+1, and the training examples being excluded can have arbitrary value, such as-1.
In one embodiment of the invention, a kind of method for sort file is as shown in figure 11.In use, in step Rapid 1100, receive at least one seed file with known confidence levels, and it is default with at least one to receive unmarked file Cost factor.This seed file and other can be received from computer storage, user, network connection etc., and can be One is received after the request of the system performing the method.At least one seed file described can have a this document Whether it is included into a cue mark specifying classification, can contain a Keyword List, or there is any other contribute to The feature of sort file.And, in step 1102, by iterative computation, use at least one default cost factor, at least one Seed file and unmarked file, train a transductive classifier, wherein, for iterative computation each time, Setup Cost because of Son is as the function of an expectation mark value.The data point label prior probability having labelling and unmarked file can also be connect Receive, wherein, for iterative computation each time, described data point markers can be adjusted according to the estimation of group of data points membership probability Prior probability.
It addition, after at least part of iteration, be that unmarked file stores confidence score in step 1104, and in step 1106, the identifier of the unmarked file with the highest confidence score is exported to a user, another system and another process In at least one.This identifier can be the electronic copies of this document itself, its part, its title, its title, point to file Pointer, etc..And, confidence score can store after each iteration, wherein, after each iteration, has The identifier of the unmarked file of the highest confidence score is output.
One embodiment of the present of invention can inquire about pattern original document and remaining paper linked.The target of inquiry is One this pattern query proves especially valuable region.Such as, at pre-trial legal inquiry (pre-trial Legaldiscovery) in, for the possible link of lawsuit at hand, substantial amounts of file must be studied.Final purpose is in order to send out Existing " conclusive evidence ".In another example, for the common task of inventor, patent examiner, and patent attorney, It is through the retrieval to prior art, the novelty of one technology of assessment.Especially, this task is the special of all announcements of search Profit and other publication, and have found that it is likely that the file relevant with the particular technology examining novelty in this is gathered.
The task of inquiry is included in one group of data and finds a file or one group of file.A given original document or general Reading, user may wish to find the file relevant with this original document or concept.But, original document or concept and file destination Between the opinion of relation, i.e. the file that will inquire about, only after inquiring about, just can be best understood by.Labelling is had by study With unmarked file, concept etc., the present invention can learn the pattern between single or multiple original document and file destination and relation.
In another embodiment of the present invention, a kind of method such as Figure 12 for analyzing the file relevant to legal inquiry Shown in.In use, the file relevant to legal matter is received in step 1200.These files can include the electricity of file itself Sub-copy, its part, its title, its title, the pointer of sensing file, etc..It addition, in step 1202, file is performed one Plant file classifying method.Further, in step 1204, the identifier of at least part of file is exported based on its classification.Alternatively, The mark of the link between these files is also output.
Described file identification method can include any kind of process, such as transductive process etc..For example, it is possible to make With aforesaid any conclusion or transduction method.In a preferred method, use at least one default cost factor, at least one Individual seed file and the file relevant with legal matter, train a transductive classifier by iterative computation.For each time Iterative computation, cost factor preferably adjusts the function as an expectation mark value, and the grader of training is used for classification and connects The file received.This process can also include for having labelling and one data point markers prior probability of unmarked file reception, wherein, For iterative computation each time, according to the estimation of data point group membership's probability, adjust described data point label prior probability. It addition, described file classifying method can also include that one or more support vector machine process and maximum entropy differentiate process.
In another embodiment, a kind of the method for prior art document is analyzed as shown in figure 13.In use, in step 1300, based on search inquiry one grader of training.In step 1302, multiple prior art documents are accessed.These are existing Have technology can be included in one give fix the date before, any information that the public can obtain in any form.The prior art also may be used Be included in one give fix the date before, any information that the public can't obtain in any form.The prior art document enumerated Can be any type of file, as Patent Office publication, take from the data of data base, the prior art of collection, webpage Part, etc..And, in step 1304, use described grader that the most described prior art document is performed one File classifying method, and in step 1306, classify based on it, the identifier of the prior art document that output is the most described. Described document classification technology can include one or more process, differentiates including a support vector machine process, a maximum entropy Process, or aforesaid any conclusion or transduction method.Also or, between described file link sign can also be output.? In another embodiment, between at least part of prior art document, the score value of dependency is output based on its classification.
Described search inquiry can include disclosed in patent at least some of.The patent enumerated is open to be included, by inventor Disclosure, temporary patent application, non-provisional, foreign patent or patent application summing up its invention and produce etc..
In a preferred method, described search inquiry includes the claim of a patent or patent application at least A part.In another method, described search inquiry includes at least some of of the summary of a patent or patent application.? In another method, described search inquiry includes at least some of of the brief summary of the invention of a patent or patent application.
Figure 27 shows a kind of method for being mated by file with claim.In step 2700, based on a patent Or at least one claim one grader of training of patent application.Therefore, one or more claim, or one portion Point, can be used for training grader.In step 2702, multiple files are accessed.These files can include prior art document, describes Potential infringement or the file of use product of taking the lead.In step 2704, use described grader that at least part of file is performed one Plant file classifying method.In step 2706, classify based on it, export the identifier of at least part of file.At least part of file Relevance score can also be output based on its classification.
One embodiment of the present of invention can be used for the classification of patent application.In the U.S., such as, nowadays patent and patent Shen Please use US patent class (USPC) system, be classified according to its theme.This task is now by being accomplished manually, and therefore cost is high And it is time-consuming.This manual sort is also restricted by mistake.The complexity solving this task is, can be by patent or special Profit application is divided into multiple kind.
According to an embodiment, Figure 28 shows a kind of method for patent application of classifying.In step 2800, based on many Individual known one grader of file training belonging to a specific patent classification.These files can be generally patent or patent Shen Please (or one part) but it also may be the summary file of the target topic describing specific patent classification.In step 2802, one Patent or at least some of of patent application are received.Described part may include that claim, brief summary of the invention, makes a summary, illustrates Book, title, etc..In step 2804, use described grader that described patent or at least some of of patent application are performed one Plant file classifying method.In step 2806, the classification of described patent or patent application is output.Alternatively, user can be manual Ground check part or the classification of all patent applications.
Described file classifying method is preferably a kind of Yes/No sorting technique.In other words, if file is in correct class Probability in not higher than a threshold value, is then judged to it is that this document belongs to the category.If general in correct classification of file Rate is less than a threshold value, then be judged to no, and this document is not belonging to the category.
Figure 29 shows another method for patent application of classifying.In step 2900, use a grader to one Part patent or at least some of of patent application perform a kind of file classifying method, and this grader has previously been based at least one with one The file that individual specific patent classification is relevant is trained to.Same, described file classifying method is preferably a kind of Yes/No classification side Method.In step 2902, the classification of described patent or patent application is output.
In two shown in Figure 28 and Figure 29 kind method, it is possible to use different graders repeats respective method, described Grader has previously been based on multiple known file belonging to a different patent classification and is trained to.
Formally, the classification of patent should be based on claim.But, it is also desirable to perform coupling between (any IP is correlated with Content) and (any IP related content).As an example, a kind of method uses patent specification to be trained, and according to Patent application is classified by the claim of patent application.Another kind of method operation instructions and claim are trained, And based on summary classification.In particularly preferred method, no matter which part of patent or application is used for training, when classification Use the content of same type, if i.e. system is trained according to claim, be then classified based on claim.
Described file classifying method can include any kind of process, such as transductive process etc..Such as, can make With above-mentioned any conclusion or transduction method.In a preferred method, described grader can be a transductive classifier, And described transductive classifier uses at least one default cost factor, at least one seed file and prior art document, pass through Iterative computation is trained, and wherein, for iterative computation each time, adjusts described cost factor as an expectation mark value Function, and the grader of described training can be used for described prior art document of classifying.Described seed file and prior art document A data point markers prior probability can also be received, wherein, for iterative computation each time, can be according to data The estimation of some group membership's probability, adjusts described data point label prior probability.Seed file can be any file, such as Patent Office Publication, to take from the data of data base, one group of prior art, website, patent open etc..
In a method, Figure 14 describes one embodiment of the present of invention.In step 1401, one group of data is read. In these group data, the file relevant with user be the discovery that needs.In step 1402, single or multiple initial seed files Labeled.Described file can be any kind of file, such as Patent Office publication, take from the data of data base, one group Prior art, website etc..Can also a string different keyword or customer-furnished file layout transductive process.In step 1406, use the one group of data untagged having in flag data and a given set, train a transductive classifier.In iteration Each labelling induction step in transductive process, the confidence score determined in labelling generalization procedure is stored.In step 1408, once complete training, just display to the user that the file obtaining high confidence score in labelling induction step.These have height The file representative of confidence score inquires about the file that purpose is relevant to user.This display can according to the time of labelling induction step first Rear order, from the beginning of initial seed file, until last group file being found in last labelling induction step.
Another embodiment of the present invention relates to data scrubbing and precise classification, the such as business process with automatization and ties mutually Close.Described cleaning and sorting technique can include any kind of process, such as a transductive process etc..It is, for example possible to use Any of the above described transduction or inductive method.In a preferred method, according to the expectation cleannes of data base, enter data base's Key is used as the labelling relevant to confidence levels.Then, this labelling, together with relevant confidence levels, i.e. expects labelling, by with In one transductive classifier of training, labelling (key) described in this grader correction, to realize more may be used of data in data base The management leaned on.Such as, first invoice must company or individual according to invoicing be classified, and extracts realizing automaticdata, Such as determine total amount, O/No., product quantity, shipping address etc..Generally, an automatic classification system is set up to need instruction Practice sample.But, client the training examples provided is usually containing the file of wrong classification or other interference, such as fax cover page, Classifying accurately to obtain, before training described automatic classification system, these files must be identified and remove.At another In individual embodiment, in the field of case, contribute to detecting the discordance between report and its diagnosis report write by doctor.
In another embodiment, it is well known that Patent Office need experience continuous print reclassify process, wherein, they (1) assessing an existing bifurcations of their classification of disturbance method, (2) rebuild this classification method to be evenly distributed overcrowding joint Point, and existing patent reclassifies new structure by (3).Here transduction learning method be Patent Office and they outside Bag be used for do this work company used by, to reappraise their classification method, and help they (1) be one given Main classification sets up new classification method, and (2) reclassify existing patent.
Transduce from having labelling and data untagged study, be thus smooth from being tagged to unmarked transformation.Collection of illustrative plates One end be to have perfect existing the most acquainted to have flag data, e.g., given labelling is the most all correct.At another End is the most given existing acquainted data untagged.Number with the data composition mistake classification that the group disturbed to a certain degree is compiled According to, and extreme at two of collection of illustrative plates between somewhere.In a way can be for certain by being marked at of being given of data tissue It is considered correct, but and not exclusively.Therefore, change can be used for clearing up existing data set and compile, by given at one Assume a degree of specifically makeing mistakes within data tissue, and these are construed to uncertain in the existing knowledge of labelling distribution Property.
In one embodiment, a kind of the method for data is cleared up as shown in Figure 5.In use, in step 1500, Duo Geyou Flag data item is received, and in step 1502, chooses the subset of described data item for each classification in multiple classifications.Separately Outward, in step 1504, the uncertainty of the data item in each subset is arranged to about zero, in step 1506, will not exist The uncertainty of the data item in described subset be arranged to one be about zero preset value.Further, in step 1508, pass through Iterative computation, use the data item in described uncertainty, subset and data item the most in the subsets as training examples, Train a transductive classifier, and in step 1510, the grader of training has flag data item for each, each to classify Individual described data item.And, input data item classification, or derivatives thereof, step 1512 be exported to a user, another System and at least one during another.
Further, described subset can randomly select, it is possible to is chosen by user and verifies.At least part of described data item Labelling can be changed based on its classification.And, after sorting, there are the data of the confidence levels of the threshold value default less than The identifier of item is exported to user.Described identifier can be the electronic copies of this document itself, its part, its title, its Title, the pointer of sensing this document, etc..
In one embodiment of the invention, as shown in figure 16, in step 1600, two choosings of a scale removal process are started Item is presented to user.In step 1602, an option is full-automatic cleaning, for each concept or classification, selects randomly Take certain amount of file, and assume that they are correctly organized volume.Or, in step 1604, a number of file can be beaten Upper labelling, is organized volume exactly with one or more labellings distribution of hand inspection and verification whether each concept or classification. In step 1606, data, an estimation of annoyance level is received.In step 1610, use the verification in step 1608 (desk checking or randomly select) data and the data that do not verify, train described transductive classifier.Once training terminates, file Again volume is organized according to new labelling.In step 1612, in labelling distributes, there is the low confidence level less than a specific threshold Other file, is displayed to user, for hand inspection.In step 1614, distribute according to transduction of marker, in labelling distributes There is the file of the confidence levels higher than a specific threshold by automatic Proofreading.
In another embodiment, a kind of method for managing case history is as shown in figure 17.In use, in step 1700, a grader is trained to based on medical diagnosis, and in step 1702, multiple case histories are accessed.It addition, in step 1704, Use described grader that described case history performs a kind of file classifying method, and there is low probability with medical diagnosis dependency The identifier of at least one case history, is output in step 1706.This document sorting technique includes any kind of process, such as one Transductive process etc., and can include said one or multiple arbitrary conclusion or transduction method, including support vector machine process, Big entropy-discriminate process etc..
In one embodiment, described grader can be a transductive classifier, and described transductive classifier can lead to Cross iterative computation, use at least one cost factor, at least one seed file and case history preset to be trained to, wherein, right In iterative computation each time, adjust a described cost factor function as expectation mark value, and the grader of training can be used In described case history of classifying.The data point label prior probability of seed file and case history can also be received, wherein, for each time Iterative computation, can adjust described data point label prior probability according to the estimation of group of data points membership probability.
The classification concept that another embodiment of the present invention describes dynamically, drifts about.Such as, in formal layout application, point Class file, uses the layout information of file and/or content information to classify file, with described file of classifying for further Process.In many applications, file is not changeless, but time to time change.Such as, file content and/or The space of a whole page probably due to new legislation and change.Transductive classification adapts to these changes automatically, produces same or similar classification accurate Property, and do not affected by the classification concept drifted about.Compared with rule-based system or inducing classification method, it is not necessary to manually adjust Joint, will not affect accuracy due to concept drift.One example of this method is that invoice processes, and it includes concluding traditionally Study, or use the rule-based system utilizing the invoice space of a whole page.For the system that these are traditional, if the space of a whole page changes, Then system must be by the new training data of labelling or determine that new rule manually resets.But, the use of transduction is led to Cross the minor variations automatically adapting on the invoice space of a whole page so that manually reset and become no longer necessary.In another embodiment, Transductive classification can be used for analyzing customer complaint, to monitor these changes complaining character.Such as, a company can automatically will produce Product change is linked with customer complaint.
Transduction can also be used for the classification of news article.Such as, about war, the news article of the attack of terrorism, start from for The terrorist of calendar year 2001 JIUYUE Afghan War on the 11st attacks, until about the News Stories of the current situation of Iraq, can Transduction is used automatically to identify.
In another embodiment, biological classification (akpha taxonomy) can change over, by evolving, and new species Produce, and other species extinction.Along with classification concept change in time, classification outline or taxonomic this and Else Rule Can dynamically change.
By using the input data that must be classified as data untagged, transduction can identify drift classification concept, and Thus automatically adapt to the classification outline of change.Such as, Figure 18 shows that a given drift classification concept of the present invention uses The embodiment of transduction.File group DtAt time ttEntrance system, as shown in step 1802.In step 1804, use and amass up to now Tired has labelling and data untagged one transductive classifier C of trainingt, in step 1806, file group DtIn file be classified. If use artificial mode, step 1808 is confirmed as the literary composition with the confidence levels of the threshold value less than user's offer Part, is presented to user for hand inspection in step 1810.As shown in step 1812, in automatic mode, one has The file of confidence levels triggers the establishment of a new classification, and the category is added into system, and then this document is just attributed to this New classification.In step 1820A-B, have and be classified into current classification higher than the file of the confidence levels of above-mentioned selected threshold value 1 to N.In step ttIt is classified into the file of all current class of current class, in step 1822 by grader Ct Reclassify, and in step 1824 and 1826, all files being no longer classified into above-mentioned appointment classification, it is moved into new class Not.
In another embodiment, a kind of adapt to file content variation method as shown in figure 19.File content can wrap Include, but be not limited to, picture material, content of text, the space of a whole page, numbering, etc..The example of variation can include the change of time, wind The change (being processed one or more files by 2 or more individual) of lattice, the change of application process, the variation of the space of a whole page, etc..? Step 1900, receiving at least one has labelling seed file and unmarked file and at least one default cost factor.Described File can include, but are not limited to, customer complaint, invoice, form document, receipt, etc..It addition, in step 1902, use At least one default cost factor described, at least one seed file, and unmarked file, train a transductive classifier. And, in step 1904, there is the unmarked file of the confidence levels of the threshold value default more than, use grader to be classified To multiple classifications, and in step 1906, the file of described classification at least some of, use grader to be reclassified to multiple Classification.Further, in step 1908, the identifier of the file of described classification is exported to a client, another system, Yi Jiling At least one during one.Described identifier can be the electronic copies of file itself, its part, its title, its title, refer to To the pointer of file, etc..And, product variations can be linked with customer complaint etc..
It addition, have can be moved into less than the unmarked file of the confidence levels of a predetermined threshold value one or more new Classification.And, by iterative computation, use at least one cost factor, at least one seed file and described nothing preset Tab file, can train a transductive classifier, wherein, for iterative computation each time, adjusts described cost factor conduct The function of one expectation mark value, and use the grader described unmarked file of classification of described training.And, described kind of Ziwen The data point label prior probability of part and unmarked file can be received, wherein, for iterative computation each time, according to one The estimation of group of data points membership probability, adjusts described data point label prior probability.
In another embodiment, a kind of method of variation making patent classification adapt to file content is as shown in figure 20.? Step 2000, receiving at least one has labelling seed file, and unmarked file.Described unmarked file can include any The file of type, e.g., patent application, legal document, information open form, file modification, etc..Seed file can include specially Profit, patent application etc..In step 2002, use at least one seed file described and unmarked file training one transduction point Class device, and use described grader will to have the unmarked document classification higher than the confidence levels of a predetermined threshold value to multiple Existing classification.Described grader can be any kind of grader, such as transductive classifier etc., and described document classification side Method can be any method, such as support vector machine method, maximum entropy method of discrimination etc..Such as, can use above-mentioned any Conclude or transduction method.
And, in step 2004, use described grader to have the confidence levels less than a predetermined threshold value by described Unmarked document classification is at least one new classification, and in step 2006, uses described grader the most described to divide The file of class reclassifies existing classification and classification that at least one is new.Further, in step 2008, described classification The identifier of file is exported to a user, another system and at least one during another.Furthermore, it is possible to use extremely A few default cost factor, described search inquiry and described file, by iterative computation, train described transductive classification Device, wherein, for iterative computation each time, adjusts the described cost factor function as an expectation mark value, and described instruction The grader practiced can be used for described file of classifying.Further, the data point prior probability of described search inquiry and file can be by Receive, wherein, for iterative computation each time, according to the estimation of data point group membership's probability, adjust described data point first Test probability.
In another embodiment of the present invention, describe the file at file separation field to drift about.The example of one application Attached bag includes the process of mortgage file.Including a series of different debt-credit files, such as loan application, ratify, ask, quantity etc. Debt-credit file is scanned, and before further processing, it must be determined that the different file in a series of images.Use File is not changeless, but can change over.Such as, in debt-credit file, the tax form of use, can Change over according to the change of laws and regulations.
File separates and solves file or the problem of subfile boundary of finding in a series of images.General generation is a series of The example of image is digital scanner or multi-function peripheral (MFP).As in the embodiment of classification, transduction can be used for file Separate, to process file and boundary drifting problem in time thereof.Static piece-rate system, as rule-based system or based on The system of Inductive Learning, it is impossible to automatically adapt to drift and separate concept.No matter when drift about, these static separation systems The performance performance of system reduces in time.In order to keep the performance of its initial level, or manually adjusting rule (is based on rule System for), or the new file of handmarking learning system (for Inductive Learning) again.The most any It is all time-consuming costly.Application transduction separates to file so that system is improved, and it can adapt to the drift in separating concept automatically Move.
In one embodiment, a kind of method of separate file is as shown in figure 21.In step 2100, receive and have reference numerals According to, and in step 2102, receive one group of unmarked file.These data and file can include legal inquiry file, official Notice, web data, attorney's official letter etc..It addition, in step 2104, have flag data and unmarked literary composition based on described Part, uses transduction, and probabilistic classification rule is adjusted, and in step 2106, according to probabilistic classification rule, updates for literary composition The weight that part separates.And, in step 2108, determine the position separated in one group of file, and in step 2110, determine The designator of position separated in one group of file be exported to a user, another system and another during at least One.Described designator can be the electronic copies of file itself, its part, its title, its title, the pointer of sensing file, Etc..Further, in step 2112, file is labeled with coding, and described coding is relevant with described designator.
Figure 22 shows the sorting technique separated for file used in the present invention and the implementation process of equipment.In numeral After formula scanning, use autofile to separate and relate to, to reduce, the manual working that file separates and identifies.By using reasoning to calculate Method, combines to be automatically separated with classifying rules by file separation method and organizes the page more, use sorting technique described here, with Reduce the most possible separation from all available to information.As shown in figure 22, the present invention turns one example of the present invention The sorting technique leading MED is used for file separation.Specifically, the file page 2200 is placed into digital scanner 2202 or MFP, and It is turned into set of number image 2204.The described file page can be from the page of any type file, such as going out of Patent Office Version thing, take from the data of data base, the set of prior art, website etc..In step 2206, input set of number image, with Dynamically adapting uses the probabilistic classification rule of transduction.Step 2206 uses one group of image 2204 as data untagged and to have mark Numeration is according to 2208.Weight in step 2210, probability network is updated, and is used for based on dynamically adapting classifying rules Autofile separates.Output step 2212 is to be automatically put into the dynamic self-adapting of separate picture, so, and the page of set of number 2214 automated graphics being interleaved into the separator page 2216, in step 2212, are automatically inserted into figure by the separator page As sequence.In one embodiment of the invention, the separator page 2216 of Software Create can also indicate that and follows described separation closely The type of the file of the device page 2216.System described herein automatically adapts to the drift that file occurs in time and separates general Read, can occur separating accurately as rule-based static system or conclusion type machine learning based on method without worry The reduction of degree.In sheet disposal (form processing) is applied for, drift separates or a common example of classification concept It is that as mentioned before, file produces change due to new laws and regulations.
It addition, system as shown in figure 22 can change system as shown in figure 23 into, its page 2300 puts into digital scanner 2302 or MFP are converted to set of number image 2304.This group digital picture is transfused in step 2306, to use transduction the suitableeest Answer probabilistic classification rule.Step 2306 uses this group image 2304 as data untagged and to have flag data 2308.Step 2310, according to the dynamic self-adapting classifying rules used, update the weight in the probability network that autofile separates. In step 2312, it not insertion separator page-images as described in Figure 18, but step 2312 adapts dynamically to be automatically inserted into Separation information, and with coding descriptive markup described in document image.Thus, file page-images can be transfused to an image procossing Data base 2316, and described file can be accessed by software identifiers.
An alternative embodiment of the invention can use transduction to carry out recognition of face.As it has been described above, use transduction to have many Advantage, such as, it is only necessary to the training examples of relatively small amount, uses the ability of unmarked sample in training, etc..Utilize above-mentioned excellent Gesture, transduction recognition of face can be used for Criminal Investigation.
Such as, Department of Homeland Security is it is essential to ensure that terrorist must not climb up commercial airliner.A part for airport screening process Can be the photograph gathering each passenger at airport security, and attempt identifying this people.System initially can use a small amount of sample Example is trained, and this sample comes from the available limited photo being probably terrorist.At other law enforcement data The unmarked photo of terrorist in storehouse, same can also be used for training.Therefore, transduction training aids is possible not only to use the dilutest The data dredged set up functional face identification system, and also other unmarked sample originated can be used to strengthen performance. After having processed the photo gathered at airport security, transduction system can more precisely identify suspicious figure than induction system.
In another embodiment, a kind of method for recognition of face is as shown in figure 24.In step 2400, at least one The labelling drawing of seeds picture that has of face is received, and this drawing of seeds picture has known confidence levels.This at least one drawing of seeds picture can To have a labelling, indicate whether this image is included into a classification specified.It addition, in step 2400, unmarked image Received, e.g., from police office, government organs, missing child data base, airport security, or any other is local, and receives at least One default cost factor.And, in step 2402, by iterative computation, use at least one cost preset described because of Son, at least one drawing of seeds picture, and unmarked image, train a transductive classifier, wherein, for iterative computation each time, Adjust the described cost factor function as an expectation mark value.After at least successive ignition, in step 2404, for described Unmarked drawing of seeds picture one confidence score of storage.
Further, in step 2406, the identifier of the unmarked file with the highest confidence score is exported to a use Family, another system and at least one during another.Described identifier can be the electronic copies of this document itself, its portion Point, its title, its title, point to file pointer, etc..And, confidence score can be stored each time after iteration, its In, after each iteration, output has the identifier of the unmarked image of the highest confidence score.Use furthermore it is possible to receive In the described data point label prior probability having labelling and unmarked image, wherein, for iterative computation each time, can basis The estimation of one data point group membership's probability, adjusts described data point label prior probability.Further, the nothing mark of the 3rd face Note image, as from above-mentioned airport security sample, can received, described 3rd unmarked image can divide with having the highest confidence At least part of image of value compares, and if be sure oing that the face in the 3rd unmarked image with the face in drawing of seeds picture is Identical, then can export the identifier of described 3rd unmarked image.
An alternative embodiment of the invention, by providing feedback to document retrieval system, allows users to improve their search Hitch fruit.Such as, when in an internet search engine (patent or patent application search product etc.) upper one search of execution, User can obtain corresponding in a large number the result of its search inquiry.One embodiment of the present of invention allows users to from search engine Browse the result of suggestion, and inform the dependency of the one or more acquired results of search engine, e.g., " close, but be not that I am real Want ", " being absolutely not " etc..When user provides feedback to search engine, more preferable result is given according to priority and is used Family browses.
In one embodiment, a kind of method for file search is as shown in figure 25.In step 2500, receive one and search Rope is inquired about.This search inquiry can be any kind of inquiry, including case sensitive inquiry, boolean queries, approximate match Inquiry, structuralized query, etc..In step 2502, it is thus achieved that file based on search inquiry.It addition, in step 2504, output institute State file, and in step 2506, the labelling that the user at least part of file keys in is received, and this labelling indicates described file And the dependency between search inquiry.Such as, user may indicate that the particular result returned from described inquiry is relevant going back It is unrelated.And, in step 2508, the labelling keyed in based on described search inquiry and user, a grader is trained to, and Step 2510, uses described grader described file to perform a kind of file classifying method, to reclassify described file.Enter one Step, in step 2512, classifies based on it, exports the identifier of at least part of file.Described identifier can be file itself Electronic copies, its part, its title, its title, the pointer of sensing file, etc..The described file reclassified can also be by Output, condition is that first those files with high confidence level are exported.
Described file classifying method can include any kind of process, e.g., and transductive process, support vector machine process, Big entropy-discriminate process, etc..Any of the above described conclusion or transduction method can be used.In a preferred method, described classification Device can be a transductive classifier, and by iterative computation, uses at least one cost factor preset, described search to look into Ask, and described file, described transductive classifier can be trained, wherein, for iterative computation each time, adjust described cost because of Son is as the function of an expectation mark value, and the grader of described training may be used for described file of classifying.It addition, for institute The data point markers prior probability stating search inquiry and file can be received, wherein, for iterative computation each time, root According to the estimation of data point group membership's probability, described data point label prior probability can be adjusted.
An alternative embodiment of the invention may be used for improving ICR/OCR, and speech recognition.Such as, many voices are known The embodiment of other program and system needs operator to repeat many words to train described system.The present invention can be first to a use The sound monitoring a predetermined time segment at family, to collect the content of " unfiled ", e.g., monitoring telephone is talked.Result is, works as user When starting to train this identification system, this system utilizes transduction study, builds a note utilizing the voice of described monitoring to assist Recall model.
In another embodiment, a kind of method such as Figure 26 institute for one invoice of verification and the relatedness of an entity Show.In step 2600, based on invoice format training one grader relevant to first instance.This invoice format may refer to send out The practical layout of mark on ticket, or the feature on invoice, such as key word, invoice number, customer name, etc..It addition, in step 2602, labeled as accessed with at least one multiple invoice being associated in described first instance and other entity, and In step 2604, use described grader that described invoice performs a kind of file classifying method.Such as, above-mentioned any conclusion or Transduction method can serve as a kind of file classifying method.Such as, described file classifying method can include a transductive process, prop up Hold vector machine process, maximum entropy differentiates process, etc..And, in step 2606, export the mark of at least one described invoice Symbol, it is uncorrelated with described first instance that this invoice has higher probability.
Further, described grader can be any kind of grader, such as, and a transductive classifier, and by repeatedly In generation, calculates, and uses at least one predetermined cost factor, at least one seed file, and described invoice, can train described Transductive classifier, wherein, for iterative computation each time, adjusts the described cost factor function as an expectation mark value, And use the grader described invoice of classification of described training.And, for described seed file and a data point mark of invoice Note prior probability can be received, wherein, for iterative computation each time, according to the estimation of data point group membership's probability, Adjust described data point label prior probability.
Here say that an advantage of the embodiment of description is the stability of transduction algorithm.This stability is described by regulation Cost factor and the described labelling prior probability of regulation realize.Such as, in one embodiment, by Iterative classification, use extremely Lack a cost factor, have mark data points and data untagged point as training examples, train a transductive classifier.For Iterative computation each time, regulates the cost factor function as a desired mark value of described data untagged point.Additionally, For iterative computation each time, regulate a data point prior probability according to the estimation of data point group membership's probability.
Work station can have memory-resident in an operating system, this operating system such as MicrosoftBehaviour Make system (OS), MAC operation system, or UNIX operating system.Should be appreciated that preferred embodiment can also carry being different from those To platform and operating system on implement.One preferred embodiment can use JAVA, XML, C and/or C Plus Plus or Other programming language is write, in conjunction with OO Programming Methodology.Object-oriented programming (OOP) can be used, It has been being increasingly used to the application that exploitation is complicated.
Above-mentioned application uses transduction study to overcome the most rare problem of data set, and this problem annoyings conclusion type face Identification system.This aspect of transduction study is not limited to this application, it is also possible to be used for solving other due to data set rareness Say the Machine Learning Problems caused.
Within the scope and spirit of the various embodiments of invention disclosed herein, those skilled in the art can design difference Change.And, the various features of embodiments disclosed above can be used alone, or various combination each other, and not It is confined to particular combination described above.Therefore, the scope of claim is not limited to these embodiments described.

Claims (18)

1. a face identification method, it is characterised in that including:
At least one of one face of reception has labelling drawing of seeds picture, and this drawing of seeds picture has a known confidence levels;
Receive unmarked image;
Receive at least one cost factor preset;
By iterative computation, use at least one cost factor preset described, at least one drawing of seeds picture described and described Unmarked image, trains a transductive classifier, wherein, for iterative computation each time, regulates described cost factor as one The function of individual expectation mark value;
After at least part of iteration, store confidence score for described unmarked drawing of seeds picture;With
The identifier with the unmarked image of the highest confidence score is exported to a user, another system, during another At least one.
Method the most according to claim 1, it is characterised in that: at least one drawing of seeds picture described has a labelling, refers to Show whether this image is included into a classification specified.
Method the most according to claim 1, it is characterised in that: store confidence score each time after iteration, wherein, each After secondary iteration, the identifier of the unmarked image with the highest confidence score is output.
Method the most according to claim 1, it is characterised in that: also include having labelling and unmarked image-receptive one for described Individual data point markers prior probability;Wherein, for iterative computation each time, according to the estimation of data point group membership's probability, Regulate described data point label prior probability.
Method the most according to claim 1, it is characterised in that: also include the 3rd the unmarked figure receiving a face Picture, compares described 3rd unmarked image with the image at least partly with the highest confidence score, and if be sure oing the 3rd Face in individual unmarked image is identical with the face in described drawing of seeds picture, then export described 3rd unmarked image Identifier.
6. the method for the change for making patent classification adapt to file content, it is characterised in that including:
Receiving at least one and have labelling seed file, at least one seed file wherein said is selected from including patents and patent applications Group in;
Receiving a unmarked file, described unmarked file is at least one in patents and patent applications;
Use at least one seed file described and described unmarked file, train a transductive classifier;
Use processor to use described grader, divide having higher than the unmarked file of the confidence levels of a predetermined threshold value Class is to multiple existing classifications;
New classification non-existing before automatically creating at least one and the described grader of use will have less than a default threshold The unmarked document classification of the confidence levels of value is at least one new classification described;
Use described grader, the file of classification before at least partly described is reclassified described existing classification and described At least one new classification;With
The identifier of the file of described classification is exported to a user, another system and at least one during another.
Method the most according to claim 6, it is characterised in that: described grader is a transductive classifier, and also includes By iterative computation, use at least one cost factor, search inquiry and described file preset, train described transduction to divide Class device, wherein, for iterative computation each time, regulates the described cost factor function as an expectation mark value, and uses The grader described file of classification of described training.
Method the most according to claim 6, it is characterised in that: also include for search inquiry and one data point of file reception Labelling prior probability;Wherein, for iterative computation each time, according to the estimation of data point group membership's probability, regulation is described Data point label prior probability.
Method the most according to claim 6, it is characterised in that: described file classifying method includes a support vector machine mistake Journey.
Method the most according to claim 6, it is characterised in that: described file classifying method includes that a maximum entropy differentiates Process.
11. methods according to claim 6, it is characterised in that: described unmarked file is patent application.
12. methods according to claim 6, it is characterised in that: at least one seed file described selected from patent and One patent application.
13. 1 kinds of methods adapting to file content change, it is characterised in that including:
Receive at least one and have labelling seed file;
Receive unmarked file;
Receive at least one cost factor preset;
Use at least one described cost factor, at least one seed file and described unmarked file preset described, instruction Practice a transductive classifier;
Use processor to use described grader, divide having higher than the unmarked file of the confidence levels of a predetermined threshold value Class is to multiple classifications;
Use described grader, the file classified by different graders the most before is reclassified multiple class Not, thus adapting to the variation of file content, wherein said file content includes in picture material, content of text, the space of a whole page and numbering At least one;Wherein said variation is at least one in the variation of the variation of time, the variation of style and the space of a whole page;With
The identifier of the file of described classification is exported to a user, another system and at least one during another.
14. methods according to claim 13, it is characterised in that: also include the confidence that will have less than a predetermined threshold value The unmarked file of rank moves into one or more new classifications.
15. methods according to claim 13, it is characterised in that: also include by iterative computation, use at least one pre- If cost factor, at least one seed file described and described unmarked file, train described transductive classifier;Wherein, For iterative computation each time, regulate the described cost factor function as an expectation mark value, and use described training The grader described unmarked file of classification.
16. methods according to claim 15, it is characterised in that: also include connecing for described seed file and unmarked file Receive a data point markers prior probability;Wherein, for iterative computation each time, according to estimating of data point group membership's probability Calculate, regulate described data point label prior probability.
17. methods according to claim 13, it is characterised in that: described unmarked file is customer complaint, and also includes Product variations is associated with customer complaint.
18. methods according to claim 13, it is characterised in that: described unmarked file is invoice.
CN200780001197.9A 2006-07-12 2007-06-07 Method and system and the data classification method of use machine learning method for data classification of transduceing Active CN101449264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610972541.XA CN107180264A (en) 2006-07-12 2007-06-07 For the transductive classification method to document and data

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US83031106P 2006-07-12 2006-07-12
US60/830,311 2006-07-12
US11/752,673 2007-05-23
US11/752,634 2007-05-23
US11/752,719 2007-05-23
US11/752,691 2007-05-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201610972541.XA Division CN107180264A (en) 2006-07-12 2007-06-07 For the transductive classification method to document and data

Publications (2)

Publication Number Publication Date
CN101449264A CN101449264A (en) 2009-06-03
CN101449264B true CN101449264B (en) 2016-11-30

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2369698A (en) * 2000-07-21 2002-06-05 Ford Motor Co Theme-based system and method for classifying patent documents

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2369698A (en) * 2000-07-21 2002-06-05 Ford Motor Co Theme-based system and method for classifying patent documents

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Discriminative, Generative and Imitative Learning;Tony Jebara;《MIT PhD Thesis》;20020228;全文 *
Learning frmo Partially Labeled Data;Marcin Olof Szummer;《MIT PhD Thesis》;20020930;全文 *

Similar Documents

Publication Publication Date Title
US8719197B2 (en) Data classification using machine learning techniques
US8239335B2 (en) Data classification using machine learning techniques
US7761391B2 (en) Methods and systems for improved transductive maximum entropy discrimination classification
US11275841B2 (en) Combination of protection measures for artificial intelligence applications against artificial intelligence attacks
Dai et al. Adversarial attack on graph structured data
WO2008008142A2 (en) Machine learning techniques and transductive data classification
CN107967575B (en) Artificial intelligence platform system for artificial intelligence insurance consultation service
CN112241481B (en) Cross-modal news event classification method and system based on graph neural network
US20080086432A1 (en) Data classification methods using machine learning techniques
Bazan et al. The rough set exploration system
Sommer et al. Towards probabilistic verification of machine unlearning
Kanan et al. An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system
Wang et al. Efficient learning by directed acyclic graph for resource constrained prediction
US8472728B1 (en) System and method for identifying and characterizing content within electronic files using example sets
CN110210468B (en) Character recognition method based on convolutional neural network feature fusion migration
CN111914156A (en) Cross-modal retrieval method and system for self-adaptive label perception graph convolution network
CN112199508B (en) Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision
CN112257841A (en) Data processing method, device and equipment in graph neural network and storage medium
CN113904872A (en) Feature extraction method and system for anonymous service website fingerprint attack
Wu Application of improved boosting algorithm for art image classification
CN101449264B (en) Method and system and the data classification method of use machine learning method for data classification of transduceing
CN114265954B (en) Graph representation learning method based on position and structure information
Yu et al. Autonomous knowledge-oriented clustering using decision-theoretic rough set theory
CN107180264A (en) For the transductive classification method to document and data
Amrutha et al. Deep Clustering Network for Steganographer Detection Using Latent Features Extracted from a Novel Convolutional Autoencoder

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant