CN101449264B - Method and system and the data classification method of use machine learning method for data classification of transduceing - Google Patents
Method and system and the data classification method of use machine learning method for data classification of transduceing Download PDFInfo
- Publication number
- CN101449264B CN101449264B CN200780001197.9A CN200780001197A CN101449264B CN 101449264 B CN101449264 B CN 101449264B CN 200780001197 A CN200780001197 A CN 200780001197A CN 101449264 B CN101449264 B CN 101449264B
- Authority
- CN
- China
- Prior art keywords
- file
- data
- classification
- labelling
- unmarked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of system for categorical data, method, data processing equipment and goods.Also disclose the data classification method using machine learning method.
Description
Technical field
The invention mainly relates to the method and apparatus for data classification.Specifically, the invention provides the transduction of improvement
Machine learning method.The invention still further relates to use the new application of machine learning method.
Background technology
The information age and in the recent period all trades and professions (include, particularly, scanning file, web material, search engine number
According to, text data, image, audio data file, etc.) huge explosion of electronic data, how to process data and become very
Important.
The field just starting to explore is the classification of non-artificial data.In many sorting techniques, machine or computer
Must arrange according to the rule being manually entered and setting up and/or the artificial training examples study set up.Using training examples
Machine learning in, study sample quantity generally little than the number of parameters of required estimation, i.e. meet given by training examples
The quantity of solution of restrictive condition bigger.One challenge of machine learning is to find that a kind of restriction regardless of shortcoming has still been concluded
Good solution.It is thus desirable to overcome these and/or other problem of the prior art.
It is yet further desirable to the actual application of various types of machine learning method.
Summary of the invention
In a computer based system, according to one embodiment of present invention, a kind of side for data classification
Method, including: receive and have mark data points, have mark data points to have at least one labelling described in each, indicate this data point
It is the training examples being included into a data point specifying classification, or the training from a data point specifying classification to be excluded
Sample;Receive data untagged point;At least one cost preset of mark data points and data untagged point is had described in reception
The factor;By iterative computation, use at least one cost factor described, and described in have mark data points and data untagged point
As training examples, use maximum entropy to differentiate (MED), train a transductive classifier, wherein, for iterative computation each time,
Adjust the cost factor function as an expectation mark value of data untagged point, and estimating according to group of data points membership probability
Calculate, adjust the prior probability of a data point markers;Be used for the grader of training classifying described data untagged point, have labelling
At least one in data point and input data point;And the classification of the data point or derivatives thereof of described classification is exported to one
Individual user, another system and at least one during another.
According to another embodiment of the invention, a kind of method for data classification, provide including to computer system
Needing the executable program code used, and perform on the computer systems, described program code includes multiple instruction, is used for:
Access be stored in computer storage have mark data points, have mark data points to have at least one labelling described in each,
Indicating this data point is the training examples being included into a data point specifying classification, or specifies classification to be excluded from one
The training examples of data point;Unmarked data point is accessed from computer storage;Mark is had from described in computer storage access
The cost factor that at least one of numeration strong point and data untagged point is preset;By iterative computation, use described at least one
Cost factor, and storage have the data untagged point of mark data points and storage as training examples, train a maximum
Entropy-discriminate (MED) transductive classifier, wherein, for iterative computation each time, adjusts data untagged point cost factor as one
The function of individual expectation mark value, and according to the estimation of data point group membership's probability, adjust the priori of described data point markers
Probability;Be used for the grader of training classifying described data untagged point, have in mark data points and input data point at least
One;And the classification of the data point or derivatives thereof of described classification exported to a user, another system and another during
At least one.
According to another embodiment of the invention, a kind of data processing equipment, including: at least one memorizer, it is used for depositing
Storage: (i) has a mark data points, described each have mark data points to have at least one labelling, indicate this data point to be received
Enter the training examples of a data point specifying classification, or the training examples from a data point specifying classification to be excluded;
(ii) data untagged point;(iii) have described in mark data points and data untagged point at least one preset cost because of
Son;And a transductive classifier training aids, to use cost factor of at least one storage described, and storage have labelling
The data untagged point of data point and storage, as training examples, uses the maximum entropy of transduction to differentiate (MED), and cyclically training turns
Lead grader, wherein, for MED iterative computation each time, adjust data untagged point cost factor as an expectation labelling
The function of value, and according to the estimation of data point group membership's probability, adjust the prior probability of described data point markers;
Wherein, transductive classifier training aids the grader trained for data untagged point of classifying, have mark data points,
And at least one in input data point;
Wherein, the classification of the data point or derivatives thereof of described classification, it is exported to a user, another system and another
At least one during one.
According to another embodiment of the invention, a kind of goods, including: a computer-readable program recorded medium,
This medium includes the executable instruction repertorie of one or more computer definitely, with the method performing the classification of a kind of data,
Including: receive and have mark data points, have mark data points to have at least one labelling described in each, indicate this data point by
Include the training examples of a data point specifying classification, or the training sample from a data point specifying classification to be excluded in
Example;Receive data untagged point;Have described in reception mark data points and data untagged point at least one preset cost because of
Son;Use the cost factor of at least one storage described, and the data untagged point having mark data points and storage of storage
As training examples, utilize the maximum entropy of iteration to differentiate that (MED) calculates, train a transductive classifier, wherein, each time
In MED iterative computation, adjust the cost factor function as an expectation mark value of data untagged point, and according to a number
The estimation of strong point group membership's probability, adjusts a data point markers prior probability;The grader of training is used for described nothing of classifying
Mark data points, have mark data points and input data point at least one;And by the data point or derivatives thereof of classification
Classification export to a user, another system and at least one during another.
In a computer based system, according to another embodiment of the invention, a kind of data untagged point
Class method, including: receive and have mark data points, have mark data points to have at least one labelling described in each, indicate this number
Strong point is the training examples being included into a data point specifying classification, or from a data point specifying classification to be excluded
Training examples;Reception has labelling and data untagged point;Receive and have the priori signature of mark data points and data untagged point general
Rate information;At least one cost factor preset of mark data points and data untagged point is had described in reception;According to described number
The labelling prior probability at strong point, determines that each has labelling and the desired labelling of data untagged point;Repeat following sub-step
Suddenly, until data value is enough restrained.
● the data untagged point proportional to the absolute value of the expectation labelling of data point for each generates a regulation
Value at cost;
● be determined by decision function, the given sample being included into training and being excluded training, use described in have labelling and
Data untagged point, as training examples, trains a grader, and according to their expectation labelling, KL is dissipated by this decision function
It is minimised as the prior probability distribution of decision function parameter;
● use the grader of described training, determine described in have labelling and the classification score value of data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to the described group membership's probability determined, update the labelling prior probability of described data untagged point;
● the labelling prior probability utilizing described renewal and the classification score value determined before, use maximum entropy to differentiate (MED),
Determine described labelling and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation labelling;With
● by the described expectation labelling of iteration before is inserted described new expectation labelling, update for each data point
Expect labelling.
One classification of input data point or derivatives thereof is exported to a user, another system and another process
In at least one.
According to another embodiment of the invention, a kind of file classifying method, including: receive at least one markd kind
Subfile, it has the known confidence levels of labelling distribution;Receive unmarked file;Receive at least one preset cost because of
Son;Use at least one cost factor, at least one seed file described and described unmarked file preset described, logical
Cross iterative computation one transductive classifier of training, wherein, for iterative computation each time, adjust described cost factor as one
Expect the function of mark value;After at least part of iteration, store confidence score for described unmarked file;And will have
The identifier of the unmarked file of high confidence score export to a user, another system and during another at least one
Individual.
According to another embodiment of the invention, a kind of method for analyzing the file relevant to legal inquiry, including:
Receive the file relevant to legal matter;Described file is performed a kind of file classifying method;And classify based on it, output is extremely
The identifier of small part file.
According to another embodiment of the invention, a kind of method clearing up data, including: receive multiple markd data
?;Each for multiple classifications chooses the subset of described data item;In each subset, the deviation of described data item is set
It is set to about zero;The deviation of the not data item in described subset is arranged to one be about zero preset value;Use described
Data item in deviation, described subset and described data item the most in the subsets, as training examples, are instructed by iterative computation
Practice a transductive classifier;The grader of described training is applied to each markd data item described, described with classification
Each data item;And the classification of described input data item or derivatives thereof exported to a user, another system, another
During at least one.
According to another embodiment of the invention, a kind of method for checking of invoice Yu the relatedness of entity, including: base
In invoice format training one grader relevant to first entity;Access multiple be marked as with described first instance and its
The invoice that at least one in its entity is relevant;Use described grader that invoice performs a kind of file classifying method;And it is defeated
Going out the identifier of at least one invoice, it is uncorrelated with described first entity that this invoice has higher probability.
According to another embodiment of the invention, a kind of method for managing case history, including: train based on medical diagnosis
One grader;Access multiple case history;Use described grader that described case history performs a kind of file classifying method;And output
The identifier of at least one case history, it is relevant to described medical diagnosis that this case history has relatively low probability.
According to another embodiment of the invention, a kind of method for recognition of face, including: receive at least one face
Have labelling drawing of seeds picture, described drawing of seeds picture has a known confidence levels;Receive unmarked image;Receive at least one
Individual default cost factor;By iterative computation, use described at least one cost factor preset, at least one drawing of seeds
Picture and described unmarked image, train a transductive classifier, wherein, for iterative computation each time, adjust described one-tenth
This factor is as the function of a desired mark value;After at least part of iteration, store for described unmarked drawing of seeds picture
One confidence score;And the identifier with the unmarked image of the highest confidence score exported to a user, another be
System, at least one during another.
According to another embodiment of the invention, a kind of method for analyzing prior art document, including: based on one
Search inquiry one grader of training;Access multiple prior art document;Use described grader at least part of described existing
Technological document performs a kind of file classifying method;And classify based on it, export the mark of at least part of described prior art document
Know symbol.
According to another embodiment of the invention, a kind of method making patent classification adapt to file content variation, including: connect
Receive at least one markd seed file;Receive unmarked file;Use at least one seed file described and described nothing
Tab file one transductive classifier of training;Use described grader, will there is a confidence levels higher than predetermined threshold value
Unmarked file is referred to multiple existing classification;Use described grader, will there is a confidence level less than predetermined threshold value
Other unmarked file is referred at least one new classification;Use grader, by least part of described classified file weight
Newly it is referred to described existing classification and at least one new classification described;And the identifier of described sorted file is exported
To a user, another system and at least one during another.
According to another embodiment of the invention, a kind of method for file is mated with claim, including: based on
At least one claim one grader of training of one patent or patent application;Access multiple file;Use described classification
Device performs a kind of file classifying method at least part of described file;And classify based on it, export at least part of described file
Identifier.
According to another embodiment of the invention, a kind of patent or the sorting technique of patent application, including: based on multiple
Know one grader of file training belonging to a specific patent classification;Receive a patent or at least one of patent application
Point;Use the described grader a kind of file classifying method of described at least some of execution to described patent or patent application;With
And export the classification of described patent or patent application, wherein, described file classifying method is a Yes/No sorting technique.
According to another embodiment of the invention, a kind of method adapting to file content variation, including: receive at least one
There is labelling seed file;Receive unmarked file;Receive at least one cost factor preset;Described in using, at least one is preset
Cost factor, at least one seed file described and described unmarked file, train a transductive classifier;Use institute
State grader, be referred to multiple classification by having higher than the unmarked file of the confidence levels of a predetermined threshold value;Use described
Grader, reclassifies multiple classification by the file of at least part of described classification;And the mark by described sorted file
Symbol output is to a user, another system and at least one during another.
According to another embodiment of the invention, a kind of method of separate file, including: receive markd data;Connect
Receive one group of unmarked file;Based on described markd data and unmarked file, transduction is used to rewrite probabilistic classification rule;Root
According to described probabilistic classification rule, update the weight separated for file;Determine the position separated in described one group of file;By described
The designator of the separation point position determined exports to a user, another system and at least one during another;And give
Code stamped by file, and this code is relevant to described designator.
According to another embodiment of the invention, a kind of method of file search, including: receive a search inquiry;Base
In described search queries retrieval file;Export described file;The labelling keyed in at least part of described file reception user, described
Labelling indicates the dependency between described file and described search inquiry;The labelling instruction keyed in based on described search inquiry and user
Practice a grader;Described grader is used described file to be performed a file classifying method, so that described file to be divided again
Class;And classify based on it, export the identifier of at least part of described file.
Accompanying drawing explanation
Fig. 1 is the expectation labelling curve chart as a function of classification score value, and this classification score value is applicable to by use
The MED that labelling is concluded differentiates that study obtains.
Fig. 2 is the schematic diagram of the iterative computation of one group of decision function obtained by MED study of transduceing.
Fig. 3 is changing of one group of decision function obtained by the transduction MED study improved according to an embodiment of the invention
The schematic diagram that generation calculates.
Fig. 4 is according to one embodiment of the invention, uses a cost factor regulated, and one is used for unmarked number of classifying
According to control flow chart.
Fig. 5 is according to one embodiment of the invention, uses user-defined priori probability information, and one is used for classifying without mark
The process control chart of numeration evidence.
Fig. 6 is according to one embodiment of the invention, utilizes cost factor and the priori probability information of regulation, uses maximum entropy
Differentiate, a detailed control flowchart for data untagged of classifying.
Fig. 7 is the network that the network structure of different embodiment described herein is implemented in display.
Fig. 8 be one representational, the system block diagram of the hardware environment relevant to subscriber equipment.
Fig. 9 is the block diagram of the device representing one embodiment of the present of invention.
Figure 10 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 11 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 12 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 13 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 14 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 15 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 16 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 17 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 18 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 19 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 19 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 20 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 21 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 22 is the method for one embodiment of the invention, for the control flow chart of a first document classification system.
Figure 23 is the method for one embodiment of the invention, for the control flow chart of a second document classification system.
Figure 24 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 25 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 26 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 27 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 28 is by the flow chart of the categorizing process performed according to an embodiment.
Figure 29 is by the flow chart of the categorizing process performed according to an embodiment.
Detailed description of the invention
Following description be it is presently contemplated that the best approach realizing the present invention, the purpose of this description illustrates that this
Bright General Principle, is not intended to limit the content of invention described herein.And, special characteristic described herein can
Combine from other feature described of each in various different possible combination and permutation.
Unless separately defined especially herein, all terms all give the possible explanation that it is the widest, including from description
The meaning of hint, and the meaning that skilled artisan understands that, and look like as defined in dictionary, paper etc..
Text classification
Benefit and the demand of text data classification are the hugest, and have had multiple sorting technique to be used.Below
Discussion is for the sorting technique of text data:
For increasing its effectiveness and intelligence, it is desirable to the machine of such as computer etc can be classified (or identify) continuous expansion
The big object in scope.Such as, computer can use optical character recognition to the hand-written or numeral of scanning and the word of classifying, and makes
Carry out classification chart picture by pattern identification, such as face, fingerprint, fighter plane etc., or use speech recognition to classify sound, voice etc.
Deng.
Machine is also required can classifying text information object, such as text computer file or document.Text classification
Application is various and important.Such as, text classification can be used for managing text message object to be classified to a predetermined class
Hierarchical structure that is other or that classify.So, the text message object that discovery (or finding) is relevant with particular topic is just simplified.Literary composition
This classification can be used for suitable text message object is routed to suitable crowd or place.So, information service can will relate to
The text message object of various themes (e.g., commercial affairs, physical culture, stock market, football, specific company, specific football team) routes to
There is the crowd of different interest.Text classification can be used for filtering text message object, so that individual is from unwanted text
Hold the invasion of (such as need not and uncalled Email, also referred to as SPAM, or " rubbish ").As from these
As being appreciated that in example, text classification has multiple exciting and important application.
Rule-based classification
In some instances, it is necessary to the logic generally acknowledged based on certain, utilize absolute certitude that file content is classified.
One rule-based system can be used for realizing this type of classification.Substantially, rule-based system uses the shape of production rule
Formula:
IF condition, THEN is true.
Described condition can include whether text message includes some word or expression, has specific grammer, or has
Specific attribute.Such as, if content of text has word and " closes ", phrase " Nasdaq " and numeral, then it is classified as
" stock market " text.
In about 10 years of past, other type of grader is little by little used.Although this kind of grader is unlike base
Grader in rule uses static state, predetermined logic like that, but in numerous applications, they are better than rule-based classification
Device.This kind of grader generally includes a learning element and an executive component.This kind of grader includes neutral net, Bayes
Network and support vector machine.Although each this kind of grader is well known, but reader for convenience, it is briefly described below each
Plant grader.
There is the grader of study and executive component
As the end of upper joint is previously mentioned, in numerous applications, the grader with study and executive component is excellent
In rule-based grader.Reiterating, these graders can include neutral net, Bayesian network and support vector
Machine.
Neutral net
Neutral net is substantially the multilamellar of same treatment element (also referred to as neuron), level arrangement.Each neuron can
There is one or more input, but only one of which exports.By a coefficient, the input of each neuron is weighted.Neuron
Output is typically its weighting input and function of deviation value sum.This function, also referred to as activation primitive, it is common that one
Sigmoid function.That is, this activation primitive can be S-shaped monotonic increase, and when its (multiple) input is respectively close to positive minus infinity,
Asymptotics fixed value (such as+1,0 ,-1).Sigmoid function and the weight of single nerve and deviation value determine that neuron is to input signal
Response or " irritability ".
In the level of neuron arranges, the output of the neuron in a layer can distribute as one or more in next layer
The input of neuron.Typical neutral net can include an input layer and two (2) individual different layers;That is, one input layer, one
Relay cell layer, and an output neuron layer.The node that note that described input layer is not neuron.More precisely,
The node of input layer only has an input, and mainly provides the untreated input inputing to next layer.If, such as nerve net
Network will be used for identifying a numerical character in 20 × 15 pel arrays, and this input layer can have 300 neurons (i.e.
Each pixel of input), and output array can have 10 neurons (each in i.e. 10 numerals).
The use of neutral net generally comprises two (2) individual continuous print steps.First, initialize neutral net, and according to tool
This network is trained in the known input having known output valve (or classification).Once neutral net is trained to, and it just can be used for classifying not
The input known.By the weight of neuron and deviation being set to random value (generally being generated), nerve net by a Gauss distribution
Network can be initialised.Then use a series of input with known output (or classification), train this neutral net.To instruct
When white silk input is supplied to neutral net, adjust (such as according to known backpropagation techniques) neural weight and deviation value, so that
The output of the neutral net of each single training mode approaches or mates this known output.Substantially, the gradient of weight space
Decline and be used for minimizing output error.So, the study of training input continuously is used, towards weight and the local optimum of deviation
Solve convergence.That is, weight and deviation is adjusted to minimum error.
In practical operation, the most do not become to converge to the certain point of optimal solution by this systematic training.On the contrary, system will be by
" overtraining ", causes it the most professional for training data, thereby increases and it is possible to be bad at the input somewhat different with training set of classifying.
Therefore, at its different times trained, this system is tested by one group of checking data.When the performance of system is at checking collection
On when no longer improving, training stops.
Once having trained, so that it may use this neutral net, according to the weight determined during training and deviation, classification is not
Know input.The Unknown worm if this neutral net can be classified surely, an output of the neuron in certain output layer will
Can export far above other.
Bayesian network
Generally, Bayesian network uses it is assumed that as between data (e.g., input feature value) and prediction (e.g., classify)
Medium.For given data (" P (assume | data) "), each probability assumed can be estimated.Use after assuming
Test probability, obtain prediction from described hypothesis, be weighted with the single prediction that each is assumed.Data-oriented D, it was predicted that X's
Probability can be expressed as:
Wherein, HiAssume for i-th.Maximize given D (P (Hi| D)) HiHypothesis H of maximum likelihood of probabilityi
It is referred to as maximum a posteriori and assumes (or " HMAP"), and be represented by:
P (X | D)~P (X | HMAP)
Use bayes rule, data-oriented D, it is assumed that HiProbability be represented by:
The probability of data D keeps constant.Therefore, for finding HMAP, it is necessary to maximize molecule.
The Section 1 of molecule represents: given assume i, can be it is observed that the probability of these data.The Section 2 of molecule represents: point
The given prior probability assuming i described in dispensing.
Bayesian network includes the directed edge between variable and variable, thus one directed acyclic graph (i.e. " DAG ") of definition.
Each variable may be assumed that as the arbitrary value in the mutual exclusion state of limited quantity.For each variables A, it has female variable
B1…Bn, have an attached probability tables (P (and A | B1…Bn).The described structured coding of Bayesian network described it is assumed that given its
Female variable, each variable is conditionally independent of its non-sub-variable.
Assume the structure of Bayesian network it is known that and variable observable, the most only need condition for study list of probabilities set.Directly
Use the statistics from one group of study sample, these lists can be estimated.If this structure it is known that and some variable is hiding,
Then study is similar to above-mentioned neural network learning.
The example of simple Bayesian network is described below.Variable " MML " can represent " humidity on my lawn "
(moisture of my lawn), and can have state " wet " and " doing ".MML variable can have " raining " and " my watering
Device is opened " female variable, each there is "Yes" and "No" state.Another variable, " MNL " can represent " the grass of my neighbours
The humidity on level ground ", and can have state " wet " and " doing ".MNL variable can share " raining " female variable.In this example, it was predicted that can
It is " wet " or " doing " with the lawn being me.This prediction can be based on the assumption that (i): if rained, what my lawn will be wet is general
Rate (x1) and assume (ii): if my water sprinkler is opened, the probability (x that my lawn will be wet2).The probability rained or I
The probability opened of water sprinkler can be depending on other variable.Such as, if the lawn of my neighbours is wet, and they do not spill
Hydrophone, that is likely to rain.
As it has been described above, as the example of neutral net, the conditional probability table in Bayesian network can be trained.Its advantage exists
In, by allowing offer priori, this learning process can be shortened.Unfortunately, the prior probability of conditional probability is usually
It is unknown, now uses unified prior probability.
It is one (1) individual that one embodiment of the present of invention can perform at least two (2) individual basic functions, i.e. generates grader
Parameter, and object of classification, such as text message object.
Substantially, based on one group of training examples, parameter is generated for grader.One group of spy can be generated from one group of training examples
Levy vector.The feature of this stack features vector can be simplified.The parameter of generation can be included the dullness defined (such as a S-shaped) function
With a weight vectors.This weight vectors can determine (or by technology known to other) by the way of SVM trains.Can pass through
Optimization method determines this dullness (such as S-shaped) function.
Text classifier includes a weight vectors and dullness (e.g., the S-shaped) function of a definition.Substantially, the present invention
The output of text classifier be represented by:
Wherein:
OcThe classification output of=classification c;
wc=weight vectors the parameter relevant to classification c;
X=based on unknown text information object (simplification) characteristic vector;
A and B is a customized parameter of dull (e.g., S-shaped) function;
Output is calculated faster than being calculated output by expression formula (1) by expression formula (2).
According to being classified the form of object, text message object (i) can be converted to characteristic vector, and (ii) by grader
It it is the simplification characteristic vector with less element by feature vectors reduction.
Transduction machine learning
Commercially, currently used in prior art automatic classification system is rule-based or utilizes conclusion type machine
Study, i.e. use handmarking's training examples.Compared to transduction method, two kinds of methods are generally required for manually arranging work in a large number
Make.The solution provided by rule-based system or conclusion type method is static solution, if there is no manual working, it
Classification concept of drifting about cannot be adapted to.
Conclusion type machine learning is for (namely be based on the observation of or minority by attribute or relation owing to based on characterizing
Or experience) type;Or formulate rule based on limited observation reproduction mode.Conclusion type machine learning includes from observing
Reasoning in training cases, to set up general rule, this rule is then used in test case.
Distinguishingly, preferred embodiment uses transduction machine learning method.Transduction machine learning is an effective method, can
To avoid these defects.
Transduction Machine Method can have labelling training examples learning from considerably less one group, automatically adapts to drift classification general
Read, and automatically correct the training examples of labelling.These advantages make transduction machine learning become an interesting and valuable side
Method, is suitable for the application of various business.
Transduction is in data learning pattern.By not only from having flag data but also from data untagged learning, transduction
Extend the concept of conclusion type study.This makes transduction can learn not from having flag data capture or only part from there being mark
The pattern of capture in numeration evidence.Therefore, comparing rule-based system or system based on the study of conclusion type, transduction can adapt to
The dynamically environment of change.This ability makes transduction can be used in file search, data scrubbing, addressing drift classification concept etc.
Deng.
The reality of the transductive classification utilizing support vector machine (SVM) classification and maximum entropy differentiation (MED) framework is described below
Execute example.
Support vector machine
Support vector machine (SVM) is a kind of method that text classification is used, by using the concept pair of regularization theory
Possible solution arranges restriction, and the method has processed the problem of a large amount of solution, and consequent evolvement problem.Such as, one two
Unit's SVM classifier is chosen from the hyperplane of all accurate separation training datas and is maximized the hyperplane of boundary as solution.Maximum
Boundary normalization under the restrictive condition that training data is classified exactly, meet aforementioned extensive and memory between select close
The problem concerning study of suitable balance.Data have been remembered in restriction to training data, and normalization then ensure that the most extensive.Conclude and divide
Class is from the training examples learning with known mark, i.e. the group membership of each training examples is known.When inducing classification from
Known mark learning, transductive classification determines classifying rules from having labelling and data untagged.One transduction svm classifier
Example is as shown in table 1.
The principle of transduction svm classifier
Require:Data matrix X of labeled training examples and their labels
Y.
Require:Data matrix X ' of the unlabeled training examples.
Require:A list of all possible labels assignments of the unlabeled
training examples
[Y1 ...,′Yn′]。
1:MaximumMargin=0
3:for all label assignments Yi.′1≤i≤n in the list of label
assignments do
4:CurrentMaximumMargin:=MaximizeMargin (X, Y, X ', Yi′)
5:if CurrentMaximumMargin > MaximumMargin then
6:MaximumMargin:=CurrentMaximumMargin
7:
8:end if
9:end for
Table 1
Table 1 shows the principle of the transductive classification utilizing support vector machine.Solution is given by hyperplane, and this hyperplane is for nothing
The all possible labelling distribution of flag data produces maximum figure.Described possible labelling distributes along with the number of data untagged
Amount is exponentially increased, and for actually available method, the algorithm of table 1 must be estimated.The example of this estimation exists
T.Joachims, Transductive inference for text classification using support
Vector machines, Technical report, Universitact Dortmund, LAS VIII, 1999
(Joachims) it is described in.
Being uniformly distributed expression for labelling distribution in table 1, data untagged point has the probability of 1/2 becomes this group
Front sample and there is the probability of 1/2 become negative sample, i.e. y=+1 (front sample) and y=-1 (negative sample) this two
Plant possible labelling distribution even odds, and final expectation is labeled as 0.Be 0 labelling expectation can be by one equal to 1/2 consolidating
Fixed category prior probability obtains, or the category prior probability (i.e. by a stochastic variable with uniform prior distribution
Unknown category prior probability) obtain.Therefore, in the application of known class prior probability being not equal to 1/2, should by combining
Additional information can improve this algorithm.Such as, it not being uniformly distributed, but according to category prior of labelling distribution in use table 1
Probability, the distribution of some labelling of prioritizing selection rather than the distribution of other labelling.But, but there is relatively high standard score less and join
Boundary solution and relatively big but have between the boundary solution of relatively low labelling distribution that to make balance be difficult.The probability of labelling distribution and boundary
Limit is different scale.
Maximum entropy differentiates
The method of another kind of classification, maximum entropy differentiation (MED) (referring to, e.g., T.Jebara, Machine
LearningDiscriminative and Generative, Kluwer Academic Publishers) (Jebara) do not have
Encounter the problem relevant to SVM, because decision function formal phase of normalization and labelling distribution formal phase of normalization are all derived from for solution
Prior probability distribution, the most all in identical probability scale.Thus, if category prior, and labelling priori thus
Time known, transduction MED classification is better than svm classifier of transduceing, because it allows priori signature knowledge to combine in principle fashion.
Conclude MED classification and assume the prior distribution of a decision function parameter, the prior distribution of a bias term, and one
The prior distribution of boundary.It selects that distribution closest to prior distribution to be distributed as the final of these parameters, and produces
The expectation decision function of one categorical data point exactly.
In form, such as, giving a linear classifier, problem is expressed as follows: find hyperplane parameter distribution p (Θ), partially
Difference distribution p (b), data point categorised demarcation line p (γ), its joint probability distribution has a minimum Kullback Lai Baile and dissipates
(Kullback Leibler divergence) KL gives each prior distribution p combined0, i.e.
It is limited by restrictive condition
Wherein Θ XtIt it is the dot product between separating hyperplane weight vectors and the characteristic vector of t data point.Due to mark
Score and join ytFor known and fixing, it is not necessary to the prior distribution of binary flag distribution.Therefore, conclusion MED classification is generalized for transduction
The short-cut method of MED classification, is as being limited at the prior distribution parameter that possible labelling distributes using binary flag distribution
Reason.The example of transduction MED is as shown in table 2.
Transduction MED classification
Require:Data Matrix X of labeled and unlabeled training examples.
Require:Label prior probabilities p0(y)for labeled and unlabeled
training examples.
1:<Y>:=ExpectedLabel (p0(y)){Expected label determined from the
training
examples’label prior probabilities.}
3:W:=MinimizeKLDivergence (X,<Y>)
4:Y ' :=InduceLabels (W, X, p0(y))
+ (1-∈) Y ' 5:<Y>:=∈<Y>
6.end while
Table 2
For there being flag data, labelling prior distribution is a delta-function, thus can effectively determine and be labeled as+1 or-1.
For data untagged, it is assumed that a labelling prior probability p0Y (), distributes to each data untagged one y=+1's of point
The probability of positive labelling is p0(y), and the probability of the negative flag of a y=-1 is 1-p0(y).Assume a non-information labelling priori
(p0(y)=1/2), produce the transduction MED similar with an above-mentioned transduction svm classifier classification.
As the situation at transduction svm classifier, it is right that the reality implementation applicatory of above-mentioned MED algorithm must be estimated
Search in the most possible labelling distribution.The method is at T.Jaakkola, M.Meila, and T.Jebara, Maximum
Entropy discrimination, Technical Report AITR-1668, Massachusetts Institute
OfTechnology, Artificial Intelligence Laboratory, is described in 1999 (Jaakkola), and it selects
One approximation, is two steps by procedure decomposition, is similar to an expected value and maximizes (EM) formula.In this formula, need
Solve two problems.The first step, is equivalent to the M step in EM algorithm, when the best-guess distributed according to current markers, accurate
During all data points of really classifying, it is similar to the maximum of boundary.Second step, is equivalent to E step, uses and determines in M step
Classification results, and estimate new value for the group membership of each sample.Our this second step is called that labelling is concluded.Substantially
Describe as shown in table 2.
The special implementation of the method for Jakkola cited herein, it is assumed that one has the zero average of hyperplane parameter
Value and the Gaussian function of unit variance, a zero mean with straggling parameter and variances sigmab 2Gaussian function, formula exp [-
C (1-γ)] a boundary priori, wherein γ is the boundary of data point, and c is cost factor, and one described above without mark
The binary flag prior probability p of numeration evidence0(y).The transductive classification algorithm Jaakkola being discussed below, is hereby incorporated, due to
Simplification and the reason of non-loss of generality, therefore assume the labelling prior probability of 1/2.
For a fixation probability distribution of given hyperplane parameter, labelling induction step determines marking probability distribution.Make
By above-mentioned boundary and labelling priori, produce the object function (referring to table 2) of following labelling induction step:
Wherein λtIt is the t training examples Lagrange multiplier (Lagrange Multiplier), stFor in aforementioned M step
Middle its classification score value determined, c is cost factor.First two in training examples summation obtain from boundary prior distribution, and
Section 3 is given by labelling prior distribution.By maximizingLagrange multiplier is determined, and thereby determines that data untagged
Marking probability distribution.Can be seen that in formula 3, data point acts on alone object function, therefore each Lagrange multiplier
Determination unrelated with other Lagrange multiplier.Such as, in order to maximize a classification score value with highest absolute value | st|
The effect of data untagged point, needs little Lagrange multiplier λt, and one has little value | st| data untagged
Point, then need to utilize a big Lagrange multiplier, maximize it rightEffect.On the other hand, the one of data untagged point
Individual expectation labelling<y>as the function representation of its classification score value s and Lagrange multiplier λ is:
<y>=tanh (λ s) (4)
Fig. 1 shows expectation labelling<y>function as a classification score value s, its use cost factor c=5 and c=
1.5.By use cost factor c=5 and c=1.5 solution formula 3, determine the Lagrange multiplier for producing Fig. 1.By Fig. 1
Understand, the data untagged point outside boundary, i.e. | s | > 1, there is the expectation labelling<y>close to 0, close to the number of boundary
Strong point, i.e. | s | ≈ 1, produce and the highest definitely expect mark value, and the data point close to hyperplane, i.e. | s | < ∈, produce
Raw |<y>| < ∈.As | s | → ∞, the reason of the non-intuitive labelling distribution of<y>→ 0 is determined method of discrimination, the method
As long as meeting classification to limit, attempt to be kept as much as possible close to prior distribution.It is not one by selected by the known method of table 2
The artifact of the approximation selected, i.e. one algorithm, this algorithm is searched for the distribution of all possible labelling up hill and dale, and is therefore ensured that
Find out globally optimal solution, and equally the expectation labelling close or equal to zero is distributed to the data untagged outside boundary.Again
Secondary reaffirm, as it has been described above, that is to differentiate that viewpoint is desired.Data point outside boundary is unimportant for separating sample,
The individual probability distribution of the most all these data points has been returned to their prior distribution.
The M step of the transductive classification algorithm of Jaakkola, is hereby incorporated, it is determined that the probability distribution of hyperplane parameter, partially
Difference item and under conditions of limiting closest to the boundary of data point of respective prior distribution,
Wherein, stIt is the t data point classification score value, < yt>it is its desired labelling,<γt> it is its desired boundary.Right
In there being flag data, it is desirable to labelling be fixing, for<y>=+ 1 or<y>=-1.The expectation labelling of data untagged is positioned at district
Between within (-1 ,+1) and estimated in labelling induction step.According to formula 5, owing to classification score value is by expecting that labelling determines,
Data untagged must meet the classification more tightened up than there being flag data and limit.Additionally, the relational expression of given expectation labelling, as dividing
One function of class score value, referring to Fig. 1, the data untagged close to separating hyperplane has most stringent of classification and limits, because of
Score value and the absolute value of expectation labelling for them | < yt> | little.The complete target letter of the M step of given above-mentioned prior distribution
Number is:
Section 1 is obtained by Gauss hyperplane parameter prior distribution, and Section 2 is boundary priori formal phase of normalization, last
For deviation priori formal phase of normalization, by having zero mean and variances sigmab 2Gaussian prior obtain.The prior distribution of bias term can be managed
Solution is the prior distribution of a category prior probability.Therefore, the formal phase of normalization corresponding to described deviation prior distribution just limits
Face sample and the weight of negative sample.Referring to formula 6, the effect of bias term is minimized, in case the front sample on hyperplane
Collective pull and pull equal to the collective of negative sample.Due to deviation priori, the collective of Lagrange multiplier limits just by data
The expectation labelling weighting of point, and therefore data untagged is more less than the restriction having flag data.Thus, data untagged has ratio
There is the ability of flag data stronger influence last solution.
In a word, in the M step of the transductive classification algorithm of Jaakkola, being hereby incorporated, data untagged needs ratio to have labelling
Data meet tightened up classification and limit, and they have the restriction of flag data less for the accumulation weight ratio solved.It addition, tool
Have one close to zero the data untagged of expectation labelling, within being positioned at the boundary of current M step, on the impact solved
Greatly.So, as in figure 2 it is shown, by this algorithm is applied to data set, can be with the clean effect of graphic extension formulation E and M step
Should.Data set includes that two have labelling sample, a negative sample (x) being positioned at x position-1, and the front sample of+1
(+), and along x-axis, six unmarked samples (o) between-1 and+1.Fork (x) represents that has the negative sample of labelling,
Plus sige (+) represent that has a labelling front sample, and circle (o) represents data untagged.Different figures represents that separate surpasses
Plane, is determined by the different iteration of M step.Final solution is determined by the transduction MED grader of Jaakkaola, is hereby incorporated,
Front has labelling training examples to be classified by mistake.Fig. 2 shows the successive ignition of M step.In the first time iteration of M step, not
Consider data untagged, and the hyperplane separated is positioned at x=0.One data untagged point with negative x value is than other nothing any
The hyperplane that flag data separates closer to this.In labelling induction step subsequently, it will be assigned to minimum | and<y>
|, correspondingly, in next M step, it has the authority of maximum and hyperplane is pushed to front has labelling sample.Expect labelling<y>
Given shape as one by the cost factor (referring to Fig. 1) selected determine classification score value function, with data untagged
The specific interval of point combines and creates bridge effect, and in each continuous print M step, the hyperplane of separation is just increasingly closer to
Face sample.Intuitively, M step is perplexed by a kind of myopia, closest to the data untagged point of current separating hyperplane
Can determine that most the final position of this plane, and away from data point the most critically important.Finally, since deviation priori item limits nothing
The collective of flag data pulls less than there being the collective of flag data to pull, thus separating hyperplane moves on to beyond front labelling sample
Example, produces a final solution, the 15th iteration in Fig. 2, and front labelling sample has been carried out the classification of mistake by it.At Fig. 2
In employ one Deviation variance and the cost factor of a c=10.Utilize Any at scope 9.8 < c < 13
Within cost factor produce the final hyperplane of a classification that a certain front labelling sample carried out mistake.And it is all in district
Between cost factor outside 9.8 < c < 13, have between labelling sample Anywhere at two, produce the hyperplane separated.
The unstability of this algorithm is not limited merely to the sample shown in Fig. 2, when applying Jaakkola method, draws at this
With, it is also subject to be confined to real-world data collection, including the Reuter's data set being well known to those skilled in the art.Table 2
Described in inherent instability is this embodiment major defect of the method, and limit its versatility, to the greatest extent
Pipe Jaakkola method may be implemented in certain embodiments of the present invention.
One method for optimizing of the present invention uses the transductive classification of the framework using maximum entropy differentiation (MED).Easy to understand, this
The different embodiments of invention, it is adaptable to classification, are applied equally to other MED problem concerning study using transduction, including, but do not limit
In, transduction MED restores and image model.
By assuming that the prior probability distribution of a parameter, maximum entropy differentiates restriction and reduces possible solution.According in the phase
The solution hoped describes under the restriction of training data exactly, closest to the probability distribution of the prior probability distribution assumed, last solution
For the expected value likely solved.The prior probability distribution of all solutions is mapped to a formal phase of normalization, i.e. have selected one specific
Prior distribution, just have selected for a specific normalization.
Differentiated that estimating is effective in the study from a small amount of sample by what support vector machine was implemented.The embodiment of the present invention
Method and apparatus all there is as support vector machine this feature, and will not estimate that ratio solves necessary to given problem
The more parameter of parameter, and therefore produce a sparse solution.Compared with generation mode estimation, generation mode estimation attempts to explain base
Plinth process, it usually needs estimate higher statistics than differentiating.On the other hand, generation mode is more flexible, thus can be used for various respectively
The problem of sample.It addition, generation mode estimation can directly include priori.By using maximum entropy to differentiate, the embodiment of the present invention
Method and apparatus shorten the gap between pure discrimination model estimation (e.g., support vector machine study) and generation mode estimation.
The method of embodiments of the invention as shown in table 3 is a transduction MED sorting algorithm improved, and it does not has
Aforementioned unstable problem in the presence of the method for Jaakkola (being hereby incorporated).Difference includes, but not limited at this
In bright embodiment, each data point has the cost factor of himself, proportional to its absolute descriptor's expected value |<y>|.Separately
Outward, according to estimating that group membership's probability, as the function of the distance of data point to decision function, after each M step, updates each
The labelling prior probability of individual data point.The method of the embodiment of the present invention is as shown in the following Table 3:
The transduction MED classification improved
Require:Data matrix X of labeled and unlabeled training examples
Require:Label prior probabilities p0(y)for labeled and unlabeled
training examples.
Require:Global cost factor c.
1:<Y>: ExpectedLabel (p0(y)){Expected label determined from the
training
examples’label prior probabilities.}
3:C:=|<Y>| c{Scale each training example ' s cost factor by the
absolute value of
its expected label.}
4:W:=MinimizeKLDivergence (X,<Y>, C)
5:p0(y) :=EstimateClassProbability (W,<Y>)
6:Y ' :=InduceLabels (W, X, p0(y), C)
+ (1-∈) Y ' 7:<Y>:=∈<Y>
8:end while
Table 3
Pass through |<y>| and regulate data point cost factor, relaxed what data untagged dragged for the collective on hyperplane
The problem that effect is more higher than there being flag data, because the cost factor of data untagged is than the cost factor having flag data now
Little, say, that each data untagged point for the independent role of last solution always less than the independent work having mark data points
With.But, if the total amount of data untagged is much larger than the quantity having flag data, data untagged still can have reference numerals by ratio
According to affecting last solution more.It addition, utilize the class probability of estimation, cost factor regulation is tied with update mark prior probability
Close, the problem solving above-mentioned bridge effect.First M step, data untagged has little cost factor, produces one
Expect labelling, as classification score value function, its relatively flat (see Fig. 1), correspondingly, to a certain extent, all unmarked
Data are allowed to continue and pull hyperplane, although only have less weight.Further, since the renewal of labelling prior probability, away from
The data untagged of the hyperplane separated is not previously allocated the expectation labelling that close to 0, but after many iterations, distributes
One labelling close to y=+1 or y=-1, and the most little by little it is counted as having flag data to process.
In a particular implementation of the method for the embodiment of the present invention, by assuming that one has decision function parameter Θ
Zero mean and a Gaussian prior of unit variance:
The prior distribution of decision function parameter combines the important priori of specific classification problem on the horizon.Other
For the prior distribution such as multinomial distribution of the important decision function parameter of classification problem, Poisson distribution, Cauchy's distribution
(Breit-Wigner), maxwell boltzman distribution or B-E distribution.
The prior distribution of decision function threshold value b is by having average value mubAnd variances sigmab 2Gauss distribution give:
Categorised demarcation line γ as data pointtPrior distribution
Chosen, wherein c is cost factor.This prior distribution and the prior distribution of use in Jaakkola (being hereby incorporated)
Difference, the expression formula of Jaakkola is exp [-c (1-γ)].Preferably, the expression formula given by formula 9 be better than Jaakkola (
This quotes) expression formula that uses, even if because cost factor is less than 1, formula 9 also can produce a front expectation boundary, and as c <
When 1, exp [-c (1-γ)] produces a negative expectation boundary.
These prior distributions given, can directly determine that corresponding partition function Z is (referring to sample T.M.Cover and
J.A.Thomas, Elements of Information Theory, John Wiley&Sons, Inc.) (Cover), and target
Function
For
According to Jaakkola (being hereby incorporated), the object function of M step is
And the object function of E step is
Wherein stIt is the classification score value of t data point, determines in M step above, p0,1(yt) it is the two of data point
Meta-tag prior probability.For there being flag data, labelling priori is initialized as p0,1(yt)=1, and for data untagged, mark
Note priori is initialized as p0,1(ytThe non-information priori of)=1/2, or category prior probability.
The part of the most named M step describes the algorithm solving M step object function.Similarly, the most named E
The part of step describes E step algorithm.
In estimation class probability (the Estimate Class Probability) step of table 3 the 5th row, employ training
Data, to determine calibration parameter, give score value p (c | s) for classification score value becomes the probability of group membership's probability, i.e. classification.With
In score value calibration is estimated as the correlation technique of probability at J.Platt, Probabilistic outputs for support
Vectormachines and comparison to regularized likelihood methods, pages 61-74,
2000 (Platt) and B.Zadrozny and C.Elkan, Transforming classifier scores into
Accurate multi-classprobability estimates, is described in 2002 (Zadrozny).
Referring particularly to Fig. 3, fork (x) represents that has a negative sample of labelling, plus sige (+) indicate labelling front sample, and
Circle (o) represents data untagged.Different curves represents the separating hyperplane determined with the different iteration of M step.20th time
Iteration shows the last solution determined by the transduction MED grader improved.Fig. 3 show the transduction MED sorting algorithm of improvement, should
For above-mentioned small data set.The parameter used is c=10, μb=0.Different c produces and is positioned at x ≈-0.5, and x
Separating hyperplane between=0, as c < 3.5, hyperplane is positioned at the right side of the data untagged of an x < 0, and when c >=
When 3.5, hyperplane is positioned at the left side of this data untagged point.
Referring particularly to Fig. 4, it is illustrated that a control flow, it is shown that the side of the classification data untagged of the embodiment of the present invention
Method.Method 100 starts in step 102, accesses storage data 106 in step 104.These data are stored in memory element and include
Flag data, data untagged and at least one cost factor preset.Data 106 include the data with the labelling of distribution
Point.The data point identification of distribution has whether mark data points will be included into a specific classification, or from a particular category
It is excluded.
Once data are accessed in step 104, and the method for the embodiment of the present invention is then used by the mark of data point in step 108
Note information, determines the labelling prior probability of this data point.Then, in step 110, according to described labelling prior probability, this is determined
The expectation labelling of data point.Along with expectation labelling calculated in step 110, together with there being flag data, data untagged with become
This factor, step 112 includes, by adjustment cost factor data untagged point, transduction MED grader being iterated training.?
Each time in iterative computation, the cost factor of data untagged point is conditioned.So, MED grader is from iterating of calculating
Learning.The grader of training then accesses input data 114 in step 116.Then the grader of this training is complete in step 118
The step of constituent class input data, and terminate in step 120.
Easy to understand, the data untagged of 106 and input data 114 can obtain from a single source.Thus, defeated
Entering data/data untagged and can be used for the iterative process of step 112, this process is used for classifying the most in step 118.And,
The embodiment of the present invention considers, input data 114 can include a feedback mechanism, so that input data are supplied to the storage 106
Data, in order to the MED grader of 112 is dynamically from the new data learning of input.
Referring particularly to Fig. 5, it is illustrated that a control flow chart, it is shown that the another kind of data untagged of the embodiment of the present invention
Sorting technique, including user-defined priori probability information.Method 200 starts from step 202, accesses storage number in step 204
According to 206.These data 206 include flag data, data untagged, a default cost factor and customer-furnished
Priori probability information.The flag data that has of 206 includes having the data point of labelling of distribution.This mark of the marker recognition of described distribution
The data point of note is will to be included into a specific classification or be excluded from a particular category.
In step 208, it is desirable to labelling by 206 data calculate.Then, this desired labelling in step 210 together with
Flag data, data untagged and cost factor is had to be used together, to guide the repetitive exercise of a transduction MED grader.
The iterative computation of 210, in calculating each time, regulates the cost factor of data untagged.Calculate and continue, until grader is by just
Really train.
Then, the grader of training accesses the input data from input data 212 in step 214.The grader of training
Next the step of classifying input data can be completed in step 216.Process described in Fig. 4 and method, input data and nothing
Flag data can obtain from a single source, and can enter system 206 and 212.So, input data 212
Can be in 210 impact training, in order to this process dynamically can change over along with continuous print input data.
In two methods shown in figures 4 and 5, a monitor can determine that system is either with or without reaching convergence.Work as MED
The change of the hyperplane between the iteration each time calculated drops to below a default threshold value, it may be determined that convergence.In the present invention
Another embodiment in, when determine expectation labelling change drop to below a default threshold value, it may be determined that described threshold value.As
Fruit reaches convergence, then repetitive exercise process can stop.
Referring particularly to Fig. 6, it is shown that the repetitive exercise process of at least one embodiment of the inventive method is in further detail
Control flow chart.Process 300 starts from step 302, and in step 304, the data from data 306 are accessed, and these data are permissible
Include flag data, data untagged, at least one cost factor preset, and priori probability information.306 have labelling
Data point includes a labelling, and whether data point described in this marker recognition is the instruction by being included into a data point specifying classification
Practice sample, or the training examples of the data point of classification eliminating will be specified by one.The priori probability information of 306 includes labelling
Data set and the probabilistic information of data untagged collection.
In step 308, it is desirable to labelling is determined by the data of the priori probability information from step 306.In the step 310,
The cost factor of each data untagged collection is relative to the proportional regulation of absolute value of the expectation labelling of data point.Then pass through
Determine a decision function, train a MED grader in step 312, i.e. according to the expectation mark having labelling and data untagged
Note, utilizes and has labelling and data untagged as training examples, maximize in the training examples being included into and the training being excluded
Boundary between sample.In step 314, the grader of the training of step 312 is used to determine classification score value.In step 316, classification
Score value is calibrated to group membership's probability.In step 318, according to group membership's probability updating labelling priori probability information.In step 320
Performing a MED to calculate, to determine labelling and marginal probability distribution, wherein, classification score value determined above makes in MED calculates
With.As a result, new expectation is marked at step 322 and calculates, and in step 324, uses the calculating from step 322 to update this phase
Hope labelling.In step 326, the method determines whether to reach convergence.If it is, the method terminates in step 328.If not up to
Convergence, then, from the beginning of step 310, complete the another an iteration of the method.Iteration is until reaching convergence, thus realizes MED
The repetitive exercise of grader.When decision function change between MED iterative computation each time drops to below a preset value,
Reach convergence.In another embodiment, when the change of the expectation mark value determined drop to a default threshold value with
Time lower, reach convergence.
Fig. 7 shows a network architecture 700 according to an embodiment.As shown in the figure, it is provided that multiple remotely
Network 702, including the first telecommunication network 702 and the second telecommunication network 704.Gateway 707 is attached to telecommunication network 702 with neighbouring
Between network 708.In the environment of present networks architecture 700, each of network 704,706 can use arbitrary shape
Formula, includes, but are not limited to: LAN, wide area network, such as the Internet, Public Switched Telephone Network (PSTN), intercom phone net, etc.
Deng.
In use, gateway 707 as from telecommunication network 702 to the entrance of adjacent network 708.Thus, gateway 707 can
As a router, can manage a given packet arriving gateway 707, and a switch, it is given number
Actual path is provided according to bag turnover gateway 707.
Farther including at least one data server 714 being connected with described adjacent network 708, it can pass through gateway
707 access from telecommunication network 702.It is noted that data server 714 can include any kind of computer equipment/group
Part.Be connected with each data server 714 is multiple subscriber equipmenies 716.These subscriber equipmenies 716 can include desk-top calculating
Machine, laptop computer, hand-held computer, printer or other logical device any.It is noted that an embodiment
In, subscriber equipment 717 can also be directly connected in arbitrary network.
One facsimile machine 720 or a series of facsimile machine 720 may connect to one or more network 704,706,708.
It is noted that data base and/or add-on assemble can be connected to any type of of network 704,706,708
Network element is used together or is incorporated into wherein.In the environment of this description, network element is preferably the random component of network.
According to an embodiment, Fig. 8 shows a representative hardware environment relevant with the subscriber equipment 716 of Fig. 7.This figure
Show the hardware configuration of a typical workstation, there is a central processing unit 810, such as a microprocessor and multiple
By system bus 812 other unit interconnective.
Work station shown in Fig. 8 includes random access memory (RAM) 814, read only memory (ROM) 816, and I/O is adaptive
Device 818, is used for connecting ancillary equipment (disk storage unit 820 as being connected with bus 812), user interface adapter 822, uses
In by keyboard 824, mouse 826, speaker 828, microphone 832 and/or other user interface facilities, such as touch screen sum code-phase
Machine (not shown), is connected to bus 812, and communication adapter 834, for being connected to communication network 835 (e.g., data by work station
Process network), and display adapter 836, for bus 812 is connected with display device 838.
Referring particularly to Fig. 9, it is shown that the device 414 of one embodiment of the invention.One embodiment of the present of invention includes using
Storage device 814 in storage flag data 416.Each mark data points 416 includes a labelling, indicates this data point
It is the training examples being included into a data point specifying classification, or the training from a data point specifying classification to be excluded
Sample.Memorizer 814 also stores data untagged 418, priori probability data 420 and cost factor 422.
Processor 810 accesses the data from memorizer 814, and uses transduction MED to calculate one binary classifier of training,
Can classify data untagged.By the use cost factor and have labelling and data untagged training examples by oneself, place
Reason device 810 uses iteration transduction to calculate, and regulates this cost factor function as expectation mark value, thus affects cost
The data of factor data 422, these data input processor 810 the most again.Therefore, cost factor 422 is along with processor 810
MED classification iteration each time and change.Once processor 810 trained a MED grader fully, and processor is with that
Can instruct this grader that data untagged is referred to classified data 424.
Transduction SVM and the MED formula of prior art causes potential labelling distribution to be exponentially increased, and approximation must be to reality
Border application development.In another embodiment of the present invention, describe the formula of different transduction MED classification, it is not necessary to suffer in
The possible labelling distribution of exponential increase, and allow a conventional closed-form solution (closed formsolution).For linearly
Grader, problem is expressed as follows: find hyperplane parameter distribution p (Θ), deviation profile p (b), data point categorised demarcation line p (γ),
Its probability distribution combined compares the respective prior distribution p of combination0There is one minimize Ku Lebaike accumulation Le and dissipate
(Kullback Leibler divergence) KL, i.e.
It is limited by the following restriction having flag data
And it is limited by the restriction of following data untagged
Wherein Θ XtFor the dot product between weight vectors and the characteristic vector of t data point of the hyperplane separated.Nothing
Need the prior distribution of labelling.Flag data is had to be limited in the right side of hyperplane of separation according to labelling known to it, and for
Only requirement is that of data untagged, they to hyperplane distance square more than boundary.In a word, embodiments of the invention are looked for
To a hyperplane separated, it is closest to selected prior probability, separates exactly and has flag data, Yi Ji
A balance between data untagged is not had between boundary.Have an advantage in that, it is not necessary to introduce the prior distribution of labelling, thus,
Avoid the problem that potential labelling distribution index increases.
In the particular implementation of another embodiment of the present invention, use in the formula 7,8 and 9 of hyperplane parameter given
Prior distribution, deviation and boundary, obtain following partition function:
Wherein subscript t is the subscript having flag data, and t ' is the subscript of data untagged.
Created symbol:
With
Formula 16 is rewritable is as follows:
After integration, following partition function is produced:
That is, final object function is:
As the situation of the known mark discussed in the paragraph of referred to herein as M step, object functionCan be by application
Similar method solves.Difference is, the matrix G in the quadratic form of maximum figure item3 -1Currently there is nondiagonal term.
Except classification, the present invention uses the method for maximum entropy differentiation framework to there is also multiple application.Such as, MED can be used for
Solve the classification of data.In a word, can be used for any kind of discriminant function and prior distribution, recovery and image model
(T.Jebara, Machine Learning Discriminative and Generative, Kluwer Academic
Publishers)(Jebara)。
The application of the embodiment of the present invention can be formulated into the pure inductive learning problem with known mark, and tool
There is the transduction problem concerning study of labelling and unmarked training examples.In embodiment below, the transduction MED described in table 3 divides
Improving of class algorithm is classified for common transduction MED, transduction MED restores, the transduction MED study of image model is the most equally applicable.
So, for purpose and the dependent claims thereof of the disclosure, word " is classified " and can be included restoring or image model.
M step
According to formula 11, the object function of M step is:
{λt|0≤λt≤ c},
Wherein Lagrange multiplier λtBy maximizing JMDetermine.
Ignore redundancy and limit λt< c, the lagrangian of above-mentioned two problems is:
KKT condition necessary and sufficient for optimality is:
Wherein FtFor:
In optimal solution, deviation is equal to expectation deviation Obtain:
<yt>(-Ft-<b>)+δt=0 (25)
By considering δtλt=0 two kinds of situations limited, it can be gathered that these formula.The all λ of the first situationt=0, with
And all 0 < λ of the secondt< c.Without considering the third, such as S.Keerthi, S.Shevade, C.Bhattacharhyya,
And K.Murthy, Improvements to platt ' s smo algorithm for svm classifier design,
1999 (Keerthi), described in, it is applied to SVM algorithm;In this formula, potential function (potential function) is protected
Hold λt≠c。
In the case of these can there is interference in some data point t, until it reaches optimal solution.That is, λ is worked astDuring for non-zero, Ft≠-<b
>, or work as λtWhen being zero, Ft<yt><-<b><yt>.Unfortunately, there is no optimal solution λt, just cannot calculate<b>.This is asked
One good solution of topic is the method using for reference Keerthi (being again hereby incorporated), by building following three set:
I0={ t:0 < λt< c} (28)
I1={ t: < yt> > 0, λt=0} (29)
I4={ t: < yt> < 0, λt=0} (30)
By using these to gather, using following definition, we can limit the maximum limit interference of optimality condition.
I0In element for interference, as long as they be not equal to-<b>, therefore, from I0Minimum and maximum FtFor becoming the time of interference
Choosing.Work as Ft<-<b>time, I1In element be interference, therefore, if it exists, from I1Least member be maximum limit
Interference.Finally, F is worked ast>-<b>time, at I4In element for interference, it is from I4Interference candidate produces greatest member.Therefore ,-<
B > limited by these " minimum " of gathering as follows and " maximum " value:
Due in optimal solution ,-bupWith-blowNecessary equal reason, i.e.-<b>, then, reduction-bupWith-blowDifference
Restrain away from training algorithm will be promoted.It addition, gap can also be determined as a kind of method determining that numerical value is restrained.
As it was previously stated, only reach convergence, just can know that the value of b=<b>.The difference of the method for another embodiment exists
In, once can only optimize a sample.Therefore, every once, heuristic training will be at I0In sample and all samples between
It is used alternatingly.
E step
In formula 12, the object function of E step is
Wherein stThe classification score value of t the data point for determining in M step before.Lagrange multiplier λtBy
BigizationDetermine.
Ignore redundancy and limit λt< c, the lagrangian of above-mentioned two problems is:
KKT condition necessary and sufficient for optimality is:
Owing to sample has been carried out factorization, as long as ignoring sample, by optimizing KKT condition to Lagrange multiplier
Solve and can complete.
For there being labelling sample, it is desirable to labelling < yt> there is P0,1(yt)=1 and P0,1(-yt)=0, simplifying KKT condition is:
And generate the solution as the Lagrange multiplier having labelling sample:
For unmarked sample, formula 35 can not decompose and solves, but it is necessary that use, as met formula 35 to each
The Lagrange multiplier of unmarked sample carries out linear search, determines.
Being multiple unrestriced sample below, it can be by above-mentioned enumerated method and derivation thereof or change, Yi Jiqi
Its method known in the art realizes.Each example includes preferred computing, and in conjunction with optional computing or parameter, it can be
Basic method for optimizing opinion is implemented.
In an embodiment, as shown in Figure 10, have mark data points to be received in step 1002, each data point have to
A few labelling, indicating this data point is the training examples of the data point being included into a particular category, or specific from one
The training examples of the data point that classification is excluded.It addition, data untagged point is received in step 1004, have described in reception simultaneously
At least one default cost factor of mark data points and data untagged point.Described data point can include any medium, as
Word, image, sound etc..The priori probability information having labelling and data untagged point can also be received.And, it is included into
The labelling of training examples can be mapped as first numerical value, such as+1 etc., and the training examples being excluded can be mapped as the second number
Value, such as-1 etc..There are mark data points, data untagged point, input data point described in it addition, and have mark data points and nothing
The default cost factor of at least one of mark data points can be stored in computer storage.
Further, in step 1006, use at least one cost factor described, and have mark data points and unmarked number
Strong point is as training examples, and by iterative computation, a transduction MED grader is trained to.For iterative computation each time, adjust
Data untagged point cost factor as one expectation mark value, such as a data point expectation labelling absolute value etc., letter
Number, and adjust data point label prior probability according to the estimation of group of data points membership probability, therefore ensure that stability.And, turn
Lead grader and can learn to use the priori probability information of labelling and data untagged, which further improves stability.Training
The iterative step of transductive classifier can repeat, until it reaches the convergence of data value, such as, when the decision function of transductive classifier
Change when dropping to below a default threshold value, drop to below a default threshold value when the change of the expectation mark value determined
Time, etc..
Additionally, in step 1008, the grader of training be used for classifying described data untagged point, have mark data points and
At least one in input data point.Input data point can receive before or after grader is trained to, or does not receive.
And, according to their expectation labelling, there are labelling and data untagged point described in utilization as study sample, it may be determined that judge letter
Number, gives and is included into and dispossessed training examples, and KL can be dissipated the elder generation being minimised as decision function parameter by this decision function
Test probability distribution.In other words, this decision function can use the multinomial distribution of decision function parameter, minimum KL dissipate
Determine.
In step 1010, the classification of the data point of classification, or derivatives thereof, be exported to a user, another system and
At least one during another.System can be long-range or locality.The example of the derivant of classification is not it may be that but
It is limited to, the data point of classification itself, the sign of categorical data point or identifier or master file/document, etc..
In another embodiment, computer system uses and performs computer executable program code.This program code
Including being stored in the instruction having mark data points of computer storage for accessing, mark data points is had to have described in each
At least one labelling, indicates whether this data point is the training examples being included into a data point specifying classification, or from one
The training examples of the data point being excluded in individual appointment classification.It addition, computer code includes for visiting from computer storage
Ask the instruction of data untagged point, and have mark data points and at least the one of data untagged point from computer storage access
The instruction of individual default cost factor.The priori probability information having labelling and data untagged point being stored in calculating memorizer also may be used
With accessed.And, the labelling of the training examples being included into can be mapped as first numerical value, such as+1 etc., and the training being excluded
Sample can be mapped as second numerical value, such as-1 etc..
Further, program code comprises instructions that, described instruction use at least one storage cost factor and
Store has the data untagged point of mark data points and storage as training examples, trains transductive classification by iterative computation
Device.And, for iterative computation each time, adjust the data untagged point cost factor expectation mark value as this data point,
Such as data point expectation labelling absolute value, a function.And, for iteration each time, priori probability information is permissible
The estimation of the group membership's probability according to data point is adjusted.The iterative step of training transductive classifier can be repeated, until number
Convergence is reached, such as, when the change of the decision function of transductive classifier drops to below a default threshold value, when determining according to value
The change of expectation mark value when dropping to below a default threshold value, etc..
It addition, program code comprises instructions that, described instruction is used for training grader, to data untagged point, has
At least one in mark data points and input data point is classified, and for exporting the class of the data point of described classification
The instruction of other or derivatives thereof, exports classification to a user, another system and at least one during another.
And, according to their expectation labelling, there are labelling and data untagged point described in utilization as study sample, it may be determined that judge letter
Number, the given training examples being included into and being excluded, KL can be dissipated the elder generation being minimised as decision function parameter by this decision function
Test probability distribution.
In another embodiment, data processing equipment includes at least one memorizer, is used for storing: (i) has reference numerals
Strong point, has mark data points to have at least one labelling described in each, and indicating this data point is to be included into one to specify classification
The training examples of data point, or the training examples from a data point specifying classification to be excluded;(ii) data untagged
Point;(iii) there is at least one default cost factor of mark data points and data untagged point described in.Described memorizer also may be used
Labelling and the priori probability information of data untagged point is had with storage.And, the labelling of the training examples being included into can be mapped as
First numerical value, such as+1 etc., and the training examples being excluded can be mapped as second numerical value, such as-1 etc..
It addition, described data processing equipment includes a transductive classification training aids, with utilize at least one cost described because of
Son, and described in have mark data points and data untagged point as training examples, use the maximum entropy of transduction to differentiate (MED),
Train described transductive classifier iteratively.Additionally, in MED iterative computation each time, adjust described data untagged point cost because of
Son as the expectation mark value of this data point, the absolute value etc. of the expectation labelling of a such as data point, a function.And
And, in MED iterative computation each time, priori probability information can be adjusted according to the estimation of data point group membership's probability.
This device can also include one for determining the device that data value is restrained, e.g., when the decision function that transductive classifier calculates
Change when dropping to below a default threshold value, drop to below a default threshold value when the change of the expectation mark value determined
Time, etc., and once it is determined that convergence, then terminate calculating.
It addition, training grader for classify data untagged point, have mark data points and input data point in extremely
Few one.And, according to their expectation labelling, there are labelling and data untagged point described in utilization as study sample, can be true
Determining decision function, the given training examples being included into and being excluded, KL can be dissipated and be minimised as decision function by this decision function
The prior probability distribution of parameter.And, the classification of the data point of classification, or derivatives thereof, export to a user, another be
System and at least one during another.
In another embodiment, goods, including computer-readable program recorded medium, this medium wraps definitely
Include the executable instruction repertorie of one or more computer, with the method performing data classification.In use, reception has reference numerals
Strong point, each have mark data points to have at least one labelling, and indicating this data point is to be included into the data specifying classification
The training examples of point, or the training examples from a data point specifying classification to be excluded.It addition, reception data untagged
Point, and described in have mark data points and data untagged point at least one preset cost factor.Have mark data points and
The priori probability information of data untagged point can also be stored in computer storage.And, the labelling of the training examples being included into
Can be mapped as first numerical value, such as+1 etc., and the training examples being excluded can be mapped as second numerical value, such as-1, etc..
Further, have mark data points and the data untagged point that use at least one cost factor stored and storage are made
For training examples, utilize the maximum entropy of iteration to differentiate that (MED) calculates, train transductive classifier.In the iteration each time that MED calculates
In, adjust the data untagged point cost factor expectation mark value as this data point, the expectation labelling of a such as data point
Absolute value etc., a function.And, in MED iterative computation each time, priori probability information can be according to a data point
The estimation of group membership's probability is adjusted.The iterative step of training transductive classifier can be repeated, until it reaches data value is restrained, example
As, when the change of the decision function of transductive classifier drops to below a default threshold value, when the expectation mark value determined
When change drops to below a default threshold value, etc..
It addition, access input data point from computer storage, the grader of training is used for described data untagged of classifying
Point, have mark data points and input data point at least one.And, according to their expectation labelling, have described in utilization
Labelling and data untagged point are as study sample, it may be determined that decision function, and the given training examples being included into and being excluded should
KL can be dissipated the prior probability distribution being minimised as decision function parameter by decision function.And, the classification of the data point of classification,
Or derivatives thereof, is exported to a user, another system and at least one during another.
In another embodiment, it is provided that a kind of for data untagged of classifying in a computer based system
Method.In use, there is mark data points to be received, have mark data points to have at least one labelling described in each, refer to
Show that this data point is the training examples being included into a data point specifying classification, or the number specifying classification to be excluded from
The training examples at strong point.
It addition, have labelling and data untagged point to be received, there are mark data points and the priori signature of data untagged point
Probabilistic information is also received.And, there is at least one default cost factor of mark data points and data untagged point also to be connect
Receive.
And, each has the expectation labelling labelling prior probability quilt according to this data point of labelling and data untagged point
Determine.Repeat following sub-step, until data value is enough restrained.
● the data untagged point proportional to the absolute value of the expectation labelling of data point for each generates a regulation
Value at cost;
● be determined by decision function, the given sample being included into training and being expelled out of training, use described in have labelling and
Data untagged point, as training examples, trains a maximum entropy to differentiate (MED) grader, according to their expectation labelling, and should
KL is dissipated the prior probability distribution being minimised as decision function parameter by decision function;
● use the grader of described training, determine described in have labelling and the classification score value of data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to the described group membership's probability determined, update the labelling prior probability of described data untagged point;
● the labelling prior probability utilizing described renewal and the classification score value determined before, use maximum entropy to differentiate (MED),
Determine described labelling and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation labelling;With
● by the described expectation labelling of iteration before is inserted described new expectation labelling, update for each data point
Expect labelling.
And, the classification or derivatives thereof of input data point, it is exported to a user, another system and another process
In at least one.
When the change of decision function drops to below a default threshold value, reach convergence.Additionally, when the expectation mark determined
When the change of note value drops to below a default threshold value, it is also possible to reach to dissipate.And, the labelling of the training examples being included into
Can have arbitrary value, such as+1, and the training examples being excluded can have arbitrary value, such as-1.
In one embodiment of the invention, a kind of method for sort file is as shown in figure 11.In use, in step
Rapid 1100, receive at least one seed file with known confidence levels, and it is default with at least one to receive unmarked file
Cost factor.This seed file and other can be received from computer storage, user, network connection etc., and can be
One is received after the request of the system performing the method.At least one seed file described can have a this document
Whether it is included into a cue mark specifying classification, can contain a Keyword List, or there is any other contribute to
The feature of sort file.And, in step 1102, by iterative computation, use at least one default cost factor, at least one
Seed file and unmarked file, train a transductive classifier, wherein, for iterative computation each time, Setup Cost because of
Son is as the function of an expectation mark value.The data point label prior probability having labelling and unmarked file can also be connect
Receive, wherein, for iterative computation each time, described data point markers can be adjusted according to the estimation of group of data points membership probability
Prior probability.
It addition, after at least part of iteration, be that unmarked file stores confidence score in step 1104, and in step
1106, the identifier of the unmarked file with the highest confidence score is exported to a user, another system and another process
In at least one.This identifier can be the electronic copies of this document itself, its part, its title, its title, point to file
Pointer, etc..And, confidence score can store after each iteration, wherein, after each iteration, has
The identifier of the unmarked file of the highest confidence score is output.
One embodiment of the present of invention can inquire about pattern original document and remaining paper linked.The target of inquiry is
One this pattern query proves especially valuable region.Such as, at pre-trial legal inquiry (pre-trial
Legaldiscovery) in, for the possible link of lawsuit at hand, substantial amounts of file must be studied.Final purpose is in order to send out
Existing " conclusive evidence ".In another example, for the common task of inventor, patent examiner, and patent attorney,
It is through the retrieval to prior art, the novelty of one technology of assessment.Especially, this task is the special of all announcements of search
Profit and other publication, and have found that it is likely that the file relevant with the particular technology examining novelty in this is gathered.
The task of inquiry is included in one group of data and finds a file or one group of file.A given original document or general
Reading, user may wish to find the file relevant with this original document or concept.But, original document or concept and file destination
Between the opinion of relation, i.e. the file that will inquire about, only after inquiring about, just can be best understood by.Labelling is had by study
With unmarked file, concept etc., the present invention can learn the pattern between single or multiple original document and file destination and relation.
In another embodiment of the present invention, a kind of method such as Figure 12 for analyzing the file relevant to legal inquiry
Shown in.In use, the file relevant to legal matter is received in step 1200.These files can include the electricity of file itself
Sub-copy, its part, its title, its title, the pointer of sensing file, etc..It addition, in step 1202, file is performed one
Plant file classifying method.Further, in step 1204, the identifier of at least part of file is exported based on its classification.Alternatively,
The mark of the link between these files is also output.
Described file identification method can include any kind of process, such as transductive process etc..For example, it is possible to make
With aforesaid any conclusion or transduction method.In a preferred method, use at least one default cost factor, at least one
Individual seed file and the file relevant with legal matter, train a transductive classifier by iterative computation.For each time
Iterative computation, cost factor preferably adjusts the function as an expectation mark value, and the grader of training is used for classification and connects
The file received.This process can also include for having labelling and one data point markers prior probability of unmarked file reception, wherein,
For iterative computation each time, according to the estimation of data point group membership's probability, adjust described data point label prior probability.
It addition, described file classifying method can also include that one or more support vector machine process and maximum entropy differentiate process.
In another embodiment, a kind of the method for prior art document is analyzed as shown in figure 13.In use, in step
1300, based on search inquiry one grader of training.In step 1302, multiple prior art documents are accessed.These are existing
Have technology can be included in one give fix the date before, any information that the public can obtain in any form.The prior art also may be used
Be included in one give fix the date before, any information that the public can't obtain in any form.The prior art document enumerated
Can be any type of file, as Patent Office publication, take from the data of data base, the prior art of collection, webpage
Part, etc..And, in step 1304, use described grader that the most described prior art document is performed one
File classifying method, and in step 1306, classify based on it, the identifier of the prior art document that output is the most described.
Described document classification technology can include one or more process, differentiates including a support vector machine process, a maximum entropy
Process, or aforesaid any conclusion or transduction method.Also or, between described file link sign can also be output.?
In another embodiment, between at least part of prior art document, the score value of dependency is output based on its classification.
Described search inquiry can include disclosed in patent at least some of.The patent enumerated is open to be included, by inventor
Disclosure, temporary patent application, non-provisional, foreign patent or patent application summing up its invention and produce etc..
In a preferred method, described search inquiry includes the claim of a patent or patent application at least
A part.In another method, described search inquiry includes at least some of of the summary of a patent or patent application.?
In another method, described search inquiry includes at least some of of the brief summary of the invention of a patent or patent application.
Figure 27 shows a kind of method for being mated by file with claim.In step 2700, based on a patent
Or at least one claim one grader of training of patent application.Therefore, one or more claim, or one portion
Point, can be used for training grader.In step 2702, multiple files are accessed.These files can include prior art document, describes
Potential infringement or the file of use product of taking the lead.In step 2704, use described grader that at least part of file is performed one
Plant file classifying method.In step 2706, classify based on it, export the identifier of at least part of file.At least part of file
Relevance score can also be output based on its classification.
One embodiment of the present of invention can be used for the classification of patent application.In the U.S., such as, nowadays patent and patent Shen
Please use US patent class (USPC) system, be classified according to its theme.This task is now by being accomplished manually, and therefore cost is high
And it is time-consuming.This manual sort is also restricted by mistake.The complexity solving this task is, can be by patent or special
Profit application is divided into multiple kind.
According to an embodiment, Figure 28 shows a kind of method for patent application of classifying.In step 2800, based on many
Individual known one grader of file training belonging to a specific patent classification.These files can be generally patent or patent Shen
Please (or one part) but it also may be the summary file of the target topic describing specific patent classification.In step 2802, one
Patent or at least some of of patent application are received.Described part may include that claim, brief summary of the invention, makes a summary, illustrates
Book, title, etc..In step 2804, use described grader that described patent or at least some of of patent application are performed one
Plant file classifying method.In step 2806, the classification of described patent or patent application is output.Alternatively, user can be manual
Ground check part or the classification of all patent applications.
Described file classifying method is preferably a kind of Yes/No sorting technique.In other words, if file is in correct class
Probability in not higher than a threshold value, is then judged to it is that this document belongs to the category.If general in correct classification of file
Rate is less than a threshold value, then be judged to no, and this document is not belonging to the category.
Figure 29 shows another method for patent application of classifying.In step 2900, use a grader to one
Part patent or at least some of of patent application perform a kind of file classifying method, and this grader has previously been based at least one with one
The file that individual specific patent classification is relevant is trained to.Same, described file classifying method is preferably a kind of Yes/No classification side
Method.In step 2902, the classification of described patent or patent application is output.
In two shown in Figure 28 and Figure 29 kind method, it is possible to use different graders repeats respective method, described
Grader has previously been based on multiple known file belonging to a different patent classification and is trained to.
Formally, the classification of patent should be based on claim.But, it is also desirable to perform coupling between (any IP is correlated with
Content) and (any IP related content).As an example, a kind of method uses patent specification to be trained, and according to
Patent application is classified by the claim of patent application.Another kind of method operation instructions and claim are trained,
And based on summary classification.In particularly preferred method, no matter which part of patent or application is used for training, when classification
Use the content of same type, if i.e. system is trained according to claim, be then classified based on claim.
Described file classifying method can include any kind of process, such as transductive process etc..Such as, can make
With above-mentioned any conclusion or transduction method.In a preferred method, described grader can be a transductive classifier,
And described transductive classifier uses at least one default cost factor, at least one seed file and prior art document, pass through
Iterative computation is trained, and wherein, for iterative computation each time, adjusts described cost factor as an expectation mark value
Function, and the grader of described training can be used for described prior art document of classifying.Described seed file and prior art document
A data point markers prior probability can also be received, wherein, for iterative computation each time, can be according to data
The estimation of some group membership's probability, adjusts described data point label prior probability.Seed file can be any file, such as Patent Office
Publication, to take from the data of data base, one group of prior art, website, patent open etc..
In a method, Figure 14 describes one embodiment of the present of invention.In step 1401, one group of data is read.
In these group data, the file relevant with user be the discovery that needs.In step 1402, single or multiple initial seed files
Labeled.Described file can be any kind of file, such as Patent Office publication, take from the data of data base, one group
Prior art, website etc..Can also a string different keyword or customer-furnished file layout transductive process.In step
1406, use the one group of data untagged having in flag data and a given set, train a transductive classifier.In iteration
Each labelling induction step in transductive process, the confidence score determined in labelling generalization procedure is stored.In step
1408, once complete training, just display to the user that the file obtaining high confidence score in labelling induction step.These have height
The file representative of confidence score inquires about the file that purpose is relevant to user.This display can according to the time of labelling induction step first
Rear order, from the beginning of initial seed file, until last group file being found in last labelling induction step.
Another embodiment of the present invention relates to data scrubbing and precise classification, the such as business process with automatization and ties mutually
Close.Described cleaning and sorting technique can include any kind of process, such as a transductive process etc..It is, for example possible to use
Any of the above described transduction or inductive method.In a preferred method, according to the expectation cleannes of data base, enter data base's
Key is used as the labelling relevant to confidence levels.Then, this labelling, together with relevant confidence levels, i.e. expects labelling, by with
In one transductive classifier of training, labelling (key) described in this grader correction, to realize more may be used of data in data base
The management leaned on.Such as, first invoice must company or individual according to invoicing be classified, and extracts realizing automaticdata,
Such as determine total amount, O/No., product quantity, shipping address etc..Generally, an automatic classification system is set up to need instruction
Practice sample.But, client the training examples provided is usually containing the file of wrong classification or other interference, such as fax cover page,
Classifying accurately to obtain, before training described automatic classification system, these files must be identified and remove.At another
In individual embodiment, in the field of case, contribute to detecting the discordance between report and its diagnosis report write by doctor.
In another embodiment, it is well known that Patent Office need experience continuous print reclassify process, wherein, they
(1) assessing an existing bifurcations of their classification of disturbance method, (2) rebuild this classification method to be evenly distributed overcrowding joint
Point, and existing patent reclassifies new structure by (3).Here transduction learning method be Patent Office and they outside
Bag be used for do this work company used by, to reappraise their classification method, and help they (1) be one given
Main classification sets up new classification method, and (2) reclassify existing patent.
Transduce from having labelling and data untagged study, be thus smooth from being tagged to unmarked transformation.Collection of illustrative plates
One end be to have perfect existing the most acquainted to have flag data, e.g., given labelling is the most all correct.At another
End is the most given existing acquainted data untagged.Number with the data composition mistake classification that the group disturbed to a certain degree is compiled
According to, and extreme at two of collection of illustrative plates between somewhere.In a way can be for certain by being marked at of being given of data tissue
It is considered correct, but and not exclusively.Therefore, change can be used for clearing up existing data set and compile, by given at one
Assume a degree of specifically makeing mistakes within data tissue, and these are construed to uncertain in the existing knowledge of labelling distribution
Property.
In one embodiment, a kind of the method for data is cleared up as shown in Figure 5.In use, in step 1500, Duo Geyou
Flag data item is received, and in step 1502, chooses the subset of described data item for each classification in multiple classifications.Separately
Outward, in step 1504, the uncertainty of the data item in each subset is arranged to about zero, in step 1506, will not exist
The uncertainty of the data item in described subset be arranged to one be about zero preset value.Further, in step 1508, pass through
Iterative computation, use the data item in described uncertainty, subset and data item the most in the subsets as training examples,
Train a transductive classifier, and in step 1510, the grader of training has flag data item for each, each to classify
Individual described data item.And, input data item classification, or derivatives thereof, step 1512 be exported to a user, another
System and at least one during another.
Further, described subset can randomly select, it is possible to is chosen by user and verifies.At least part of described data item
Labelling can be changed based on its classification.And, after sorting, there are the data of the confidence levels of the threshold value default less than
The identifier of item is exported to user.Described identifier can be the electronic copies of this document itself, its part, its title, its
Title, the pointer of sensing this document, etc..
In one embodiment of the invention, as shown in figure 16, in step 1600, two choosings of a scale removal process are started
Item is presented to user.In step 1602, an option is full-automatic cleaning, for each concept or classification, selects randomly
Take certain amount of file, and assume that they are correctly organized volume.Or, in step 1604, a number of file can be beaten
Upper labelling, is organized volume exactly with one or more labellings distribution of hand inspection and verification whether each concept or classification.
In step 1606, data, an estimation of annoyance level is received.In step 1610, use the verification in step 1608
(desk checking or randomly select) data and the data that do not verify, train described transductive classifier.Once training terminates, file
Again volume is organized according to new labelling.In step 1612, in labelling distributes, there is the low confidence level less than a specific threshold
Other file, is displayed to user, for hand inspection.In step 1614, distribute according to transduction of marker, in labelling distributes
There is the file of the confidence levels higher than a specific threshold by automatic Proofreading.
In another embodiment, a kind of method for managing case history is as shown in figure 17.In use, in step
1700, a grader is trained to based on medical diagnosis, and in step 1702, multiple case histories are accessed.It addition, in step 1704,
Use described grader that described case history performs a kind of file classifying method, and there is low probability with medical diagnosis dependency
The identifier of at least one case history, is output in step 1706.This document sorting technique includes any kind of process, such as one
Transductive process etc., and can include said one or multiple arbitrary conclusion or transduction method, including support vector machine process,
Big entropy-discriminate process etc..
In one embodiment, described grader can be a transductive classifier, and described transductive classifier can lead to
Cross iterative computation, use at least one cost factor, at least one seed file and case history preset to be trained to, wherein, right
In iterative computation each time, adjust a described cost factor function as expectation mark value, and the grader of training can be used
In described case history of classifying.The data point label prior probability of seed file and case history can also be received, wherein, for each time
Iterative computation, can adjust described data point label prior probability according to the estimation of group of data points membership probability.
The classification concept that another embodiment of the present invention describes dynamically, drifts about.Such as, in formal layout application, point
Class file, uses the layout information of file and/or content information to classify file, with described file of classifying for further
Process.In many applications, file is not changeless, but time to time change.Such as, file content and/or
The space of a whole page probably due to new legislation and change.Transductive classification adapts to these changes automatically, produces same or similar classification accurate
Property, and do not affected by the classification concept drifted about.Compared with rule-based system or inducing classification method, it is not necessary to manually adjust
Joint, will not affect accuracy due to concept drift.One example of this method is that invoice processes, and it includes concluding traditionally
Study, or use the rule-based system utilizing the invoice space of a whole page.For the system that these are traditional, if the space of a whole page changes,
Then system must be by the new training data of labelling or determine that new rule manually resets.But, the use of transduction is led to
Cross the minor variations automatically adapting on the invoice space of a whole page so that manually reset and become no longer necessary.In another embodiment,
Transductive classification can be used for analyzing customer complaint, to monitor these changes complaining character.Such as, a company can automatically will produce
Product change is linked with customer complaint.
Transduction can also be used for the classification of news article.Such as, about war, the news article of the attack of terrorism, start from for
The terrorist of calendar year 2001 JIUYUE Afghan War on the 11st attacks, until about the News Stories of the current situation of Iraq, can
Transduction is used automatically to identify.
In another embodiment, biological classification (akpha taxonomy) can change over, by evolving, and new species
Produce, and other species extinction.Along with classification concept change in time, classification outline or taxonomic this and Else Rule
Can dynamically change.
By using the input data that must be classified as data untagged, transduction can identify drift classification concept, and
Thus automatically adapt to the classification outline of change.Such as, Figure 18 shows that a given drift classification concept of the present invention uses
The embodiment of transduction.File group DtAt time ttEntrance system, as shown in step 1802.In step 1804, use and amass up to now
Tired has labelling and data untagged one transductive classifier C of trainingt, in step 1806, file group DtIn file be classified.
If use artificial mode, step 1808 is confirmed as the literary composition with the confidence levels of the threshold value less than user's offer
Part, is presented to user for hand inspection in step 1810.As shown in step 1812, in automatic mode, one has
The file of confidence levels triggers the establishment of a new classification, and the category is added into system, and then this document is just attributed to this
New classification.In step 1820A-B, have and be classified into current classification higher than the file of the confidence levels of above-mentioned selected threshold value
1 to N.In step ttIt is classified into the file of all current class of current class, in step 1822 by grader Ct
Reclassify, and in step 1824 and 1826, all files being no longer classified into above-mentioned appointment classification, it is moved into new class
Not.
In another embodiment, a kind of adapt to file content variation method as shown in figure 19.File content can wrap
Include, but be not limited to, picture material, content of text, the space of a whole page, numbering, etc..The example of variation can include the change of time, wind
The change (being processed one or more files by 2 or more individual) of lattice, the change of application process, the variation of the space of a whole page, etc..?
Step 1900, receiving at least one has labelling seed file and unmarked file and at least one default cost factor.Described
File can include, but are not limited to, customer complaint, invoice, form document, receipt, etc..It addition, in step 1902, use
At least one default cost factor described, at least one seed file, and unmarked file, train a transductive classifier.
And, in step 1904, there is the unmarked file of the confidence levels of the threshold value default more than, use grader to be classified
To multiple classifications, and in step 1906, the file of described classification at least some of, use grader to be reclassified to multiple
Classification.Further, in step 1908, the identifier of the file of described classification is exported to a client, another system, Yi Jiling
At least one during one.Described identifier can be the electronic copies of file itself, its part, its title, its title, refer to
To the pointer of file, etc..And, product variations can be linked with customer complaint etc..
It addition, have can be moved into less than the unmarked file of the confidence levels of a predetermined threshold value one or more new
Classification.And, by iterative computation, use at least one cost factor, at least one seed file and described nothing preset
Tab file, can train a transductive classifier, wherein, for iterative computation each time, adjusts described cost factor conduct
The function of one expectation mark value, and use the grader described unmarked file of classification of described training.And, described kind of Ziwen
The data point label prior probability of part and unmarked file can be received, wherein, for iterative computation each time, according to one
The estimation of group of data points membership probability, adjusts described data point label prior probability.
In another embodiment, a kind of method of variation making patent classification adapt to file content is as shown in figure 20.?
Step 2000, receiving at least one has labelling seed file, and unmarked file.Described unmarked file can include any
The file of type, e.g., patent application, legal document, information open form, file modification, etc..Seed file can include specially
Profit, patent application etc..In step 2002, use at least one seed file described and unmarked file training one transduction point
Class device, and use described grader will to have the unmarked document classification higher than the confidence levels of a predetermined threshold value to multiple
Existing classification.Described grader can be any kind of grader, such as transductive classifier etc., and described document classification side
Method can be any method, such as support vector machine method, maximum entropy method of discrimination etc..Such as, can use above-mentioned any
Conclude or transduction method.
And, in step 2004, use described grader to have the confidence levels less than a predetermined threshold value by described
Unmarked document classification is at least one new classification, and in step 2006, uses described grader the most described to divide
The file of class reclassifies existing classification and classification that at least one is new.Further, in step 2008, described classification
The identifier of file is exported to a user, another system and at least one during another.Furthermore, it is possible to use extremely
A few default cost factor, described search inquiry and described file, by iterative computation, train described transductive classification
Device, wherein, for iterative computation each time, adjusts the described cost factor function as an expectation mark value, and described instruction
The grader practiced can be used for described file of classifying.Further, the data point prior probability of described search inquiry and file can be by
Receive, wherein, for iterative computation each time, according to the estimation of data point group membership's probability, adjust described data point first
Test probability.
In another embodiment of the present invention, describe the file at file separation field to drift about.The example of one application
Attached bag includes the process of mortgage file.Including a series of different debt-credit files, such as loan application, ratify, ask, quantity etc.
Debt-credit file is scanned, and before further processing, it must be determined that the different file in a series of images.Use
File is not changeless, but can change over.Such as, in debt-credit file, the tax form of use, can
Change over according to the change of laws and regulations.
File separates and solves file or the problem of subfile boundary of finding in a series of images.General generation is a series of
The example of image is digital scanner or multi-function peripheral (MFP).As in the embodiment of classification, transduction can be used for file
Separate, to process file and boundary drifting problem in time thereof.Static piece-rate system, as rule-based system or based on
The system of Inductive Learning, it is impossible to automatically adapt to drift and separate concept.No matter when drift about, these static separation systems
The performance performance of system reduces in time.In order to keep the performance of its initial level, or manually adjusting rule (is based on rule
System for), or the new file of handmarking learning system (for Inductive Learning) again.The most any
It is all time-consuming costly.Application transduction separates to file so that system is improved, and it can adapt to the drift in separating concept automatically
Move.
In one embodiment, a kind of method of separate file is as shown in figure 21.In step 2100, receive and have reference numerals
According to, and in step 2102, receive one group of unmarked file.These data and file can include legal inquiry file, official
Notice, web data, attorney's official letter etc..It addition, in step 2104, have flag data and unmarked literary composition based on described
Part, uses transduction, and probabilistic classification rule is adjusted, and in step 2106, according to probabilistic classification rule, updates for literary composition
The weight that part separates.And, in step 2108, determine the position separated in one group of file, and in step 2110, determine
The designator of position separated in one group of file be exported to a user, another system and another during at least
One.Described designator can be the electronic copies of file itself, its part, its title, its title, the pointer of sensing file,
Etc..Further, in step 2112, file is labeled with coding, and described coding is relevant with described designator.
Figure 22 shows the sorting technique separated for file used in the present invention and the implementation process of equipment.In numeral
After formula scanning, use autofile to separate and relate to, to reduce, the manual working that file separates and identifies.By using reasoning to calculate
Method, combines to be automatically separated with classifying rules by file separation method and organizes the page more, use sorting technique described here, with
Reduce the most possible separation from all available to information.As shown in figure 22, the present invention turns one example of the present invention
The sorting technique leading MED is used for file separation.Specifically, the file page 2200 is placed into digital scanner 2202 or MFP, and
It is turned into set of number image 2204.The described file page can be from the page of any type file, such as going out of Patent Office
Version thing, take from the data of data base, the set of prior art, website etc..In step 2206, input set of number image, with
Dynamically adapting uses the probabilistic classification rule of transduction.Step 2206 uses one group of image 2204 as data untagged and to have mark
Numeration is according to 2208.Weight in step 2210, probability network is updated, and is used for based on dynamically adapting classifying rules
Autofile separates.Output step 2212 is to be automatically put into the dynamic self-adapting of separate picture, so, and the page of set of number
2214 automated graphics being interleaved into the separator page 2216, in step 2212, are automatically inserted into figure by the separator page
As sequence.In one embodiment of the invention, the separator page 2216 of Software Create can also indicate that and follows described separation closely
The type of the file of the device page 2216.System described herein automatically adapts to the drift that file occurs in time and separates general
Read, can occur separating accurately as rule-based static system or conclusion type machine learning based on method without worry
The reduction of degree.In sheet disposal (form processing) is applied for, drift separates or a common example of classification concept
It is that as mentioned before, file produces change due to new laws and regulations.
It addition, system as shown in figure 22 can change system as shown in figure 23 into, its page 2300 puts into digital scanner
2302 or MFP are converted to set of number image 2304.This group digital picture is transfused in step 2306, to use transduction the suitableeest
Answer probabilistic classification rule.Step 2306 uses this group image 2304 as data untagged and to have flag data 2308.Step
2310, according to the dynamic self-adapting classifying rules used, update the weight in the probability network that autofile separates.
In step 2312, it not insertion separator page-images as described in Figure 18, but step 2312 adapts dynamically to be automatically inserted into
Separation information, and with coding descriptive markup described in document image.Thus, file page-images can be transfused to an image procossing
Data base 2316, and described file can be accessed by software identifiers.
An alternative embodiment of the invention can use transduction to carry out recognition of face.As it has been described above, use transduction to have many
Advantage, such as, it is only necessary to the training examples of relatively small amount, uses the ability of unmarked sample in training, etc..Utilize above-mentioned excellent
Gesture, transduction recognition of face can be used for Criminal Investigation.
Such as, Department of Homeland Security is it is essential to ensure that terrorist must not climb up commercial airliner.A part for airport screening process
Can be the photograph gathering each passenger at airport security, and attempt identifying this people.System initially can use a small amount of sample
Example is trained, and this sample comes from the available limited photo being probably terrorist.At other law enforcement data
The unmarked photo of terrorist in storehouse, same can also be used for training.Therefore, transduction training aids is possible not only to use the dilutest
The data dredged set up functional face identification system, and also other unmarked sample originated can be used to strengthen performance.
After having processed the photo gathered at airport security, transduction system can more precisely identify suspicious figure than induction system.
In another embodiment, a kind of method for recognition of face is as shown in figure 24.In step 2400, at least one
The labelling drawing of seeds picture that has of face is received, and this drawing of seeds picture has known confidence levels.This at least one drawing of seeds picture can
To have a labelling, indicate whether this image is included into a classification specified.It addition, in step 2400, unmarked image
Received, e.g., from police office, government organs, missing child data base, airport security, or any other is local, and receives at least
One default cost factor.And, in step 2402, by iterative computation, use at least one cost preset described because of
Son, at least one drawing of seeds picture, and unmarked image, train a transductive classifier, wherein, for iterative computation each time,
Adjust the described cost factor function as an expectation mark value.After at least successive ignition, in step 2404, for described
Unmarked drawing of seeds picture one confidence score of storage.
Further, in step 2406, the identifier of the unmarked file with the highest confidence score is exported to a use
Family, another system and at least one during another.Described identifier can be the electronic copies of this document itself, its portion
Point, its title, its title, point to file pointer, etc..And, confidence score can be stored each time after iteration, its
In, after each iteration, output has the identifier of the unmarked image of the highest confidence score.Use furthermore it is possible to receive
In the described data point label prior probability having labelling and unmarked image, wherein, for iterative computation each time, can basis
The estimation of one data point group membership's probability, adjusts described data point label prior probability.Further, the nothing mark of the 3rd face
Note image, as from above-mentioned airport security sample, can received, described 3rd unmarked image can divide with having the highest confidence
At least part of image of value compares, and if be sure oing that the face in the 3rd unmarked image with the face in drawing of seeds picture is
Identical, then can export the identifier of described 3rd unmarked image.
An alternative embodiment of the invention, by providing feedback to document retrieval system, allows users to improve their search
Hitch fruit.Such as, when in an internet search engine (patent or patent application search product etc.) upper one search of execution,
User can obtain corresponding in a large number the result of its search inquiry.One embodiment of the present of invention allows users to from search engine
Browse the result of suggestion, and inform the dependency of the one or more acquired results of search engine, e.g., " close, but be not that I am real
Want ", " being absolutely not " etc..When user provides feedback to search engine, more preferable result is given according to priority and is used
Family browses.
In one embodiment, a kind of method for file search is as shown in figure 25.In step 2500, receive one and search
Rope is inquired about.This search inquiry can be any kind of inquiry, including case sensitive inquiry, boolean queries, approximate match
Inquiry, structuralized query, etc..In step 2502, it is thus achieved that file based on search inquiry.It addition, in step 2504, output institute
State file, and in step 2506, the labelling that the user at least part of file keys in is received, and this labelling indicates described file
And the dependency between search inquiry.Such as, user may indicate that the particular result returned from described inquiry is relevant going back
It is unrelated.And, in step 2508, the labelling keyed in based on described search inquiry and user, a grader is trained to, and
Step 2510, uses described grader described file to perform a kind of file classifying method, to reclassify described file.Enter one
Step, in step 2512, classifies based on it, exports the identifier of at least part of file.Described identifier can be file itself
Electronic copies, its part, its title, its title, the pointer of sensing file, etc..The described file reclassified can also be by
Output, condition is that first those files with high confidence level are exported.
Described file classifying method can include any kind of process, e.g., and transductive process, support vector machine process,
Big entropy-discriminate process, etc..Any of the above described conclusion or transduction method can be used.In a preferred method, described classification
Device can be a transductive classifier, and by iterative computation, uses at least one cost factor preset, described search to look into
Ask, and described file, described transductive classifier can be trained, wherein, for iterative computation each time, adjust described cost because of
Son is as the function of an expectation mark value, and the grader of described training may be used for described file of classifying.It addition, for institute
The data point markers prior probability stating search inquiry and file can be received, wherein, for iterative computation each time, root
According to the estimation of data point group membership's probability, described data point label prior probability can be adjusted.
An alternative embodiment of the invention may be used for improving ICR/OCR, and speech recognition.Such as, many voices are known
The embodiment of other program and system needs operator to repeat many words to train described system.The present invention can be first to a use
The sound monitoring a predetermined time segment at family, to collect the content of " unfiled ", e.g., monitoring telephone is talked.Result is, works as user
When starting to train this identification system, this system utilizes transduction study, builds a note utilizing the voice of described monitoring to assist
Recall model.
In another embodiment, a kind of method such as Figure 26 institute for one invoice of verification and the relatedness of an entity
Show.In step 2600, based on invoice format training one grader relevant to first instance.This invoice format may refer to send out
The practical layout of mark on ticket, or the feature on invoice, such as key word, invoice number, customer name, etc..It addition, in step
2602, labeled as accessed with at least one multiple invoice being associated in described first instance and other entity, and
In step 2604, use described grader that described invoice performs a kind of file classifying method.Such as, above-mentioned any conclusion or
Transduction method can serve as a kind of file classifying method.Such as, described file classifying method can include a transductive process, prop up
Hold vector machine process, maximum entropy differentiates process, etc..And, in step 2606, export the mark of at least one described invoice
Symbol, it is uncorrelated with described first instance that this invoice has higher probability.
Further, described grader can be any kind of grader, such as, and a transductive classifier, and by repeatedly
In generation, calculates, and uses at least one predetermined cost factor, at least one seed file, and described invoice, can train described
Transductive classifier, wherein, for iterative computation each time, adjusts the described cost factor function as an expectation mark value,
And use the grader described invoice of classification of described training.And, for described seed file and a data point mark of invoice
Note prior probability can be received, wherein, for iterative computation each time, according to the estimation of data point group membership's probability,
Adjust described data point label prior probability.
Here say that an advantage of the embodiment of description is the stability of transduction algorithm.This stability is described by regulation
Cost factor and the described labelling prior probability of regulation realize.Such as, in one embodiment, by Iterative classification, use extremely
Lack a cost factor, have mark data points and data untagged point as training examples, train a transductive classifier.For
Iterative computation each time, regulates the cost factor function as a desired mark value of described data untagged point.Additionally,
For iterative computation each time, regulate a data point prior probability according to the estimation of data point group membership's probability.
Work station can have memory-resident in an operating system, this operating system such as MicrosoftBehaviour
Make system (OS), MAC operation system, or UNIX operating system.Should be appreciated that preferred embodiment can also carry being different from those
To platform and operating system on implement.One preferred embodiment can use JAVA, XML, C and/or C Plus Plus or
Other programming language is write, in conjunction with OO Programming Methodology.Object-oriented programming (OOP) can be used,
It has been being increasingly used to the application that exploitation is complicated.
Above-mentioned application uses transduction study to overcome the most rare problem of data set, and this problem annoyings conclusion type face
Identification system.This aspect of transduction study is not limited to this application, it is also possible to be used for solving other due to data set rareness
Say the Machine Learning Problems caused.
Within the scope and spirit of the various embodiments of invention disclosed herein, those skilled in the art can design difference
Change.And, the various features of embodiments disclosed above can be used alone, or various combination each other, and not
It is confined to particular combination described above.Therefore, the scope of claim is not limited to these embodiments described.
Claims (18)
1. a face identification method, it is characterised in that including:
At least one of one face of reception has labelling drawing of seeds picture, and this drawing of seeds picture has a known confidence levels;
Receive unmarked image;
Receive at least one cost factor preset;
By iterative computation, use at least one cost factor preset described, at least one drawing of seeds picture described and described
Unmarked image, trains a transductive classifier, wherein, for iterative computation each time, regulates described cost factor as one
The function of individual expectation mark value;
After at least part of iteration, store confidence score for described unmarked drawing of seeds picture;With
The identifier with the unmarked image of the highest confidence score is exported to a user, another system, during another
At least one.
Method the most according to claim 1, it is characterised in that: at least one drawing of seeds picture described has a labelling, refers to
Show whether this image is included into a classification specified.
Method the most according to claim 1, it is characterised in that: store confidence score each time after iteration, wherein, each
After secondary iteration, the identifier of the unmarked image with the highest confidence score is output.
Method the most according to claim 1, it is characterised in that: also include having labelling and unmarked image-receptive one for described
Individual data point markers prior probability;Wherein, for iterative computation each time, according to the estimation of data point group membership's probability,
Regulate described data point label prior probability.
Method the most according to claim 1, it is characterised in that: also include the 3rd the unmarked figure receiving a face
Picture, compares described 3rd unmarked image with the image at least partly with the highest confidence score, and if be sure oing the 3rd
Face in individual unmarked image is identical with the face in described drawing of seeds picture, then export described 3rd unmarked image
Identifier.
6. the method for the change for making patent classification adapt to file content, it is characterised in that including:
Receiving at least one and have labelling seed file, at least one seed file wherein said is selected from including patents and patent applications
Group in;
Receiving a unmarked file, described unmarked file is at least one in patents and patent applications;
Use at least one seed file described and described unmarked file, train a transductive classifier;
Use processor to use described grader, divide having higher than the unmarked file of the confidence levels of a predetermined threshold value
Class is to multiple existing classifications;
New classification non-existing before automatically creating at least one and the described grader of use will have less than a default threshold
The unmarked document classification of the confidence levels of value is at least one new classification described;
Use described grader, the file of classification before at least partly described is reclassified described existing classification and described
At least one new classification;With
The identifier of the file of described classification is exported to a user, another system and at least one during another.
Method the most according to claim 6, it is characterised in that: described grader is a transductive classifier, and also includes
By iterative computation, use at least one cost factor, search inquiry and described file preset, train described transduction to divide
Class device, wherein, for iterative computation each time, regulates the described cost factor function as an expectation mark value, and uses
The grader described file of classification of described training.
Method the most according to claim 6, it is characterised in that: also include for search inquiry and one data point of file reception
Labelling prior probability;Wherein, for iterative computation each time, according to the estimation of data point group membership's probability, regulation is described
Data point label prior probability.
Method the most according to claim 6, it is characterised in that: described file classifying method includes a support vector machine mistake
Journey.
Method the most according to claim 6, it is characterised in that: described file classifying method includes that a maximum entropy differentiates
Process.
11. methods according to claim 6, it is characterised in that: described unmarked file is patent application.
12. methods according to claim 6, it is characterised in that: at least one seed file described selected from patent and
One patent application.
13. 1 kinds of methods adapting to file content change, it is characterised in that including:
Receive at least one and have labelling seed file;
Receive unmarked file;
Receive at least one cost factor preset;
Use at least one described cost factor, at least one seed file and described unmarked file preset described, instruction
Practice a transductive classifier;
Use processor to use described grader, divide having higher than the unmarked file of the confidence levels of a predetermined threshold value
Class is to multiple classifications;
Use described grader, the file classified by different graders the most before is reclassified multiple class
Not, thus adapting to the variation of file content, wherein said file content includes in picture material, content of text, the space of a whole page and numbering
At least one;Wherein said variation is at least one in the variation of the variation of time, the variation of style and the space of a whole page;With
The identifier of the file of described classification is exported to a user, another system and at least one during another.
14. methods according to claim 13, it is characterised in that: also include the confidence that will have less than a predetermined threshold value
The unmarked file of rank moves into one or more new classifications.
15. methods according to claim 13, it is characterised in that: also include by iterative computation, use at least one pre-
If cost factor, at least one seed file described and described unmarked file, train described transductive classifier;Wherein,
For iterative computation each time, regulate the described cost factor function as an expectation mark value, and use described training
The grader described unmarked file of classification.
16. methods according to claim 15, it is characterised in that: also include connecing for described seed file and unmarked file
Receive a data point markers prior probability;Wherein, for iterative computation each time, according to estimating of data point group membership's probability
Calculate, regulate described data point label prior probability.
17. methods according to claim 13, it is characterised in that: described unmarked file is customer complaint, and also includes
Product variations is associated with customer complaint.
18. methods according to claim 13, it is characterised in that: described unmarked file is invoice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610972541.XA CN107180264A (en) | 2006-07-12 | 2007-06-07 | For the transductive classification method to document and data |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US83031106P | 2006-07-12 | 2006-07-12 | |
US60/830,311 | 2006-07-12 | ||
US11/752,673 | 2007-05-23 | ||
US11/752,634 | 2007-05-23 | ||
US11/752,719 | 2007-05-23 | ||
US11/752,691 | 2007-05-23 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610972541.XA Division CN107180264A (en) | 2006-07-12 | 2007-06-07 | For the transductive classification method to document and data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101449264A CN101449264A (en) | 2009-06-03 |
CN101449264B true CN101449264B (en) | 2016-11-30 |
Family
ID=
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2369698A (en) * | 2000-07-21 | 2002-06-05 | Ford Motor Co | Theme-based system and method for classifying patent documents |
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2369698A (en) * | 2000-07-21 | 2002-06-05 | Ford Motor Co | Theme-based system and method for classifying patent documents |
Non-Patent Citations (2)
Title |
---|
Discriminative, Generative and Imitative Learning;Tony Jebara;《MIT PhD Thesis》;20020228;全文 * |
Learning frmo Partially Labeled Data;Marcin Olof Szummer;《MIT PhD Thesis》;20020930;全文 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8719197B2 (en) | Data classification using machine learning techniques | |
US8239335B2 (en) | Data classification using machine learning techniques | |
US7761391B2 (en) | Methods and systems for improved transductive maximum entropy discrimination classification | |
US11275841B2 (en) | Combination of protection measures for artificial intelligence applications against artificial intelligence attacks | |
Dai et al. | Adversarial attack on graph structured data | |
WO2008008142A2 (en) | Machine learning techniques and transductive data classification | |
CN107967575B (en) | Artificial intelligence platform system for artificial intelligence insurance consultation service | |
CN112241481B (en) | Cross-modal news event classification method and system based on graph neural network | |
US20080086432A1 (en) | Data classification methods using machine learning techniques | |
Bazan et al. | The rough set exploration system | |
Sommer et al. | Towards probabilistic verification of machine unlearning | |
Kanan et al. | An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system | |
Wang et al. | Efficient learning by directed acyclic graph for resource constrained prediction | |
US8472728B1 (en) | System and method for identifying and characterizing content within electronic files using example sets | |
CN110210468B (en) | Character recognition method based on convolutional neural network feature fusion migration | |
CN111914156A (en) | Cross-modal retrieval method and system for self-adaptive label perception graph convolution network | |
CN112199508B (en) | Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision | |
CN112257841A (en) | Data processing method, device and equipment in graph neural network and storage medium | |
CN113904872A (en) | Feature extraction method and system for anonymous service website fingerprint attack | |
Wu | Application of improved boosting algorithm for art image classification | |
CN101449264B (en) | Method and system and the data classification method of use machine learning method for data classification of transduceing | |
CN114265954B (en) | Graph representation learning method based on position and structure information | |
Yu et al. | Autonomous knowledge-oriented clustering using decision-theoretic rough set theory | |
CN107180264A (en) | For the transductive classification method to document and data | |
Amrutha et al. | Deep Clustering Network for Steganographer Detection Using Latent Features Extracted from a Novel Convolutional Autoencoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |