CN107180264A - For the transductive classification method to document and data - Google Patents

For the transductive classification method to document and data Download PDF

Info

Publication number
CN107180264A
CN107180264A CN201610972541.XA CN201610972541A CN107180264A CN 107180264 A CN107180264 A CN 107180264A CN 201610972541 A CN201610972541 A CN 201610972541A CN 107180264 A CN107180264 A CN 107180264A
Authority
CN
China
Prior art keywords
data
mark
point
classification
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201610972541.XA
Other languages
Chinese (zh)
Inventor
毛里蒂乌斯·A·R·施密特勒
克里斯托弗·K·哈里斯
罗兰·博雷
安东尼·萨拉
妮古拉·卡鲁索
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tungsten Automation Corp
Original Assignee
Kofax Image Products Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/752,634 external-priority patent/US7761391B2/en
Priority claimed from US11/752,673 external-priority patent/US7958067B2/en
Priority claimed from US11/752,691 external-priority patent/US20080086432A1/en
Priority claimed from US11/752,719 external-priority patent/US7937345B2/en
Application filed by Kofax Image Products Inc filed Critical Kofax Image Products Inc
Priority claimed from CN200780001197.9A external-priority patent/CN101449264B/en
Publication of CN107180264A publication Critical patent/CN107180264A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of for the system of grouped data, method, data processing equipment and product.Also disclose the data classification method using machine learning method.

Description

For the transductive classification method to document and data
The application is divisional application, and the international application no of its original application is PCT/US2007/013484, and international filing date is On June 7th, 2007, China national Application No. 200780001197.9, the date into China is on April 23rd, 2008, hair Bright entitled " method and system and the data classification method using machine learning method for data classification of transduceing ".
Technical field
The invention mainly relates to the method and apparatus classified for data.Specifically, the invention provides improved transduction Machine learning method.The invention further relates to the new application using machine learning method.
Background technology
The information age and in the recent period all trades and professions (including, particularly, scanning file, web material, search engine number According to, text data, image, audio data file, etc.) electronic data huge explosion, how processing data has become very It is important.
A field for just starting to explore is non-artificial data classification.In many sorting techniques, machine or computer Must be according to the rule setting for being manually entered and setting up and/or the training examples manually set up study.Using training examples Machine learning in, learn number of parameters of the quantity of sample generally than required estimation small, i.e. meet and given by training examples Restrictive condition solution quantity it is bigger.One challenge of machine learning is to find that a kind of limit regardless of shortcoming has still been concluded Good solution.Therefore need to overcome these and/or other problem of the prior art.
It it is yet further desirable to the practical application of various types of machine learning methods.
The content of the invention
In a computer based system, according to one embodiment of present invention, a kind of side classified for data Method, including:Reception has mark data points, has described in each mark data points to be marked with least one, indicates the data point The training examples for the data point for being included into a specified classification, or the data point being excluded from a specified classification training Sample;Receive data untagged point;There is at least one default cost of mark data points and data untagged point described in receiving The factor;By iterative calculation, using at least one described cost factor, and described there are mark data points and data untagged point As training examples, using maximum entropy-discriminate (MED), a transductive classifier is trained, wherein, for iterating to calculate each time, The cost factor of data untagged point is adjusted as the function of an expectation mark value, and estimating according to group of data points membership probability Calculate, adjust the prior probability of a data point markers;By the grader of training be used for classify the data untagged point, have mark At least one in data point and input data point;And the classification of the data point by the classification or derivatives thereof is exported to one Individual user, another system and it is another during at least one.
According to another embodiment of the invention, a kind of method classified for data, including provided to computer system The executable program code needed to use, and perform on the computer systems, described program code includes multiple instruction, is used for: Access be stored in computer storage have mark data points, have mark data points that there is at least one mark described in each, It is the training examples for the data point for being included into a specified classification to indicate the data point, or be excluded from a specified classification The training examples of data point;Unmarked data point is accessed from computer storage;There is mark from described in computer storage access Numeration strong point and at least one default cost factor of data untagged point;By iterative calculation, using it is described at least one Cost factor, and storage have mark data points and the data untagged point of storage as training examples, train a maximum Entropy-discriminate (MED) transductive classifier, wherein, for iterating to calculate each time, adjustment data untagged point cost factor is used as one The individual function for expecting mark value, and according to the estimation of data point group membership's probability, adjust the priori of the data point markers Probability;By the grader of training be used for classify the data untagged point, have in mark data points and input data point at least One;And the classification of the data point by the classification or derivatives thereof export to a user, another system and it is another during At least one.
According to another embodiment of the invention, a kind of data processing equipment, including:At least one memory, for depositing Storage:(i) have a mark data points, it is described each there are mark data points to be marked with least one, it is to be received to indicate the data point Enter the training examples of the data point of a specified classification, or the data point being excluded from a specified classification training examples; (ii) data untagged point;(iii) described have at least one default cost of mark data points and data untagged point because Son;And a transductive classifier training aids, with using at least one described cost factor for storing, and storage have mark Data point and the data untagged point of storage are as training examples, and using the maximum entropy-discriminate (MED) of transduction, cyclically training turns Grader is led, wherein, iterated to calculate for MED each time, adjustment data untagged point cost factor is expected to mark as one The function of value, and according to the estimation of data point group membership's probability, adjust the prior probability of the data point markers;
Wherein, the grader trained by transductive classifier training aids be used for classify data untagged point, have mark data points, And at least one in input data point;
Wherein, the classification of data point of the classification or derivatives thereof, is exported to a user, another system and another At least one during one.
According to another embodiment of the invention, a kind of product, including:One computer-readable program recorded medium, The medium definitely includes the executable instruction repertorie of one or more computers, to perform a kind of method of data classification, Including:Reception has mark data points, has described in each mark data points to be marked with least one, indicate the data point be by Include the training examples of the data point of a specified classification, or the data point being excluded from a specified classification training sample Example;Receive data untagged point;Have described in receiving at least one default cost of mark data points and data untagged point because Son;Use the data untagged point for having mark data points and storage of the cost factor of at least one storage, and storage As training examples, calculated using the maximum entropy-discriminate (MED) of iteration, train a transductive classifier, wherein, each time In MED iterative calculation, the cost factor of data untagged point is adjusted as the function of an expectation mark value, and according to a number The estimation of strong point group membership's probability, adjusts a data point markers prior probability;The grader of training is used for the nothing of classifying Mark data points, there is at least one in mark data points and input data point;And the data point by classification or derivatives thereof Classification export to a user, another system and it is another during at least one.
In a computer based system, according to another embodiment of the invention, a kind of point of data untagged Class method, including:Reception has mark data points, has described in each mark data points to be marked with least one, indicates the number Strong point is the training examples for the data point for being included into a specified classification, or the data point being excluded from a specified classification Training examples;Reception has mark and data untagged point;Reception has the priori signature of mark data points and data untagged point general Rate information;There is at least one default cost factor of mark data points and data untagged point described in receiving;According to the number The mark prior probability at strong point, determines that each has the desired mark of mark and data untagged point;Repeat following sub-step Suddenly, until data value is restrained enough.
● generate a regulation for each data untagged point proportional to the absolute value of the expectation mark of data point Value at cost;
● by determining decision function, the given sample for being included into training and being excluded training, using it is described have mark and Data untagged point trains a grader as training examples, is marked according to their expectation, and the decision function dissipates KL It is minimised as the prior probability distribution of decision function parameter;
● using the grader of the training, it is determined that described have the classification score value marked with data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to group membership's probability of the determination, update the mark prior probability of the data untagged point;
● using the mark prior probability and the classification score value that determines before of the renewal, using maximum entropy-discriminate (MED), Determine the mark and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation mark;With
● it is that each data point updates by the way that the expectation mark insertion new expectation of iteration before is marked Expect mark.
One classification of input data point or derivatives thereof is exported to a user, another system and another process In at least one.
According to another embodiment of the invention, a kind of file classifying method, including:Receive at least one markd kind Subfile, it has the known confidence levels of mark distribution;Receive unmarked file;Receive at least one default cost because Son;Using at least one described default cost factor, at least one described seed file and the unmarked file, lead to Iterative calculation one transductive classifier of training is crossed, wherein, for iterating to calculate each time, adjust the cost factor and be used as one Expect the function of mark value;It is the unmarked file storage confidence score after at least part iteration;And will have most The identifier of the unmarked file of high confidence score export to a user, another system and it is another during at least one It is individual.
According to another embodiment of the invention, a kind of method for being used to analyze the file related to legal inquiry, including: Receive the file related to legal matter;A kind of file classifying method is performed to the file;And based on its classification, output is extremely The identifier of small part file.
According to another embodiment of the invention, a kind of method for clearing up data, including:Receive multiple markd data ;The subset of the data item is chosen for each of multiple classifications;In each subset, the deviation of the data item is set It is set to about zero;The deviation of data item not in the subset is arranged to the preset value that one is not about zero;Using described Deviation, the data item in the subset and the data item not in the subsets are instructed as training examples by iterating to calculate Practice a transductive classifier;The grader of the training is applied to each described markd data item, it is described to classify Each data item;And export the classification of the input data or derivatives thereof to a user, another system, another During at least one.
According to another embodiment of the invention, it is a kind of for checking of invoice and the method for the relevance of entity, including:Base A grader is trained in the invoice format related to first entity;Multiple are accessed to be marked as and the first instance and its At least one related invoice in its entity;A kind of file classifying method is performed to invoice using the grader;And it is defeated Go out the identifier of at least one invoice, the invoice has higher probability uncorrelated to first entity.
According to another embodiment of the invention, a kind of method for managing case history, including:Based on medical diagnosis training One grader;Access multiple case histories;A kind of file classifying method is performed to the case history using the grader;And output The identifier of at least one case history, the case history has relatively low probability related to the medical diagnosis.
According to another embodiment of the invention, a kind of method for recognition of face, including:Receive at least one face Have a mark drawing of seeds picture, the drawing of seeds picture has a known confidence levels;Receive unmarked image;Receive at least one Individual default cost factor;By iterative calculation, at least one described default cost factor, at least one drawing of seeds are used As and the unmarked image, train a transductive classifier, wherein, for iterating to calculate each time, adjust it is described into This factor as a desired mark value function;It is the unmarked drawing of seeds picture storage after at least part iteration One confidence score;And export the identifier of the unmarked image with highest confidence score to a user, another system System, it is another during at least one.
According to another embodiment of the invention, a kind of method for analyzing prior art document, including:Based on one Search inquiry trains a grader;Access multiple prior art documents;Using the grader at least partly described existing Technological document performs a kind of file classifying method;And based on its classification, the mark of at least partly described prior art document of output Know symbol.
According to another embodiment of the invention, it is a kind of patent classification is adapted to the method that file content changes, including:Connect Receive at least one markd seed file;Receive unmarked file;Use at least one described seed file and the nothing Tab file trains a transductive classifier;Using the grader, the confidence levels of predetermined threshold value will be higher than with one Unmarked file is referred to multiple existing classifications;Using the grader, the confidence level of predetermined threshold value will be less than with one Other unmarked file is referred at least one new classification;, will at least partly described classified file weight using grader Newly it is referred to the existing classification and at least one described new classification;And export the identifier of the sorted file To a user, another system and it is another during at least one.
According to another embodiment of the invention, a kind of method for file to be matched with claim, including:It is based on One patent or at least one claim of patent application train a grader;Access multiple files;Use the classification Device performs a kind of file classifying method at least partly described file;And based on its classification, at least partly described file of output Identifier.
According to another embodiment of the invention, a kind of patent or the sorting technique of patent application, including:Based on it is multiple Know one grader of file training for belonging to a specific patent classification;Receive at least one of a patent or patent application Point;A kind of file classifying method is performed to the patent or described at least a portion of patent application using the grader;With And the classification of the output patent or patent application, wherein, the file classifying method is a Yes/No sorting technique.
According to another embodiment of the invention, a kind of method for adapting to file content variation, including:Receive at least one There is mark seed file;Receive unmarked file;Receive at least one default cost factor;Using it is described at least one preset Cost factor, at least one described seed file and the unmarked file, train a transductive classifier;Using institute Grader is stated, the unmarked file of the confidence levels with higher than a predetermined threshold value is referred to multiple classifications;Using described Grader, multiple classifications are reclassified by the file of at least partly described classification;And by the mark of the sorted file Symbol output to a user, another system and it is another during at least one.
According to another embodiment of the invention, a kind of method of separate file, including:Receive markd data;Connect Receive one group of unmarked file;Based on the markd data and unmarked file, probabilistic classification rule is rewritten using transduction;Root According to probabilistic classification rule, the weight separated for file is updated;Determine the position separated in one group of file;Will be described The designator of the separation point position of determination export to a user, another system and it is another during at least one;And give File stamps code, and the code is related to the designator.
According to another embodiment of the invention, a kind of method of file search, including:Receive a search inquiry;Base In the search queries retrieval file;Export the file;The mark keyed in at least partly described file reception user, it is described Mark indicates the correlation between the file and the search inquiry;The mark instruction keyed in based on the search inquiry and user Practice a grader;One file classifying method is performed to the file using the grader, to divide again the file Class;And based on its classification, the identifier of at least partly described file of output.
Brief description of the drawings
Fig. 1 for expect mark as classify score value a function curve map, the classification score value by using suitable for The MED that mark is concluded differentiates study and obtained.
Fig. 2 is the schematic diagram of the iterative calculation of one group of decision function obtained by MED study of transduceing.
Fig. 3 is one group of changing by the improved transduction MED decision functions for learning to obtain according to an embodiment of the invention The schematic diagram that generation calculates.
Fig. 4 is that, according to one embodiment of the invention, using the cost factor of a regulation, one is used for unmarked number of classifying According to control flow chart.
Fig. 5 is that, according to one embodiment of the invention, using user-defined priori probability information, one is used to classify without mark The process control chart for the evidence that counts.
Fig. 6 is, according to one embodiment of the invention, using the cost factor and priori probability information of regulation, to use maximum entropy Differentiate, a detailed control flowchart for data untagged of classifying.
Fig. 7 implements the network of the network structure of not be the same as Example described herein for display.
Fig. 8 is one representational, the system block diagram of the hardware environment related to user equipment.
Fig. 9 is the block diagram for the device for representing one embodiment of the present of invention.
Figure 10 is by the flow chart of the assorting process performed according to one embodiment.
Figure 11 is by the flow chart of the assorting process performed according to one embodiment.
Figure 12 is by the flow chart of the assorting process performed according to one embodiment.
Figure 13 is by the flow chart of the assorting process performed according to one embodiment.
Figure 14 is by the flow chart of the assorting process performed according to one embodiment.
Figure 15 is by the flow chart of the assorting process performed according to one embodiment.
Figure 16 is by the flow chart of the assorting process performed according to one embodiment.
Figure 17 is by the flow chart of the assorting process performed according to one embodiment.
Figure 18 is by the flow chart of the assorting process performed according to one embodiment.
Figure 19 is by the flow chart of the assorting process performed according to one embodiment.
Figure 19 is by the flow chart of the assorting process performed according to one embodiment.
Figure 20 is by the flow chart of the assorting process performed according to one embodiment.
Figure 21 is by the flow chart of the assorting process performed according to one embodiment.
Figure 22 is the method for one embodiment of the invention, the control flow chart for a first document classification system.
Figure 23 is the method for one embodiment of the invention, the control flow chart for a second document classification system.
Figure 24 is by the flow chart of the assorting process performed according to one embodiment.
Figure 25 is by the flow chart of the assorting process performed according to one embodiment.
Figure 26 is by the flow chart of the assorting process performed according to one embodiment.
Figure 27 is by the flow chart of the assorting process performed according to one embodiment.
Figure 28 is by the flow chart of the assorting process performed according to one embodiment.
Figure 29 is by the flow chart of the assorting process performed according to one embodiment.
Embodiment
Following description be it is presently contemplated that the realization present invention the best approach, the purpose of the description is to illustrate this hair Bright General Principle, is not intended to limit the content of invention described herein.Moreover, special characteristic described herein can It is combined with the feature of each other description in various different possible combination and permutation.
Unless another especially definition here, all terms all give its most wide possible explanation, including from specification The meaning of hint, and skilled artisan understands that the meaning, and as defined in dictionary, paper etc. look like.
Text classification
The benefit and demand of text data classification are very huge, and have had a variety of sorting techniques to be used.Below Sorting technique for text data is discussed:
To increase its effectiveness and intelligence, it is desirable to the machine of such as computer etc can classify (or identification) one constantly expand Object in big scope.For example, optical character identification can be used the numeral and word of classify hand-written or scanning in computer, make With pattern identification come classification chart picture, such as face, fingerprint, fighter plane, or classified using speech recognition sound, voice etc. Deng.
Machine is also required being capable of classifying text information object, such as text computer file or document.Text classification Using being various and important.For example, text classification can be used for management text message object to be classified to a predetermined class Other or classification hierarchical structure.So, it is found that (or finding) text message object relevant with particular topic is just simplified.Text This classification can be used for appropriate text message object routing to appropriate crowd or place.So, information service will can be related to The text message object of various themes (e.g., commercial affairs, physical culture, stock market, football, specific company, specific football team) is routed to Crowd with different interest.Text classification can be used for filtering text message object, so that individual is from unwanted text Hold the invasion of (such as need not be with uncalled Email, also referred to as SPAM, or " rubbish ").As from these As being learnt in example, text classification has a variety of exciting and important application.
Rule-based classification
In some instances, it is necessary to based on certain generally acknowledged logic, file content is classified using absolute certitude. One rule-based system can be used for realizing such classification.Substantially, rule-based system uses the shape of production rule Formula:
IF conditions, THEN is true.
The condition can include whether text message includes some word or expressions, with specific grammer, or have Specific attribute.If for example, content of text has word " closing quotation ", phrase " Nasdaq " and numeral, be then classified as " stock market " text.
In past about 10 years, other types of grader is little by little used.Although this kind of grader is unlike base Static, predetermined logic is used like that in the grader of rule, but in numerous applications, they are better than rule-based classification Device.This kind of grader generally includes a learning element and an executive component.This kind of grader includes neutral net, Bayes Network and SVMs.Although each this kind of grader is well known, for convenience reader, it is briefly described below each Plant grader.
Grader with study and executive component
As being previously mentioned the end of upper section, in numerous applications, the grader with study and executive component is excellent In rule-based grader.Reiterate, these graders can include neutral net, Bayesian network and supporting vector Machine.
Neutral net
Neutral net is substantially the multilayer of same treatment element (also referred to as neuron), level arrangement.Each neuron can With one or more inputs, but only one of which is exported.The input of each neuron is weighted by a coefficient.Neuron Output is typically a function of its weighting input and deviation sum.This function, also referred to as activation primitive, typically one Sigmoid function.That is, the activation primitive can be S-shaped monotonic increase, and when its it is (multiple) input respectively close to positive minus infinity when, Asymptotics fixed value (such as+1,0, -1).Sigmoid function and single neural weight and deviation determine that neuron is believed input Number response or " excitability ".
In the level arrangement of neuron, the output of the neuron in one layer can be distributed as one or more in next layer The input of neuron.Typical neutral net may include the individual different layers of an input layer and two (2);That is, one input layer, one Intrerneuron layer, and an output neuron layer.The node that note that the input layer is not neuron.More precisely, The node of input layer only has an input, and mainly provides the untreated input for inputing to next layer.If, such as nerve net Network will be used to recognize a numerical character in 20 × 15 pel arrays, and the input layer can have 300 neurons (i.e. Each pixel of input), and output array can have 10 neurons (in i.e. 10 numerals each).
The use of neutral net generally comprises the individual continuous step in two (2).First, neutral net is initialized, and according to tool The network is trained in the known input for having known output valve (or classification).Once neutral net is trained to, it is with regard to that can be used for classification not The input known.By the way that the weight and deviation of neuron are set into random value (generally being generated by a Gaussian Profile), nerve net Network can be initialised.Then there is the known input for exporting (or classification) using a series of, trains the neutral net.It will instruct When white silk input is supplied to neutral net, adjustment (such as according to known backpropagation techniques) neural weight and deviation, so that The output of the neutral net of each single training mode approaches or matched the known output.Substantially, the gradient of weight space Decline be used to minimize output error.So, using the study of continuous training input, towards weight and the local optimum of deviation Solution convergence.That is, weight and deviation is adjusted to minimal error.
In practical operation, generally not by the systematic training into the certain point for converging to optimal solution.On the contrary, system will be by " over training ", causes it excessively professional for training data, thereby increases and it is possible to be bad at classification and the somewhat different input of training set. Therefore, in the different times of its training, the system is tested in one group of checking data.When the performance of system collects in checking On when no longer improving, training stops.
Once training is completed, so that it may use the neutral net, according to the weight and deviation determined during the training period, classification is not Know input.One output of the neuron in the Unknown worm if the neutral net can classify, some output layer surely will Other outputs can be far above.
Bayesian network
Generally, Bayesian network is used it is assumed that as between data (e.g., input feature value) and prediction (e.g., classifying) Medium.For given data (" P (assuming that ︱ data) "), each probability assumed can be estimated.After hypothesis Probability is tested, is predicted from the hypothesis, to be weighted to the single prediction that each is assumed.Data-oriented D, prediction X's Probability can be expressed as:
Wherein, HiFor i-th of hypothesis.Maximize given D (P (Hi︱ D)) HiProbability maximum likelihood hypothesis Hi It is referred to as maximum a posteriori and assumes (or " HMAP"), and be represented by:
P (X ︱ D)~P (X ︱ HMAP)
Use bayes rule, data-oriented D, it is assumed that HiProbability be represented by:
Data D probability keeps constant.Therefore, to find HMAP, it is necessary to maximum chemoattractant molecule.
The Section 1 of molecule is represented:It is given to assume i, can be it is observed that the probability of the data.The Section 2 of molecule is represented:Point The prior probability for assuming i is given described in dispensing.
Bayesian network includes the directed edge between variable and variable, thus defines a directed acyclic graph (i.e. " DAG "). Each variable may be assumed that the arbitrary value in the mutual exclusion state for limited quantity.For each variables A, it has female variable B1…Bn, there is an attached probability tables (P (A ︱ B1…Bn).The structured coding of Bayesian network it is described it is assumed that it is given its Female variable, each variable is conditionally independently of its non-sub- variable.
Assuming that the structure of Bayesian network, it is known that and variable observable, then only need condition for study list of probabilities set.Directly Using the statistics from one group of study sample, these lists can be estimated.If the structure, it is known that and some variables be it is hiding, Then study is similar to above-mentioned neural network learning.
The example of simple Bayesian network is described below.Variable " MML " can be represented " humidity on my lawn " (moisture of my lawn), and can have state " wet " and " dry ".MML variables can have " rainy " and " my watering Device is opened " female variable, each with "Yes" and "No" state.Another variable, " MNL " can represent " the grass of my neighbours The humidity on level ground ", and can have state " wet " and " dry ".MNL variables can share " rainy " female variable.In this example, prediction can To be that my lawn is " wet " or " dry ".The prediction can be with based on the assumption that (i):If rained, what my lawn will be wet is general Rate (x1) and assume (ii):If my water sprinkler is opened, the probability (x that my lawn will be wet2).The probability rained or I Water sprinkler open probability may depend on other variables.If for example, the lawn of my neighbours is wet, and they do not spill Hydrophone, that is likely to rain.
As described above, as the example of neutral net, the conditional probability table in Bayesian network can be trained.Its advantage exists In, by allow provide priori, the learning process can be shortened.Unfortunately, the prior probability of conditional probability is usually It is unknown, now using unified prior probability.
One (1) that one embodiment of the present of invention can perform in the individual basic function at least two (2) is individual, that is, generates grader Parameter, and object of classification, such as text message object.
Substantially, it is grader generation parameter based on one group of training examples.One group of spy can be generated from one group of training examples Levy vector.The feature of this group of characteristic vector can be simplified.The parameter of generation may include to dullness (such as S-shaped) function of a definition With a weight vectors.The weight vectors can be determined (or by other known technology) by way of SVM is trained.It can pass through Optimal method determines dullness (such as S-shaped) function.
Text classifier includes dullness (e.g., S-shaped) function of a weight vectors and a definition.Substantially, it is of the invention The output of text classifier be represented by:
Wherein:
Oc=classification c classification output;
wc=weight vectors the parameter related to classification c;
(simplification) characteristic vectors of x=based on unknown text information object;
A and B are a customized parameters of dull (e.g., S-shaped) function;
Output is calculated by expression formula (2) faster than calculating output by expression formula (1).
According to the form for being classified object, grader can (i) text message object is converted into characteristic vector, and (ii) It is the simplification characteristic vector with less element by feature vectors reduction.
Transduction machine learning
Commercially, currently used automatic classification system is rule-based or utilizes conclusion type machine in the prior art Study, i.e. use handmarking's training examples.Compared to transduction method, two methods are generally required for a large amount of artificial setting works Make.The solution provided by rule-based system or conclusion type method is static solution, if without manual working, it Classification concept of drifting about cannot be adapted to.
The machine learning of conclusion type is used to attribute or relation being attributed to based on characterizing (namely based on one or a small number of observations Or experience) type;Or rule is formulated based on limited observation reproduction mode.Conclusion type machine learning is included from observing Reasoning in training cases, to set up general rule, the rule is then used in test case.
Distinguishingly, preferred embodiment uses transduction machine learning method.Machine learning of transduceing is an effective method, can To avoid these defects.
Transduction Machine Method can have mark training examples learning from considerably less one group, and automatic adaptation drift classification is general Read, and correct the training examples of mark automatically.These advantages cause transduction machine learning to turn into an interesting and valuable side Method, is adapted to various business applications.
Transduction is in data learning pattern.By not only from having flag data but also from data untagged learning, transduction Extend the concept of conclusion type study.This enables transduction to learn not from having flag data capture or only part from there is mark The pattern that numeration is captured in.Therefore, the system learnt compared to rule-based system or based on conclusion type, transduction can adapt to The environment of dynamic change.This ability causes transduction to can be used in file search, data scrubbing, addressing drift classification concept etc. Deng.
Description below utilizes SVMs (SVM) classification and the reality of the transductive classification of maximum entropy-discriminate (MED) framework Apply example.
SVMs
SVMs (SVM) is a kind of method that text classification is used, by using the concept pair of regularization theory Possible solution sets limitation, the problem of this method has handled a large amount of solutions, and resulting evolvement problem.For example, one two The hyperplane that first SVM classifier chooses maximization boundary from all accurate hyperplane for separating training data is used as solution.It is maximum Boundary normalization is met foregoing in the extensive selection conjunction between memory under the restrictive condition that training data is classified exactly The problem concerning study of suitable balance.Remembered data to the limitation of training data, and normalization then ensure that it is suitable extensive.Conclude and divide Class is from the training examples learning with known mark, i.e. the group membership of each training examples is known.When inducing classification from Known mark learning, transductive classification determines classifying rules from having mark and data untagged.One svm classifier of transduceing Example is as shown in table 1.
The principle of transduction svm classifier
Table 1
Table 1 shows the principle of the transductive classification using SVMs.Solution is provided by hyperplane, and the hyperplane is directed to nothing The all possible mark distribution of flag data produces maximum figure.The possible mark distribution is with the number of data untagged Amount is exponentially increased, and for actually available method, the algorithm of table 1 must be estimated.The example of the estimation exists T.Joachims, Transductive inference for text classification using support vector machines,Technical report,Universitact Dortmund,LAS VIII,1999 (Joachims) it is described in.
For the expression that is uniformly distributed of mark distribution in table 1, the probability that a data untagged point has 1/2 turns into the group Front sample and with 1/2 probability turn into negative sample, i.e. y=+1 (front sample) and y=-1 (negative sample) this Two kinds of possible mark distribution even odds, and final expectation is labeled as 0.For 0 mark it is expected that 1/2 can be equal to by one Fixed category prior probability is obtained, or by the category prior probability (i.e. one for a stochastic variable being distributed with uniform prior Individual unknown category prior probability) obtain.Therefore, in 1/2 application of known class prior probability is not equal to, by combining The additional information can improve the algorithm.For example, be not being uniformly distributed using the mark distribution in table 1, but it is first according to classification Test probability, some mark distribution of prioritizing selection, rather than other mark distribution.However, but matching somebody with somebody smaller with being scored compared with high standard Boundary solution and it is larger but with it is relatively low mark distribution boundary solution between make balance be difficult.Mark distribution probability and Boundary is different scale.
Maximum entropy-discriminate
Another method of classification, maximum entropy-discriminate (MED) (referring to, e.g., T.Jebara, Machine Learning Discriminative and Generative, Kluwer Academic Publishers) (Jebara) will not hit on The problem of SVM is related, because decision function formal phase of normalization and mark distribution formal phase of normalization are all derived from the priori for solution Probability distribution, therefore all in identical probability scale.Thus, if category prior, and during mark known a priori thus, MED classification transduce better than transduction svm classifier, because it allows priori signature knowledge to combine in principle fashion.
Conclude MED classification and assume the prior distribution of a decision function parameter, the prior distribution of bias term, and one The prior distribution of boundary.It selects that distribution closest to prior distribution as the final distribution of these parameters, and produces The expectation decision function of one grouped data point exactly.
In form, a linear classifier is for example given, problem is expressed as follows:Hyperplane parameter distribution p (Θ) is found, partially Poor distribution p (b), data point categorised demarcation line p (γ), its joint probability distribution has a minimum Kullback Lai Baile diverging (Kullback Leibler divergence) KL assigns each prior distribution p combined0, i.e.,
It is limited by restrictive condition
Wherein Θ XtBe separating hyperplane weight vectors and t-th of data point characteristic vector between dot product.Due to mark Score with ytFor known and fixed, the prior distribution distributed without binary flag.Therefore, MED classification will be concluded and is generalized for transduction The short-cut method of MED classification, is to distribute binary flag as the prior distribution parameter of possible mark distribution is limited to locate Reason.Transduction MED example is as shown in table 2.
The MED that transduces classifies
Table 2
For there is flag data, mark prior distribution is a δ function, thus can effectively determine to be labeled as+1 or -1. For data untagged, it is assumed that a mark prior probability p0(y), distribute to one y=+1's of each data untagged point The probability just marked is p0(y), and the probability of y=-1 negative flag be 1-p0(y).Assuming that a non-information mark is first Test (p0(y)=1/2), produce a transduction MED classification similar with above-mentioned transduction svm classifier.
As the situation in transduction svm classifier, the reality implementations applicatory of above-mentioned MED algorithms must be estimated pair In the search of all possible mark distribution.This method is in T.Jaakkola, M.Meila, and T.Jebara, Maximum entropy discrimination,Technical Report AITR-1668,Massachusetts Institute of It is described in Technology, Artificial Intelligence Laboratory, 1999 (Jaakkola), it selects one Individual approximation, is two steps by procedure decomposition, (EM) formula is maximized similar to a desired value.In the formula, it is necessary to Solve two problems.The first step, equivalent to the M steps in EM algorithms, when the best-guess distributed according to current markers, accurately During ground all data points of classification, similar to the maximum of boundary.Second step, equivalent to E steps, uses what is determined in M steps Classification results, and estimate new value for the group membership of each sample.Our second steps are called that mark is concluded.Retouching substantially State as shown in table 2.
The special implementation of Jakkola cited herein method, it is assumed that one has the zero of hyperplane parameter to be averaged The Gaussian function of value and unit variance, a zero mean and variance with straggling parameterGaussian function, formula exp [- C (1- γ)] a boundary priori, wherein γ is the boundary of data point, and c is cost factor, and one as described above without mark The binary flag prior probability p for the evidence that counts0(y).The transductive classification algorithm Jaakkola being discussed below, is hereby incorporated, due to The reason of simplification and non-loss of generality, therefore assume 1/2 mark prior probability.
Fixation probability distribution for giving hyperplane parameter, mark induction step determines marking probability distribution.Make With above-mentioned boundary and mark priori, produce the object function of following mark induction step (referring to table 2):
Wherein λtFor t-th of training examples Lagrange multiplier (Lagrange Multiplier), stFor in foregoing M steps Its score value of classifying of middle determination, c is cost factor.First two in training examples summation obtain from boundary prior distribution, and Section 3 is given by mark prior distribution.By maximizingLagrange multiplier is determined, and thereby determines that data untagged Marking probability distribution.As can be seen that in formula 3, data point acts on alone object function, therefore each Lagrange multiplier Determination it is unrelated with other Lagrange multipliers.For example, in order to maximize a classification score value ︱ s with highest absolute valuet︱'s The effect of data untagged point is, it is necessary to one small Lagrange multiplier λt, and one has small value ︱ st︱ data untagged Point, then need, using a big Lagrange multiplier, to maximize it rightEffect.On the other hand, one of data untagged point Expecting label L T.LT.LT y > as its function representation for classifying score value s and Lagrange multiplier λ is:
< y >=tanh (λ s) (4)
Fig. 1 shows the function for expecting label L T.LT.LT y > as a classification score value s, its use cost factor c=5 and c= 1.5.By using cost factor c=5 and c=1.5 solution formula 3, it is determined that the Lagrange multiplier for producing Fig. 1.By Fig. 1 Understand, the data untagged point outside boundary, i.e., | s |>1, with the expectation label L T.LT.LT y > close to 0, close to the data of boundary Point, i.e., | s | ≈ 1, produce highest and definitely expect mark value, and close to the data point of hyperplane, i.e., | s |<∈, is produced |< y>|<∈.When | s | → ∞,<y>The reason for → 0 non-intuitive mark distribution, is determined method of discrimination, as long as this method Classification limitation is met, attempts to keep close to prior distribution.It is not that a known method by table 2 is selected The algorithm of the artefact of approximation, i.e., one, the algorithm thoroughly searches for all possible mark distribution, and therefore ensures that and find out Globally optimal solution, and the same expectation mark by close or equal to zero distributes to the data untagged outside boundary.Weigh again Shen, as described above, that is to differentiate that viewpoint is desired.Data point outside boundary is unimportant for separating sample, therefore The individual probability distribution of all these data points has been returned to their prior distribution.
The M steps of Jaakkola transductive classification algorithm, are hereby incorporated, it is determined that the probability distribution of hyperplane parameter, partially Poor item and under conditions of limitation closest to respective prior distribution data point boundary,
Wherein, stFor t-th data point classification score value,<yt>For its desired mark,<γt>For its desired boundary.It is right In there is flag data, desired mark is fixed, is<y>=+1 or<y>=-1.The expectation mark of data untagged is located at Within interval (- 1 ,+1), and it is estimated in mark induction step.According to formula 5, because classification score value is determined by expectation mark Fixed, data untagged must meet the classification limitation more tightened up than there is flag data.In addition, the given relational expression for expecting mark, makees For a function of score value of classifying, referring to Fig. 1, the data untagged close to separating hyperplane has most stringent of classification limit System, because the absolute value that their score value and expectation are marked |<yt>| it is small.Give the complete mesh of the M steps of above-mentioned prior distribution Scalar functions are:
Section 1 is obtained by Gauss hyperplane parameter prior distribution, and Section 2 is boundary priori formal phase of normalization, last For deviation priori formal phase of normalization, by with zero mean and varianceGaussian prior obtain.The prior distribution of bias term can be managed Solve as the prior distribution of a category prior probability.Therefore, limited just corresponding to the formal phase of normalization of the deviation prior distribution The weight of face sample and negative sample.Referring to formula 6, the effect of bias term is minimized, to prevent the front sample on hyperplane Collective pull equal to negative sample collective pull.Due to deviation priori, the collective of Lagrange multiplier is limited just by data Point expectation mark weighting, and therefore data untagged than there is the limitation of flag data less.Thus, data untagged have than There is the ability of flag data stronger influence last solution.
In a word, Jaakkola transductive classification algorithm M steps, be hereby incorporated, data untagged need ratio have mark Data meet tightened up classification limitation, and they have the limitation of flag data less for the accumulation weight ratio of solution.In addition, tool There is a data untagged close to zero expectation mark, within the boundary of current M steps, the influence to solution is most Greatly.So, as shown in Fig. 2 by the way that the algorithm is applied into data set, the net effect of formulation E and M steps can be illustrated Should.Data set, which includes two, a mark sample, one be located at x position -1 negative sample (x), and the front sample of one+1 (+), and six unmarked samples (o) along x-axis, between -1 and+1.Fork (x) represents that has the negative sample of mark, Plus sige (+) represents that has a mark front sample, and circle (o) represents data untagged.Different figures represents that what is separated surpasses Plane, is determined by the different iteration of M steps.Final solution is determined by Jaakkaola transduction MED graders, is hereby incorporated, Front has mark training examples to be classified by mistake.Fig. 2 shows the successive ignition of M steps.In the first time iteration of M steps, not Consider data untagged, and the hyperplane separated is located at x=0.One has the data untagged point of negative x values than any other nothing The hyperplane that flag data separates closer to this.In subsequent mark induction step, it will be assigned to minimum |<y> |, correspondingly, in next M steps, there is maximum authority, which to push hyperplane to front, for it mark sample.Expect mark<y> Given shape as a classification score value determined by selected cost factor (referring to Fig. 1) function, with data untagged The specific interval of point, which is combined, generates bridge effect, and in each continuous M step, the hyperplane of separation is increasingly closer to just Face sample.Intuitively, M steps are by a kind of near-sighted puzzlement, closest to the data untagged point of current separating hyperplane The final position of the plane is most can determine that, and remote data point is not critically important.Finally, since deviation priori limits nothing Collective's pulling of flag data moves on to less than the collective's pulling for having flag data, thus separating hyperplane and marks sample beyond front Example, produces the 15th iteration in a final solution, Fig. 2, and it has carried out front mark sample the classification of mistake.In Fig. 2 In used oneDeviation variance and a c=10 cost factor.UtilizeIt is any in scope 9.8<c< Cost factor within 13 produces the final hyperplane of a classification for marking sample to carry out mistake in a certain front.And it is all Interval 9.8<c<Cost factor outside 13, from anywhere in two have between mark sample, produces the hyperplane separated.
The unstability of the algorithm is not limited merely to the sample shown in Fig. 2, when application Jaakkola methods, draws herein With being also subject to be confined to real-world data collection, including the Reuter's data set being well known to those skilled in the art.Table 2 Described in this method inherent instability be the embodiment a major defect, and limit its versatility, to the greatest extent Pipe Jaakkola methods may be implemented in certain embodiments of the present invention.
One method for optimizing of the present invention uses the transductive classification of the framework using maximum entropy-discriminate (MED).It is readily appreciated that, this The not be the same as Example of invention, it is adaptable to classify, is applied equally to other MED problems concerning study using transduction, including, but do not limit In transduction MED restores and image model.
By assuming that the prior probability distribution of a parameter, maximum entropy-discriminate limits and reduces possible solution.According in the phase The solution of prestige is described under the limitation of training data exactly, closest to the probability distribution of the prior probability distribution of hypothesis, last solution To be possible to the desired value of solution.The prior probability distribution of all solutions is mapped to a formal phase of normalization, i.e. have selected one it is specific Prior distribution, just have selected for a specific normalization.
The differentiation estimation implemented by SVMs is being effective from the study of a small amount of sample.The embodiment of the present invention Method and apparatus all there is the feature as SVMs, it is and necessary the problem of will not estimate than solving given The more parameters of parameter, and therefore produce a sparse solution.Compared with generation mode is estimated, generation mode estimation attempts to explain base Plinth process, it usually needs the statistics higher than differentiating estimation.On the other hand, generation mode is more flexible, therefore available for various each The problem of sample.In addition, generation mode estimation can directly include priori.By using maximum entropy-discriminate, the embodiment of the present invention Method and apparatus shorten pure discrimination model estimation (e.g., SVMs learns) and generation mode estimate between gap.
The method of embodiments of the invention as shown in table 3 is an improved transduction MED sorting algorithm, and it does not have The problem of foregoing unstable in the presence of the method for Jaakkola (being hereby incorporated).Difference includes, but not limited in this hair In bright embodiment, each data point has the cost factor of its own, with its absolute descriptor's desired value |<y>| it is proportional.Separately Outside, according to function of the estimation group membership's probability as the distance of data point to decision function, after each M steps, update each The mark prior probability of individual data point.The method of the embodiment of the present invention is as shown in the following Table 3:
Improved transduction MED classification
Table 3
Pass through |<y>| regulation data point cost factor, has relaxed data untagged for collective's dragging on hyperplane Effect is than there is the problem of flag data is stronger, because the cost factor of data untagged is than there is the cost factor of flag data now It is small, that is to say, that each data untagged point is always less than the independent work for having mark data points for the independent role of last solution With.However, if the total amount of data untagged is much larger than the quantity for having flag data, data untagged still can be than there is reference numerals According to more influenceing last solution.In addition, using the class probability of estimation, by cost factor regulation and update mark prior probability knot Close, the problem of solving above-mentioned bridge effect.In first M step, data untagged has small cost factor, produces one Expect mark, as the function of classification score value, its relatively flat (see Fig. 1) is correspondingly, to a certain extent, all unmarked Data are allowed to continue pulling hyperplane, although only have less weight.Further, since the renewal of mark prior probability, remote The data untagged of the hyperplane of separation is not previously allocated one and marked close to 0 expectation, but after many iterations, distribution One mark close to y=+1 or y=-1, and thus little by little it is counted as having flag data processing.
In a particular implementation of the method for the embodiment of the present invention, by assuming that one has decision function parameter Θ's One Gaussian prior of zero mean and unit variance:
The prior distribution of decision function parameter combines the important priori of upcoming specific classification problem.It is other For the prior distribution such as multinomial distribution of the important decision function parameter of classification problem, Poisson distribution, Cauchy's distribution (Breit-Wigner), maxwell boltzman distribution or B-E distribution.
Decision function threshold value b prior distribution is by with average value mubAnd varianceGaussian Profile give:
It is used as the categorised demarcation line γ of data pointiPrior distribution
Chosen, wherein c is cost factor.The prior distribution and the prior distribution used in Jaakkola (being hereby incorporated) Difference, Jaakkola expression formula is exp [- c (1- γ)].Preferably, the expression formula given by formula 9 is better than Jaakkola The expression formula that (being hereby incorporated) uses, because even cost factor is less than 1, formula 9 can also produce a front and expect boundary, and work as c<When 1, exp [- c (1- γ)] produces a negative expectation boundary.
These prior distributions are given, can directly determine corresponding partition function Z (referring to sample T.M.Cover and J.A.Thomas, Elements of Information Theory, John Wiley&Sons, Inc.) (Cover), and target FunctionFor
(it is hereby incorporated) according to Jaakkola, the object function of M steps is
And the object function of E steps is
Wherein stFor the classification score value of t-th of data point, determined in M steps above, p0,1(yt) it is the two of data point Meta-tag prior probability.For there is flag data, mark priori is initialized as p0,1(yt)=1, and for data untagged, mark Note priori is initialized as p0,1(ytThe non-information priori of)=1/2, or category prior probability.
Here the part for being named as M steps describes the algorithm for solving M step object functions.Similarly, E is named as here The part of step describes E step algorithms.
In estimation class probability (Estimate Class Probability) step of the row of table 3 the 5th, training has been used Data are to determine calibration parameter, and the probability for score value of classifying to be become to group membership's probability, i.e. classification gives score value p (cs).With The correlation technique of probability is estimated as in J.Platt, Probabilistic outputs for support in score value is calibrated Vector machines and comparison to regularized likelihood methods, pages 61-74, 2000 (Platt) and B.Zadrozny and C.Elkan, Transforming classifier scores into It is described in accurate multi-class probability estimates, 2002 (Zadrozny).
Referring particularly to Fig. 3, fork (x) represents that has the negative sample of mark, and plus sige (+) indicates to mark front sample, and Circle (o) represents data untagged.Different curves is represented with the separating hyperplane of the different iteration determination of M steps.20th time Iteration shows the last solution determined by improved transduction MED graders.Fig. 3 show improved transduction MED sorting algorithms, should For above-mentioned small data set.The parameter used is c=10,μb=0.Different c, which is produced, is located at x ≈ -0.5, Separating hyperplane between x=0, works as c<When 3.5, hyperplane is located at an x<The right side of 0 data untagged, and when c >= When 3.5, hyperplane is located at the left side of the data untagged point.
Referring particularly to Fig. 4, it is illustrated that a control flow, it is shown that the side of the classification data untagged of the embodiment of the present invention Method.Method 100 starts in step 102, and data storage 106 is accessed in step 104.The data storage is in memory cell and includes Flag data, data untagged and at least one default cost factor.Data 106 include the data of the mark with distribution Point.The data point identification of distribution has mark data points whether will be included into a specific classification, or from a particular category It is excluded.
Once data are accessed in step 104, the method for the embodiment of the present invention then uses the mark of data point in step 108 Remember information, determine the mark prior probability of the data point.Then, in step 110, according to the mark prior probability, it is determined that should The expectation mark of data point.With it is expected that mark is calculated in step 110, together with there is a flag data, data untagged and into This factor, step 112 is included by adjustment cost factor data untagged point, and training is iterated to transduction MED graders. In iterating to calculate each time, the cost factor of data untagged point is conditioned.So, MED graders iterating from calculating Learning.The grader of training then accesses input data 114 in step 116.Then the grader of the training is complete in step 118 The step of constituent class input data, and terminated in step 120.
It is readily appreciated that, 106 data untagged and input data 114 can be obtained from a single source.Thus, it is defeated Enter the iterative process that data/data untagged can be used for step 112, the process is then used to classify in step 118.Moreover, The embodiment of the present invention considers that input data 114 may include a feedback mechanism, input data is supplied into the storage 106 Data, so as to 112 MED graders dynamically from the new data learning of input.
Referring particularly to Fig. 5, it is illustrated that a control flow chart, it is shown that another data untagged of the embodiment of the present invention Sorting technique, including user-defined priori probability information.Method 200 starts from step 202, and storage number is accessed in step 204 According to 206.The data 206 include flag data, data untagged, a default cost factor and customer-furnished Priori probability information.206 have flag data include with distribution mark data point.The marker recognition of the distribution mark The data point of note is will to be included into a specific classification or be excluded from a particular category.
In step 208, desired mark is calculated by 206 data.Then, this it is desired mark in step 210 together with There are flag data, data untagged and cost factor to be used together, to guide the repetitive exercise of a transduction MED grader. 210 iterative calculation adjusts the cost factor of data untagged in calculating each time.Calculate and continue, until grader is by just Really train.
Then, the grader of training accesses the input data from input data 212 in step 214.The grader of training Next can be the step of step 216 completes classifying input data.Process and method described in Fig. 4, input data and nothing Flag data can be obtained from a single source, and can enter system 206 and 212.So, input data 212 Can be in 210 influence training, so that the process can dynamically be changed over time with continuous input data.
In figures 4 and 5 in two shown methods, a monitor can determine that system either with or without reaching convergence.Work as MED The change of hyperplane between the iteration each time calculated is dropped to below a default threshold value, it may be determined that convergence.In the present invention Another embodiment in, when it is determined that expectation mark change drop to below a default threshold value, it may be determined that the threshold value.Such as Fruit reaches convergence, then repetitive exercise process can stop.
Referring particularly to Fig. 6, it is shown that the repetitive exercise process of at least one embodiment of the inventive method is in further detail Control flow chart.Process 300 starts from step 302, in step 304, and the data from data 306 are accessed, and the data can be with Include flag data, data untagged, at least one default cost factor, and priori probability information.306 have a mark Data point includes a mark, and whether data point described in the marker recognition is by the instruction for the data point for being included into a specified classification Practice sample, or by by the training examples of the data point of a specified classification exclusion.306 priori probability information includes mark The probabilistic information of data set and data untagged collection.
In step 308, it is expected that mark is determined by the data of the priori probability information from step 306.In the step 310, Absolute value proportional regulation of the cost factor of each data untagged collection relative to the expectation mark of data point.Then pass through A decision function is determined, a MED grader is trained in step 312, i.e., according to the expectation mark for having mark and data untagged Note, by the use of having mark and data untagged as training examples, is maximized in the training examples being included into and the training being excluded Boundary between sample.In step 314, classification score value is determined using the grader of the training of step 312.In step 316, classification Score value is calibrated to group membership's probability.In step 318, priori probability information is marked according to group membership's probability updating.In step 320 A MED is performed to calculate, to determine that mark and marginal probability are distributed, wherein, classification score value determined above makes in MED calculating With.As a result, new expectation is marked at step 322 and calculated, and in step 324, the phase is updated using the calculating from step 322 Hope mark.In step 326, this method determines whether to reach convergence.If it is, this method is terminated in step 328.If not up to Convergence, then since step 310, complete the another an iteration of this method.Iteration is until reach convergence, so as to realize MED The repetitive exercise of grader.When change of the decision function between the iterative calculation of MED each time is dropped to below a preset value, Reach convergence.In another embodiment, when it is determined that expectation mark value change drop to a default threshold value with When lower, convergence is reached.
Fig. 7 shows a network architecture 700 according to one embodiment.There is provided multiple long-range as shown in the figure Network 702, including the first telecommunication network 702 and the second telecommunication network 704.Gateway 707 be attached to telecommunication network 702 with it is neighbouring Between network 708.In the environment of present networks architecture 700, each of network 704,706 can use arbitrary shape Formula, including but not limited to:LAN, wide area network, such as internet, Public Switched Telephone Network (PSTN), intercom phone net, etc. Deng.
In use, gateway 707 is as from telecommunication network 702 to the entrance of adjacent network 708.Thus, gateway 707 can As a router, the given packet of an arrival gateway 707, and a switch can be managed, it is given number Actual path is provided according to bag turnover gateway 707.
Further comprise at least one data server 714 being connected with the adjacent network 708, it can pass through gateway 707 access from telecommunication network 702.It is noted that data server 714 can include any kind of computer equipment/group Part.What is be connected with each data server 714 is multiple user equipmenies 716.These user equipmenies 716 can include desk-top calculate Machine, laptop computer, hand-held computer, printer or any other logical device.It is noted that in one embodiment In, user equipment 717 can also be directly connected in arbitrary network.
One facsimile machine 720 or a series of facsimile machines 720 may connect to one or more networks 704,706,708.
It is noted that database and/or add-on assemble can be with being connected to any type of of network 704,706,708 Network element is used together or integrated wherein.In the environment of this description, network element is preferably the random component of network.
According to one embodiment, Fig. 8 shows a representative hardware environment relevant with Fig. 7 user equipment 716.The figure The hardware configuration of a typical workstation is shown, with a central processing unit 810, such as one microprocessor, and it is multiple The other units being connected with each other by system bus 812.
Work station shown in Fig. 8 includes random access memory (RAM) 814, read-only storage (ROM) 816, I/O adaptations Device 818, for connecting ancillary equipment (disk storage unit 820 being such as connected with bus 812), user interface adapter 822 is used In by keyboard 824, mouse 826, loudspeaker 828, microphone 832, and/or other user interface facilities, such as touch-screen sum code-phase Machine (not shown), is connected to bus 812, communication adapter 834, for work station to be connected into (e.g., the data of communication network 835 Handle network), and display adapter 836, for bus 812 to be connected with display device 838.
Referring particularly to Fig. 9, it is shown that the device 414 of one embodiment of the invention.One embodiment of the present of invention includes using In the storage device 814 of storage flag data 416.Each mark data points 416 includes a mark, indicates the data point The training examples for the data point for being included into a specified classification, or the data point being excluded from a specified classification training Sample.Memory 814 also stores data untagged 418, priori probability data 420 and cost factor 422.
Processor 810 accesses the data from memory 814, and calculates one binary classifier of training using transduction MED, Can be classified data untagged.By using cost factor and to have mark and data untagged training examples, place by oneself Manage device 810 to calculate using iteration transduction, and adjust the cost factor as a function for expecting mark value, so as to influence cost The data of factor data 422, the again data and then input processor 810.Therefore, cost factor 422 is with processor 810 MED classification iteration each time and change.Once processor 810 fully trained a MED grader, processor is with that It can instruct the grader that data untagged is referred into classified data 424.
Transduction SVM and the MED formula of prior art causes potentially to mark distribution to be exponentially increased, and approximation must be to reality Border application development.In another embodiment of the present invention, describe the formula of different transduction MED classification, without by The possible mark distribution of exponential increase, and allow the closed-form solution (closed form solution) of a routine.For line Property grader, problem is expressed as follows:Find hyperplane parameter distribution p (Θ), deviation profile p (b), data point categorised demarcation line p (γ), its probability distribution combined is compared to the respective prior distribution p combined0Ku Lebaike accumulations, which are minimized, with one strangles hair (Kullback Leibler divergence) KL is dissipated, i.e.,
It is limited by the following limitation for having a flag data
And it is limited by the limitation of following data untagged
Wherein Θ XtWeight vectors for the hyperplane of separation and the dot product between the characteristic vector of t-th of data point.Nothing The prior distribution that need to be marked.There is flag data to mark the right side for the hyperplane for being limited in separation according to known to it, and for Data untagged only requirement is that, they to hyperplane distance square be more than boundary.In a word, embodiments of the invention are looked for To the hyperplane of a separation, it is that separate has flag data, Yi Ji exactly closest to selected prior probability There is no a balance between data untagged between boundary.It the advantage is that, the prior distribution without introducing mark, thus, Avoid the problem of potential mark distribution index increases.
In the particular implementation of another embodiment of the present invention, given using in the formula 7,8 and 9 for hyperplane parameter Prior distribution, deviation and boundary, obtain following partition function:
Wherein subscript t is the subscript for having flag data, and t ' is the subscript of data untagged.
Created symbol:
With W=∑stλtγtUt-2∑t′λt′γt′Ut′,
Rewritable formula 16 is as follows:
After integration, following partition function is produced:
That is, final object function is:
As the situation for the known mark for being referred to herein as discussing in the paragraph of M steps, object functionApplication can be passed through Similar method is solved.Difference is, the matrix in the quadratic form of maximum figureCurrently there is nondiagonal term.
Except classification, also there are a variety of applications in the present invention using the method for maximum entropy-discriminate framework.For example, MED can be used for Solve the classification of data.In a word, available for any kind of discriminant function and prior distribution, recovery and image model (T.Jebara,Machine Learning Discriminative and Generative,Kluwer Academic Publishers)(Jebara)。
The application of the embodiment of the present invention can be formulated into the pure inductive learning problem with known mark, and tool There is the transduction problem concerning study of mark and unmarked training examples.In embodiment below, MED points of the transduction described in table 3 The improvement of class algorithm is restored for common transduction MED classification, transduction MED, the transduction MED study of image model is all equally applicable. So, for the purpose and its dependent claims of the disclosure, word " classification " may include to restore or image model.
M steps
According to formula 11, the object function of M steps is:
t|0≤λt≤ c },
Wherein Lagrange multiplier λtBy maximizing JMIt is determined that.
Ignore redundancy limitation λt<C, the lagrangian of above-mentioned two problems is:
It is for the necessary and sufficient KKT conditions of optimality:
Wherein FtFor:
In optimal solution, deviation, which is equal to, expects deviationObtain:
<yt>(-Ft-<b>)+δt=0
(25)
By considering δtλtTwo kinds of situations of=0 limitation, it can be gathered that these formula.All λ of the first situationt=0, with And second all 0<λt<c.The third need not be considered, such as S.Keerthi, S.Shevade, C.Bhattacharhyya, and K.Murthy,Improvements to platt’s smo algorithm for svm classifier design,1999 (Keerthi) described in, applied to SVM algorithm;In this formula, potential function (potential function) keeps λt ≠c。
Some data point t can have interference in the case of these, until being optimal solution.That is, λ is worked astDuring for non-zero, Ft≠-<b >, or work as λtWhen being zero, Ft<yt><-<b><yt>.Unfortunately, there is no optimal solution λt, can not just calculate<b>.For this problem A good solution be use for reference Keerthi (being hereby incorporated again) method, by building following three set:
I0={ t:0 < λt< c } (28)
I1={ t:<yt>> 0, λt=0 } (29)
I4={ t:<yt>< 0, λt=0 } (30)
By using these set, using following definition, we can limit the greatest limit interference of optimality condition. I0In element for interference, as long as they be not equal to-<b>, therefore, from I0Minimum and maximum FtFor the time as interference Choosing.Work as Ft<-<b>When, I1In element be interference, therefore, if it exists, from I1Least member it is dry for greatest limit Disturb.Finally, F is worked ast>-<b>When, in I4In element be interference, it is from I4Greatest member is produced in interference candidate.Therefore ,-<b> " minimum " and " maximum " value by these set as follows is limited:
Due in optimal solution ,-bupWith-blowNecessary equal reason, i.e. ,-<b>, then, reduction-bupWith-blowDifference Away from training algorithm will being promoted to restrain.In addition, gap can also determine that the convergent method of numerical value is measured as a kind of.
As it was previously stated, only reach convergence, just can know that b value=<b>.The difference of the method for another embodiment exists In can only once optimize a sample.Therefore, every once, heuristic training will be in I0In sample and all samples between It is used alternatingly.
E steps
The object function of E steps is in formula 12
Wherein stFor the classification score value of t-th of the data point determined in M steps before.Lagrange multiplier λtBy most BigizationIt is determined that.
Ignore redundancy limitation λt<C, the lagrangian of above-mentioned two problems is:
It is for the necessary and sufficient KKT conditions of optimality:
Due to having carried out factorization to sample, as long as ignoring sample, by optimizing KKT conditions to Lagrange multiplier Solution can be completed.
For there is mark sample, mark is expected<yt>With P0,1(yt)=1 and P0,1(-yt)=0, simplifying KKT conditions is:
And generate the solution for the Lagrange multiplier for marking sample as having:
For unmarked sample, formula 35 can not decompose solution, but must be by using, as to it is each meet formula 35 The Lagrange multiplier of unmarked sample carries out linear search, to determine.
It is multiple unrestricted samples below, it can pass through above-mentioned enumerated method and its derivation or change, Yi Jiqi Its method known in the art is realized.Each example includes preferred computing, with reference to optional computing or parameter, and it can be Implement in basic method for optimizing opinion.
In embodiment, as shown in Figure 10, there are mark data points to be received in step 1002, each data point has extremely A few mark, it is the training examples for the data point for being included into a particular category to indicate the data point, or specific from one The training examples for the data point that classification is excluded.In addition, data untagged point is received in step 1004, while having described in receiving The default cost factor of at least one of mark data points and data untagged point.The data point can include any medium, such as Word, image, sound etc..The priori probability information for having mark and data untagged point can also be received.Moreover, being included into The marks of training examples can be mapped as first numerical value, such as+1, and the training examples being excluded can be mapped as the second number Value, such as -1.In addition, described have mark data points, data untagged point, input data point and have mark data points and nothing The default cost factor of at least one of mark data points can be stored in computer storage.
Further, in step 1006, using at least one described cost factor, and there are mark data points and unmarked number Strong point is as training examples, by iterative calculation, and a transduction MED grader is trained to.For iterating to calculate each time, adjustment Data untagged point cost factor expects mark value as one, such as the absolute value of the expectation mark of data point, letter Number, and data point label prior probability is adjusted according to the estimation of group of data points membership probability, therefore ensure that stability.Moreover, turning Leading grader can learn using the priori probability information for having mark and data untagged, and which further improves stability.Training The iterative step of transductive classifier can be repeated, until the convergence of data value is reached, for example, when the decision function of transductive classifier Change drop to when below a default threshold value, when it is determined that the change of expectation mark value drop to below a default threshold value When, etc..
In addition, in step 1008, the grader of training be used to classifying the data untagged point, have mark data points and At least one of input data point.Input data point can be received before or after grader is trained to, or not received. Moreover, being marked according to their expectation, there are mark and data untagged point as study sample by the use of described, it may be determined that judge letter Number, the given training examples for being included into and being expelled out of, decision function can dissipate KL the elder generation for being minimised as decision function parameter Test probability distribution.In other words, the decision function can use the multinomial distribution of decision function parameter, by minimum KL dissipate Lai It is determined that.
In step 1010, the classification of the data point of classification, or derivatives thereof, be exported to a user, another system and At least one during another.System can be long-range or local.The example of the derivative of classification can be, but not Be limited to, the data point of classification in itself, the sign or identifier or master file/document, etc. of grouped data point.
In another embodiment, computer system uses and performs computer executable program code.The program code There are mark data points to have including being stored in the instruction for there are mark data points of computer storage for accessing, described in each At least one mark, indicate the data point whether be the data point for being included into a specified classification training examples, or from one The training examples for the data point being excluded in individual specified classification.In addition, computer code includes being used for visiting from computer storage The instruction of data untagged point is asked, and has at least the one of mark data points and data untagged point from computer storage access The instruction of individual default cost factor.The priori probability information for having mark and data untagged point for being stored in calculating memory also may be used With accessed.Moreover, the mark for the training examples being included into can be mapped as first numerical value, such as+1, and the training being excluded Sample can be mapped as second numerical value, such as -1.
Further, program code is comprised instructions that, it is described instruction using at least one store cost factor and What is stored has mark data points and the data untagged point of storage as training examples, passes through and iterates to calculate training transductive classification Device.Moreover, for iterating to calculate each time, data untagged point cost factor is adjusted as the expectation mark value of the data point, Such as data point expectation mark absolute value, a function.Moreover, for iteration each time, priori probability information can be with It is adjusted according to the estimation of group membership's probability of data point.The iterative step of training transductive classifier can be repeated, until number Reach convergence according to value, for example, when the change of the decision function of transductive classifier is dropped to below a default threshold value, when it is determined that The change of expectation mark value when dropping to below a default threshold value, etc..
In addition, program code is comprised instructions that, and it is described to instruct for training grader, to data untagged point, have At least one of mark data points and input data point are classified, and for the class for the data point for exporting the classification Other or derivatives thereof instruction, at least one to a user, another system and during another is exported by classification. Moreover, being marked according to their expectation, there are mark and data untagged point as study sample by the use of described, it may be determined that judge letter Number, the given training examples for being included into and being excluded, decision function can dissipate KL the elder generation for being minimised as decision function parameter Test probability distribution.
In another embodiment, data processing equipment includes at least one memory, for storing:(i) there are reference numerals Strong point, has described in each mark data points to be marked with least one, it is to be included into a specified classification to indicate the data point Data point training examples, or the data point being excluded from a specified classification training examples;(ii) data untagged Point;There is at least one default cost factor of mark data points and data untagged point described in (iii).The memory may be used also With the priori probability information of be stored with mark and data untagged point.Moreover, the mark for the training examples being included into can be mapped as First numerical value, such as+1, and the training examples being excluded can be mapped as second numerical value, such as -1.
In addition, the data processing equipment includes transductive classification training aids, with using at least one described cost because Son, and it is described have mark data points and data untagged point as training examples, using the maximum entropy-discriminate (MED) of transduction, Iteratively train the transductive classifier.In addition, in the iterative calculation of MED each time, adjust the data untagged point cost because Son is used as the expectation mark value of the data point, such as the absolute value for expecting to mark of data point, a function.And And, in the iterative calculation of MED each time, priori probability information can be adjusted according to the estimation of data point group membership's probability. The device, which can also include one, to be used to determine the convergent device of data value, e.g., the decision function calculated when transductive classifier Change drops to when below a default threshold value, when it is determined that the change of expectation mark value drop to below a default threshold value When, etc., and once determining convergence, then terminate and calculate.
In addition, training grader be used for classify data untagged point, have in mark data points and input data point extremely Few one kind.Moreover, being marked according to their expectation, there are mark and data untagged point as study sample by the use of described, can be true Determine decision function, KL divergings can be minimised as decision function by the given training examples for being included into and being excluded, the decision function The prior probability distribution of parameter.Moreover, the classification of the data point of classification, or derivatives thereof, export to a user, another system System and it is another during at least one.
In another embodiment, a product, including computer-readable program recorded medium, the medium are definitely wrapped The executable instruction repertorie of one or more computers is included, the method to perform data classification.In use, receiving has reference numerals Strong point, each has mark data points to be marked with least one, it is the data for being included into a specified classification to indicate the data point Point training examples, or the data point being excluded from a specified classification training examples.In addition, receiving data untagged Point, and described at least one default cost factor for having mark data points and a data untagged point.Have mark data points and The priori probability information of data untagged point can also be stored in computer storage.Moreover, the mark for the training examples being included into It can be mapped as first numerical value, such as+1, and the training examples being excluded can be mapped as second numerical value, such as -1, etc..
Further, the cost factor stored using at least one has mark data points and data untagged point to make with storage For training examples, calculated using the maximum entropy-discriminate (MED) of iteration, train transductive classifier.The iteration each time calculated in MED In, adjustment data untagged point cost factor is used as the expectation mark value of the data point, the expectation mark of such as one data point Absolute value etc., a function.Moreover, in the iterative calculation of MED each time, priori probability information can be according to a data point The estimation of group membership's probability is adjusted.The iterative step of training transductive classifier can be repeated, until reach that data value is restrained, example Such as, when the change of the decision function of transductive classifier is dropped to below a default threshold value, when it is determined that expectation mark value When change is dropped to below a default threshold value, etc..
In addition, accessing input data point from computer storage, the grader of training is used for the data untagged of classifying Point, there are at least one of mark data points and input data point.Moreover, being marked according to their expectation, have using described Mark and data untagged point are as study sample, it may be determined that decision function, and the given training examples for being included into and being excluded should Decision function can dissipate KL the prior probability distribution for being minimised as decision function parameter.Moreover, the classification of the data point of classification, Or derivatives thereof, be exported to a user, another system and it is another during at least one.
It is used for data untagged of classifying in a computer based system there is provided one kind in another embodiment Method.In use, there are mark data points to be received, there are described in each mark data points to be marked with least one, refer to Show that the data point is the training examples for the data point for being included into a specified classification, or the number being excluded from a specified classification The training examples at strong point.
In addition, thering is mark and data untagged point to be received, there is the priori signature of mark data points and data untagged point Probabilistic information is also received.Moreover, at least one the default cost factor for having mark data points and data untagged point is also connect Receive.
Moreover, each has the expectation mark of mark and data untagged point according to the mark prior probability quilt of the data point It is determined that.Following sub-step is repeated, until data value is restrained enough.
● generate a regulation for each data untagged point proportional to the absolute value of the expectation mark of data point Value at cost;
● by determining decision function, the given sample for being included into training and being expelled out of training, using it is described have mark and Data untagged point is trained maximum entropy-discriminate (MED) grader, marked according to their expectation as training examples, should Decision function dissipates KL the prior probability distribution for being minimised as decision function parameter;
● using the grader of the training, it is determined that described have the classification score value marked with data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to group membership's probability of the determination, update the mark prior probability of the data untagged point;
● using the mark prior probability and the classification score value that determines before of the renewal, using maximum entropy-discriminate (MED), Determine the mark and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation mark;With
● it is that each data point updates by the way that the expectation mark insertion new expectation of iteration before is marked Expect mark.
Moreover, classification of input data point or derivatives thereof, is exported to a user, another system and another process In at least one.
When the change of decision function is dropped to below a default threshold value, convergence is reached.In addition, when the expectation mark determined When the change of note value is dropped to below a default threshold value, diverging can also be reached.Moreover, the mark for the training examples being included into Can have arbitrary value, such as+1, and the training examples being excluded can have arbitrary value, such as -1.
In one embodiment of the invention, a kind of method for sort file is as shown in figure 11.In use, in step Rapid 1100, receiving at least one has the seed file of known confidence levels, and it is default with least one to receive unmarked file Cost factor.The seed file and other items can be received from computer storage, user, network connection etc., and can be One is received after the request of system for performing this method.At least one described seed file can have a this document Whether the cue mark of a specified classification is included into, can be containing a Keyword List, or contribute to any other The feature of sort file.Moreover, in step 1102, by iterative calculation, using at least one default cost factor, at least one Seed file and unmarked file, train a transductive classifier, wherein, for iterating to calculate each time, Setup Cost because Son is used as a function for expecting mark value.The data point label prior probability for having mark and unmarked file can also be connect Receive, wherein, for iterating to calculate each time, the data point markers can be adjusted according to the estimation of group of data points membership probability Prior probability.
In addition, be that unmarked file stores confidence score in step 1104 after at least part iteration, and in step 1106, the identifier of the unmarked file with highest confidence score is exported to a user, another system and another process In at least one.The identifier can be this document electronic copies in itself, its part, its title, its title, point to file Pointer, etc..Moreover, confidence score can be stored after each iteration, wherein, after each iteration, have The identifier of the unmarked file of highest confidence score is output.
One embodiment of the present of invention can inquire about the pattern for linking original document and remaining paper.The target of inquiry is One this pattern query proves especially valuable region.For example, in pre-trial legal inquiry (pre-trial legal Discovery, must the substantial amounts of file of research for the possible link of lawsuit at hand in).Final purpose is " true in order to find The evidence of chisel ".In another example, for the common task of inventor, patent examiner, and patent attorney, it is exactly By the retrieval to prior art, the novelty of a technology is assessed.Especially, the task for search for all announcements patent and Other publications, and have found that it is likely that in this set the file relevant with examining the particular technology of novelty.
The task of inquiry, which is included in one group of data, finds a file or one group of file.Give an original document or general Read, user may wish to find the file relevant with the original document or concept.However, original document or concept and file destination Between relation opinion, i.e. the file that will be inquired about, only after inquiring about, can just be best understood by.There is mark by study With unmarked file, concept etc., the present invention can learn the pattern and relation between single or multiple original documents and file destination.
In another embodiment of the present invention, a kind of method such as Figure 12 for being used to analyze the file related to legal inquiry It is shown.In use, the file related to legal matter is received in step 1200.These files can include electricity file in itself Sub- copy, its part, its title, its title, the pointer of sensing file, etc..In addition, in step 1202, one is performed to file Plant file classifying method.Further, in step 1204, the identifier based on its classification output at least part file.Alternatively, The mark of link between these files is also output.
The file identification method can include any kind of process, such as one transductive process.For example, can make With foregoing any conclusion or transduction method.In a preferred method, at least one default cost factor, at least one are used Individual seed file and the file relevant with legal matter, by iterating to calculate one transductive classifier of training.For each time Iterative calculation, cost factor preferably adjusts the function as an expectation mark value, and the grader of training is used for classification and connect The file of receipts.The process can also include to have mark and unmarked one data point markers prior probability of file reception, wherein, For iterating to calculate each time, according to the estimation of data point group membership's probability, the data point label prior probability is adjusted. In addition, the file classifying method can also include one or more SVMs processes and maximum entropy-discriminate process.
In another embodiment, a kind of method for analyzing prior art document is as shown in figure 13.In use, in step 1300, a grader is trained based on a search inquiry.In step 1302, multiple prior art documents are accessed.These show Before thering is technology to be included in one to fixing the date, any information that the public can obtain in any form.The prior art also may be used Before being included in one to fixing the date, any information that the public can't obtain in any form.The prior art document enumerated Can be any type of file, publication such as Patent Office, be derived from the data of database, the prior art collected, webpage Part, etc..Moreover, in step 1304, one kind is performed to the prior art document described at least part using the grader File classifying method, and in step 1306, based on its classification, the identifier of the prior art document described in output at least part. The document classification technology can include one or more processes, including SVMs process, a maximum entropy-discriminate Process, or foregoing any conclusion or transduction method.Also or, the sign linked between the file can also be output. In another embodiment, the score value of correlation is output based on its classification between at least part prior art document.
The search inquiry can include the disclosed at least a portion of patent.The patent enumerated is disclosed including by inventor Disclosure, temporary patent application, non-provisional, foreign patent or patent application summarized its invention and produced etc..
In a preferred method, claim of the search inquiry including a patent or patent application is at least A part.In another method, the search inquiry includes at least a portion of the summary of a patent or patent application. In another method, the search inquiry includes at least a portion of a patent or the brief summary of the invention of patent application.
Figure 27 shows a kind of method for file to be matched with claim.In step 2700, based on a patent Or at least one claim of patent application trains a grader.Therefore, one or more of claim, or one portion Point, available for training grader.In step 2702, multiple files are accessed.These files may include prior art document, description The potential file encroached right or taken the lead using product.In step 2704, one is performed at least part file using the grader Plant file classifying method.In step 2706, based on its classification, the identifier of at least part file is exported.At least part file Relevance score can also be output based on its classification.
One embodiment of the present of invention can be used for the classification of patent application.In the U.S., for example, nowadays patent and patent Shen US patent class (USPC) system please be use, is classified according to its theme.The task is now by being accomplished manually, therefore cost is high And it is time-consuming.This manual sort is also restricted by mistake.The complexity for solving this task is, can by patent or specially Profit application is divided into multiple species.
According to one embodiment, Figure 28 shows a kind of method for patent application of classifying.In step 2800, based on many Individual known one grader of file training for belonging to a specific patent classification.These files generally can be patent or patent Shen The please summary file of (or one part) but it is also possible to be the target topic for describing specific patent classification.In step 2802, one Patent or at least a portion of patent application are received.The part can include:Claim, brief summary of the invention, summary, explanation Book, title, etc..In step 2804, one is performed to the patent or at least a portion of patent application using the grader Plant file classifying method.In step 2806, the classification of the patent or patent application is output.Alternatively, user can be manual Ground check part or the classification of whole patent applications.
The file classifying method is preferably a kind of Yes/No sorting technique.In other words, if file is in correct class Higher than one threshold value of not interior probability, then be determined as it being that this document belongs to the category.If file is general in correct classification Rate is then determined as no less than threshold value, and this document is not belonging to the category.
Figure 29 shows another method for patent application of classifying.In step 2900, using a grader to one Part patent or at least a portion of patent application perform a kind of file classifying method, and the grader has previously been based at least one and one The related file of individual specific patent classification is trained to.Likewise, the file classifying method is preferably a kind of Yes/No classification side Method.In step 2902, the classification of the patent or patent application is output.
In the two methods shown in Figure 28 and Figure 29, different graders can be used to repeat respective method, it is described Grader has previously been based on multiple known files for belonging to a different patent classification and is trained to.
Formally, the classification of patent should be based on claim.But, it is also desirable to matching is performed between (any IP is related Content) and (any IP related contents).As an example, a kind of method is trained using patent specification, and according to The claim of patent application is classified to patent application.Another method operation instructions and claim are trained, And based on summary classification.In particularly preferred method, no matter which of patent or application is partly used for training, in classification Using the content of same type, if that is, system is trained according to claim, claim is classified based on.
The file classifying method can include any kind of process, such as one transductive process etc..For example, can make With above-mentioned any conclusion or transduction method.In a preferred method, the grader can be a transductive classifier, And the transductive classifier is passed through using at least one default cost factor, at least one seed file and prior art document Iterative calculation is trained, wherein, for iterating to calculate each time, the cost factor is adjusted as one and expects mark value Function, and the grader of the training can be used for the prior art document of classifying.The seed file and prior art document A data point markers prior probability can also be received, wherein, can be according to a data for iterating to calculate each time The estimation of point group membership's probability, adjusts the data point label prior probability.Seed file can be any file, such as Patent Office Publication, the data for being derived from database, one group of prior art, website, patent disclose.
In a method, Figure 14 describes one embodiment of the present of invention.In step 1401, one group of data is read. In this group of data, the file relevant with user is the discovery that needs.In step 1402, single or multiple initial seed files It is labeled.The file can be any kind of file, the publication of such as Patent Office, the data for being derived from database, one group Prior art, website etc..Can also a string of different keywords or customer-furnished file layout transductive process.In step 1406, using one group of data untagged having in flag data and a given set, train a transductive classifier.In iteration Each mark induction step in transductive process, the confidence score determined in mark generalization procedure is stored.In step 1408, once completing training, the file that high confidence score is obtained in mark induction step is just shown in user.These have height The file representative of the confidence score file related to user's inquiry purpose.The display can be first according to the time of mark induction step Afterwards sequentially, since initial seed file, until last group of file being found in last mark induction step.
Another embodiment of the present invention is related to data scrubbing and precise classification, and such as business process with automation is mutually tied Close.The cleaning and sorting technique can include any kind of process, such as a transductive process.It is, for example, possible to use Any of the above described transduction or inductive method.In a preferred method, according to the expectation cleannes of database, into database Key is used as the mark related to confidence levels.Then, the mark expects mark together with related confidence levels, by with In training a transductive classifier, (key) is marked described in the grader amendment, data in database more may be used with realizing The management leaned on.For example, invoice must be classified according to the company of invoicing or individual first, to realize that automaticdata is extracted, For example determine total amount, O/No., product quantity, shipping address etc..Generally, setting up an automatic classification system needs instruction Practice sample.However, the training examples provided by customer usually contain the file of wrong classification or other interference, such as fax cover page, In order to obtain accurate classification, before the automatic classification system is trained, these files must be identified and remove.Another In individual embodiment, in the field of case, the inconsistency reported between its diagnosis report for contributing to detection to be write by doctor.
In another embodiment, it is well known that Patent Office need undergo continuously reclassifies process, wherein, they (1) existing bifurcations of their classification of disturbance method are assessed, (2) rebuild the classification to be evenly distributed overcrowding section Point, and (3) reclassify existing patent new structure.Here transduction learning method be Patent Office and they outside Used in the company for doing this work of bag, to reappraise their classification, and their (1) is helped to be one given Main classification sets up new classification, and (2) reclassify existing patent.
Transduction, from having mark and data untagged study, is thus smooth from unmarked transformation is tagged to.Collection of illustrative plates One end be the flag data that has with perfect existing knowledge, e.g., given mark is all correct without exception.Another Hold not give the data untagged of existing knowledge.The number for the data composition mistake classification compiled with the group disturbed to a certain degree According to, and positioned at two of the collection of illustrative plates somewheres between extreme.The mark provided by data tissue to a certain extent can be for certain It is considered correct, but not fully.Therefore, transformation can be used for clearing up existing data group and compile, by given at one A specific error degree is assumed within data tissue, and these are construed to uncertain in the existing knowledge of mark distribution Property.
In one embodiment, a kind of method for clearing up data is as shown in Figure 5.In use, in step 1500, Duo Geyou Flag data is received, in step 1502, is the subset that each classification in multiple classifications chooses the data item.Separately Outside, in step 1504, the uncertainty of the data item in each subset is arranged to about zero, will not be in step 1506 The uncertainty of data item in the subset is arranged to the preset value that one is not about zero.Further, in step 1508, pass through Iterative calculation, using the data item in the uncertain, subset and data item in the subsets is not as training examples, A transductive classifier is trained, and in step 1510, the grader of training, which is used for each, flag data, each to classify The individual data item.Moreover, the classification of input data, or derivatives thereof, a user is exported in step 1512, another System and it is another during at least one.
Further, the subset can be randomly selected, it is possible to chosen by user and verified.At least partly described data item Mark can be changed based on its classification.Moreover, after sorting, the data of the confidence levels with less than a default threshold value The identifier of item is exported to user.The identifier can be this document electronic copies in itself, its part, its title, its Title, the pointer for pointing to this document, etc..
In one embodiment of the invention, as shown in figure 16, in step 1600, two choosings of a scale removal process are started Item is presented to user.In step 1602, an option is full-automatic cleaning, for each concept or classification, is randomly selected Certain amount of file is taken, and assumes that they are compiled by correctly group.Or, in step 1604, a number of file can be beaten Upper mark, with hand inspection and verification, whether one or more marks distribution of each concept or classification is compiled by group exactly. An estimation of annoyance level is received in step 1606, data.In step 1610, the verification in step 1608 is used (desk checking is randomly selected) data and the data that do not verify, train the transductive classifier.Once training terminates, file Compiled according to new mark by group again.In step 1612, there is the low confidence level less than a specific threshold in mark distribution Other file, is displayed to user, for hand inspection.In step 1614, distributed according to transduction of marker, in mark distribution The file of confidence levels with higher than a specific threshold is by automatic Proofreading.
In another embodiment, a kind of method for being used to manage case history is as shown in figure 17.In use, in step 1700, a grader is trained to based on medical diagnosis, in step 1702, and multiple case histories are accessed.In addition, in step 1704, Perform a kind of file classifying method to the case history using the grader, and with low probability with medical diagnosis correlation The identifier of at least one case history, is output in step 1706.This document sorting technique includes any kind of process, such as one Transductive process etc., and said one or multiple arbitrary conclusions or transduction method, including SVMs process, most can be included Big entropy-discriminate process etc..
In one embodiment, the grader can be a transductive classifier, and the transductive classifier can lead to Iterative calculation is crossed, is trained to using at least one default cost factor, at least one seed file and case history, wherein, it is right Iterated to calculate in each time, adjust the cost factor and can use as a function for expecting mark value, and the grader of training In the case history of classifying.Seed file and the data point label prior probability of case history can also be received, wherein, for each time Iterative calculation, can adjust the data point label prior probability according to the estimation of group of data points membership probability.
Another embodiment of the present invention describes dynamic, the classification concept of drift.For example, in formal layout application, point Class file, is classified using the layout information and/or content information of file to file, and to classify, the file is used for further Processing.In many applications, file is not changeless, but time to time change.For example, the content of file and/or The space of a whole page probably due to new legislation and change.Transductive classification adapts to these changes automatically, produces same or similar classification accurate Property, without being influenceed by the classification concept drifted about.Compared with rule-based system or inducing classification method, without artificial tune Section, will not influence accuracy due to concept drift.One example of this method is invoice processing, and it traditionally includes concluding Study, or use the rule-based system using the invoice space of a whole page.For these traditional systems, if the space of a whole page changes, Then system must manually be reset by marking new training data or determining new rule.However, the use of transduction is led to Cross the automatic minor variations adapted on the invoice space of a whole page so that artificial reset becomes no longer necessary.In another embodiment, Transductive classification can be used for analysis customer complaint, to monitor the change that these complain property.For example, a company can be automatically by production Product change is linked with customer complaint.
Transduction can also be used for the classification of news article.For example, about the news article of war, the attack of terrorism, starting from and being directed to The terrorist of the Afghan War on the 11st of September in 2001 attacks, until about the News Stories of the current situation of Iraq, can Use transduction automatic identification.
In another embodiment, biological classification (akpha taxonomy) can be changed over time, by evolving, new species Produce, and other species extinctions.With change of the classification concept with the time, classification outline or this taxonomic and Else Rule Can be with dynamic change.
By using that must be classified as the input data of data untagged, transduction can recognize drift classification concept, and Thus the classification outline of change is automatically adapted to.For example, Figure 18 shows that the given drift classification concept of the present invention is used The embodiment of transduction.File group DiIn time tiInto system, as shown in step 1802.In step 1804, use is accumulated so far Tired has mark and data untagged to train a transductive classifier Ci, in step 1806, file group DiIn file be classified. If using artificial mode, being confirmed as the text of the confidence levels of the threshold value provided with less than one user in step 1808 Part, user is presented to for hand inspection in step 1810.As shown in step 1812, in automatic mode, one has The file of confidence levels triggers the establishment of a new classification, and the category is added into system, and then this document is just attributed to this New classification.In step 1820A-B, the file of the confidence levels with higher than above-mentioned selected threshold value is classified into current classification 1 to N.In step tiThe file of all current class of current class has been classified into before, in step 1822 by grader Ci Reclassify, and in step 1824 and 1826, all files for being no longer classified into above-mentioned specified classification are moved into new class Not.
In another embodiment, a kind of method for adapting to file content variation is as shown in figure 19.File content can be wrapped Include, but be not limited to, picture material, content of text, the space of a whole page, numbering, etc..The example of variation can include change, the wind of time The change (by 2 or more the personal one or more files of processing) of lattice, the change of application process, the variation of the space of a whole page, etc.. Step 1900, receiving at least one has mark seed file and unmarked file and at least one default cost factor.It is described File can include, but are not limited to, customer complaint, invoice, form document, receipt, etc..In addition, in step 1902, using At least one described default cost factor, at least one seed file, and unmarked file, train a transductive classifier. Moreover, in step 1904, the unmarked file of the confidence levels with more than a default threshold value is classified using grader To multiple classifications, and in step 1906, at least a portion of the file of the classification is reclassified to multiple using grader Classification.Further, in step 1908, the identifier of the file of the classification is exported to a client, another system, Yi Jiling At least one during one.The identifier can be file electronic copies in itself, its part, its title, its title, refer to Pointer to file, etc..Moreover, product variations can be linked with customer complaint etc..
In addition, the unmarked file of the confidence levels with less than a predetermined threshold value can be moved into it is one or more new Classification.Moreover, by iterative calculation, using at least one default cost factor, at least one seed file and the nothing Tab file, can train a transductive classifier, wherein, for iterating to calculate each time, adjust the cost factor conduct The function of one expectation mark value, and use the grader classification unmarked file of the training.Moreover, described kind of Ziwen The data point label prior probability of part and unmarked file can be received, wherein, for iterating to calculate each time, according to one The estimation of group of data points membership probability, adjusts the data point label prior probability.
In another embodiment, a kind of method for the variation for making patent classification adapt to file content is as shown in figure 20. Step 2000, receiving at least one has mark seed file, and unmarked file.The unmarked file can include any The file of type, e.g., patent application, legal document, information disclose form, file modification, etc..Seed file can include special Profit, patent application etc..In step 2002, one transduction point of at least one described seed file and unmarked file training is used Class device, and using the grader by the unmarked document classification of the confidence levels with higher than a predetermined threshold value to multiple Existing classification.The grader can be any kind of grader, such as transductive classifier, and the document classification side Method can be any method, for example support vector machine method, maximum entropy method of discrimination etc..For example, can be used above-mentioned any Conclude or transduction method.
Moreover, in step 2004, there is the confidence levels less than a predetermined threshold value by described using the grader Unmarked document classification at least partly described will be divided at least one new classification, and in step 2006 using the grader The file of class reclassifies existing classification and at least one new classification.Further, in step 2008, the classification The identifier of file be exported to a user, another system and it is another during at least one.Furthermore, it is possible to using extremely A few default cost factor, the search inquiry and the file, by iterative calculation, train the transductive classification Device, wherein, for iterating to calculate each time, the cost factor is adjusted as the function of an expectation mark value, and the instruction Experienced grader can be used for the file of classifying.Further, the data point prior probability of the search inquiry and file can be by Receive, wherein, for iterating to calculate each time, according to the estimation of data point group membership's probability, adjust the data point first Test probability.
In another embodiment of the present invention, the file drift in file separation field is described.The example of one application Attached bag includes the process of mortgage file.Including a series of different debt-credit files, such as loan application, approval, request, quantity File is borrowed or lent money to be scanned, and before further processing, it must be determined that the different files in a series of images.Use File is not changeless, but can be changed over time.For example, in debt-credit file, the tax form used can Changed over time according to the change of laws and regulations.
File separation solves the problem of finding file or subfile boundary in a series of images.Typically produce a series of The example of image is digital scanner or multi-function peripheral (MFP).Such as in the embodiment of classification, transduction can be used for file Separation, to handle the drifting problem of file and its boundary with the time.Static piece-rate system, such as rule-based system or is based on The system of Inductive Learning, it is impossible to automatically adapt to drift separation concept.No matter when drift about, these static separation systems The performance capabilities of system is reduced with the time.In order to keep the performance of its initial level, otherwise manually adjusting rule (is based on rule System for), or the new file of handmarking and again learning system (for Inductive Learning).It is no matter any All it is time-consuming costly.Using transduction to file separation so that system is improved, it can adapt to the drift in separation concept automatically Move.
In one embodiment, a kind of method of separate file is as shown in figure 21.In step 2100, reception has reference numerals According to, and in step 2102, receive one group of unmarked file.These data and file can include legal inquiry file, official Notice, web data, attorney's official letter etc..In addition, in step 2104, having flag data and unmarked text based on described Part, using transduction, probabilistic classification rule is adjusted, and in step 2106, according to probabilistic classification rule, is updated for text The weight of part separation.Moreover, in step 2108, it is determined that the position separated in one group of file, and in step 2110, it is determined that The designator of the position separated in one group of file be exported to a user, another system and it is another during at least One.The designator can be file electronic copies in itself, its part, its title, its title, the pointer for pointing to file, Etc..Further, in step 2112, file is labeled with coding, and the coding is relevant with the designator.
Figure 22 shows the implementation process of the sorting technique used in the present invention separated for file and equipment.In numeral After formula scanning, separate to reduce the manual working for being related to file separation and recognizing using autofile.Calculated by using reasoning Method, file separation method is combined with classifying rules to be automatically separated multigroup page, using sorting technique described here, with Reduce the most possible separation from all available to information.As shown in figure 22, of the invention turns the example of the present invention The sorting technique for leading MED is used for file separation.Specifically, the file page 2200 is placed into digital scanner 2202 or MFP, and It is turned into set of number image 2204.The file page can be the page from any types file, and such as Patent Office goes out Version thing, the data for being derived from database, the set of prior art, website etc..In step 2206, set of number image is inputted, with Dynamically adapting is regular using the probabilistic classification of transduction.Step 2206 is using one group of image 2204 as data untagged and has mark Numeration is according to 2208.Weight in step 2210, probability network is updated, and is used for based on dynamically adapting classifying rules Autofile is separated.Output step 2212 is the dynamic self-adapting for being automatically put into separate picture, so, the page of set of number 2214 are interleaved into the automated graphics of the separator page 2216, and in step 2212, the separator page is automatically inserted into figure As sequence.In one embodiment of the invention, the separator page 2216 of Software Create, which can also indicate that, follows the separation closely The type of the file of the device page 2216.It is general that system described herein automatically adapts to the drift separation that file occurs with the time Read, without worrying occur separation conclusion type machine learning of the meeting as rule-based static system or based on method accurately The reduction of degree.In sheet disposal (form processing) application, a common example of drift separation or classification concept It is that as mentioned before, file produces change due to new laws and regulations.
In addition, system as shown in figure 22 can be changed to system as shown in figure 23, its page 2300 is put into digital scanner 2302 or MFP is converted to set of number image 2304.This group of digital picture is transfused in step 2306, with suitable using transduction dynamic Answer probabilistic classification rule.Step 2306 is using this group of image 2304 as data untagged and has flag data 2308.Step 2310, according to the dynamic self-adapting classifying rules used, update the weight in the probability network separated for autofile. It is not insertion separator page-images as described in Figure 18 in step 2312, but step 2312 adapts dynamically to be automatically inserted into Separate information, and with encode descriptive markup described in document image.Thus, file page-images can be transfused to an image procossing Database 2316, and the file can be accessed by software identifiers.
Transduction can be used to carry out recognition of face for an alternative embodiment of the invention.As described above, being had using transduction many Advantage, for example, it is only necessary to the training examples of relatively small amount, uses ability of unmarked sample, etc. in training.Using above-mentioned excellent Gesture, transduction recognition of face can be used for Criminal Investigation.
For example, Department of Homeland Security is it is essential to ensure that terrorist must not climb up commercial airliner.A part for airport screening process It can be the photograph that each passenger is gathered at airport security, and attempt to recognize the people.System can initially use a small amount of sample Example be trained, the sample come from it is available be probably terrorist limited photo.In other law enforcement datas The unmarked photo of in storehouse, same terrorist can also be used for training.Therefore, transduction training aids not only can be with initially dilute Thin data set up feature face identification system, and the unmarked sample in other sources can also be used to strengthen performance. After the photo gathered at airport security has been handled, transduction system can more precisely recognize suspicious figure than induction system.
In another embodiment, a kind of method for recognition of face is as shown in figure 24.In step 2400, at least one Having for face marks drawing of seeds picture to be received, and the drawing of seeds picture has known confidence levels.At least one drawing of seeds picture can With a mark, to indicate whether the image is included into a classification specified.In addition, in step 2400, unmarked image Received, e.g., from police office, government organs, missing child database, airport security, or any other place, and received at least One default cost factor.Moreover, in step 2402, by iterative calculation, using at least one described default cost because Son, at least one drawing of seeds picture, and unmarked image, train a transductive classifier, wherein, for iterating to calculate each time, The cost factor is adjusted as the function of an expectation mark value.It is described in step 2404 after at least successive ignition Unmarked drawing of seeds picture stores a confidence score.
Further, in step 2406, the identifier of the unmarked file with highest confidence score is exported to a use Family, another system and it is another during at least one.The identifier can be this document electronic copies in itself, its portion Point, its title, its title, the pointer, etc. for pointing to file.Moreover, confidence score can be stored after iteration each time, its In, after each iteration, the identifier of unmarked image of the output with highest confidence score.Furthermore it is possible to receive use In the data point label prior probability for having and marking with unmarked image, wherein, can basis for iterating to calculate each time The estimation of one data point group membership's probability, adjusts the data point label prior probability.Further, the 3rd face without mark Remember image, such as come from above-mentioned airport security sample, can be received, the 3rd unmarked image can be with dividing with highest confidence At least part image of value compares, and if firmly believes that the face in the face and drawing of seeds picture in the 3rd unmarked image is Identical, then can export the identifier of the 3rd unmarked image.
An alternative embodiment of the invention allows users to improve their search by providing feedback to document retrieval system Hitch fruit.For example, when performing a search on an internet search engine (patent or patent application search product etc.), User can obtain largely corresponding to the result of its search inquiry.One embodiment of the present of invention is allowed users to from search engine The result of suggestion is browsed, and informs the correlation of the one or more acquired results of search engine, e.g., " it is close, but be not that I am real It is desired ", " being absolutely not " etc..When user provides feedback to search engine, more preferable result is according to priority to use Family is browsed.
In one embodiment, a kind of method for file search is as shown in figure 25.In step 2500, receive one and search Rope is inquired about.The search inquiry can be any kind of inquiry, including case sensitive inquiry, boolean queries, approximate match Inquiry, structuralized query, etc..In step 2502, the file based on search inquiry is obtained.In addition, in step 2504, output institute File is stated, and in step 2506, the mark that the user at least part file keys in is received, and the mark indicates the file Correlation between search inquiry.For example, user can indicate that from the particular result that the inquiry is returned be related go back It is unrelated.Moreover, in step 2508, the mark keyed in based on the search inquiry and user a, grader is trained to, and Step 2510, a kind of file classifying method is performed to the file using the grader, to reclassify the file.Enter one Step, in step 2512, based on its classification, exports the identifier of at least part file.The identifier can be file in itself Electronic copies, its part, its title, its title, the pointer of sensing file, etc..The file reclassified can also be by Output, condition is that there is the file of highest confidence level to be exported first for those.
The file classifying method can include any kind of process, e.g., transductive process, SVMs process, most Big entropy-discriminate process, etc..Any of the above described conclusion or transduction method can be used.In a preferred method, the classification Device can be a transductive classifier, and by iterative calculation, be looked into using at least one default cost factor, the search Ask, and the file can train the transductive classifier, wherein, for iterating to calculate each time, adjust the cost because Son is as a function for expecting mark value, and the grader of the training can be used for the file of classifying.In addition, for institute A data point markers prior probability for stating search inquiry and file can be received, wherein, for iterating to calculate each time, root According to the estimation of data point group membership's probability, the data point label prior probability can be adjusted.
An alternative embodiment of the invention can be used for improving ICR/OCR, and speech recognition.For example, many voices are known The embodiment of other program and system needs operator to repeat many words to train the system.The present invention can be first to a use The sound monitoring a predetermined time segment at family, to collect the content of " unfiled ", e.g., monitoring telephone is talked.As a result, working as user When starting to train the identifying system, the system is learnt using transduction, assists to build a note with the voice using the monitoring Recall model.
In another embodiment, a kind of method such as Figure 26 institutes for being used to check the relevance of an invoice and an entity Show.In step 2600, a grader is trained based on the invoice format related to first instance.The invoice format can refer to hair The practical layout of mark on ticket, or the feature on invoice, such as keyword, invoice number, customer name, etc..In addition, in step 2602, it is labeled and is accessed as multiple invoices being associated with least one in the first instance and other entities, and In step 2604, a kind of file classifying method is performed to the invoice using the grader.For example, above-mentioned any conclusion or Transduction method may be used as a kind of file classifying method.For example, the file classifying method can include a transductive process, branch Hold vector machine process, maximum entropy-discriminate process, etc..Moreover, in step 2606, exporting the mark of at least one invoice Symbol, the invoice has higher probability uncorrelated to the first instance.
Further, the grader can be any kind of grader, for example, a transductive classifier, and by repeatedly In generation, calculates, using at least one predetermined cost factor, at least one seed file, and the invoice, can train described Transductive classifier, wherein, for iterating to calculate each time, the cost factor is adjusted as the function of an expectation mark value, And use the grader classification invoice of the training.Moreover, for the seed file and a data point mark of invoice Note prior probability can be received, wherein, for iterating to calculate each time, according to the estimation of data point group membership's probability, Adjust the data point label prior probability.
Here an advantage for saying the embodiment of description is the stability of transduction algorithm.This stability is described by regulation Cost factor is realized with the mark prior probability is adjusted.For example, in one embodiment, by Iterative classification, using extremely Lack a cost factor, have mark data points and data untagged point as training examples, train a transductive classifier.For Iterate to calculate each time, the cost factor for adjusting the data untagged point is used as the function of a desired mark value.In addition, For iterating to calculate each time, a data point prior probability is adjusted according to the estimation of data point group membership's probability.
Work station can have memory-resident, the operating system such as Microsoft in an operating system Operating system (OS), MAC operation system, or UNIX operating system.It should be appreciated that preferred embodiment can also be different from those Implement on the platform and operating system mentioned.One preferred embodiment can use JAVA, XML, C and/or C Plus Plus or The other programming languages of person are write, with reference to the Programming Methodology of object-oriented.Object-oriented programming can be used (OOP), it has been being increasingly used to the complicated application of exploitation.
Above-mentioned application is using transduction study to overcome the problem of data set is very rare, and the problem annoyings conclusion type face Identifying system.This aspect learnt of transduceing is not limited to this application, can be used for solution other because data set is rare Machine Learning Problems caused by saying.
Within the scope and spirit of the various embodiments of invention disclosed herein, those skilled in the art can design difference Change.Moreover, the various features of embodiments disclosed above can be used alone, or various combination each other, and not It is confined to particular combination described above.Therefore, the scope of claim is not limited to the embodiment of these descriptions.

Claims (27)

1. a kind of method for document classification, it is characterised in that including:
Receiving at least one has mark kind subdocument, and this kind of subdocument has a known confidence levels;
Receive unmarked document;
Receive at least one default cost factor;
By iterative calculation, using at least one described default cost factor, it is described at least one plant subdocument and described Unmarked document, trains a transductive classifier, wherein, for iterating to calculate each time, adjust the cost factor and be used as one The individual function for expecting mark value;
It is the unmarked document storage confidence score after at least part iteration;With
By the identifier of the unmarked document with highest confidence score export to a user, another system, it is another during At least one.
2. according to the method described in claim 1, it is characterised in that:Each in one or more described kind of subdocuments has One mark, indicates whether this kind of subdocument is included into a classification specified.
3. according to the method described in claim 1, it is characterised in that:Confidence score is stored after iteration each time, wherein, each After secondary iteration, the identifier of the unmarked document with highest confidence score is output.
4. according to the method described in claim 1, it is characterised in that:Also include having mark and unmarked Document Creator one to be described Individual data point markers prior probability;Wherein, for iterating to calculate each time, according to the estimation of data point group membership's probability, Adjust the data point label prior probability.
5. according to the method described in claim 1, it is characterised in that:Also include:
Receive the 3rd unmarked document;
By the 3rd unmarked document with least partly having the unmarked document comparison of highest confidence score;And
The identifier of the 3rd unmarked document is exported in response to being identified below:
(1) confidence levels of the 3rd unmarked document indicates that the 3rd unmarked document belongs to and described kind of Ziwen Shelves identical classification;And
(2) confidence levels of the 3rd unmarked document is more than predefined confidence threshold.
6. in a computer based system, a kind of method classified for data, it is characterised in that including:
Reception has mark data points, has described in each mark data points to be marked with least one, indicate the data point be by Include the training examples of the data point of a classification specified, or the data point being excluded from a classification specified training Sample;
Receive data untagged point;
There is at least one default cost factor of mark data points and data untagged point described in receiving;
By iterative calculation, using at least one described cost factor, and described there are mark data points and data untagged point As training examples, using maximum entropy-discriminate (MED), a transductive classifier is trained, wherein, for iterating to calculate each time, The data untagged point cost factor is adjusted as the function of an expectation mark value, it is and general according to a data point group membership The estimation of rate, adjusts a data point markers prior probability;
Using the grader of the training classify the data untagged point, described have mark data points and input data point In at least one;With
Classification of data point by the classification or derivatives thereof export to a user, another system and it is another during At least one.
7. method according to claim 6, it is characterised in that:The function is the absolute of the expectation mark of a data point Value.
8. method according to claim 6, it is characterised in that:Also include receiving the priori for having mark and data untagged point The step of probabilistic information.
9. method according to claim 8, it is characterised in that:The transductive classifier has mark and unmarked using described The priori probability information study of data.
10. method according to claim 6, it is characterised in that:A Gauss elder generation also including the use of decision function parameter Test, the given training examples for being included into and being excluded are marked according to their expectation, have mark and unmarked using described Data are as training examples, it is determined that the step of decision function of the KL divergings with minimum.
11. method according to claim 6, it is characterised in that the multinomial prior also including the use of decision function parameter Distribution, it is determined that the step of decision function with minimum KL divergences.
12. method according to claim 6, it is characterised in that:The iterative step of one transductive classifier of repetition training, directly To the convergence for reaching data value.
13. method according to claim 12, it is characterised in that:When the change of the decision function of the transductive classifier When change is dropped to below a default threshold value, convergence is reached.
14. method according to claim 12, it is characterised in that:When it is determined that expectation mark value change drop to one it is pre- If threshold value below when, reach convergence.
15. method according to claim 6, it is characterised in that:The value of the mark of the training examples being included into is+1, And the value of the mark of the training examples being excluded is -1.
16. method according to claim 6, it is characterised in that:The mark of the sample being included into is mapped to first Individual numerical value, and the mark of the sample being excluded is mapped to second numerical value.
17. method according to claim 6, it is characterised in that also include:
There are mark data points to be stored in a computer storage by described;
The data untagged point is stored in a computer storage;
The input data point is stored in a computer storage;With
There is at least one default cost factor described in mark data points and data untagged point to be stored in a calculating by described Machine memory.
18. a kind of method classified for data, it is characterised in that including:
Computer executable program code is provided, to use and perform in a computer system, described program code includes Instructing is used for:
Access be stored in computer storage have mark data points, have mark data points that there is at least one mark described in each Note, it is the training examples for the data point for being included into a specified classification to indicate the data point, or is arranged from a specified classification The training examples for the data point removed;
The data untagged point is accessed from computer storage;
From at least one the default cost factor for having mark data points and data untagged point described in computer storage access;
By iterative calculation, using it is described at least one store cost factor and storage have mark data points and storage Data untagged point trains maximum entropy-discriminate (MED) transductive classifier as training examples, wherein, for changing each time In generation, calculates, and adjusts the data untagged point cost factor as the function of an expectation mark value, and according to a data point The estimation of group membership's probability, adjusts a data point prior probability;
Using the grader of the training classify the data untagged point, described have mark data points and input data point In at least one;With
Classification of data point by the classification or derivatives thereof export to a user, another system and it is another during At least one.
19. method according to claim 18, it is characterised in that:The function is the exhausted of the expectation mark of a data point To value.
20. method according to claim 18, it is characterised in that:Also include accessing be stored in computer storage have mark The step of priori probability information of note and data untagged point.
21. method according to claim 20, it is characterised in that:For iteration each time, constituted according to a data point The estimation of member's probability, adjusts the priori probability information.
22. method according to claim 18, it is characterised in that:Also include the given training for being included into and being excluded Sample, is marked according to their expectation, has mark and data untagged as training examples by the use of described, by with minimum KL The step of prior distribution for being defined as the decision function parameter of the decision function of diverging.
23. method according to claim 18, it is characterised in that:The iterative step of one transductive classifier of repetition training, Until reaching the convergence of data value.
24. method according to claim 23, it is characterised in that:When the change of the decision function of the transductive classifier When change is dropped to below a default threshold value, convergence is reached.
25. method according to claim 23, it is characterised in that:When the change of the expectation mark value of the determination drops to one When below individual default threshold value, convergence is reached.
26. method according to claim 18, it is characterised in that:The value of the mark of the training examples being included into for+ 1, and the value of the mark of the training examples being excluded is -1.
27. method according to claim 18, it is characterised in that:The mark of the sample being included into is mapped to first Individual numerical value, and the mark of the sample being excluded is mapped to second numerical value.
CN201610972541.XA 2006-07-12 2007-06-07 For the transductive classification method to document and data Withdrawn CN107180264A (en)

Applications Claiming Priority (11)

Application Number Priority Date Filing Date Title
US83031106P 2006-07-12 2006-07-12
US60/830,311 2006-07-12
US11/752,719 2007-05-23
US11/752,634 2007-05-23
US11/752,634 US7761391B2 (en) 2006-07-12 2007-05-23 Methods and systems for improved transductive maximum entropy discrimination classification
US11/752,673 US7958067B2 (en) 2006-07-12 2007-05-23 Data classification methods using machine learning techniques
US11/752,691 2007-05-23
US11/752,673 2007-05-23
US11/752,691 US20080086432A1 (en) 2006-07-12 2007-05-23 Data classification methods using machine learning techniques
US11/752,719 US7937345B2 (en) 2006-07-12 2007-05-23 Data classification methods using machine learning techniques
CN200780001197.9A CN101449264B (en) 2006-07-12 2007-06-07 Method and system and the data classification method of use machine learning method for data classification of transduceing

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN200780001197.9A Division CN101449264B (en) 2006-07-12 2007-06-07 Method and system and the data classification method of use machine learning method for data classification of transduceing

Publications (1)

Publication Number Publication Date
CN107180264A true CN107180264A (en) 2017-09-19

Family

ID=40743805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610972541.XA Withdrawn CN107180264A (en) 2006-07-12 2007-06-07 For the transductive classification method to document and data

Country Status (1)

Country Link
CN (1) CN107180264A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11436816B2 (en) * 2019-03-07 2022-09-06 Seiko Epson Corporation Information processing device, learning device, and storage medium storing learnt model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11436816B2 (en) * 2019-03-07 2022-09-06 Seiko Epson Corporation Information processing device, learning device, and storage medium storing learnt model

Also Published As

Publication number Publication date
CN101449264A (en) 2009-06-03

Similar Documents

Publication Publication Date Title
US7937345B2 (en) Data classification methods using machine learning techniques
US7761391B2 (en) Methods and systems for improved transductive maximum entropy discrimination classification
US7958067B2 (en) Data classification methods using machine learning techniques
WO2008008142A2 (en) Machine learning techniques and transductive data classification
CN107967575B (en) Artificial intelligence platform system for artificial intelligence insurance consultation service
Kanan et al. An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system
US20080086432A1 (en) Data classification methods using machine learning techniques
Bazan et al. The rough set exploration system
Hu et al. Active learning with partial feedback
Zavvar et al. Email spam detection using combination of particle swarm optimization and artificial neural network and support vector machine
de la Iglesia et al. Developments on a multi-objective metaheuristic (MOMH) algorithm for finding interesting sets of classification rules
Al-Rasheed Identification of important features and data mining classification techniques in predicting employee absenteeism at work.
Wu Application of improved boosting algorithm for art image classification
Trivedi et al. A modified content-based evolutionary approach to identify unsolicited emails
CN107180264A (en) For the transductive classification method to document and data
CN101449264B (en) Method and system and the data classification method of use machine learning method for data classification of transduceing
Laishram Link prediction in dynamic weighted and directed social network using supervised learning
WO2002048911A1 (en) A system and method for multi-class multi-label hierachical categorization
Zelenko et al. Automatic competitor identification from public information sources
Kou Stacked graphical learning
Siersdorfer et al. Using restrictive classification and meta classification for junk elimination
Jordan et al. Content-Based Image Retrieval Using Deep Learning
Liu et al. Distribution embedding network for meta-learning with variable-length input
Rehill Distilling interpretable causal trees from causal forests
CN111949794A (en) Online active machine learning method for text multi-classification task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20170919

WW01 Invention patent application withdrawn after publication