CN107180264A - For the transductive classification method to document and data - Google Patents
For the transductive classification method to document and data Download PDFInfo
- Publication number
- CN107180264A CN107180264A CN201610972541.XA CN201610972541A CN107180264A CN 107180264 A CN107180264 A CN 107180264A CN 201610972541 A CN201610972541 A CN 201610972541A CN 107180264 A CN107180264 A CN 107180264A
- Authority
- CN
- China
- Prior art keywords
- data
- mark
- point
- classification
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 249
- 238000012549 training Methods 0.000 claims description 160
- 230000006870 function Effects 0.000 claims description 102
- 238000009826 distribution Methods 0.000 claims description 93
- 238000003860 storage Methods 0.000 claims description 38
- 230000008859 change Effects 0.000 claims description 36
- 238000004364 calculation method Methods 0.000 claims description 25
- 238000012360 testing method Methods 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 2
- 238000010801 machine learning Methods 0.000 abstract description 17
- 238000012545 processing Methods 0.000 abstract description 12
- 230000026683 transduction Effects 0.000 description 71
- 238000010361 transduction Methods 0.000 description 71
- 230000008569 process Effects 0.000 description 62
- 238000000926 separation method Methods 0.000 description 22
- 238000012706 support-vector machine Methods 0.000 description 21
- 238000004422 calculation algorithm Methods 0.000 description 17
- 239000013598 vector Substances 0.000 description 17
- 230000007935 neutral effect Effects 0.000 description 14
- 210000002569 neuron Anatomy 0.000 description 13
- 230000033228 biological regulation Effects 0.000 description 10
- 230000006698 induction Effects 0.000 description 9
- 238000010606 normalization Methods 0.000 description 9
- 230000014509 gene expression Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000001939 inductive effect Effects 0.000 description 6
- 238000003745 diagnosis Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 241000196324 Embryophyta Species 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000003252 repetitive effect Effects 0.000 description 4
- 238000007635 classification algorithm Methods 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 239000003550 marker Substances 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 241000894007 species Species 0.000 description 3
- 241000139306 Platt Species 0.000 description 2
- 230000035508 accumulation Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000007717 exclusion Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 210000004218 nerve net Anatomy 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000005201 scrubbing Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 206010001497 Agitation Diseases 0.000 description 1
- 244000025254 Cannabis sativa Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 239000002975 chemoattractant Substances 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000012679 convergent method Methods 0.000 description 1
- 238000011840 criminal investigation Methods 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000008713 feedback mechanism Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000009545 invasion Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 208000001491 myopia Diseases 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000004886 process control Methods 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000682 scanning probe acoustic microscopy Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of for the system of grouped data, method, data processing equipment and product.Also disclose the data classification method using machine learning method.
Description
The application is divisional application, and the international application no of its original application is PCT/US2007/013484, and international filing date is
On June 7th, 2007, China national Application No. 200780001197.9, the date into China is on April 23rd, 2008, hair
Bright entitled " method and system and the data classification method using machine learning method for data classification of transduceing ".
Technical field
The invention mainly relates to the method and apparatus classified for data.Specifically, the invention provides improved transduction
Machine learning method.The invention further relates to the new application using machine learning method.
Background technology
The information age and in the recent period all trades and professions (including, particularly, scanning file, web material, search engine number
According to, text data, image, audio data file, etc.) electronic data huge explosion, how processing data has become very
It is important.
A field for just starting to explore is non-artificial data classification.In many sorting techniques, machine or computer
Must be according to the rule setting for being manually entered and setting up and/or the training examples manually set up study.Using training examples
Machine learning in, learn number of parameters of the quantity of sample generally than required estimation small, i.e. meet and given by training examples
Restrictive condition solution quantity it is bigger.One challenge of machine learning is to find that a kind of limit regardless of shortcoming has still been concluded
Good solution.Therefore need to overcome these and/or other problem of the prior art.
It it is yet further desirable to the practical application of various types of machine learning methods.
The content of the invention
In a computer based system, according to one embodiment of present invention, a kind of side classified for data
Method, including:Reception has mark data points, has described in each mark data points to be marked with least one, indicates the data point
The training examples for the data point for being included into a specified classification, or the data point being excluded from a specified classification training
Sample;Receive data untagged point;There is at least one default cost of mark data points and data untagged point described in receiving
The factor;By iterative calculation, using at least one described cost factor, and described there are mark data points and data untagged point
As training examples, using maximum entropy-discriminate (MED), a transductive classifier is trained, wherein, for iterating to calculate each time,
The cost factor of data untagged point is adjusted as the function of an expectation mark value, and estimating according to group of data points membership probability
Calculate, adjust the prior probability of a data point markers;By the grader of training be used for classify the data untagged point, have mark
At least one in data point and input data point;And the classification of the data point by the classification or derivatives thereof is exported to one
Individual user, another system and it is another during at least one.
According to another embodiment of the invention, a kind of method classified for data, including provided to computer system
The executable program code needed to use, and perform on the computer systems, described program code includes multiple instruction, is used for:
Access be stored in computer storage have mark data points, have mark data points that there is at least one mark described in each,
It is the training examples for the data point for being included into a specified classification to indicate the data point, or be excluded from a specified classification
The training examples of data point;Unmarked data point is accessed from computer storage;There is mark from described in computer storage access
Numeration strong point and at least one default cost factor of data untagged point;By iterative calculation, using it is described at least one
Cost factor, and storage have mark data points and the data untagged point of storage as training examples, train a maximum
Entropy-discriminate (MED) transductive classifier, wherein, for iterating to calculate each time, adjustment data untagged point cost factor is used as one
The individual function for expecting mark value, and according to the estimation of data point group membership's probability, adjust the priori of the data point markers
Probability;By the grader of training be used for classify the data untagged point, have in mark data points and input data point at least
One;And the classification of the data point by the classification or derivatives thereof export to a user, another system and it is another during
At least one.
According to another embodiment of the invention, a kind of data processing equipment, including:At least one memory, for depositing
Storage:(i) have a mark data points, it is described each there are mark data points to be marked with least one, it is to be received to indicate the data point
Enter the training examples of the data point of a specified classification, or the data point being excluded from a specified classification training examples;
(ii) data untagged point;(iii) described have at least one default cost of mark data points and data untagged point because
Son;And a transductive classifier training aids, with using at least one described cost factor for storing, and storage have mark
Data point and the data untagged point of storage are as training examples, and using the maximum entropy-discriminate (MED) of transduction, cyclically training turns
Grader is led, wherein, iterated to calculate for MED each time, adjustment data untagged point cost factor is expected to mark as one
The function of value, and according to the estimation of data point group membership's probability, adjust the prior probability of the data point markers;
Wherein, the grader trained by transductive classifier training aids be used for classify data untagged point, have mark data points,
And at least one in input data point;
Wherein, the classification of data point of the classification or derivatives thereof, is exported to a user, another system and another
At least one during one.
According to another embodiment of the invention, a kind of product, including:One computer-readable program recorded medium,
The medium definitely includes the executable instruction repertorie of one or more computers, to perform a kind of method of data classification,
Including:Reception has mark data points, has described in each mark data points to be marked with least one, indicate the data point be by
Include the training examples of the data point of a specified classification, or the data point being excluded from a specified classification training sample
Example;Receive data untagged point;Have described in receiving at least one default cost of mark data points and data untagged point because
Son;Use the data untagged point for having mark data points and storage of the cost factor of at least one storage, and storage
As training examples, calculated using the maximum entropy-discriminate (MED) of iteration, train a transductive classifier, wherein, each time
In MED iterative calculation, the cost factor of data untagged point is adjusted as the function of an expectation mark value, and according to a number
The estimation of strong point group membership's probability, adjusts a data point markers prior probability;The grader of training is used for the nothing of classifying
Mark data points, there is at least one in mark data points and input data point;And the data point by classification or derivatives thereof
Classification export to a user, another system and it is another during at least one.
In a computer based system, according to another embodiment of the invention, a kind of point of data untagged
Class method, including:Reception has mark data points, has described in each mark data points to be marked with least one, indicates the number
Strong point is the training examples for the data point for being included into a specified classification, or the data point being excluded from a specified classification
Training examples;Reception has mark and data untagged point;Reception has the priori signature of mark data points and data untagged point general
Rate information;There is at least one default cost factor of mark data points and data untagged point described in receiving;According to the number
The mark prior probability at strong point, determines that each has the desired mark of mark and data untagged point;Repeat following sub-step
Suddenly, until data value is restrained enough.
● generate a regulation for each data untagged point proportional to the absolute value of the expectation mark of data point
Value at cost;
● by determining decision function, the given sample for being included into training and being excluded training, using it is described have mark and
Data untagged point trains a grader as training examples, is marked according to their expectation, and the decision function dissipates KL
It is minimised as the prior probability distribution of decision function parameter;
● using the grader of the training, it is determined that described have the classification score value marked with data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to group membership's probability of the determination, update the mark prior probability of the data untagged point;
● using the mark prior probability and the classification score value that determines before of the renewal, using maximum entropy-discriminate (MED),
Determine the mark and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation mark;With
● it is that each data point updates by the way that the expectation mark insertion new expectation of iteration before is marked
Expect mark.
One classification of input data point or derivatives thereof is exported to a user, another system and another process
In at least one.
According to another embodiment of the invention, a kind of file classifying method, including:Receive at least one markd kind
Subfile, it has the known confidence levels of mark distribution;Receive unmarked file;Receive at least one default cost because
Son;Using at least one described default cost factor, at least one described seed file and the unmarked file, lead to
Iterative calculation one transductive classifier of training is crossed, wherein, for iterating to calculate each time, adjust the cost factor and be used as one
Expect the function of mark value;It is the unmarked file storage confidence score after at least part iteration;And will have most
The identifier of the unmarked file of high confidence score export to a user, another system and it is another during at least one
It is individual.
According to another embodiment of the invention, a kind of method for being used to analyze the file related to legal inquiry, including:
Receive the file related to legal matter;A kind of file classifying method is performed to the file;And based on its classification, output is extremely
The identifier of small part file.
According to another embodiment of the invention, a kind of method for clearing up data, including:Receive multiple markd data
;The subset of the data item is chosen for each of multiple classifications;In each subset, the deviation of the data item is set
It is set to about zero;The deviation of data item not in the subset is arranged to the preset value that one is not about zero;Using described
Deviation, the data item in the subset and the data item not in the subsets are instructed as training examples by iterating to calculate
Practice a transductive classifier;The grader of the training is applied to each described markd data item, it is described to classify
Each data item;And export the classification of the input data or derivatives thereof to a user, another system, another
During at least one.
According to another embodiment of the invention, it is a kind of for checking of invoice and the method for the relevance of entity, including:Base
A grader is trained in the invoice format related to first entity;Multiple are accessed to be marked as and the first instance and its
At least one related invoice in its entity;A kind of file classifying method is performed to invoice using the grader;And it is defeated
Go out the identifier of at least one invoice, the invoice has higher probability uncorrelated to first entity.
According to another embodiment of the invention, a kind of method for managing case history, including:Based on medical diagnosis training
One grader;Access multiple case histories;A kind of file classifying method is performed to the case history using the grader;And output
The identifier of at least one case history, the case history has relatively low probability related to the medical diagnosis.
According to another embodiment of the invention, a kind of method for recognition of face, including:Receive at least one face
Have a mark drawing of seeds picture, the drawing of seeds picture has a known confidence levels;Receive unmarked image;Receive at least one
Individual default cost factor;By iterative calculation, at least one described default cost factor, at least one drawing of seeds are used
As and the unmarked image, train a transductive classifier, wherein, for iterating to calculate each time, adjust it is described into
This factor as a desired mark value function;It is the unmarked drawing of seeds picture storage after at least part iteration
One confidence score;And export the identifier of the unmarked image with highest confidence score to a user, another system
System, it is another during at least one.
According to another embodiment of the invention, a kind of method for analyzing prior art document, including:Based on one
Search inquiry trains a grader;Access multiple prior art documents;Using the grader at least partly described existing
Technological document performs a kind of file classifying method;And based on its classification, the mark of at least partly described prior art document of output
Know symbol.
According to another embodiment of the invention, it is a kind of patent classification is adapted to the method that file content changes, including:Connect
Receive at least one markd seed file;Receive unmarked file;Use at least one described seed file and the nothing
Tab file trains a transductive classifier;Using the grader, the confidence levels of predetermined threshold value will be higher than with one
Unmarked file is referred to multiple existing classifications;Using the grader, the confidence level of predetermined threshold value will be less than with one
Other unmarked file is referred at least one new classification;, will at least partly described classified file weight using grader
Newly it is referred to the existing classification and at least one described new classification;And export the identifier of the sorted file
To a user, another system and it is another during at least one.
According to another embodiment of the invention, a kind of method for file to be matched with claim, including:It is based on
One patent or at least one claim of patent application train a grader;Access multiple files;Use the classification
Device performs a kind of file classifying method at least partly described file;And based on its classification, at least partly described file of output
Identifier.
According to another embodiment of the invention, a kind of patent or the sorting technique of patent application, including:Based on it is multiple
Know one grader of file training for belonging to a specific patent classification;Receive at least one of a patent or patent application
Point;A kind of file classifying method is performed to the patent or described at least a portion of patent application using the grader;With
And the classification of the output patent or patent application, wherein, the file classifying method is a Yes/No sorting technique.
According to another embodiment of the invention, a kind of method for adapting to file content variation, including:Receive at least one
There is mark seed file;Receive unmarked file;Receive at least one default cost factor;Using it is described at least one preset
Cost factor, at least one described seed file and the unmarked file, train a transductive classifier;Using institute
Grader is stated, the unmarked file of the confidence levels with higher than a predetermined threshold value is referred to multiple classifications;Using described
Grader, multiple classifications are reclassified by the file of at least partly described classification;And by the mark of the sorted file
Symbol output to a user, another system and it is another during at least one.
According to another embodiment of the invention, a kind of method of separate file, including:Receive markd data;Connect
Receive one group of unmarked file;Based on the markd data and unmarked file, probabilistic classification rule is rewritten using transduction;Root
According to probabilistic classification rule, the weight separated for file is updated;Determine the position separated in one group of file;Will be described
The designator of the separation point position of determination export to a user, another system and it is another during at least one;And give
File stamps code, and the code is related to the designator.
According to another embodiment of the invention, a kind of method of file search, including:Receive a search inquiry;Base
In the search queries retrieval file;Export the file;The mark keyed in at least partly described file reception user, it is described
Mark indicates the correlation between the file and the search inquiry;The mark instruction keyed in based on the search inquiry and user
Practice a grader;One file classifying method is performed to the file using the grader, to divide again the file
Class;And based on its classification, the identifier of at least partly described file of output.
Brief description of the drawings
Fig. 1 for expect mark as classify score value a function curve map, the classification score value by using suitable for
The MED that mark is concluded differentiates study and obtained.
Fig. 2 is the schematic diagram of the iterative calculation of one group of decision function obtained by MED study of transduceing.
Fig. 3 is one group of changing by the improved transduction MED decision functions for learning to obtain according to an embodiment of the invention
The schematic diagram that generation calculates.
Fig. 4 is that, according to one embodiment of the invention, using the cost factor of a regulation, one is used for unmarked number of classifying
According to control flow chart.
Fig. 5 is that, according to one embodiment of the invention, using user-defined priori probability information, one is used to classify without mark
The process control chart for the evidence that counts.
Fig. 6 is, according to one embodiment of the invention, using the cost factor and priori probability information of regulation, to use maximum entropy
Differentiate, a detailed control flowchart for data untagged of classifying.
Fig. 7 implements the network of the network structure of not be the same as Example described herein for display.
Fig. 8 is one representational, the system block diagram of the hardware environment related to user equipment.
Fig. 9 is the block diagram for the device for representing one embodiment of the present of invention.
Figure 10 is by the flow chart of the assorting process performed according to one embodiment.
Figure 11 is by the flow chart of the assorting process performed according to one embodiment.
Figure 12 is by the flow chart of the assorting process performed according to one embodiment.
Figure 13 is by the flow chart of the assorting process performed according to one embodiment.
Figure 14 is by the flow chart of the assorting process performed according to one embodiment.
Figure 15 is by the flow chart of the assorting process performed according to one embodiment.
Figure 16 is by the flow chart of the assorting process performed according to one embodiment.
Figure 17 is by the flow chart of the assorting process performed according to one embodiment.
Figure 18 is by the flow chart of the assorting process performed according to one embodiment.
Figure 19 is by the flow chart of the assorting process performed according to one embodiment.
Figure 19 is by the flow chart of the assorting process performed according to one embodiment.
Figure 20 is by the flow chart of the assorting process performed according to one embodiment.
Figure 21 is by the flow chart of the assorting process performed according to one embodiment.
Figure 22 is the method for one embodiment of the invention, the control flow chart for a first document classification system.
Figure 23 is the method for one embodiment of the invention, the control flow chart for a second document classification system.
Figure 24 is by the flow chart of the assorting process performed according to one embodiment.
Figure 25 is by the flow chart of the assorting process performed according to one embodiment.
Figure 26 is by the flow chart of the assorting process performed according to one embodiment.
Figure 27 is by the flow chart of the assorting process performed according to one embodiment.
Figure 28 is by the flow chart of the assorting process performed according to one embodiment.
Figure 29 is by the flow chart of the assorting process performed according to one embodiment.
Embodiment
Following description be it is presently contemplated that the realization present invention the best approach, the purpose of the description is to illustrate this hair
Bright General Principle, is not intended to limit the content of invention described herein.Moreover, special characteristic described herein can
It is combined with the feature of each other description in various different possible combination and permutation.
Unless another especially definition here, all terms all give its most wide possible explanation, including from specification
The meaning of hint, and skilled artisan understands that the meaning, and as defined in dictionary, paper etc. look like.
Text classification
The benefit and demand of text data classification are very huge, and have had a variety of sorting techniques to be used.Below
Sorting technique for text data is discussed:
To increase its effectiveness and intelligence, it is desirable to the machine of such as computer etc can classify (or identification) one constantly expand
Object in big scope.For example, optical character identification can be used the numeral and word of classify hand-written or scanning in computer, make
With pattern identification come classification chart picture, such as face, fingerprint, fighter plane, or classified using speech recognition sound, voice etc.
Deng.
Machine is also required being capable of classifying text information object, such as text computer file or document.Text classification
Using being various and important.For example, text classification can be used for management text message object to be classified to a predetermined class
Other or classification hierarchical structure.So, it is found that (or finding) text message object relevant with particular topic is just simplified.Text
This classification can be used for appropriate text message object routing to appropriate crowd or place.So, information service will can be related to
The text message object of various themes (e.g., commercial affairs, physical culture, stock market, football, specific company, specific football team) is routed to
Crowd with different interest.Text classification can be used for filtering text message object, so that individual is from unwanted text
Hold the invasion of (such as need not be with uncalled Email, also referred to as SPAM, or " rubbish ").As from these
As being learnt in example, text classification has a variety of exciting and important application.
Rule-based classification
In some instances, it is necessary to based on certain generally acknowledged logic, file content is classified using absolute certitude.
One rule-based system can be used for realizing such classification.Substantially, rule-based system uses the shape of production rule
Formula:
IF conditions, THEN is true.
The condition can include whether text message includes some word or expressions, with specific grammer, or have
Specific attribute.If for example, content of text has word " closing quotation ", phrase " Nasdaq " and numeral, be then classified as
" stock market " text.
In past about 10 years, other types of grader is little by little used.Although this kind of grader is unlike base
Static, predetermined logic is used like that in the grader of rule, but in numerous applications, they are better than rule-based classification
Device.This kind of grader generally includes a learning element and an executive component.This kind of grader includes neutral net, Bayes
Network and SVMs.Although each this kind of grader is well known, for convenience reader, it is briefly described below each
Plant grader.
Grader with study and executive component
As being previously mentioned the end of upper section, in numerous applications, the grader with study and executive component is excellent
In rule-based grader.Reiterate, these graders can include neutral net, Bayesian network and supporting vector
Machine.
Neutral net
Neutral net is substantially the multilayer of same treatment element (also referred to as neuron), level arrangement.Each neuron can
With one or more inputs, but only one of which is exported.The input of each neuron is weighted by a coefficient.Neuron
Output is typically a function of its weighting input and deviation sum.This function, also referred to as activation primitive, typically one
Sigmoid function.That is, the activation primitive can be S-shaped monotonic increase, and when its it is (multiple) input respectively close to positive minus infinity when,
Asymptotics fixed value (such as+1,0, -1).Sigmoid function and single neural weight and deviation determine that neuron is believed input
Number response or " excitability ".
In the level arrangement of neuron, the output of the neuron in one layer can be distributed as one or more in next layer
The input of neuron.Typical neutral net may include the individual different layers of an input layer and two (2);That is, one input layer, one
Intrerneuron layer, and an output neuron layer.The node that note that the input layer is not neuron.More precisely,
The node of input layer only has an input, and mainly provides the untreated input for inputing to next layer.If, such as nerve net
Network will be used to recognize a numerical character in 20 × 15 pel arrays, and the input layer can have 300 neurons (i.e.
Each pixel of input), and output array can have 10 neurons (in i.e. 10 numerals each).
The use of neutral net generally comprises the individual continuous step in two (2).First, neutral net is initialized, and according to tool
The network is trained in the known input for having known output valve (or classification).Once neutral net is trained to, it is with regard to that can be used for classification not
The input known.By the way that the weight and deviation of neuron are set into random value (generally being generated by a Gaussian Profile), nerve net
Network can be initialised.Then there is the known input for exporting (or classification) using a series of, trains the neutral net.It will instruct
When white silk input is supplied to neutral net, adjustment (such as according to known backpropagation techniques) neural weight and deviation, so that
The output of the neutral net of each single training mode approaches or matched the known output.Substantially, the gradient of weight space
Decline be used to minimize output error.So, using the study of continuous training input, towards weight and the local optimum of deviation
Solution convergence.That is, weight and deviation is adjusted to minimal error.
In practical operation, generally not by the systematic training into the certain point for converging to optimal solution.On the contrary, system will be by
" over training ", causes it excessively professional for training data, thereby increases and it is possible to be bad at classification and the somewhat different input of training set.
Therefore, in the different times of its training, the system is tested in one group of checking data.When the performance of system collects in checking
On when no longer improving, training stops.
Once training is completed, so that it may use the neutral net, according to the weight and deviation determined during the training period, classification is not
Know input.One output of the neuron in the Unknown worm if the neutral net can classify, some output layer surely will
Other outputs can be far above.
Bayesian network
Generally, Bayesian network is used it is assumed that as between data (e.g., input feature value) and prediction (e.g., classifying)
Medium.For given data (" P (assuming that ︱ data) "), each probability assumed can be estimated.After hypothesis
Probability is tested, is predicted from the hypothesis, to be weighted to the single prediction that each is assumed.Data-oriented D, prediction X's
Probability can be expressed as:
Wherein, HiFor i-th of hypothesis.Maximize given D (P (Hi︱ D)) HiProbability maximum likelihood hypothesis Hi
It is referred to as maximum a posteriori and assumes (or " HMAP"), and be represented by:
P (X ︱ D)~P (X ︱ HMAP)
Use bayes rule, data-oriented D, it is assumed that HiProbability be represented by:
Data D probability keeps constant.Therefore, to find HMAP, it is necessary to maximum chemoattractant molecule.
The Section 1 of molecule is represented:It is given to assume i, can be it is observed that the probability of the data.The Section 2 of molecule is represented:Point
The prior probability for assuming i is given described in dispensing.
Bayesian network includes the directed edge between variable and variable, thus defines a directed acyclic graph (i.e. " DAG ").
Each variable may be assumed that the arbitrary value in the mutual exclusion state for limited quantity.For each variables A, it has female variable
B1…Bn, there is an attached probability tables (P (A ︱ B1…Bn).The structured coding of Bayesian network it is described it is assumed that it is given its
Female variable, each variable is conditionally independently of its non-sub- variable.
Assuming that the structure of Bayesian network, it is known that and variable observable, then only need condition for study list of probabilities set.Directly
Using the statistics from one group of study sample, these lists can be estimated.If the structure, it is known that and some variables be it is hiding,
Then study is similar to above-mentioned neural network learning.
The example of simple Bayesian network is described below.Variable " MML " can be represented " humidity on my lawn "
(moisture of my lawn), and can have state " wet " and " dry ".MML variables can have " rainy " and " my watering
Device is opened " female variable, each with "Yes" and "No" state.Another variable, " MNL " can represent " the grass of my neighbours
The humidity on level ground ", and can have state " wet " and " dry ".MNL variables can share " rainy " female variable.In this example, prediction can
To be that my lawn is " wet " or " dry ".The prediction can be with based on the assumption that (i):If rained, what my lawn will be wet is general
Rate (x1) and assume (ii):If my water sprinkler is opened, the probability (x that my lawn will be wet2).The probability rained or I
Water sprinkler open probability may depend on other variables.If for example, the lawn of my neighbours is wet, and they do not spill
Hydrophone, that is likely to rain.
As described above, as the example of neutral net, the conditional probability table in Bayesian network can be trained.Its advantage exists
In, by allow provide priori, the learning process can be shortened.Unfortunately, the prior probability of conditional probability is usually
It is unknown, now using unified prior probability.
One (1) that one embodiment of the present of invention can perform in the individual basic function at least two (2) is individual, that is, generates grader
Parameter, and object of classification, such as text message object.
Substantially, it is grader generation parameter based on one group of training examples.One group of spy can be generated from one group of training examples
Levy vector.The feature of this group of characteristic vector can be simplified.The parameter of generation may include to dullness (such as S-shaped) function of a definition
With a weight vectors.The weight vectors can be determined (or by other known technology) by way of SVM is trained.It can pass through
Optimal method determines dullness (such as S-shaped) function.
Text classifier includes dullness (e.g., S-shaped) function of a weight vectors and a definition.Substantially, it is of the invention
The output of text classifier be represented by:
Wherein:
Oc=classification c classification output;
wc=weight vectors the parameter related to classification c;
(simplification) characteristic vectors of x=based on unknown text information object;
A and B are a customized parameters of dull (e.g., S-shaped) function;
Output is calculated by expression formula (2) faster than calculating output by expression formula (1).
According to the form for being classified object, grader can (i) text message object is converted into characteristic vector, and (ii)
It is the simplification characteristic vector with less element by feature vectors reduction.
Transduction machine learning
Commercially, currently used automatic classification system is rule-based or utilizes conclusion type machine in the prior art
Study, i.e. use handmarking's training examples.Compared to transduction method, two methods are generally required for a large amount of artificial setting works
Make.The solution provided by rule-based system or conclusion type method is static solution, if without manual working, it
Classification concept of drifting about cannot be adapted to.
The machine learning of conclusion type is used to attribute or relation being attributed to based on characterizing (namely based on one or a small number of observations
Or experience) type;Or rule is formulated based on limited observation reproduction mode.Conclusion type machine learning is included from observing
Reasoning in training cases, to set up general rule, the rule is then used in test case.
Distinguishingly, preferred embodiment uses transduction machine learning method.Machine learning of transduceing is an effective method, can
To avoid these defects.
Transduction Machine Method can have mark training examples learning from considerably less one group, and automatic adaptation drift classification is general
Read, and correct the training examples of mark automatically.These advantages cause transduction machine learning to turn into an interesting and valuable side
Method, is adapted to various business applications.
Transduction is in data learning pattern.By not only from having flag data but also from data untagged learning, transduction
Extend the concept of conclusion type study.This enables transduction to learn not from having flag data capture or only part from there is mark
The pattern that numeration is captured in.Therefore, the system learnt compared to rule-based system or based on conclusion type, transduction can adapt to
The environment of dynamic change.This ability causes transduction to can be used in file search, data scrubbing, addressing drift classification concept etc.
Deng.
Description below utilizes SVMs (SVM) classification and the reality of the transductive classification of maximum entropy-discriminate (MED) framework
Apply example.
SVMs
SVMs (SVM) is a kind of method that text classification is used, by using the concept pair of regularization theory
Possible solution sets limitation, the problem of this method has handled a large amount of solutions, and resulting evolvement problem.For example, one two
The hyperplane that first SVM classifier chooses maximization boundary from all accurate hyperplane for separating training data is used as solution.It is maximum
Boundary normalization is met foregoing in the extensive selection conjunction between memory under the restrictive condition that training data is classified exactly
The problem concerning study of suitable balance.Remembered data to the limitation of training data, and normalization then ensure that it is suitable extensive.Conclude and divide
Class is from the training examples learning with known mark, i.e. the group membership of each training examples is known.When inducing classification from
Known mark learning, transductive classification determines classifying rules from having mark and data untagged.One svm classifier of transduceing
Example is as shown in table 1.
The principle of transduction svm classifier
Table 1
Table 1 shows the principle of the transductive classification using SVMs.Solution is provided by hyperplane, and the hyperplane is directed to nothing
The all possible mark distribution of flag data produces maximum figure.The possible mark distribution is with the number of data untagged
Amount is exponentially increased, and for actually available method, the algorithm of table 1 must be estimated.The example of the estimation exists
T.Joachims, Transductive inference for text classification using support
vector machines,Technical report,Universitact Dortmund,LAS VIII,1999
(Joachims) it is described in.
For the expression that is uniformly distributed of mark distribution in table 1, the probability that a data untagged point has 1/2 turns into the group
Front sample and with 1/2 probability turn into negative sample, i.e. y=+1 (front sample) and y=-1 (negative sample) this
Two kinds of possible mark distribution even odds, and final expectation is labeled as 0.For 0 mark it is expected that 1/2 can be equal to by one
Fixed category prior probability is obtained, or by the category prior probability (i.e. one for a stochastic variable being distributed with uniform prior
Individual unknown category prior probability) obtain.Therefore, in 1/2 application of known class prior probability is not equal to, by combining
The additional information can improve the algorithm.For example, be not being uniformly distributed using the mark distribution in table 1, but it is first according to classification
Test probability, some mark distribution of prioritizing selection, rather than other mark distribution.However, but matching somebody with somebody smaller with being scored compared with high standard
Boundary solution and it is larger but with it is relatively low mark distribution boundary solution between make balance be difficult.Mark distribution probability and
Boundary is different scale.
Maximum entropy-discriminate
Another method of classification, maximum entropy-discriminate (MED) (referring to, e.g., T.Jebara, Machine Learning
Discriminative and Generative, Kluwer Academic Publishers) (Jebara) will not hit on
The problem of SVM is related, because decision function formal phase of normalization and mark distribution formal phase of normalization are all derived from the priori for solution
Probability distribution, therefore all in identical probability scale.Thus, if category prior, and during mark known a priori thus,
MED classification transduce better than transduction svm classifier, because it allows priori signature knowledge to combine in principle fashion.
Conclude MED classification and assume the prior distribution of a decision function parameter, the prior distribution of bias term, and one
The prior distribution of boundary.It selects that distribution closest to prior distribution as the final distribution of these parameters, and produces
The expectation decision function of one grouped data point exactly.
In form, a linear classifier is for example given, problem is expressed as follows:Hyperplane parameter distribution p (Θ) is found, partially
Poor distribution p (b), data point categorised demarcation line p (γ), its joint probability distribution has a minimum Kullback Lai Baile diverging
(Kullback Leibler divergence) KL assigns each prior distribution p combined0, i.e.,
It is limited by restrictive condition
Wherein Θ XtBe separating hyperplane weight vectors and t-th of data point characteristic vector between dot product.Due to mark
Score with ytFor known and fixed, the prior distribution distributed without binary flag.Therefore, MED classification will be concluded and is generalized for transduction
The short-cut method of MED classification, is to distribute binary flag as the prior distribution parameter of possible mark distribution is limited to locate
Reason.Transduction MED example is as shown in table 2.
The MED that transduces classifies
Table 2
For there is flag data, mark prior distribution is a δ function, thus can effectively determine to be labeled as+1 or -1.
For data untagged, it is assumed that a mark prior probability p0(y), distribute to one y=+1's of each data untagged point
The probability just marked is p0(y), and the probability of y=-1 negative flag be 1-p0(y).Assuming that a non-information mark is first
Test (p0(y)=1/2), produce a transduction MED classification similar with above-mentioned transduction svm classifier.
As the situation in transduction svm classifier, the reality implementations applicatory of above-mentioned MED algorithms must be estimated pair
In the search of all possible mark distribution.This method is in T.Jaakkola, M.Meila, and T.Jebara, Maximum
entropy discrimination,Technical Report AITR-1668,Massachusetts Institute of
It is described in Technology, Artificial Intelligence Laboratory, 1999 (Jaakkola), it selects one
Individual approximation, is two steps by procedure decomposition, (EM) formula is maximized similar to a desired value.In the formula, it is necessary to
Solve two problems.The first step, equivalent to the M steps in EM algorithms, when the best-guess distributed according to current markers, accurately
During ground all data points of classification, similar to the maximum of boundary.Second step, equivalent to E steps, uses what is determined in M steps
Classification results, and estimate new value for the group membership of each sample.Our second steps are called that mark is concluded.Retouching substantially
State as shown in table 2.
The special implementation of Jakkola cited herein method, it is assumed that one has the zero of hyperplane parameter to be averaged
The Gaussian function of value and unit variance, a zero mean and variance with straggling parameterGaussian function, formula exp [-
C (1- γ)] a boundary priori, wherein γ is the boundary of data point, and c is cost factor, and one as described above without mark
The binary flag prior probability p for the evidence that counts0(y).The transductive classification algorithm Jaakkola being discussed below, is hereby incorporated, due to
The reason of simplification and non-loss of generality, therefore assume 1/2 mark prior probability.
Fixation probability distribution for giving hyperplane parameter, mark induction step determines marking probability distribution.Make
With above-mentioned boundary and mark priori, produce the object function of following mark induction step (referring to table 2):
Wherein λtFor t-th of training examples Lagrange multiplier (Lagrange Multiplier), stFor in foregoing M steps
Its score value of classifying of middle determination, c is cost factor.First two in training examples summation obtain from boundary prior distribution, and
Section 3 is given by mark prior distribution.By maximizingLagrange multiplier is determined, and thereby determines that data untagged
Marking probability distribution.As can be seen that in formula 3, data point acts on alone object function, therefore each Lagrange multiplier
Determination it is unrelated with other Lagrange multipliers.For example, in order to maximize a classification score value ︱ s with highest absolute valuet︱'s
The effect of data untagged point is, it is necessary to one small Lagrange multiplier λt, and one has small value ︱ st︱ data untagged
Point, then need, using a big Lagrange multiplier, to maximize it rightEffect.On the other hand, one of data untagged point
Expecting label L T.LT.LT y > as its function representation for classifying score value s and Lagrange multiplier λ is:
< y >=tanh (λ s) (4)
Fig. 1 shows the function for expecting label L T.LT.LT y > as a classification score value s, its use cost factor c=5 and c=
1.5.By using cost factor c=5 and c=1.5 solution formula 3, it is determined that the Lagrange multiplier for producing Fig. 1.By Fig. 1
Understand, the data untagged point outside boundary, i.e., | s |>1, with the expectation label L T.LT.LT y > close to 0, close to the data of boundary
Point, i.e., | s | ≈ 1, produce highest and definitely expect mark value, and close to the data point of hyperplane, i.e., | s |<∈, is produced |<
y>|<∈.When | s | → ∞,<y>The reason for → 0 non-intuitive mark distribution, is determined method of discrimination, as long as this method
Classification limitation is met, attempts to keep close to prior distribution.It is not that a known method by table 2 is selected
The algorithm of the artefact of approximation, i.e., one, the algorithm thoroughly searches for all possible mark distribution, and therefore ensures that and find out
Globally optimal solution, and the same expectation mark by close or equal to zero distributes to the data untagged outside boundary.Weigh again
Shen, as described above, that is to differentiate that viewpoint is desired.Data point outside boundary is unimportant for separating sample, therefore
The individual probability distribution of all these data points has been returned to their prior distribution.
The M steps of Jaakkola transductive classification algorithm, are hereby incorporated, it is determined that the probability distribution of hyperplane parameter, partially
Poor item and under conditions of limitation closest to respective prior distribution data point boundary,
Wherein, stFor t-th data point classification score value,<yt>For its desired mark,<γt>For its desired boundary.It is right
In there is flag data, desired mark is fixed, is<y>=+1 or<y>=-1.The expectation mark of data untagged is located at
Within interval (- 1 ,+1), and it is estimated in mark induction step.According to formula 5, because classification score value is determined by expectation mark
Fixed, data untagged must meet the classification limitation more tightened up than there is flag data.In addition, the given relational expression for expecting mark, makees
For a function of score value of classifying, referring to Fig. 1, the data untagged close to separating hyperplane has most stringent of classification limit
System, because the absolute value that their score value and expectation are marked |<yt>| it is small.Give the complete mesh of the M steps of above-mentioned prior distribution
Scalar functions are:
Section 1 is obtained by Gauss hyperplane parameter prior distribution, and Section 2 is boundary priori formal phase of normalization, last
For deviation priori formal phase of normalization, by with zero mean and varianceGaussian prior obtain.The prior distribution of bias term can be managed
Solve as the prior distribution of a category prior probability.Therefore, limited just corresponding to the formal phase of normalization of the deviation prior distribution
The weight of face sample and negative sample.Referring to formula 6, the effect of bias term is minimized, to prevent the front sample on hyperplane
Collective pull equal to negative sample collective pull.Due to deviation priori, the collective of Lagrange multiplier is limited just by data
Point expectation mark weighting, and therefore data untagged than there is the limitation of flag data less.Thus, data untagged have than
There is the ability of flag data stronger influence last solution.
In a word, Jaakkola transductive classification algorithm M steps, be hereby incorporated, data untagged need ratio have mark
Data meet tightened up classification limitation, and they have the limitation of flag data less for the accumulation weight ratio of solution.In addition, tool
There is a data untagged close to zero expectation mark, within the boundary of current M steps, the influence to solution is most
Greatly.So, as shown in Fig. 2 by the way that the algorithm is applied into data set, the net effect of formulation E and M steps can be illustrated
Should.Data set, which includes two, a mark sample, one be located at x position -1 negative sample (x), and the front sample of one+1
(+), and six unmarked samples (o) along x-axis, between -1 and+1.Fork (x) represents that has the negative sample of mark,
Plus sige (+) represents that has a mark front sample, and circle (o) represents data untagged.Different figures represents that what is separated surpasses
Plane, is determined by the different iteration of M steps.Final solution is determined by Jaakkaola transduction MED graders, is hereby incorporated,
Front has mark training examples to be classified by mistake.Fig. 2 shows the successive ignition of M steps.In the first time iteration of M steps, not
Consider data untagged, and the hyperplane separated is located at x=0.One has the data untagged point of negative x values than any other nothing
The hyperplane that flag data separates closer to this.In subsequent mark induction step, it will be assigned to minimum |<y>
|, correspondingly, in next M steps, there is maximum authority, which to push hyperplane to front, for it mark sample.Expect mark<y>
Given shape as a classification score value determined by selected cost factor (referring to Fig. 1) function, with data untagged
The specific interval of point, which is combined, generates bridge effect, and in each continuous M step, the hyperplane of separation is increasingly closer to just
Face sample.Intuitively, M steps are by a kind of near-sighted puzzlement, closest to the data untagged point of current separating hyperplane
The final position of the plane is most can determine that, and remote data point is not critically important.Finally, since deviation priori limits nothing
Collective's pulling of flag data moves on to less than the collective's pulling for having flag data, thus separating hyperplane and marks sample beyond front
Example, produces the 15th iteration in a final solution, Fig. 2, and it has carried out front mark sample the classification of mistake.In Fig. 2
In used oneDeviation variance and a c=10 cost factor.UtilizeIt is any in scope 9.8<c<
Cost factor within 13 produces the final hyperplane of a classification for marking sample to carry out mistake in a certain front.And it is all
Interval 9.8<c<Cost factor outside 13, from anywhere in two have between mark sample, produces the hyperplane separated.
The unstability of the algorithm is not limited merely to the sample shown in Fig. 2, when application Jaakkola methods, draws herein
With being also subject to be confined to real-world data collection, including the Reuter's data set being well known to those skilled in the art.Table 2
Described in this method inherent instability be the embodiment a major defect, and limit its versatility, to the greatest extent
Pipe Jaakkola methods may be implemented in certain embodiments of the present invention.
One method for optimizing of the present invention uses the transductive classification of the framework using maximum entropy-discriminate (MED).It is readily appreciated that, this
The not be the same as Example of invention, it is adaptable to classify, is applied equally to other MED problems concerning study using transduction, including, but do not limit
In transduction MED restores and image model.
By assuming that the prior probability distribution of a parameter, maximum entropy-discriminate limits and reduces possible solution.According in the phase
The solution of prestige is described under the limitation of training data exactly, closest to the probability distribution of the prior probability distribution of hypothesis, last solution
To be possible to the desired value of solution.The prior probability distribution of all solutions is mapped to a formal phase of normalization, i.e. have selected one it is specific
Prior distribution, just have selected for a specific normalization.
The differentiation estimation implemented by SVMs is being effective from the study of a small amount of sample.The embodiment of the present invention
Method and apparatus all there is the feature as SVMs, it is and necessary the problem of will not estimate than solving given
The more parameters of parameter, and therefore produce a sparse solution.Compared with generation mode is estimated, generation mode estimation attempts to explain base
Plinth process, it usually needs the statistics higher than differentiating estimation.On the other hand, generation mode is more flexible, therefore available for various each
The problem of sample.In addition, generation mode estimation can directly include priori.By using maximum entropy-discriminate, the embodiment of the present invention
Method and apparatus shorten pure discrimination model estimation (e.g., SVMs learns) and generation mode estimate between gap.
The method of embodiments of the invention as shown in table 3 is an improved transduction MED sorting algorithm, and it does not have
The problem of foregoing unstable in the presence of the method for Jaakkola (being hereby incorporated).Difference includes, but not limited in this hair
In bright embodiment, each data point has the cost factor of its own, with its absolute descriptor's desired value |<y>| it is proportional.Separately
Outside, according to function of the estimation group membership's probability as the distance of data point to decision function, after each M steps, update each
The mark prior probability of individual data point.The method of the embodiment of the present invention is as shown in the following Table 3:
Improved transduction MED classification
Table 3
Pass through |<y>| regulation data point cost factor, has relaxed data untagged for collective's dragging on hyperplane
Effect is than there is the problem of flag data is stronger, because the cost factor of data untagged is than there is the cost factor of flag data now
It is small, that is to say, that each data untagged point is always less than the independent work for having mark data points for the independent role of last solution
With.However, if the total amount of data untagged is much larger than the quantity for having flag data, data untagged still can be than there is reference numerals
According to more influenceing last solution.In addition, using the class probability of estimation, by cost factor regulation and update mark prior probability knot
Close, the problem of solving above-mentioned bridge effect.In first M step, data untagged has small cost factor, produces one
Expect mark, as the function of classification score value, its relatively flat (see Fig. 1) is correspondingly, to a certain extent, all unmarked
Data are allowed to continue pulling hyperplane, although only have less weight.Further, since the renewal of mark prior probability, remote
The data untagged of the hyperplane of separation is not previously allocated one and marked close to 0 expectation, but after many iterations, distribution
One mark close to y=+1 or y=-1, and thus little by little it is counted as having flag data processing.
In a particular implementation of the method for the embodiment of the present invention, by assuming that one has decision function parameter Θ's
One Gaussian prior of zero mean and unit variance:
The prior distribution of decision function parameter combines the important priori of upcoming specific classification problem.It is other
For the prior distribution such as multinomial distribution of the important decision function parameter of classification problem, Poisson distribution, Cauchy's distribution
(Breit-Wigner), maxwell boltzman distribution or B-E distribution.
Decision function threshold value b prior distribution is by with average value mubAnd varianceGaussian Profile give:
It is used as the categorised demarcation line γ of data pointiPrior distribution
Chosen, wherein c is cost factor.The prior distribution and the prior distribution used in Jaakkola (being hereby incorporated)
Difference, Jaakkola expression formula is exp [- c (1- γ)].Preferably, the expression formula given by formula 9 is better than Jaakkola
The expression formula that (being hereby incorporated) uses, because even cost factor is less than 1, formula 9 can also produce a front and expect boundary, and work as
c<When 1, exp [- c (1- γ)] produces a negative expectation boundary.
These prior distributions are given, can directly determine corresponding partition function Z (referring to sample T.M.Cover and
J.A.Thomas, Elements of Information Theory, John Wiley&Sons, Inc.) (Cover), and target
FunctionFor
(it is hereby incorporated) according to Jaakkola, the object function of M steps is
And the object function of E steps is
Wherein stFor the classification score value of t-th of data point, determined in M steps above, p0,1(yt) it is the two of data point
Meta-tag prior probability.For there is flag data, mark priori is initialized as p0,1(yt)=1, and for data untagged, mark
Note priori is initialized as p0,1(ytThe non-information priori of)=1/2, or category prior probability.
Here the part for being named as M steps describes the algorithm for solving M step object functions.Similarly, E is named as here
The part of step describes E step algorithms.
In estimation class probability (Estimate Class Probability) step of the row of table 3 the 5th, training has been used
Data are to determine calibration parameter, and the probability for score value of classifying to be become to group membership's probability, i.e. classification gives score value p (cs).With
The correlation technique of probability is estimated as in J.Platt, Probabilistic outputs for support in score value is calibrated
Vector machines and comparison to regularized likelihood methods, pages 61-74,
2000 (Platt) and B.Zadrozny and C.Elkan, Transforming classifier scores into
It is described in accurate multi-class probability estimates, 2002 (Zadrozny).
Referring particularly to Fig. 3, fork (x) represents that has the negative sample of mark, and plus sige (+) indicates to mark front sample, and
Circle (o) represents data untagged.Different curves is represented with the separating hyperplane of the different iteration determination of M steps.20th time
Iteration shows the last solution determined by improved transduction MED graders.Fig. 3 show improved transduction MED sorting algorithms, should
For above-mentioned small data set.The parameter used is c=10,μb=0.Different c, which is produced, is located at x ≈ -0.5,
Separating hyperplane between x=0, works as c<When 3.5, hyperplane is located at an x<The right side of 0 data untagged, and when c >=
When 3.5, hyperplane is located at the left side of the data untagged point.
Referring particularly to Fig. 4, it is illustrated that a control flow, it is shown that the side of the classification data untagged of the embodiment of the present invention
Method.Method 100 starts in step 102, and data storage 106 is accessed in step 104.The data storage is in memory cell and includes
Flag data, data untagged and at least one default cost factor.Data 106 include the data of the mark with distribution
Point.The data point identification of distribution has mark data points whether will be included into a specific classification, or from a particular category
It is excluded.
Once data are accessed in step 104, the method for the embodiment of the present invention then uses the mark of data point in step 108
Remember information, determine the mark prior probability of the data point.Then, in step 110, according to the mark prior probability, it is determined that should
The expectation mark of data point.With it is expected that mark is calculated in step 110, together with there is a flag data, data untagged and into
This factor, step 112 is included by adjustment cost factor data untagged point, and training is iterated to transduction MED graders.
In iterating to calculate each time, the cost factor of data untagged point is conditioned.So, MED graders iterating from calculating
Learning.The grader of training then accesses input data 114 in step 116.Then the grader of the training is complete in step 118
The step of constituent class input data, and terminated in step 120.
It is readily appreciated that, 106 data untagged and input data 114 can be obtained from a single source.Thus, it is defeated
Enter the iterative process that data/data untagged can be used for step 112, the process is then used to classify in step 118.Moreover,
The embodiment of the present invention considers that input data 114 may include a feedback mechanism, input data is supplied into the storage 106
Data, so as to 112 MED graders dynamically from the new data learning of input.
Referring particularly to Fig. 5, it is illustrated that a control flow chart, it is shown that another data untagged of the embodiment of the present invention
Sorting technique, including user-defined priori probability information.Method 200 starts from step 202, and storage number is accessed in step 204
According to 206.The data 206 include flag data, data untagged, a default cost factor and customer-furnished
Priori probability information.206 have flag data include with distribution mark data point.The marker recognition of the distribution mark
The data point of note is will to be included into a specific classification or be excluded from a particular category.
In step 208, desired mark is calculated by 206 data.Then, this it is desired mark in step 210 together with
There are flag data, data untagged and cost factor to be used together, to guide the repetitive exercise of a transduction MED grader.
210 iterative calculation adjusts the cost factor of data untagged in calculating each time.Calculate and continue, until grader is by just
Really train.
Then, the grader of training accesses the input data from input data 212 in step 214.The grader of training
Next can be the step of step 216 completes classifying input data.Process and method described in Fig. 4, input data and nothing
Flag data can be obtained from a single source, and can enter system 206 and 212.So, input data 212
Can be in 210 influence training, so that the process can dynamically be changed over time with continuous input data.
In figures 4 and 5 in two shown methods, a monitor can determine that system either with or without reaching convergence.Work as MED
The change of hyperplane between the iteration each time calculated is dropped to below a default threshold value, it may be determined that convergence.In the present invention
Another embodiment in, when it is determined that expectation mark change drop to below a default threshold value, it may be determined that the threshold value.Such as
Fruit reaches convergence, then repetitive exercise process can stop.
Referring particularly to Fig. 6, it is shown that the repetitive exercise process of at least one embodiment of the inventive method is in further detail
Control flow chart.Process 300 starts from step 302, in step 304, and the data from data 306 are accessed, and the data can be with
Include flag data, data untagged, at least one default cost factor, and priori probability information.306 have a mark
Data point includes a mark, and whether data point described in the marker recognition is by the instruction for the data point for being included into a specified classification
Practice sample, or by by the training examples of the data point of a specified classification exclusion.306 priori probability information includes mark
The probabilistic information of data set and data untagged collection.
In step 308, it is expected that mark is determined by the data of the priori probability information from step 306.In the step 310,
Absolute value proportional regulation of the cost factor of each data untagged collection relative to the expectation mark of data point.Then pass through
A decision function is determined, a MED grader is trained in step 312, i.e., according to the expectation mark for having mark and data untagged
Note, by the use of having mark and data untagged as training examples, is maximized in the training examples being included into and the training being excluded
Boundary between sample.In step 314, classification score value is determined using the grader of the training of step 312.In step 316, classification
Score value is calibrated to group membership's probability.In step 318, priori probability information is marked according to group membership's probability updating.In step 320
A MED is performed to calculate, to determine that mark and marginal probability are distributed, wherein, classification score value determined above makes in MED calculating
With.As a result, new expectation is marked at step 322 and calculated, and in step 324, the phase is updated using the calculating from step 322
Hope mark.In step 326, this method determines whether to reach convergence.If it is, this method is terminated in step 328.If not up to
Convergence, then since step 310, complete the another an iteration of this method.Iteration is until reach convergence, so as to realize MED
The repetitive exercise of grader.When change of the decision function between the iterative calculation of MED each time is dropped to below a preset value,
Reach convergence.In another embodiment, when it is determined that expectation mark value change drop to a default threshold value with
When lower, convergence is reached.
Fig. 7 shows a network architecture 700 according to one embodiment.There is provided multiple long-range as shown in the figure
Network 702, including the first telecommunication network 702 and the second telecommunication network 704.Gateway 707 be attached to telecommunication network 702 with it is neighbouring
Between network 708.In the environment of present networks architecture 700, each of network 704,706 can use arbitrary shape
Formula, including but not limited to:LAN, wide area network, such as internet, Public Switched Telephone Network (PSTN), intercom phone net, etc.
Deng.
In use, gateway 707 is as from telecommunication network 702 to the entrance of adjacent network 708.Thus, gateway 707 can
As a router, the given packet of an arrival gateway 707, and a switch can be managed, it is given number
Actual path is provided according to bag turnover gateway 707.
Further comprise at least one data server 714 being connected with the adjacent network 708, it can pass through gateway
707 access from telecommunication network 702.It is noted that data server 714 can include any kind of computer equipment/group
Part.What is be connected with each data server 714 is multiple user equipmenies 716.These user equipmenies 716 can include desk-top calculate
Machine, laptop computer, hand-held computer, printer or any other logical device.It is noted that in one embodiment
In, user equipment 717 can also be directly connected in arbitrary network.
One facsimile machine 720 or a series of facsimile machines 720 may connect to one or more networks 704,706,708.
It is noted that database and/or add-on assemble can be with being connected to any type of of network 704,706,708
Network element is used together or integrated wherein.In the environment of this description, network element is preferably the random component of network.
According to one embodiment, Fig. 8 shows a representative hardware environment relevant with Fig. 7 user equipment 716.The figure
The hardware configuration of a typical workstation is shown, with a central processing unit 810, such as one microprocessor, and it is multiple
The other units being connected with each other by system bus 812.
Work station shown in Fig. 8 includes random access memory (RAM) 814, read-only storage (ROM) 816, I/O adaptations
Device 818, for connecting ancillary equipment (disk storage unit 820 being such as connected with bus 812), user interface adapter 822 is used
In by keyboard 824, mouse 826, loudspeaker 828, microphone 832, and/or other user interface facilities, such as touch-screen sum code-phase
Machine (not shown), is connected to bus 812, communication adapter 834, for work station to be connected into (e.g., the data of communication network 835
Handle network), and display adapter 836, for bus 812 to be connected with display device 838.
Referring particularly to Fig. 9, it is shown that the device 414 of one embodiment of the invention.One embodiment of the present of invention includes using
In the storage device 814 of storage flag data 416.Each mark data points 416 includes a mark, indicates the data point
The training examples for the data point for being included into a specified classification, or the data point being excluded from a specified classification training
Sample.Memory 814 also stores data untagged 418, priori probability data 420 and cost factor 422.
Processor 810 accesses the data from memory 814, and calculates one binary classifier of training using transduction MED,
Can be classified data untagged.By using cost factor and to have mark and data untagged training examples, place by oneself
Manage device 810 to calculate using iteration transduction, and adjust the cost factor as a function for expecting mark value, so as to influence cost
The data of factor data 422, the again data and then input processor 810.Therefore, cost factor 422 is with processor 810
MED classification iteration each time and change.Once processor 810 fully trained a MED grader, processor is with that
It can instruct the grader that data untagged is referred into classified data 424.
Transduction SVM and the MED formula of prior art causes potentially to mark distribution to be exponentially increased, and approximation must be to reality
Border application development.In another embodiment of the present invention, describe the formula of different transduction MED classification, without by
The possible mark distribution of exponential increase, and allow the closed-form solution (closed form solution) of a routine.For line
Property grader, problem is expressed as follows:Find hyperplane parameter distribution p (Θ), deviation profile p (b), data point categorised demarcation line p
(γ), its probability distribution combined is compared to the respective prior distribution p combined0Ku Lebaike accumulations, which are minimized, with one strangles hair
(Kullback Leibler divergence) KL is dissipated, i.e.,
It is limited by the following limitation for having a flag data
And it is limited by the limitation of following data untagged
Wherein Θ XtWeight vectors for the hyperplane of separation and the dot product between the characteristic vector of t-th of data point.Nothing
The prior distribution that need to be marked.There is flag data to mark the right side for the hyperplane for being limited in separation according to known to it, and for
Data untagged only requirement is that, they to hyperplane distance square be more than boundary.In a word, embodiments of the invention are looked for
To the hyperplane of a separation, it is that separate has flag data, Yi Ji exactly closest to selected prior probability
There is no a balance between data untagged between boundary.It the advantage is that, the prior distribution without introducing mark, thus,
Avoid the problem of potential mark distribution index increases.
In the particular implementation of another embodiment of the present invention, given using in the formula 7,8 and 9 for hyperplane parameter
Prior distribution, deviation and boundary, obtain following partition function:
Wherein subscript t is the subscript for having flag data, and t ' is the subscript of data untagged.
Created symbol:
With W=∑stλtγtUt-2∑t′λt′γt′Ut′,
Rewritable formula 16 is as follows:
After integration, following partition function is produced:
That is, final object function is:
As the situation for the known mark for being referred to herein as discussing in the paragraph of M steps, object functionApplication can be passed through
Similar method is solved.Difference is, the matrix in the quadratic form of maximum figureCurrently there is nondiagonal term.
Except classification, also there are a variety of applications in the present invention using the method for maximum entropy-discriminate framework.For example, MED can be used for
Solve the classification of data.In a word, available for any kind of discriminant function and prior distribution, recovery and image model
(T.Jebara,Machine Learning Discriminative and Generative,Kluwer Academic
Publishers)(Jebara)。
The application of the embodiment of the present invention can be formulated into the pure inductive learning problem with known mark, and tool
There is the transduction problem concerning study of mark and unmarked training examples.In embodiment below, MED points of the transduction described in table 3
The improvement of class algorithm is restored for common transduction MED classification, transduction MED, the transduction MED study of image model is all equally applicable.
So, for the purpose and its dependent claims of the disclosure, word " classification " may include to restore or image model.
M steps
According to formula 11, the object function of M steps is:
{λt|0≤λt≤ c },
Wherein Lagrange multiplier λtBy maximizing JMIt is determined that.
Ignore redundancy limitation λt<C, the lagrangian of above-mentioned two problems is:
It is for the necessary and sufficient KKT conditions of optimality:
Wherein FtFor:
In optimal solution, deviation, which is equal to, expects deviationObtain:
<yt>(-Ft-<b>)+δt=0
(25)
By considering δtλtTwo kinds of situations of=0 limitation, it can be gathered that these formula.All λ of the first situationt=0, with
And second all 0<λt<c.The third need not be considered, such as S.Keerthi, S.Shevade, C.Bhattacharhyya, and
K.Murthy,Improvements to platt’s smo algorithm for svm classifier design,1999
(Keerthi) described in, applied to SVM algorithm;In this formula, potential function (potential function) keeps λt
≠c。
Some data point t can have interference in the case of these, until being optimal solution.That is, λ is worked astDuring for non-zero, Ft≠-<b
>, or work as λtWhen being zero, Ft<yt><-<b><yt>.Unfortunately, there is no optimal solution λt, can not just calculate<b>.For this problem
A good solution be use for reference Keerthi (being hereby incorporated again) method, by building following three set:
I0={ t:0 < λt< c } (28)
I1={ t:<yt>> 0, λt=0 } (29)
I4={ t:<yt>< 0, λt=0 } (30)
By using these set, using following definition, we can limit the greatest limit interference of optimality condition.
I0In element for interference, as long as they be not equal to-<b>, therefore, from I0Minimum and maximum FtFor the time as interference
Choosing.Work as Ft<-<b>When, I1In element be interference, therefore, if it exists, from I1Least member it is dry for greatest limit
Disturb.Finally, F is worked ast>-<b>When, in I4In element be interference, it is from I4Greatest member is produced in interference candidate.Therefore ,-<b>
" minimum " and " maximum " value by these set as follows is limited:
Due in optimal solution ,-bupWith-blowNecessary equal reason, i.e. ,-<b>, then, reduction-bupWith-blowDifference
Away from training algorithm will being promoted to restrain.In addition, gap can also determine that the convergent method of numerical value is measured as a kind of.
As it was previously stated, only reach convergence, just can know that b value=<b>.The difference of the method for another embodiment exists
In can only once optimize a sample.Therefore, every once, heuristic training will be in I0In sample and all samples between
It is used alternatingly.
E steps
The object function of E steps is in formula 12
Wherein stFor the classification score value of t-th of the data point determined in M steps before.Lagrange multiplier λtBy most
BigizationIt is determined that.
Ignore redundancy limitation λt<C, the lagrangian of above-mentioned two problems is:
It is for the necessary and sufficient KKT conditions of optimality:
Due to having carried out factorization to sample, as long as ignoring sample, by optimizing KKT conditions to Lagrange multiplier
Solution can be completed.
For there is mark sample, mark is expected<yt>With P0,1(yt)=1 and P0,1(-yt)=0, simplifying KKT conditions is:
And generate the solution for the Lagrange multiplier for marking sample as having:
For unmarked sample, formula 35 can not decompose solution, but must be by using, as to it is each meet formula 35
The Lagrange multiplier of unmarked sample carries out linear search, to determine.
It is multiple unrestricted samples below, it can pass through above-mentioned enumerated method and its derivation or change, Yi Jiqi
Its method known in the art is realized.Each example includes preferred computing, with reference to optional computing or parameter, and it can be
Implement in basic method for optimizing opinion.
In embodiment, as shown in Figure 10, there are mark data points to be received in step 1002, each data point has extremely
A few mark, it is the training examples for the data point for being included into a particular category to indicate the data point, or specific from one
The training examples for the data point that classification is excluded.In addition, data untagged point is received in step 1004, while having described in receiving
The default cost factor of at least one of mark data points and data untagged point.The data point can include any medium, such as
Word, image, sound etc..The priori probability information for having mark and data untagged point can also be received.Moreover, being included into
The marks of training examples can be mapped as first numerical value, such as+1, and the training examples being excluded can be mapped as the second number
Value, such as -1.In addition, described have mark data points, data untagged point, input data point and have mark data points and nothing
The default cost factor of at least one of mark data points can be stored in computer storage.
Further, in step 1006, using at least one described cost factor, and there are mark data points and unmarked number
Strong point is as training examples, by iterative calculation, and a transduction MED grader is trained to.For iterating to calculate each time, adjustment
Data untagged point cost factor expects mark value as one, such as the absolute value of the expectation mark of data point, letter
Number, and data point label prior probability is adjusted according to the estimation of group of data points membership probability, therefore ensure that stability.Moreover, turning
Leading grader can learn using the priori probability information for having mark and data untagged, and which further improves stability.Training
The iterative step of transductive classifier can be repeated, until the convergence of data value is reached, for example, when the decision function of transductive classifier
Change drop to when below a default threshold value, when it is determined that the change of expectation mark value drop to below a default threshold value
When, etc..
In addition, in step 1008, the grader of training be used to classifying the data untagged point, have mark data points and
At least one of input data point.Input data point can be received before or after grader is trained to, or not received.
Moreover, being marked according to their expectation, there are mark and data untagged point as study sample by the use of described, it may be determined that judge letter
Number, the given training examples for being included into and being expelled out of, decision function can dissipate KL the elder generation for being minimised as decision function parameter
Test probability distribution.In other words, the decision function can use the multinomial distribution of decision function parameter, by minimum KL dissipate Lai
It is determined that.
In step 1010, the classification of the data point of classification, or derivatives thereof, be exported to a user, another system and
At least one during another.System can be long-range or local.The example of the derivative of classification can be, but not
Be limited to, the data point of classification in itself, the sign or identifier or master file/document, etc. of grouped data point.
In another embodiment, computer system uses and performs computer executable program code.The program code
There are mark data points to have including being stored in the instruction for there are mark data points of computer storage for accessing, described in each
At least one mark, indicate the data point whether be the data point for being included into a specified classification training examples, or from one
The training examples for the data point being excluded in individual specified classification.In addition, computer code includes being used for visiting from computer storage
The instruction of data untagged point is asked, and has at least the one of mark data points and data untagged point from computer storage access
The instruction of individual default cost factor.The priori probability information for having mark and data untagged point for being stored in calculating memory also may be used
With accessed.Moreover, the mark for the training examples being included into can be mapped as first numerical value, such as+1, and the training being excluded
Sample can be mapped as second numerical value, such as -1.
Further, program code is comprised instructions that, it is described instruction using at least one store cost factor and
What is stored has mark data points and the data untagged point of storage as training examples, passes through and iterates to calculate training transductive classification
Device.Moreover, for iterating to calculate each time, data untagged point cost factor is adjusted as the expectation mark value of the data point,
Such as data point expectation mark absolute value, a function.Moreover, for iteration each time, priori probability information can be with
It is adjusted according to the estimation of group membership's probability of data point.The iterative step of training transductive classifier can be repeated, until number
Reach convergence according to value, for example, when the change of the decision function of transductive classifier is dropped to below a default threshold value, when it is determined that
The change of expectation mark value when dropping to below a default threshold value, etc..
In addition, program code is comprised instructions that, and it is described to instruct for training grader, to data untagged point, have
At least one of mark data points and input data point are classified, and for the class for the data point for exporting the classification
Other or derivatives thereof instruction, at least one to a user, another system and during another is exported by classification.
Moreover, being marked according to their expectation, there are mark and data untagged point as study sample by the use of described, it may be determined that judge letter
Number, the given training examples for being included into and being excluded, decision function can dissipate KL the elder generation for being minimised as decision function parameter
Test probability distribution.
In another embodiment, data processing equipment includes at least one memory, for storing:(i) there are reference numerals
Strong point, has described in each mark data points to be marked with least one, it is to be included into a specified classification to indicate the data point
Data point training examples, or the data point being excluded from a specified classification training examples;(ii) data untagged
Point;There is at least one default cost factor of mark data points and data untagged point described in (iii).The memory may be used also
With the priori probability information of be stored with mark and data untagged point.Moreover, the mark for the training examples being included into can be mapped as
First numerical value, such as+1, and the training examples being excluded can be mapped as second numerical value, such as -1.
In addition, the data processing equipment includes transductive classification training aids, with using at least one described cost because
Son, and it is described have mark data points and data untagged point as training examples, using the maximum entropy-discriminate (MED) of transduction,
Iteratively train the transductive classifier.In addition, in the iterative calculation of MED each time, adjust the data untagged point cost because
Son is used as the expectation mark value of the data point, such as the absolute value for expecting to mark of data point, a function.And
And, in the iterative calculation of MED each time, priori probability information can be adjusted according to the estimation of data point group membership's probability.
The device, which can also include one, to be used to determine the convergent device of data value, e.g., the decision function calculated when transductive classifier
Change drops to when below a default threshold value, when it is determined that the change of expectation mark value drop to below a default threshold value
When, etc., and once determining convergence, then terminate and calculate.
In addition, training grader be used for classify data untagged point, have in mark data points and input data point extremely
Few one kind.Moreover, being marked according to their expectation, there are mark and data untagged point as study sample by the use of described, can be true
Determine decision function, KL divergings can be minimised as decision function by the given training examples for being included into and being excluded, the decision function
The prior probability distribution of parameter.Moreover, the classification of the data point of classification, or derivatives thereof, export to a user, another system
System and it is another during at least one.
In another embodiment, a product, including computer-readable program recorded medium, the medium are definitely wrapped
The executable instruction repertorie of one or more computers is included, the method to perform data classification.In use, receiving has reference numerals
Strong point, each has mark data points to be marked with least one, it is the data for being included into a specified classification to indicate the data point
Point training examples, or the data point being excluded from a specified classification training examples.In addition, receiving data untagged
Point, and described at least one default cost factor for having mark data points and a data untagged point.Have mark data points and
The priori probability information of data untagged point can also be stored in computer storage.Moreover, the mark for the training examples being included into
It can be mapped as first numerical value, such as+1, and the training examples being excluded can be mapped as second numerical value, such as -1, etc..
Further, the cost factor stored using at least one has mark data points and data untagged point to make with storage
For training examples, calculated using the maximum entropy-discriminate (MED) of iteration, train transductive classifier.The iteration each time calculated in MED
In, adjustment data untagged point cost factor is used as the expectation mark value of the data point, the expectation mark of such as one data point
Absolute value etc., a function.Moreover, in the iterative calculation of MED each time, priori probability information can be according to a data point
The estimation of group membership's probability is adjusted.The iterative step of training transductive classifier can be repeated, until reach that data value is restrained, example
Such as, when the change of the decision function of transductive classifier is dropped to below a default threshold value, when it is determined that expectation mark value
When change is dropped to below a default threshold value, etc..
In addition, accessing input data point from computer storage, the grader of training is used for the data untagged of classifying
Point, there are at least one of mark data points and input data point.Moreover, being marked according to their expectation, have using described
Mark and data untagged point are as study sample, it may be determined that decision function, and the given training examples for being included into and being excluded should
Decision function can dissipate KL the prior probability distribution for being minimised as decision function parameter.Moreover, the classification of the data point of classification,
Or derivatives thereof, be exported to a user, another system and it is another during at least one.
It is used for data untagged of classifying in a computer based system there is provided one kind in another embodiment
Method.In use, there are mark data points to be received, there are described in each mark data points to be marked with least one, refer to
Show that the data point is the training examples for the data point for being included into a specified classification, or the number being excluded from a specified classification
The training examples at strong point.
In addition, thering is mark and data untagged point to be received, there is the priori signature of mark data points and data untagged point
Probabilistic information is also received.Moreover, at least one the default cost factor for having mark data points and data untagged point is also connect
Receive.
Moreover, each has the expectation mark of mark and data untagged point according to the mark prior probability quilt of the data point
It is determined that.Following sub-step is repeated, until data value is restrained enough.
● generate a regulation for each data untagged point proportional to the absolute value of the expectation mark of data point
Value at cost;
● by determining decision function, the given sample for being included into training and being expelled out of training, using it is described have mark and
Data untagged point is trained maximum entropy-discriminate (MED) grader, marked according to their expectation as training examples, should
Decision function dissipates KL the prior probability distribution for being minimised as decision function parameter;
● using the grader of the training, it is determined that described have the classification score value marked with data untagged point;
● the output of the grader of training is calibrated to group membership's probability;
● according to group membership's probability of the determination, update the mark prior probability of the data untagged point;
● using the mark prior probability and the classification score value that determines before of the renewal, using maximum entropy-discriminate (MED),
Determine the mark and marginal probability distribution;
● the marking probability distribution determined before use, calculate new expectation mark;With
● it is that each data point updates by the way that the expectation mark insertion new expectation of iteration before is marked
Expect mark.
Moreover, classification of input data point or derivatives thereof, is exported to a user, another system and another process
In at least one.
When the change of decision function is dropped to below a default threshold value, convergence is reached.In addition, when the expectation mark determined
When the change of note value is dropped to below a default threshold value, diverging can also be reached.Moreover, the mark for the training examples being included into
Can have arbitrary value, such as+1, and the training examples being excluded can have arbitrary value, such as -1.
In one embodiment of the invention, a kind of method for sort file is as shown in figure 11.In use, in step
Rapid 1100, receiving at least one has the seed file of known confidence levels, and it is default with least one to receive unmarked file
Cost factor.The seed file and other items can be received from computer storage, user, network connection etc., and can be
One is received after the request of system for performing this method.At least one described seed file can have a this document
Whether the cue mark of a specified classification is included into, can be containing a Keyword List, or contribute to any other
The feature of sort file.Moreover, in step 1102, by iterative calculation, using at least one default cost factor, at least one
Seed file and unmarked file, train a transductive classifier, wherein, for iterating to calculate each time, Setup Cost because
Son is used as a function for expecting mark value.The data point label prior probability for having mark and unmarked file can also be connect
Receive, wherein, for iterating to calculate each time, the data point markers can be adjusted according to the estimation of group of data points membership probability
Prior probability.
In addition, be that unmarked file stores confidence score in step 1104 after at least part iteration, and in step
1106, the identifier of the unmarked file with highest confidence score is exported to a user, another system and another process
In at least one.The identifier can be this document electronic copies in itself, its part, its title, its title, point to file
Pointer, etc..Moreover, confidence score can be stored after each iteration, wherein, after each iteration, have
The identifier of the unmarked file of highest confidence score is output.
One embodiment of the present of invention can inquire about the pattern for linking original document and remaining paper.The target of inquiry is
One this pattern query proves especially valuable region.For example, in pre-trial legal inquiry (pre-trial legal
Discovery, must the substantial amounts of file of research for the possible link of lawsuit at hand in).Final purpose is " true in order to find
The evidence of chisel ".In another example, for the common task of inventor, patent examiner, and patent attorney, it is exactly
By the retrieval to prior art, the novelty of a technology is assessed.Especially, the task for search for all announcements patent and
Other publications, and have found that it is likely that in this set the file relevant with examining the particular technology of novelty.
The task of inquiry, which is included in one group of data, finds a file or one group of file.Give an original document or general
Read, user may wish to find the file relevant with the original document or concept.However, original document or concept and file destination
Between relation opinion, i.e. the file that will be inquired about, only after inquiring about, can just be best understood by.There is mark by study
With unmarked file, concept etc., the present invention can learn the pattern and relation between single or multiple original documents and file destination.
In another embodiment of the present invention, a kind of method such as Figure 12 for being used to analyze the file related to legal inquiry
It is shown.In use, the file related to legal matter is received in step 1200.These files can include electricity file in itself
Sub- copy, its part, its title, its title, the pointer of sensing file, etc..In addition, in step 1202, one is performed to file
Plant file classifying method.Further, in step 1204, the identifier based on its classification output at least part file.Alternatively,
The mark of link between these files is also output.
The file identification method can include any kind of process, such as one transductive process.For example, can make
With foregoing any conclusion or transduction method.In a preferred method, at least one default cost factor, at least one are used
Individual seed file and the file relevant with legal matter, by iterating to calculate one transductive classifier of training.For each time
Iterative calculation, cost factor preferably adjusts the function as an expectation mark value, and the grader of training is used for classification and connect
The file of receipts.The process can also include to have mark and unmarked one data point markers prior probability of file reception, wherein,
For iterating to calculate each time, according to the estimation of data point group membership's probability, the data point label prior probability is adjusted.
In addition, the file classifying method can also include one or more SVMs processes and maximum entropy-discriminate process.
In another embodiment, a kind of method for analyzing prior art document is as shown in figure 13.In use, in step
1300, a grader is trained based on a search inquiry.In step 1302, multiple prior art documents are accessed.These show
Before thering is technology to be included in one to fixing the date, any information that the public can obtain in any form.The prior art also may be used
Before being included in one to fixing the date, any information that the public can't obtain in any form.The prior art document enumerated
Can be any type of file, publication such as Patent Office, be derived from the data of database, the prior art collected, webpage
Part, etc..Moreover, in step 1304, one kind is performed to the prior art document described at least part using the grader
File classifying method, and in step 1306, based on its classification, the identifier of the prior art document described in output at least part.
The document classification technology can include one or more processes, including SVMs process, a maximum entropy-discriminate
Process, or foregoing any conclusion or transduction method.Also or, the sign linked between the file can also be output.
In another embodiment, the score value of correlation is output based on its classification between at least part prior art document.
The search inquiry can include the disclosed at least a portion of patent.The patent enumerated is disclosed including by inventor
Disclosure, temporary patent application, non-provisional, foreign patent or patent application summarized its invention and produced etc..
In a preferred method, claim of the search inquiry including a patent or patent application is at least
A part.In another method, the search inquiry includes at least a portion of the summary of a patent or patent application.
In another method, the search inquiry includes at least a portion of a patent or the brief summary of the invention of patent application.
Figure 27 shows a kind of method for file to be matched with claim.In step 2700, based on a patent
Or at least one claim of patent application trains a grader.Therefore, one or more of claim, or one portion
Point, available for training grader.In step 2702, multiple files are accessed.These files may include prior art document, description
The potential file encroached right or taken the lead using product.In step 2704, one is performed at least part file using the grader
Plant file classifying method.In step 2706, based on its classification, the identifier of at least part file is exported.At least part file
Relevance score can also be output based on its classification.
One embodiment of the present of invention can be used for the classification of patent application.In the U.S., for example, nowadays patent and patent Shen
US patent class (USPC) system please be use, is classified according to its theme.The task is now by being accomplished manually, therefore cost is high
And it is time-consuming.This manual sort is also restricted by mistake.The complexity for solving this task is, can by patent or specially
Profit application is divided into multiple species.
According to one embodiment, Figure 28 shows a kind of method for patent application of classifying.In step 2800, based on many
Individual known one grader of file training for belonging to a specific patent classification.These files generally can be patent or patent Shen
The please summary file of (or one part) but it is also possible to be the target topic for describing specific patent classification.In step 2802, one
Patent or at least a portion of patent application are received.The part can include:Claim, brief summary of the invention, summary, explanation
Book, title, etc..In step 2804, one is performed to the patent or at least a portion of patent application using the grader
Plant file classifying method.In step 2806, the classification of the patent or patent application is output.Alternatively, user can be manual
Ground check part or the classification of whole patent applications.
The file classifying method is preferably a kind of Yes/No sorting technique.In other words, if file is in correct class
Higher than one threshold value of not interior probability, then be determined as it being that this document belongs to the category.If file is general in correct classification
Rate is then determined as no less than threshold value, and this document is not belonging to the category.
Figure 29 shows another method for patent application of classifying.In step 2900, using a grader to one
Part patent or at least a portion of patent application perform a kind of file classifying method, and the grader has previously been based at least one and one
The related file of individual specific patent classification is trained to.Likewise, the file classifying method is preferably a kind of Yes/No classification side
Method.In step 2902, the classification of the patent or patent application is output.
In the two methods shown in Figure 28 and Figure 29, different graders can be used to repeat respective method, it is described
Grader has previously been based on multiple known files for belonging to a different patent classification and is trained to.
Formally, the classification of patent should be based on claim.But, it is also desirable to matching is performed between (any IP is related
Content) and (any IP related contents).As an example, a kind of method is trained using patent specification, and according to
The claim of patent application is classified to patent application.Another method operation instructions and claim are trained,
And based on summary classification.In particularly preferred method, no matter which of patent or application is partly used for training, in classification
Using the content of same type, if that is, system is trained according to claim, claim is classified based on.
The file classifying method can include any kind of process, such as one transductive process etc..For example, can make
With above-mentioned any conclusion or transduction method.In a preferred method, the grader can be a transductive classifier,
And the transductive classifier is passed through using at least one default cost factor, at least one seed file and prior art document
Iterative calculation is trained, wherein, for iterating to calculate each time, the cost factor is adjusted as one and expects mark value
Function, and the grader of the training can be used for the prior art document of classifying.The seed file and prior art document
A data point markers prior probability can also be received, wherein, can be according to a data for iterating to calculate each time
The estimation of point group membership's probability, adjusts the data point label prior probability.Seed file can be any file, such as Patent Office
Publication, the data for being derived from database, one group of prior art, website, patent disclose.
In a method, Figure 14 describes one embodiment of the present of invention.In step 1401, one group of data is read.
In this group of data, the file relevant with user is the discovery that needs.In step 1402, single or multiple initial seed files
It is labeled.The file can be any kind of file, the publication of such as Patent Office, the data for being derived from database, one group
Prior art, website etc..Can also a string of different keywords or customer-furnished file layout transductive process.In step
1406, using one group of data untagged having in flag data and a given set, train a transductive classifier.In iteration
Each mark induction step in transductive process, the confidence score determined in mark generalization procedure is stored.In step
1408, once completing training, the file that high confidence score is obtained in mark induction step is just shown in user.These have height
The file representative of the confidence score file related to user's inquiry purpose.The display can be first according to the time of mark induction step
Afterwards sequentially, since initial seed file, until last group of file being found in last mark induction step.
Another embodiment of the present invention is related to data scrubbing and precise classification, and such as business process with automation is mutually tied
Close.The cleaning and sorting technique can include any kind of process, such as a transductive process.It is, for example, possible to use
Any of the above described transduction or inductive method.In a preferred method, according to the expectation cleannes of database, into database
Key is used as the mark related to confidence levels.Then, the mark expects mark together with related confidence levels, by with
In training a transductive classifier, (key) is marked described in the grader amendment, data in database more may be used with realizing
The management leaned on.For example, invoice must be classified according to the company of invoicing or individual first, to realize that automaticdata is extracted,
For example determine total amount, O/No., product quantity, shipping address etc..Generally, setting up an automatic classification system needs instruction
Practice sample.However, the training examples provided by customer usually contain the file of wrong classification or other interference, such as fax cover page,
In order to obtain accurate classification, before the automatic classification system is trained, these files must be identified and remove.Another
In individual embodiment, in the field of case, the inconsistency reported between its diagnosis report for contributing to detection to be write by doctor.
In another embodiment, it is well known that Patent Office need undergo continuously reclassifies process, wherein, they
(1) existing bifurcations of their classification of disturbance method are assessed, (2) rebuild the classification to be evenly distributed overcrowding section
Point, and (3) reclassify existing patent new structure.Here transduction learning method be Patent Office and they outside
Used in the company for doing this work of bag, to reappraise their classification, and their (1) is helped to be one given
Main classification sets up new classification, and (2) reclassify existing patent.
Transduction, from having mark and data untagged study, is thus smooth from unmarked transformation is tagged to.Collection of illustrative plates
One end be the flag data that has with perfect existing knowledge, e.g., given mark is all correct without exception.Another
Hold not give the data untagged of existing knowledge.The number for the data composition mistake classification compiled with the group disturbed to a certain degree
According to, and positioned at two of the collection of illustrative plates somewheres between extreme.The mark provided by data tissue to a certain extent can be for certain
It is considered correct, but not fully.Therefore, transformation can be used for clearing up existing data group and compile, by given at one
A specific error degree is assumed within data tissue, and these are construed to uncertain in the existing knowledge of mark distribution
Property.
In one embodiment, a kind of method for clearing up data is as shown in Figure 5.In use, in step 1500, Duo Geyou
Flag data is received, in step 1502, is the subset that each classification in multiple classifications chooses the data item.Separately
Outside, in step 1504, the uncertainty of the data item in each subset is arranged to about zero, will not be in step 1506
The uncertainty of data item in the subset is arranged to the preset value that one is not about zero.Further, in step 1508, pass through
Iterative calculation, using the data item in the uncertain, subset and data item in the subsets is not as training examples,
A transductive classifier is trained, and in step 1510, the grader of training, which is used for each, flag data, each to classify
The individual data item.Moreover, the classification of input data, or derivatives thereof, a user is exported in step 1512, another
System and it is another during at least one.
Further, the subset can be randomly selected, it is possible to chosen by user and verified.At least partly described data item
Mark can be changed based on its classification.Moreover, after sorting, the data of the confidence levels with less than a default threshold value
The identifier of item is exported to user.The identifier can be this document electronic copies in itself, its part, its title, its
Title, the pointer for pointing to this document, etc..
In one embodiment of the invention, as shown in figure 16, in step 1600, two choosings of a scale removal process are started
Item is presented to user.In step 1602, an option is full-automatic cleaning, for each concept or classification, is randomly selected
Certain amount of file is taken, and assumes that they are compiled by correctly group.Or, in step 1604, a number of file can be beaten
Upper mark, with hand inspection and verification, whether one or more marks distribution of each concept or classification is compiled by group exactly.
An estimation of annoyance level is received in step 1606, data.In step 1610, the verification in step 1608 is used
(desk checking is randomly selected) data and the data that do not verify, train the transductive classifier.Once training terminates, file
Compiled according to new mark by group again.In step 1612, there is the low confidence level less than a specific threshold in mark distribution
Other file, is displayed to user, for hand inspection.In step 1614, distributed according to transduction of marker, in mark distribution
The file of confidence levels with higher than a specific threshold is by automatic Proofreading.
In another embodiment, a kind of method for being used to manage case history is as shown in figure 17.In use, in step
1700, a grader is trained to based on medical diagnosis, in step 1702, and multiple case histories are accessed.In addition, in step 1704,
Perform a kind of file classifying method to the case history using the grader, and with low probability with medical diagnosis correlation
The identifier of at least one case history, is output in step 1706.This document sorting technique includes any kind of process, such as one
Transductive process etc., and said one or multiple arbitrary conclusions or transduction method, including SVMs process, most can be included
Big entropy-discriminate process etc..
In one embodiment, the grader can be a transductive classifier, and the transductive classifier can lead to
Iterative calculation is crossed, is trained to using at least one default cost factor, at least one seed file and case history, wherein, it is right
Iterated to calculate in each time, adjust the cost factor and can use as a function for expecting mark value, and the grader of training
In the case history of classifying.Seed file and the data point label prior probability of case history can also be received, wherein, for each time
Iterative calculation, can adjust the data point label prior probability according to the estimation of group of data points membership probability.
Another embodiment of the present invention describes dynamic, the classification concept of drift.For example, in formal layout application, point
Class file, is classified using the layout information and/or content information of file to file, and to classify, the file is used for further
Processing.In many applications, file is not changeless, but time to time change.For example, the content of file and/or
The space of a whole page probably due to new legislation and change.Transductive classification adapts to these changes automatically, produces same or similar classification accurate
Property, without being influenceed by the classification concept drifted about.Compared with rule-based system or inducing classification method, without artificial tune
Section, will not influence accuracy due to concept drift.One example of this method is invoice processing, and it traditionally includes concluding
Study, or use the rule-based system using the invoice space of a whole page.For these traditional systems, if the space of a whole page changes,
Then system must manually be reset by marking new training data or determining new rule.However, the use of transduction is led to
Cross the automatic minor variations adapted on the invoice space of a whole page so that artificial reset becomes no longer necessary.In another embodiment,
Transductive classification can be used for analysis customer complaint, to monitor the change that these complain property.For example, a company can be automatically by production
Product change is linked with customer complaint.
Transduction can also be used for the classification of news article.For example, about the news article of war, the attack of terrorism, starting from and being directed to
The terrorist of the Afghan War on the 11st of September in 2001 attacks, until about the News Stories of the current situation of Iraq, can
Use transduction automatic identification.
In another embodiment, biological classification (akpha taxonomy) can be changed over time, by evolving, new species
Produce, and other species extinctions.With change of the classification concept with the time, classification outline or this taxonomic and Else Rule
Can be with dynamic change.
By using that must be classified as the input data of data untagged, transduction can recognize drift classification concept, and
Thus the classification outline of change is automatically adapted to.For example, Figure 18 shows that the given drift classification concept of the present invention is used
The embodiment of transduction.File group DiIn time tiInto system, as shown in step 1802.In step 1804, use is accumulated so far
Tired has mark and data untagged to train a transductive classifier Ci, in step 1806, file group DiIn file be classified.
If using artificial mode, being confirmed as the text of the confidence levels of the threshold value provided with less than one user in step 1808
Part, user is presented to for hand inspection in step 1810.As shown in step 1812, in automatic mode, one has
The file of confidence levels triggers the establishment of a new classification, and the category is added into system, and then this document is just attributed to this
New classification.In step 1820A-B, the file of the confidence levels with higher than above-mentioned selected threshold value is classified into current classification
1 to N.In step tiThe file of all current class of current class has been classified into before, in step 1822 by grader Ci
Reclassify, and in step 1824 and 1826, all files for being no longer classified into above-mentioned specified classification are moved into new class
Not.
In another embodiment, a kind of method for adapting to file content variation is as shown in figure 19.File content can be wrapped
Include, but be not limited to, picture material, content of text, the space of a whole page, numbering, etc..The example of variation can include change, the wind of time
The change (by 2 or more the personal one or more files of processing) of lattice, the change of application process, the variation of the space of a whole page, etc..
Step 1900, receiving at least one has mark seed file and unmarked file and at least one default cost factor.It is described
File can include, but are not limited to, customer complaint, invoice, form document, receipt, etc..In addition, in step 1902, using
At least one described default cost factor, at least one seed file, and unmarked file, train a transductive classifier.
Moreover, in step 1904, the unmarked file of the confidence levels with more than a default threshold value is classified using grader
To multiple classifications, and in step 1906, at least a portion of the file of the classification is reclassified to multiple using grader
Classification.Further, in step 1908, the identifier of the file of the classification is exported to a client, another system, Yi Jiling
At least one during one.The identifier can be file electronic copies in itself, its part, its title, its title, refer to
Pointer to file, etc..Moreover, product variations can be linked with customer complaint etc..
In addition, the unmarked file of the confidence levels with less than a predetermined threshold value can be moved into it is one or more new
Classification.Moreover, by iterative calculation, using at least one default cost factor, at least one seed file and the nothing
Tab file, can train a transductive classifier, wherein, for iterating to calculate each time, adjust the cost factor conduct
The function of one expectation mark value, and use the grader classification unmarked file of the training.Moreover, described kind of Ziwen
The data point label prior probability of part and unmarked file can be received, wherein, for iterating to calculate each time, according to one
The estimation of group of data points membership probability, adjusts the data point label prior probability.
In another embodiment, a kind of method for the variation for making patent classification adapt to file content is as shown in figure 20.
Step 2000, receiving at least one has mark seed file, and unmarked file.The unmarked file can include any
The file of type, e.g., patent application, legal document, information disclose form, file modification, etc..Seed file can include special
Profit, patent application etc..In step 2002, one transduction point of at least one described seed file and unmarked file training is used
Class device, and using the grader by the unmarked document classification of the confidence levels with higher than a predetermined threshold value to multiple
Existing classification.The grader can be any kind of grader, such as transductive classifier, and the document classification side
Method can be any method, for example support vector machine method, maximum entropy method of discrimination etc..For example, can be used above-mentioned any
Conclude or transduction method.
Moreover, in step 2004, there is the confidence levels less than a predetermined threshold value by described using the grader
Unmarked document classification at least partly described will be divided at least one new classification, and in step 2006 using the grader
The file of class reclassifies existing classification and at least one new classification.Further, in step 2008, the classification
The identifier of file be exported to a user, another system and it is another during at least one.Furthermore, it is possible to using extremely
A few default cost factor, the search inquiry and the file, by iterative calculation, train the transductive classification
Device, wherein, for iterating to calculate each time, the cost factor is adjusted as the function of an expectation mark value, and the instruction
Experienced grader can be used for the file of classifying.Further, the data point prior probability of the search inquiry and file can be by
Receive, wherein, for iterating to calculate each time, according to the estimation of data point group membership's probability, adjust the data point first
Test probability.
In another embodiment of the present invention, the file drift in file separation field is described.The example of one application
Attached bag includes the process of mortgage file.Including a series of different debt-credit files, such as loan application, approval, request, quantity
File is borrowed or lent money to be scanned, and before further processing, it must be determined that the different files in a series of images.Use
File is not changeless, but can be changed over time.For example, in debt-credit file, the tax form used can
Changed over time according to the change of laws and regulations.
File separation solves the problem of finding file or subfile boundary in a series of images.Typically produce a series of
The example of image is digital scanner or multi-function peripheral (MFP).Such as in the embodiment of classification, transduction can be used for file
Separation, to handle the drifting problem of file and its boundary with the time.Static piece-rate system, such as rule-based system or is based on
The system of Inductive Learning, it is impossible to automatically adapt to drift separation concept.No matter when drift about, these static separation systems
The performance capabilities of system is reduced with the time.In order to keep the performance of its initial level, otherwise manually adjusting rule (is based on rule
System for), or the new file of handmarking and again learning system (for Inductive Learning).It is no matter any
All it is time-consuming costly.Using transduction to file separation so that system is improved, it can adapt to the drift in separation concept automatically
Move.
In one embodiment, a kind of method of separate file is as shown in figure 21.In step 2100, reception has reference numerals
According to, and in step 2102, receive one group of unmarked file.These data and file can include legal inquiry file, official
Notice, web data, attorney's official letter etc..In addition, in step 2104, having flag data and unmarked text based on described
Part, using transduction, probabilistic classification rule is adjusted, and in step 2106, according to probabilistic classification rule, is updated for text
The weight of part separation.Moreover, in step 2108, it is determined that the position separated in one group of file, and in step 2110, it is determined that
The designator of the position separated in one group of file be exported to a user, another system and it is another during at least
One.The designator can be file electronic copies in itself, its part, its title, its title, the pointer for pointing to file,
Etc..Further, in step 2112, file is labeled with coding, and the coding is relevant with the designator.
Figure 22 shows the implementation process of the sorting technique used in the present invention separated for file and equipment.In numeral
After formula scanning, separate to reduce the manual working for being related to file separation and recognizing using autofile.Calculated by using reasoning
Method, file separation method is combined with classifying rules to be automatically separated multigroup page, using sorting technique described here, with
Reduce the most possible separation from all available to information.As shown in figure 22, of the invention turns the example of the present invention
The sorting technique for leading MED is used for file separation.Specifically, the file page 2200 is placed into digital scanner 2202 or MFP, and
It is turned into set of number image 2204.The file page can be the page from any types file, and such as Patent Office goes out
Version thing, the data for being derived from database, the set of prior art, website etc..In step 2206, set of number image is inputted, with
Dynamically adapting is regular using the probabilistic classification of transduction.Step 2206 is using one group of image 2204 as data untagged and has mark
Numeration is according to 2208.Weight in step 2210, probability network is updated, and is used for based on dynamically adapting classifying rules
Autofile is separated.Output step 2212 is the dynamic self-adapting for being automatically put into separate picture, so, the page of set of number
2214 are interleaved into the automated graphics of the separator page 2216, and in step 2212, the separator page is automatically inserted into figure
As sequence.In one embodiment of the invention, the separator page 2216 of Software Create, which can also indicate that, follows the separation closely
The type of the file of the device page 2216.It is general that system described herein automatically adapts to the drift separation that file occurs with the time
Read, without worrying occur separation conclusion type machine learning of the meeting as rule-based static system or based on method accurately
The reduction of degree.In sheet disposal (form processing) application, a common example of drift separation or classification concept
It is that as mentioned before, file produces change due to new laws and regulations.
In addition, system as shown in figure 22 can be changed to system as shown in figure 23, its page 2300 is put into digital scanner
2302 or MFP is converted to set of number image 2304.This group of digital picture is transfused in step 2306, with suitable using transduction dynamic
Answer probabilistic classification rule.Step 2306 is using this group of image 2304 as data untagged and has flag data 2308.Step
2310, according to the dynamic self-adapting classifying rules used, update the weight in the probability network separated for autofile.
It is not insertion separator page-images as described in Figure 18 in step 2312, but step 2312 adapts dynamically to be automatically inserted into
Separate information, and with encode descriptive markup described in document image.Thus, file page-images can be transfused to an image procossing
Database 2316, and the file can be accessed by software identifiers.
Transduction can be used to carry out recognition of face for an alternative embodiment of the invention.As described above, being had using transduction many
Advantage, for example, it is only necessary to the training examples of relatively small amount, uses ability of unmarked sample, etc. in training.Using above-mentioned excellent
Gesture, transduction recognition of face can be used for Criminal Investigation.
For example, Department of Homeland Security is it is essential to ensure that terrorist must not climb up commercial airliner.A part for airport screening process
It can be the photograph that each passenger is gathered at airport security, and attempt to recognize the people.System can initially use a small amount of sample
Example be trained, the sample come from it is available be probably terrorist limited photo.In other law enforcement datas
The unmarked photo of in storehouse, same terrorist can also be used for training.Therefore, transduction training aids not only can be with initially dilute
Thin data set up feature face identification system, and the unmarked sample in other sources can also be used to strengthen performance.
After the photo gathered at airport security has been handled, transduction system can more precisely recognize suspicious figure than induction system.
In another embodiment, a kind of method for recognition of face is as shown in figure 24.In step 2400, at least one
Having for face marks drawing of seeds picture to be received, and the drawing of seeds picture has known confidence levels.At least one drawing of seeds picture can
With a mark, to indicate whether the image is included into a classification specified.In addition, in step 2400, unmarked image
Received, e.g., from police office, government organs, missing child database, airport security, or any other place, and received at least
One default cost factor.Moreover, in step 2402, by iterative calculation, using at least one described default cost because
Son, at least one drawing of seeds picture, and unmarked image, train a transductive classifier, wherein, for iterating to calculate each time,
The cost factor is adjusted as the function of an expectation mark value.It is described in step 2404 after at least successive ignition
Unmarked drawing of seeds picture stores a confidence score.
Further, in step 2406, the identifier of the unmarked file with highest confidence score is exported to a use
Family, another system and it is another during at least one.The identifier can be this document electronic copies in itself, its portion
Point, its title, its title, the pointer, etc. for pointing to file.Moreover, confidence score can be stored after iteration each time, its
In, after each iteration, the identifier of unmarked image of the output with highest confidence score.Furthermore it is possible to receive use
In the data point label prior probability for having and marking with unmarked image, wherein, can basis for iterating to calculate each time
The estimation of one data point group membership's probability, adjusts the data point label prior probability.Further, the 3rd face without mark
Remember image, such as come from above-mentioned airport security sample, can be received, the 3rd unmarked image can be with dividing with highest confidence
At least part image of value compares, and if firmly believes that the face in the face and drawing of seeds picture in the 3rd unmarked image is
Identical, then can export the identifier of the 3rd unmarked image.
An alternative embodiment of the invention allows users to improve their search by providing feedback to document retrieval system
Hitch fruit.For example, when performing a search on an internet search engine (patent or patent application search product etc.),
User can obtain largely corresponding to the result of its search inquiry.One embodiment of the present of invention is allowed users to from search engine
The result of suggestion is browsed, and informs the correlation of the one or more acquired results of search engine, e.g., " it is close, but be not that I am real
It is desired ", " being absolutely not " etc..When user provides feedback to search engine, more preferable result is according to priority to use
Family is browsed.
In one embodiment, a kind of method for file search is as shown in figure 25.In step 2500, receive one and search
Rope is inquired about.The search inquiry can be any kind of inquiry, including case sensitive inquiry, boolean queries, approximate match
Inquiry, structuralized query, etc..In step 2502, the file based on search inquiry is obtained.In addition, in step 2504, output institute
File is stated, and in step 2506, the mark that the user at least part file keys in is received, and the mark indicates the file
Correlation between search inquiry.For example, user can indicate that from the particular result that the inquiry is returned be related go back
It is unrelated.Moreover, in step 2508, the mark keyed in based on the search inquiry and user a, grader is trained to, and
Step 2510, a kind of file classifying method is performed to the file using the grader, to reclassify the file.Enter one
Step, in step 2512, based on its classification, exports the identifier of at least part file.The identifier can be file in itself
Electronic copies, its part, its title, its title, the pointer of sensing file, etc..The file reclassified can also be by
Output, condition is that there is the file of highest confidence level to be exported first for those.
The file classifying method can include any kind of process, e.g., transductive process, SVMs process, most
Big entropy-discriminate process, etc..Any of the above described conclusion or transduction method can be used.In a preferred method, the classification
Device can be a transductive classifier, and by iterative calculation, be looked into using at least one default cost factor, the search
Ask, and the file can train the transductive classifier, wherein, for iterating to calculate each time, adjust the cost because
Son is as a function for expecting mark value, and the grader of the training can be used for the file of classifying.In addition, for institute
A data point markers prior probability for stating search inquiry and file can be received, wherein, for iterating to calculate each time, root
According to the estimation of data point group membership's probability, the data point label prior probability can be adjusted.
An alternative embodiment of the invention can be used for improving ICR/OCR, and speech recognition.For example, many voices are known
The embodiment of other program and system needs operator to repeat many words to train the system.The present invention can be first to a use
The sound monitoring a predetermined time segment at family, to collect the content of " unfiled ", e.g., monitoring telephone is talked.As a result, working as user
When starting to train the identifying system, the system is learnt using transduction, assists to build a note with the voice using the monitoring
Recall model.
In another embodiment, a kind of method such as Figure 26 institutes for being used to check the relevance of an invoice and an entity
Show.In step 2600, a grader is trained based on the invoice format related to first instance.The invoice format can refer to hair
The practical layout of mark on ticket, or the feature on invoice, such as keyword, invoice number, customer name, etc..In addition, in step
2602, it is labeled and is accessed as multiple invoices being associated with least one in the first instance and other entities, and
In step 2604, a kind of file classifying method is performed to the invoice using the grader.For example, above-mentioned any conclusion or
Transduction method may be used as a kind of file classifying method.For example, the file classifying method can include a transductive process, branch
Hold vector machine process, maximum entropy-discriminate process, etc..Moreover, in step 2606, exporting the mark of at least one invoice
Symbol, the invoice has higher probability uncorrelated to the first instance.
Further, the grader can be any kind of grader, for example, a transductive classifier, and by repeatedly
In generation, calculates, using at least one predetermined cost factor, at least one seed file, and the invoice, can train described
Transductive classifier, wherein, for iterating to calculate each time, the cost factor is adjusted as the function of an expectation mark value,
And use the grader classification invoice of the training.Moreover, for the seed file and a data point mark of invoice
Note prior probability can be received, wherein, for iterating to calculate each time, according to the estimation of data point group membership's probability,
Adjust the data point label prior probability.
Here an advantage for saying the embodiment of description is the stability of transduction algorithm.This stability is described by regulation
Cost factor is realized with the mark prior probability is adjusted.For example, in one embodiment, by Iterative classification, using extremely
Lack a cost factor, have mark data points and data untagged point as training examples, train a transductive classifier.For
Iterate to calculate each time, the cost factor for adjusting the data untagged point is used as the function of a desired mark value.In addition,
For iterating to calculate each time, a data point prior probability is adjusted according to the estimation of data point group membership's probability.
Work station can have memory-resident, the operating system such as Microsoft in an operating system
Operating system (OS), MAC operation system, or UNIX operating system.It should be appreciated that preferred embodiment can also be different from those
Implement on the platform and operating system mentioned.One preferred embodiment can use JAVA, XML, C and/or C Plus Plus or
The other programming languages of person are write, with reference to the Programming Methodology of object-oriented.Object-oriented programming can be used
(OOP), it has been being increasingly used to the complicated application of exploitation.
Above-mentioned application is using transduction study to overcome the problem of data set is very rare, and the problem annoyings conclusion type face
Identifying system.This aspect learnt of transduceing is not limited to this application, can be used for solution other because data set is rare
Machine Learning Problems caused by saying.
Within the scope and spirit of the various embodiments of invention disclosed herein, those skilled in the art can design difference
Change.Moreover, the various features of embodiments disclosed above can be used alone, or various combination each other, and not
It is confined to particular combination described above.Therefore, the scope of claim is not limited to the embodiment of these descriptions.
Claims (27)
1. a kind of method for document classification, it is characterised in that including:
Receiving at least one has mark kind subdocument, and this kind of subdocument has a known confidence levels;
Receive unmarked document;
Receive at least one default cost factor;
By iterative calculation, using at least one described default cost factor, it is described at least one plant subdocument and described
Unmarked document, trains a transductive classifier, wherein, for iterating to calculate each time, adjust the cost factor and be used as one
The individual function for expecting mark value;
It is the unmarked document storage confidence score after at least part iteration;With
By the identifier of the unmarked document with highest confidence score export to a user, another system, it is another during
At least one.
2. according to the method described in claim 1, it is characterised in that:Each in one or more described kind of subdocuments has
One mark, indicates whether this kind of subdocument is included into a classification specified.
3. according to the method described in claim 1, it is characterised in that:Confidence score is stored after iteration each time, wherein, each
After secondary iteration, the identifier of the unmarked document with highest confidence score is output.
4. according to the method described in claim 1, it is characterised in that:Also include having mark and unmarked Document Creator one to be described
Individual data point markers prior probability;Wherein, for iterating to calculate each time, according to the estimation of data point group membership's probability,
Adjust the data point label prior probability.
5. according to the method described in claim 1, it is characterised in that:Also include:
Receive the 3rd unmarked document;
By the 3rd unmarked document with least partly having the unmarked document comparison of highest confidence score;And
The identifier of the 3rd unmarked document is exported in response to being identified below:
(1) confidence levels of the 3rd unmarked document indicates that the 3rd unmarked document belongs to and described kind of Ziwen
Shelves identical classification;And
(2) confidence levels of the 3rd unmarked document is more than predefined confidence threshold.
6. in a computer based system, a kind of method classified for data, it is characterised in that including:
Reception has mark data points, has described in each mark data points to be marked with least one, indicate the data point be by
Include the training examples of the data point of a classification specified, or the data point being excluded from a classification specified training
Sample;
Receive data untagged point;
There is at least one default cost factor of mark data points and data untagged point described in receiving;
By iterative calculation, using at least one described cost factor, and described there are mark data points and data untagged point
As training examples, using maximum entropy-discriminate (MED), a transductive classifier is trained, wherein, for iterating to calculate each time,
The data untagged point cost factor is adjusted as the function of an expectation mark value, it is and general according to a data point group membership
The estimation of rate, adjusts a data point markers prior probability;
Using the grader of the training classify the data untagged point, described have mark data points and input data point
In at least one;With
Classification of data point by the classification or derivatives thereof export to a user, another system and it is another during
At least one.
7. method according to claim 6, it is characterised in that:The function is the absolute of the expectation mark of a data point
Value.
8. method according to claim 6, it is characterised in that:Also include receiving the priori for having mark and data untagged point
The step of probabilistic information.
9. method according to claim 8, it is characterised in that:The transductive classifier has mark and unmarked using described
The priori probability information study of data.
10. method according to claim 6, it is characterised in that:A Gauss elder generation also including the use of decision function parameter
Test, the given training examples for being included into and being excluded are marked according to their expectation, have mark and unmarked using described
Data are as training examples, it is determined that the step of decision function of the KL divergings with minimum.
11. method according to claim 6, it is characterised in that the multinomial prior also including the use of decision function parameter
Distribution, it is determined that the step of decision function with minimum KL divergences.
12. method according to claim 6, it is characterised in that:The iterative step of one transductive classifier of repetition training, directly
To the convergence for reaching data value.
13. method according to claim 12, it is characterised in that:When the change of the decision function of the transductive classifier
When change is dropped to below a default threshold value, convergence is reached.
14. method according to claim 12, it is characterised in that:When it is determined that expectation mark value change drop to one it is pre-
If threshold value below when, reach convergence.
15. method according to claim 6, it is characterised in that:The value of the mark of the training examples being included into is+1,
And the value of the mark of the training examples being excluded is -1.
16. method according to claim 6, it is characterised in that:The mark of the sample being included into is mapped to first
Individual numerical value, and the mark of the sample being excluded is mapped to second numerical value.
17. method according to claim 6, it is characterised in that also include:
There are mark data points to be stored in a computer storage by described;
The data untagged point is stored in a computer storage;
The input data point is stored in a computer storage;With
There is at least one default cost factor described in mark data points and data untagged point to be stored in a calculating by described
Machine memory.
18. a kind of method classified for data, it is characterised in that including:
Computer executable program code is provided, to use and perform in a computer system, described program code includes
Instructing is used for:
Access be stored in computer storage have mark data points, have mark data points that there is at least one mark described in each
Note, it is the training examples for the data point for being included into a specified classification to indicate the data point, or is arranged from a specified classification
The training examples for the data point removed;
The data untagged point is accessed from computer storage;
From at least one the default cost factor for having mark data points and data untagged point described in computer storage access;
By iterative calculation, using it is described at least one store cost factor and storage have mark data points and storage
Data untagged point trains maximum entropy-discriminate (MED) transductive classifier as training examples, wherein, for changing each time
In generation, calculates, and adjusts the data untagged point cost factor as the function of an expectation mark value, and according to a data point
The estimation of group membership's probability, adjusts a data point prior probability;
Using the grader of the training classify the data untagged point, described have mark data points and input data point
In at least one;With
Classification of data point by the classification or derivatives thereof export to a user, another system and it is another during
At least one.
19. method according to claim 18, it is characterised in that:The function is the exhausted of the expectation mark of a data point
To value.
20. method according to claim 18, it is characterised in that:Also include accessing be stored in computer storage have mark
The step of priori probability information of note and data untagged point.
21. method according to claim 20, it is characterised in that:For iteration each time, constituted according to a data point
The estimation of member's probability, adjusts the priori probability information.
22. method according to claim 18, it is characterised in that:Also include the given training for being included into and being excluded
Sample, is marked according to their expectation, has mark and data untagged as training examples by the use of described, by with minimum KL
The step of prior distribution for being defined as the decision function parameter of the decision function of diverging.
23. method according to claim 18, it is characterised in that:The iterative step of one transductive classifier of repetition training,
Until reaching the convergence of data value.
24. method according to claim 23, it is characterised in that:When the change of the decision function of the transductive classifier
When change is dropped to below a default threshold value, convergence is reached.
25. method according to claim 23, it is characterised in that:When the change of the expectation mark value of the determination drops to one
When below individual default threshold value, convergence is reached.
26. method according to claim 18, it is characterised in that:The value of the mark of the training examples being included into for+
1, and the value of the mark of the training examples being excluded is -1.
27. method according to claim 18, it is characterised in that:The mark of the sample being included into is mapped to first
Individual numerical value, and the mark of the sample being excluded is mapped to second numerical value.
Applications Claiming Priority (11)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US83031106P | 2006-07-12 | 2006-07-12 | |
US60/830,311 | 2006-07-12 | ||
US11/752,719 | 2007-05-23 | ||
US11/752,634 | 2007-05-23 | ||
US11/752,634 US7761391B2 (en) | 2006-07-12 | 2007-05-23 | Methods and systems for improved transductive maximum entropy discrimination classification |
US11/752,673 US7958067B2 (en) | 2006-07-12 | 2007-05-23 | Data classification methods using machine learning techniques |
US11/752,691 | 2007-05-23 | ||
US11/752,673 | 2007-05-23 | ||
US11/752,691 US20080086432A1 (en) | 2006-07-12 | 2007-05-23 | Data classification methods using machine learning techniques |
US11/752,719 US7937345B2 (en) | 2006-07-12 | 2007-05-23 | Data classification methods using machine learning techniques |
CN200780001197.9A CN101449264B (en) | 2006-07-12 | 2007-06-07 | Method and system and the data classification method of use machine learning method for data classification of transduceing |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200780001197.9A Division CN101449264B (en) | 2006-07-12 | 2007-06-07 | Method and system and the data classification method of use machine learning method for data classification of transduceing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107180264A true CN107180264A (en) | 2017-09-19 |
Family
ID=40743805
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610972541.XA Withdrawn CN107180264A (en) | 2006-07-12 | 2007-06-07 | For the transductive classification method to document and data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107180264A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11436816B2 (en) * | 2019-03-07 | 2022-09-06 | Seiko Epson Corporation | Information processing device, learning device, and storage medium storing learnt model |
-
2007
- 2007-06-07 CN CN201610972541.XA patent/CN107180264A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11436816B2 (en) * | 2019-03-07 | 2022-09-06 | Seiko Epson Corporation | Information processing device, learning device, and storage medium storing learnt model |
Also Published As
Publication number | Publication date |
---|---|
CN101449264A (en) | 2009-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7937345B2 (en) | Data classification methods using machine learning techniques | |
US7761391B2 (en) | Methods and systems for improved transductive maximum entropy discrimination classification | |
US7958067B2 (en) | Data classification methods using machine learning techniques | |
WO2008008142A2 (en) | Machine learning techniques and transductive data classification | |
CN107967575B (en) | Artificial intelligence platform system for artificial intelligence insurance consultation service | |
Kanan et al. | An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system | |
US20080086432A1 (en) | Data classification methods using machine learning techniques | |
Bazan et al. | The rough set exploration system | |
Hu et al. | Active learning with partial feedback | |
Zavvar et al. | Email spam detection using combination of particle swarm optimization and artificial neural network and support vector machine | |
de la Iglesia et al. | Developments on a multi-objective metaheuristic (MOMH) algorithm for finding interesting sets of classification rules | |
Al-Rasheed | Identification of important features and data mining classification techniques in predicting employee absenteeism at work. | |
Wu | Application of improved boosting algorithm for art image classification | |
Trivedi et al. | A modified content-based evolutionary approach to identify unsolicited emails | |
CN107180264A (en) | For the transductive classification method to document and data | |
CN101449264B (en) | Method and system and the data classification method of use machine learning method for data classification of transduceing | |
Laishram | Link prediction in dynamic weighted and directed social network using supervised learning | |
WO2002048911A1 (en) | A system and method for multi-class multi-label hierachical categorization | |
Zelenko et al. | Automatic competitor identification from public information sources | |
Kou | Stacked graphical learning | |
Siersdorfer et al. | Using restrictive classification and meta classification for junk elimination | |
Jordan et al. | Content-Based Image Retrieval Using Deep Learning | |
Liu et al. | Distribution embedding network for meta-learning with variable-length input | |
Rehill | Distilling interpretable causal trees from causal forests | |
CN111949794A (en) | Online active machine learning method for text multi-classification task |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20170919 |
|
WW01 | Invention patent application withdrawn after publication |