CN1669023A - Method and apparatus for automatically determining salient features for object classification - Google Patents

Method and apparatus for automatically determining salient features for object classification Download PDF

Info

Publication number
CN1669023A
CN1669023A CNA02829663XA CN02829663A CN1669023A CN 1669023 A CN1669023 A CN 1669023A CN A02829663X A CNA02829663X A CN A02829663XA CN 02829663 A CN02829663 A CN 02829663A CN 1669023 A CN1669023 A CN 1669023A
Authority
CN
China
Prior art keywords
unique features
list
data object
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA02829663XA
Other languages
Chinese (zh)
Other versions
CN100378713C (en
Inventor
D·P·卢力奇
F·G·吉拉克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1669023A publication Critical patent/CN1669023A/en
Application granted granted Critical
Publication of CN100378713C publication Critical patent/CN100378713C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Processing (AREA)

Abstract

A method and apparatus for automatically determining salient features (308) for object classification is provided. In accordance with one embodiment, one or more unique features are extracted from a first content group of objects to form a first feature list, and one or more unique features are extracted from a second anti-content group of objects to form a second feature list. A ranked list of features is then created by applying statistical differentiation between unique features of the first feature list and unique features of the second feature list. A set of salient features (308) is then identified from the resulting ranked list of features.

Description

The method and apparatus of determining distinguishing feature automatically for object class
Background of invention
1. invention field
The present invention relates to data processing field.More specifically say, the present invention relates to be used for automatic selection the features of the object of object grouping.
2. background information
WWW provides important information sources, estimates can online reading download billions of pages information.But, must need one and suit practical methods and be used to guide this mass data in order effectively to utilize this information.
At the initial stage of internet surfing, developed two kinds of basic skills and be used for online search.In first method, produce index data according to the web page contents that is collected in by automatic search engine together, search engine " is creeped to seek the page of new uniqueness on the net.Then, this database can be with various inquiry technique searches, and data can be according to the similarity classification of the form of inquiry usually.In second method, webpage is grouped into a hierarchy, and often the form with one tree presents.Then when when this hierarchy is descending, the user just makes a series of selection, significantly making two or more selections on each rank of difference between the subtree of representing under the commit point, finally reach to the leaf node that comprises the text and/or the content of multimedia page.
For example, Fig. 1 illustrates a typical prior art hierarchy 102, and wherein a plurality of decision nodes (calling " node " in the following text) 130-136 hierarchal arrangement becomes a plurality of fathers/or child node, and each node all interrelates with the subject classification of a uniqueness.For example, node 130 is the father node of node 131 and 132, and node 131 and 132 then is the child node of node 130.Because node 131 and 132 all is the child node of same node (node 130), node 131 and 132 is the brother each other.Other brother also has node 135 and 136 to comprising node 133 and 134 in 102 subject hierarchies.As seen from Figure 1, node 130 forms the first order 137 of subject hierarchy 102, and node 131-132 forms the second level 138 of subject hierarchy 102.Node 133-136 then forms the third level 139 of subject hierarchy 102.In addition, node 130 is considered to the root node of subject hierarchy 102, because it is not the child node of other any nodes.
The process of webpage hierarchical classification is faced multiple challenges.At first, the character of hierarchy must define.Usually this is manually to be finished by the expert in the specialism field, and a bit picture is done the classification of Dewey decimal system for the library.These classifications label that captions is submitted to out then, so that user and sorter can make suitable decision when pointing to this hierarchy.Then, for example, the content that presents with indivedual electronic document forms can be used in the categorizing system method of manual search and puts in a plurality of classifications and go.
People's notice has turned to the robotization in this each stage of process in recent years.Now there has been the system of from the batch document, document being classified automatically.For example, the associated word that concerns of some system applies and document assembles cohort automatically with similar document.These cohorts can repeatedly be formed super cohort again repeatedly, thereby produce hierarchy, yet these systems need the artificial key word that inserts, generation be a hierarchy that does not have systematic structure.If this hierarchy is used for manual search, just must the hand inspection child node or the leaf document with the identification common feature, thereby labelled to the node of hierarchy.
Many categorizing systems are used the word tabulation with document classification, and usually, significantly word can define in advance, also can select from the document of handling, so that characterize document more exactly.General these remarkable word tabulations are to use the frequency of occurrences to the whole words of each number of files in one group of document to produce.According to one or more criterions word is shifted out from the word tabulation then.Often, occurrence number word very little is disallowable in a collection of document, because these words are used to such an extent that be not enough to distinguish reliably classification very little, but the word that occurs too frequently also will reject, because all will occur in all kinds of documents.
Moreover " useless words " also often rejected from feature list to be more conducive to determining of distinguishing feature with stem.Useless words comprises the common word in the language, such as " a ", " the ", and " his " and " and ", these words allow the people feel not have the semantics content, and stem then refers to such as " ing ", " is " and suffixes such as " able ".Unfortunately, generating the useless words tabulation is the professional task of a term language with the stem tabulation, require to have the professional knowledge of grammer, document and idiom aspect, and these is can be time dependent.Therefore, significantly specific with regard to requiring a more dexterous method to determine.
Brief Description Of Drawings
The present invention will describe by exemplary embodiment, but also unrestricted, and use description of drawings, and wherein identical label is represented similar key element, in the accompanying drawing:
Fig. 1 illustrates that one comprises the exemplary prior art hierarchy of a plurality of decision nodes;
Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process;
Fig. 3 illustrates the significantly application example of specific definite equipment of the present invention according to an embodiment;
Fig. 4 is the functional block diagram of the sorter training service of key diagram 3 according to one embodiment of present invention;
Fig. 5 illustrates according to one embodiment of present invention and is suitable for use as the computing system of determining distinguishing feature.
The detailed description of invention
Hereinafter various aspects of the present invention will be described.Yet those skilled in the art can be clear, and the present invention can only implement with its some or all aspect.For ease of explaining that special numeral, material and structure are all illustrated so that provide and well understood of the present invention.Yet those skilled in the art is also clear, and the present invention need not these details and also can implement.In other situations, well-known characteristics are ignored or are simplified, in order to avoid the present invention that can hardly be explained.
Some part of describing is expressed to use the operation of implementing based on the equipment of processor, use is such as data, storage, select, determine, term such as calculating, conform to those skilled in the art are normally used, so that their substance of work is passed to others skilled in the art.Those skilled in the art will appreciate that quantity is desirable can store, transmission or the form by electricity, magnetics or the optical signalling controlled based on the machinery in the equipment of processor and electricity component; And processor one speech comprises microprocessor, microcontroller, digital signal processor etc. here, can be independently, also can be that assist or Embedded.
Each operation is described successively by each discrete steps, so that help to understand the present invention most, yet the order of description should not be construed as and means that these operations must be relevant with order.In fact, these operations need not be carried out by the order that is presented.Moreover, describe and use phrase " in one embodiment " repeatedly, might not all refer to same embodiment, though can be like this.
According to one embodiment of present invention, the characteristics of extracting one or more uniquenesses from first group of objects are extracted one or more unique features to form second feature set again to form first feature set from second group of objects.Between the unique features of the unique features of the first specific collection and second feature set, adopt statistics differentiation method to produce a ranked list of features then.Then, from the ranked list that obtains like this, can identify one group of distinguishing feature.
In one embodiment, the determining of distinguishing feature helps the effective classification to the data object, object includes, but is not limited to text document, visual document, audio frequency preface and video sequence, in very large-scale hierarchical classification tree, also these data objects had both comprised that the patent form also comprised non-patent form in such as the non-graded data structure of smooth document.For example in a text document, the form of the desirable word of characteristics, and term " word " is generally understood as represents one group of letter in given language, has certain semantic meaning.More generally, characteristics can be N-mark grammers (N-token gram), and a mark is exactly a small element of language, for example, comprise N-letter grammer and N-word grammer in the English, also comprise N-in the Asian language symbol grammer of expressing the meaning.And for example in tonic train, tone, speed, sound prolong, pitch, volume and the like all can be used as the characteristics to sound classification, and in video sequence and rest image, each pixel property promptly can be used as characteristics such as angle and intensity level.According to one embodiment of present invention, in case group of features from one group (such as) identified the electronic document, with regard to the classification of given data object group, be significant with regard to a sub-set pair can determining these characteristics then.Term herein " electronic document " is widely used in describes gang's data object, all as described above comprise some of one or more formation characteristics.Though electronic document can comprise text, can comprise audio frequency and/or video content too, can replace text, also can be additional to text.
The criterion that characteristics are selected is once determining (attribute of which different text/audio/video is concentrated as the determinacy characteristics at data object in other words), and distinguishing feature deterministic process of the present invention can be implemented.The distinguishing feature deterministic process at the beginning, the data object of being considered is divided into two groups.Then these two groups of data objects are used the equation (square journey 1) of representative " operational feasibility ", here 0 (d) represents the possibility of a given data object as the member of the first data object group, P (R|d) represents the probability of this data object as this first group membership, and P (R ' | d) then represent the probability of this data object as second group membership.
O ( d ) = P ( R | d ) P ( R ′ | d ) - - - ( 1 )
Because the artificial grouping of data object is not provided for calculating the probability of operational feasibility, equation (1) just can make full use of estimates this value.Accordingly, logarithmic function can be applied to the both sides of equation (1) together with the Baye formula, provides equation (2):
logO(d)=logP(d|R)-logP(d|R′)+logP(R)-logP(R′) (2)
So, data object hypothesis is by one group of characteristics { F jForm; And X iBe 1 or be 0, represent given characteristics f respectively iOr not in a data object, then
log O ( d ) = Σ i [ log P ( X i | R ) - log P ( X i | R ′ ) ] + log P ( R ) - log P ( R ′ ) - - - ( 3 )
Because logP (R) and logP (R ') are constants,, just can stipulate a new amount g (d) with to be elected to be outstanding feature in the data object irrelevant:
g ( d ) = Σ i [ log P ( X i | R ) - log P ( X i | R ′ ) ] - - - ( 4 )
As establish p i=P (X i=1/R) represent a given characteristics (f i) appear at the probability in the data object in first data set, and q i=P (x i=1/R ') represents given characteristics (f i) appear at the probability in the data object in the second data object group, then can get equation (5) through the substitution abbreviation:
g ( d ) = Σ i [ X i log p i ( 1 - q i ) q i ( 1 - p i ) + Σ i [ log 1 - p i 1 - q i ] - - - ( 5 )
Because the summation in second and the characteristics that the do not rely on appearance situation in data object, it can be removed and equation (6):
log p i ( 1 - q i ) q i ( 1 - p i ) - - - ( 6 )
Because logarithmic function is a monotonic quantity, equation (7)
p i ( 1 - q i ) q i ( 1 - p i ) - - - ( 7 )
Ratio maximization promptly be enough to make corresponding logarithm value maximization.According to an imbody of the present invention,, each characteristic in the characteristics of combination tabulation are used the identification that equation (7) is beneficial to distinguishing feature to two groups of data objects.For this reason, should calculate p i, representative comprises characteristics f at least in the first data object group iData object number once is divided by the sum of data object in the first data object sets of documentation.Equally, should calculate q i, q iRepresent in the second data object group and comprise characteristics f at least iData object number once is divided by the sum of data object in second group of data object group.
Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process.At the beginning, check earlier the first collection data object producing a feature list, this tabulation is that unique features from one or more data objects of first set of data objects is formed by appearing at least, sees square frame 210.To each unique features of discerning, use equation (7) to produce a grouping feature list, see square frame 220, at least one subclass in this grouping feature list is elected to be distinguishing feature, sees square frame 230.Distinguishing feature can comprise by the one or more adjacent or non-adjacent element set of selecting in the ranked list of features.In one embodiment, the top n element in the ranked list of features is elected to be significantly, and N can change according to the needs of system.In another embodiment, last M element in the ranked list of features is elected to be significantly, and M also changes according to the needs of system.
According to one embodiment of present invention, when producing feature list (seeing square frame 210), the sum that is included in the data object in each data object group is determined, see square frame 212, to being each unique features of discerning in the first data object group at least, the data object sum that comprises this unique features also can be determined, sees square frame 214.In addition, list of unique features can be filtered according to required different criterion, sees square frame 216.For example, list of unique features can be deleted the characteristics that those are not found to be the data object of the least number of times that occurs in any case of removing, and those are shorter than the characteristics of a certain minimum length through determining, and/or the characteristics that the number of times that occurs lacks than quota also are removed.
According to one embodiment of present invention, the method that applied statistics is distinguished obtains ranked list of features, as described in the square frame 220 among Fig. 2 A, also further is included in those processes that illustrate among the 2C.In other words, distinguishing method (promptly shown in equation (7)) time in applied statistics just arrives at a decision, determine that promptly which unique features through discerning of concentrating at first data object also appears in second data object set, see square frame 221, similarly determine which unique features of concentrating at first data object and do not appear in second document sets, see square frame 222 through discerning.According to illustrated imbody, when by statistics differentiation method (being equation (7)) when making decision, those are defined as only appearing in the data object set and do not appear at other characteristics of concentrating and just be decided to be higher relative rank in the grouping particular list, see square frame 223, those characteristics that are defined as all occurring in two data object sets then are decided to be relatively low rank, see square frame 224.Sometimes, according to the sum of the data object that comprises each corresponding characteristic, the characteristics in the ranked list of features are further classification also.
Use example
Now,, the equipment that the present invention is used for determining distinguishing feature is shown with an example therein according to an embodiment referring to Fig. 3.As shown in the figure, sorter 300 is used for effectively to the data object class, such as the electronic document in the data structure that comprises very extensive grade classification tree and smooth document format in a big class, include, but is not limited to text document, image document, tonic train and video sequence had both comprised the patentability form and had also comprised non-patentability form.Sorter 300 comprises sorter training service 305, with think training classifier 300 according to from before the classifying rules that extracts the data hierarchy structure of having classified to new data-object classifications; Comprise that also sorter classified service 315 is in order to classify to the new data-objects of importing sorter 300.
The function of sorter training service 305 comprises aggregation capability 306, and distinguishing feature of the present invention is determined function 308, and node characterization function 309.According to shown in embodiment, each the node place at hierarchy focuses on by aggregation capability 306 from the content of preceding classified data hierarchy structure, with content group and the non-content group that forms data simultaneously.Extract characteristics and determine that with distinguishing feature the method for function 308 determines that those characteristics are significant characteristics subclass by each data set then.Node characterization function 309 is used for each node to preceding data hierarchy structure of having classified according to the distinguishing feature characterization, also in order in data storage 310, to store these characteristic of divisionizations, for example, so that be that sorter classified service 315 is done further uses.
Other data about the sorter 300 that comprises sorter exercise equipment 305 and sorter sorting device 315 are numbered " 51026 what meanwhile submit to, P004 " the U.S. Patent application book in describe, be entitled as " Very-Large-Scale Automatic Categorizer For Web Content (for the very extensive automatic categorizer of online content) ", jointly transfer the possession of the application's assignee, this application is incorporated into this by reference fully.
Sorter training service
Fig. 4 functional-block diagram of the classification based training service 305 among Fig. 3 that draws according to one embodiment of present invention.As shown in Figure 4, preceding classified data hierarchy structure 402 is in order to input to the classification based training service 305 of sorter 300.Before classified data hierarchy structure 400 represent a set of data objects such as audio frequency, video and/or text object, classified before these data objects and it be included into a theme hierarchy (usually by manually finishing).Before classified data hierarchy structure 402 can represent one or more before web door or the search engine electronic document collection of classifying.
According to the example that had illustrated already, aggregation capability 406 will so just increase difference from the Content aggregation of preceding classified data hierarchy structure 402 to content and non-content group between each other brotgher of node of level of hierarchy.Distinguishing feature determines that the effect of function 408 is that the extraction characteristics determine that also the characteristics (409) of which extraction can be decided to be (409 ') significantly from content and non-content groups of data.
In addition, according to the example that has illustrated, the effect of the node characterization function 309 among Fig. 3 is to content and non-content groups of data characterization.In one embodiment, content and non-content-data are according to fixed distinguishing feature and characterization.In one embodiment, the result of characterization is stored in the data storage device 310, and the form that this can any kind of data structure is implemented, such as database, bibliographic structure, or simple examination tabulation.In one embodiment of the invention to the parameter of each node classifier all be stored in one be similar to before in the grade classification tree of file structure of classified data hierarchy structure.
Example computer system
Fig. 5 explanation is suitable for according to one embodiment of present invention in order to determine a routine computer system of distinguishing feature.As shown in the figure, computer system 500 comprises one or more processors 502 and system storage 504.In addition, computer system 500 also comprises jumbo storage device 506 (such as disk, hard disk driver, CDROM etc.), input-output apparatus 508 (such as keyboard, cursor control etc.) and communication interface 510 (such as network interface unit, modulator-demodular unit etc.).Each several part intercouples by system bus 512, and system bus can be represented one or more buses.When a plurality of bus of system bus 512 representatives, be connected by one or more bus bridge (not shown)s to each other.
Each part is all right to make conventional functions as known in the art.Specifically, system storage 504 and mass-memory unit 506 are used for storing a work copy and permanent copy of the programming instruction of implementing categorizing system of the present invention.The permanent copy of programming instruction can promptly be loaded into before dispatching from the factory in the mass-memory unit 506; Or be written at the scene, as previously mentioned, load by a distribution media (not shown) or by communication interface 510 (from a distribution server (not shown)).The structure of these parts 502~512 all is known, need not further describe.
Conclusion and postscript
Therefore, as seen from the above description, describe out with the new method of determining distinguishing feature automatically and the device of thinking object class.Though the present invention describes with the foregoing description, it will be understood by those skilled in the art that the present invention is not limited to described embodiment.The present invention also useful modifications and alternative implements, but must be within the spirit and scope of appended claims.Therefore this description should be thought about illustrative of the present invention and non-binding description.

Claims (36)

1. a method comprises
From the first content group of data object, extract one or more unique features to form first feature list;
From the second non-content group of data object document, extract one or more unique features to form second feature list;
Produce a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list; And
From described ranked list of features, identify the distinguishing feature collection.
2. the method for claim 1 is characterized in that, each in the second non-content group of the first content group of described data object and described data object all comprises one or more electronic documents.
3. the method for claim 1 is characterized in that, also comprises:
Determine to comprise the first data object sum of described data object first content group; And
Determine to comprise the second data object sum of the described data object second non-content group.
4. method as claimed in claim 3 is characterized in that, also comprises:
In the described one or more unique features that form described first feature list each is determined to contain in the described data object first content group the first data object number of at least one example of each corresponding described one or more unique features in described first feature list; And
In the described one or more unique features that form described second feature list each is determined to contain in the described data object second non-content group the second data object number of each corresponding described one or more unique features in described second feature list.
5. method as claimed in claim 4 is characterized in that, produces described grading list and comprises:
Those unique features that do not appear at described first feature list in described second feature list are identified as exclusive characteristics;
Those unique features that also appear at described first feature list in described second feature list are identified as common feature; And
To described grading list ordering, so that compare with described common feature, higher in the rank of exclusive characteristics described in the described grading list.
6. method as claimed in claim 5 is characterized in that, also comprises:
To each described common feature applied probability function to obtain a result vector, wherein, described probabilistic function comprises the first data object number divided by the described first data object sum, with the ratio of the described second data object number divided by the described second data object sum; And
Result vector based on described probabilistic function sorts to the described common feature in the described grading list at least in part.
7. method as claimed in claim 5 is characterized in that, based on the first data object number to the further classification of described exclusive characteristics.
8. the method for claim 1 is characterized in that, identifies described distinguishing feature collection and comprise the adjacent characteristics of top n in the described ranked list of features of selecting from described ranked list of features.
9. the method for claim 1 is characterized in that, identifies described distinguishing feature collection and comprise last M the adjacent characteristics of selecting in the described ranked list of features from described ranked list of features.
10. the method for claim 1 is characterized in that, each described unique features all comprises the group of one or more alphanumeric characters.
11. the method for claim 1 is characterized in that, also comprises:
Become one relation in the second non-content group with the first content group of described data object and described data object the closest a new data-object classifications based on described distinguishing feature collection at least in part.
12. the method for claim 1 is characterized in that, the first content group of described data object comprise those corresponding to the node of selecting in the theme hierarchy with a plurality of nodes and with the data object of selecting any child node that node is associated; And
Wherein, second of the described data object non-content group comprises those corresponding to any brotgher of node that is associated with the node of selecting and the data object of any child node of being associated of the brotgher of node therewith.
13. a method of discerning distinguishing feature, described method comprises:
Identification is as one or more unique features of the member of first data class;
Check that second data class also is those unique features of the member of described second data class to discern in described one or more unique features, and be not those unique features of the member of described second data class in described one or more unique features;
Produce a classification list of unique features, this tabulation has an order based on member's identity of each described one or more unique features in described second data class; And
The one or more of described classification list of unique features are identified as significantly.
14. method as claimed in claim 13 is characterized in that, also comprises:
To in the described classification list of unique features each, determine to comprise in described first data class number of objects of each corresponding unique features.
15. method as claimed in claim 14, it is characterized in that, produce a grading list and also comprise the rank of those unique features in described grading list that in the described unique features not to be the described second data class member and be decided to be than the rank height that in the described unique features also is those unique features of the described second data class member.
16. method as claimed in claim 15, it is characterized in that, produce a grading list and also comprise the rank of those unique features in described grading list that to belong to the more number object of described first data class in the described unique features and be decided to be than the rank height that belongs to those unique features of less number object in described first data class in the described unique features.
17. method as claimed in claim 13 is characterized in that, is identified as to comprise significantly to select the continuous feature set of top n from described classification list of unique features.
18. method as claimed in claim 13 is characterized in that, is identified as to comprise significantly to select last M continuous characteristics from described classification list of unique features.
19. a device comprises:
One stores the storage medium that a plurality of programming instructions are arranged therein, and described instruction is designed to realize a plurality of functions of class name service, in order to provide class name to data object, comprises first one or more functions, so that
From the first content group of data object, extract one or more unique features forming first feature list,
From the second non-content group of data object, extract one or more unique features forming second feature list,
Produce a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list, and
From described ranked list of features, identify a distinguishing feature collection; And
One with the processor of described storage medium coupling, in order to carry out described programming instruction.
20. device as claimed in claim 19 is characterized in that, each in described data object first content group and the described data object second non-content group all comprises one or more data objects.
21. device as claimed in claim 19 is characterized in that, described a plurality of instructions also comprise instruction, so that
Determine to comprise the first data object sum of described data object first content group, and
Determine to comprise the second data object sum of the described data object second non-content group.
22. device as claimed in claim 19 is characterized in that, described a plurality of instructions also comprise instruction, so that
To each of described one or more unique features of forming described first feature list, determine to comprise in the described data object first content group the first data object number of at least one example of each corresponding described one or more unique features of described first feature list, and
To described one or more unique features of forming described second feature list each, determine to comprise in the described data object second non-content group the second data object number of at least one example of each corresponding described one or more unique features of described second particular list.
23. device as claimed in claim 20 is characterized in that, the described a plurality of instructions that produce described grading list comprise instruction, so that
Those unique features that do not appear in described second feature list in described first feature list are identified as exclusive characteristics,
Those unique features that also appear in described second feature list in described first feature list are identified as common feature, and
To described grading list ordering, so that described exclusive characteristics are compared with described common feature, rank is higher in described grading list.
24. device as claimed in claim 23 is characterized in that, described a plurality of instructions also comprise instruction, so that
To each applied probability function of described common feature to obtain a result vector, wherein, described probabilistic function comprises the first data object number divided by the described first data object sum, with the ratio of described second number of files divided by the described second data object sum, and
Result vector based on described probabilistic function sorts to the described common feature in the described grading list at least in part.
25. device as claimed in claim 23 is characterized in that, based on the first data object number to the further classification of described exclusive characteristics.
26. device as claimed in claim 19 is characterized in that, also comprises in order to select the instruction of the characteristics that top n is adjacent in the described ranked list of features in order to the described a plurality of instructions that identify described distinguishing feature collection from described ranked list of features.
27. device as claimed in claim 19 is characterized in that, also comprises in order to select the instruction of last M adjacent characteristics in the described ranked list of features in order to the described a plurality of instructions that identify described distinguishing feature collection from described ranked list of features.
28. device as claimed in claim 19 is characterized in that, each described unique features all comprises the group of one or more alphanumeric characters.
29. device as claimed in claim 19 is characterized in that, described a plurality of instructions also comprise instruction, so that
At least in part based on described distinguishing feature collection, with a new data object be identified as with described data object first content group and the described data object second non-content group in a relation the closest.
30. device as claimed in claim 19 is characterized in that, described data object first content group comprise those corresponding to the node of from theme hierarchy, selecting with a plurality of nodes and with the data object of selecting any child node that node is associated; And
Wherein, the described data object second non-content group comprises those corresponding to any brotgher of node that interrelates with selected node and the data object of any child node of interrelating of the brotgher of node therewith.
31. a device comprises:
Store the storage medium that a plurality of programming instructions are arranged therein, described instruction is designed to realize comprising a plurality of functions of first one or more functions, so that
Identify one or more unique features as the first data class member,
Check that second data class also is those unique features of the described second data class member to identify in described one or more unique features, and be not those unique features of the described second data class member in described one or more unique features,
Produce a list of unique features, it has the order based on member's identity ordering of each of one or more unique features described in described second data class, and
With one or more being identified as significantly in the described classification list of unique features; And one and the processor that is coupled of described storage medium, in order to carry out described programming instruction.
32. device as claimed in claim 31 is characterized in that, described a plurality of instructions also comprise instruction, so that
To in the described classification list of unique features each, determine to comprise in described first data class number of objects of each corresponding unique features.
33. device as claimed in claim 32, it is characterized in that the described a plurality of instructions that produce grading list also comprise to be decided the rank of those unique features in described grading list that in the described unique features is not the described second data class member than the high instruction of those unique features that in the described unique features also is the described second data class member.
34. device as claimed in claim 33, it is characterized in that the described a plurality of instructions that produce grading list also comprise to be decided the rank of those unique features in described grading list that belongs to more number object in described first data class in the described unique features than the high instruction of those unique features that belongs to less number object in described first data class in the described unique features.
35. device as claimed in claim 31 is characterized in that, also comprises the instruction of selecting the continuous unique features collection of top n from described classification list of unique features in order to be identified as significant described a plurality of instructions.
36. device as claimed in claim 31 is characterized in that, also comprises the instruction of selecting last M continuous unique features collection from described classification list of unique features in order to be identified as significant described a plurality of instructions.
CNB02829663XA 2002-09-25 2002-09-25 Method and apparatus for automatically determining salient features for object classification Expired - Fee Related CN100378713C (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2002/030457 WO2004029826A1 (en) 2002-09-25 2002-09-25 Method and apparatus for automatically determining salient features for object classification

Publications (2)

Publication Number Publication Date
CN1669023A true CN1669023A (en) 2005-09-14
CN100378713C CN100378713C (en) 2008-04-02

Family

ID=32041246

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB02829663XA Expired - Fee Related CN100378713C (en) 2002-09-25 2002-09-25 Method and apparatus for automatically determining salient features for object classification

Country Status (8)

Country Link
EP (1) EP1543437A4 (en)
JP (1) JP2006501545A (en)
CN (1) CN100378713C (en)
AU (1) AU2002334669A1 (en)
BR (1) BR0215899A (en)
CA (1) CA2500264A1 (en)
MX (1) MXPA05003249A (en)
WO (1) WO2004029826A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035195A (en) * 2013-06-03 2019-07-19 柯达阿拉里斯股份有限公司 Classification through the hardcopy medium scanned

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7576755B2 (en) 2007-02-13 2009-08-18 Microsoft Corporation Picture collage systems and methods
US8832140B2 (en) 2007-06-26 2014-09-09 Oracle Otc Subsidiary Llc System and method for measuring the quality of document sets
US8935249B2 (en) 2007-06-26 2015-01-13 Oracle Otc Subsidiary Llc Visualization of concepts within a collection of information

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6539115B2 (en) * 1997-02-12 2003-03-25 Fujitsu Limited Pattern recognition device for performing classification using a candidate table and method thereof
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US6018733A (en) * 1997-09-12 2000-01-25 Infoseek Corporation Methods for iteratively and interactively performing collection selection in full text searches
US6353825B1 (en) * 1999-07-30 2002-03-05 Verizon Laboratories Inc. Method and device for classification using iterative information retrieval techniques
WO2002006993A1 (en) * 2000-07-17 2002-01-24 Asymmetry, Inc. System and methods for web resource discovery

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035195A (en) * 2013-06-03 2019-07-19 柯达阿拉里斯股份有限公司 Classification through the hardcopy medium scanned

Also Published As

Publication number Publication date
JP2006501545A (en) 2006-01-12
CN100378713C (en) 2008-04-02
MXPA05003249A (en) 2005-07-05
EP1543437A1 (en) 2005-06-22
CA2500264A1 (en) 2004-04-08
AU2002334669A1 (en) 2004-04-19
WO2004029826A1 (en) 2004-04-08
BR0215899A (en) 2005-07-26
EP1543437A4 (en) 2008-05-28

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
DE60315506T2 (en) IDENTIFICATION OF CRITICAL FEATURES IN A REGIONAL SCALE ROOM
Dadgar et al. A novel text mining approach based on TF-IDF and Support Vector Machine for news classification
US6938025B1 (en) Method and apparatus for automatically determining salient features for object classification
US6826576B2 (en) Very-large-scale automatic categorizer for web content
CN1240011C (en) File classifying management system and method for operation system
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
Noaman et al. Naive Bayes classifier based Arabic document categorization
CN110516074B (en) Website theme classification method and device based on deep learning
CN109446423B (en) System and method for judging sentiment of news and texts
CN115796181A (en) Text relation extraction method for chemical field
CN108595525A (en) A kind of lawyer's information processing method and system
CN1158460A (en) Multiple languages automatic classifying and searching method
CN108681548A (en) A kind of lawyer's information processing method and system
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN108614860A (en) A kind of lawyer's information processing method and system
CN100378713C (en) Method and apparatus for automatically determining salient features for object classification
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN111401056A (en) Method for extracting keywords from various texts
JPH06282587A (en) Automatic classifying method and device for document and dictionary preparing method and device for classification
Asirvatham et al. Web page categorization based on document structure
Li Research on an Enhanced Web Information Processing Technology based on AIS Text Mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: MICROSOFT TECHNOLOGY LICENSING LLC

Free format text: FORMER OWNER: MICROSOFT CORP.

Effective date: 20150429

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150429

Address after: Washington State

Patentee after: Micro soft technique license Co., Ltd

Address before: Washington State

Patentee before: Microsoft Corp.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080402

Termination date: 20190925