CN100378713C - Method and apparatus for automatically determining salient features for object classification - Google Patents

Method and apparatus for automatically determining salient features for object classification Download PDF

Info

Publication number
CN100378713C
CN100378713C CNB02829663XA CN02829663A CN100378713C CN 100378713 C CN100378713 C CN 100378713C CN B02829663X A CNB02829663X A CN B02829663XA CN 02829663 A CN02829663 A CN 02829663A CN 100378713 C CN100378713 C CN 100378713C
Authority
CN
China
Prior art keywords
unique features
list
data object
features
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB02829663XA
Other languages
Chinese (zh)
Other versions
CN1669023A (en
Inventor
D·P·卢力奇
F·G·吉拉克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1669023A publication Critical patent/CN1669023A/en
Application granted granted Critical
Publication of CN100378713C publication Critical patent/CN100378713C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Processing (AREA)

Abstract

The present invention provides a method and a device for classifying objects to automatically determine distinctive characteristics. Firstly, one or a plurality of distinct characteristics are extracted from a first content object group to form a first characteristic list according to an embodiment; secondly, one or a plurality of distinct characteristics are extracted from a second non content object group to form a second characteristic list; thirdly, a graded characteristic list is generated between one unique characteristic of the first characteristic list and one unique characteristic of the second characteristic list with a statistic distinguishing method; finally, a distinctive characteristic set is recognized from the obtained graded characteristic list.

Description

The method and apparatus of determining distinguishing feature automatically for object class
Background of invention
1. invention field
The present invention relates to data processing field.More specifically say, the present invention relates to be used for automatic selection the features of the object of object grouping.
2. background information
WWW provides important information sources, estimates can online reading download billions of pages information.But, must need one and suit practical methods and be used to guide this mass data in order effectively to utilize this information.
At the initial stage of internet surfing, developed two kinds of basic skills and be used for online search.In first method, produce index data according to the web page contents that is collected in by automatic search engine together, search engine " is creeped to seek the page of new uniqueness on the net.Then, this database can be with various inquiry technique searches, and data can be according to the similarity classification of the form of inquiry usually.In second method, webpage is grouped into a hierarchy, and often the form with one tree presents.Then when when this hierarchy is descending, the user just makes a series of selection, significantly making two or more selections on each rank of difference between the subtree of representing under the commit point, finally reach to the leaf node that comprises the text and/or the content of multimedia page.
For example, Fig. 1 illustrates a typical prior art hierarchy 102, and wherein a plurality of decision nodes (calling " node " in the following text) 130-136 hierarchal arrangement becomes a plurality of fathers/or child node, and each node all interrelates with the subject classification of a uniqueness.For example, node 130 is the father node of node 131 and 132, and node 131 and 132 then is the child node of node 130.Because node 131 and 132 all is the child node of same node (node 130), node 131 and 132 is the brother each other.Other brother also has node 135 and 136 to comprising node 133 and 134 in 102 subject hierarchies.As seen from Figure 1, node 130 forms the first order 137 of subject hierarchy 102, and node 131-132 forms the second level 138 of subject hierarchy 102.Node 133-136 then forms the third level 139 of subject hierarchy 102.In addition, node 130 is considered to the root node of subject hierarchy 102, because it is not the child node of other any nodes.
The process of webpage hierarchical classification is faced multiple challenges.At first, the character of hierarchy must define.Usually this is manually to be finished by the expert in the specialism field, and a bit picture is done the classification of Dewey decimal system for the library.These classifications label that captions is submitted to out then, so that user and sorter can make suitable decision when pointing to this hierarchy.Then, for example, the content that presents with indivedual electronic document forms can be used in the categorizing system method of manual search and puts in a plurality of classifications and go.
People's notice has turned to the robotization in this each stage of process in recent years.Now there has been the system of from the batch document, document being classified automatically.For example, the associated word that concerns of some system applies and document assembles cohort automatically with similar document.These cohorts can repeatedly be formed super cohort again repeatedly, thereby produce hierarchy, yet these systems need the artificial key word that inserts, generation be a hierarchy that does not have systematic structure.If this hierarchy is used for manual search, just must the hand inspection child node or the leaf document with the identification common feature, thereby labelled to the node of hierarchy.
Many categorizing systems are used the word tabulation with document classification, and usually, significantly word can define in advance, also can select from the document of handling, so that characterize document more exactly.General these remarkable word tabulations are to use the frequency of occurrences to the whole words of each number of files in one group of document to produce.According to one or more criterions word is shifted out from the word tabulation then.Often, occurrence number word very little is disallowable in a collection of document, because these words are used to such an extent that be not enough to distinguish reliably classification very little, but the word that occurs too frequently also will reject, because all will occur in all kinds of documents.
Moreover " useless words " also often rejected from feature list to be more conducive to determining of distinguishing feature with stem.Useless words comprises the common word in the language, such as " a ", " the ", and " his " and " and ", these words allow the people feel not have the semantics content, and stem then refers to such as " ing ", " is " and suffixes such as " able ".Unfortunately, generating the useless words tabulation is the professional task of a term language with the stem tabulation, require to have the professional knowledge of grammer, document and idiom aspect, and these is can be time dependent.Therefore, significantly specific with regard to requiring a more dexterous method to determine.
Brief Description Of Drawings
The present invention will describe by exemplary embodiment, but also unrestricted, and use description of drawings, and wherein identical label is represented similar key element, in the accompanying drawing:
Fig. 1 illustrates that one comprises the exemplary prior art hierarchy of a plurality of decision nodes;
Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process;
Fig. 3 illustrates the significantly application example of specific definite equipment of the present invention according to an embodiment;
Fig. 4 is the functional block diagram of the sorter training service of key diagram 3 according to one embodiment of present invention;
Fig. 5 illustrates according to one embodiment of present invention and is suitable for use as the computing system of determining distinguishing feature.
The detailed description of invention
Hereinafter various aspects of the present invention will be described.Yet those skilled in the art can be clear, and the present invention can only implement with its some or all aspect.For ease of explaining that special numeral, material and structure are all illustrated so that provide and well understood of the present invention.Yet those skilled in the art is also clear, and the present invention need not these details and also can implement.In other situations, well-known characteristics are ignored or are simplified, in order to avoid the present invention that can hardly be explained.
Some part of describing is expressed to use the operation of implementing based on the equipment of processor, use is such as data, storage, select, determine, term such as calculating, conform to those skilled in the art are normally used, so that their substance of work is passed to others skilled in the art.Those skilled in the art will appreciate that quantity is desirable can store, transmission or the form by electricity, magnetics or the optical signalling controlled based on the machinery in the equipment of processor and electricity component; And processor one speech comprises microprocessor, microcontroller, digital signal processor etc. here, can be independently, also can be that assist or Embedded.
Each operation is described successively by each discrete steps, so that help to understand the present invention most, yet the order of description should not be construed as and means that these operations must be relevant with order.In fact, these operations need not be carried out by the order that is presented.Moreover, describe and use phrase " in one embodiment " repeatedly, might not all refer to same embodiment, though can be like this.
According to one embodiment of present invention, the characteristics of extracting one or more uniquenesses from first group of objects are extracted one or more unique features to form second feature set again to form first feature set from second group of objects.Between the unique features of the unique features of the first specific collection and second feature set, adopt statistics differentiation method to produce a ranked list of features then.Then, from the ranked list that obtains like this, can identify one group of distinguishing feature.
In one embodiment, the determining of distinguishing feature helps the effective classification to the data object, object includes, but is not limited to text document, visual document, audio frequency preface and video sequence, in very large-scale hierarchical classification tree, also these data objects had both comprised that the patent form also comprised non-patent form in such as the non-graded data structure of smooth document.For example in a text document, the form of the desirable word of characteristics, and term " word " is generally understood as represents one group of letter in given language, has certain semantic meaning.More generally, characteristics can be N-mark grammers (N-token gram), and a mark is exactly a small element of language, for example, comprise N-letter grammer and N-word grammer in the English, also comprise N-in the Asian language symbol grammer of expressing the meaning.And for example in tonic train, tone, speed, sound prolong, pitch, volume and the like all can be used as the characteristics to sound classification, and in video sequence and rest image, each pixel property promptly can be used as characteristics such as angle and intensity level.According to one embodiment of present invention, in case group of features from one group (such as) identified the electronic document, with regard to the classification of given data object group, be significant with regard to a sub-set pair can determining these characteristics then.Term herein " electronic document " is widely used in describes gang's data object, all as described above comprise some of one or more formation characteristics.Though electronic document can comprise text, can comprise audio frequency and/or video content too, can replace text, also can be additional to text.
The criterion that characteristics are selected is once determining (attribute of which different text/audio/video is concentrated as the determinacy characteristics at data object in other words), and distinguishing feature deterministic process of the present invention can be implemented.The distinguishing feature deterministic process at the beginning, the data object of being considered is divided into two groups.Then these two groups of data objects are used the equation (square journey 1) of representative " operational feasibility ", here 0 (d) represents the possibility of a given data object as the member of the first data object group, P (R|d) represents the probability of this data object as this first group membership, and P (R ' | d) then represent the probability of this data object as second group membership.
O ( d ) = P ( R | d ) P ( R ′ | d ) - - - ( 1 )
Because the artificial grouping of data object is not provided for calculating the probability of operational feasibility, equation (1) just can make full use of estimates this value.Accordingly, logarithmic function can be applied to the both sides of equation (1) together with the Baye formula, provides equation (2):
logO(d)=logP(d|R)-logP(d|R′)+logP(R)-logP(R′)(2)
So, data object hypothesis is by one group of characteristics { F jForm; And X iBe 1 or be 0, represent given characteristics f respectively iOr not in a data object, then
log O ( d ) = Σ i [ log P ( X i | R ) - log P ( X i | R ′ ) ] + log P ( R ) - log P ( R ′ ) - - - ( 3 )
Because logP (R) and logP (R ') are constants,, just can stipulate a new amount g (d) with to be elected to be outstanding feature in the data object irrelevant:
g ( d ) = Σ i [ log P ( X i | R ) - log P ( X i | R ′ ) ] - - - ( 4 )
As establish p i=P (X i=1/R) represent a given characteristics (f i) appear at the probability in the data object in first data set, and q i=P (x i=1/R ') represents given characteristics (f i) appear at the probability in the data object in the second data object group, then can get equation (5) through the substitution abbreviation:
g ( d ) = Σ i [ X i log p i ( 1 - q i ) q i ( 1 - p i ) + Σ i [ log 1 - p i 1 - q i ] - - - ( 5 )
Because the summation in second and the characteristics that the do not rely on appearance situation in data object, it can be removed and equation (6):
log p i ( 1 - q i ) q i ( 1 - p i ) - - - ( 6 )
Because logarithmic function is a monotonic quantity, equation (7)
p i ( 1 - q i ) q i ( 1 - p i ) - - - ( 7 )
Ratio maximization promptly be enough to make corresponding logarithm value maximization.According to an imbody of the present invention,, each characteristic in the characteristics of combination tabulation are used the identification that equation (7) is beneficial to distinguishing feature to two groups of data objects.For this reason, should calculate p i, representative comprises characteristics f at least in the first data object group iData object number once is divided by the sum of data object in the first data object sets of documentation.Equally, should calculate q i, q iRepresent in the second data object group and comprise characteristics f at least iData object number once is divided by the sum of data object in second group of data object group.
Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process.At the beginning, check earlier the first collection data object producing a feature list, this tabulation is that unique features from one or more data objects of first set of data objects is formed by appearing at least, sees square frame 210.To each unique features of discerning, use equation (7) to produce a grouping feature list, see square frame 220, at least one subclass in this grouping feature list is elected to be distinguishing feature, sees square frame 230.Distinguishing feature can comprise by the one or more adjacent or non-adjacent element set of selecting in the ranked list of features.In one embodiment, the top n element in the ranked list of features is elected to be significantly, and N can change according to the needs of system.In another embodiment, last M element in the ranked list of features is elected to be significantly, and M also changes according to the needs of system.
According to one embodiment of present invention, when producing feature list (seeing square frame 210), the sum that is included in the data object in each data object group is determined, see square frame 212, to being each unique features of discerning in the first data object group at least, the data object sum that comprises this unique features also can be determined, sees square frame 214.In addition, list of unique features can be filtered according to required different criterion, sees square frame 216.For example, list of unique features can be deleted the characteristics that those are not found to be the data object of the least number of times that occurs in any case of removing, and those are shorter than the characteristics of a certain minimum length through determining, and/or the characteristics that the number of times that occurs lacks than quota also are removed.
According to one embodiment of present invention, the method that applied statistics is distinguished obtains ranked list of features, as described in the square frame 220 among Fig. 2 A, also further is included in those processes that illustrate among the 2C.In other words, distinguishing method (promptly shown in equation (7)) time in applied statistics just arrives at a decision, determine that promptly which unique features through discerning of concentrating at first data object also appears in second data object set, see square frame 221, similarly determine which unique features of concentrating at first data object and do not appear in second document sets, see square frame 222 through discerning.According to illustrated imbody, when by statistics differentiation method (being equation (7)) when making decision, those are defined as only appearing in the data object set and do not appear at other characteristics of concentrating and just be decided to be higher relative rank in the grouping particular list, see square frame 223, those characteristics that are defined as all occurring in two data object sets then are decided to be relatively low rank, see square frame 224.Sometimes, according to the sum of the data object that comprises each corresponding characteristic, the characteristics in the ranked list of features are further classification also.
Use example
Now,, the equipment that the present invention is used for determining distinguishing feature is shown with an example therein according to an embodiment referring to Fig. 3.As shown in the figure, sorter 300 is used for effectively to the data object class, such as the electronic document in the data structure that comprises very extensive grade classification tree and smooth document format in a big class, include, but is not limited to text document, image document, tonic train and video sequence had both comprised the patentability form and had also comprised non-patentability form.Sorter 300 comprises sorter training service 305, with think training classifier 300 according to from before the classifying rules that extracts the data hierarchy structure of having classified to new data-object classifications; Comprise that also sorter classified service 315 is in order to classify to the new data-objects of importing sorter 300.
The function of sorter training service 305 comprises aggregation capability 306, and distinguishing feature of the present invention is determined function 308, and node characterization function 309.According to shown in embodiment, each the node place at hierarchy focuses on by aggregation capability 306 from the content of preceding classified data hierarchy structure, with content group and the non-content group that forms data simultaneously.Extract characteristics and determine that with distinguishing feature the method for function 308 determines that those characteristics are significant characteristics subclass by each data set then.Node characterization function 309 is used for each node to preceding data hierarchy structure of having classified according to the distinguishing feature characterization, also in order in data storage 310, to store these characteristic of divisionizations, for example, so that be that sorter classified service 315 is done further uses.
About other data of the sorter 300 that comprises sorter exercise equipment 305 and sorter sorting device 315 in being numbered of meanwhile submitting to<<51026, P004〉〉 the U.S. Patent application book in describe, be entitled as " Very-Large-Scale Automatic Categorizer For Web Content (for the very extensive automatic categorizer of online content) ", jointly transfer the possession of the application's assignee, this application is incorporated into this by reference fully.
Sorter training service
Fig. 4 functional-block diagram of the classification based training service 305 among Fig. 3 that draws according to one embodiment of present invention.As shown in Figure 4, preceding classified data hierarchy structure 402 is in order to input to the classification based training service 305 of sorter 300.Before classified data hierarchy structure 400 represent a set of data objects such as audio frequency, video and/or text object, classified before these data objects and it be included into a theme hierarchy (usually by manually finishing).Before classified data hierarchy structure 402 can represent one or more before web door or the search engine electronic document collection of classifying.
According to the example that had illustrated already, aggregation capability 406 will so just increase difference from the Content aggregation of preceding classified data hierarchy structure 402 to content and non-content group between each other brotgher of node of level of hierarchy.Distinguishing feature determines that the effect of function 408 is that the extraction characteristics determine that also the characteristics (409) of which extraction can be decided to be (409 ') significantly from content and non-content groups of data.
In addition, according to the example that has illustrated, the effect of the node characterization function 309 among Fig. 3 is to content and non-content groups of data characterization.In one embodiment, content and non-content-data are according to fixed distinguishing feature and characterization.In one embodiment, the result of characterization is stored in the data storage device 310, and the form that this can any kind of data structure is implemented, such as database, bibliographic structure, or simple examination tabulation.In one embodiment of the invention to the parameter of each node classifier all be stored in one be similar to before in the grade classification tree of file structure of classified data hierarchy structure.
Example computer system
Fig. 5 explanation is suitable for according to one embodiment of present invention in order to determine a routine computer system of distinguishing feature.As shown in the figure, computer system 500 comprises one or more processors 502 and system storage 504.In addition, computer system 500 also comprises jumbo storage device 506 (such as disk, hard disk driver, CDROM etc.), input-output apparatus 508 (such as keyboard, cursor control etc.) and communication interface 510 (such as network interface unit, modulator-demodular unit etc.).Each several part intercouples by system bus 512, and system bus can be represented one or more buses.When a plurality of bus of system bus 512 representatives, be connected by one or more bus bridge (not shown)s to each other.
Each part is all right to make conventional functions as known in the art.Specifically, system storage 504 and mass-memory unit 506 are used for storing a work copy and permanent copy of the programming instruction of implementing categorizing system of the present invention.The permanent copy of programming instruction can promptly be loaded into before dispatching from the factory in the mass-memory unit 506; Or be written at the scene, as previously mentioned, load by a distribution media (not shown) or by communication interface 510 (from a distribution server (not shown)).The structure of these parts 502~512 all is known, need not further describe.
Conclusion and postscript
Therefore, as seen from the above description, describe out with the new method of determining distinguishing feature automatically and the device of thinking object class.Though the present invention describes with the foregoing description, it will be understood by those skilled in the art that the present invention is not limited to described embodiment.The present invention also useful modifications and alternative implements, but must be within the spirit and scope of appended claims.Therefore this description should be thought about illustrative of the present invention and non-binding description.

Claims (34)

1. method comprises:
From the first content group of data object, extract one or more unique features to form first feature list;
From the second non-content group of data object, extract one or more unique features to form second feature list;
Produce a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list; And
From described ranked list of features, identify the distinguishing feature collection,
Wherein, producing described grading list comprises:
Those unique features that do not appear at described first feature list in described second feature list are identified as exclusive characteristics;
Those unique features that also appear at described first feature list in described second feature list are identified as common feature; And
To described grading list ordering, so that compare with described common feature, higher in the rank of exclusive characteristics described in the described grading list.
2. the method for claim 1 is characterized in that, each in the second non-content group of the first content group of described data object and described data object all comprises one or more electronic documents.
3. the method for claim 1 is characterized in that, also comprises:
Determine the first data object sum of the first content group of composition data object; And
Determine the second data object sum of the second non-content group of composition data object.
4. method as claimed in claim 3 is characterized in that, also comprises:
To in the described one or more unique features that form described first feature list each, contain the first data object number of at least one example of each corresponding described one or more unique features in described first feature list in the described first content group of specified data object; And
To in the described one or more unique features that form described second feature list each, contain the second data object number of at least one example of each corresponding described one or more unique features in described second feature list in the described second non-content group of specified data object.
5. method as claimed in claim 4 is characterized in that, also comprises:
To each described common feature applied probability function to obtain a result vector, wherein, described probabilistic function comprises the described first data object number divided by the result of the described first data object sum and the described second data object number ratio divided by the result of the described second data object sum; And
Based on the result vector of described probabilistic function, the described common feature in the described grading list is sorted at least in part.
6. method as claimed in claim 4 is characterized in that, based on the described first data object number to the further classification of described exclusive characteristics.
7. the method for claim 1, it is characterized in that, identifying the distinguishing feature collection from described ranked list of features comprises: select the continuous characteristics of top n in the described ranked list of features, wherein N is the natural number less than the characteristics number in the described ranked list of features.
8. the method for claim 1, it is characterized in that, identifying the distinguishing feature collection from described ranked list of features comprises: select last M continuous characteristics in the described ranked list of features, wherein M is the natural number less than the characteristics number in the described ranked list of features.
9. the method for claim 1 is characterized in that, each described unique features all comprises the group by one or more alphanumeric character set one-tenth.
10. the method for claim 1 is characterized in that, also comprises:
At least in part based on described distinguishing feature collection, become one relation in the described second non-content group with the described first content group of data object and data object the closest a new data-object classifications.
11. the method for claim 1 is characterized in that, the described first content group of data object comprises those data objects corresponding to the node of selecting in the theme hierarchy with a plurality of nodes and any child node of being associated with the node of selecting; And
Wherein, described second of the data object non-content group comprises those data objects corresponding to any brotgher of node that is associated with the node of selecting and any child node of being associated with the brotgher of node.
12. a method of discerning distinguishing feature, described method comprises:
Identification is as one or more unique features of the member of first data class;
Check that second data class also is those unique features of the member of described second data class to discern in described one or more unique features, and be not those unique features of the member of described second data class in described one or more unique features;
Produce the grading list of unique features, this grading list has an order based on member's identity of each described one or more unique features in described second data class; And
One or more unique features in the grading list of described unique features are identified as significantly.
13. method as claimed in claim 12 is characterized in that, also comprises:
To each unique features in the grading list of described unique features, determine to comprise in described first data class number of objects of each corresponding unique features.
14. method as claimed in claim 13, it is characterized in that, produce grading list and also comprise: the rank of those unique features in described grading list that in the described unique features is not the described second data class member is decided to be than the rank height that in the described unique features also is those unique features of the described second data class member.
15. method as claimed in claim 14, it is characterized in that, produce grading list and also comprise: the rank of those unique features in described grading list that belongs to the more number object of described first data class in the described unique features is decided to be than the rank height that belongs to those unique features of less number object in described first data class in the described unique features.
16. method as claimed in claim 12 is characterized in that, is identified as significantly to comprise: select the continuous characteristics of top n from the grading list of described unique features, wherein N is the natural number of the characteristics number in the described ranked list of features.
17. method as claimed in claim 12 is characterized in that, is identified as significantly to comprise: select last M characteristics continuously from the grading list of described unique features, wherein M is the natural number of the characteristics number in the described ranked list of features.
18. an equipment comprises:
Be used for extracting one or more unique features to form the device of first feature list from the first content group of data object;
Be used for extracting one or more unique features to form the device of second feature list from the second non-content group of data object;
Be used for producing the device of a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list; And
From described ranked list of features, identify the device of distinguishing feature collection,
Wherein, the device that produces described grading list comprises:
Those unique features that are used for not appearing at described first feature list of described second feature list are identified as the device of exclusive characteristics;
Those unique features that are used for also appearing at described first feature list of described second feature list are identified as the device of common feature; And
Be used for device to described grading list ordering, so that compare with described common feature, higher in the rank of exclusive characteristics described in the described grading list.
19. equipment as claimed in claim 18 is characterized in that, each in the second non-content group of the first content group of described data object and described data object all comprises one or more electronic documents.
20. equipment as claimed in claim 18 is characterized in that, also comprises:
Be used for determining the device of the first data object sum of the first content group of composition data object; And
Be used for determining the device of the second data object sum of the second non-content group of composition data object.
21. equipment as claimed in claim 18 is characterized in that, also comprises:
Be used for each, comprise the device of the first data object number of at least one example of each corresponding described one or more unique features in described first feature list in the described first content group of specified data object described one or more unique features of forming described first feature list; And
Be used for each, comprise the device of the second data object number of at least one example of each corresponding described one or more unique features in described second feature list in the described second non-content group of specified data object described one or more unique features of forming described second feature list.
22. equipment as claimed in claim 21 is characterized in that, also comprises:
To each described common feature applied probability function to obtain the device of a result vector, wherein, described probabilistic function comprises the described first data object number divided by the result of the described first data object sum and the described second data object number device divided by the result's of the described second data object sum ratio; And
At least in part based on the result vector of described probabilistic function, to the device of the described common feature ordering in the described grading list.
23. equipment as claimed in claim 21 is characterized in that, based on the described first data object number to the further classification of described exclusive characteristics.
24. equipment as claimed in claim 18, it is characterized in that, the described device that is used for identifying from described ranked list of features the distinguishing feature collection comprises: be used for selecting the device of the continuous characteristics of described ranked list of features top n, wherein N is the natural number of the characteristics number in the described ranked list of features.
25. equipment as claimed in claim 18, it is characterized in that, the described device that is used for identifying from described ranked list of features the distinguishing feature collection comprises: be used for selecting last M of the described ranked list of features device of characteristics continuously, wherein M is the natural number of the characteristics number in the described ranked list of features.
26. equipment as claimed in claim 18 is characterized in that, each described unique features all comprises the group by one or more alphanumeric character set one-tenth.
27. equipment as claimed in claim 18 is characterized in that, also comprises:
At least in part based on described distinguishing feature collection, a new data-object classifications is become a closest device of relation in the described second non-content group with the described first content group of data object and data object.
28. equipment as claimed in claim 18 is characterized in that, the described first content group of data object comprises those data objects corresponding to the node of selecting in the theme hierarchy with a plurality of nodes and any child node of being associated with the node of selecting; And
Wherein, described second of the data object non-content group comprises those data objects corresponding to any brotgher of node that is associated with the node of selecting and any child node of being associated with the brotgher of node.
29. an equipment of discerning distinguishing feature comprises:
Be used to discern device as one or more unique features of the member of first data class;
Be used for checking second data class those unique features, and be not the member's of described second data class the device of those unique features in described one or more unique features with the member that to discern described one or more unique features also be described second data class;
Be used to produce the device of the grading list of unique features, this grading list has an order based on member's identity of each described one or more unique features in described second data class; And
Be used for one or more unique features of the grading list of described unique features are identified as significant device.
30. equipment as claimed in claim 29 is characterized in that, also comprises:
Be used for each unique features, determine to comprise in described first data class device of the number of objects of each corresponding unique features the grading list of described unique features.
31. equipment as claimed in claim 30, it is characterized in that the described device that is used for producing grading list also comprises: being used for described unique features is not that those unique features of the described second data class member also are the high device of rank of those unique features of the described second data class member in the rank of described grading list is decided to be than described unique features.
32. equipment as claimed in claim 31, it is characterized in that the described device that is used for producing grading list also comprises: the rank of those unique features in described grading list that described unique features is belonged to the more number object of described first data class is decided to be the high device of rank than those unique features that belong to less number object in described first data class in the described unique features.
33. equipment as claimed in claim 29, it is characterized in that, describedly be used for being identified as significant device and comprise: be used for selecting from the grading list of described unique features the device of the continuous characteristics of top n, wherein N is the natural number of the characteristics number in the described ranked list of features.
34. equipment as claimed in claim 29, it is characterized in that, describedly be used for being identified as significant device and comprise: be used for selecting last M the device of characteristics continuously from the grading list of described unique features, wherein N is the natural number of the characteristics number in the described ranked list of features.
CNB02829663XA 2002-09-25 2002-09-25 Method and apparatus for automatically determining salient features for object classification Expired - Fee Related CN100378713C (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2002/030457 WO2004029826A1 (en) 2002-09-25 2002-09-25 Method and apparatus for automatically determining salient features for object classification

Publications (2)

Publication Number Publication Date
CN1669023A CN1669023A (en) 2005-09-14
CN100378713C true CN100378713C (en) 2008-04-02

Family

ID=32041246

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB02829663XA Expired - Fee Related CN100378713C (en) 2002-09-25 2002-09-25 Method and apparatus for automatically determining salient features for object classification

Country Status (8)

Country Link
EP (1) EP1543437A4 (en)
JP (1) JP2006501545A (en)
CN (1) CN100378713C (en)
AU (1) AU2002334669A1 (en)
BR (1) BR0215899A (en)
CA (1) CA2500264A1 (en)
MX (1) MXPA05003249A (en)
WO (1) WO2004029826A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7576755B2 (en) 2007-02-13 2009-08-18 Microsoft Corporation Picture collage systems and methods
US8005643B2 (en) * 2007-06-26 2011-08-23 Endeca Technologies, Inc. System and method for measuring the quality of document sets
US8935249B2 (en) 2007-06-26 2015-01-13 Oracle Otc Subsidiary Llc Visualization of concepts within a collection of information
US9307107B2 (en) * 2013-06-03 2016-04-05 Kodak Alaris Inc. Classification of scanned hardcopy media
US20220309384A1 (en) * 2021-03-25 2022-09-29 International Business Machines Corporation Selecting representative features for machine learning models

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1190764A (en) * 1997-02-12 1998-08-19 富士通株式会社 Model identification equipment using condidate table making classifying and method thereof
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6018733A (en) * 1997-09-12 2000-01-25 Infoseek Corporation Methods for iteratively and interactively performing collection selection in full text searches
US6353825B1 (en) * 1999-07-30 2002-03-05 Verizon Laboratories Inc. Method and device for classification using iterative information retrieval techniques

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
WO2002007010A1 (en) * 2000-07-17 2002-01-24 Asymmetry, Inc. System and method for storage and processing of business information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
CN1190764A (en) * 1997-02-12 1998-08-19 富士通株式会社 Model identification equipment using condidate table making classifying and method thereof
US6018733A (en) * 1997-09-12 2000-01-25 Infoseek Corporation Methods for iteratively and interactively performing collection selection in full text searches
US6353825B1 (en) * 1999-07-30 2002-03-05 Verizon Laboratories Inc. Method and device for classification using iterative information retrieval techniques

Also Published As

Publication number Publication date
BR0215899A (en) 2005-07-26
EP1543437A4 (en) 2008-05-28
JP2006501545A (en) 2006-01-12
WO2004029826A1 (en) 2004-04-08
CN1669023A (en) 2005-09-14
AU2002334669A1 (en) 2004-04-19
CA2500264A1 (en) 2004-04-08
MXPA05003249A (en) 2005-07-05
EP1543437A1 (en) 2005-06-22

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
US7971150B2 (en) Document categorisation system
US6938025B1 (en) Method and apparatus for automatically determining salient features for object classification
DE60315506T2 (en) IDENTIFICATION OF CRITICAL FEATURES IN A REGIONAL SCALE ROOM
US20020174095A1 (en) Very-large-scale automatic categorizer for web content
Noaman et al. Naive Bayes classifier based Arabic document categorization
CN107506472B (en) Method for classifying browsed webpages of students
CN107193915A (en) A kind of company information sorting technique and device
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN103778206A (en) Method for providing network service resources
CN114997288A (en) Design resource association method
JP2016218512A (en) Information processing device and information processing program
CN107908649B (en) Text classification control method
CN100378713C (en) Method and apparatus for automatically determining salient features for object classification
Triwijoyo et al. Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms
CN103714051B (en) A kind of preprocess method of waiting for translating shelves
CN110750963A (en) Method, device and storage medium for removing duplicate of news document
KR102695536B1 (en) Irregular/bad food monitoring device and method
CN103729350B (en) The preprocess method of various dimensions waiting for translating shelves
CN118569254B (en) Method and system for collecting and analyzing document data based on NLP
Bozdogan et al. Comparison of Traditional and Modern Topic Model Algorithms in Terms of Topic Determination in Official Documents
CN117556140A (en) Patent recommendation system
Röder et al. 9 DICE, Paderborn University, Paderborn, Germany michael. roeder@ uni-paderborn. de

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: MICROSOFT TECHNOLOGY LICENSING LLC

Free format text: FORMER OWNER: MICROSOFT CORP.

Effective date: 20150429

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150429

Address after: Washington State

Patentee after: Micro soft technique license Co., Ltd

Address before: Washington State

Patentee before: Microsoft Corp.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080402

Termination date: 20190925