CN100378713C

CN100378713C - Method and apparatus for automatically determining salient features for object classification

Info

Publication number: CN100378713C
Application number: CNB02829663XA
Authority: CN
Inventors: D·P·卢力奇; F·G·吉拉克
Original assignee: Microsoft Corp
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2002-09-25
Filing date: 2002-09-25
Publication date: 2008-04-02
Anticipated expiration: 2022-09-25
Also published as: BR0215899A; EP1543437A4; JP2006501545A; WO2004029826A1; CN1669023A; AU2002334669A1; CA2500264A1; MXPA05003249A; EP1543437A1

Abstract

The present invention provides a method and a device for classifying objects to automatically determine distinctive characteristics. Firstly, one or a plurality of distinct characteristics are extracted from a first content object group to form a first characteristic list according to an embodiment; secondly, one or a plurality of distinct characteristics are extracted from a second non content object group to form a second characteristic list; thirdly, a graded characteristic list is generated between one unique characteristic of the first characteristic list and one unique characteristic of the second characteristic list with a statistic distinguishing method; finally, a distinctive characteristic set is recognized from the obtained graded characteristic list.

Description

The method and apparatus of determining distinguishing feature automatically for object class

Background of invention

1. invention field

The present invention relates to data processing field.More specifically say, the present invention relates to be used for automatic selection the features of the object of object grouping.

2. background information

WWW provides important information sources, estimates can online reading download billions of pages information.But, must need one and suit practical methods and be used to guide this mass data in order effectively to utilize this information.

At the initial stage of internet surfing, developed two kinds of basic skills and be used for online search.In first method, produce index data according to the web page contents that is collected in by automatic search engine together, search engine " is creeped to seek the page of new uniqueness on the net.Then, this database can be with various inquiry technique searches, and data can be according to the similarity classification of the form of inquiry usually.In second method, webpage is grouped into a hierarchy, and often the form with one tree presents.Then when when this hierarchy is descending, the user just makes a series of selection, significantly making two or more selections on each rank of difference between the subtree of representing under the commit point, finally reach to the leaf node that comprises the text and/or the content of multimedia page.

For example, Fig. 1 illustrates a typical prior art hierarchy 102, and wherein a plurality of decision nodes (calling " node " in the following text) 130-136 hierarchal arrangement becomes a plurality of fathers/or child node, and each node all interrelates with the subject classification of a uniqueness.For example, node 130 is the father node of node 131 and 132, and node 131 and 132 then is the child node of node 130.Because node 131 and 132 all is the child node of same node (node 130), node 131 and 132 is the brother each other.Other brother also has

node

135 and 136 to comprising

node

133 and 134 in 102 subject hierarchies.As seen from Figure 1, node 130 forms the first order 137 of subject hierarchy 102, and node 131-132 forms the second level 138 of subject hierarchy 102.Node 133-136 then forms the third level 139 of subject hierarchy 102.In addition, node 130 is considered to the root node of subject hierarchy 102, because it is not the child node of other any nodes.

The process of webpage hierarchical classification is faced multiple challenges.At first, the character of hierarchy must define.Usually this is manually to be finished by the expert in the specialism field, and a bit picture is done the classification of Dewey decimal system for the library.These classifications label that captions is submitted to out then, so that user and sorter can make suitable decision when pointing to this hierarchy.Then, for example, the content that presents with indivedual electronic document forms can be used in the categorizing system method of manual search and puts in a plurality of classifications and go.

People's notice has turned to the robotization in this each stage of process in recent years.Now there has been the system of from the batch document, document being classified automatically.For example, the associated word that concerns of some system applies and document assembles cohort automatically with similar document.These cohorts can repeatedly be formed super cohort again repeatedly, thereby produce hierarchy, yet these systems need the artificial key word that inserts, generation be a hierarchy that does not have systematic structure.If this hierarchy is used for manual search, just must the hand inspection child node or the leaf document with the identification common feature, thereby labelled to the node of hierarchy.

Many categorizing systems are used the word tabulation with document classification, and usually, significantly word can define in advance, also can select from the document of handling, so that characterize document more exactly.General these remarkable word tabulations are to use the frequency of occurrences to the whole words of each number of files in one group of document to produce.According to one or more criterions word is shifted out from the word tabulation then.Often, occurrence number word very little is disallowable in a collection of document, because these words are used to such an extent that be not enough to distinguish reliably classification very little, but the word that occurs too frequently also will reject, because all will occur in all kinds of documents.

Moreover " useless words " also often rejected from feature list to be more conducive to determining of distinguishing feature with stem.Useless words comprises the common word in the language, such as " a ", " the ", and " his " and " and ", these words allow the people feel not have the semantics content, and stem then refers to such as " ing ", " is " and suffixes such as " able ".Unfortunately, generating the useless words tabulation is the professional task of a term language with the stem tabulation, require to have the professional knowledge of grammer, document and idiom aspect, and these is can be time dependent.Therefore, significantly specific with regard to requiring a more dexterous method to determine.

Brief Description Of Drawings

The present invention will describe by exemplary embodiment, but also unrestricted, and use description of drawings, and wherein identical label is represented similar key element, in the accompanying drawing:

Fig. 1 illustrates that one comprises the exemplary prior art hierarchy of a plurality of decision nodes;

Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process;

Fig. 3 illustrates the significantly application example of specific definite equipment of the present invention according to an embodiment;

Fig. 4 is the functional block diagram of the sorter training service of key diagram 3 according to one embodiment of present invention;

Fig. 5 illustrates according to one embodiment of present invention and is suitable for use as the computing system of determining distinguishing feature.

The detailed description of invention

Hereinafter various aspects of the present invention will be described.Yet those skilled in the art can be clear, and the present invention can only implement with its some or all aspect.For ease of explaining that special numeral, material and structure are all illustrated so that provide and well understood of the present invention.Yet those skilled in the art is also clear, and the present invention need not these details and also can implement.In other situations, well-known characteristics are ignored or are simplified, in order to avoid the present invention that can hardly be explained.

Some part of describing is expressed to use the operation of implementing based on the equipment of processor, use is such as data, storage, select, determine, term such as calculating, conform to those skilled in the art are normally used, so that their substance of work is passed to others skilled in the art.Those skilled in the art will appreciate that quantity is desirable can store, transmission or the form by electricity, magnetics or the optical signalling controlled based on the machinery in the equipment of processor and electricity component; And processor one speech comprises microprocessor, microcontroller, digital signal processor etc. here, can be independently, also can be that assist or Embedded.

Each operation is described successively by each discrete steps, so that help to understand the present invention most, yet the order of description should not be construed as and means that these operations must be relevant with order.In fact, these operations need not be carried out by the order that is presented.Moreover, describe and use phrase " in one embodiment " repeatedly, might not all refer to same embodiment, though can be like this.

According to one embodiment of present invention, the characteristics of extracting one or more uniquenesses from first group of objects are extracted one or more unique features to form second feature set again to form first feature set from second group of objects.Between the unique features of the unique features of the first specific collection and second feature set, adopt statistics differentiation method to produce a ranked list of features then.Then, from the ranked list that obtains like this, can identify one group of distinguishing feature.

In one embodiment, the determining of distinguishing feature helps the effective classification to the data object, object includes, but is not limited to text document, visual document, audio frequency preface and video sequence, in very large-scale hierarchical classification tree, also these data objects had both comprised that the patent form also comprised non-patent form in such as the non-graded data structure of smooth document.For example in a text document, the form of the desirable word of characteristics, and term " word " is generally understood as represents one group of letter in given language, has certain semantic meaning.More generally, characteristics can be N-mark grammers (N-token gram), and a mark is exactly a small element of language, for example, comprise N-letter grammer and N-word grammer in the English, also comprise N-in the Asian language symbol grammer of expressing the meaning.And for example in tonic train, tone, speed, sound prolong, pitch, volume and the like all can be used as the characteristics to sound classification, and in video sequence and rest image, each pixel property promptly can be used as characteristics such as angle and intensity level.According to one embodiment of present invention, in case group of features from one group (such as) identified the electronic document, with regard to the classification of given data object group, be significant with regard to a sub-set pair can determining these characteristics then.Term herein " electronic document " is widely used in describes gang's data object, all as described above comprise some of one or more formation characteristics.Though electronic document can comprise text, can comprise audio frequency and/or video content too, can replace text, also can be additional to text.

The criterion that characteristics are selected is once determining (attribute of which different text/audio/video is concentrated as the determinacy characteristics at data object in other words), and distinguishing feature deterministic process of the present invention can be implemented.The distinguishing feature deterministic process at the beginning, the data object of being considered is divided into two groups.Then these two groups of data objects are used the equation (square journey 1) of representative " operational feasibility ", here 0 (d) represents the possibility of a given data object as the member of the first data object group, P (R|d) represents the probability of this data object as this first group membership, and P (R ' | d) then represent the probability of this data object as second group membership.

O (d) = \frac{P (R | d)}{P (R^{'} | d)} - - - (1)

Because the artificial grouping of data object is not provided for calculating the probability of operational feasibility, equation (1) just can make full use of estimates this value.Accordingly, logarithmic function can be applied to the both sides of equation (1) together with the Baye formula, provides equation (2):

logO(d)＝logP(d|R)-logP(d|R′)+logP(R)-logP(R′)(2)

So, data object hypothesis is by one group of characteristics { F _jForm; And X _iBe 1 or be 0, represent given characteristics f respectively _iOr not in a data object, then

\log O (d) = \underset{i}{Σ} [\log P (X_{i} | R) - \log P (X_{i} | R^{'})] + \log P (R) - \log P (R^{'}) - - - (3)

Because logP (R) and logP (R ') are constants,, just can stipulate a new amount g (d) with to be elected to be outstanding feature in the data object irrelevant:

g (d) = \underset{i}{Σ} [\log P (X_{i} | R) - \log P (X_{i} | R^{'})] - - - (4)

As establish p _i=P (X _i=1/R) represent a given characteristics (f _i) appear at the probability in the data object in first data set, and q _i=P (x _i=1/R ') represents given characteristics (f _i) appear at the probability in the data object in the second data object group, then can get equation (5) through the substitution abbreviation:

g (d) = \underset{i}{Σ} [X_{i} \log \frac{p_{i} (1 - q_{i})}{q_{i} (1 - p_{i})} + \underset{i}{Σ} [\log \frac{1 - p_{i}}{1 - q_{i}}] - - - (5)

Because the summation in second and the characteristics that the do not rely on appearance situation in data object, it can be removed and equation (6):

\log \frac{p_{i} (1 - q_{i})}{q_{i} (1 - p_{i})} - - - (6)

Because logarithmic function is a monotonic quantity, equation (7)

\frac{p_{i} (1 - q_{i})}{q_{i} (1 - p_{i})} - - - (7)

Ratio maximization promptly be enough to make corresponding logarithm value maximization.According to an imbody of the present invention,, each characteristic in the characteristics of combination tabulation are used the identification that equation (7) is beneficial to distinguishing feature to two groups of data objects.For this reason, should calculate p _i, representative comprises characteristics f at least in the first data object group _iData object number once is divided by the sum of data object in the first data object sets of documentation.Equally, should calculate q _i, q _iRepresent in the second data object group and comprise characteristics f at least _iData object number once is divided by the sum of data object in second group of data object group.

Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process.At the beginning, check earlier the first collection data object producing a feature list, this tabulation is that unique features from one or more data objects of first set of data objects is formed by appearing at least, sees square frame 210.To each unique features of discerning, use equation (7) to produce a grouping feature list, see square frame 220, at least one subclass in this grouping feature list is elected to be distinguishing feature, sees square frame 230.Distinguishing feature can comprise by the one or more adjacent or non-adjacent element set of selecting in the ranked list of features.In one embodiment, the top n element in the ranked list of features is elected to be significantly, and N can change according to the needs of system.In another embodiment, last M element in the ranked list of features is elected to be significantly, and M also changes according to the needs of system.

According to one embodiment of present invention, when producing feature list (seeing square frame 210), the sum that is included in the data object in each data object group is determined, see square frame 212, to being each unique features of discerning in the first data object group at least, the data object sum that comprises this unique features also can be determined, sees square frame 214.In addition, list of unique features can be filtered according to required different criterion, sees square frame 216.For example, list of unique features can be deleted the characteristics that those are not found to be the data object of the least number of times that occurs in any case of removing, and those are shorter than the characteristics of a certain minimum length through determining, and/or the characteristics that the number of times that occurs lacks than quota also are removed.

According to one embodiment of present invention, the method that applied statistics is distinguished obtains ranked list of features, as described in the square frame 220 among Fig. 2 A, also further is included in those processes that illustrate among the 2C.In other words, distinguishing method (promptly shown in equation (7)) time in applied statistics just arrives at a decision, determine that promptly which unique features through discerning of concentrating at first data object also appears in second data object set, see square frame 221, similarly determine which unique features of concentrating at first data object and do not appear in second document sets, see square frame 222 through discerning.According to illustrated imbody, when by statistics differentiation method (being equation (7)) when making decision, those are defined as only appearing in the data object set and do not appear at other characteristics of concentrating and just be decided to be higher relative rank in the grouping particular list, see square frame 223, those characteristics that are defined as all occurring in two data object sets then are decided to be relatively low rank, see square frame 224.Sometimes, according to the sum of the data object that comprises each corresponding characteristic, the characteristics in the ranked list of features are further classification also.

Use example

Now,, the equipment that the present invention is used for determining distinguishing feature is shown with an example therein according to an embodiment referring to Fig. 3.As shown in the figure, sorter 300 is used for effectively to the data object class, such as the electronic document in the data structure that comprises very extensive grade classification tree and smooth document format in a big class, include, but is not limited to text document, image document, tonic train and video sequence had both comprised the patentability form and had also comprised non-patentability form.Sorter 300 comprises sorter training service 305, with think training classifier 300 according to from before the classifying rules that extracts the data hierarchy structure of having classified to new data-object classifications; Comprise that also sorter classified service 315 is in order to classify to the new data-objects of importing sorter 300.

The function of sorter training service 305 comprises aggregation capability 306, and distinguishing feature of the present invention is determined function 308, and node characterization function 309.According to shown in embodiment, each the node place at hierarchy focuses on by aggregation capability 306 from the content of preceding classified data hierarchy structure, with content group and the non-content group that forms data simultaneously.Extract characteristics and determine that with distinguishing feature the method for function 308 determines that those characteristics are significant characteristics subclass by each data set then.Node characterization function 309 is used for each node to preceding data hierarchy structure of having classified according to the distinguishing feature characterization, also in order in data storage 310, to store these characteristic of divisionizations, for example, so that be that sorter classified service 315 is done further uses.

About other data of the sorter 300 that comprises sorter exercise equipment 305 and sorter sorting device 315 in being numbered of meanwhile submitting to＜＜51026, P004〉〉 the U.S. Patent application book in describe, be entitled as " Very-Large-Scale Automatic Categorizer For Web Content (for the very extensive automatic categorizer of online content) ", jointly transfer the possession of the application's assignee, this application is incorporated into this by reference fully.

Sorter training service

Fig. 4 functional-block diagram of the classification based training service 305 among Fig. 3 that draws according to one embodiment of present invention.As shown in Figure 4, preceding classified data hierarchy structure 402 is in order to input to the classification based training service 305 of sorter 300.Before classified data hierarchy structure 400 represent a set of data objects such as audio frequency, video and/or text object, classified before these data objects and it be included into a theme hierarchy (usually by manually finishing).Before classified data hierarchy structure 402 can represent one or more before web door or the search engine electronic document collection of classifying.

According to the example that had illustrated already, aggregation capability 406 will so just increase difference from the Content aggregation of preceding classified data hierarchy structure 402 to content and non-content group between each other brotgher of node of level of hierarchy.Distinguishing feature determines that the effect of function 408 is that the extraction characteristics determine that also the characteristics (409) of which extraction can be decided to be (409 ') significantly from content and non-content groups of data.

In addition, according to the example that has illustrated, the effect of the node characterization function 309 among Fig. 3 is to content and non-content groups of data characterization.In one embodiment, content and non-content-data are according to fixed distinguishing feature and characterization.In one embodiment, the result of characterization is stored in the data storage device 310, and the form that this can any kind of data structure is implemented, such as database, bibliographic structure, or simple examination tabulation.In one embodiment of the invention to the parameter of each node classifier all be stored in one be similar to before in the grade classification tree of file structure of classified data hierarchy structure.

Example computer system

Fig. 5 explanation is suitable for according to one embodiment of present invention in order to determine a routine computer system of distinguishing feature.As shown in the figure, computer system 500 comprises one or more processors 502 and system storage 504.In addition, computer system 500 also comprises jumbo storage device 506 (such as disk, hard disk driver, CDROM etc.), input-output apparatus 508 (such as keyboard, cursor control etc.) and communication interface 510 (such as network interface unit, modulator-demodular unit etc.).Each several part intercouples by system bus 512, and system bus can be represented one or more buses.When a plurality of bus of system bus 512 representatives, be connected by one or more bus bridge (not shown)s to each other.

Each part is all right to make conventional functions as known in the art.Specifically, system storage 504 and mass-memory unit 506 are used for storing a work copy and permanent copy of the programming instruction of implementing categorizing system of the present invention.The permanent copy of programming instruction can promptly be loaded into before dispatching from the factory in the mass-memory unit 506; Or be written at the scene, as previously mentioned, load by a distribution media (not shown) or by communication interface 510 (from a distribution server (not shown)).The structure of these parts 502～512 all is known, need not further describe.

Conclusion and postscript

Therefore, as seen from the above description, describe out with the new method of determining distinguishing feature automatically and the device of thinking object class.Though the present invention describes with the foregoing description, it will be understood by those skilled in the art that the present invention is not limited to described embodiment.The present invention also useful modifications and alternative implements, but must be within the spirit and scope of appended claims.Therefore this description should be thought about illustrative of the present invention and non-binding description.

Claims

1. method comprises:

From the first content group of data object, extract one or more unique features to form first feature list;

From the second non-content group of data object, extract one or more unique features to form second feature list;

Produce a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list; And

From described ranked list of features, identify the distinguishing feature collection,

Wherein, producing described grading list comprises:

Those unique features that do not appear at described first feature list in described second feature list are identified as exclusive characteristics;

Those unique features that also appear at described first feature list in described second feature list are identified as common feature; And

To described grading list ordering, so that compare with described common feature, higher in the rank of exclusive characteristics described in the described grading list.

2. the method for claim 1 is characterized in that, each in the second non-content group of the first content group of described data object and described data object all comprises one or more electronic documents.

3. the method for claim 1 is characterized in that, also comprises:

Determine the first data object sum of the first content group of composition data object; And

Determine the second data object sum of the second non-content group of composition data object.

4. method as claimed in claim 3 is characterized in that, also comprises:

To in the described one or more unique features that form described first feature list each, contain the first data object number of at least one example of each corresponding described one or more unique features in described first feature list in the described first content group of specified data object; And

To in the described one or more unique features that form described second feature list each, contain the second data object number of at least one example of each corresponding described one or more unique features in described second feature list in the described second non-content group of specified data object.

5. method as claimed in claim 4 is characterized in that, also comprises:

To each described common feature applied probability function to obtain a result vector, wherein, described probabilistic function comprises the described first data object number divided by the result of the described first data object sum and the described second data object number ratio divided by the result of the described second data object sum; And

Based on the result vector of described probabilistic function, the described common feature in the described grading list is sorted at least in part.

6. method as claimed in claim 4 is characterized in that, based on the described first data object number to the further classification of described exclusive characteristics.

7. the method for claim 1, it is characterized in that, identifying the distinguishing feature collection from described ranked list of features comprises: select the continuous characteristics of top n in the described ranked list of features, wherein N is the natural number less than the characteristics number in the described ranked list of features.

8. the method for claim 1, it is characterized in that, identifying the distinguishing feature collection from described ranked list of features comprises: select last M continuous characteristics in the described ranked list of features, wherein M is the natural number less than the characteristics number in the described ranked list of features.

9. the method for claim 1 is characterized in that, each described unique features all comprises the group by one or more alphanumeric character set one-tenth.

10. the method for claim 1 is characterized in that, also comprises:

At least in part based on described distinguishing feature collection, become one relation in the described second non-content group with the described first content group of data object and data object the closest a new data-object classifications.

11. the method for claim 1 is characterized in that, the described first content group of data object comprises those data objects corresponding to the node of selecting in the theme hierarchy with a plurality of nodes and any child node of being associated with the node of selecting; And

Wherein, described second of the data object non-content group comprises those data objects corresponding to any brotgher of node that is associated with the node of selecting and any child node of being associated with the brotgher of node.

12. a method of discerning distinguishing feature, described method comprises:

Identification is as one or more unique features of the member of first data class;

Check that second data class also is those unique features of the member of described second data class to discern in described one or more unique features, and be not those unique features of the member of described second data class in described one or more unique features;

Produce the grading list of unique features, this grading list has an order based on member's identity of each described one or more unique features in described second data class; And

One or more unique features in the grading list of described unique features are identified as significantly.

13. method as claimed in claim 12 is characterized in that, also comprises:

To each unique features in the grading list of described unique features, determine to comprise in described first data class number of objects of each corresponding unique features.

14. method as claimed in claim 13, it is characterized in that, produce grading list and also comprise: the rank of those unique features in described grading list that in the described unique features is not the described second data class member is decided to be than the rank height that in the described unique features also is those unique features of the described second data class member.

15. method as claimed in claim 14, it is characterized in that, produce grading list and also comprise: the rank of those unique features in described grading list that belongs to the more number object of described first data class in the described unique features is decided to be than the rank height that belongs to those unique features of less number object in described first data class in the described unique features.

16. method as claimed in claim 12 is characterized in that, is identified as significantly to comprise: select the continuous characteristics of top n from the grading list of described unique features, wherein N is the natural number of the characteristics number in the described ranked list of features.

17. method as claimed in claim 12 is characterized in that, is identified as significantly to comprise: select last M characteristics continuously from the grading list of described unique features, wherein M is the natural number of the characteristics number in the described ranked list of features.

18. an equipment comprises:

Be used for extracting one or more unique features to form the device of first feature list from the first content group of data object;

Be used for extracting one or more unique features to form the device of second feature list from the second non-content group of data object;

Be used for producing the device of a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list; And

From described ranked list of features, identify the device of distinguishing feature collection,

Wherein, the device that produces described grading list comprises:

Those unique features that are used for not appearing at described first feature list of described second feature list are identified as the device of exclusive characteristics;

Those unique features that are used for also appearing at described first feature list of described second feature list are identified as the device of common feature; And

Be used for device to described grading list ordering, so that compare with described common feature, higher in the rank of exclusive characteristics described in the described grading list.

19. equipment as claimed in claim 18 is characterized in that, each in the second non-content group of the first content group of described data object and described data object all comprises one or more electronic documents.

20. equipment as claimed in claim 18 is characterized in that, also comprises:

Be used for determining the device of the first data object sum of the first content group of composition data object; And

Be used for determining the device of the second data object sum of the second non-content group of composition data object.

21. equipment as claimed in claim 18 is characterized in that, also comprises:

Be used for each, comprise the device of the first data object number of at least one example of each corresponding described one or more unique features in described first feature list in the described first content group of specified data object described one or more unique features of forming described first feature list; And

Be used for each, comprise the device of the second data object number of at least one example of each corresponding described one or more unique features in described second feature list in the described second non-content group of specified data object described one or more unique features of forming described second feature list.

22. equipment as claimed in claim 21 is characterized in that, also comprises:

To each described common feature applied probability function to obtain the device of a result vector, wherein, described probabilistic function comprises the described first data object number divided by the result of the described first data object sum and the described second data object number device divided by the result's of the described second data object sum ratio; And

At least in part based on the result vector of described probabilistic function, to the device of the described common feature ordering in the described grading list.

23. equipment as claimed in claim 21 is characterized in that, based on the described first data object number to the further classification of described exclusive characteristics.

24. equipment as claimed in claim 18, it is characterized in that, the described device that is used for identifying from described ranked list of features the distinguishing feature collection comprises: be used for selecting the device of the continuous characteristics of described ranked list of features top n, wherein N is the natural number of the characteristics number in the described ranked list of features.

25. equipment as claimed in claim 18, it is characterized in that, the described device that is used for identifying from described ranked list of features the distinguishing feature collection comprises: be used for selecting last M of the described ranked list of features device of characteristics continuously, wherein M is the natural number of the characteristics number in the described ranked list of features.

26. equipment as claimed in claim 18 is characterized in that, each described unique features all comprises the group by one or more alphanumeric character set one-tenth.

27. equipment as claimed in claim 18 is characterized in that, also comprises:

At least in part based on described distinguishing feature collection, a new data-object classifications is become a closest device of relation in the described second non-content group with the described first content group of data object and data object.

28. equipment as claimed in claim 18 is characterized in that, the described first content group of data object comprises those data objects corresponding to the node of selecting in the theme hierarchy with a plurality of nodes and any child node of being associated with the node of selecting; And

29. an equipment of discerning distinguishing feature comprises:

Be used to discern device as one or more unique features of the member of first data class;

Be used for checking second data class those unique features, and be not the member's of described second data class the device of those unique features in described one or more unique features with the member that to discern described one or more unique features also be described second data class;

Be used to produce the device of the grading list of unique features, this grading list has an order based on member's identity of each described one or more unique features in described second data class; And

Be used for one or more unique features of the grading list of described unique features are identified as significant device.

30. equipment as claimed in claim 29 is characterized in that, also comprises:

Be used for each unique features, determine to comprise in described first data class device of the number of objects of each corresponding unique features the grading list of described unique features.

31. equipment as claimed in claim 30, it is characterized in that the described device that is used for producing grading list also comprises: being used for described unique features is not that those unique features of the described second data class member also are the high device of rank of those unique features of the described second data class member in the rank of described grading list is decided to be than described unique features.

32. equipment as claimed in claim 31, it is characterized in that the described device that is used for producing grading list also comprises: the rank of those unique features in described grading list that described unique features is belonged to the more number object of described first data class is decided to be the high device of rank than those unique features that belong to less number object in described first data class in the described unique features.

33. equipment as claimed in claim 29, it is characterized in that, describedly be used for being identified as significant device and comprise: be used for selecting from the grading list of described unique features the device of the continuous characteristics of top n, wherein N is the natural number of the characteristics number in the described ranked list of features.

34. equipment as claimed in claim 29, it is characterized in that, describedly be used for being identified as significant device and comprise: be used for selecting last M the device of characteristics continuously from the grading list of described unique features, wherein N is the natural number of the characteristics number in the described ranked list of features.