CN1669023A

CN1669023A - Method and apparatus for automatically determining salient features for object classification

Info

Publication number: CN1669023A
Application number: CNA02829663XA
Authority: CN
Inventors: D·P·卢力奇; F·G·吉拉克
Original assignee: Microsoft Corp
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2002-09-25
Filing date: 2002-09-25
Publication date: 2005-09-14
Anticipated expiration: 2022-09-25
Also published as: JP2006501545A; CN100378713C; MXPA05003249A; EP1543437A1; CA2500264A1; AU2002334669A1; WO2004029826A1; BR0215899A; EP1543437A4

Abstract

A method and apparatus for automatically determining salient features (308) for object classification is provided. In accordance with one embodiment, one or more unique features are extracted from a first content group of objects to form a first feature list, and one or more unique features are extracted from a second anti-content group of objects to form a second feature list. A ranked list of features is then created by applying statistical differentiation between unique features of the first feature list and unique features of the second feature list. A set of salient features (308) is then identified from the resulting ranked list of features.

Description

The method and apparatus of determining distinguishing feature automatically for object class

Background of invention

1. invention field

The present invention relates to data processing field.More specifically say, the present invention relates to be used for automatic selection the features of the object of object grouping.

2. background information

WWW provides important information sources, estimates can online reading download billions of pages information.But, must need one and suit practical methods and be used to guide this mass data in order effectively to utilize this information.

At the initial stage of internet surfing, developed two kinds of basic skills and be used for online search.In first method, produce index data according to the web page contents that is collected in by automatic search engine together, search engine " is creeped to seek the page of new uniqueness on the net.Then, this database can be with various inquiry technique searches, and data can be according to the similarity classification of the form of inquiry usually.In second method, webpage is grouped into a hierarchy, and often the form with one tree presents.Then when when this hierarchy is descending, the user just makes a series of selection, significantly making two or more selections on each rank of difference between the subtree of representing under the commit point, finally reach to the leaf node that comprises the text and/or the content of multimedia page.

For example, Fig. 1 illustrates a typical prior art hierarchy 102, and wherein a plurality of decision nodes (calling " node " in the following text) 130-136 hierarchal arrangement becomes a plurality of fathers/or child node, and each node all interrelates with the subject classification of a uniqueness.For example, node 130 is the father node of node 131 and 132, and node 131 and 132 then is the child node of node 130.Because node 131 and 132 all is the child node of same node (node 130), node 131 and 132 is the brother each other.Other brother also has

node

135 and 136 to comprising

node

133 and 134 in 102 subject hierarchies.As seen from Figure 1, node 130 forms the first order 137 of subject hierarchy 102, and node 131-132 forms the second level 138 of subject hierarchy 102.Node 133-136 then forms the third level 139 of subject hierarchy 102.In addition, node 130 is considered to the root node of subject hierarchy 102, because it is not the child node of other any nodes.

The process of webpage hierarchical classification is faced multiple challenges.At first, the character of hierarchy must define.Usually this is manually to be finished by the expert in the specialism field, and a bit picture is done the classification of Dewey decimal system for the library.These classifications label that captions is submitted to out then, so that user and sorter can make suitable decision when pointing to this hierarchy.Then, for example, the content that presents with indivedual electronic document forms can be used in the categorizing system method of manual search and puts in a plurality of classifications and go.

People's notice has turned to the robotization in this each stage of process in recent years.Now there has been the system of from the batch document, document being classified automatically.For example, the associated word that concerns of some system applies and document assembles cohort automatically with similar document.These cohorts can repeatedly be formed super cohort again repeatedly, thereby produce hierarchy, yet these systems need the artificial key word that inserts, generation be a hierarchy that does not have systematic structure.If this hierarchy is used for manual search, just must the hand inspection child node or the leaf document with the identification common feature, thereby labelled to the node of hierarchy.

Many categorizing systems are used the word tabulation with document classification, and usually, significantly word can define in advance, also can select from the document of handling, so that characterize document more exactly.General these remarkable word tabulations are to use the frequency of occurrences to the whole words of each number of files in one group of document to produce.According to one or more criterions word is shifted out from the word tabulation then.Often, occurrence number word very little is disallowable in a collection of document, because these words are used to such an extent that be not enough to distinguish reliably classification very little, but the word that occurs too frequently also will reject, because all will occur in all kinds of documents.

Moreover " useless words " also often rejected from feature list to be more conducive to determining of distinguishing feature with stem.Useless words comprises the common word in the language, such as " a ", " the ", and " his " and " and ", these words allow the people feel not have the semantics content, and stem then refers to such as " ing ", " is " and suffixes such as " able ".Unfortunately, generating the useless words tabulation is the professional task of a term language with the stem tabulation, require to have the professional knowledge of grammer, document and idiom aspect, and these is can be time dependent.Therefore, significantly specific with regard to requiring a more dexterous method to determine.

Brief Description Of Drawings

The present invention will describe by exemplary embodiment, but also unrestricted, and use description of drawings, and wherein identical label is represented similar key element, in the accompanying drawing:

Fig. 1 illustrates that one comprises the exemplary prior art hierarchy of a plurality of decision nodes;

Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process;

Fig. 3 illustrates the significantly application example of specific definite equipment of the present invention according to an embodiment;

Fig. 4 is the functional block diagram of the sorter training service of key diagram 3 according to one embodiment of present invention;

Fig. 5 illustrates according to one embodiment of present invention and is suitable for use as the computing system of determining distinguishing feature.

The detailed description of invention

Hereinafter various aspects of the present invention will be described.Yet those skilled in the art can be clear, and the present invention can only implement with its some or all aspect.For ease of explaining that special numeral, material and structure are all illustrated so that provide and well understood of the present invention.Yet those skilled in the art is also clear, and the present invention need not these details and also can implement.In other situations, well-known characteristics are ignored or are simplified, in order to avoid the present invention that can hardly be explained.

Some part of describing is expressed to use the operation of implementing based on the equipment of processor, use is such as data, storage, select, determine, term such as calculating, conform to those skilled in the art are normally used, so that their substance of work is passed to others skilled in the art.Those skilled in the art will appreciate that quantity is desirable can store, transmission or the form by electricity, magnetics or the optical signalling controlled based on the machinery in the equipment of processor and electricity component; And processor one speech comprises microprocessor, microcontroller, digital signal processor etc. here, can be independently, also can be that assist or Embedded.

Each operation is described successively by each discrete steps, so that help to understand the present invention most, yet the order of description should not be construed as and means that these operations must be relevant with order.In fact, these operations need not be carried out by the order that is presented.Moreover, describe and use phrase " in one embodiment " repeatedly, might not all refer to same embodiment, though can be like this.

According to one embodiment of present invention, the characteristics of extracting one or more uniquenesses from first group of objects are extracted one or more unique features to form second feature set again to form first feature set from second group of objects.Between the unique features of the unique features of the first specific collection and second feature set, adopt statistics differentiation method to produce a ranked list of features then.Then, from the ranked list that obtains like this, can identify one group of distinguishing feature.

In one embodiment, the determining of distinguishing feature helps the effective classification to the data object, object includes, but is not limited to text document, visual document, audio frequency preface and video sequence, in very large-scale hierarchical classification tree, also these data objects had both comprised that the patent form also comprised non-patent form in such as the non-graded data structure of smooth document.For example in a text document, the form of the desirable word of characteristics, and term " word " is generally understood as represents one group of letter in given language, has certain semantic meaning.More generally, characteristics can be N-mark grammers (N-token gram), and a mark is exactly a small element of language, for example, comprise N-letter grammer and N-word grammer in the English, also comprise N-in the Asian language symbol grammer of expressing the meaning.And for example in tonic train, tone, speed, sound prolong, pitch, volume and the like all can be used as the characteristics to sound classification, and in video sequence and rest image, each pixel property promptly can be used as characteristics such as angle and intensity level.According to one embodiment of present invention, in case group of features from one group (such as) identified the electronic document, with regard to the classification of given data object group, be significant with regard to a sub-set pair can determining these characteristics then.Term herein " electronic document " is widely used in describes gang's data object, all as described above comprise some of one or more formation characteristics.Though electronic document can comprise text, can comprise audio frequency and/or video content too, can replace text, also can be additional to text.

The criterion that characteristics are selected is once determining (attribute of which different text/audio/video is concentrated as the determinacy characteristics at data object in other words), and distinguishing feature deterministic process of the present invention can be implemented.The distinguishing feature deterministic process at the beginning, the data object of being considered is divided into two groups.Then these two groups of data objects are used the equation (square journey 1) of representative " operational feasibility ", here 0 (d) represents the possibility of a given data object as the member of the first data object group, P (R|d) represents the probability of this data object as this first group membership, and P (R ' | d) then represent the probability of this data object as second group membership.

O (d) = \frac{P (R | d)}{P (R^{'} | d)} - - - (1)

Because the artificial grouping of data object is not provided for calculating the probability of operational feasibility, equation (1) just can make full use of estimates this value.Accordingly, logarithmic function can be applied to the both sides of equation (1) together with the Baye formula, provides equation (2):

logO(d)＝logP(d|R)-logP(d|R′)+logP(R)-logP(R′) (2)

So, data object hypothesis is by one group of characteristics { F _jForm; And X _iBe 1 or be 0, represent given characteristics f respectively _iOr not in a data object, then

\log O (d) = \underset{i}{Σ} [\log P (X_{i} | R) - \log P (X_{i} | R^{'})] + \log P (R) - \log P (R^{'}) - - - (3)

Because logP (R) and logP (R ') are constants,, just can stipulate a new amount g (d) with to be elected to be outstanding feature in the data object irrelevant:

g (d) = \underset{i}{Σ} [\log P (X_{i} | R) - \log P (X_{i} | R^{'})] - - - (4)

As establish p _i=P (X _i=1/R) represent a given characteristics (f _i) appear at the probability in the data object in first data set, and q _i=P (x _i=1/R ') represents given characteristics (f _i) appear at the probability in the data object in the second data object group, then can get equation (5) through the substitution abbreviation:

g (d) = \underset{i}{Σ} [X_{i} \log \frac{p_{i} (1 - q_{i})}{q_{i} (1 - p_{i})} + \underset{i}{Σ} [\log \frac{1 - p_{i}}{1 - q_{i}}] - - - (5)

Because the summation in second and the characteristics that the do not rely on appearance situation in data object, it can be removed and equation (6):

\log \frac{p_{i} (1 - q_{i})}{q_{i} (1 - p_{i})} - - - (6)

Because logarithmic function is a monotonic quantity, equation (7)

\frac{p_{i} (1 - q_{i})}{q_{i} (1 - p_{i})} - - - (7)

Ratio maximization promptly be enough to make corresponding logarithm value maximization.According to an imbody of the present invention,, each characteristic in the characteristics of combination tabulation are used the identification that equation (7) is beneficial to distinguishing feature to two groups of data objects.For this reason, should calculate p _i, representative comprises characteristics f at least in the first data object group _iData object number once is divided by the sum of data object in the first data object sets of documentation.Equally, should calculate q _i, q _iRepresent in the second data object group and comprise characteristics f at least _iData object number once is divided by the sum of data object in second group of data object group.

Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process.At the beginning, check earlier the first collection data object producing a feature list, this tabulation is that unique features from one or more data objects of first set of data objects is formed by appearing at least, sees square frame 210.To each unique features of discerning, use equation (7) to produce a grouping feature list, see square frame 220, at least one subclass in this grouping feature list is elected to be distinguishing feature, sees square frame 230.Distinguishing feature can comprise by the one or more adjacent or non-adjacent element set of selecting in the ranked list of features.In one embodiment, the top n element in the ranked list of features is elected to be significantly, and N can change according to the needs of system.In another embodiment, last M element in the ranked list of features is elected to be significantly, and M also changes according to the needs of system.

According to one embodiment of present invention, when producing feature list (seeing square frame 210), the sum that is included in the data object in each data object group is determined, see square frame 212, to being each unique features of discerning in the first data object group at least, the data object sum that comprises this unique features also can be determined, sees square frame 214.In addition, list of unique features can be filtered according to required different criterion, sees square frame 216.For example, list of unique features can be deleted the characteristics that those are not found to be the data object of the least number of times that occurs in any case of removing, and those are shorter than the characteristics of a certain minimum length through determining, and/or the characteristics that the number of times that occurs lacks than quota also are removed.

According to one embodiment of present invention, the method that applied statistics is distinguished obtains ranked list of features, as described in the square frame 220 among Fig. 2 A, also further is included in those processes that illustrate among the 2C.In other words, distinguishing method (promptly shown in equation (7)) time in applied statistics just arrives at a decision, determine that promptly which unique features through discerning of concentrating at first data object also appears in second data object set, see square frame 221, similarly determine which unique features of concentrating at first data object and do not appear in second document sets, see square frame 222 through discerning.According to illustrated imbody, when by statistics differentiation method (being equation (7)) when making decision, those are defined as only appearing in the data object set and do not appear at other characteristics of concentrating and just be decided to be higher relative rank in the grouping particular list, see square frame 223, those characteristics that are defined as all occurring in two data object sets then are decided to be relatively low rank, see square frame 224.Sometimes, according to the sum of the data object that comprises each corresponding characteristic, the characteristics in the ranked list of features are further classification also.

Use example

Now,, the equipment that the present invention is used for determining distinguishing feature is shown with an example therein according to an embodiment referring to Fig. 3.As shown in the figure, sorter 300 is used for effectively to the data object class, such as the electronic document in the data structure that comprises very extensive grade classification tree and smooth document format in a big class, include, but is not limited to text document, image document, tonic train and video sequence had both comprised the patentability form and had also comprised non-patentability form.Sorter 300 comprises sorter training service 305, with think training classifier 300 according to from before the classifying rules that extracts the data hierarchy structure of having classified to new data-object classifications; Comprise that also sorter classified service 315 is in order to classify to the new data-objects of importing sorter 300.

The function of sorter training service 305 comprises aggregation capability 306, and distinguishing feature of the present invention is determined function 308, and node characterization function 309.According to shown in embodiment, each the node place at hierarchy focuses on by aggregation capability 306 from the content of preceding classified data hierarchy structure, with content group and the non-content group that forms data simultaneously.Extract characteristics and determine that with distinguishing feature the method for function 308 determines that those characteristics are significant characteristics subclass by each data set then.Node characterization function 309 is used for each node to preceding data hierarchy structure of having classified according to the distinguishing feature characterization, also in order in data storage 310, to store these characteristic of divisionizations, for example, so that be that sorter classified service 315 is done further uses.

Other data about the sorter 300 that comprises sorter exercise equipment 305 and sorter sorting device 315 are numbered " 51026 what meanwhile submit to, P004 " the U.S. Patent application book in describe, be entitled as " Very-Large-Scale Automatic Categorizer For Web Content (for the very extensive automatic categorizer of online content) ", jointly transfer the possession of the application's assignee, this application is incorporated into this by reference fully.

Sorter training service

Fig. 4 functional-block diagram of the classification based training service 305 among Fig. 3 that draws according to one embodiment of present invention.As shown in Figure 4, preceding classified data hierarchy structure 402 is in order to input to the classification based training service 305 of sorter 300.Before classified data hierarchy structure 400 represent a set of data objects such as audio frequency, video and/or text object, classified before these data objects and it be included into a theme hierarchy (usually by manually finishing).Before classified data hierarchy structure 402 can represent one or more before web door or the search engine electronic document collection of classifying.

According to the example that had illustrated already, aggregation capability 406 will so just increase difference from the Content aggregation of preceding classified data hierarchy structure 402 to content and non-content group between each other brotgher of node of level of hierarchy.Distinguishing feature determines that the effect of function 408 is that the extraction characteristics determine that also the characteristics (409) of which extraction can be decided to be (409 ') significantly from content and non-content groups of data.

In addition, according to the example that has illustrated, the effect of the node characterization function 309 among Fig. 3 is to content and non-content groups of data characterization.In one embodiment, content and non-content-data are according to fixed distinguishing feature and characterization.In one embodiment, the result of characterization is stored in the data storage device 310, and the form that this can any kind of data structure is implemented, such as database, bibliographic structure, or simple examination tabulation.In one embodiment of the invention to the parameter of each node classifier all be stored in one be similar to before in the grade classification tree of file structure of classified data hierarchy structure.

Example computer system

Fig. 5 explanation is suitable for according to one embodiment of present invention in order to determine a routine computer system of distinguishing feature.As shown in the figure, computer system 500 comprises one or more processors 502 and system storage 504.In addition, computer system 500 also comprises jumbo storage device 506 (such as disk, hard disk driver, CDROM etc.), input-output apparatus 508 (such as keyboard, cursor control etc.) and communication interface 510 (such as network interface unit, modulator-demodular unit etc.).Each several part intercouples by system bus 512, and system bus can be represented one or more buses.When a plurality of bus of system bus 512 representatives, be connected by one or more bus bridge (not shown)s to each other.

Each part is all right to make conventional functions as known in the art.Specifically, system storage 504 and mass-memory unit 506 are used for storing a work copy and permanent copy of the programming instruction of implementing categorizing system of the present invention.The permanent copy of programming instruction can promptly be loaded into before dispatching from the factory in the mass-memory unit 506; Or be written at the scene, as previously mentioned, load by a distribution media (not shown) or by communication interface 510 (from a distribution server (not shown)).The structure of these parts 502～512 all is known, need not further describe.

Conclusion and postscript

Therefore, as seen from the above description, describe out with the new method of determining distinguishing feature automatically and the device of thinking object class.Though the present invention describes with the foregoing description, it will be understood by those skilled in the art that the present invention is not limited to described embodiment.The present invention also useful modifications and alternative implements, but must be within the spirit and scope of appended claims.Therefore this description should be thought about illustrative of the present invention and non-binding description.

Claims

1. a method comprises

From the first content group of data object, extract one or more unique features to form first feature list;

From the second non-content group of data object document, extract one or more unique features to form second feature list;

Produce a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list; And

From described ranked list of features, identify the distinguishing feature collection.

2. the method for claim 1 is characterized in that, each in the second non-content group of the first content group of described data object and described data object all comprises one or more electronic documents.

3. the method for claim 1 is characterized in that, also comprises:

Determine to comprise the first data object sum of described data object first content group; And

Determine to comprise the second data object sum of the described data object second non-content group.

4. method as claimed in claim 3 is characterized in that, also comprises:

In the described one or more unique features that form described first feature list each is determined to contain in the described data object first content group the first data object number of at least one example of each corresponding described one or more unique features in described first feature list; And

In the described one or more unique features that form described second feature list each is determined to contain in the described data object second non-content group the second data object number of each corresponding described one or more unique features in described second feature list.

5. method as claimed in claim 4 is characterized in that, produces described grading list and comprises:

Those unique features that do not appear at described first feature list in described second feature list are identified as exclusive characteristics;

Those unique features that also appear at described first feature list in described second feature list are identified as common feature; And

To described grading list ordering, so that compare with described common feature, higher in the rank of exclusive characteristics described in the described grading list.

6. method as claimed in claim 5 is characterized in that, also comprises:

To each described common feature applied probability function to obtain a result vector, wherein, described probabilistic function comprises the first data object number divided by the described first data object sum, with the ratio of the described second data object number divided by the described second data object sum; And

Result vector based on described probabilistic function sorts to the described common feature in the described grading list at least in part.

7. method as claimed in claim 5 is characterized in that, based on the first data object number to the further classification of described exclusive characteristics.

8. the method for claim 1 is characterized in that, identifies described distinguishing feature collection and comprise the adjacent characteristics of top n in the described ranked list of features of selecting from described ranked list of features.

9. the method for claim 1 is characterized in that, identifies described distinguishing feature collection and comprise last M the adjacent characteristics of selecting in the described ranked list of features from described ranked list of features.

10. the method for claim 1 is characterized in that, each described unique features all comprises the group of one or more alphanumeric characters.

11. the method for claim 1 is characterized in that, also comprises:

Become one relation in the second non-content group with the first content group of described data object and described data object the closest a new data-object classifications based on described distinguishing feature collection at least in part.

12. the method for claim 1 is characterized in that, the first content group of described data object comprise those corresponding to the node of selecting in the theme hierarchy with a plurality of nodes and with the data object of selecting any child node that node is associated; And

Wherein, second of the described data object non-content group comprises those corresponding to any brotgher of node that is associated with the node of selecting and the data object of any child node of being associated of the brotgher of node therewith.

13. a method of discerning distinguishing feature, described method comprises:

Identification is as one or more unique features of the member of first data class;

Check that second data class also is those unique features of the member of described second data class to discern in described one or more unique features, and be not those unique features of the member of described second data class in described one or more unique features;

Produce a classification list of unique features, this tabulation has an order based on member's identity of each described one or more unique features in described second data class; And

The one or more of described classification list of unique features are identified as significantly.

14. method as claimed in claim 13 is characterized in that, also comprises:

To in the described classification list of unique features each, determine to comprise in described first data class number of objects of each corresponding unique features.

15. method as claimed in claim 14, it is characterized in that, produce a grading list and also comprise the rank of those unique features in described grading list that in the described unique features not to be the described second data class member and be decided to be than the rank height that in the described unique features also is those unique features of the described second data class member.

16. method as claimed in claim 15, it is characterized in that, produce a grading list and also comprise the rank of those unique features in described grading list that to belong to the more number object of described first data class in the described unique features and be decided to be than the rank height that belongs to those unique features of less number object in described first data class in the described unique features.

17. method as claimed in claim 13 is characterized in that, is identified as to comprise significantly to select the continuous feature set of top n from described classification list of unique features.

18. method as claimed in claim 13 is characterized in that, is identified as to comprise significantly to select last M continuous characteristics from described classification list of unique features.

19. a device comprises:

One stores the storage medium that a plurality of programming instructions are arranged therein, and described instruction is designed to realize a plurality of functions of class name service, in order to provide class name to data object, comprises first one or more functions, so that

From the first content group of data object, extract one or more unique features forming first feature list,

From the second non-content group of data object, extract one or more unique features forming second feature list,

Produce a ranked list of features by applied statistics differentiation method between the unique features of the unique features of described first feature list and described second feature list, and

From described ranked list of features, identify a distinguishing feature collection; And

One with the processor of described storage medium coupling, in order to carry out described programming instruction.

20. device as claimed in claim 19 is characterized in that, each in described data object first content group and the described data object second non-content group all comprises one or more data objects.

21. device as claimed in claim 19 is characterized in that, described a plurality of instructions also comprise instruction, so that

Determine to comprise the first data object sum of described data object first content group, and

22. device as claimed in claim 19 is characterized in that, described a plurality of instructions also comprise instruction, so that

To each of described one or more unique features of forming described first feature list, determine to comprise in the described data object first content group the first data object number of at least one example of each corresponding described one or more unique features of described first feature list, and

To described one or more unique features of forming described second feature list each, determine to comprise in the described data object second non-content group the second data object number of at least one example of each corresponding described one or more unique features of described second particular list.

23. device as claimed in claim 20 is characterized in that, the described a plurality of instructions that produce described grading list comprise instruction, so that

Those unique features that do not appear in described second feature list in described first feature list are identified as exclusive characteristics,

Those unique features that also appear in described second feature list in described first feature list are identified as common feature, and

To described grading list ordering, so that described exclusive characteristics are compared with described common feature, rank is higher in described grading list.

24. device as claimed in claim 23 is characterized in that, described a plurality of instructions also comprise instruction, so that

To each applied probability function of described common feature to obtain a result vector, wherein, described probabilistic function comprises the first data object number divided by the described first data object sum, with the ratio of described second number of files divided by the described second data object sum, and

25. device as claimed in claim 23 is characterized in that, based on the first data object number to the further classification of described exclusive characteristics.

26. device as claimed in claim 19 is characterized in that, also comprises in order to select the instruction of the characteristics that top n is adjacent in the described ranked list of features in order to the described a plurality of instructions that identify described distinguishing feature collection from described ranked list of features.

27. device as claimed in claim 19 is characterized in that, also comprises in order to select the instruction of last M adjacent characteristics in the described ranked list of features in order to the described a plurality of instructions that identify described distinguishing feature collection from described ranked list of features.

28. device as claimed in claim 19 is characterized in that, each described unique features all comprises the group of one or more alphanumeric characters.

29. device as claimed in claim 19 is characterized in that, described a plurality of instructions also comprise instruction, so that

At least in part based on described distinguishing feature collection, with a new data object be identified as with described data object first content group and the described data object second non-content group in a relation the closest.

30. device as claimed in claim 19 is characterized in that, described data object first content group comprise those corresponding to the node of from theme hierarchy, selecting with a plurality of nodes and with the data object of selecting any child node that node is associated; And

Wherein, the described data object second non-content group comprises those corresponding to any brotgher of node that interrelates with selected node and the data object of any child node of interrelating of the brotgher of node therewith.

31. a device comprises:

Store the storage medium that a plurality of programming instructions are arranged therein, described instruction is designed to realize comprising a plurality of functions of first one or more functions, so that

Identify one or more unique features as the first data class member,

Check that second data class also is those unique features of the described second data class member to identify in described one or more unique features, and be not those unique features of the described second data class member in described one or more unique features,

Produce a list of unique features, it has the order based on member's identity ordering of each of one or more unique features described in described second data class, and

With one or more being identified as significantly in the described classification list of unique features; And one and the processor that is coupled of described storage medium, in order to carry out described programming instruction.

32. device as claimed in claim 31 is characterized in that, described a plurality of instructions also comprise instruction, so that

33. device as claimed in claim 32, it is characterized in that the described a plurality of instructions that produce grading list also comprise to be decided the rank of those unique features in described grading list that in the described unique features is not the described second data class member than the high instruction of those unique features that in the described unique features also is the described second data class member.

34. device as claimed in claim 33, it is characterized in that the described a plurality of instructions that produce grading list also comprise to be decided the rank of those unique features in described grading list that belongs to more number object in described first data class in the described unique features than the high instruction of those unique features that belongs to less number object in described first data class in the described unique features.

35. device as claimed in claim 31 is characterized in that, also comprises the instruction of selecting the continuous unique features collection of top n from described classification list of unique features in order to be identified as significant described a plurality of instructions.

36. device as claimed in claim 31 is characterized in that, also comprises the instruction of selecting last M continuous unique features collection from described classification list of unique features in order to be identified as significant described a plurality of instructions.