2. background information
WWW provides important information sources, estimates can online reading download billions of pages information.But, must need one and suit practical methods and be used to guide this mass data in order effectively to utilize this information.
At the initial stage of internet surfing, developed two kinds of basic skills and be used for online search.In first method, produce index data according to the web page contents that is collected in by automatic search engine together, search engine " is creeped to seek the page of new uniqueness on the net.Then, this database can be with various inquiry technique searches, and data can be according to the similarity classification of the form of inquiry usually.In second method, webpage is grouped into a hierarchy, and often the form with one tree presents.Then when when this hierarchy is descending, the user just makes a series of selection, significantly making two or more selections on each rank of difference between the subtree of representing under the commit point, finally reach to the leaf node that comprises the text and/or the content of multimedia page.
For example, Fig. 1 illustrates a typical prior art hierarchy 102, and wherein a plurality of decision nodes (calling " node " in the following text) 130-136 hierarchal arrangement becomes a plurality of fathers/or child node, and each node all interrelates with the subject classification of a uniqueness.For example, node 130 is the father node of node 131 and 132, and node 131 and 132 then is the child node of node 130.Because node 131 and 132 all is the child node of same node (node 130), node 131 and 132 is the brother each other.Other brother also has node 135 and 136 to comprising node 133 and 134 in 102 subject hierarchies.As seen from Figure 1, node 130 forms the first order 137 of subject hierarchy 102, and node 131-132 forms the second level 138 of subject hierarchy 102.Node 133-136 then forms the third level 139 of subject hierarchy 102.In addition, node 130 is considered to the root node of subject hierarchy 102, because it is not the child node of other any nodes.
The process of webpage hierarchical classification is faced multiple challenges.At first, the character of hierarchy must define.Usually this is manually to be finished by the expert in the specialism field, and a bit picture is done the classification of Dewey decimal system for the library.These classifications label that captions is submitted to out then, so that user and sorter can make suitable decision when pointing to this hierarchy.Then, for example, the content that presents with indivedual electronic document forms can be used in the categorizing system method of manual search and puts in a plurality of classifications and go.
People's notice has turned to the robotization in this each stage of process in recent years.Now there has been the system of from the batch document, document being classified automatically.For example, the associated word that concerns of some system applies and document assembles cohort automatically with similar document.These cohorts can repeatedly be formed super cohort again repeatedly, thereby produce hierarchy, yet these systems need the artificial key word that inserts, generation be a hierarchy that does not have systematic structure.If this hierarchy is used for manual search, just must the hand inspection child node or the leaf document with the identification common feature, thereby labelled to the node of hierarchy.
Many categorizing systems are used the word tabulation with document classification, and usually, significantly word can define in advance, also can select from the document of handling, so that characterize document more exactly.General these remarkable word tabulations are to use the frequency of occurrences to the whole words of each number of files in one group of document to produce.According to one or more criterions word is shifted out from the word tabulation then.Often, occurrence number word very little is disallowable in a collection of document, because these words are used to such an extent that be not enough to distinguish reliably classification very little, but the word that occurs too frequently also will reject, because all will occur in all kinds of documents.
Moreover " useless words " also often rejected from feature list to be more conducive to determining of distinguishing feature with stem.Useless words comprises the common word in the language, such as " a ", " the ", and " his " and " and ", these words allow the people feel not have the semantics content, and stem then refers to such as " ing ", " is " and suffixes such as " able ".Unfortunately, generating the useless words tabulation is the professional task of a term language with the stem tabulation, require to have the professional knowledge of grammer, document and idiom aspect, and these is can be time dependent.Therefore, significantly specific with regard to requiring a more dexterous method to determine.
The detailed description of invention
Hereinafter various aspects of the present invention will be described.Yet those skilled in the art can be clear, and the present invention can only implement with its some or all aspect.For ease of explaining that special numeral, material and structure are all illustrated so that provide and well understood of the present invention.Yet those skilled in the art is also clear, and the present invention need not these details and also can implement.In other situations, well-known characteristics are ignored or are simplified, in order to avoid the present invention that can hardly be explained.
Some part of describing is expressed to use the operation of implementing based on the equipment of processor, use is such as data, storage, select, determine, term such as calculating, conform to those skilled in the art are normally used, so that their substance of work is passed to others skilled in the art.Those skilled in the art will appreciate that quantity is desirable can store, transmission or the form by electricity, magnetics or the optical signalling controlled based on the machinery in the equipment of processor and electricity component; And processor one speech comprises microprocessor, microcontroller, digital signal processor etc. here, can be independently, also can be that assist or Embedded.
Each operation is described successively by each discrete steps, so that help to understand the present invention most, yet the order of description should not be construed as and means that these operations must be relevant with order.In fact, these operations need not be carried out by the order that is presented.Moreover, describe and use phrase " in one embodiment " repeatedly, might not all refer to same embodiment, though can be like this.
According to one embodiment of present invention, the characteristics of extracting one or more uniquenesses from first group of objects are extracted one or more unique features to form second feature set again to form first feature set from second group of objects.Between the unique features of the unique features of the first specific collection and second feature set, adopt statistics differentiation method to produce a ranked list of features then.Then, from the ranked list that obtains like this, can identify one group of distinguishing feature.
In one embodiment, the determining of distinguishing feature helps the effective classification to the data object, object includes, but is not limited to text document, visual document, audio frequency preface and video sequence, in very large-scale hierarchical classification tree, also these data objects had both comprised that the patent form also comprised non-patent form in such as the non-graded data structure of smooth document.For example in a text document, the form of the desirable word of characteristics, and term " word " is generally understood as represents one group of letter in given language, has certain semantic meaning.More generally, characteristics can be N-mark grammers (N-token gram), and a mark is exactly a small element of language, for example, comprise N-letter grammer and N-word grammer in the English, also comprise N-in the Asian language symbol grammer of expressing the meaning.And for example in tonic train, tone, speed, sound prolong, pitch, volume and the like all can be used as the characteristics to sound classification, and in video sequence and rest image, each pixel property promptly can be used as characteristics such as angle and intensity level.According to one embodiment of present invention, in case group of features from one group (such as) identified the electronic document, with regard to the classification of given data object group, be significant with regard to a sub-set pair can determining these characteristics then.Term herein " electronic document " is widely used in describes gang's data object, all as described above comprise some of one or more formation characteristics.Though electronic document can comprise text, can comprise audio frequency and/or video content too, can replace text, also can be additional to text.
The criterion that characteristics are selected is once determining (attribute of which different text/audio/video is concentrated as the determinacy characteristics at data object in other words), and distinguishing feature deterministic process of the present invention can be implemented.The distinguishing feature deterministic process at the beginning, the data object of being considered is divided into two groups.Then these two groups of data objects are used the equation (square journey 1) of representative " operational feasibility ", here 0 (d) represents the possibility of a given data object as the member of the first data object group, P (R|d) represents the probability of this data object as this first group membership, and P (R ' | d) then represent the probability of this data object as second group membership.
Because the artificial grouping of data object is not provided for calculating the probability of operational feasibility, equation (1) just can make full use of estimates this value.Accordingly, logarithmic function can be applied to the both sides of equation (1) together with the Baye formula, provides equation (2):
logO(d)=logP(d|R)-logP(d|R′)+logP(R)-logP(R′)(2)
So, data object hypothesis is by one group of characteristics { F
jForm; And X
iBe 1 or be 0, represent given characteristics f respectively
iOr not in a data object, then
Because logP (R) and logP (R ') are constants,, just can stipulate a new amount g (d) with to be elected to be outstanding feature in the data object irrelevant:
As establish p
i=P (X
i=1/R) represent a given characteristics (f
i) appear at the probability in the data object in first data set, and q
i=P (x
i=1/R ') represents given characteristics (f
i) appear at the probability in the data object in the second data object group, then can get equation (5) through the substitution abbreviation:
Because the summation in second and the characteristics that the do not rely on appearance situation in data object, it can be removed and equation (6):
Because logarithmic function is a monotonic quantity, equation (7)
Ratio maximization promptly be enough to make corresponding logarithm value maximization.According to an imbody of the present invention,, each characteristic in the characteristics of combination tabulation are used the identification that equation (7) is beneficial to distinguishing feature to two groups of data objects.For this reason, should calculate p
i, representative comprises characteristics f at least in the first data object group
iData object number once is divided by the sum of data object in the first data object sets of documentation.Equally, should calculate q
i, q
iRepresent in the second data object group and comprise characteristics f at least
iData object number once is divided by the sum of data object in second group of data object group.
Fig. 2 (A-C) illustrates that according to one embodiment of present invention distinguishing feature determines the function operations flow process.At the beginning, check earlier the first collection data object producing a feature list, this tabulation is that unique features from one or more data objects of first set of data objects is formed by appearing at least, sees square frame 210.To each unique features of discerning, use equation (7) to produce a grouping feature list, see square frame 220, at least one subclass in this grouping feature list is elected to be distinguishing feature, sees square frame 230.Distinguishing feature can comprise by the one or more adjacent or non-adjacent element set of selecting in the ranked list of features.In one embodiment, the top n element in the ranked list of features is elected to be significantly, and N can change according to the needs of system.In another embodiment, last M element in the ranked list of features is elected to be significantly, and M also changes according to the needs of system.
According to one embodiment of present invention, when producing feature list (seeing square frame 210), the sum that is included in the data object in each data object group is determined, see square frame 212, to being each unique features of discerning in the first data object group at least, the data object sum that comprises this unique features also can be determined, sees square frame 214.In addition, list of unique features can be filtered according to required different criterion, sees square frame 216.For example, list of unique features can be deleted the characteristics that those are not found to be the data object of the least number of times that occurs in any case of removing, and those are shorter than the characteristics of a certain minimum length through determining, and/or the characteristics that the number of times that occurs lacks than quota also are removed.
According to one embodiment of present invention, the method that applied statistics is distinguished obtains ranked list of features, as described in the square frame 220 among Fig. 2 A, also further is included in those processes that illustrate among the 2C.In other words, distinguishing method (promptly shown in equation (7)) time in applied statistics just arrives at a decision, determine that promptly which unique features through discerning of concentrating at first data object also appears in second data object set, see square frame 221, similarly determine which unique features of concentrating at first data object and do not appear in second document sets, see square frame 222 through discerning.According to illustrated imbody, when by statistics differentiation method (being equation (7)) when making decision, those are defined as only appearing in the data object set and do not appear at other characteristics of concentrating and just be decided to be higher relative rank in the grouping particular list, see square frame 223, those characteristics that are defined as all occurring in two data object sets then are decided to be relatively low rank, see square frame 224.Sometimes, according to the sum of the data object that comprises each corresponding characteristic, the characteristics in the ranked list of features are further classification also.
Use example
Now,, the equipment that the present invention is used for determining distinguishing feature is shown with an example therein according to an embodiment referring to Fig. 3.As shown in the figure, sorter 300 is used for effectively to the data object class, such as the electronic document in the data structure that comprises very extensive grade classification tree and smooth document format in a big class, include, but is not limited to text document, image document, tonic train and video sequence had both comprised the patentability form and had also comprised non-patentability form.Sorter 300 comprises sorter training service 305, with think training classifier 300 according to from before the classifying rules that extracts the data hierarchy structure of having classified to new data-object classifications; Comprise that also sorter classified service 315 is in order to classify to the new data-objects of importing sorter 300.
The function of sorter training service 305 comprises aggregation capability 306, and distinguishing feature of the present invention is determined function 308, and node characterization function 309.According to shown in embodiment, each the node place at hierarchy focuses on by aggregation capability 306 from the content of preceding classified data hierarchy structure, with content group and the non-content group that forms data simultaneously.Extract characteristics and determine that with distinguishing feature the method for function 308 determines that those characteristics are significant characteristics subclass by each data set then.Node characterization function 309 is used for each node to preceding data hierarchy structure of having classified according to the distinguishing feature characterization, also in order in data storage 310, to store these characteristic of divisionizations, for example, so that be that sorter classified service 315 is done further uses.
About other data of the sorter 300 that comprises sorter exercise equipment 305 and sorter sorting device 315 in being numbered of meanwhile submitting to<<51026, P004〉〉 the U.S. Patent application book in describe, be entitled as " Very-Large-Scale Automatic Categorizer For Web Content (for the very extensive automatic categorizer of online content) ", jointly transfer the possession of the application's assignee, this application is incorporated into this by reference fully.
Sorter training service
Fig. 4 functional-block diagram of the classification based training service 305 among Fig. 3 that draws according to one embodiment of present invention.As shown in Figure 4, preceding classified data hierarchy structure 402 is in order to input to the classification based training service 305 of sorter 300.Before classified data hierarchy structure 400 represent a set of data objects such as audio frequency, video and/or text object, classified before these data objects and it be included into a theme hierarchy (usually by manually finishing).Before classified data hierarchy structure 402 can represent one or more before web door or the search engine electronic document collection of classifying.
According to the example that had illustrated already, aggregation capability 406 will so just increase difference from the Content aggregation of preceding classified data hierarchy structure 402 to content and non-content group between each other brotgher of node of level of hierarchy.Distinguishing feature determines that the effect of function 408 is that the extraction characteristics determine that also the characteristics (409) of which extraction can be decided to be (409 ') significantly from content and non-content groups of data.
In addition, according to the example that has illustrated, the effect of the node characterization function 309 among Fig. 3 is to content and non-content groups of data characterization.In one embodiment, content and non-content-data are according to fixed distinguishing feature and characterization.In one embodiment, the result of characterization is stored in the data storage device 310, and the form that this can any kind of data structure is implemented, such as database, bibliographic structure, or simple examination tabulation.In one embodiment of the invention to the parameter of each node classifier all be stored in one be similar to before in the grade classification tree of file structure of classified data hierarchy structure.
Example computer system
Fig. 5 explanation is suitable for according to one embodiment of present invention in order to determine a routine computer system of distinguishing feature.As shown in the figure, computer system 500 comprises one or more processors 502 and system storage 504.In addition, computer system 500 also comprises jumbo storage device 506 (such as disk, hard disk driver, CDROM etc.), input-output apparatus 508 (such as keyboard, cursor control etc.) and communication interface 510 (such as network interface unit, modulator-demodular unit etc.).Each several part intercouples by system bus 512, and system bus can be represented one or more buses.When a plurality of bus of system bus 512 representatives, be connected by one or more bus bridge (not shown)s to each other.
Each part is all right to make conventional functions as known in the art.Specifically, system storage 504 and mass-memory unit 506 are used for storing a work copy and permanent copy of the programming instruction of implementing categorizing system of the present invention.The permanent copy of programming instruction can promptly be loaded into before dispatching from the factory in the mass-memory unit 506; Or be written at the scene, as previously mentioned, load by a distribution media (not shown) or by communication interface 510 (from a distribution server (not shown)).The structure of these parts 502~512 all is known, need not further describe.
Conclusion and postscript
Therefore, as seen from the above description, describe out with the new method of determining distinguishing feature automatically and the device of thinking object class.Though the present invention describes with the foregoing description, it will be understood by those skilled in the art that the present invention is not limited to described embodiment.The present invention also useful modifications and alternative implements, but must be within the spirit and scope of appended claims.Therefore this description should be thought about illustrative of the present invention and non-binding description.