CN101404031B

CN101404031B - Method and system for recognizing concept type web pages

Info

Publication number: CN101404031B
Application number: CN2008102257626A
Authority: CN
Inventors: 刘琳
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2008-11-12
Filing date: 2008-11-12
Publication date: 2012-05-30
Anticipated expiration: 2028-11-12
Also published as: CN101404031A

Abstract

The invention discloses a method for identifying a conceptual web page and a system thereof. The method comprises the following steps: a plurality of conceptual web pages are acquired from a web page database; the URI amount of the conceptual web pages under all levels of directories of each website domain name is compared with a first threshold, the directory under which the URI amount of the conceptual web pages is greater than the first threshold is determined as a conceptual directory; and the URI of the web page to be identified is matched with each conceptual directory, if matched, the web page to be identified is determined as the conceptual web page. The method can quickly and comprehensively distinguish whether the web page is a conceptual web page and the class thereof. The method increases the identification rate and obviously improves the coverage rate in regard to identifying a conceptual document from mass web page data.

Description

The method and system of identification concept type web pages

Technical field

The present invention relates to network information process field, more specifically, relate to a kind of method and system of discerning concept type web pages.

Background technology

Along with increasing sharply of text that uses in internet and other data network and the system and content of multimedia, the data volume of the network information sharply increases.Therefore, how to help the user to try one's best apace, from the network information of magnanimity, obtain needed information exactly as far as possible, become the hot issue in the network information process field.

" notion " typically refers to the blocks of knowledge (or general semantic primitive) that the unique combination of characteristic is formed.The concept type document is usually with to the explanation of the notion theme as document, launches to describe around the connotation and extension of identical concept.

Prior art has proposed the various network informations are carried out the technical scheme of analyzing and processing, to satisfy user's information requirement.Wherein, In patent " a kind of recognition methods of concept type document and system " (publication number: CN101004753A, hereinafter referred is invention 1), analyze and point out that the user is in search behavior; Under the situation of same matching inquiry keyword, the selection answer that the concept type document is normally best.Therefore be necessary from the network documentation set, to analyze and identify this type.Simultaneously,, conflict mutually, thereby reduced the efficient of the information of obtaining so be generally user's optimal selection answer with the concept type document because the concept type document is usually located in the search result list position that comparatively falls behind in traditional searching order mode.Therefore, be necessary this type of document is discerned specially efficiently.

Provided a kind of independent, automatic also means of high efficiency identification concept type document in the invention 1, but there is following problem in it in practical application: the method for (1) invention 1 has certain wrong probability.Can cause the document to be identified as the concept nature document such as the application of rhetorical devices such as the metaphor in literary works, personification by mistake.For example, " people's army is exactly our relatives, is exactly our great wall of steel." compare with next sentence: " sunspot is exactly a kind of solar activity phenomenon." on the textual description mode, be difficult to whether it is describing a notion by automatic program identification.Except the noun difference as the sentence main body, other form of presentations are in full accord.Theoretically, only in the system of invention 1 itself, adjust, can't reach the ability of distinguishing the two; And the method for (2) invention 1 has omission to a certain degree.That is,, do not guarantee to cover all concept type documents of identification though invention 1 has guaranteed the relatively accurate and high-level efficiency of identification.Especially in the internet, such as notional word itself can not occurring in the description of some concept type document file pages, but the title of notional word as the page represented separately; Also there are some concept type document contents very brief, are not enough to the abundant foundation of judging as with the method for invention 1.All there are some intrinsic defectives in invention 1 method on the accuracy rate of concept type document recognition and recall rate, or because the limitation of method, or owing to the diversity of document.In addition,, can find in user behavior analysis that the user can have influence on choosing of Search Results for the authoritative understanding of website even discern concept type document accurately.For the result of the website that a large amount of concept type documents can be provided, the user can be more prone to trust and choose; And only have the website of a small amount of concept type document, and user's cognition degree is not high, and the result is also difficult to win the confidence.Therefore, though the method for invention 1 provides a kind of means of quick and precisely discerning the concept type document technically, can't satisfy the demand of user search fully.

Therefore, need a kind of solution of discerning concept type web pages, to solve the problem in the above-mentioned correlation technique.

Summary of the invention

The present invention aims to provide a kind of method and system that can improve the identification concept type web pages of search quality.

According to an aspect of the present invention, the invention provides a kind of method of discerning concept type web pages, may further comprise the steps: in web database, obtain a plurality of concept type web pages; The URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, the URI quantity of concept type web pages under it is confirmed as the concept type catalogue greater than the catalogue of first threshold; URI and each concept type catalogue of webpage to be identified are mated, if coupling then should be confirmed as concept type web pages by webpage to be identified.

After the step of confirming the concept type catalogue, also comprise: all webpages under the concept type catalogue are classified; And total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is confirmed as the classification of concept type catalogue.

After the step of webpage to be identified being confirmed as concept type web pages, also comprise: the webpage that will be confirmed as concept type web pages makes an addition in the classification.

After the step of confirming the concept type catalogue, also comprise: the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is confirmed as non-concept type catalogue greater than second threshold value.

The step of webpage to be identified being confirmed as concept type web pages also comprises: webpage to be identified and each non-concept type catalogue are mated; If all do not match with each non-concept type catalogue; Then webpage to be identified is confirmed as concept type web pages, otherwise webpage to be identified is confirmed as non-concept type web pages.

The step of confirming the concept type catalogue comprises: the URI quantity of adding up the concept type web pages under each website domain name catalogue; And itself and first threshold compared; When the URI quantity of the concept type web pages under the domain name catalogue of website during, the website domain name catalogue is confirmed as the concept type catalogue greater than first threshold; And the URI quantity of the concept type web pages of statistics website domain name subprime directory, and itself and first threshold compared, when the URI quantity of the concept type web pages of website domain name subprime directory greater than first threshold, the website domain name subprime directory is confirmed as the concept type catalogue; And so repetitive operation, be not more than first threshold or the catalogue of being added up does not have subprime directory until the URI quantity of the concept type web pages of the catalogue of being added up.

Obtaining a plurality of concept type web pages steps comprises: utilize the concept type web pages recognizer to obtain a plurality of said concept type web pages.

According to another aspect of the present invention, a kind of system that discerns concept type web pages is provided, has comprised: acquisition module is used for obtaining a plurality of concept type web pages at web database; Concept type catalogue determination module is used for the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, and the URI quantity of concept type web pages under it is confirmed as the concept type catalogue greater than the catalogue of first threshold; And coupling and determination module, be used for URI and each concept type catalogue of webpage to be identified are mated, if coupling then should be confirmed as concept type web pages by webpage to be identified.

This system also comprises: the classification determination module, be used for all webpages under the concept type catalogue are classified, and total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is confirmed as the classification of concept type catalogue; And the interpolation module, be used for the webpage that is confirmed as concept type web pages is added into this classification.

This system also comprises: non-concept type catalogue determination module; The URI quantity that is used for the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is confirmed as non-concept type catalogue greater than second threshold value.

Coupling and determination module also are used for webpage to be identified and non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is confirmed as concept type web pages, otherwise webpage to be identified is confirmed as non-concept type web pages.

Acquisition module uses the concept type web pages recognizer to obtain a plurality of concept type web pages.

Concept type catalogue determination module comprises: first statistical module is used to add up the URI quantity of the concept type web pages under the catalogues at different levels of each website domain name; First comparison module is used for the URI quantity and the first threshold of concept type web pages are compared; And first determination module, be used for its down the URI quantity of concept type web pages confirm as the concept type catalogue greater than the concept type catalogue of first threshold.

Non-concept type catalogue determination module comprises: second statistical module is used for the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Second comparison module is used for the URI quantity and second threshold value of non-concept type web pages are compared; And second determination module, be used under the situation of the URI of non-concept type web pages quantity greater than second threshold value, the catalogue at non-concept type web pages place is confirmed as non-concept type catalogue.

The present invention utilizes the distribution characteristics of concept type web pages; Filter out some and comprise the seldom website of the concept type page, and make that distribution is comparatively concentrated, a fairly large number of website of the concept type page is selected out as the concept type catalogue than the website that comprises less concept type web pages more easily.Like this, just can not be identified as the concept type page by the page that confidence level is lower through the present invention, thereby can represent better Search Results.

Whether be concept type web pages and classification thereof, for identification concept type document from extensive web data, not only improved recognition speed, but also obviously improved coverage rate through the present invention if can discern webpage fast all sidedly.

Other features and advantages of the present invention will be set forth in instructions subsequently, and, partly from instructions, become obvious, perhaps understand through embodiment of the present invention.The object of the invention can be realized through the structure that in the instructions of being write, claims and accompanying drawing, is particularly pointed out and obtained with other advantages.

Description of drawings

Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:

Fig. 1 is the process flow diagram according to the identification concept type web pages method of first embodiment of the invention;

Fig. 2 is the block scheme according to the identification concept type web pages system of second embodiment of the invention;

Fig. 3 A, Fig. 3 B and Fig. 3 C are the process flow diagrams according to each step of identification concept type web pages method of third embodiment of the invention; And

Fig. 4 obtains the method flow diagram of classification concept type website and similar notional word according to the utilization of the embodiment of the invention the 4th embodiment notional word of presorting.

Embodiment

Below with reference to accompanying drawing and combine embodiment, specify the present invention.In full, same reference numerals is represented similar elements.

Fig. 1 is the process flow diagram according to the identification concept type web pages method of first embodiment of the invention.

With reference to Fig. 1, may further comprise the steps according to the identification concept type web pages method of the embodiment of the invention:

Step S102 obtains a plurality of concept type web pages in web database;

Step S104 compares the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name, with its down the URI quantity of concept type web pages confirm as the concept type catalogue greater than the catalogue of first threshold; And

Step S106 matees URI and each concept type catalogue of webpage to be identified, if coupling then should be confirmed as concept type web pages by webpage to be identified.

Also comprising behind the step S104: the classification of confirming webpage under the concept type catalogue.For example, to the concept type web pages that heart disease, coronary heart disease, diabetes and other diseases make an explanation, according to the classification of each webpage, the classification of this concept type catalogue should be the disease health.

Also comprising behind the step S106: all webpages under the concept type catalogue are classified, and total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is confirmed as the classification of concept type catalogue.If will confirm as concept type web pages about the webpage of pneumonia through this method, then this webpage belongs in the classification of above-mentioned disease health.

Also comprising behind the step S104: the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue, under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is confirmed as non-concept type catalogue greater than second threshold value.

Step S106 also comprises: webpage to be identified and each non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is confirmed as concept type web pages, otherwise webpage to be identified is confirmed as non-concept type web pages.

Step S104 comprises: the URI quantity of adding up the concept type web pages under each website domain name catalogue; And itself and first threshold compared; When the URI quantity of the concept type web pages under the domain name catalogue of website during, the website domain name catalogue is confirmed as the concept type catalogue greater than first threshold; And the URI quantity of the concept type web pages of statistics website domain name subprime directory, and itself and first threshold compared, when the URI quantity of the concept type web pages of website domain name subprime directory greater than first threshold, the website domain name subprime directory is confirmed as the concept type catalogue; And so repetitive operation, be not more than first threshold or the catalogue of being added up does not have subprime directory until the URI quantity of the concept type web pages of the catalogue of being added up.

Step S102 comprises: utilize the concept type web pages recognizer to obtain a plurality of said concept type web pages.

Fig. 2 is the block scheme according to the identification concept type web pages system of second embodiment of the invention.

With reference to Fig. 2, the system 200 of identification concept type web pages, comprising: acquisition module 202 is used for obtaining a plurality of concept type web pages at web database; Concept type catalogue determination module 204 is used for the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, and the URI quantity of concept type web pages under it is confirmed as the concept type catalogue greater than the catalogue of first threshold; And coupling and determination module 206, be used for URI and each concept type catalogue of webpage to be identified are mated, if coupling then should be confirmed as concept type web pages by webpage to be identified.

This system also comprises: classification determination module 208, be used for all webpages under the concept type catalogue are classified, and total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is confirmed as the classification of concept type catalogue; And add module 210, be used for the webpage that is confirmed as concept type web pages is added into this classification.

This system also comprises: non-concept type catalogue determination module 212; The URI quantity that is used for the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is confirmed as non-concept type catalogue greater than second threshold value.

Coupling and determination module 206 also are used for webpage to be identified and non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is confirmed as concept type web pages, otherwise webpage to be identified is confirmed as non-concept type web pages.

Acquisition module 202 uses the concept type web pages recognizer to obtain a plurality of concept type web pages.

Concept type catalogue determination module 204 comprises: first statistical module is used to add up the URI quantity of the concept type web pages under the catalogues at different levels of each website domain name; First comparison module is used for the URI quantity and the first threshold of concept type web pages are compared; And first determination module, be used for its down the URI quantity of concept type web pages confirm as the concept type catalogue greater than the concept type catalogue of first threshold.

Non-concept type catalogue determination module 212 comprises: second statistical module is used for the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Second comparison module is used for the URI quantity and second threshold value of non-concept type web pages are compared; And second determination module, be used under the situation of the URI of non-concept type web pages quantity greater than second threshold value, the catalogue at non-concept type web pages place is confirmed as non-concept type catalogue.

Fig. 3 A, Fig. 3 B and Fig. 3 C are the process flow diagrams according to each step of identification concept type website location mode of third embodiment of the invention.

With reference to Fig. 3 A, find that the method for concept type website may further comprise the steps:

Step S302a uses the concept type document recognition algorithm of invention 1 that collections of web pages is handled to obtain the concept type web pages set; And

Step S304a, the URI that concept type web pages is gathered carries out statistical treatment.

In step 302a; The element of collections of web pages is single web document; Set is identical or approximate identical with the collections of web pages that user search need be inquired about, and is indifferent to the Data Source of collections of web pages, but requires each webpage to keep the unique identification of original URI as webpage.

Wherein, step S304a comprises: add up the concept type web pages URI sum under each website domain name successively, record URI sum surpasses the sum of concept type web pages URI under website domain name (like A.com) and this domain name of a certain predetermined threshold N; To the website that selects; Do further statistics with the URI sum of statistics under subprime directory; If the URI sum under a certain catalogue still exceeds predetermined threshold value N,, and replace the higher level's domain name (A.com) that has write down then with the URI (like A.com/Z/) that writes down this catalogue; Catalogue analysis does not so step by step perhaps have the next stage sub-directory to analyze up to the sub-directory that does not exceed threshold value N.

Find that the non-concept type catalogue in the concept type website is through on the basis of finding the concept type website, the distribution situation of the non-concept type page realizes in the set of continuation concept of analysis type catalogue.With reference to Fig. 3 B, find that the step of the non-concept type catalogue in the concept type website comprises:

Step S302b, the method for utilizing website to mate, the concept type catalogue set that generates among the use step S304a and the URI of the webpage in the collections of web pages mate, and mate successful adding " concept type web pages set ^*";

Step S304b is in " concept type web pages set ^*" in, carry out and invent 1 concept type document recognition algorithm, the webpage that is identified as non-concept type document is added non-concept type web pages set; And

Step S306b; URI (A.com/Z) with in the set of concept type catalogue is the basis; The URI that adds up non-concept type web pages set backing wire page or leaf distribution situation at different levels below the concept type catalogue; If the ratio of non-concept type web pages surpasses a certain predetermined threshold K under certain one-level catalogue, then write down this grade catalogue and stop this catalogue is done further analysis.

Fig. 3 C shows the concept type catalogue set of the step generation that utilizes among Fig. 3 B and the non-concept type catalogue set that the step among Fig. 3 B generates, and uses the concept type document analysis algorithm (hereinafter to be referred as method 2) of simplification, analyzes the method for identification concept type web pages.

Step S302c uses the concept type Website Hosting to mate the URI that imports webpage successively.Wherein, if coupling concept type catalogue set failure, then nonrecognition is a concept type web pages; If coupling concept type catalogue is gathered successfully, continue to use non-concept type catalogue set to carry out similar coupling; Gather successfully if mate non-concept type catalogue, then nonrecognition is a concept type web pages; If mate non-concept type catalogue set failure, then this page be identified as the concept type page, and extract notional word.

The instance of step S302c is following:

By the step among Fig. 3 A obtain the set of concept type catalogue XXX: //A.com/Z, XXX: //A.com/Y, XXX: //B.com/W}

By the step among Fig. 3 B obtain the set of non-concept type catalogue XXX: //A.com/Z/M, XXX: //A.com/Y/N/P, XXX: //B.com/W/Q}

Then the step among Fig. 3 C is respectively for the judged result of following URI:

XXX: //the A.com/Z/H/1.html concept type (coupling XXX: //A.com/Z)

XXX: //the non-concept type of A.com/Z/M/2.html (coupling XXX: //A.com/Z and coupling XXX: //A.com/Z/M)

XXX: //the A.com/Y/N/R/3.html concept type (coupling XXX: //A.com/Y)

XXX: //the non-concept type of B.com/4.html

XXX: //the non-concept type of C.com/5.html

In three steps of Fig. 3 A, Fig. 3 B and Fig. 3 C, the step among the step among Fig. 3 A and Fig. 3 B is the distribution situation of coming concept of analysis type website as the basis with 1 the method for inventing.Yet the method that it also can not utilize invention 1 promptly, can replace to other effective concept type document recognition algorithms with " concept type recognizer ".Step among Fig. 3 C is the application that identification concept type website distributes, and the statistics of utilizing the concept type website to distribute is only discerned specific data as concept type web pages.On the basis that concept type website distribution results is analyzed out in advance, recognition efficiency is very high.Because concept type web pages has the intensive characteristics that distribute in the internet; Utilize concept type website DISTRIBUTION RECOGNITION concept type document can effectively remedy the defective of recognizer, though can lose the data of part website, the concept type total number of documents that identifies has the lifting of certain scale; And because the inner concept type DATA DISTRIBUTION of website is intensive; For the user, the confidence level of Search Results also is improved, and effect can be superior to independent concepts type document recognition.

Because the concept type document has concentrated character in some professional websites distribution can utilize some known particular category notional words, the step in Fig. 3 C is mated after analyzing the concept type document and extracting notional word.After mating successfully, write down the concept type catalogue set that this concept type document hits in identifying, as the corresponding concept type catalogue set of this classification notional word.In instance in Fig. 3 C, if XXX: //notional word of A.com/Z/H/1.html extraction and the notional word coupling in the classification first, then record concept type catalogue XXX in the concept type catalogue of classification first correspondence is gathered: //A.com/Z.

With reference to Fig. 4, may further comprise the steps according to the utilization of the embodiment of the invention the 4th embodiment method that notional word obtains classification concept type website and similar notional word of presorting:

Step S402 after the concept type catalogue set that obtains some classification, adds up each concept type catalogue corresponding class information; Under the fully comprehensive situation of the classification concept speech that is used to mate that provides in advance, if the only corresponding a kind of or limited a few kind of certain concept type catalogue perhaps exceeds predetermined threshold Q with the ratio that speech a certain or limited a few kind couplings account for all classificating words that provide in advance of this catalogue coupling; Simultaneously, exceed predetermined threshold P, then can think with notional word sums a certain or limited a few kind couplings; Document under this concept type catalogue; All belong to this corresponding a kind of or limited a few kind, and the notional word that document extracted under this concept type catalogue also belongs to this a kind of or limited a few kind basically; Promptly; Utilize the notional word category distribution under the concept type catalogue, can analyze possible classification concept type catalogue, and further excavate similar notional word.

The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a method of discerning concept type web pages is characterized in that, may further comprise the steps:

In web database, obtain a plurality of concept type web pages, obtain concept type web pages set, wherein, said concept type web pages is with to the explanation of the notion theme as webpage, launches the webpage of describing around the connotation and extension of identical concept;

URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name of said concept type web pages set are compared, the URI quantity of the said concept type web pages of its time is confirmed as the concept type catalogue greater than the catalogue of said first threshold; And

URI and each said concept type catalogue of webpage to be identified are mated, if coupling, the webpage to be identified that then will mate adds said concept type web pages set.

2. method according to claim 1 is characterized in that, after the step of said definite concept type catalogue, also comprises:

All webpages under the said concept type catalogue are classified; And

Total classification classification under the said concept type catalogue is identical and that reach the webpage of predetermined ratio is confirmed as the classification of said concept type catalogue.

3. method according to claim 2 is characterized in that, the webpage to be identified that matees is added said concept type web pages set comprise:

The webpage to be identified of coupling is added in the said classification.

4. method according to claim 1 is characterized in that, after the step of said definite concept type catalogue, also comprises:

Add up the URI quantity of the non-concept type web pages of the catalogues at different levels under the said concept type catalogue, under the situation of the URI of said non-concept type web pages quantity, the catalogue at said non-concept type web pages place is confirmed as non-concept type catalogue greater than second threshold value.

5. method according to claim 4 is characterized in that, the step that the said webpage to be identified that will mate adds said concept type web pages set also comprises:

Said webpage to be identified and each said non-concept type catalogue are mated,, then said webpage to be identified is added said concept type web pages set, otherwise said webpage to be identified is confirmed as non-concept type web pages if all do not match with each said non-concept type catalogue.

6. method according to claim 1 is characterized in that, the step of said definite concept type catalogue comprises:

Add up the URI quantity of the said concept type web pages under each said website domain name catalogue; And itself and said first threshold compared; When the URI quantity of the said concept type web pages under the said website domain name catalogue during, said website domain name catalogue is confirmed as said concept type catalogue greater than said first threshold; And

Add up the URI quantity of the concept type web pages of said website domain name subprime directory; And itself and said first threshold compared; When the URI quantity of the concept type web pages of said website domain name subprime directory greater than said first threshold, said website domain name subprime directory is confirmed as said concept type catalogue; And so repetitive operation, be not more than said first threshold or the catalogue of being added up does not have subprime directory until the URI quantity of the said concept type web pages of the catalogue of being added up.

7. method according to claim 6 is characterized in that, saidly obtains a plurality of concept type web pages steps and comprises:

Utilize the concept type web pages recognizer to obtain a plurality of said concept type web pages.

8. a system that discerns concept type web pages is characterized in that, comprising:

Acquisition module is used for obtaining a plurality of concept type web pages at web database, obtains concept type web pages set, and wherein, said concept type web pages is with to the explanation of the notion theme as webpage, launches the webpage of describing around the connotation and extension of identical concept;

Concept type catalogue determination module; Be used for URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name of said concept type web pages set are compared, the URI quantity of the said concept type web pages of its time is confirmed as the concept type catalogue greater than the catalogue of said first threshold; And

Coupling and determination module are used for URI and each said concept type catalogue of webpage to be identified are mated, if coupling, the webpage to be identified that then will mate adds said concept type web pages set.

9. system according to claim 8; It is characterized in that; Also comprise: the classification determination module; Be used for all webpages under the said concept type catalogue are classified, and total classification classification under the said concept type catalogue is identical and that reach the webpage of predetermined ratio is confirmed as the classification of said concept type catalogue; And the interpolation module, be used for the webpage to be identified of coupling is added into said classification.

10. system according to claim 8; It is characterized in that; Also comprise: non-concept type catalogue determination module; Be used to add up the URI quantity of the non-concept type web pages of the catalogues at different levels under the said concept type catalogue, under the situation of the URI of said non-concept type web pages quantity, the catalogue at said non-concept type web pages place confirmed as non-concept type catalogue greater than second threshold value.

11. system according to claim 10; It is characterized in that; Said coupling and determination module also are used for said webpage to be identified and said non-concept type catalogue are mated; If all do not match, then said webpage to be identified is added said concept type web pages set, otherwise said webpage to be identified is confirmed as non-concept type web pages with each said non-concept type catalogue.

12. system according to claim 8 is characterized in that, said acquisition module uses the concept type web pages recognizer to obtain a plurality of said concept type web pages.

13. system according to claim 8 is characterized in that, said concept type catalogue determination module comprises:

First statistical module is used to add up the URI quantity of the said concept type web pages under the catalogues at different levels of each said website domain name;

First comparison module is used for the URI quantity and the said first threshold of said concept type web pages are compared; And

First determination module is used for the URI quantity of said concept type web pages under it is confirmed as said concept type catalogue greater than the concept type catalogue of said first threshold.

14. system according to claim 10 is characterized in that, said non-concept type catalogue determination module comprises:

Second statistical module is used to add up the URI quantity of the non-concept type web pages of the catalogues at different levels under the said concept type catalogue;

Second comparison module is used for the URI quantity and said second threshold value of said non-concept type web pages are compared; And

Second determination module is used under the situation of the URI of said non-concept type web pages quantity greater than said second threshold value, and the catalogue at said non-concept type web pages place is confirmed as non-concept type catalogue.