CN102346748A

CN102346748A - Automatic identification method for network literature directory type web pages

Info

Publication number: CN102346748A
Application number: CN2010102458463A
Authority: CN
Inventors: 陈运文; 马飞涛; 宋海涛
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: Shengle Information Technolpogy Shanghai Co Ltd
Priority date: 2010-08-05
Filing date: 2010-08-05
Publication date: 2012-02-08

Abstract

The invention discloses an automatic identification method for network literature directory type web pages. The method comprises the following steps of: acquiring a data body of a current web page; extracting character strings corresponding to hyperlink marks containing hyperlink addresses and combining the character strings into a character string array I; removing array elements containing image hyperlink marks in the character string array I to form a character string array II; extracting hyperlink text information of array elements of the character string array II to form a character string array III; judging whether each array element in the character string array III is a piece of directory text information, and counting the array elements which are directory text information to obtain a numerical value I; dividing the total number of the array elements of the character string array III by using the numerical value I to obtain a confirmation ratio; and when the confirmation ratio is more than 0.7 or the numerical value I is more than 15, determining that the current web page is a literature directory page. By the method, different novel directory pages in different sites can be well identified.

Description

Online literature directory webpage automatic identifying method

Technical field

The present invention relates to webpage and handle, particularly relate to a kind of online literature directory webpage automatic identifying method.

Background technology

The online literature business develops rapidly just on the internet, and the netizens of internet also more and more depend on and on network, read literary works.When on network, reading the literature page, catalogue page is wherein of paramount importance one page---this page has provided all chapters and sections tabulations of article, and the user can visit required chapters and sections the most easily.

Webpage is that a HTML (HyperText Mark-up Language) is HTML (Hypertext Markup Language) or HTML file in the prior art; The structure of HTML comprises that head (Head) is that data head, the main body (Body) of webpage is the data volume two large divisions of webpage, and the data head of webpage is meant in the html tag<head>With</Head>Between part, the data volume of webpage is meant<body>With</Body>Between part.Concerning search engine, when webpage was handled, it was to be necessary very much that the listing of novel page or leaf is discerned; Have only these webpages are correctly discerned after; Could when the literary works title is correlated with in user search, directly offer the corresponding listing of novel page of netizen, improve the quality of result for retrieval

The novel catalog page has following Difficulties of Identification in the prior art: 1, the html form of different websites uses has nothing in common with each other; Have nothing in common with each other like html page composing layout, CSS template, font, font size, color etc., can't use the method for simple use template matches to carry out the identification of listing of novel page or leaf.2, all do not have tangible listing of novel information in webpage and the web page address (url), only be difficult to extract the listing of novel page information from url.In addition,, keywords such as catalogue, tabulation directly do not occur, be difficult to directly obtain page type information yet from the content of text of the page.

Summary of the invention

Technical matters to be solved by this invention provides a kind of online literature directory webpage automatic identifying method, can solve in the dissimilar websites identification problem that the diversity owing to the listing of novel page produces, and can well discern the listing of novel page or leaf.

For solving the problems of the technologies described above, online literature directory webpage automatic identifying method provided by the invention comprises the steps:

Step 1, obtain the data volume of current web page.Said data volume is in the html source file in the html tag<body>With</Body>Between part.

Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label.Said hyperlink label is the html mark<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.

Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two.Said images category hyperlink label is " <img " mark.

Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information.Wherein, the method for hyperlink text information that extracts each array element of said character string array two is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " < " character is then ignored said current character and promptly do not carried out pop down and the operation that plays stack, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form an array element of said character string array three with this ejection text.

Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted.Wherein, Judge that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.

Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio.

Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.

The inventive method is through proposing a kind of page type recognition methods based on ultra chain text density algorithm, can solve in the dissimilar websites identification problem that the diversity owing to the listing of novel page produces, and can well discern the listing of novel page or leaf.

Description of drawings

Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed explanation:

Fig. 1 is the process flow diagram of the inventive method.

Embodiment

As shown in Figure 1, be the process flow diagram of the inventive method, the online literature directory webpage automatic identifying method that the embodiment of the invention provides comprises the steps:

More than through specific embodiment the present invention has been carried out detailed explanation, but these are not to be construed as limiting the invention.Under the situation that does not break away from the principle of the invention, those skilled in the art also can make many distortion and improvement, and these also should be considered as protection scope of the present invention.

Claims

1. an online literature directory webpage automatic identifying method is characterized in that, comprises the steps:

Step 1, obtain the data volume of current web page;

Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label;

Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two;

Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information;

Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted;

Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio;

2. online literature directory webpage automatic identifying method according to claim 1, it is characterized in that: said data volume is the html source file, hyperlink label described in the step 2 does<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address in the step 2 is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.

3. like online literature directory webpage automatic identifying method as described in the claim 2, it is characterized in that: the hyperlink label of images category described in the step 3 is " <img " mark.

4. like online literature directory webpage automatic identifying method as described in the claim 2, it is characterized in that: the method for hyperlink text information that extracts each array element of said character string array two in the step 4 is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " said current character then ignored in < " character, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form the array element of said character string array three.

5. like online literature directory webpage automatic identifying method as described in the claim 2; It is characterized in that: judge in the step 5 that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.