CN102346748A - Automatic identification method for network literature directory type web pages - Google Patents
Automatic identification method for network literature directory type web pages Download PDFInfo
- Publication number
- CN102346748A CN102346748A CN2010102458463A CN201010245846A CN102346748A CN 102346748 A CN102346748 A CN 102346748A CN 2010102458463 A CN2010102458463 A CN 2010102458463A CN 201010245846 A CN201010245846 A CN 201010245846A CN 102346748 A CN102346748 A CN 102346748A
- Authority
- CN
- China
- Prior art keywords
- character
- character string
- array
- hyperlink
- array element
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an automatic identification method for network literature directory type web pages. The method comprises the following steps of: acquiring a data body of a current web page; extracting character strings corresponding to hyperlink marks containing hyperlink addresses and combining the character strings into a character string array I; removing array elements containing image hyperlink marks in the character string array I to form a character string array II; extracting hyperlink text information of array elements of the character string array II to form a character string array III; judging whether each array element in the character string array III is a piece of directory text information, and counting the array elements which are directory text information to obtain a numerical value I; dividing the total number of the array elements of the character string array III by using the numerical value I to obtain a confirmation ratio; and when the confirmation ratio is more than 0.7 or the numerical value I is more than 15, determining that the current web page is a literature directory page. By the method, different novel directory pages in different sites can be well identified.
Description
Technical field
The present invention relates to webpage and handle, particularly relate to a kind of online literature directory webpage automatic identifying method.
Background technology
The online literature business develops rapidly just on the internet, and the netizens of internet also more and more depend on and on network, read literary works.When on network, reading the literature page, catalogue page is wherein of paramount importance one page---this page has provided all chapters and sections tabulations of article, and the user can visit required chapters and sections the most easily.
Webpage is that a HTML (HyperText Mark-up Language) is HTML (Hypertext Markup Language) or HTML file in the prior art; The structure of HTML comprises that head (Head) is that data head, the main body (Body) of webpage is the data volume two large divisions of webpage, and the data head of webpage is meant in the html tag<head>With</Head>Between part, the data volume of webpage is meant<body>With</Body>Between part.Concerning search engine, when webpage was handled, it was to be necessary very much that the listing of novel page or leaf is discerned; Have only these webpages are correctly discerned after; Could when the literary works title is correlated with in user search, directly offer the corresponding listing of novel page of netizen, improve the quality of result for retrieval
The novel catalog page has following Difficulties of Identification in the prior art: 1, the html form of different websites uses has nothing in common with each other; Have nothing in common with each other like html page composing layout, CSS template, font, font size, color etc., can't use the method for simple use template matches to carry out the identification of listing of novel page or leaf.2, all do not have tangible listing of novel information in webpage and the web page address (url), only be difficult to extract the listing of novel page information from url.In addition,, keywords such as catalogue, tabulation directly do not occur, be difficult to directly obtain page type information yet from the content of text of the page.
Summary of the invention
Technical matters to be solved by this invention provides a kind of online literature directory webpage automatic identifying method, can solve in the dissimilar websites identification problem that the diversity owing to the listing of novel page produces, and can well discern the listing of novel page or leaf.
For solving the problems of the technologies described above, online literature directory webpage automatic identifying method provided by the invention comprises the steps:
Step 1, obtain the data volume of current web page.Said data volume is in the html source file in the html tag<body>With</Body>Between part.
Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label.Said hyperlink label is the html mark<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.
Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two.Said images category hyperlink label is " <img " mark.
Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information.Wherein, the method for hyperlink text information that extracts each array element of said character string array two is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " < " character is then ignored said current character and promptly do not carried out pop down and the operation that plays stack, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form an array element of said character string array three with this ejection text.
Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted.Wherein, Judge that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.
Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio.
Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.
The inventive method is through proposing a kind of page type recognition methods based on ultra chain text density algorithm, can solve in the dissimilar websites identification problem that the diversity owing to the listing of novel page produces, and can well discern the listing of novel page or leaf.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed explanation:
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
As shown in Figure 1, be the process flow diagram of the inventive method, the online literature directory webpage automatic identifying method that the embodiment of the invention provides comprises the steps:
Step 1, obtain the data volume of current web page.Said data volume is in the html source file in the html tag<body>With</Body>Between part.
Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label.Said hyperlink label is the html mark<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.
Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two.Said images category hyperlink label is " <img " mark.
Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information.Wherein, the method for hyperlink text information that extracts each array element of said character string array two is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " < " character is then ignored said current character and promptly do not carried out pop down and the operation that plays stack, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form an array element of said character string array three with this ejection text.
Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted.Wherein, Judge that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.
Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio.
Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.
More than through specific embodiment the present invention has been carried out detailed explanation, but these are not to be construed as limiting the invention.Under the situation that does not break away from the principle of the invention, those skilled in the art also can make many distortion and improvement, and these also should be considered as protection scope of the present invention.
Claims (5)
1. an online literature directory webpage automatic identifying method is characterized in that, comprises the steps:
Step 1, obtain the data volume of current web page;
Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label;
Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two;
Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information;
Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted;
Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio;
Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.
2. online literature directory webpage automatic identifying method according to claim 1, it is characterized in that: said data volume is the html source file, hyperlink label described in the step 2 does<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address in the step 2 is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.
3. like online literature directory webpage automatic identifying method as described in the claim 2, it is characterized in that: the hyperlink label of images category described in the step 3 is " <img " mark.
4. like online literature directory webpage automatic identifying method as described in the claim 2, it is characterized in that: the method for hyperlink text information that extracts each array element of said character string array two in the step 4 is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " said current character then ignored in < " character, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form the array element of said character string array three.
5. like online literature directory webpage automatic identifying method as described in the claim 2; It is characterized in that: judge in the step 5 that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102458463A CN102346748A (en) | 2010-08-05 | 2010-08-05 | Automatic identification method for network literature directory type web pages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102458463A CN102346748A (en) | 2010-08-05 | 2010-08-05 | Automatic identification method for network literature directory type web pages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102346748A true CN102346748A (en) | 2012-02-08 |
Family
ID=45545432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010102458463A Pending CN102346748A (en) | 2010-08-05 | 2010-08-05 | Automatic identification method for network literature directory type web pages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102346748A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714075A (en) * | 2012-09-29 | 2014-04-09 | 北京百度网讯科技有限公司 | Website contents page determination method and device |
CN103970755A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Novel catalog entry identification method, device and system |
CN106407291A (en) * | 2016-08-29 | 2017-02-15 | 达而观信息科技(上海)有限公司 | Hyperlinked text density algorithm-based page type identification method |
CN106445967A (en) * | 2015-08-11 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Resource directory management method and apparatus |
CN110750739A (en) * | 2018-07-04 | 2020-02-04 | 北京国双科技有限公司 | Page type determination method and device |
CN111831948A (en) * | 2019-04-18 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Webpage type detection method and device and computer equipment |
CN113221031A (en) * | 2020-12-30 | 2021-08-06 | 江苏省未来网络创新研究院 | Method for automatically identifying website directory page |
CN113221031B (en) * | 2020-12-30 | 2024-05-31 | 江苏省未来网络创新研究院 | Method for automatically identifying website catalog page |
-
2010
- 2010-08-05 CN CN2010102458463A patent/CN102346748A/en active Pending
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714075A (en) * | 2012-09-29 | 2014-04-09 | 北京百度网讯科技有限公司 | Website contents page determination method and device |
CN103714075B (en) * | 2012-09-29 | 2018-07-13 | 北京百度网讯科技有限公司 | A kind of method and device of determining directory web site page |
CN103970755A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Novel catalog entry identification method, device and system |
CN103970755B (en) * | 2013-01-28 | 2018-12-11 | 腾讯科技(深圳)有限公司 | A kind of recognition methods of listing of novel item, device and system |
CN106445967A (en) * | 2015-08-11 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Resource directory management method and apparatus |
CN106407291A (en) * | 2016-08-29 | 2017-02-15 | 达而观信息科技(上海)有限公司 | Hyperlinked text density algorithm-based page type identification method |
CN110750739A (en) * | 2018-07-04 | 2020-02-04 | 北京国双科技有限公司 | Page type determination method and device |
CN110750739B (en) * | 2018-07-04 | 2022-07-05 | 北京国双科技有限公司 | Page type determination method and device |
CN111831948A (en) * | 2019-04-18 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Webpage type detection method and device and computer equipment |
CN113221031A (en) * | 2020-12-30 | 2021-08-06 | 江苏省未来网络创新研究院 | Method for automatically identifying website directory page |
WO2022143192A1 (en) * | 2020-12-30 | 2022-07-07 | 江苏省未来网络创新研究院 | Method for automatically recognizing contents page of website |
CN113221031B (en) * | 2020-12-30 | 2024-05-31 | 江苏省未来网络创新研究院 | Method for automatically identifying website catalog page |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102346748A (en) | Automatic identification method for network literature directory type web pages | |
CN101061478B (en) | Method and system for identifying web document | |
US20110302486A1 (en) | Method and apparatus for obtaining the effective contents of web page | |
CN101673266B (en) | Method for searching audio and video contents | |
CN104063364A (en) | PDF document recognition method | |
CN105320734B (en) | A kind of web page core content extracting method | |
CN104598577A (en) | Extraction method for webpage text | |
CN111310750B (en) | Information processing method, device, computing equipment and medium | |
JP5724009B2 (en) | Search result ranking apparatus and method using reliability of representative | |
JP2014013534A (en) | Document processor, image processor, image processing method and document processing program | |
CN103778141A (en) | Mixed PDF book catalogue automatic extracting algorithm | |
CN104915422A (en) | Webpage collecting method and device based on browser | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
US20120185238A1 (en) | Auto Generation of Social Media Content from Existing Sources | |
CN101673263A (en) | Method for searching video content | |
CN110489514B (en) | System and method for improving event extraction labeling efficiency, event extraction method and system | |
CN110737855A (en) | Method for extracting words in non-replicable word web page | |
JP2015076698A (en) | Image processor and image formation apparatus, and image reader and image formation system | |
CN106407291A (en) | Hyperlinked text density algorithm-based page type identification method | |
CN105847122A (en) | Advertising mail recognition method and device | |
CN106815249B (en) | Vertical text advertisement filtering method and device | |
JP5448744B2 (en) | Sentence correction program, method, and sentence analysis server for correcting sentences containing unknown words | |
CN105320716A (en) | Automatic labeling method for digital publication | |
JP2008071040A (en) | Method and program for extracting company name | |
WO2019119030A1 (en) | Image analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120208 |
|
WD01 | Invention patent application deemed withdrawn after publication |