CN102346748A - Automatic identification method for network literature directory type web pages - Google Patents

Automatic identification method for network literature directory type web pages Download PDF

Info

Publication number
CN102346748A
CN102346748A CN2010102458463A CN201010245846A CN102346748A CN 102346748 A CN102346748 A CN 102346748A CN 2010102458463 A CN2010102458463 A CN 2010102458463A CN 201010245846 A CN201010245846 A CN 201010245846A CN 102346748 A CN102346748 A CN 102346748A
Authority
CN
China
Prior art keywords
character
character string
array
hyperlink
array element
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102458463A
Other languages
Chinese (zh)
Inventor
陈运文
马飞涛
宋海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengle Information Technolpogy Shanghai Co Ltd
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN2010102458463A priority Critical patent/CN102346748A/en
Publication of CN102346748A publication Critical patent/CN102346748A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic identification method for network literature directory type web pages. The method comprises the following steps of: acquiring a data body of a current web page; extracting character strings corresponding to hyperlink marks containing hyperlink addresses and combining the character strings into a character string array I; removing array elements containing image hyperlink marks in the character string array I to form a character string array II; extracting hyperlink text information of array elements of the character string array II to form a character string array III; judging whether each array element in the character string array III is a piece of directory text information, and counting the array elements which are directory text information to obtain a numerical value I; dividing the total number of the array elements of the character string array III by using the numerical value I to obtain a confirmation ratio; and when the confirmation ratio is more than 0.7 or the numerical value I is more than 15, determining that the current web page is a literature directory page. By the method, different novel directory pages in different sites can be well identified.

Description

Online literature directory webpage automatic identifying method
Technical field
The present invention relates to webpage and handle, particularly relate to a kind of online literature directory webpage automatic identifying method.
Background technology
The online literature business develops rapidly just on the internet, and the netizens of internet also more and more depend on and on network, read literary works.When on network, reading the literature page, catalogue page is wherein of paramount importance one page---this page has provided all chapters and sections tabulations of article, and the user can visit required chapters and sections the most easily.
Webpage is that a HTML (HyperText Mark-up Language) is HTML (Hypertext Markup Language) or HTML file in the prior art; The structure of HTML comprises that head (Head) is that data head, the main body (Body) of webpage is the data volume two large divisions of webpage, and the data head of webpage is meant in the html tag<head>With</Head>Between part, the data volume of webpage is meant<body>With</Body>Between part.Concerning search engine, when webpage was handled, it was to be necessary very much that the listing of novel page or leaf is discerned; Have only these webpages are correctly discerned after; Could when the literary works title is correlated with in user search, directly offer the corresponding listing of novel page of netizen, improve the quality of result for retrieval
The novel catalog page has following Difficulties of Identification in the prior art: 1, the html form of different websites uses has nothing in common with each other; Have nothing in common with each other like html page composing layout, CSS template, font, font size, color etc., can't use the method for simple use template matches to carry out the identification of listing of novel page or leaf.2, all do not have tangible listing of novel information in webpage and the web page address (url), only be difficult to extract the listing of novel page information from url.In addition,, keywords such as catalogue, tabulation directly do not occur, be difficult to directly obtain page type information yet from the content of text of the page.
Summary of the invention
Technical matters to be solved by this invention provides a kind of online literature directory webpage automatic identifying method, can solve in the dissimilar websites identification problem that the diversity owing to the listing of novel page produces, and can well discern the listing of novel page or leaf.
For solving the problems of the technologies described above, online literature directory webpage automatic identifying method provided by the invention comprises the steps:
Step 1, obtain the data volume of current web page.Said data volume is in the html source file in the html tag<body>With</Body>Between part.
Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label.Said hyperlink label is the html mark<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.
Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two.Said images category hyperlink label is " <img " mark.
Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information.Wherein, the method for hyperlink text information that extracts each array element of said character string array two is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " < " character is then ignored said current character and promptly do not carried out pop down and the operation that plays stack, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form an array element of said character string array three with this ejection text.
Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted.Wherein, Judge that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.
Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio.
Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.
The inventive method is through proposing a kind of page type recognition methods based on ultra chain text density algorithm, can solve in the dissimilar websites identification problem that the diversity owing to the listing of novel page produces, and can well discern the listing of novel page or leaf.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed explanation:
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
As shown in Figure 1, be the process flow diagram of the inventive method, the online literature directory webpage automatic identifying method that the embodiment of the invention provides comprises the steps:
Step 1, obtain the data volume of current web page.Said data volume is in the html source file in the html tag<body>With</Body>Between part.
Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label.Said hyperlink label is the html mark<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.
Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two.Said images category hyperlink label is " <img " mark.
Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information.Wherein, the method for hyperlink text information that extracts each array element of said character string array two is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " < " character is then ignored said current character and promptly do not carried out pop down and the operation that plays stack, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form an array element of said character string array three with this ejection text.
Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted.Wherein, Judge that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.
Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio.
Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.
More than through specific embodiment the present invention has been carried out detailed explanation, but these are not to be construed as limiting the invention.Under the situation that does not break away from the principle of the invention, those skilled in the art also can make many distortion and improvement, and these also should be considered as protection scope of the present invention.

Claims (5)

1. an online literature directory webpage automatic identifying method is characterized in that, comprises the steps:
Step 1, obtain the data volume of current web page;
Step 2, in said data volume, extracting all and include the pairing character string of hyperlink label of hyperlink address, is that an array element is stored in the character string array one with the pairing character string of each said hyperlink label;
Step 3, the said character string array one of removal include the array element of images category hyperlink label, form a character string array two;
Step 4, extracting the hyperlink text information of each array element of said character string array two, is that array element forms a character string array three with each said hyperlink text information;
Step 5, judge whether each array element in the said character string array three is a catalogue text message, and, obtain numerical value one being that the array element of catalogue file information is counted;
Step 6, obtain one with said numerical value one divided by the sum of the array element of said character string array three and confirm ratio;
Step 7, when said affirmation ratio greater than 0.7, or said numerical value one greater than 15 the time, confirm that said current web page is a literature catalogue page.
2. online literature directory webpage automatic identifying method according to claim 1, it is characterized in that: said data volume is the html source file, hyperlink label described in the step 2 does<a>, the said hyperlink label that includes hyperlink address is the said hyperlink label that includes " herf=" parameter<a>In said data volume, extracting all pairing character string methods of hyperlink label that include hyperlink address in the step 2 is: judge in the said data volume and whether contain "<a herf=" mark; Said to comprising "<a herf=" part of mark carries out character string and extracts, the scope of extraction be from "<a " Mark begins, to "</a>" all character strings of finishing of mark.
3. like online literature directory webpage automatic identifying method as described in the claim 2, it is characterized in that: the hyperlink label of images category described in the step 3 is " <img " mark.
4. like online literature directory webpage automatic identifying method as described in the claim 2, it is characterized in that: the method for hyperlink text information that extracts each array element of said character string array two in the step 4 is: at first generate a stack; Again the array element of said character string array two is carried out carrying out as judging to the character scanning of afterbody and to the current character that scans from the head: when as described in current character for " during < " character, with said current character pop down; When said current character is ">" during character, and stack top element is " during < " character, with " < " character bullet stack; When said current character is non-" < " character and non-">" character, if current stack top element is that " said current character then ignored in < " character, continues along character string traverse scanning forward; When said current character is non-" < " character and non-">" character, if current stack top element is not that " said current character pop down then will be worked as in < " character; After the array element to said character string array two carries out as above the end of scan, the text in the said stack is ejected, form the array element of said character string array three.
5. like online literature directory webpage automatic identifying method as described in the claim 2; It is characterized in that: judge in the step 5 that whether each array element in the said character string array three is that the method for a catalogue text message is to judge whether the array element of said character string array three satisfies following condition: contain " chapter " or " joint " or " returning " or " words " in the follow-up alphabetic character of first alphabetic character of the array element of said character string array three for the array element of " the " and said character string array three; If satisfy above-mentioned condition, the array element of then said character string array three is a catalogue text message.
CN2010102458463A 2010-08-05 2010-08-05 Automatic identification method for network literature directory type web pages Pending CN102346748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102458463A CN102346748A (en) 2010-08-05 2010-08-05 Automatic identification method for network literature directory type web pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102458463A CN102346748A (en) 2010-08-05 2010-08-05 Automatic identification method for network literature directory type web pages

Publications (1)

Publication Number Publication Date
CN102346748A true CN102346748A (en) 2012-02-08

Family

ID=45545432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102458463A Pending CN102346748A (en) 2010-08-05 2010-08-05 Automatic identification method for network literature directory type web pages

Country Status (1)

Country Link
CN (1) CN102346748A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714075A (en) * 2012-09-29 2014-04-09 北京百度网讯科技有限公司 Website contents page determination method and device
CN103970755A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Novel catalog entry identification method, device and system
CN106407291A (en) * 2016-08-29 2017-02-15 达而观信息科技(上海)有限公司 Hyperlinked text density algorithm-based page type identification method
CN106445967A (en) * 2015-08-11 2017-02-22 腾讯科技(深圳)有限公司 Resource directory management method and apparatus
CN110750739A (en) * 2018-07-04 2020-02-04 北京国双科技有限公司 Page type determination method and device
CN111831948A (en) * 2019-04-18 2020-10-27 阿里巴巴集团控股有限公司 Webpage type detection method and device and computer equipment
CN113221031A (en) * 2020-12-30 2021-08-06 江苏省未来网络创新研究院 Method for automatically identifying website directory page
CN113221031B (en) * 2020-12-30 2024-05-31 江苏省未来网络创新研究院 Method for automatically identifying website catalog page

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714075A (en) * 2012-09-29 2014-04-09 北京百度网讯科技有限公司 Website contents page determination method and device
CN103714075B (en) * 2012-09-29 2018-07-13 北京百度网讯科技有限公司 A kind of method and device of determining directory web site page
CN103970755A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Novel catalog entry identification method, device and system
CN103970755B (en) * 2013-01-28 2018-12-11 腾讯科技(深圳)有限公司 A kind of recognition methods of listing of novel item, device and system
CN106445967A (en) * 2015-08-11 2017-02-22 腾讯科技(深圳)有限公司 Resource directory management method and apparatus
CN106407291A (en) * 2016-08-29 2017-02-15 达而观信息科技(上海)有限公司 Hyperlinked text density algorithm-based page type identification method
CN110750739A (en) * 2018-07-04 2020-02-04 北京国双科技有限公司 Page type determination method and device
CN110750739B (en) * 2018-07-04 2022-07-05 北京国双科技有限公司 Page type determination method and device
CN111831948A (en) * 2019-04-18 2020-10-27 阿里巴巴集团控股有限公司 Webpage type detection method and device and computer equipment
CN113221031A (en) * 2020-12-30 2021-08-06 江苏省未来网络创新研究院 Method for automatically identifying website directory page
WO2022143192A1 (en) * 2020-12-30 2022-07-07 江苏省未来网络创新研究院 Method for automatically recognizing contents page of website
CN113221031B (en) * 2020-12-30 2024-05-31 江苏省未来网络创新研究院 Method for automatically identifying website catalog page

Similar Documents

Publication Publication Date Title
CN102346748A (en) Automatic identification method for network literature directory type web pages
CN101061478B (en) Method and system for identifying web document
US20110302486A1 (en) Method and apparatus for obtaining the effective contents of web page
CN101673266B (en) Method for searching audio and video contents
CN104063364A (en) PDF document recognition method
CN105320734B (en) A kind of web page core content extracting method
CN104598577A (en) Extraction method for webpage text
CN111310750B (en) Information processing method, device, computing equipment and medium
JP5724009B2 (en) Search result ranking apparatus and method using reliability of representative
JP2014013534A (en) Document processor, image processor, image processing method and document processing program
CN103778141A (en) Mixed PDF book catalogue automatic extracting algorithm
CN104915422A (en) Webpage collecting method and device based on browser
CN107145591B (en) Title-based webpage effective metadata content extraction method
US20120185238A1 (en) Auto Generation of Social Media Content from Existing Sources
CN101673263A (en) Method for searching video content
CN110489514B (en) System and method for improving event extraction labeling efficiency, event extraction method and system
CN110737855A (en) Method for extracting words in non-replicable word web page
JP2015076698A (en) Image processor and image formation apparatus, and image reader and image formation system
CN106407291A (en) Hyperlinked text density algorithm-based page type identification method
CN105847122A (en) Advertising mail recognition method and device
CN106815249B (en) Vertical text advertisement filtering method and device
JP5448744B2 (en) Sentence correction program, method, and sentence analysis server for correcting sentences containing unknown words
CN105320716A (en) Automatic labeling method for digital publication
JP2008071040A (en) Method and program for extracting company name
WO2019119030A1 (en) Image analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120208

WD01 Invention patent application deemed withdrawn after publication