WO2001052078A1 - Dead hyper link detection method and system - Google Patents

Dead hyper link detection method and system

Info

Publication number
WO2001052078A1
WO2001052078A1 PCT/US2001/001214 US0101214W WO2001052078A1 WO 2001052078 A1 WO2001052078 A1 WO 2001052078A1 US 0101214 W US0101214 W US 0101214W WO 2001052078 A1 WO2001052078 A1 WO 2001052078A1
Authority
WO
Grant status
Application
Patent type
Prior art keywords
hyperlinks
document
web
valid
server
Prior art date
Application number
PCT/US2001/001214
Other languages
French (fr)
Inventor
Brian Mcginty
Original Assignee
Screamingmedia Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30861Retrieval from the Internet, e.g. browsers
    • G06F17/30876Retrieval from the Internet, e.g. browsers by using information identifiers, e.g. encoding URL in specific indicia, browsing history
    • G06F17/30887URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

A method and system for automatically checking the validity of hyperlinks embedded in web pages being served to clients by these servers. In response to a web page request, a server will parse the document (step 412) and separate the hyperlinks from the other elements in the documents (steps 414, 416). The server will then review the hyperlinks to determine whether 'dead links' are present (step 418). The server will then either remove the dead link or will strip away the tags that empower the link thus making the link look like plain text. The server will then reconstruct the document including the hypertext links and other elements and send the document to the requestor (steps 422, 424).

Description

DEAD HYPER LINK DETECTION METHOD AND SYSTEM

Field of Invention

The present invention relates generally to the field of document retrieval and interaction on a distributed computer network. More specifically, the present invention relates to a system for post processing embedded hyperlinks.

Background of the Invention

The World Wide Web (WWW) may be broadly described as a virtual collection of documents with a user being able to access and retrieve these documents through existing telephone or data lines. Documents accessible on the WWW have the capability to direct users to other documents on the web using linking information imbedded in the text itself. Typically, the documents are stored in hypertext markup language (HTML) format. Using hypertext linking an author will integrate references directly into the text of a document which point to other related items of information. Uniform resource locators (URLs) provide a way of converting the integrated reference to a real location where the related information will be located on the Internet. It is possible that links that are valid when they are included, in these pages may become defunct or "dead links" over time.

Summary of the Invention

An aspect of the present invention involves a method of testing embedded hyperlinks including receiving a document request from a client; parsing a first document to determine if elements in the first document contain hyperlinks; separating the elements into hyperlinks and all other non-hyperlink elements; testing the hyperlinks in a first document in parallel to determine if the hyperlinks are valid hyperlinks or invalid hyperlinks by comparing the hyperlinks to a predetermined rule set; adding the valid hyperlinks to a list including the other non-hyperlink elements; generating a second document from the list; and providing the second document to the client.

Another aspect of the present invention involves a system including a memory device which stores a first document; and a processor in communication with the memory device, said processor configured to: receive a document request from a client; parse the first document to determine if elements in the first document contain said hyperlinks; separate the elements into hyperlinks and all other non- hyperlink elements; test hyperlinks in said first document to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks by the comparing the hyperlinks to a predetermined rale set; add the valid hyperlinks to a list including the other non-hyperlink elements; generate a second document using the list; and provide said second document to said client.

Other and further aspects of the present invention will become apparent during the course of the following description and by reference to the attached drawings.

Brief Description of the Drawings

Figure 1 illustrates a block diagram of an internet client/server relationship;

Figure 2 illustrates a block diagram of the server of Figure 1;

Figure 3 illustrates an HTML document in an exploded view;

Figure 4 illustrates a flow chart of the process of link validation of an embodiment of the present invention;

Figure 5 illustrates a first subroutine of the flow chart of Figure 4 in which the hypertext links and other text are separated;

Figure 6 illustrates a second subroutine of the flow chart of Figure 4 in which the hypertext links are tested to determine if they are valid; and

Figure 7 illustrates an alternative embodiment of the present invention which includes a modification of the subroutine of Figure 5 so that invalid hypertext links are processed to strip away the HTML tags .

Detailed Description of the Preferred Embodiments

The ability of a web server application to ascertain the validity of embedded links in web pages at request time is critical for the creditability of a web site. With more and more web sites moving into the e-commerce arena, this question of web site creditability is becoming even more sensitive. The present invention is capable of detecting defunct hyper links as soon as they become accessible. Embodiments of the present invention disclosed herein relate to the serving of web pages or documents by Internet web servers. The pages or documents discussed in this application may be in Hyper Text Markup Language (HTML), Standard Generalized Markup Language (SGML), Extensible Markup Language (XML) or any other format which uses a tagging architecture. In the following discussion of this application, HTML will be used for example purposes only.

The embodiments disclosed herein include a method and system for checking the validity of HTML hyperlinks embedded in HTML web pages being served to clients by a server. This is true of web servers that serve static (or non- changing HTML web pages) orapplication web servers that serve dynamic HTML web pages. Static web pages are HTML web pages that are written or "constructed" at some point in time and then remain unchanged until a web site administrator manually either removes them, updates them, or replaces them with entirely new pages. Dynamic HTML web pages are web pages served through some type of application server utilizing HTML templates and some type of dynamic page generation mechanism. In both cases it is possible that links that are valid when they are included in these pages may become defunct or "dead links" over time.

With reference to the Figures, several embodiments of the present invention will now be shown and described. Referring to Figure 1, electronic content distribution system 100 includes a server 110 and a user computer/client 140 both of which are connected across network backbone 105. Network backbone 105 may include an internet backbone, an intranet backbone or any other conventional network backbone or a combination thereof.

Server 110 may be a conventional server which includes conventional computer hardware and functionality. Server 110 may be associated with a web site or a content provider, such as a publisher (e.g., a magazine publisher, book publisher, etc.), a news agency, or any distributor or provider of electronic content. Electronic content may correspond to any publications (e.g., a news or magazine article), reports, technical papers and so forth. Electronic content may include a content body including documents with text and/or images with associated metadata as well as traditional index fields generally provided in a header or trailer section of this electronic content. Server 110 is configured to perform automatic dead link checking of hyperlinks to determine if dead links appear in a content body of the electronic content.

Fig. 2 is a schematic block diagram illustrating the components of server 110 of Fig. 1. Conventional computer components are included, such as a processor 200, user input devices 205, e.g., keyboard, mouse, etc., for receiving user inputs, network interface 210 for interconnection to the network backbone 105, RAM 215, ROM 220, display 225 and storage device 230. Storage device 230 stores the software which implements the present invention.

Turning to Figure 1, a request is sent from user computer 140 onto the network backbone 105 for a particular document or other piece of information. The requested document 320 as shown, in Figure 3 is stored on server 110. The document 320 may include highlighted text 322 which includes hidden embedded links to other related information as prepared by hypertext authoring tools. The present invention will automatically perform a dead link check on any hyperlinks in the document 320 before sending the document to the user computer 140.

Figure 4 illustrates a flow diagram of the elemental steps of a first embodiment of the present invention. In a first step 410, a user accesses an Internet resource, such as an HTML page, which is served by the server 110. In step 412, the server 110 will, before serving the page to the user, parse that page and isolate the HTML hyper links that are embedded in that page. Figure 5 illustrates step 412 in more detail. In step 412a, a comparison is performed between the HTML page and a predefined rule set. Since all HTML hyperlinks employ a defined syntax the server 110 can work from this predetermined rule set for parsing and isolating these links. This predetermined rale set can optionally be augmented through the use of a web server configuration file. This configuration file may employ an HTML hyper link meta language that will allow the server 110 to dynamically learn at initialization time the syntax and nature of the HTML hyperlinks that must be isolated. In step 412b, a decision is made whether the text is a hyperlink. If so, it is added to the list of "N" hyperlinks in 412c (with N representing a number greater than or equal to 0). If the text is not a hyperlink, it is added to the list of all other HTML elements which are not hyperlinks 412d. In step 412e, the system determines if all of the document has been checked and if not, returns to step 412a to continue checking the document. If the entire document has been reviewed, then the hyperlink parsing is completed in step 412f and the program returns to the flowchart of Figure 3.

Figure 4 shows that in steps 414 and 416 the hyperlink list of the "N" links and the other non-hyperlink HTML elements lists are separated. Once the server 110 has isolated the list of hyperlinks for a given web page it may in step 418 employ a multi-threaded socket initiator to simultaneously create hypertext transfer protocol (HTTP) socket connections to all the hyperlinks in the hyperlink list and allow the hyperlinks to be tested in parallel. These socket connections will begin retrieving the specified web pages looking in particular for web server error messages in HTTP headers of the incoming pages. For example a 404 return code signifies that the web page in question no longer exists at the specified location. Once the HTTP header is read, the socket connection may be terminated. It is then a matter of parsing and interpreting the headers for the various web pages.

Figure 6 discloses step 418 in more detail. In step 418a, hyperlinks 1 to N are tested. If the first through "N" hyperlinks are valid as determined in steps 418a through 418c then these hyperlinks are given a Boolean value of VALID and added to the list of valid hypertext links in 418d. If these hyperlinks are not valid, then the hyperlink is given the Boolean value of NOT VALID and not added to the list of valid hyperlinks and the program returns to the flowchart of Figure 4.

At this point the server 110 has the HTML web page parsed into a dynamic data stracture with the hyperlinks separated from the remaining page elements. The server 110 also has a dynamic data stracture that has a list of the pages internal links and a Boolean value that represents that links web status (i.e., VALID or NOT VALID). The server 110 will recombine the VALID hyperlinks with the other HTML elements in step 420 and omit any hyperlinks having a NOT VALID value. The server 110 will recompose the elements of the page in step 422. In this way the user will never see invalid or defunct links being served by the web site that employs a server 110 such as this. In an alternative embodiment disclosed in Figure 7, subroutine 418 will be modified so that server 110 will recompose the page with the non- valid link but will strip away the HTML tags that empower that link, thus making the link look like plain text. In this embodiment, the net result is the same. A user will never click on a hyper link that takes them to a defunct page. Subroutine 418 will be modified to include steps 418e through 418g in which if a hyperlink is found to be invalid, the tag will be stripped and the link will be made to look like text and added to VALID hyperlink list.

The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the law. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.

Claims

Claims
1. A method of testing embedded hyperlinks comprising: receiving a document request from a client; parsing a first document to determine if elements in the first document contain hyperlinks; separating the elements into hyperlinks and all other non-hyperlink elements; testing the hyperlinks in a first document in parallel to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks by comparing the hyperlinks to a predetermined rule set; adding the valid hyperlinks to a list including the other non-hyperlink elements; generating a second document from said list; and providing said second document to said client.
2. A method comprising: automatically testing hyperlinks in a first document to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks; and generating a second document using the valid hyperlinks.
3. The method of claim 2, further comprising: stripping tags from the invalid hyperlinks and adding the invalid hyperlinks to the second document.
4. The method of claim 2, wherein said testing of the hyperlinks is performed in parallel.
5. The method of claim 2, further comprising: receiving a document request from a client; and providing the second document to the client.
6. The method of claim 2 further comprising: parsing the first document to determine if elements in the first document contain said hyperlinks.
7. The method of claim 2 further comprising: separating the hyperlinks from other elements in the first document; and adding the valid hyperlinks to the other elements before generating said second document.
8. The method of claim 2, wherein said parsing step includes comparing said elements to a predetermined rule set.
9. The method of claim 2, wherein said first and second documents are static web pages.
10. The method of claim 2, wherein said first and second documents are dynamic web pages.
11. The method of claim 2, wherein said first and second documents are written in a format from one of the group consisting of HTML, SGML, and XML.
12. The method of claim 2, further comprising: stripping tags from the invalid hyperlinks and adding the invalid hyperlinks to the list.
13. A system comprising: a memory device which stores a first document; and a processor in communication with said memory device, said processor configured to: automatically test hyperlinks in said first document to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks; and generate a second document using the valid hyperlinks.
14. The system of claim 13, said processor further configured to: strip tags from the invalid hyperlinks and add the invalid hyperlinks to the second document.
15. The system of claim 13, said processor further configured to: test said hyperlinks in parallel.
16. The system of claim 13, said processor further configured to: parse the first document to determine if elements in the first document contain said hyperlinks.
17. A system comprising: a memory device which stores a first document; and a processor in communication with said memory device, said processor configured to: receive a document request from a client; parse the first document to determine if elements in the first document contain said hyperlinks; separate the elements into hyperlinks and all other non-hyperlink elements; test hyperlinks in said first document to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks by the comparing the hyperlinks to a predetermined rale set; add the valid hyperlinks to a list including the other non-hyperlink elements; generate a second document using the list; and provide said second document to said client.
18. A system comprising: means for automatically testing hyperlinks in a first document to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks; and means for generating a second document using the valid hyperlinks.
19. The system of claim 18, further comprising: means for stripping tags from the invalid hyperlinks and adding the invalid hyperlinks to the second document.
20. The system of claim 18, further comprising: a means for parsing the first document to determine if elements in the first document contain hyperlinks.
PCT/US2001/001214 2000-01-14 2001-01-12 Dead hyper link detection method and system WO2001052078A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US48343900 true 2000-01-14 2000-01-14
US09/483,439 2000-01-14

Publications (1)

Publication Number Publication Date
WO2001052078A1 true true WO2001052078A1 (en) 2001-07-19

Family

ID=23920028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/001214 WO2001052078A1 (en) 2000-01-14 2001-01-12 Dead hyper link detection method and system

Country Status (1)

Country Link
WO (1) WO2001052078A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6519616B1 (en) * 1999-12-31 2003-02-11 Qwest Communications International, Inc. Web site quality assurance system and method
WO2005015387A2 (en) * 2003-07-17 2005-02-17 International Business Machines Corporation Method and system for automatic adjustment of entitlements in a distributed data processing environment
EP1677215A1 (en) 2004-12-30 2006-07-05 Microsoft Corporation Methods and apparatus for the evalution of aspects of a web page
EP1739603A1 (en) * 2005-06-28 2007-01-03 Hurra Communications GmbH Client-server system, server and method for outputting at least one information concerning an online shop or a product offered by the online shop on a network page
US7222101B2 (en) * 2001-02-26 2007-05-22 American Express Travel Related Services Company, Inc. System and method for securing data through a PDA portal
US7536389B1 (en) 2005-02-22 2009-05-19 Yahoo ! Inc. Techniques for crawling dynamic web content
US7590634B2 (en) 2005-12-09 2009-09-15 Microsoft Corporation Detection of inaccessible resources
US7610267B2 (en) * 2005-06-28 2009-10-27 Yahoo! Inc. Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
US20100275117A1 (en) * 2009-04-23 2010-10-28 Xerox Corporation Method and system for handling references in markup language documents
US8833650B1 (en) 2006-05-25 2014-09-16 Sean I. Mcghie Online shopping sites for redeeming loyalty points
US8944320B1 (en) 2006-05-25 2015-02-03 Sean I. Mcghie Conversion/transfer of non-negotiable credits to in-game funds for in-game purchases
CN104504097A (en) * 2014-12-29 2015-04-08 北京奇虎科技有限公司 Live link rule mining method and device, and searching method and device
CN104572928A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 Dead link rule digging method, dead link rule digging device, searching method and searching device
US9704174B1 (en) 2006-05-25 2017-07-11 Sean I. Mcghie Conversion of loyalty program points to commerce partner points per terms of a mutual agreement
US9842345B2 (en) 2001-03-29 2017-12-12 Gula Consulting Limited Liability Company System and method for networked loyalty program

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5995099A (en) * 1996-06-10 1999-11-30 Horstmann; Jens U. Method for creating and maintaining page links
US6035330A (en) * 1996-03-29 2000-03-07 British Telecommunications World wide web navigational mapping system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035330A (en) * 1996-03-29 2000-03-07 British Telecommunications World wide web navigational mapping system and method
US5995099A (en) * 1996-06-10 1999-11-30 Horstmann; Jens U. Method for creating and maintaining page links

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEUNG: "A tool for testing hypermedia systems", EUROMICRO CONFERENCE, IEEE, vol. 2, 1999, pages 203, XP002939083 *
STOTTS: "Petri-net-based hypertext: document structure with browsing semantics", ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 7, January 1989 (1989-01-01), pages 3 - 29, XP002939082 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6519616B1 (en) * 1999-12-31 2003-02-11 Qwest Communications International, Inc. Web site quality assurance system and method
US7222101B2 (en) * 2001-02-26 2007-05-22 American Express Travel Related Services Company, Inc. System and method for securing data through a PDA portal
US9842345B2 (en) 2001-03-29 2017-12-12 Gula Consulting Limited Liability Company System and method for networked loyalty program
WO2005015387A2 (en) * 2003-07-17 2005-02-17 International Business Machines Corporation Method and system for automatic adjustment of entitlements in a distributed data processing environment
WO2005015387A3 (en) * 2003-07-17 2005-06-16 Paul Anthony Ashley Method and system for automatic adjustment of entitlements in a distributed data processing environment
CN100424636C (en) 2003-07-17 2008-10-08 国际商业机器公司 Method and system for automatic adjustment of entitlements in a distributed data processing environment
EP1677215A1 (en) 2004-12-30 2006-07-05 Microsoft Corporation Methods and apparatus for the evalution of aspects of a web page
US7536389B1 (en) 2005-02-22 2009-05-19 Yahoo ! Inc. Techniques for crawling dynamic web content
US7610267B2 (en) * 2005-06-28 2009-10-27 Yahoo! Inc. Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
EP1739603A1 (en) * 2005-06-28 2007-01-03 Hurra Communications GmbH Client-server system, server and method for outputting at least one information concerning an online shop or a product offered by the online shop on a network page
US7590634B2 (en) 2005-12-09 2009-09-15 Microsoft Corporation Detection of inaccessible resources
US9704174B1 (en) 2006-05-25 2017-07-11 Sean I. Mcghie Conversion of loyalty program points to commerce partner points per terms of a mutual agreement
US8833650B1 (en) 2006-05-25 2014-09-16 Sean I. Mcghie Online shopping sites for redeeming loyalty points
US8944320B1 (en) 2006-05-25 2015-02-03 Sean I. Mcghie Conversion/transfer of non-negotiable credits to in-game funds for in-game purchases
US8950669B1 (en) 2006-05-25 2015-02-10 Sean I. Mcghie Conversion of non-negotiable credits to entity independent funds
US8973821B1 (en) 2006-05-25 2015-03-10 Sean I. Mcghie Conversion/transfer of non-negotiable credits to entity independent funds
US8209599B2 (en) * 2009-04-23 2012-06-26 Xerox Corporation Method and system for handling references in markup language documents
US20100275117A1 (en) * 2009-04-23 2010-10-28 Xerox Corporation Method and system for handling references in markup language documents
CN104572928A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 Dead link rule digging method, dead link rule digging device, searching method and searching device
CN104504097A (en) * 2014-12-29 2015-04-08 北京奇虎科技有限公司 Live link rule mining method and device, and searching method and device

Similar Documents

Publication Publication Date Title
Cohen PyFlag–An advanced network forensic framework
US6763496B1 (en) Method for promoting contextual information to display pages containing hyperlinks
US7269784B1 (en) Server-originated differential caching
US6877007B1 (en) Method and apparatus for tracking a user's interaction with a resource supplied by a server computer
US6507867B1 (en) Constructing, downloading, and accessing page bundles on a portable client having intermittent network connectivity
US5941944A (en) Method for providing a substitute for a requested inaccessible object by identifying substantially similar objects using weights corresponding to object features
US6596030B2 (en) Identifying changes in on-line data repositories
US5892908A (en) Method of extracting network information
US7325045B1 (en) Error processing methods for providing responsive content to a user when a page load error occurs
US6519602B2 (en) System and method for the automatic construction of generalization-specialization hierarchy of terms from a database of terms and associated meanings
US6618717B1 (en) Computer method and apparatus for determining content owner of a website
US7519902B1 (en) System and method for enhanced browser-based web crawling
Abascal et al. The use of guidelines to automatically verify Web accessibility
US6052730A (en) Method for monitoring and/or modifying web browsing sessions
US7062706B2 (en) Method and apparatus for populating a form with data
US6601075B1 (en) System and method of ranking and retrieving documents based on authority scores of schemas and documents
US6643641B1 (en) Web search engine with graphic snapshots
US20040199497A1 (en) System and Methodology for Extraction and Aggregation of Data from Dynamic Content
US6018801A (en) Method for authenticating electronic documents on a computer network
US6632248B1 (en) Customization of network documents by accessing customization information on a server computer using uniquie user identifiers
US20020133720A1 (en) Method for filtering the transmission of data on a computer network to Web domains
US5999941A (en) Database access using active server pages
US5966705A (en) Tracking a user across both secure and non-secure areas on the Internet, wherein the users is initially tracked using a globally unique identifier
US7574486B1 (en) Web page content translator
US20060190561A1 (en) Method and system for obtaining script related information for website crawling

Legal Events

Date Code Title Description
AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP