WO2012082114A1

WO2012082114A1 - Selecting content within a web page

Info

Publication number: WO2012082114A1
Application number: PCT/US2010/060322
Authority: WO
Inventors: Suk Hwan Lim; Eamonn O'brien-Strain
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2010-12-14
Filing date: 2010-12-14
Publication date: 2012-06-21

Abstract

A method of selecting content within a web page (Fig. 1, 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) comprising accessing first web page data associated with at least one previously accessed web page, accessing second web page data associated with a currently accessed web page (Fig. 1, 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507), comparing the first web page data with the second web page data, and presenting to a user, via an output device (Fig. 1, 150), equivalent web page data selected most often within the at least one previously accessed web page as selected content within the currently accessed web page (Fig. 1, 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

Description

SELECTING CONTENT WITHIN A WEB PAGE

BACKGROUND

[0001] The Internet is providing many users throughout the world with the ability to access large amounts and varieties of information at previously unthinkable speeds. Indeed, with the advent of the Internet other means of communication such as newspapers, telephones, and mail are becoming obsolete and consumers are looking to the various web pages on the World Wide Web for information, services, and products. However, with the inclusion of multimedia content, embedded advertising, and other online services within them, these web pages have become substantially more complex. By way of example, a web page may include additional peripheral information such as background imagery, advertisements, navigational menus, headers, footers, as well as separate links to additional content located throughout the World Wide Web.

[0002] It is, therefore, often the case that users of a web page desire to view, utilize or adapt the main content within the web page. Selecting or otherwise using that desired portion of the content on the web page requires that the user carefully distinguish between the desirable and undesirable content and retrieve those desirable portions of the web page. Additionally, various web sites and web pages not only vary widely by content, but any one web page may not contain the same information at any given time. Still further, individual users' preferences vary from user to user and therefore the desirable content to be selected may also vary depending on any one user's preferences. Selection of those portions of the website the user desires could greatly increase productivity as well as improve the user's experience while accessing the web page.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] The accompanying drawings illustrate various examples of the principles described herein and are a part of the specification. The illustrated examples are given merely for illustration, and do not limit the scope of the claims.

[0004] Fig. 1 is a diagram of an illustrative system for selection of user desirable content in web pages, according to one example of principles described herein.

[0005] Fig. 2A is a Document Object Model (DOM) tree for an illustrative web page, according to one example of principles described herein.

[0006] Fig. 2B is a layout of an illustrative web page which

corresponds to the Document Object Model (DOM) tree of Fig. 2A, according to one example of principles described herein.

[0007] Fig. 2C is diagram of an illustrative web page showing the content of the web page of Figs. 2A and 2B, according to one example of principles described herein.

[0008] Fig. 3 is an illustrative flowchart depicting a method of extracting user desirable content from a web page, according to one example of the principles described herein.

[0009] Fig. 4 is an illustrative diagram of the web page of Fig. 2C, showing a selection of additional web page content, according to one example of principles described herein.

[0010] Fig. 5 is an illustrative diagram of the web page of Fig. 2C, showing a selection of additional web page content, according to one example of principles described herein. [0011] Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

DETAILED DESCRIPTION

[0012] The present specification discloses various methods, systems, and devices for determining the user desirable or main content of a web page using a user's previous markups of content selections in similar web pages. As discussed earlier, there exists various types of content on any given web page that a user of a web page may not necessarily want to utilize. Some of the potentially unwanted content may include background imagery, advertisements, navigational menus, headers, footers, as well as separate links to additional content located throughout the World Wide Web. Therefore, it is more advantageous for a user of a web page to be able to select those portions of the web page that he or she wants to edit, view, print, present or otherwise utilize. Additionally, it is also advantageous to save any data associated with a web page related to those portions previously selected by the user for utilization by the user. Therefore, when the user of the web page accesses the same or a similar web page, the user desirable content of a web page is selected based, at least partially, on the types of content previously selected for that web page or a similar web page.

[0013] As briefly discussed earlier, various challenges arise in attempting to manually select user desirable content from a web page. One challenge is the various types of web pages used. Specifically, many different templates are used to create the various types of web pages on the World Wide Web and this may add additional difficulty in trying to retrieve the pertinent content in a more convenient way. Another challenge is to select desirable content from web pages which may be arbitrary because the web page does not include a template.

[0014] It is further challenging to select the desirable content or at least the "main content" of the web page when most web pages on the World Wide Web include various types of unwanted content such as text, images, videos, and flash objects. Therefore, determining what is and is not wanted content can be difficult if all of these types of content are present in any given web page. To help with this, an algorithm may be used to not only determine a relative ordering of level of appeal of content but also to determine whether content can be categorized as "desirable" or "main" content.

[0015] As used herein, the term "includes" means includes but not limited to and the term "including" means including but not limited to. The term "based on" means based at least in part on.

[0016] As used in the present specification and in the appended claims, the term "web page" is meant to be understood broadly as any document that can be accessed by a Uniform Resource Locator (URL) on the World Wide Web. A web page may, therefore, be retrieved from a server over a network connection and viewed in a web browser application.

[0017] Additionally, as used in the present specification and in the appended claims, the term "user" is meant to be understood broadly as any person viewing or otherwise utilizing a web page. Therefore, an owner or administrator of a web page, a user of a computing system having accessed a web page, or any other person may be a user.

[0018] Still further, as used in the present specification and in the appended claims, the term "main content," "user desirable content," or "viewer desirable content" is meant to be understood broadly as that content on a web page which a user wishes to view, utilize or adapt for any purpose. Indeed, the present specification may refer to "desirable" content within a web page which is meant to be understood as those sections of text, images, or any other content on a web page which the user may generally wish to view, utilize or adapt, and which is separate from any other undesirable content within a web page. In one example of the present specification, the method of determining what content within the web page is to be selected, to determine the web page data selected most often, may utilize an algorithm that aggregates the statistical distribution of what parts of the web page have been selected previously.

[0019] Even further, as used in the present specification and in the appended claims, the term "web page data" is meant to be understood broadly as any data relating to a web page. For example, web page data may include at least one of the web page's Uniform Resource Locator (URL); the web page's Document Object Model (DOM); information relating to the structure and layout of a Document Object Model (DOM) tree of the web page; the layout and structure of any nodes within the Document Object Model (DOM) tree; content of a web page or nodes previously or currently selected by a user within a Document Object Model (DOM) tree; content of a web page or nodes not previously or currently selected by a user within a Document Object Model (DOM) tree; any data relating to the amount or characteristics of any type of content of the web page selected or not selected by an individual, entity; or combinations of these. Web page data may additionally include any metadata associated with or describing any of the above mentioned types of data. Still further, web page data may also include any data or metadata relating not only to the content of a web page an individual has selected from any one web page in the past, but may also include information relating to when and how often the user had previously viewed, utilized, or adapted a web page or content on a web page.

[0020] Further, as used in the present specification and in the appended claims, the term "sub-node" is meant to be understood broadly as any node within a Document Object Model (DOM) tree which has at least one node located on a higher level in the hierarchal order of the Document Object Model (DOM) tree. Therefore, a sub-node may be a sub-node of a node which itself is a sub-node. Additionally, a sub-node may also comprise or have associated with it a number of sub-nodes itself.

[0021] Still further, as used in the present specification and in the appended claims, the term "similar web page" is meant to be understood broadly as any web page having similar characteristics as compared to another web page. For example, a similar web page may be similar in the type of template used to arrange the text, images or other content displayed on the web page. A similar web page may also be similar because, although the web page address or Uniform Resource Locator (URL) is not entirely identical, the domain name within the Uniform Resource Locator (URL) is the same. Additionally, a similar web page may be similar in the content displayed on the web page. Similarly, as used in the present specification and in the appended claims, the terms "equivalent web page data" or "similar web page data" is meant to be understood broadly as any web page data having similar characteristics as compared to other web page data. For example, a number of web pages' Document Object Model (DOM) trees may contain certain nodes which are similar to each other because, for example, the content contained in those respective nodes are equivalent.

[0022] Additionally, as used in the present specification and in the appended claims, the term "hash" is meant to be understood broadly as any number generated from a string of data. Indeed, a "hash function" is meant to be understood as any function that is used to convert data into small datum which may serve as an index. Specifically, a hash may be a conversion of web page data associated with a web page into smaller datum which may then be placed in a table or database for easy lookup.

[0023] In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough

understanding of the present systems and methods. It will be apparent, however, to one skilled in the art that the present apparatus, systems and methods may be practiced without these specific details. Reference in the specification to "an example" or similar language means that a particular feature, structure, or characteristic described in connection with the example is included in at least that one example, but not necessarily in other examples. The various instances of the phrase "in one example" or similar phrases in various places in the specification are not necessarily all referring to the same example.

[0024] Referring now to Fig. 1 , an illustrative system (100) for selection of user desirable content in web pages (1 10) includes a computing device (105) that has access to a web page (1 10) stored by a web page server (1 15). In the present example, for the purposes of simplicity in illustration, the computing device (105) and the web page server (1 15) are separate computing devices communicatively coupled to each other through a mutual connection to a network (120). However, the principles set forth in the present specification extend equally to any alternative configuration in which a computing device (105) has complete access to a web page (1 10). As such, alternative examples within the scope of the principles of the present specification include, but are not limited to, examples in which the computing device (105) and the web page server (1 15) are implemented by the same computing device, examples in which the functionality of the computing device (105) is implemented by multiple interconnected computers (for example, a server in a data center and a user's client machine), examples in which the computing device (105) and the web page server (1 15) communicate directly through a bus without intermediary network devices, and examples in which the computing device (105) has a stored local copy of the web page (1 10) which is to be analyzed to select the desirable content from the web page (1 10).

[0025] Additionally, for purposes of simplicity, the web page of the present example is stored on a single web server. However, the principles set forth in the present specification may include web pages which are generated dynamically from pieces of web page content stored on a number of various types of storage devices. For example, a web page of the present specification may be generated by a cluster of individual communicating servers. Still further, a web page of the present specification may also be generated dynamically by data computed on the fly.

[0026] The computing device (105) of the present example is a computing device that retrieves the web page (1 10) hosted by the web page server (1 15) and determines the most user desirable content of the web page (1 10) based, at least partially, on the user's previous selections of text, images, and other content on other web pages. In the present example, this is accomplished by the computing device (105) requesting the web page (1 10) from the web page server (1 15) over the network (120) using the appropriate network protocol for example, Internet Protocol (IP). Illustrative processes for identifying the most user desirable content of the web page (1 10) are set forth in more detail below. [0027] To achieve its desired functionality, the computing device (105) includes various hardware components. Among these hardware components may be at least one processor (125), at least one data storage device (130), peripheral device adapters (135), and a network adapter (140). These hardware components may be interconnected through the use of one or more busses and/or network connections.

[0028] The processor (125) may include the hardware architecture to retrieve executable code from the data storage device (130) and execute the executable code. The executable code may, when executed by the processor (125), cause the processor (125) to implement at least the functionality of retrieving the web page (1 10) and analyzing a web page (1 10) in order to locate the most user desirable content of the web page (110) according to the methods of the present specification described below. In the course of executing code, the processor (125) may receive input from and provide output to one or more of the remaining hardware units.

[0029] The data storage device (130) may store data such as web page data which is processed and produced by the processor (125). As will be discussed, the data storage device (130) may specifically save web page data including, for example, a web page's Uniform Resource Locator (URL),

Document Object Model (DOM) tree, and sections of content in a web page a user has selected. All of this data may further be stored in the form of a database for easy retrieval when the same or a similar web page is once again accessed by a user.

[0030] The data storage device (130) may include various types of memory modules, including volatile and nonvolatile memory. For example, the data storage device (130) of the present example includes Random Access Memory (RAM), Read Only Memory (ROM), and Hard Disk Drive (HDD) memory. Many other types of memory are available in the art, and the present specification contemplates the use of many varying type(s) of memory (130) in the data storage device (130) as may suit a particular application of the principles described herein. In certain examples, different types of memory in the data storage device (130) may be used for different data storage needs. For example, in certain examples the processor (125) may boot from Read Only Memory (ROM), maintain nonvolatile storage in the Hard Disk Drive (HDD) memory, and execute program code stored in Random Access Memory (RAM).

[0031] Generally, the data storage device (130) may comprise a computer readable storage medium. For example, the data storage device (130) may be, but not limited to, an electronic, magnetic, optical,

electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non- exhaustive list) of the computer readable storage medium could include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0032] The hardware adapters (135, 140) in the computing device (105) enable the processor (125) to interface with various other hardware elements, external and internal to the computing device (105). For example, peripheral device adapters (135) may provide an interface to input/output devices to create a user interface and/or access external sources of memory storage. As will be discussed below, an output device (150) may be provided to allow a user to interact with and adjust the amount and type of content selected within a web page (1 10).

[0033] Peripheral device adapters (135) may also create an interface between the processor (125) and a printer (145) or other media output device. For example, where the computing device (105) selects the most user desirable content of the web page (110) and the user then wishes to print that content, the computing device (105) may instruct the printer (145) to create one or more physical copies of the document. A network adapter (140) may additionally provide an interface to the network (120), thereby enabling the transmission of data to and receipt of data from other devices on the network (120), including the web page server (1 15).

[0034] Referring now to Figs. 2A-2C, a Document Object Model (DOM) tree for an illustrative web page, the web page layout, and the visual elements in a web page is shown. As discussed earlier, various types of data associated with a web page may exist. This data may be saved in order to better select the user desirable content of a web page. However, for purposes of explanation, the present specification uses the illustrative example of saving a Uniform Resource Locator (URL), the web page associated with the Uniform Resource Locator (URL), the web page's Document Object Model (DOM) tree, the particular nodes selected by a user, or combinations thereof. Therefore, although the illustrative example in the present specification and specifically in Figs. 2A-2C may only refer to these types of data being saved in order to better select the appropriate user desirable content from a web page, it can be appreciated that any type of web page data may also be saved so as to achieve similar results. For example, the present system, method and device described in the present specification may save any representation of a web page

Document Object Model (DOM) tree, any transformation of a web page

Document Object Model (DOM) tree, any hash table created by the use of a hash function and meant to represent any selected content of a web page, any modifications of a previous Document Object Model (DOM) tree, or any other type of data representing any content on any web page which has been previously selected by a user.

[0035] In the example shown in Figs. 2A-2C, the web page is from a recipe website and includes an image of the dish which is described, a rating of the dish by users of the web page, a description of the dish, ingredients to make the dish, preparation instructions, and other elements. Fig. 2A is an illustrative Document Object Module (DOM) tree (200) showing the hierarchy of Document Object Module (DOM) nodes in the illustrative web page. A Document Object Module (DOM) is a cross-platform and language independent convention for representing and interacting with web page elements in HyperText Markup Language (HTML), eXensible HyperText Markup Language (XHTML) and eXensible Markup Language (XML). The root node in this illustrative web page is the Content (210) node which has six sub-nodes: the Banner (215) sub-node; Header (220) sub-node, MainCol (225) sub-node; AdCol (230) sub-node;

Reviews (235) sub-node; and Footer (240) sub-node. For purposes of illustration, sub-nodes (250-285) are shown only for the MainCol (225) sub- node. Therefore, it can be appreciated that the Banner (215) sub-node, Header (220) sub-node, AdCol (230) sub-node, Reviews (235) sub-node, and Footer (240) sub-node may each include additional sub-nodes of their own. Dashed lines extending to the right of the other sub-nodes therefore show the

continuation of the sub-nodes with nodes which are not illustrated in Fig. 2A.

[0036] The MainCol (225) sub-node also includes two sub-nodes itself, LeftCol (250) sub-node and RightCol (225) sub-node, at the next hierarchal level. LeftCol (250) sub-node has two sub-nodes at the lowest hierarchal level: Mainlmg (260) sub-node and SimRec (265) sub-node. The RightCol (225) sub-node has four sub-nodes at the lowest hierarchal level: Rating (270) sub-node, Descr (275) sub-node, Ingred (280) sub-node, and Prep (285) sub-node.

[0037] Fig. 2B shows the layout (205) of the illustrative web page depicted by the Document Object Module (DOM) tree (Fig. 2A, 200) shown in Fig. 2A. The Banner (215) and AdCol (230) each hold a location within the layout (205) for a banner ad and other advertisements. The Header (220) may contain a number of elements including navigation tabs, search fields and other sub-elements. Similarly the Footer (240) may contain a number of elements including links to related sites, terms of use and privacy policies, copyright notices, and other elements. The Reviews (235) sub-tree may contain ratings and comments from various users of the site who have tried the recipe.

However, as explained above, for simplicity these elements within the Banner (215), AdCol (230), Header (220), Footer (240) and Reviews (235) are not represented on the Document Object Model (DOM) tree of Fig. 2A and, therefore, also do not appear in the web page layout of Fig. 2B. [0038] The MainCol (225) sub-node contains at least some of the user desirable content which a user may want to view, utilize or adapt. The MainCol (225) contains a left column (250) and a right column (255). In left column (250), an image is shown in the Mainlmg (260) element; in this illustrative example the image is a dish. The right column (255) includes an overall rating for the dish (270), a description of the dish (275), ingredients of the dish (280), and preparation instructions (285). Similar recipes are shown below the MainCol (225) in the SimRec (265) element. These elements (260- 285) may have a number of additional sub-elements.

[0039] Fig. 2C is diagram of an illustrative web page (207) showing the content of the web page of Figs. 2A and 2B. The content has been simplified for purposes of illustration. There may be a variety of non-visual code and/or elements present in any of the elements (Fig. 2B, 215-285). However, according to one aspect of the present systems and methods this non-visual information is not presented to the user viewing the web page (207) as being part of the user desirable content. Consequently, during the analysis of the web page (207) to determine the user desirable content of the web page (207), non- visual information is not weighted heavily or is not considered at all. As discussed above the user is typically interested in viewing, utilizing or adapting in some way the main content (290) of the web page (207). Banner ads, page navigation, reviews, and links typically contain information which is not directly relevant to the user's interest in the web page (207) and are not directly related to the content the user wishes to view, utilize or adapt.

[0040] Turning now to Fig. 3, an illustrative flowchart depicting a method of extracting user desirable content from a web page (Fig. 1 , 1 10; Fig. 2C, 207) is shown. The method starts by accessing or downloading a web page (Fig. 1 , 1 10; Fig. 2C, 207) to a computing device (Block 305) operated by a user of a website. Accessing a web page (Fig. 1 , 1 10; Fig. 2C, 207) is typically accomplished with a web browser program stored on the computing device (Fig. 1 , 105). As discussed earlier this computing device (Fig. 1 , 105) may retrieve the web page (Fig. 1 , 1 10; Fig. 2C, 207) hosted by the web page server (Fig. 1 , 1 15) and determine the most user desirable content of the web page (Fig. 1 , 1 10; Fig. 2C, 207) based, at least partially, on the user's previous selections of text, images and other content on the same or similar web pages. In the present example, access to the web page (Fig. 1 , 1 10; Fig. 2C, 207) is accomplished by the computing device (Fig. 1 , 105) requesting the web page (Fig. 1 , 1 10; Fig. 2C, 207) from the web page server (Fig. 1 , 1 15) over the network (Fig. 1 , 120) using the appropriate network protocol (for example, Internet Protocol ("IP")).

[0041] Next, it is determined whether any web page data has been saved on the computing device (Fig. 1 , 105) which is similar to the web page data of the current web page (Fig. 1 , 1 10; Fig. 2C, 207) being accessed (Block 310). The computing device (Fig. 1 , 105) therefore accesses any saved data on the memory (Fig. 1 , 130) to determine whether the web page data of the web page (Fig. 1 , 1 10; Fig. 2C, 207) currently being accessed is equivalent to or is at least similar to any other previously accessed web page's web page data. As discussed previously, the web page data may come in the form of a Uniform Resource Locator (URL), a Document object Model (DOM) tree, or any other type of web page data and may be stored and accessed in a way so as to be compared with any web page data associated with any currently accessed web pages. This is done so as to first determine if such web page data exists and then, if it does, equate the web page data of the current web page with data of any previously accessed web page to present to the user those sections of the currently accessed web page which the user desires to view, print, or otherwise utilize.

[0042] If, for example, the current web page (Fig. 1 , 1 10; Fig. 2C, 207) being viewed had not been accessed by the user earlier, any web page data relating to that web page (Fig. 1 , 1 10; Fig. 2C, 207) may not have been saved for access by the computing device (Fig. 1 , 105). Similarly, if a web page similar to the current web page (Fig. 1 , 1 10; Fig. 2C, 207) being viewed had not been accessed by the user previously, any web page data to which the web page data of the currently viewed web page (Fig. 1 , 1 10; Fig. 2C, 207) may not have been saved. When this occurs (Determination NO, Block 310), the computing device (Fig. 1 , 105) performs a content search of the currently accessed web page to present a preliminary selection of user desirable content (Block 315). Content selection may be performed via a number of methods; however, in one example an algorithm may be implemented by the computing device (Fig. 1 , 105) to select the most user desirable portions of the web page (Fig. 1 , 110; Fig. 2C, 207).

[0043] One method of selecting user desirable content from a web page (Fig. 1 , 1 10; Fig. 2C, 207) may include, first, segmenting the web page (Fig. 1 , 1 10; Fig. 2C, 207) into several coherent areas or blocks. For example, the computing device (Fig. 1 , 105) may access the source code of the web page (Fig. 1 , 1 10; Fig. 2C, 207) to determine or create a Document Object Model (DOM) tree (Fig. 2A, 200) for the web page (Fig. 1 , 1 10; Fig. 2C, 207), gather information about each node on the Document Object Model (DOM) tree (Fig. 2A, 200), and segment the web page (Fig. 1 , 1 10; Fig. 2C, 207) into coherent areas or blocks. The computing device (Fig. 1 , 105) may also eliminate or filter out any invisible elements of the web page (Fig. 1 , 1 10; Fig. 2C, 207) which may not need to be included with the main content of the web page (Fig. 1 , 1 10; Fig. 2C, 207).

[0044] The computing device (Fig. 1 , 105) may then calculate a score for each area or block based on many features of the web page (Fig. 1 , 110; Fig. 2C, 207). For example, a score may be calculated based on the horizontal and vertical coverage of each block, the normalized text length within each block, the link-to-text ratio within each block, the ratio of non-highlighted text to highlighted text within each block, the normalized block area, and the

normalized number of any child Document Object Model (DOM) nodes within each block. The horizontal coverage may be obtained by computing the horizontal extent of a segment over the total area of the page. The blocks covering near the horizontal center get higher scores. Similarly, the vertical coverage may be obtained by computing the vertical extent of a segment over the total area of the page. The blocks covering near the top of the web page (Fig. 1 , 1 10; Fig. 2C, 207) have higher scores. The normalized text length may be obtained by computing the text length of the segment over the maximal text length of all segments. The link-to-text ratio may be obtained by computing the link text length of the segment over the text length of the segment. Texts with higher density of anchor text are more likely to be a navigational bar or an advertisement. Similarly, the non-highlighted text to highlighted text ratio may be obtained by computing the highlight text length of the segment over the text length of the segment and then multiplying the highlight weight. For example, the weight of <H1 > is larger than <H6>. The normalized block area may be obtained by computing the segment area over the maximal area of all segments. Next, the normalized number of child (DOM) nodes may be obtained by computing the number of child nodes in the segment over the maximal number of child nodes in all segments.

[0045] Then, the computing device (Fig. 1 , 105) may determine which areas or blocks have received the highest score and present those areas with a score high enough to overcome a predetermined threshold limit to a user via a user interface such as the output device (150) of Fig. 1 . The main content (Fig. 2C, 290) is then selected without any user interaction. Therefore, the selection of these selected portions of the web site (Fig. 1 , 1 10; Fig. 2C, 207) may be done in the background while the web page (Fig. 1 , 1 10; Fig. 2C, 207) is being accessed by the user.

[0046] In another example, the selection of the most often selected portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) may be performed using a threshold. In this example, portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) associated with particular nodes within the Document Object Model (DOM) tree (Fig. 2A, 200) are selected at least a threshold amount of times. Again, this threshold may be predetermined by the client device (Fig. 1 , 105), or may be selected by the user. For example, if a portion of the web page (Fig. 1 , 1 10; Fig. 2C, 207) associated with particular node is selected by other users at least ten times, then that portion of the web page is presented to the user as a selection.

[0047] In another example, the selection of the most often selected portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) may be performed using a fraction of times a particular portion of the web page (Fig. 1 , 1 10; Fig. 2C, 207) was selected. In this example, if a particular node or other portion of the web page has been selected a number of times more than other portions of the web page above a predetermined fraction, then that portion of the web page is presented to the user as a selection. In one example, the fraction may be higher than about 0.8. In another example, the fraction may be higher than about 0.6.

[0048] Further, in yet another example, the selection of the most often selected portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) may be performed using a variance of a selection of a portion of the web page (Fig. 1 , 1 10; Fig. 2C, 207). In this example, it may first be determined how consistently a particular node or portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) is selected. In still another example, the selection of the most often selected portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) may be performed using correlations between how related nodes or portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) are selected.

[0049] Still further, in other examples, the selection of the most often selected portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207) may be

determined by a weighted count of a selection by its type, as a median of certain types of selections, or some other voting scheme. For example, more weight may be given to a specific node within the Document Object Model (DOM) tree (Fig. 2A, 200) based on the content contained or described in that node. Therefore, if a website contains generally news article, for example, the main article may be given more weight than other articles listed on the web page and may, therefore, be presented to the user over other portions of the web page. In another example, the type of content contained within one node may also determine what weight to give a node and thereby may determine whether a node is included in the selected content or not.

[0050] After the computing device (Fig. 1 , 105) has performed a content search of the web page to present a preliminary selection of user desirable content, the user may then be allowed to adjust the amount of content to be selected (Block 320) within the web page (Fig. 1 , 1 10; Fig. 2C, 207). Still looking at Fig. 3 and now turning to Fig. 4, an illustrative diagram of the illustrative web page of Fig. 2C showing a selection of additional web page content (405) is shown. In addition to the selected main content (290) of the web page (407), the user may select additional content (405) of the web page (407). Specifically, this may be done by clicking on and dragging a number of control points (410) located around or otherwise associated with the selected main content (290) shown on the user interface of the computing device (Fig, 1 , 105). In this manner, the user may include additional content to the selected main content (290) of the web page (407) by dragging, for example, a corner or side control point (410) of the main content (290) over additional portions of the web page (407). Further, the user may restrict the amount of content included in a selected portion by dragging the control points (410) off of portions of the main content (290) of the web page (407). Still further, the user may be allowed to drag a cursor over additional portions of the web page (407) so as to further select a separate portion of the web page (407) which is not close to the selected portion (290). For example, expansion of the selected main content (290) of the web page may result in content which the user may not wish to include, but does include if the user is dragging a control point (410) over the unwanted content. In this case, the user may create a new block or section (405) within the content of the web page separate and distinct from the selected main content (290) while still excluding those undesirable sections positioned between those two sections of content. Therefore, this addition and subtraction of the selected portions within the web page provides for a more effective and user-friendly means of selecting those desirable portions of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407).

[0051] Looking now at Fig. 3 again, the method further includes saving any pertinent web page data (Block 325) to a data storage device (Fig. 1 , 130) thereby allowing easy access to the web page data by a processor (Fig. 1 , 125) when the user accesses the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) or a web page similar to the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407). As discussed above the web page data may be any type of data associated with the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) which allows a computing device (Fig. 1 , 105) to select those user desirable portions of a web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407). For example, web page data may include the web page's (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) Uniform Resource Locator (URL); the web page's (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) Document Object Model (DOM) (Fig. 2A, 200); information relating to the structure and layout of a Document Object Model (DOM) tree (Fig. 2A, 200) of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407); the layout and structure of any nodes within the Document Object Model (DOM) tree (Fig. 2A, 200);

content of a web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) or nodes previously or currently selected by a user within a Document Object Model (DOM) tree (Fig. 2A, 200); content of a web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) or nodes not previously or currently selected by a user within a

Document Object Model (DOM) tree (Fig. 2A, 200); any data relating to the amount or characteristics of any type of content of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) selected or not selected by an individual, entity; or combinations of these. Web page data may additionally include any metadata associated with or describing any of the above mentioned types of data. Still further, web page data may also include any data or metadata relating not only to the content of a web page an individual has selected from any one web page (Fig. 1 , 110, Fig. 2C, 207, Fig. 4, 407) in the past, but may also include information relating to when and how often the user had previously viewed, utilized, or adapted a web page or content on a web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407).

[0052] The web page data stored on the data storage device (Fig. 1 , 130) may then be retrieved again at a later time by the processor (Fig. 1 , 125) located on the computing device (Fig. 1 , 105) so as to better select the user desired content of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) based on those portions of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) previously selected by the user. Therefore, if the user had previously accessed the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) or a similar web page to that web page being currently accessed and web page data relating that web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) or a similar web page does exist (Determination YES, Block 310), then the computing device (Fig. 1 , 105) determines whether the web page data of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) being currently accessed is similar to any of the web page data of a previously saved web page (Block 330). This may be done by allowing the computing device (Fig. 1 , 105) to access the database associated with the web page data and compare data relating to the accessed web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) with data relating to any previously accessed web page. For example, the computing device (Fig. 1 , 105) may compare the Uniform Resource Locator (URL) of the presently accessed web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) with any other saved Uniform Resource Locator (URL) related or associated with a previously accessed web page. Any web page data saved on the database relating to that Uniform Resource Locator (URL) is then compared (Block 330) with the web page data of the currently assessed web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407).

[0053] Often, the layout of the content within a webpage or even a template used in creating a web page may change over a period of time. For instance, an operator or owner of a web page may want to adjust the look of a web page and in so doing may use a different template or at least adjust the placement of the content on the web page. As can be appreciated, when a user has accessed a web page before these changes were implemented; had saved the pertinent web page data for future use; and revisited the web page again after the web page was altered or adjusted, the web page data may not be similar enough to once again effectively obtain from the web page the user desirable content. In one example, the computing device (Fig. 1 , 105) may determine that a threshold of instances of similarities between the saved web page data and the web page data accessed has not been met, thereby determining that a new set of web page data associated with the web page currently being accessed be saved. In this case (Determination NO, Block 330), the web page (Fig. 1 , 110, Fig. 2C, 207, Fig. 4, 407) is treated as if the user had never previously visited the currently accessed web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) before and the method described above in connection with Blocks 315 through 325 is repeated again for this web page. Specifically, a content selection algorithm is ran (Block 315) to obtain user desirable content from the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407), the user is allowed to adjust (Block 320) the selected content (Fig. 2C. 290) to his or her preferences, and the web page data is again saved and stored on the data storage device (Fig. 1 , 130) in a database (Block 325).

[0054] If, however, the web page data of the currently accessed web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) is similar enough to the web page data previously stored in the database (Determination YES, Block 330), then the computing device (Fig. 1 , 105) may compare (Block 335) the web page data associated with the currently accessed web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) with the content of the web page data associated with the previously accessed web page to see if there is any equivalent or similar web page data. After the computing device (Fig. 1 , 105) has compared both sets of web page data, the computing device (Fig. 1 , 105) may then present that equivalent or similar content to the user (Block 340) on an output device (Fig. 1 , 1 10) such as a monitor for the user to store, print, or otherwise utilize.

[0055] In one alternative example of the present specification, the web page data stored on the computing device (Fig. 1 , 105) may comprise, at least, web page data relating to the content of the web page which was previously selected by the user and saved earlier in response to the user accessing that web page. Therefore, the computing device (Fig. 1 , 105) may compare that web page data of that web page and compare it to the web page data associated with the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) currently being accessed to determine which content of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) to include and exclude from the content selection.

[0056] In another alternative example of the present specification, the web page data stored on the computing device (Fig. 1 , 105) may comprise, at least, web page data relating to the content of the web page which was not previously selected by the user; that data also being saved earlier in response to the user accessing that web page. Therefore, the computing device (Fig. 1 , 105) may compare that web page data to the web page data associated with the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) currently being accessed and determine which content of the web page (Fig. 1 , 1 10, Fig. 2C, 207, Fig. 4, 407) to include and exclude from the content selection.

[0057] Similar to Block 320 discussed above, after the equivalent portions of the web page have been presented to the user (Block 340), the user may further be allowed to adjust the content selection (Block 345). Again, still looking at Fig. 3 and now turning to Fig. 5, in addition to the content selected by the computing device based on previous selections made by the user (590), the user may select additional portions (505) of the web page (507). The user may further exclude portions of the web page (507) from being part of the user desirable content selection. Specifically, this may be done by clicking on and dragging a number of control points (510) located around or otherwise associated with the selected portion of the selected content shown on the user interface of the computing device (Fig, 1 , 105). In this manner, the user may include additional portions of the user desirable portion of the web page (507) by dragging, for example, a corner or side control point (510) of the selected portion over additional portions of the web page (507). Further, the user may restrict the amount of content included in a selected portion by dragging the control points (510) off of portions of the selected content of the web page (507). Still further, the user may be allowed to drag a cursor over additional portions of the web page (507) so as to further select a separate portion of the web page (507) which is not close to the previously selected portion (590). For example, expansion of the previously selected portion of the web page in order to include additional content may result in content which the user may not wish to include, but does include if the user is dragging a control point (510) over the unwanted content. In this case, the user may create a new block or section (505) within the content of the web page separate and distinct from the previously selected portion (590) while still excluding those undesirable sections positioned between those two portions. Therefore, this addition and subtraction of the previously selected portions (590) within the web page provides for a more effective and user-friendly means of obtaining those desirable portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507). [0058] Once the user has had the opportunity to adjust the selection of the content in the web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507), the computing device determines (Block 350) if significant changes have been made by the user to the amount or type of content selected. These changes are compared to the initial content presented to the user after the computing device (Fig. 1 , 105) had found, equated, and presented (Blocks 335 and 340) the web page data and content of the current web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507). Therefore, in one example, if the amount of content has been adjusted by any degree (Determination YES, Block 350), then the web page data representing the new amount and type of content selected by the user is stored on a database (Block 325) for future reference by the processor (Fig. 1 , 125).

[0059] In another example, if the amount of content has been adjusted beyond a predetermined threshold (Determination YES, Block 350), then the web page data representing the new amount of content selected by the user is stored on a database (Block 325) for future access by the processor (Fig. 1 , 125). However, if the changes to the content selected by the user do not meet the predetermined threshold (Determination NO, Block 350), then the process ends without the web page data representing those adjustments being stored (Block 325). In one example, the predetermined threshold may be determined by the number of nodes within the web page's DOM tree which have and have not been selected. Therefore, both the number of additional nodes selected as well as the number of nodes omitted as compared to the original number of nodes previously selected may be taken into account when determining if the predetermined threshold has or has not been met.

[0060] When the changes to the content selection are significant enough (Determination YES, Block 350), the web page data and that web page data defining those changes are saved and stored once again for future use (Block 325). When the changes are not significant enough (Determination NO, Block 350), the user had chosen those selected portions of the web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) which were presented to the user (Block 340) and represents the most user desirable content on that web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

[0061] The methods described above can be accomplished by a computer program product comprising a computer readable storage medium having computer usable program code embodied therewith that when executed performs the above methods. Specifically, the computer usable program code may determine whether any web page data exists that relates to the current web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) being viewed by the user. The computer usable program code may further determine whether the web page data associated with the currently viewed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) is similar or equivalent to any web page data associated with any previously viewed web pages. Still further the computer usable program code may present to a user, via an output device (Fig. 1 , 150), the equivalent web page data selected most often as selected content within the second web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507). Further, the computer usable program code may interpret and store any changes made to the selected content within the web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) being accessed.

[0062] In conclusion, the specification and figures describe a method of selecting content within a web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507). Specifically, the specification and figures describe a method of selecting content within a web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) by equating equivalent web page data within a currently viewed web page with web page data associated with a previously accessed web page, and presenting, via a user interface, the equivalent content to a user. This method of selecting content within a web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) may have a number of advantages, including: accuracy in the amount and type of user desirable content selected by the computing device; assimilation of user specific personal preferences as to the type and amount of content selected by the computing device; immediate accuracy in the amount and type of user desirable content selected by the computing device; selection of user desirable content based on the user's preferences without further interaction by the user; and, increase in privacy because the web page data saved by the computing device is saved locally or otherwise only obtainable by the user's computing device.

[0063] The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.

Claims

CLAIMS WHAT IS CLAIMED IS:

1 . A method of selecting content within a web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) comprising:

accessing first web page data associated with at least one previously accessed web page;

accessing second web page data associated with a currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507);

comparing the first web page data with the second web page data; and presenting to a user, via an output device (Fig. 1 , 150), equivalent web page data selected most often within the at least one previously accessed web page as selected content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

2. The method of claim 1 , further comprising determining if the first web page data exists;

in which, if the first web page data exists, then presenting to a user the equivalent web page data selected most often within the at least one previously accessed web page as selected content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507), and

in which, if the first web page data does not exist, then running a default content selection algorithm to select main content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

3. The method of claim 2, further comprising receiving input from a user relating to adjustments to the content selected within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) if the first web page data does not exist.

4. The method of claim 3, further comprising saving web page data associated with content selected within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) to a data storage device.

5. The method of claim 1 , further comprising receiving input from a user relating to adjustments to the content selected within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

6. The method of claim 5, further comprising determining if changes have been made to the content selection within the currently accessed web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507);

in which, if changes have been made to the content selection within the currently accessed web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) within a predetermined threshold, then new web page data associated with the currently accessed web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) is saved to a data storage device (Fig. 1 , 130).

7. The method of claim 1 , in which the first web page data associated with the at least one previously accessed web page is saved to a data storage device (Fig. 1 , 130).

8. The method of claim 1 , in which accessing the first web page data and second web page data is performed by a processor (Fig. 1 , 125) located on a computing device (Fig. 1 , 105).

9. The method of claim 1 , in which the first and second web page data comprises at least one of a Uniform Resource Locator (URL), a web page Document Object Model (DOM) (Fig. 2A, 200), data defining the structure and layout of a Document Object Model (DOM) tree (Fig. 2A, 200) of a web page, layout and structure of the nodes within a Document Object Model (DOM) tree (Fig. 2A, 200), content of a web page previously selected by a user within a Document Object Model (DOM) tree (Fig. 2A, 200), content of a web page currently selected by a user within a Document Object Model (DOM) tree (Fig. 2A, 200), content of nodes previously selected by a user within a Document Object Model (DOM) tree (Fig. 2A, 200), content of nodes currently selected by a user within a Document Object Model (DOM) tree (Fig. 2A, 200), data relating to the amount of content of a web page which had been previously selected by a user, data relating to the amount of content of a web page which had previously not been selected by a user, data relating to the characteristics of content of a web page which had been previously selected by a user, data relating to the characteristics of content of a web page which had previously not been selected by a user, metadata associated with any of the above mentioned types of data, metadata describing any of the above mentioned types of data, data relating to when and how often a user had previously adapted a web page, data relating to when and how often a user had previously adapted content on a web page, or combinations thereof.

10. A computer program product for selecting content within a web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507), the computer program product comprising:

a computer readable storage medium having computer usable program code embodied therewith, the computer usable program code comprising:

computer usable program code that, when executed, accesses first web page data associated with at least one previously accessed web page;

computer usable program code that, when executed, accesses second web page data associated with a currently accessed web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507);

computer usable program code that, when executed, compares the first web page data with the second web page data; and

computer usable program code that, when executed, presents to a user, via an output device (Fig. 1 , 150), equivalent web page data selected most often within the at least one previously accessed web page as selected content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

1 1 . The computer program product of claim 10, further comprising:

computer usable program code that, when executed, determines if the first web page data exists;

computer usable program code that, when executed, presents equivalent web page data within the at least one previously accessed web page as selected content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) to a user if the first web page data exists, and

computer usable program code that, when executed, runs a default content selection to select main content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) if the first web page data does not exist.

12. The computer program product of claim 10, further comprising computer usable program code that, when executed, receives input from a user relating to adjustments to the content selected within the second web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

13. The computer program product of claim 12, further comprising:

computer usable program code that, when executed, determines if changes have been made to the content selection within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507); and

computer usable program code that, when executed, saves new data associated with the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) to a data storage device (Fig. 1 , 130) if changes have been made to the content selection within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) within a predetermined threshold.

14. A system for selecting content within a web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) comprising:

a data storage device (Fig. 1 , 130) that stores first web page data associated with at least one previously accessed web page and second web page data associated with a currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507); and

a processor (Fig. 1 , 125), communicatively coupled to the storage medium (Fig. 1 , 130), that accesses the first and second web page data, compares the first web page data with the second web page data, and presents to a user, via an output device (Fig. 1 , 150), equivalent web page data selected most often within the at least one previously accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507) as selected content within the currently accessed web page (Fig. 1 , 1 10; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).

15. The system of claim 14, in which the processor (Fig. 1 , 125) further determines if the first web page data exists:

in which, if the first web page data exists, then the processor (Fig. 1 , 125) presents to a user the equivalent web page data selected most often within the at least one previously accessed web page as selected content within the currently accessed web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507), and

in which, if the first web page data does not exist, then the processor (Fig. 1 , 125) runs a default content selection to select main content within the currently accessed web page (Fig. 1 , 110; Fig. 2C, 207; Fig. 4, 407; Fig. 5, 507).