EP2567513A1 - System and method for monitoring web content - Google Patents

System and method for monitoring web content

Info

Publication number
EP2567513A1
EP2567513A1 EP10850926A EP10850926A EP2567513A1 EP 2567513 A1 EP2567513 A1 EP 2567513A1 EP 10850926 A EP10850926 A EP 10850926A EP 10850926 A EP10850926 A EP 10850926A EP 2567513 A1 EP2567513 A1 EP 2567513A1
Authority
EP
European Patent Office
Prior art keywords
location
feature
monitoring
content
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP10850926A
Other languages
German (de)
French (fr)
Inventor
Hyun Chul Lee
Byron Bondling Ma
Kyu Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rogers Communications Inc
Original Assignee
Rogers Communications Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rogers Communications Inc filed Critical Rogers Communications Inc
Publication of EP2567513A1 publication Critical patent/EP2567513A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Definitions

  • the present disclosure relates generally to the monitoring of dynamic content. More specifically, it relates to a method and system for monitoring content, such as web-pages, which are stored at a plurality of locations in a location set.
  • Monitoring web-page content and fetching web-page content may be useful in systems which index or classify such content.
  • search engines, news aggregation services, and other indexing and classification systems may re-visit web-pages from time to time in order to determine whether content associated with those web-pages has changed. Where content has changed, such systems may update indexing and classification data .
  • This approach to monitoring and fetching may be less effective when monitoring highly dynamic web-pages and web-content. For example, visiting web-pages in a predetermined fixed order may be inefficient for monitoring web-pages which are micro-blogs, such as TwitterTM.
  • FIG. 1 shows a system diagram illustrating a possible environment in which embodiments of the present application may operate
  • FIG. 2 shows a block diagram of a content monitoring system in accordance with an embodiment of the present disclosure
  • FIG. 3 shows a block diagram of a content monitoring system in accordance with a further embodiment of the present disclosure
  • FIG. 4 shows a flowchart of a process for monitoring content in accordance with an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of a process for recognizing monitoring content in accordance with a further embodiment of the present disclosure
  • FIG. 6 shows a flowchart of a process for recognizing monitoring content in accordance with another embodiment of the present disclosure.
  • the present disclosure provides a method of monitoring content stored at a plurality of locations in a location set.
  • the method comprises : determining two or more historic attributes for a first feature associated with each location; for each location in the location set, determining a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location; determining a monitoring schedule in
  • the present application provides a content monitoring system for monitoring content stored at a plurality of locations in a location set.
  • the system comprises a prediction component.
  • the prediction component is configured to determine two or more historic attributes for a first feature associated with each location .
  • the prediction component is further configured to, for each location in the location set, determine a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location .
  • the system further comprises a scheduling component configured to determine a monitoring schedule in accordance with the first predicted attribute.
  • the system further comprises a monitoring component configured to monitor the content at the locations in the location set according to the monitoring schedule.
  • FIG. 1 illustrates a system diagram of a possible operating environment in which embodiments of the present disclosure may operate.
  • a content monitoring system 160 is illustrated.
  • the content monitoring system 160 is configured to monitor content of electronic documents 120a, 120b located at a plurality of locations 182, 184, which may be identified in a location set 180. That is, the content monitoring system 160 is configured to monitor electronic documents 120a, 120b located at a set of locations 182, 184 defined by a location set 180.
  • the location set 180 is stored in a storage 190 which is accessible by the content monitoring system 190.
  • the storage 190 may, in some embodiments, be internal storage of the content monitoring system 160. In other embodiments, the storage 190 may be external storage of the content monitoring system 160, including, for example, network storage accessible through a network 104.
  • the electronic documents 120a, 120b may vary over time. That is, the content of an electronic document 120a, 120b located at any given location 182, 184 may vary over time.
  • the electronic documents 120a, 120b may, in various embodiments, be one or more of: Really Simple Syndication ("RSS") feeds or other cascaded feeds, blogs, micro-blogs such as TwitterTM, on-line news sources, user-generated comments from web-pages, etc.
  • RSS Really Simple Syndication
  • Other types of electronic documents 120a, 120b are also possible.
  • the electronic documents 120a, 120b may be formatted in a Hyper-Text Markup Language (“HTM L”) format, a plain-text format, or a portable document format (“PDF”) .
  • HTM L Hyper-Text Markup Language
  • PDF portable document format
  • the electronic documents 120a, 120b may be an image, such as a JPEG or Bitmap image. Other document formats are also possible.
  • the electronic documents 120a, 120b may be located at associated locations 182, 184 on a plurality of document servers 114a, 114b, which may be accessible through a network 104, such as the Internet.
  • the document servers 114 may be publicly and/or privately accessible web-pages which may be identified by a unique Uniform Resource Locator ("URL") .
  • URL Uniform Resource Locator
  • the locations 182, 184 may be URLs.
  • the network 104 may be a public or private network, or a combination thereof.
  • the network 104 may be comprised of a Wireless Wide Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Wide Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Wide Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Wide Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Wide Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a Local Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a
  • Network or any combination of these network types.
  • Other types of networks are also possible and are contemplated by the present disclosure.
  • the location set 180 which defines the locations 182, 184 of the electronic documents 120a, 120b which are to be monitored may be stored on the storage 190.
  • the storage 190 may include non-volatile memory such as, for example, a Hard Disk Drive (H DD), Flash Memory, or other types of memory. In some embodiments, the storage 190 may include a combination of different types of memory.
  • H DD Hard Disk Drive
  • Flash Memory Flash Memory
  • the content monitoring system 160 may include functionality in addition to the ability to monitor the content of electronic documents 120a, 120b located at locations 182, 184.
  • the content monitoring system 160 may be a document aggregation system 150.
  • the document aggregation system 150 may be configured to search document servers 114a, 114b to locate and/or group electronic documents 120a, 120b which are related to a common subject matter.
  • the electronic documents 120a, 120b may, in some embodiments, be news-related documents which contain information about recent, interesting, topical and/or important events. In such cases, the document aggregation system 150 may also be referred to as a news aggregation system .
  • the news aggregation system may be configured to locate and group electronic documents 120a, 120b which are related to a common event or story.
  • the locations 182, 184 in the location set 180 may be predefined fixed locations.
  • the locations 182, 184 may, in some embodiments, be specified, in whole or in part by a user of the content monitoring system 160, such as, for example, a system administrator.
  • the location set may be dynamic.
  • the content monitoring system 160 (which may be a document aggregation system 150) may include a document search subsystem (not shown) .
  • the document search subsystem (not shown) may be used by the document aggregation system 150 to locate documents accessible through the network 104, which may be located at locations which are not identified in the location set 180.
  • the document search subsystem may be configured to search document servers 114a, 114b based on a search algorithm in order to identify electronic documents 120a, 120b matching a search criteria .
  • the search algorithm may provide for searching of websites (or other document servers 114a, 114b) of a specific category using a search keyword or phrase.
  • the document search subsystem may be configured to search blogs, micro blogs, and/or online traditional news sources, etc.
  • the document search subsystem may, in some embodiments, rely on a third party search engine which may not be physically located within the document aggregation system 150.
  • a third party search engine such as GoogleTM may be used.
  • the document search subsystem 150 may update the location set 180 to include the locations of those identified documents. For example, in some circumstances, the document search subsystem may search for electronic documents 120a, 120b which relate to a specific news item, such as a specific event. If any such documents are located, the location set 180 may be updated to include the location 182, 184 of those electronic documents 120a, 120b in order to cause the content monitoring system 160 to monitor the content of the documents 120a, 120b at those locations 182, 184.
  • the document aggregation system 150 also includes a document classification subsystem (not shown) which associates electronic documents 120a, 120b and/or the content therein with one or more labels.
  • the document classification subsystem may associate one or more documents 120a, 120b with a phrase contained in the one or more document 120a, 120b.
  • the label which is associated with the electronic document 120a, 120b may be used to identify the subject matter of the electronic document 120a, 120b.
  • the document aggregation system 150 may include other subsystems not specifically described above.
  • the document aggregation system 150 may, in some embodiments, include a ranking subsystem which ranks documents 120a, 120b or the subject of documents 120a, 120b based on frequency of use or frequency of occurrence.
  • the subjects of a plurality of documents 120a, 120b may be ranked by determining the frequency of occurrence of each label (such as a phrase) associated with documents 120a, 120b.
  • the rank may indicate, in at least some embodiments, how topical the subject matter associated with that label is.
  • the document aggregation system 150 may include a web-interface subsystem (not shown) for automatically generating web pages which provide links for accessing the documents 120a, 120b on the document servers 114a, 114b and other information about the documents 120a, 120b.
  • the other information may include a machine-generated summary of the contents of the document, and the rank of the subject matter of the document as determined by the ranking subsystem (not shown) .
  • the web pages which are generated by the web-interface subsystem may group documents 120a, 120b by subject matter and/or by phrases which are used in the electronic documents 120a, 120b.
  • other subsystems of the document aggregation system 150 may also include a power subsystem for providing electrical power to electrical components of the document aggregation system 150 and a communication subsystem for communicating with the document servers 114a, 114b through the network 104.
  • the content monitoring system 160 may include more or less systems, modules, subsystems and/or functions than are discussed herein . It will also be appreciated that the functions provided by any set of systems or subsystems described above may be provided by a single system and that these functions are not, necessarily, logically or physically separated into different subsystems.
  • FIG. 1 illustrates one possible embodiment in which the content monitoring system 160 may operate, (i .e. where the content monitoring system 160 is a document aggregation system 150) it will be
  • the content monitoring system 160 may be employed in any system in which it may be useful to monitor the content of electronic documents 120a, 120b located at locations 182, 184 of a location set 180.
  • content monitoring system 160 is intended to include stand alone content monitoring systems which are not, necessarily, part of a larger system, and also content monitoring sub-systems which are part of a larger system (which may be the same or different than the document aggregation system 150 of FIG. 1) .
  • the term content monitoring system 160 is, therefore, intended to include any systems in which the content monitoring methods described herein are included.
  • the content monitoring system 160, and/or the document aggregation system 150 may be implemented, in whole or in part, by way of a processor 240 which is configured to execute software modules 260 stored in memory 250.
  • a processor 240 which is configured to execute software modules 260 stored in memory 250.
  • FIG. 2 A block diagram of one such example content monitoring system 160, is illustrated in FIG. 2.
  • the content monitoring system 160 includes a controller comprising one or more processor 240 which controls the overall operation of the content monitoring system 160.
  • the content monitoring system 160 also includes memory 250 which is connected to the processor 240 for receiving and sending data to the processor 240. While the memory 250 is illustrated as a single component, it will typically be comprised of multiple memory components of various types.
  • the memory 250 may include Random Access Memory (RAM), Read Only Memory (ROM), a Hard Disk Drive (HDD), Flash Memory, or other types of memory. It will be appreciated that each of the various memory types will be best suited for different purposes and applications.
  • the processor 240 may operate under stored program control and may execute software modules 260 stored on the memory 250.
  • the software modules 260 may be comprised of, for example, a content monitoring module 280 which is configured to monitor the content of one or more electronic documents 120a, 120b (FIG. 1) located at locations 182, 184 identified in the location set 180.
  • the content monitoring module 280 may include a monitoring component 234 which is configured to monitor electronic documents 120a, 120b (FIG. 1) according to a monitoring schedule 202.
  • the monitoring schedule 202 specifies the order in which the content of electronic documents 120a, 120b at locations 182, 184 of the location set 180 are monitored .
  • the monitoring schedule 202 may be determined by a scheduling component 234 of the content monitoring module 280.
  • the monitoring schedule 202 may be stored in the storage 190 by the scheduling component 232 and retrieved by the monitoring component 234.
  • monitoring schedule 202 will be discussed in greater detail below.
  • the monitoring schedule 202 may, in at least some embodiments, act as a queue which lists locations 182, 184 in the order in which they are to be monitored.
  • the monitoring component 234 is configured to monitor the documents 120a, 120b at the locations 182, 184 in the monitoring schedule 202 in the order in which they are listed in the monitoring schedule.
  • Monitoring electronic documents 120a, 120b may, in various combinations
  • the monitoring component 234 may, in various embodiments, be configured to fetch the electronic documents 120a, 120b from their respective locations 182, 184 and to save the electronic documents 120a, 120b to the storage 190.
  • the electronic documents 120a, 120b may be saved in a fetched content 206 portion of the storage 190.
  • monitoring electronic documents 120a, 120b may include monitoring documents referred to and/or linked to in the electronic documents 120a, 120b located at the locations 182, 184 in the location set 180.
  • the document 120a, 120b located at a location 182, 184 in the location set 180 may, in some embodiments, be a cascaded data object such as an RSS feed.
  • the monitoring component 234 may be configured to visit locations referred to or linked in the document that is the RSS feed, when monitoring that document. That is, the monitoring component may visit locations referred to or linked to in an RSS document in order to retrieve and/or fetch content from other documents located at the referred-to or linked-to locations.
  • the monitoring component 234 may be configured to perform a duplication checking analysis on fetched content 206 before saving the content to the storage 190.
  • the monitoring component 234 may compare the fetched content with fetched content already saved to the storage
  • the monitoring component 234 may save the content to the storage 190.
  • the monitoring component 234 may not re-save the content to the storage 190.
  • the monitoring component 234 may be further configured to analyze electronic documents 120a, 120b located at the locations 182, 184 of the location set 180 to determine one or more attributes associated with features of the electronic documents 120a, 120b.
  • Each attribute may be related to a feature of the electronic documents 120a, 120b at a specific point in time.
  • the attribute may be related to a feature of the electronic document 120a, 120b at the point in time in which the electronic documents 120a, 120b are fetched from their respective locations 182, 184.
  • the time which is related to each attribute is, generally, a time which has already passed .
  • the attributes may, in at least some embodiments, be referred to as historic attributes. Since the attributes are each related to one or more features of the electronic document 120a, 120b, the attributes may also be referred to as feature attributes 204.
  • the feature attributes 204 may be a value, quantifier, or other attribute associated with a feature of an associated electronic document 120a, 120b at an associated point in time. That is, the feature attributes 204 serve to quantify features.
  • the features of the electronic documents 120a, 120b represent information about the electronic document 120a, 120b which may be used to determine how frequent the location 182, 184 associated with the document 120a, 120b will be monitored . That is, features are characteristics associated with the electronic document 120a, 120b which may be used in order to determine how often the location 182, 184 of the document 120a, 120b should be revisited for monitoring and/or how the monitoring of the document 120a, 120b should be prioritized relative to the monitoring of other documents 120a, 120b.
  • the features may include one or more of: an indicator of whether the document at a location was updated or not updated since a last visit to that same location, an indicator of the age of the document (for example, the elapsed time since the last change to the document), a quantifier of the number of comments associated with the electronic document 120a, 120b (for example, if the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature and the number of comments may be a feature attribute), and/or a quantifier of the number of inlinks associated with the electronic document 120a, 120b.
  • an indicator of whether the document at a location was updated or not updated since a last visit to that same location for example, the elapsed time since the last change to the document
  • a quantifier of the number of comments associated with the electronic document 120a, 120b for example, if the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature and the number of comments may be a feature attribute
  • Inlinks are links, such as hyper-text links, which point to the electronic document 120a, 120b.
  • the number of inlinks is not determined from the document 120a, 120b itself, but rather, from examining other documents to determine whether they link to the document 120a, 120b.
  • the feature may also include a feature which is a link analysis based ranking associated with the electronic document 120a, 120b.
  • a PageRankTM associated with an electronic document 120 may be a feature of that electronic document 120a, 120b.
  • the specific value or other quantifier of the feature for each document 120a, 120b at an associated time is the feature attribute 204 for that feature.
  • a specific PageRankTM value associated with a specific electronic document 120a, 120b at a specific point in time may be an attribute of a PageRankTM feature for that electronic document 120a, 120b.
  • the feature attributes 204 which are determined by the monitoring component 234 may be saved to storage 190 associated with the content
  • the feature attributes 204 may be saved in a features database in the storage 190. Each feature attribute 204 may be saved along with a time related to that feature attribute 204. That is, the feature attributes 204 may be saved in a time-series fashion .
  • the time may, in at least some embodiments, be the time at which the feature attributes 204 were observed or determined . In at least some embodiments, the time may be saved using POSIX time convention . However, other time formats may also be used.
  • the monitoring component 234 may be configured to only record a finite number of values associated with each feature for each location 182, 184 in the location set 180. This finite number may be defined by a feature attribute threshold. Once the feature attribute threshold is met, older feature attributes 204 may be removed from storage 190 in order to make room for newer features attributes 204. For example, in some embodiments, the monitoring component 234 may be configured to record only the last k-feature attributes 204 associated with each feature for each location .
  • the storage 190 may, in some embodiments, be internal storage of the content monitoring system 160, such as internal memory of the content monitoring system 160. In other embodiments, the storage 190 may be external storage which is accessible by the content monitoring system 160. For example, the storage 190 may, in some embodiments, be network storage.
  • the content monitoring module 280 may also include a prediction component 230.
  • the prediction component 230 may be configured to, for each location 182, 184 in the location set 180, determine a first predicted attribute for the first feature associated with that location based on the historic feature attributes 204 for that first feature and that location 182, 184. That is, in at least some embodiments, the prediction
  • the component 230 may, for each location 182, 184 in the location set 180, determine a future attribute for a first feature associated with that location based on historic feature attributes 204 for that first feature and that location .
  • the prediction component 230 may attempt to determine future attributes of features based on previously observed attributes of that same feature.
  • the prediction component 230 may attempt to predict whether, at some future time, the document will be updated or not since the last visit.
  • the prediction component 230 may attempt to predict what the age of the document will be at some future time.
  • the prediction component 230 may attempt to predict the number of comments associated with the electronic document at some future time.
  • the prediction component 230 may attempt to predict the number of inlinks associated with the electronic document 120a, 120b at some future time.
  • the prediction component 230 may attempt to predict the link analysis based ranking associated with the electronic document 120a, 120b at some future time.
  • the prediction component 230 may, in at least some embodiments, include a regression computation module which performs a regression analysis on historic attributes (also known as feature attributes 204) associated with a feature and a location in order to determine predicted attributes for that same feature and location .
  • a regression computation module which performs a regression analysis on historic attributes (also known as feature attributes 204) associated with a feature and a location in order to determine predicted attributes for that same feature and location .
  • the historic attributes may be taken at times that are irregular. That is, since monitoring does not occur in a fixed order, the time period between successive feature attributes for any location may be variable. Accordingly, a regression analysis which does not require fixed time intervals may be utilized by the prediction component 230. For example, in at least some embodiments, a brown's double exponential smoothing method may be used. In such embodiments, a predicted attribute for a feature and a location may be determined according to the following formula : ⁇ ,, ⁇ ⁇ - ⁇ , ⁇ - ⁇ ,,- ⁇ + ⁇ ,, ⁇ ,, where :
  • X travel is the predicted attribute
  • n is the number of historic attributes for the feature and location which are used to determine the predicted attribute
  • t is the time associated with a historic attribute (i .e. t n is the time for the n th historic attribute for that feature and that location)
  • X ,,-i is a last predicted attribute
  • X n is a feature attribute.
  • the smoothing parameter is a value which is, in at least some embodiments, between the range of zero (0) to one (1) . In at least some embodiments, the smoothing parameter is approximately 0.1.
  • an extended Holt's approach may be used to perform a regression analysis.
  • the predicted attribute can be determined by iterating through the following steps:
  • variable smoothing coefficients are given as :
  • the predicted attribute may be calculated as:
  • a linear regression method may be used to determine predicted attributes.
  • predicted attributes for more than one feature may be determined for each location 182, 184.
  • the prediction component 230 may, for each location 182, 184 in the location set 180, gather the predicted attributes for more than one feature and compute a
  • the prediction component 230 may apply a
  • each feature may have a weighting value associated with that feature.
  • the performance metric value may, in at least some embodiments, be calculated as the sum of the products of the predicted attribute of features and the weighting value associated with that feature.
  • the multiple features may include the number of comments associated with a document (i .e. the first feature) and the number of inlinks associated with the document (i .e. the second feature) .
  • the performance metric value may be calculated based on both a predicted attribute related to the number of comments expected to be associated with the document at some future time and a predicted attribute related to the number of inlink expected to link to the document at some future time.
  • the content monitoring module 280 may also include a scheduling component 232.
  • the scheduling component 232 may determine a monitoring schedule 202 based on the predicted attributes and/or the performance metric values determined by the prediction component 230.
  • the scheduling component 232 may schedule the monitoring of the locations 182, 184 in the location set 180 based on the predicted attributes and/or the performance metric values; locations which have higher predicted attributes and/or higher performance metric values may be placed higher on the monitoring schedule 202 (and thus monitored sooner) than locations with relatively lower predicted attributes and/or lower performance metric values.
  • the scheduling component 232 may be configured to increase the rank of a location in the monitoring schedule 202 if that location becomes stale. For example, the rank of a location may be increased based on the period of time which has elapsed since the location was last
  • the rank of a location in the monitoring schedule 202 may be increased by increasing the performance metric value associated with that location .
  • the predicted performance metric could be incremented by a predetermined amount for every thousand fetching operations. It will be appreciated however, that a thousand fetching operations is intended to be illustrative and that other thresholds may be used.
  • any one or more of the components 230, 232, 234 or modules 280 may be logically or physically organized in a manner that is different from the manner illustrated in FIG. 2.
  • the location set 180 and the monitoring schedule 202 may be a single element.
  • a single list of locations may serve as both a location set 180 and a monitoring schedule 202.
  • the order of the listing of locations in the location set 180 may define the order of monitoring .
  • FIG. 3 a block diagram of a further example of content monitoring systems 160 is illustrated .
  • a first content monitoring system 360 and a second content monitoring system 362 are connected to a common storage 190.
  • the first content monitoring system 360 and the second content monitoring system 362 may retrieve and update data which is common to both content monitoring systems 360 and 362.
  • the first content monitoring system 360 and the second content monitoring system 360 may share fetched content 206, feature attributes 204, a monitoring schedule 202 and/or a location set 180. Due to the sharing of data, the capacity of the system to monitor documents may be increased simply by adding additional content monitoring systems 160.
  • FIG. 3 illustrates an example where two content monitoring systems 160 are used in order to provide additional capacity
  • additional content monitoring systems 160 could be used in order to provide greater capacity.
  • FIG. 4 a process 400 for monitoring content stored at a plurality of locations 182, 184 (FIG. 1) in a location set 180 (FIG. 1) is illustrated in flowchart form .
  • the process 400 includes steps or operations which may be performed by the content monitoring system 160 of FIGs. 1 to 3.
  • the content monitoring module 280 may be configured to perform the steps or operations of the process 400 of FIG. 4.
  • the steps or operations of the process 400 of FIG. 4 may be performed by one or more of the prediction component 230, the scheduling component 232 and/or the monitoring component 234 of FIG. 2.
  • the content monitoring module 280, the prediction component 230, the scheduling component 232 and/or the monitoring component 234 may contain instructions for causing the processor 240 to execute the process 400 of FIG. 4.
  • the monitoring component 234 of the content monitoring module 280 may retrieve a monitoring schedule 206 (FIG. 2) from storage 190 and may access a location 182, 184 in a location set 180 according to the monitoring schedule 202.
  • the monitoring schedule 202 specifies the order in which the content of electronic documents 120a, 120b at locations 182, 184 of the location set 180 are monitored.
  • the monitoring schedule 202 may, in at least some embodiments, act as a queue which lists locations 182, 184 in the order in which they are to be monitored.
  • the monitoring component 234 will monitor the documents at the locations 182, 184 in the monitoring schedule 202 in the order in which they are listed in the monitoring schedule 202.
  • the location accessed at step 410 may be the location at the top of the queue.
  • the monitoring schedule 202 may, at least initially, be randomly or arbitrarily determined . For example, all of the locations 182, 184 in the location set 180 may be added to the monitoring schedule 202 in a random or arbitrary manner. Other methods of initializing the monitoring schedule 202 are also possible. As will be explained in greater detail below, the monitoring schedule 202 will be updated in a manner which permits locations to be monitored in a dynamic manner. That is, the monitoring schedule 202 is not simply a fixed schedule in which locations are always monitored in the same predetermined order. The order of monitoring will vary as described below. [0076] Step 410 includes a step of retrieving the electronic document 120a, 120b at the location 182, 184 specified by the monitoring schedule 202.
  • Step 410 may also include a step of saving the electronic documents 120a, 120b to the storage 190. That is, the monitoring component 234 may, in various embodiments, be configured to fetch the electronic documents 120a, 120b from their respective locations 182, 184 and to save the electronic documents 120a, 120b to the storage 190. For example, the electronic documents 120a, 120b may be saved in a fetched content 206 portion of the storage 190.
  • monitoring electronic documents 120a, 120b at step 410 may include monitoring documents referred to and/or linked to in the electronic documents 120a, 120b located at the locations 182, 184 in the location set 180.
  • the document 120a, 120b located at a location 182, 184 may, in some embodiments, be a cascaded data object such as an RSS feed .
  • the monitoring component 234 may be configured to visit locations referred to or linked to in the document that is an RSS feed, when monitoring that document. That is, the monitoring component may visit locations referred to or linked to in an RSS document in order to retrieve and/or fetch content from other documents located at the referred-to or linked-to locations.
  • the monitoring component 234 may be configured to perform a duplication checking analysis on fetched content 206 before saving the content to the storage 190.
  • the monitoring component 234 may compare the fetched content with fetched content already saved to the storage 190. If the monitoring component 234 determines that the content has not already been saved to the storage, it may save the content to the storage 190. Alternatively, if the monitoring component 234 determines that the content has already been saved to the storage, it may not re-save the content to the storage 190.
  • the monitoring component 234 may analyze the retrieved electronic documents 120a, 120b located at the location 182, 184 specified by the monitoring schedule 202 to determine one or more feature attributes 204 associated with features of the electronic documents 120a, 120b.
  • Each feature attribute 204 may be related to a feature of the electronic documents 120a, 120b at a specific point in time.
  • the feature attribute 204 may be related to a feature of the electronic document 120a, 120b at the point in time in which the electronic documents 120a, 120b are fetched from their respective locations 182, 184.
  • the time which is related to each feature attribute 204 is, generally, a time which has already passed.
  • the feature attributes 204 may, in at least some embodiments, be referred to as historic attributes.
  • Each feature attribute may be a value, quantifier, or other attribute associated with a feature of an associated electronic document 120a, 120b at an associated point in time. That is, the feature attributes 204 serve to quantify features. Each feature attribute 204 is associated with both a feature and a location .
  • the features of the electronic documents 120a, 120b represent information about the electronic document 120a, 120b which may be used to determine how frequent the location 182, 184 associated with the document 120a, 120b will be monitored . That is, features are characteristics associated with the electronic document 120a, 120b which may be used in order to determine how often the location 182, 184 of the document 120a, 120b should be revisited for monitoring and/or how the monitoring of the document 120a, 120b should be prioritized relative to the monitoring of other documents 120a, 120b.
  • the features may include one or more of: an indicator of whether the document was updated or not updated since a last visit, an indicator of the age of the document (for example, the elapsed time since the last change to the document), a quantifier of the number of comments associated with the electronic document 120a, 120b (for example, in the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature), and/or a quantifier of the number of inlinks associated with the electronic document 120a, 120b.
  • Inlinks are links, such as hyper-text links, which direct to the electronic document 120a, 120b.
  • the number of inlinks is not determined from the document itself, but rather, from examining other documents to determine whether they link to the document.
  • the features may also include a feature which is a link analysis based ranking associated with the electronic document.
  • a PageRankTM associated with an electronic document 120 may be a feature of that electronic document 120a, 120b.
  • the specific value or other quantifier of the feature for each document 120a, 120b at an associated time is the attribute for that feature.
  • a specific PageRankTM value associated with a specific electronic document 120a, 120b at a specific point in time may be a feature attribute of a PageRank feature for that electronic document 120a, 120b.
  • the feature attribute 204 which is determined by the monitoring component 234 may be saved to storage 190 associated with the content monitoring system 160.
  • the feature attributes 204 may be saved in a features database in the storage 190.
  • the feature attributes 204 may be saved along with a time related to the feature attributes 204. That is, the feature attributes 204 may be saved in a time-series fashion .
  • the time may, in at least some embodiments, be the time at which the feature attributes 204 were observed or determined .
  • the time may be saved using POSIX time convention . However, other time formats may also be used.
  • the monitoring component 234 may be configured to only record a finite number of values associated with each feature for each location 182, 184 in the location set 180. This finite number may be defined by a feature attribute threshold. Once the feature attribute threshold is met, older feature attributes 204 may be removed from storage 190 in order to make room for the newer feature attributes. For example, in some embodiments, the monitoring component 234 may record only the last k-feature attributes 204 associated with each feature for each location . [0088] Next, at step 440, the prediction component 230 may determine a first predicted attribute for the first feature associated with the location based on the historic feature attributes 204 for that first feature and that location .
  • the prediction component 230 may determine a future attribute for a first feature associated with the location accessed in step 410 based on historic feature attributes 204 for that first feature and that location .
  • the prediction component 230 may attempt to determine future attributes of features based on previously observed attributes of that same feature.
  • the prediction component 230 may, in at least some embodiments, perform a regression analysis on historic attributes associated with a feature and the location accessed in step 410 in order to determine predicted attributes for that same feature and location .
  • a brown's double exponential smoothing method may be performed at step 440.
  • a predicted attribute for a feature and a location may be determined according to the following formula :
  • X tract is the predicted attribute
  • a is a smoothing parameter
  • n is the number of historic attributes for the feature and location which are used to determine the predicted attribute
  • t is the time associated with a historic attribute (i .e. t n is the time for the n th historic attribute for that feature and that location)
  • X tract-i is a last predicted attribute
  • X n is a feature attribute
  • the smoothing parameter is a value which is, in at least some embodiments, between the range of zero (0) to one (1). In at least some embodiments, the smoothing parameter is approximately 0.1.
  • an extended Holt's approach may be used to perform a regression analysis.
  • the predicted attribute can be determined by iterating through the following steps:
  • a linear regression method may be used to determine predicted attributes.
  • the monitoring component 232 may update the monitoring schedule 202 based on the predicted attribute determined at step 440.
  • the scheduling component 232 may, at step 450, schedule the monitoring of the locations 182, 184 in the location set 180 based on the predicted attribute determined at step 440.
  • locations which have higher predicted attributes may be placed higher on the monitoring schedule 202 (and thus monitored sooner) than locations with relatively lower predicted attributes.
  • the process 400 may then repeat itself so that the scheduling and monitoring of locations proceeds indefinitely, or until some predetermined stop condition is satisfied.
  • FIG. 5 a further process 500 for monitoring content stored at a plurality of locations 182, 184 (FIG. 1) in a location set 180 (FIG. 1) is illustrated in flowchart form .
  • the process 500 includes steps or operations which may be performed by the content monitoring system 160 of FIGs. 1 to 3.
  • the content monitoring module 280 may be configured to perform the steps or operations of the process 500 of FIG. 5.
  • the steps or operations of the process 500 of FIG. 5 may be performed by one or more of the prediction component 230, the scheduling component 232 and/or the monitoring component 234 of FIG. 2. That is, the content monitoring module 280, the prediction component 230, the scheduling component 232 and/or the monitoring component 234 may contain instructions for causing the processor 240 to execute the process 500 of FIG. 5.
  • the process 500 of FIG. 5 is similar to the process 400 of FIG. 4, except in that, in the process 500 of FIG. 5, the scheduling is made based on historic feature attributes 204 for more than one feature.
  • Step 520 of FIG. 5 is similar to step 420 of FIG. 4, except in that, at step 520 of FIG. 5, feature attributes 204 for a plurality of features are determined. For example, in some embodiments, a feature attribute for a first feature and a feature attribute for a second feature may be determined .
  • the features may include one or more of: an indicator of whether the document was updated or not updated since a last visit, an indicator of the age of the document (for example, the elapsed time since the last change to the document), a quantifier of the number of comments associated with the electronic document 120a, 120b (for example, in the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature), and/or a quantifier of the number of inlinks associated with the electronic document 120a, 120b.
  • the feature may also include a feature which is a link analysis based ranking associated with the electronic document 120a, 120b. For example, a PageRankTM associated with an electronic document 120 may be a feature of that electronic document 120a, 120b.
  • step 530 of FIG. 5 is similar to step 430 of FIG. 4 except in that, at step 530 of FIG. 5, feature attributes for multiple features associated with a location are stored.
  • step 540 of FIG. 5 is similar to step 440 of FIG. 4 except in that, at step 540 predicted attributes for multiple features are
  • the prediction component 230 may, for the location accessed at step 410, gather predicted attributes for more than one feature and compute a performance metric value based on those predicted attributes. For example, in at least some embodiments, the prediction component 230 may apply a predetermined function to the predicted attributes for multiple features in order to compute a performance metric value.
  • each feature may have a weighting value associated with that feature.
  • the performance metric value may, in at least some embodiments, be calculated as the sum of the products of the predicted attribute of features and the weighting value associated with that feature.
  • the monitoring component 232 may update the monitoring schedule 202 based on the performance metric values determined at step 550.
  • the scheduling component 232 may, at step 560, schedule the monitoring of the locations 182, 184 in the location set 180 based on the performance metric values determined at step 550.
  • locations which have higher performance metric values may be placed higher on the monitoring schedule 202 (and thus monitored sooner) than locations with relatively lower performance metric values.
  • the monitoring schedule 202 is determined in accordance with a plurality of predicted attributes.
  • the monitoring schedule is determined in accordance with a first predicted attribute associated with a first feature and a second predicted attribute associated with a second feature.
  • FIG. 6 a further process 600 for monitoring content stored at a plurality of locations 182, 184 (FIG. 1) in a location set 180 (FIG. 1) is illustrated in flowchart form .
  • the process 600 includes steps or operations which may be performed by the content monitoring system 160 of FIGs. 1 to 3.
  • the content monitoring module 280 may be configured to perform the steps or operations of the process 600 of FIG. 6.
  • the steps or operations of the process 600 of FIG. 6 may be performed by one or more of the prediction component 230, the scheduling component 232 and/or the monitoring component 234 of FIG. 2. That is, the content monitoring module 280, the prediction component 230, the scheduling component 232 and/or the monitoring component 234 may contain instructions for causing the processor 240 to execute the process 600 of FIG. 6.
  • the process 600 of FIG. 6 is similar to the process 500 of FIG. 5 except in that it includes a further step 660 of increasing the ranking of stale locations in the monitoring schedule 202.
  • the scheduling component 232 may be increase the rank of a location in the monitoring schedule 202 if that location becomes stale. For example, the rank of a location may be increased based on the period of time which has elapsed since the location was last
  • the rank of a location in the monitoring schedule 202 may be increased by increasing the performance metric value associated with that location .
  • the predicted performance metric could be incremented by a predetermined amount for every thousand fetching operations. It will be appreciated however, that a thousand fetching operations is intended to be illustrative and that other thresholds may be used. [00107] It will be appreciated that variations of the methods and systems described above are also possible. For example, various embodiments may omit or modify some of the steps of FIGs. 4 to 6.
  • an article of manufacture for use with the apparatus such as a prerecorded storage device or other similar computer readable medium including program instructions recorded thereon, or a computer data signal carrying computer readable program instructions may direct an apparatus to facilitate the practice of the described methods. It is understood that such apparatus, and articles of manufacture also come within the scope of the present disclosure.

Abstract

A system and method of monitoring content stored at a plurality of locations in a location set are provided. The method comprises: determining two or more historic attributes for a first feature associated with each location; for each location in the location set, determining a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location; determining a monitoring schedule in accordance with the first predicted attribute; and monitoring the content at the locations in the location set according to the monitoring schedule.

Description

SYSTEM AND METHOD FOR MONITORING WEB CONTENT
TECHNICAL FIELD
[0001] The present disclosure relates generally to the monitoring of dynamic content. More specifically, it relates to a method and system for monitoring content, such as web-pages, which are stored at a plurality of locations in a location set.
BACKGROUND
[0002] Monitoring web-page content and fetching web-page content may be useful in systems which index or classify such content. For example, search engines, news aggregation services, and other indexing and classification systems may re-visit web-pages from time to time in order to determine whether content associated with those web-pages has changed. Where content has changed, such systems may update indexing and classification data .
[0003] Monitoring and fetching systems often visit web-pages in a
predetermined fixed order. This approach to monitoring and fetching may be less effective when monitoring highly dynamic web-pages and web-content. For example, visiting web-pages in a predetermined fixed order may be inefficient for monitoring web-pages which are micro-blogs, such as Twitter™.
[0004] Thus there exists a need for improved systems and for monitoring content stored at a plurality of locations.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] Reference will now be made, by way of example, to the accompanying drawings which show an embodiment of the present application, and in which :
[0006] FIG. 1 shows a system diagram illustrating a possible environment in which embodiments of the present application may operate;
[0007] FIG. 2 shows a block diagram of a content monitoring system in accordance with an embodiment of the present disclosure; [0008] FIG. 3 shows a block diagram of a content monitoring system in accordance with a further embodiment of the present disclosure;
[0009] FIG. 4 shows a flowchart of a process for monitoring content in accordance with an embodiment of the present disclosure; [0010] FIG. 5 shows a flowchart of a process for recognizing monitoring content in accordance with a further embodiment of the present disclosure; and
[0011] FIG. 6 shows a flowchart of a process for recognizing monitoring content in accordance with another embodiment of the present disclosure.
[0012] Sim ilar reference numerals are used in different figures to denote similar components.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0013] In one aspect the present disclosure provides a method of monitoring content stored at a plurality of locations in a location set. The method comprises : determining two or more historic attributes for a first feature associated with each location; for each location in the location set, determining a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location; determining a monitoring schedule in
accordance with the first predicted attribute; and monitoring the content at the locations in the location set according to the monitoring schedule.
[0014] In another aspect, the present application provides a content monitoring system for monitoring content stored at a plurality of locations in a location set. The system comprises a prediction component. The prediction component is configured to determine two or more historic attributes for a first feature associated with each location . The prediction component is further configured to, for each location in the location set, determine a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location . The system further comprises a scheduling component configured to determine a monitoring schedule in accordance with the first predicted attribute. The system further comprises a monitoring component configured to monitor the content at the locations in the location set according to the monitoring schedule.
[0015] Other aspects and features of the present application will become apparent to those ordinarily skilled in the art upon review of the following
description of specific embodiments of the application in conjunction with the accompanying figures.
[0016] Reference is first made to FIG. 1, which illustrates a system diagram of a possible operating environment in which embodiments of the present disclosure may operate.
[0017] In the embodiment of FIG. 1, a content monitoring system 160 is illustrated. The content monitoring system 160 is configured to monitor content of electronic documents 120a, 120b located at a plurality of locations 182, 184, which may be identified in a location set 180. That is, the content monitoring system 160 is configured to monitor electronic documents 120a, 120b located at a set of locations 182, 184 defined by a location set 180. The location set 180 is stored in a storage 190 which is accessible by the content monitoring system 190. The storage 190 may, in some embodiments, be internal storage of the content monitoring system 160. In other embodiments, the storage 190 may be external storage of the content monitoring system 160, including, for example, network storage accessible through a network 104.
[0018] The electronic documents 120a, 120b may vary over time. That is, the content of an electronic document 120a, 120b located at any given location 182, 184 may vary over time. [0019] The electronic documents 120a, 120b may, in various embodiments, be one or more of: Really Simple Syndication ("RSS") feeds or other cascaded feeds, blogs, micro-blogs such as Twitter™, on-line news sources, user-generated comments from web-pages, etc. Other types of electronic documents 120a, 120b are also possible. By way of example and not limitation, the electronic documents 120a, 120b may be formatted in a Hyper-Text Markup Language ("HTM L") format, a plain-text format, or a portable document format ("PDF") . In some instances, the electronic documents 120a, 120b may be an image, such as a JPEG or Bitmap image. Other document formats are also possible. [0020] The electronic documents 120a, 120b may be located at associated locations 182, 184 on a plurality of document servers 114a, 114b, which may be accessible through a network 104, such as the Internet. In some embodiments, the document servers 114 may be publicly and/or privately accessible web-pages which may be identified by a unique Uniform Resource Locator ("URL") . In such embodiments, the locations 182, 184 may be URLs.
[0021] The network 104 may be a public or private network, or a combination thereof. The network 104 may be comprised of a Wireless Wide Area Network (WWAN), a Wireless Local Area Network (WLAN), the Internet, a Local Area
Network (LAN), or any combination of these network types. Other types of networks are also possible and are contemplated by the present disclosure.
[0022] The location set 180 which defines the locations 182, 184 of the electronic documents 120a, 120b which are to be monitored may be stored on the storage 190.
[0023] The storage 190 may include non-volatile memory such as, for example, a Hard Disk Drive (H DD), Flash Memory, or other types of memory. In some embodiments, the storage 190 may include a combination of different types of memory.
[0024] The content monitoring system 160 may include functionality in addition to the ability to monitor the content of electronic documents 120a, 120b located at locations 182, 184. For example, as illustrated in FIG. 1, in some embodiments, the content monitoring system 160 may be a document aggregation system 150. The document aggregation system 150 may be configured to search document servers 114a, 114b to locate and/or group electronic documents 120a, 120b which are related to a common subject matter. [0025] The electronic documents 120a, 120b may, in some embodiments, be news-related documents which contain information about recent, interesting, topical and/or important events. In such cases, the document aggregation system 150 may also be referred to as a news aggregation system . The news aggregation system may be configured to locate and group electronic documents 120a, 120b which are related to a common event or story.
[0026] The locations 182, 184 in the location set 180 may be predefined fixed locations. The locations 182, 184 may, in some embodiments, be specified, in whole or in part by a user of the content monitoring system 160, such as, for example, a system administrator.
[0027] In other embodiments, the location set may be dynamic. In such embodiments, the content monitoring system 160 (which may be a document aggregation system 150) may include a document search subsystem (not shown) . The document search subsystem (not shown) may be used by the document aggregation system 150 to locate documents accessible through the network 104, which may be located at locations which are not identified in the location set 180. The document search subsystem may be configured to search document servers 114a, 114b based on a search algorithm in order to identify electronic documents 120a, 120b matching a search criteria . By way of example, in some embodiments, the search algorithm may provide for searching of websites (or other document servers 114a, 114b) of a specific category using a search keyword or phrase. For example, the document search subsystem may be configured to search blogs, micro blogs, and/or online traditional news sources, etc.
[0028] The document search subsystem may, in some embodiments, rely on a third party search engine which may not be physically located within the document aggregation system 150. For example, a publicly accessible search engine, such as Google™ may be used.
[0029] If the document search subsystem 150 identifies electronic documents 120a, 120b matching a search criteria, it may update the location set 180 to include the locations of those identified documents. For example, in some circumstances, the document search subsystem may search for electronic documents 120a, 120b which relate to a specific news item, such as a specific event. If any such documents are located, the location set 180 may be updated to include the location 182, 184 of those electronic documents 120a, 120b in order to cause the content monitoring system 160 to monitor the content of the documents 120a, 120b at those locations 182, 184.
[0030] In at least some embodiments, the document aggregation system 150 also includes a document classification subsystem (not shown) which associates electronic documents 120a, 120b and/or the content therein with one or more labels. For example, the document classification subsystem may associate one or more documents 120a, 120b with a phrase contained in the one or more document 120a, 120b. The label which is associated with the electronic document 120a, 120b may be used to identify the subject matter of the electronic document 120a, 120b.
[0031] The document aggregation system 150 may include other subsystems not specifically described above. By way of example, the document aggregation system 150 may, in some embodiments, include a ranking subsystem which ranks documents 120a, 120b or the subject of documents 120a, 120b based on frequency of use or frequency of occurrence. For example, the subjects of a plurality of documents 120a, 120b may be ranked by determining the frequency of occurrence of each label (such as a phrase) associated with documents 120a, 120b. The rank may indicate, in at least some embodiments, how topical the subject matter associated with that label is.
[0032] In at least some embodiments, the document aggregation system 150 may include a web-interface subsystem (not shown) for automatically generating web pages which provide links for accessing the documents 120a, 120b on the document servers 114a, 114b and other information about the documents 120a, 120b. The other information may include a machine-generated summary of the contents of the document, and the rank of the subject matter of the document as determined by the ranking subsystem (not shown) . The web pages which are generated by the web-interface subsystem may group documents 120a, 120b by subject matter and/or by phrases which are used in the electronic documents 120a, 120b. [0033] By way of further example, other subsystems of the document aggregation system 150 may also include a power subsystem for providing electrical power to electrical components of the document aggregation system 150 and a communication subsystem for communicating with the document servers 114a, 114b through the network 104.
[0034] It will be appreciated that the content monitoring system 160 (and/or the document aggregation system 150) may include more or less systems, modules, subsystems and/or functions than are discussed herein . It will also be appreciated that the functions provided by any set of systems or subsystems described above may be provided by a single system and that these functions are not, necessarily, logically or physically separated into different subsystems.
[0035] Furthermore, while FIG. 1 illustrates one possible embodiment in which the content monitoring system 160 may operate, (i .e. where the content monitoring system 160 is a document aggregation system 150) it will be
appreciated that the content monitoring system 160 may be employed in any system in which it may be useful to monitor the content of electronic documents 120a, 120b located at locations 182, 184 of a location set 180.
[0036] Accordingly, the term content monitoring system 160, as used herein, is intended to include stand alone content monitoring systems which are not, necessarily, part of a larger system, and also content monitoring sub-systems which are part of a larger system (which may be the same or different than the document aggregation system 150 of FIG. 1) . The term content monitoring system 160 is, therefore, intended to include any systems in which the content monitoring methods described herein are included.
[0037] In at least some embodiments, the content monitoring system 160, and/or the document aggregation system 150 may be implemented, in whole or in part, by way of a processor 240 which is configured to execute software modules 260 stored in memory 250. A block diagram of one such example content monitoring system 160, is illustrated in FIG. 2.
[0038] In the embodiment of FIG. 2, the content monitoring system 160 includes a controller comprising one or more processor 240 which controls the overall operation of the content monitoring system 160. The content monitoring system 160 also includes memory 250 which is connected to the processor 240 for receiving and sending data to the processor 240. While the memory 250 is illustrated as a single component, it will typically be comprised of multiple memory components of various types. For example, the memory 250 may include Random Access Memory (RAM), Read Only Memory (ROM), a Hard Disk Drive (HDD), Flash Memory, or other types of memory. It will be appreciated that each of the various memory types will be best suited for different purposes and applications.
[0039] The processor 240 may operate under stored program control and may execute software modules 260 stored on the memory 250. The software modules 260 may be comprised of, for example, a content monitoring module 280 which is configured to monitor the content of one or more electronic documents 120a, 120b (FIG. 1) located at locations 182, 184 identified in the location set 180.
[0040] The content monitoring module 280 may include a monitoring component 234 which is configured to monitor electronic documents 120a, 120b (FIG. 1) according to a monitoring schedule 202. The monitoring schedule 202 specifies the order in which the content of electronic documents 120a, 120b at locations 182, 184 of the location set 180 are monitored .
[0041] The monitoring schedule 202 may be determined by a scheduling component 234 of the content monitoring module 280. The monitoring schedule 202 may be stored in the storage 190 by the scheduling component 232 and retrieved by the monitoring component 234. Methods of determining the
monitoring schedule 202 will be discussed in greater detail below.
[0042] The monitoring schedule 202 may, in at least some embodiments, act as a queue which lists locations 182, 184 in the order in which they are to be monitored. For example, in at least some embodiments, the monitoring component 234 is configured to monitor the documents 120a, 120b at the locations 182, 184 in the monitoring schedule 202 in the order in which they are listed in the monitoring schedule.
[0043] Monitoring electronic documents 120a, 120b may, in various
embodiments, include retrieving the electronic documents 120a, 120b from their respective locations 182, 184 and may also include saving the electronic documents 120a, 120b to the storage 190. That is, the monitoring component 234 may, in various embodiments, be configured to fetch the electronic documents 120a, 120b from their respective locations 182, 184 and to save the electronic documents 120a, 120b to the storage 190. For example, the electronic documents 120a, 120b may be saved in a fetched content 206 portion of the storage 190.
[0044] In at least some embodiments, monitoring electronic documents 120a, 120b may include monitoring documents referred to and/or linked to in the electronic documents 120a, 120b located at the locations 182, 184 in the location set 180. For example, the document 120a, 120b located at a location 182, 184 in the location set 180 may, in some embodiments, be a cascaded data object such as an RSS feed. In such cases, the monitoring component 234 may be configured to visit locations referred to or linked in the document that is the RSS feed, when monitoring that document. That is, the monitoring component may visit locations referred to or linked to in an RSS document in order to retrieve and/or fetch content from other documents located at the referred-to or linked-to locations.
[0045] In at least some embodiments, the monitoring component 234 may be configured to perform a duplication checking analysis on fetched content 206 before saving the content to the storage 190. The monitoring component 234 may compare the fetched content with fetched content already saved to the storage
190. If the monitoring component 234 determines that the content has not already been saved to the storage 190, it may save the content to the storage 190.
Alternatively, if the monitoring component 234 determines that the content has already been saved to the storage, it may not re-save the content to the storage 190.
[0046] The monitoring component 234 may be further configured to analyze electronic documents 120a, 120b located at the locations 182, 184 of the location set 180 to determine one or more attributes associated with features of the electronic documents 120a, 120b. Each attribute may be related to a feature of the electronic documents 120a, 120b at a specific point in time. In at least some embodiments, the attribute may be related to a feature of the electronic document 120a, 120b at the point in time in which the electronic documents 120a, 120b are fetched from their respective locations 182, 184. The time which is related to each attribute is, generally, a time which has already passed . Thus, the attributes may, in at least some embodiments, be referred to as historic attributes. Since the attributes are each related to one or more features of the electronic document 120a, 120b, the attributes may also be referred to as feature attributes 204.
[0047] The feature attributes 204 may be a value, quantifier, or other attribute associated with a feature of an associated electronic document 120a, 120b at an associated point in time. That is, the feature attributes 204 serve to quantify features.
[0048] The features of the electronic documents 120a, 120b represent information about the electronic document 120a, 120b which may be used to determine how frequent the location 182, 184 associated with the document 120a, 120b will be monitored . That is, features are characteristics associated with the electronic document 120a, 120b which may be used in order to determine how often the location 182, 184 of the document 120a, 120b should be revisited for monitoring and/or how the monitoring of the document 120a, 120b should be prioritized relative to the monitoring of other documents 120a, 120b.
[0049] The features may include one or more of: an indicator of whether the document at a location was updated or not updated since a last visit to that same location, an indicator of the age of the document (for example, the elapsed time since the last change to the document), a quantifier of the number of comments associated with the electronic document 120a, 120b (for example, if the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature and the number of comments may be a feature attribute), and/or a quantifier of the number of inlinks associated with the electronic document 120a, 120b.
[0050] Inlinks are links, such as hyper-text links, which point to the electronic document 120a, 120b. The number of inlinks is not determined from the document 120a, 120b itself, but rather, from examining other documents to determine whether they link to the document 120a, 120b. [0051] The feature may also include a feature which is a link analysis based ranking associated with the electronic document 120a, 120b. For example, a PageRank™ associated with an electronic document 120 may be a feature of that electronic document 120a, 120b. The specific value or other quantifier of the feature for each document 120a, 120b at an associated time is the feature attribute 204 for that feature. For example, a specific PageRank™ value associated with a specific electronic document 120a, 120b at a specific point in time may be an attribute of a PageRank™ feature for that electronic document 120a, 120b.
[0052] Other features apart from those specifically discussed above are also possible.
[0053] The feature attributes 204 which are determined by the monitoring component 234 may be saved to storage 190 associated with the content
monitoring system 160. In at least some embodiments, the feature attributes 204 may be saved in a features database in the storage 190. Each feature attribute 204 may be saved along with a time related to that feature attribute 204. That is, the feature attributes 204 may be saved in a time-series fashion . The time may, in at least some embodiments, be the time at which the feature attributes 204 were observed or determined . In at least some embodiments, the time may be saved using POSIX time convention . However, other time formats may also be used.
[0054] In at least some embodiments, the monitoring component 234 may be configured to only record a finite number of values associated with each feature for each location 182, 184 in the location set 180. This finite number may be defined by a feature attribute threshold. Once the feature attribute threshold is met, older feature attributes 204 may be removed from storage 190 in order to make room for newer features attributes 204. For example, in some embodiments, the monitoring component 234 may be configured to record only the last k-feature attributes 204 associated with each feature for each location .
[0055] The storage 190 may, in some embodiments, be internal storage of the content monitoring system 160, such as internal memory of the content monitoring system 160. In other embodiments, the storage 190 may be external storage which is accessible by the content monitoring system 160. For example, the storage 190 may, in some embodiments, be network storage.
[0056] The content monitoring module 280 may also include a prediction component 230. As will be explained in greater detail below, the prediction component 230 may be configured to, for each location 182, 184 in the location set 180, determine a first predicted attribute for the first feature associated with that location based on the historic feature attributes 204 for that first feature and that location 182, 184. That is, in at least some embodiments, the prediction
component 230 may, for each location 182, 184 in the location set 180, determine a future attribute for a first feature associated with that location based on historic feature attributes 204 for that first feature and that location . The prediction component 230 may attempt to determine future attributes of features based on previously observed attributes of that same feature.
[0057] For example, where the feature is an indicator of whether the document was updated or not updated since a last visit, the prediction component 230 may attempt to predict whether, at some future time, the document will be updated or not since the last visit. Similarly, where the feature is an indicator of the age of the document (for example, the elapsed time since the last change to the document), the prediction component 230 may attempt to predict what the age of the document will be at some future time. Similarly, where the feature is a quantifier of the number of comments associated with the electronic document 120a, 120b, the prediction component 230 may attempt to predict the number of comments associated with the electronic document at some future time. Similarly, where the feature is a quantifier of the number of inlinks associated with the electronic document 120a, 120b, the prediction component 230 may attempt to predict the number of inlinks associated with the electronic document 120a, 120b at some future time.
[0058] Sim ilarly, where the feature is a link analysis based ranking associated with the electronic document 120a, 120b (such as PageRank™), the prediction component 230 may attempt to predict the link analysis based ranking associated with the electronic document 120a, 120b at some future time.
[0059] The prediction component 230 may, in at least some embodiments, include a regression computation module which performs a regression analysis on historic attributes (also known as feature attributes 204) associated with a feature and a location in order to determine predicted attributes for that same feature and location .
[0060] It will be appreciated that the historic attributes may be taken at times that are irregular. That is, since monitoring does not occur in a fixed order, the time period between successive feature attributes for any location may be variable. Accordingly, a regression analysis which does not require fixed time intervals may be utilized by the prediction component 230. For example, in at least some embodiments, a brown's double exponential smoothing method may be used. In such embodiments, a predicted attribute for a feature and a location may be determined according to the following formula : χ ,, ^ α -ν,^ - χ ,,-ι + ν,,χ,, where :
V
b„ = (1 - a)'"-'-' , V0 = \ - (\ - a) and n + l
Where X„ is the predicted attribute, is a smoothing parameter, n is the number of historic attributes for the feature and location which are used to determine the predicted attribute, t is the time associated with a historic attribute (i .e. tn is the time for the nth historic attribute for that feature and that location) . X ,,-i is a last predicted attribute, and Xn is a feature attribute. The smoothing parameter is a value which is, in at least some embodiments, between the range of zero (0) to one (1) . In at least some embodiments, the smoothing parameter is approximately 0.1.
[0061] In other embodiments, an extended Holt's approach may be used to perform a regression analysis. In such embodiments, a linear regression step may be performed to create a regression line using historic feature attributes 204. More particularly, if we let S0=A and T0 = B, where A is the intercept of the regression line at to and B is the slope of the linear regression line. The predicted attribute can be determined by iterating through the following steps:
S„+i = (1 - α„+ι ) Γ S„ + (f„+1 - tn) - Tn] + all+i yn+i
Tn+l = (i - rn+l ) - T„ + rn+ s"+l ~_s"
' /! + ] Π
Where variable smoothing coefficients are given as :
"+' α„+ (1 - α '-'- where ≡ (0,1) is a smoothing constant for the level and γ (0,1) is a smoothing constant for the slope. [0062] The predicted attribute may be calculated as:
Xl+tt (t) = S, + n - Tl
[0063] In other embodiments, a linear regression method may be used to determine predicted attributes.
[0064] In at least some embodiments, predicted attributes for more than one feature may be determined for each location 182, 184. In such embodiments, the prediction component 230 may, for each location 182, 184 in the location set 180, gather the predicted attributes for more than one feature and compute a
performance metric value based on those predicted attributes. For example, in at least some embodiments, the prediction component 230 may apply a
predetermined function to the predicted attributes for multiple features in order to compute a performance metric value. By way of example and not limitation, each feature may have a weighting value associated with that feature. The performance metric value may, in at least some embodiments, be calculated as the sum of the products of the predicted attribute of features and the weighting value associated with that feature. For example, in some embodiments, the multiple features may include the number of comments associated with a document (i .e. the first feature) and the number of inlinks associated with the document (i .e. the second feature) . In such embodiments, the performance metric value may be calculated based on both a predicted attribute related to the number of comments expected to be associated with the document at some future time and a predicted attribute related to the number of inlink expected to link to the document at some future time.
[0065] The content monitoring module 280 may also include a scheduling component 232. The scheduling component 232 may determine a monitoring schedule 202 based on the predicted attributes and/or the performance metric values determined by the prediction component 230.
[0066] For example, the scheduling component 232 may schedule the monitoring of the locations 182, 184 in the location set 180 based on the predicted attributes and/or the performance metric values; locations which have higher predicted attributes and/or higher performance metric values may be placed higher on the monitoring schedule 202 (and thus monitored sooner) than locations with relatively lower predicted attributes and/or lower performance metric values.
[0067] In at least some embodiments, the scheduling component 232 may be configured to increase the rank of a location in the monitoring schedule 202 if that location becomes stale. For example, the rank of a location may be increased based on the period of time which has elapsed since the location was last
monitored. The period of time may be measured, for example, in terms of the number of fetching or monitoring operations which have occurred by the monitoring component 234 since the location was last monitored. In some embodiments, the rank of a location in the monitoring schedule 202 may be increased by increasing the performance metric value associated with that location . For example, the predicted performance metric could be incremented by a predetermined amount for every thousand fetching operations. It will be appreciated however, that a thousand fetching operations is intended to be illustrative and that other thresholds may be used.
[0068] It will be appreciated that the division of functions between
components could, in some embodiments, be different than that specifically described above. That is, any functions provided by any one of either the prediction component 230, scheduling component 232 and monitoring component 234, could be performed by another component, module, or system . For example, any one or more of the components 230, 232, 234 or modules 280 may be logically or physically organized in a manner that is different from the manner illustrated in FIG. 2.
[0069] It will also be appreciated that, while the location set 180 and the monitoring schedule 202 are depicted in FIG. 2 using separate blocks, in at least some embodiments, the location set 180 and the monitoring schedule 202 may be a single element. For example, a single list of locations may serve as both a location set 180 and a monitoring schedule 202. For example the order of the listing of locations in the location set 180 may define the order of monitoring .
[0070] Referring now to FIG. 3, a block diagram of a further example of content monitoring systems 160 is illustrated . In the example of FIG. 3, a first content monitoring system 360 and a second content monitoring system 362 are connected to a common storage 190. The first content monitoring system 360 and the second content monitoring system 362 may retrieve and update data which is common to both content monitoring systems 360 and 362. For example, the first content monitoring system 360 and the second content monitoring system 360 may share fetched content 206, feature attributes 204, a monitoring schedule 202 and/or a location set 180. Due to the sharing of data, the capacity of the system to monitor documents may be increased simply by adding additional content monitoring systems 160.
[0071] It will be appreciated that, while FIG. 3 illustrates an example where two content monitoring systems 160 are used in order to provide additional capacity, in other embodiments, additional content monitoring systems 160 could be used in order to provide greater capacity.
[0072] Referring now to FIG. 4, a process 400 for monitoring content stored at a plurality of locations 182, 184 (FIG. 1) in a location set 180 (FIG. 1) is illustrated in flowchart form . The process 400 includes steps or operations which may be performed by the content monitoring system 160 of FIGs. 1 to 3. In at least some embodiments, the content monitoring module 280 may be configured to perform the steps or operations of the process 400 of FIG. 4. The steps or operations of the process 400 of FIG. 4 may be performed by one or more of the prediction component 230, the scheduling component 232 and/or the monitoring component 234 of FIG. 2. That is, the content monitoring module 280, the prediction component 230, the scheduling component 232 and/or the monitoring component 234 may contain instructions for causing the processor 240 to execute the process 400 of FIG. 4. [0073] First, at step 410, the monitoring component 234 of the content monitoring module 280 may retrieve a monitoring schedule 206 (FIG. 2) from storage 190 and may access a location 182, 184 in a location set 180 according to the monitoring schedule 202. The monitoring schedule 202 specifies the order in which the content of electronic documents 120a, 120b at locations 182, 184 of the location set 180 are monitored.
[0074] The monitoring schedule 202 may, in at least some embodiments, act as a queue which lists locations 182, 184 in the order in which they are to be monitored. For example, in at least some embodiments, the monitoring component 234 will monitor the documents at the locations 182, 184 in the monitoring schedule 202 in the order in which they are listed in the monitoring schedule 202. In such embodiments, the location accessed at step 410 may be the location at the top of the queue.
[0075] The monitoring schedule 202 may, at least initially, be randomly or arbitrarily determined . For example, all of the locations 182, 184 in the location set 180 may be added to the monitoring schedule 202 in a random or arbitrary manner. Other methods of initializing the monitoring schedule 202 are also possible. As will be explained in greater detail below, the monitoring schedule 202 will be updated in a manner which permits locations to be monitored in a dynamic manner. That is, the monitoring schedule 202 is not simply a fixed schedule in which locations are always monitored in the same predetermined order. The order of monitoring will vary as described below. [0076] Step 410 includes a step of retrieving the electronic document 120a, 120b at the location 182, 184 specified by the monitoring schedule 202. Step 410 may also include a step of saving the electronic documents 120a, 120b to the storage 190. That is, the monitoring component 234 may, in various embodiments, be configured to fetch the electronic documents 120a, 120b from their respective locations 182, 184 and to save the electronic documents 120a, 120b to the storage 190. For example, the electronic documents 120a, 120b may be saved in a fetched content 206 portion of the storage 190.
[0077] In at least some embodiments, monitoring electronic documents 120a, 120b at step 410 may include monitoring documents referred to and/or linked to in the electronic documents 120a, 120b located at the locations 182, 184 in the location set 180. For example, the document 120a, 120b located at a location 182, 184 may, in some embodiments, be a cascaded data object such as an RSS feed . In such cases, the monitoring component 234 may be configured to visit locations referred to or linked to in the document that is an RSS feed, when monitoring that document. That is, the monitoring component may visit locations referred to or linked to in an RSS document in order to retrieve and/or fetch content from other documents located at the referred-to or linked-to locations.
[0078] In at least some embodiments, at step 410, the monitoring component 234 may be configured to perform a duplication checking analysis on fetched content 206 before saving the content to the storage 190. The monitoring component 234 may compare the fetched content with fetched content already saved to the storage 190. If the monitoring component 234 determines that the content has not already been saved to the storage, it may save the content to the storage 190. Alternatively, if the monitoring component 234 determines that the content has already been saved to the storage, it may not re-save the content to the storage 190.
[0079] Next, at step 420, the monitoring component 234 may analyze the retrieved electronic documents 120a, 120b located at the location 182, 184 specified by the monitoring schedule 202 to determine one or more feature attributes 204 associated with features of the electronic documents 120a, 120b. Each feature attribute 204 may be related to a feature of the electronic documents 120a, 120b at a specific point in time. In at least some embodiments, the feature attribute 204 may be related to a feature of the electronic document 120a, 120b at the point in time in which the electronic documents 120a, 120b are fetched from their respective locations 182, 184. The time which is related to each feature attribute 204 is, generally, a time which has already passed. Thus, the feature attributes 204 may, in at least some embodiments, be referred to as historic attributes.
[0080] Each feature attribute may be a value, quantifier, or other attribute associated with a feature of an associated electronic document 120a, 120b at an associated point in time. That is, the feature attributes 204 serve to quantify features. Each feature attribute 204 is associated with both a feature and a location .
[0081] The features of the electronic documents 120a, 120b represent information about the electronic document 120a, 120b which may be used to determine how frequent the location 182, 184 associated with the document 120a, 120b will be monitored . That is, features are characteristics associated with the electronic document 120a, 120b which may be used in order to determine how often the location 182, 184 of the document 120a, 120b should be revisited for monitoring and/or how the monitoring of the document 120a, 120b should be prioritized relative to the monitoring of other documents 120a, 120b.
[0082] The features may include one or more of: an indicator of whether the document was updated or not updated since a last visit, an indicator of the age of the document (for example, the elapsed time since the last change to the document), a quantifier of the number of comments associated with the electronic document 120a, 120b (for example, in the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature), and/or a quantifier of the number of inlinks associated with the electronic document 120a, 120b.
[0083] Inlinks are links, such as hyper-text links, which direct to the electronic document 120a, 120b. The number of inlinks is not determined from the document itself, but rather, from examining other documents to determine whether they link to the document.
[0084] The features may also include a feature which is a link analysis based ranking associated with the electronic document. For example, a PageRank™ associated with an electronic document 120 may be a feature of that electronic document 120a, 120b. The specific value or other quantifier of the feature for each document 120a, 120b at an associated time is the attribute for that feature. For example, a specific PageRank™ value associated with a specific electronic document 120a, 120b at a specific point in time may be a feature attribute of a PageRank feature for that electronic document 120a, 120b.
[0085] Other features apart from those specifically discussed above are also possible.
[0086] Next, at step 430, the feature attribute 204 which is determined by the monitoring component 234 may be saved to storage 190 associated with the content monitoring system 160. In at least some embodiments, the feature attributes 204 may be saved in a features database in the storage 190. The feature attributes 204 may be saved along with a time related to the feature attributes 204. That is, the feature attributes 204 may be saved in a time-series fashion . The time may, in at least some embodiments, be the time at which the feature attributes 204 were observed or determined . In at least some embodiments, the time may be saved using POSIX time convention . However, other time formats may also be used.
[0087] In at least some embodiments, the monitoring component 234 may be configured to only record a finite number of values associated with each feature for each location 182, 184 in the location set 180. This finite number may be defined by a feature attribute threshold. Once the feature attribute threshold is met, older feature attributes 204 may be removed from storage 190 in order to make room for the newer feature attributes. For example, in some embodiments, the monitoring component 234 may record only the last k-feature attributes 204 associated with each feature for each location . [0088] Next, at step 440, the prediction component 230 may determine a first predicted attribute for the first feature associated with the location based on the historic feature attributes 204 for that first feature and that location .
[0089] That is, in at least some embodiments, the prediction component 230 may determine a future attribute for a first feature associated with the location accessed in step 410 based on historic feature attributes 204 for that first feature and that location . The prediction component 230 may attempt to determine future attributes of features based on previously observed attributes of that same feature.
[0090] The prediction component 230 may, in at least some embodiments, perform a regression analysis on historic attributes associated with a feature and the location accessed in step 410 in order to determine predicted attributes for that same feature and location .
[0091] In at least one embodiment, at step 440, a brown's double exponential smoothing method may be performed . In such embodiments, a predicted attribute for a feature and a location may be determined according to the following formula :
X„ = (l - VH ) - X n-i + VHX„ where :
bn = (l - a)'"-'-> , V0 = 1 - (1 - )", and
n + l
Where X„ is the predicted attribute, a is a smoothing parameter, n is the number of historic attributes for the feature and location which are used to determine the predicted attribute, t is the time associated with a historic attribute (i .e. tn is the time for the nth historic attribute for that feature and that location) . X„-i is a last predicted attribute and Xn is a feature attribute The smoothing parameter is a value which is, in at least some embodiments, between the range of zero (0) to one (1). In at least some embodiments, the smoothing parameter is approximately 0.1.
[0092] In other embodiments, an extended Holt's approach may be used to perform a regression analysis. In such embodiments, a linear regression step may be performed to create a regression line using historic feature attributes 204. More particularly, if we let S0=A and T0 = B, where A is the intercept of the regression line at t0 and B is the slope of the linear regression line. The predicted attribute can be determined by iterating through the following steps:
S„+i = d - a„+i Sn + (^, - 1„ ) Tn ] + a„+1 yn+1
Ttt+l = a - rH+l ) - Ttt + rtt+r Sn+i ~_ s" Where variable smoothing coefficients are given as: a„ + (1 - a)'
^ + (i - r) j " where e (0,1) is a smoothing constant for the level and γ (0,1) is a smoothing constant for the slope.
[0093] The predicted attribute may be calculated as: Xl+n (t) = S, + n - Tl
[0094] In other embodiments, a linear regression method may be used to determine predicted attributes.
[0095] Next, at step 450, the monitoring component 232 may update the monitoring schedule 202 based on the predicted attribute determined at step 440. For example, the scheduling component 232 may, at step 450, schedule the monitoring of the locations 182, 184 in the location set 180 based on the predicted attribute determined at step 440. In at least some embodiments, locations which have higher predicted attributes may be placed higher on the monitoring schedule 202 (and thus monitored sooner) than locations with relatively lower predicted attributes.
[0096] The process 400 may then repeat itself so that the scheduling and monitoring of locations proceeds indefinitely, or until some predetermined stop condition is satisfied.
[0097] Referring now to FIG. 5, a further process 500 for monitoring content stored at a plurality of locations 182, 184 (FIG. 1) in a location set 180 (FIG. 1) is illustrated in flowchart form . The process 500 includes steps or operations which may be performed by the content monitoring system 160 of FIGs. 1 to 3. In at least some embodiments, the content monitoring module 280 may be configured to perform the steps or operations of the process 500 of FIG. 5. The steps or operations of the process 500 of FIG. 5 may be performed by one or more of the prediction component 230, the scheduling component 232 and/or the monitoring component 234 of FIG. 2. That is, the content monitoring module 280, the prediction component 230, the scheduling component 232 and/or the monitoring component 234 may contain instructions for causing the processor 240 to execute the process 500 of FIG. 5.
[0098] The process 500 of FIG. 5 is similar to the process 400 of FIG. 4, except in that, in the process 500 of FIG. 5, the scheduling is made based on historic feature attributes 204 for more than one feature. Step 520 of FIG. 5 is similar to step 420 of FIG. 4, except in that, at step 520 of FIG. 5, feature attributes 204 for a plurality of features are determined. For example, in some embodiments, a feature attribute for a first feature and a feature attribute for a second feature may be determined .
[0099] The features may include one or more of: an indicator of whether the document was updated or not updated since a last visit, an indicator of the age of the document (for example, the elapsed time since the last change to the document), a quantifier of the number of comments associated with the electronic document 120a, 120b (for example, in the electronic document 120a, 120b is a web page which permits commenting, the comments may be a feature), and/or a quantifier of the number of inlinks associated with the electronic document 120a, 120b. The feature may also include a feature which is a link analysis based ranking associated with the electronic document 120a, 120b. For example, a PageRank™ associated with an electronic document 120 may be a feature of that electronic document 120a, 120b.
[00100] Other features apart from those specifically discussed above are also possible.
[00101] Sim ilarly, step 530 of FIG. 5 is similar to step 430 of FIG. 4 except in that, at step 530 of FIG. 5, feature attributes for multiple features associated with a location are stored. Similarly, step 540 of FIG. 5 is similar to step 440 of FIG. 4 except in that, at step 540 predicted attributes for multiple features are
determined .
[00102] Next, at step 550, the prediction component 230 may, for the location accessed at step 410, gather predicted attributes for more than one feature and compute a performance metric value based on those predicted attributes. For example, in at least some embodiments, the prediction component 230 may apply a predetermined function to the predicted attributes for multiple features in order to compute a performance metric value. By way of example and not limitation, each feature may have a weighting value associated with that feature. The performance metric value may, in at least some embodiments, be calculated as the sum of the products of the predicted attribute of features and the weighting value associated with that feature.
[00103] Next, at step 560, the monitoring component 232 may update the monitoring schedule 202 based on the performance metric values determined at step 550. For example, the scheduling component 232 may, at step 560, schedule the monitoring of the locations 182, 184 in the location set 180 based on the performance metric values determined at step 550. In at least some
embodiments, locations which have higher performance metric values may be placed higher on the monitoring schedule 202 (and thus monitored sooner) than locations with relatively lower performance metric values.
[00104] Thus, in the embodiment of FIG. 5, the monitoring schedule 202 is determined in accordance with a plurality of predicted attributes. For example, in some embodiments, the monitoring schedule is determined in accordance with a first predicted attribute associated with a first feature and a second predicted attribute associated with a second feature.
[00105] Referring now to FIG. 6, a further process 600 for monitoring content stored at a plurality of locations 182, 184 (FIG. 1) in a location set 180 (FIG. 1) is illustrated in flowchart form . The process 600 includes steps or operations which may be performed by the content monitoring system 160 of FIGs. 1 to 3. In at least some embodiments, the content monitoring module 280 may be configured to perform the steps or operations of the process 600 of FIG. 6. The steps or operations of the process 600 of FIG. 6 may be performed by one or more of the prediction component 230, the scheduling component 232 and/or the monitoring component 234 of FIG. 2. That is, the content monitoring module 280, the prediction component 230, the scheduling component 232 and/or the monitoring component 234 may contain instructions for causing the processor 240 to execute the process 600 of FIG. 6.
[00106] The process 600 of FIG. 6 is similar to the process 500 of FIG. 5 except in that it includes a further step 660 of increasing the ranking of stale locations in the monitoring schedule 202. At step 660, the scheduling component 232 may be increase the rank of a location in the monitoring schedule 202 if that location becomes stale. For example, the rank of a location may be increased based on the period of time which has elapsed since the location was last
monitored. The period of time may be measured, for example, in terms of the number of fetching or monitoring operations which have occurred by the monitoring component 234 since the location was last monitored. In some embodiments, the rank of a location in the monitoring schedule 202 may be increased by increasing the performance metric value associated with that location . For example, the predicted performance metric could be incremented by a predetermined amount for every thousand fetching operations. It will be appreciated however, that a thousand fetching operations is intended to be illustrative and that other thresholds may be used. [00107] It will be appreciated that variations of the methods and systems described above are also possible. For example, various embodiments may omit or modify some of the steps of FIGs. 4 to 6.
[00108] While the present disclosure is primarily described in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to various apparatus, such as a server and/or a document processing system, including components for performing at least some of the aspects and features of the described methods, be it by way of hardware
components, software or any combination of the two, or in any other manner.
Moreover, an article of manufacture for use with the apparatus, such as a prerecorded storage device or other similar computer readable medium including program instructions recorded thereon, or a computer data signal carrying computer readable program instructions may direct an apparatus to facilitate the practice of the described methods. It is understood that such apparatus, and articles of manufacture also come within the scope of the present disclosure.
[00109] While the processes 400, 500, 600 of FIGs. 4 to 6 have been described as occurring in a particular order, it will be appreciated by persons skilled in the art that some of the steps may be performed in a different order provided that the result of the changed order of any given step will not prevent or impair the occurrence of subsequent steps. Furthermore, some of the steps described above may be combined in other embodiments, and some of the steps described above may be separated into a number of sub-steps in other embodiments.
[00110] The various embodiments presented above are merely examples.
Variations of the embodiments described herein will be apparent to persons of ordinary skill in the art, such variations being within the intended scope of the present disclosure. In particular, features from one or more of the above-described embodiments may be selected to create alternative embodiments comprised of a sub-combination of features which may not be explicitly described above. In addition, features from one or more of the above-described embodiments may be selected and combined to create alternative embodiments comprised of a
combination of features which may not be explicitly described above. Features suitable for such combinations and sub-combinations would be readily apparent to persons skilled in the art upon review of the present disclosure as a whole. The subject matter described herein intends to cover and embrace all suitable changes in technology.

Claims

What is claimed is:
1. A method of monitoring content stored at a plurality of locations in a location set, the method comprising :
determining two or more historic attributes for a first feature associated with each location;
for each location in the location set, determining a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location;
determining a monitoring schedule in accordance with the first predicted attribute; and
monitoring the content at the locations in the location set according to the monitoring schedule.
2. The method of claim 1, further comprising :
determining two or more historic attributes for a second feature associated with each location; and
for each location in the location set, determining a second predicted attribute for the second feature associated with that location based on the historic attributes for the second feature and that location,
and wherein the monitoring schedule is also determined in accordance with the second predicted attribute.
3. The method of claim 1, wherein the location references a web page and
wherein at least some of the locations in the location set are universal resource locators.
4. The method of claim 1, wherein the first feature is the number of in-links referencing the location, and wherein each historic attribute for the first feature is the number of in-links referencing the location at an associated time.
5. The method of claim 1, wherein the first feature is a quantity of comments associated with the content at the location, and wherein each historic attribute for the first feature is the quantity of comments associated with the content at an associated time.
The method of claim 1, wherein each historic attribute has an associated time.
The method of claim 1, wherein monitoring the content at the locations comprises:
retrieving the content according to the monitoring schedule; and saving the retrieved content to a memory.
The method of claim 1, wherein determining a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location comprises: performing regression analysis using the historic attributes for the first feature of that location.
The method of claim 8, wherein the regression analysis is a brown's double exponential smoothing regression analysis.
The method of claim 8, wherein the regression analysis is an extended Holt's approach regression analysis.
The method of claim 1, wherein the time duration between successive historic attributes is variable.
A content monitoring system for monitoring content stored at a plurality of locations in a location set, the system comprising :
a prediction component configured to:
determine two or more historic attributes for a first feature associated with each location;
for each location in the location set, determine a first predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location;
a scheduling component configured to determine a monitoring schedule in accordance with the first predicted attribute; and
a monitoring component configured to monitor the content at the locations in the location set according to the monitoring schedule.
13. The content monitoring system of claim 12 further comprising one or more processors associated with the prediction component, the scheduling component and the monitoring component.
14. The content monitoring system of claim 12, wherein the prediction
component is further configured to :
determine two or more historic attributes for a second feature associated with each location; and
for each location in the location set, determine a second predicted attribute for the second feature associated with that location based on the historic attributes for the second feature and that location,
and wherein the monitoring is further configured to determine the monitoring schedule in accordance with the second predicted attribute.
15. The content monitoring system of claim 12, wherein the location references a web page and wherein at least some of the locations in the location set are universal resource locators.
16. The content monitoring system of claim 12, wherein the first feature is the number of in-links referencing the location, and wherein each historic attribute for the first feature is the number of in-links referencing the location at an associated time.
17. The content monitoring system of claim 12, wherein the first feature is a quantity of comments associated with the content at the location, and wherein each historic attribute for the first feature is the quantity of comments associated with the content at an associated time.
18. The content monitoring system of claim 12, wherein each historic attribute has an associated time.
19. The content monitoring system of claim 12, wherein the monitoring component is further configured to :
retrieve the content according to the monitoring schedule; and save the retrieved content to a memory.
20. The content monitoring system of claim 12, wherein determining a first
predicted attribute for the first feature associated with that location based on the historic attributes for that first feature and that location comprises :
performing regression analysis using the historic attributes for the first feature of that location .
21. The content monitoring system of claim 20, wherein the regression analysis is a brown's double exponential smoothing regression analysis.
22. The content monitoring system of claim 20, wherein the regression analysis is an extended Holt's approach regression analysis.
23. The content monitoring system of claim 12, wherein the time duration
between successive historic attributes is variable.
EP10850926A 2010-05-07 2010-05-07 System and method for monitoring web content Withdrawn EP2567513A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CA2010/000667 WO2011137505A1 (en) 2010-05-07 2010-05-07 System and method for monitoring web content

Publications (1)

Publication Number Publication Date
EP2567513A1 true EP2567513A1 (en) 2013-03-13

Family

ID=44903529

Family Applications (1)

Application Number Title Priority Date Filing Date
EP10850926A Withdrawn EP2567513A1 (en) 2010-05-07 2010-05-07 System and method for monitoring web content

Country Status (3)

Country Link
EP (1) EP2567513A1 (en)
CA (1) CA2799134C (en)
WO (1) WO2011137505A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6910071B2 (en) * 2001-04-02 2005-06-21 The Aerospace Corporation Surveillance monitoring and automated reporting method for detecting data changes
US20080046483A1 (en) * 2006-08-16 2008-02-21 Lehr Douglas L Method and system for selecting the timing of data backups based on dynamic factors
US20090132581A1 (en) * 2007-05-29 2009-05-21 Christopher Ahlberg Information service for facts extracted from differing sources on a wide area network
US7987261B2 (en) * 2007-07-31 2011-07-26 Yahoo! Inc. Traffic predictor for network-accessible information modules
EP2131292A1 (en) * 2008-06-06 2009-12-09 NTT DoCoMo, Inc. Method and apparatus for searching a plurality of realtime sensors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2011137505A1 *

Also Published As

Publication number Publication date
CA2799134A1 (en) 2011-11-10
WO2011137505A1 (en) 2011-11-10
CA2799134C (en) 2017-07-04

Similar Documents

Publication Publication Date Title
JP6517263B2 (en) System, method and storage medium for improving access to search results
US10031975B2 (en) Presentation of search results based on the size of the content sources from which they are obtained
US20230185857A1 (en) Method and system for providing context based query suggestions
US9251157B2 (en) Enterprise node rank engine
US9147000B2 (en) Method and system for recommending websites
US7895227B1 (en) System and method for detecting trends in network-based content
JP5436665B2 (en) Classification of simultaneously selected images
US8965893B2 (en) System and method for grouping multiple streams of data
US9116979B2 (en) Systems and methods for creating an interest profile for a user
US20100287166A1 (en) Method and system for search engine indexing and searching using the index
US9767198B2 (en) Method and system for presenting content summary of search results
US20130173568A1 (en) Method or system for identifying website link suggestions
US20120047195A1 (en) Identifying Relevant Data from Unstructured Feeds
US8312011B2 (en) System and method for automatic detection of needy queries
US9465884B2 (en) System and method for monitoring web content
US9646102B2 (en) Intelligent categorization of bookmarks
US20100332491A1 (en) Method and system for utilizing user selection data to determine relevance of a web document for a search query
US20140059062A1 (en) Incremental updating of query-to-resource mapping
US8996512B2 (en) Search engine optimization using a find operation
CA2832918C (en) Systems and methods for ranking document clusters
US9128993B2 (en) Presenting secondary music search result links
CA2799134C (en) System and method for monitoring web content
CN104392000A (en) Method and device for determining catching quota of mobile station
US9183308B1 (en) Method and apparatus for searching the internet
RU2775591C2 (en) Method and system for detecting abnormal crowdsourcing label

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20121126

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20161201