AU2011101565A4

AU2011101565A4 - Harvesting and Information Management System (HIMS)

Info

Publication number: AU2011101565A4
Application number: AU2011101565A
Authority: AU
Inventors: Jonathan Bentley
Original assignee: WHIM IT
Current assignee: WHIM IT
Priority date: 2010-11-01
Filing date: 2011-11-30
Publication date: 2012-04-12
Anticipated expiration: 2031-11-01

Abstract

3 Abstract [80l] System and methods for the retrieval, processing, delivery and management of textual and binary information from sets of files and / or databases, wherein retrieved and archived textual and binary information can form the basis of searches and reports, or be exported to external programs. 4r1 retrieve next scheduled source source contents 40)Source changed? Dupilicates Wait for a while OK? source contents Extract data elements from source data elements Aply processing to data elernents7 processed data elements 406 processed Archive processed data elements data elements to archives Figure 4: source contents Known source? Generate SSF for source source contents Get SSIF for source SSIF Isource contents Extract data elements using SSF Fdaiutements Figure 5:

Description

Innovation Patent Harvesting & Information Management System (HIMS ) October 2011 Whim IT www. whimit. com. au ph: 02 6255 2865 abn: 67 993 306 944 Contents 1 Description 1 1.1 Title of Invention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Technical Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Background Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.4 Summary of Invention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Technical Problems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Solution to Problem s . . -, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Advantageous Effects of Invention . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Brief Description of Drawings....... . . . . . . . . . . . . . . . . . . 5 Brief Description of Algorithms................ . . . . . . . . . . . 6 1.5 Description of Embodiments................... . . . . . . . . . . 7 Terminology................... . . . . . . . . . . . . . . . . . . . 7 Overview.................... . . . . . . . . . . . . . . . . . . . . 7 Deployment . . . . . . . . . . . . . . . . . ... ................. 9 Source Retriever Detail . . . . . . . . . . . . . . . . . . . ............. 9 Element Processor Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Elem ent Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Search and Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 E xam ples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Reference Signs List . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 21 2 Claims 23 3 Abstract 25 4 Drawings 26 1 Description 1.1 Title of Invention [oo1 Harvesting and Information Management System (HIMS) 1.2 Technical Field tOO021 This disclosure relates to methods of monitoring and managing information presented on Internet sites, networked servers and systems, and in files and databases. This invention is also concerned with automatic and manual methods of targeting, extracting and consolidating key information across files that potentially also contain unwanted information, such as layout or formatting information and/or advertisements 1.3 Background Art [oo0 Data harvesting (also known as data scraping) is the process of programmatically locating and extracting data from information intended for human display. Primarily data is harvested for the purpose of automating or simplifying processes and decisions. Historically, before the growth of the Internet, harvesting had been a relatively easy task. Harvesting typically consisted of computer scientists harvesting data from sources generated by other computer scientists. Such sources were almost always guaranteed to be plain text ASCII, clearly delimited and hence easily parsed by simple harvesting programs, or "scripts". [00041 However, the rise of the internet has made vast amounts of information available to humans, and it has become increasingly difficult to automate its analysis. Gone are the days of plain text, 128 character ASCII files, replaced instead by Unicode files containing any combination from over 230,000 characters for every existing language. And Unicode is the best case - useful data for harvesting can now also be found in images, audio and video clips, though extracting it is very difficult. More often than not, information is no longer clearly delimited. [0005] Most web pages on the Internet mix the important information for harvesting, with unim portant layout information, graphics and advertising. Automatically and accurately distin guishing information for harvesting from the other unimportant "noise" on a web page is very difficult. Even if it's do-able for one site, another site containing similar information will prob ably be presented differently, and require an entirely new solution.

1.4 Summary of Invention 100061 HIMS is highly adaptable harvesting and information management system for accurately and automatically capturing, consolidating and archiving any text, image, audio and video information across web-sites, networks, databases and file systems. HIMS can be adapted for use with virtually any data retrieval or monitoring scenario, has full archiving and reporting components, and includes data exporting capabilities. Technical Problems [ooo0i Harvesting information from Internet sites and files presents a number of challenges, most stemming from the notion that the Internet and web pages are designed primarily for viewing by humans. A given web page or file may contain a wealth of useful content, however more often than not, the import information is mixed with irrelevant content, and/or layout and presentation information such as images and advertising. Relying on the automatic identifica tion and separation of important information from the unimportant is a "best guess" problem, and usually considered too hit-and-miss to be relied upon. Inevitably some unimportant infor mation will be classified as important (false positive), and other important information will be missed (false negative). 0oos] To complicate the problem further, one important application of harvesting is the monitor ing of changes to web pages or files over time. Consider a news page such as CNN. It contains titles links and summaries to news articles which, although potentially important information, will all change several times a day. Accurately and automatically locating and extracting the important data from a file in which the contents are constantly changing is a daunting problem. Even when a solution is devised, it probably won't work for any other site. [0009] One common solution that has surfaced is the use of RSS feeds. An RSS feed is a file, separate to the web page itself, containing only the well-structured (XML) view of the data considered most important on the web page. It is designed to make harvesting (usually by "feed aggregators") very easy as each important data element is clearly delimited, and RSS feeds from different sites all have the same common structure. But, relying on RSS feeds has its own drawbacks. Most significantly, the site has to offer a RSS feed of their information before it can be harvested. Even when a site offers RSS feeds, there is no guarantee there will be a feed for the specific information desired, only for the information the provider decides should be harvested. Worse is that many web sites do not offer RSS feeds at all, and others are actively designed to hinder harvesting. Relying on RSS feeds for harvesting rules out harvesting data from much of the Internet. [o1o Harvesting from Internet sources presents another problem, and that is the presence of similar information in files, but in differing formats. For example, the price for a product on one site might be in Euros, but it might be listed in US Dollars on another. Blindly harvesting the data from both will cause problems unless some sort of source-depended consolidation takes place to first convert one currency to the other. Worse still, not all data can be harvested. The price for the product on a third source might be displayed as an image - as an ad that reads: "Now only NZD $9.95!". Harvesting and consolidating the price from such a source would require writing a complicated program that combines Optical Character Recognition (OCR) functions with currency conversion functions. Almost certainly it's not worth doing. [01] As the complications surrounding any given harvesting case are so varied and potentially complex, getting around these complications has lead to the development of as many harvesting solutions as there are harvesting scenarios, if not more. This has lead to the development of harvesting systems that are either automatic, inflexible, and unreliable - or manual, slow and costly. In both cases the user is limited in the type and amount of information they can harvest. Furthermore, once harvested, it is typically left to the user to work out how to manage the harvested information. Solution to Problems [0012] The invention takes a novel abstracted approach to solving the above problems, such that the HIMS solution can be used to solve virtually any harvesting scenario. It is an approach that significantly increases the capabilities of information harvesting, particularly in the area of non-textual data extraction and manipulation. too3 Key to this is the invention's simple but elegant method for data identification and extrac tion, in which important information is automatically, accurately and reliably identified and harvested regardless of the type of file being examined. This makes RSS feeds unnecessary, and means all pages on the internet can be potentially harvested. Furthermore, it also means it's not just limited to harvesting from web pages. Information in Microsoft Word documents, Excel spreadsheets and any other file - even binary information in images, audio and video files, can all be harvested using exactly the same method. [0014] Furthermore, once harvested, data can be easily assigned processing queues to automati cally perform any number of sequences of transformations on the data. These transformations are performed by programs external to HIMS , and applying such a transformation is as simple as selecting it from a list. The ability to perform unlimited transformations makes it very easy to consolidate data across all types of sources, and ensures data accuracy is always maintained. [o01] HIMS also helps the user manage the data consolidated from sources by providing a means of archiving and classifying data from sources so that a history of data element changes of time is preserved. This data can then be exported in a variety of formats to be used by other programs. The data can also be searched, or form the basis of reports using the reporting component of RIMS . Advantageous Effects of Invention [om] The extensible and adaptable design of HIMS means the one system can be used to solve all manner of problems automatically, regardless of the type of data being harvested and re gardless of the format of the data sources. This removes the costs involved in having multiple systems and the overhead of manual data validation and verification. Additionally, as the data transformations are performed by external programs, HIMS can always harness the latest available technologies, and the quality of the data improves as a result. This also means the applications and capabilities for HIMS are virtually unlimited. Some sample capabilities fol low, though many more exist. Moreover, the transformed data can easily be exported to other systems, such that the HIMS systems simply becomes one core component among many in.a larger system. " Media monitoring (News sites, journals, blogs, chat sites, social networking sites, news groups etc...) - International multi-lingual sources - Automated on-the-fly translations and keyword alerts - Automated text, audio and video transcripts - Archived articles provide a searchable media history for a subject " Competitor product monitoring - Instant notification as competitors' products and prices change User-definable reports (percentage of market share etc) " Cross-site service comparisons - Real estate, car rentals, banking, flight bookings, health insurance etc * Stock market analysis - High frequency scanning can retrieve stock prices multiple times per second Automatically issue alerts when thresholds reached - Graph trends and future predictions using historical data from archives - Automatic currency conversions " Map / Image Processing - Identify and archive objects, shapes, structures and colours across sets of maps (eg: satellite, aerial, Infra-red, Google, GIS) - Replay shape movements over time from archived data - Overlay layers from existing map data, or create composite maps " Audio Processing - Find and archive audio containing particular voices or sounds - Create text transcripts from English and foreign language audio recordings " Video Processing - Flag videos containing particular shapes, faces and/or audio - Generate automated transcripts of foriegn language videos - Real-time alerts from surveillance cameras when an object enters the field of view Brief Description of Drawings & Figure 1 illustrates the relationship between the terms "data element", "source" and "collection" e Figure 2 depicts the four major components of the HIMS system, as well as a sample collection of sources being monitored. * Figure 3 is a context diagram representation of figure 2, illustrating a typical HIMS installation on a server behind a firewall. Also depicted is the optional high-volume HIMS configuration, comprising multiple retriever and processing components operating in parallel.

" Figure 4 is a flow diagram presenting an overview of the source retrieval, data extraction and processing cycle for a collection of sources. " Figure 5 is an expansion of figure 4-404, illustrating the procedure for extracting data elements from a source using the source's Source Specification Filter (SSF). " Figure 6 is an expansion of figure 4-405 and illustrates the procedure for optionally pro cessing a data element using one or more plug-ins . " Figure 7 is an expansion of figure 6-602 and shows a sample single plug-in configuration. * Figure 8 is an expansion of figure 6-602 and shows a sample chained plug-in configuration. * Figure 9 is an expansion of figure 6-602 and shows a sample branched, chained plugin configuration. * Figure 10 is an expansion of figure 6-604 and depicts the processing of a data element using an external plug-in . In this instance, the translator plug-in is passed parameters to convert the data element in a foreign language (Indonesian) to its English equivalent. " Figure 11 details the templates and components of the reporting system. * Figure 12 depicts the archiving of three data elements from the CNN source where no processing is required. * Figure 13 depicts the use of a plug-in translator to translate Indonesian title and summary data elements to English prior to archiving. " Figure 14 depicts the use of a plug-in translator to translate the Indonesian title, and a plug-in chain to convert the video summary to English text prior to archiving. " Figure 15 show the arrangement of a branched, chained plugin to issue alerts about property developments prior to archiving. Brief Description of Algorithms * Algorithm 8 is the algorithm for determining when scheduled sources are to be retrieved. * Algorithm 8 is the algorithm for automatically determining retrieval frequencies for a source.

* Algorithm 3 is the algorithm for automatically determining the Area of Interest Fixed Points of Reference (AOIFPRs) within source files. e Algorithm 4 is the algorithm for extracting Data Element Fixed Points of References (DEFPR) from between Area of Interest Fixed Points of Reference (AOIFPRS) for use with Algorithm 1 e Algorithm 1 is the algorithm for auto-generating Sparse Parse Language (SPL) instrue tions for locating and extracting data elements from an Area of Interest (AOL) using only AOIFPRs and DEFPRs. 1.5 Description of Embodiments roo1 Preferred embodiments of the invention will now be described with reference to various examples of how the invention can best be built and used. References to drawings, tables and algorithms are used throughout this description. Terminology [0018] For the purposes of describing the embodiments of the invention, the following invention specific terminology applies: [0019] A DATA ELEMENT is an item of interest within a source, and is defined by the user when the collection is first created. A data element can be any text (including multilingual text) or binary data. [00o A SOURCE is a text file, binary file, web page, or database that contains one or more data elements of interest. A source is usually available over the internet or network, though a source may also be available locally to the server. A source has an address, usually the full path to the file or the URL of the web page. 0021i A COLLECTION is a set of sources that contain a set of common data elements. [o022[ Figure 1 illustrates the relationship between these terms, and table I gives some sample collections, with their sample sources and sample data elements. Overview [0023] HIMS is a highly adaptable system for managing the extraction, consolidation, archiving and reporting of data elements across sources within a collection.

Collection Sources Data Elements News Media CNN.com, MSNBC.com, Reuters.com Date, Title, Link, Summary Real Estate Realestate.com.au, Domain.com.au Price, NumRooms, FPlan*, Images* Map Data maps.google.com.au Maplmage*, Comment, Scale, Coords Audio Data Network folder of audio files Date, Comment, Sample*, Length Table 1: Sample collections, sources and data elements. * denotes binary data elements [0024] Figure 2 shows the components of HIMS (200) and how information is passed between them. It shows a set of sources (201) that are retrieved from their locations using a source's file path, URL or database location, depending on the source, by the Source Retriever (202). Retrieval frequencies are automatically and/or manually calculated and assigned to each source, and these retrieval rates vary depending on the source. For example, a time-critical stock price ticker source should be retrieved as often as possible (perhaps every second), a real estate source, less often (perhaps once a day), and a periodical or journal less often still (say, weekly or monthly). Each time a source is retrieved, the Source Retriever then locates and extracts data elements within the source and hands them over to the Element Processor (203). [0025] The Element Processor component performs any processing assigned to data elements on a per-source basis. As the sources in a collection will all be different, not all data elements arriving at the Element Processor will be in the appropriate format. For example, given a collection of stock market sources, a SharePrice data element from a U.S trading source might be in USD, but the SharePrice data element from a Japanese trading source might be in JPY. Both sources have the same data element in common, but one data element will need to be processed to consolidate the data. In this instance, the Element Processor will apply a Japanese Yen to US dollar conversion plug-in process to retrieve the exchange rate and adjust the SharePrice data element from the Japanese source. [0026] After processing, if any, has completed, the processed data elements are passed to the Ele ment Archives (204) for storage. The Element Archives maintain a record of all data elements retrieved for a collection over time. The archives also maintain record of any classifiers assigned to entries in the collecton's archives. Classifiers are discussed later in the Element Archives section. [0027 The Element Archives are queried using the Search + Report (205) component. Users (206) access the Search + Report component to perform searches across a collection's data elements, [W28] Users of the system (206) are able to perform actions including searching, viewing and exporting report and search data to other programs in a variety of formats (XML, RSS, CSV etc...). Deployment 0o2] Figure 3 shows how the software components from Figure 2 are typically deployed on a server in a client/server model. Access to the Search + Report component (205) is provided through a web server (301) that users (206) can connect to using a computer with connectivity, and a web client such as Firefox. [oo3i An administrative interface (302) is also available for authorised users. In addition to providing the basic data view described above, after logging into this interface, users are also able to create and manage collections, assign sources and data elements to collections, adjust source retrieval frequencies, build and assign process sequences to data elements, edit or remove archived entries, assign classifiers to archived elements and build reports across collections using said classifiers. [o01 Also depicted is the optional high-volume HIMS configuration, comprising multiple re triever (202) and processing (203) components operating in highly scalable parallel configura tion. These retriever / processing pairs can be installed on separate servers, divide the source lists among themselves, and send processed data elements back to the HIMS server many times more quickly than in single server configuration Source Retriever Detail [0032] Figure 4-401 finds any sources that are due to be retrieved. The source retriever records the time when it last retrieved each source, and it also records how long to wait between source retrievals (the interval is either automatically calculated using Algorithm 8, or it is specified manually by the user). Therefore, it is a simple matter to calculate which sources need to be retrieved at any point in time by looking at Algorithm 8 [00331 (402/403) determines if the contents of the source has changed since the last time it was retrieved. At collection creation time, the user can nominate whether a data element should ignore and skip successive duplicate element values, or process and store duplicate values in the archives as usual. If the user is harvesting news headlines, duplicate headlines appearing in the archives would be a problem. However, if the user is monitoring a data element over time for future graphing or similar purpose, such as harvesting a share price, then successive duplicate share prices would be considered valid data points on the graph. [0034] (404) locates and extracts data elements from the source. It does this by maintaining an automatically (though sometimes manually) generated Source Specification Filter (SSF) for each source. The SSF is generated from an initial analysis of the source when it is first added to the collection. This generated SSF tells the Source Retriever where each data element is located within the source. However, locating data elements in sources is rarely simple, because often much of the content of a source will change between retrievals. Consider a news source where the headlines, links and summaries change several times a day. It makes it very difficult to locate and extract data elements reliably and accurately. (0oa35 HIMS simplifies the problem by first identifying the most likely important part(s) of the source file, and stripping out everything else from the source. Algorithm 3 returns the Fixed Points of Reference (FPRs) delimiting the most useful looking Areas Of Interest (AOI) within the source file. The FPRs are parts of the source file that do not change during successive source retrievals. Once the FPRs have been identified, everything else in the source can be removed. [00361 For non web sources, such as binary files, the user can manually trim the file by dragging a rectangle around a graphical representation of the file with their mouse, thereby marking left and right offsets for trimming within the file. [0037] Once trimmed, the remainder of the file is scanned for Fixed Points of Reference (FPRs). These are points in the remaining source file that don't change, despite the content around the points changing. FPRs are autmatically located by examining source retrievals over two or more intervals and recording the regions that don't change. For text files, these regions are tokens. Data elements can then be extracted accurately by describing their location relative to the FPRs. The remaining source content is then examined by algorithm 4 which locates the FPRs delimiting the data elements within the AOIs. Once the data element delimiters are located, the Source Retriever automatically generates a Sparse Parse Language (SPL) set of instructions for locating and extracting the data elements using the FPRs (ie: sparsely-located tokens). [oo381 SPL is a type-less programming language, unique to, and developed specifically for, HIMS . It is the engine behind-the-scenes of the Source Retriever and Element Processor components. [oos9 Given a set of simple instructions and a source file to operate on, the SPL interpreter moves a "virtual cursor" (VC) between sparsely-located text or binary tokens located within said file, marking, extracting and transforming the data between tokens according the instructions. Consider the analogy of using a keyboard to move the cursor around a Microsoft word document, and holding down SHIFT to select important text. joo40J Moreover, by referencing only the FPRs within a file, the remaining data in the file can change entirely, yet still remain locatable and extractable by the Source Retriever. [o41] By design, SPL accepts a very small set of instructions or keywords. This is because the Source Retriever has to be able to write its own SPL instructions to parse a source. Writing a program to write a program is a daunting task, one that is simplified with fewer instructions. Users of the system will be able to modify a source's SPL so it is important that the avail able instructions or keyworks are few, simple and intuitive. The most common keywords are described in table 2 Keyword Description mark on Turns on marking at the current position of "Virtual Cursor" (VC) mark off Turns off marking at the current position of VC move-after (A) Moves VC immediately after next occurrence of A in file move-before (A) Moves VC immediately before next occurrence of A in file store (A) Assigns marked data to variable A move (N) Moves VC specified number of bytes (positive or negative ) isin (B,A) Determines if A occurs in B plugin (A,N) Applies plug-in N to variable A for (cond) {.. For loop statement if (cond) {... Conditional statement while (cond) {.} While statement do {.,.} while (cond) Do/while statement Table 2: Sparse Parse Language (SPL) common keywords [004-2 As an example, consider a source file that contains only the words The quick brown fox jumps over the lazy dog This source can be manipulated in any number of ways using SPL. Listing 1 shows example SPL code to extract the word "quiche" from this source text, and illustrates the use of many of the SPL keywords. Listing 1: Sample SPL program for extracting "quiche" from source move-before ('q' ')1 mark on; 2 move-before (''k''); 3 mark off; 4 store (A); 5 B= ''the'';- 6 move-before (B); 7 move (1); 8 mark on; 9 move-after (''e' '); 10 mark off ; 11 store (C); 12 print (A+C); 13 Line 1: The "Virtual Cursor" (VC) is moved from offset 0 to before "q" in "quick" (offset 3) Lines 2-5: Stores the data "quic" in buffer A Lines 6-8: Move VC immediately after the "t" in "the" Lines 9-12: Stores the data "he" in buffer C Line 13: The + operator concatenates buffers A and C, printing "quiche" Element Processor Detail [0043 Figure 6 shows the operation of the Element Processor. The Element Processor takes a retrieved data element, examines the processing sequence assigned to that data element by the user when the collection is first established, and then applies the processing sequence to the data element to transform the contents of the data element prior to archiving. The processing sequence for a data element consists of one or more plug-ins arranged in sequence. 0044o A plug-in is any external process that can be applied to any data element (or source). A plug-in can be any locally installed application, process or application available remotely over a network, such as Google Translate. It is important to note that a plug-in has no knowledge that it is being used by HIMS . Conversely, HIMS does not require plug-ins be specifically written for it, and instead can harness products across the entire software industry and its emerging new technologies. This means that as a plug-in improves, so too does the information in the collection. Similarly, if a plug-in is under-performing it can easily be swapped out for a better technology. Table 3 shows some sample plug-ins from thousands of possibilities. Plug-in Name Description a Text translator Translates a text data element from one language to another b Shape recognition Determines if an image data element contains a particular shape c Audio extractor Splits audio stream from video data element d Speech-to-text Converts audio data elements to text e Map-to-image Converts map data elements to image f Voice recognition Determines if an audio data element contains a particular voice g Image extractor Extracts still images (frames) from video footage h Many others Table 3: Sample external plug-ins o45] Figure 7 shows a plug-in (702) being a applied to a data element (701) to transform the element's contents (703). If the plug-in listed (702) was an Indonesian text translator, and the data element was retrieved from an Indonesian newspaper, then the data element before processing (701) might contain the words "Polda Papua amankan". If so, the data element after processing (703) would contain its translation, the words "The Regional police of Papua" [0046] Each plug-in in use by HIMS has a Plug-in Specification, defined by the user when it is first added to HIMS . The Plug-in Specification details the input parameters for a plug-in , if any, the data element-to-parameter mapping, and the output format of the plug-in . A plug-in is specified only once, it is then available for use with any data element (or source) across collections [o4 Figure 10 is the detailed view of how a plug-in is applied internally, and is an expansion of 6-604. It shows what happens when the example Indonensian text translator above is applied to the first data element (DEI). Firstly, The contents of DE1 before processing (1001) and the Source Specification Filter (SSF) (1002) for the source is sent to the Process Controller (1003). The SSF tells the processor to use the Indonesian translator plug-in to process DEL. The Process Controller then tells the Plug-in Manager (1004) to prepare the Indonesian translator plug-in and sends DE1 to the Plug-in Manager. [04oo The Plug-in Manager loads the Plug-in Specification for the translator (1005), which tells it 3 things: The plug-in has one parameter as input, the supplied data element (from the Process Controller) maps to the input parameter, and the output from the plug-in maps back to the supplied data element. The Plug-in Manager then makes a call to the translator (1006) with DE1 used as the input parameter. The output from the translator is returned to the Plug-in manager (1007), which tells the Process Controller to assign the value back to DE1 (1008), as specified by the Plug-in Specification. The Process Controller then updates the value of DE1 (1009), which now contains the translated text ready for archiving. [04o] The real power of HIMS lies in its ability to chain multiple plug-ins together into a pro cessing sequence, such that the output from one plugin is used as input for the next. Such a sequence is called a plug-in chain, and some example chains and their application are listed in table 4. [ooo] Figure 8 shows 3 plug-ins joined in a sample chain to produce translated transcripts of foreign language videos (say, Arabic). If the data element before processing (801) contained video footage of someone speaking in Arabic, and the first plug-in was an audio extractor plug-in , then the data element after applying the first plug-in (803) would contain only the Plug-in chain sequence Description e + b Determines if a map contains an object of a particular shape g + b Determines if a video contains footage of a particular person c + f Determines if the voice of a particular person is speaking in a video c + d + a Generates automatic English transcripts of foreign language videos Many others Table 4: Sample plug-in chains using plug-ins from 3 audio component of the video, in Arabic. This audio data can then be fed into the second plug-in (804) say, a Arabic voice-to-text plug-in , after which the data element contains the text component of the audio in Arabic (805). Lastly, the Arabic text can be fed into a third plug-in say, an Arabic to English text translator (806), and the data element after processing (807) contains the English text translated from Arabic. tor05] As a variation, chained plug-ins can also have conditions, in which a plug-in or plug-in chain is applied only if a data element meets a specified condition. Such an arrangement is called a branched plug-in chain. Figure 9 shows a sample branched chain added to the end of the chain in Figure 8. Here the translated English text from (805) is subjected to a condition (901) say, a keyword list. The plug-in (902) will only execute if the text from (805) contains a word in the list of kewords, say "Osama Bin Laden". If plug-in (902) is a mobile phone SMS issuing application, then the recipient will receive a text message alert whenever an Arabic video mentions " Osama Bin Laden" Element Archives [0052] The Element Archives provide a historical record of collections' data elements over time. Users with sufficient privileges can sort, modify and vet data elements in the archives as part of the audit process.Archived data can also be searched, reported on and exported to other applications. Table 5 shows the internal representation of the archives for a small collection of news articles of just three sources (CNN, Antara, and AlIjazeera) and three data elements (Title, Link and Summary). Depicted is the contents of the data elements after their first retrieval. Source Date Title Link Summary CNN 20110901-0900 Nations pledge 200M.. www.cnn.com/a Donor countries have.. Antara 20110901-0905 Regional police of Papua.. www.antara.co.id/b The regional police.. Aljazeera 20110901-0910 Taliban agreement.. www.aljazeera.com/c The Taliban in Afghani Table 5: View of Element Archives after first retrieval [oos3 After retrieving the data elements for a second time, the Element Archives will look like Table 6. Source Date Title Link Summary CNN 20110901-0900 Nations pledge 200M.. www.cnn.com/a Donor countries have.. CNN 20110902-0900 Crime on rise in.. www.cnn.com/d Burglaries have.. Antara 20110901-0905 Regional police of Papua.. www.antara.co.id/b The regional police.. Antara 20110902-0905 Opinions divided on.. www.antara.co.id/e The new development.. Aljazeera 20110901-0910 Taliban agreement.. www.aljazeera.com/c The Taliban in Afghani Aljazeera 20110902-0910 Political maneuvers.. www.aljazeera.com/f It is reported that.. Table 6: View of Element Archives after second retrieval [oos] Binary data elements are handled only slightly differently to text data elements; their content can be saved to disk and the corresponding file name can be stored in the archives. 0055] Classifiers are used to attach additional information to archived data which can later be used as parameters in searches and reports. A classifier is a set of classifications created by the user when a collection is created, and assigned to rows of data in the archives. This assignment can be automatic, after processing is complete, or assigned manually by a user when vetting archived data. Each row of data in the archives can be assigned one or more classifications from each classifier set. For example, a collection of news items might be assigned the classifiers in table 7 Classifier Classifications Topic Current Events, Economy, Science, Technology... Importance Very, Somewhat, Not Important Vetted Yes, No Table 7: Classifier groups and classifications [oossy Note that a news item can be assigned many classifications from the "Topic" classifier, but only one from the "Importance" and "Vettedl" classifiers, hence classifiers have a cardinality. [0057] Once classifiers have been assigned to entries in the archives, the classifiers can be used as criteria for searches and reports on archived data elements, in addition to the data elements themselves. Search and Reports [os5] Authorised users can search and view the contents of the Element archives using combina tions of boolean search terms. -Combinations of criteria can be specified for data elements and any assigned classifiers. As, conceptually, a search is really just a single type of a report with a fixed layout, attention will now focus on the reports system. fooas Figure 11 shows the reporting process. In it a User interacts with the Data Specifier Tool (DST) (1101) to retrieve a slice of data (1102) from the Element Archives (1103) which is sent to the Templating System (1104). The Templating System builds a report from the user-created layout template (1105) and style template (1106). The user is then able to view the generated report (1107), and can then choose to export the contents of the report (1108). The reporting process will now be described in detail. [0060] The Data Specifier Tool (DST) (1101)is a graphical tool for selecting parameters and ranges to apply to searches and reports. It supports boolean operators such as 'AND', 'OR','NOT' and other functions such as 'NEAR'. As an example, a User could use the DST to specify the following query for a collection of news articles. "Show me the link and title for all articles published in September 2011 that contain the word 'police' near the words 'crime', or 'arrest' near the word 'trial' in the article summary, and that have an Importance classification of 'Very' The data that matches (1102) the DST criteria is sent to the Templating System (1104) for display. The DST can also take advantages of plug-ins and plug-in chains to perform additional operations on report elements prior to forwarding to the Templating System. The Templating System uses two templates to prepare the report. The Layout Template (1105) provides the shell structure into which the report data is inserted. It specifies where to position report elements in the report, whereas Style Template (1106) species how to display the report data, including colors, fonts, borders and the like. The combination of the two produces the finished report (1107) for viewing by the user, or saving or exporting to other programs in a variety of formats (1108) Examples [00611 EXAMPLE 1 [0062] This example will show how HIMS works to extract and consolidate data elements across a small collection of Internet sources, with each source having a different layout and having data elements in different formats. Typically HIMS would be working across many thousands of sources in many collections, [0063 In this scenario a user wants to set up a collection to retrieve and monitor news articles from a few news sources. To simplify the example the user has opted to monitor three news sources, each with varying content: * Cable News Network (CNN) - www.cnn.com - Roman characters - English text " ANTARA News Agency - www.antara.co.id - Roman characters - Indonesian text * Aijazeera - www.ajazeera.net/portal - Arabic characters - Arabic video [ooai The first step in setting up a collection is to identify the data elements of interest. Each source in this collection of news sites has a number of common data elements that could be added to the collection, including: article title, link, summary, article text, author, and date published. However, in the interests of simplifying the example, the user has opted to add only three data elements to their collection: "title", "link" and "summary". The user then identifies the data types of these data elements that are to be stored in the archives. In this instance, "title" is of type text, "link" is of type URL, and "summary" is also of type text. [oo06i The second step in setting up a collection is to identify and set up sources for the collection. The user proceeds to add the three sources to the collection and specifies either how often each source should be retrieved, or lets the system determine the retrieval rate automatically. In this example the user specifies a retrieval rate of once every hour. i00se] When a source is first added (or, if it has changed significantly since it was last retrieved), the areas where data elements are likely to occur (areas of interest, or AOIs) within each source are determined automatically by HIMS and the Source Specification Filter (SSF) is created accordingly. AOIs can also be adjusted manually by the user by selecting areas of the source with the mouse. Lists of sources and data elements can also be imported from external files, though this won't be covered in this example. o07i As each source is added, any additional processing for the source's data elements is de termined and appropriate plug-ins are assigned, where required (automatically where possible, manually otherwise), to consolidate the data into the format specified for the archives. This processing information is also added to the SSF. [oossi Figure 12 Shows how the CNN source is handled by HIMS . The CNN web page (1201) is retrieved by the Source Retriever (202), which locates and extracts the data elements from the source. Looking at the CNN web page, it is clear that the title (1203), link (1204) and summary (1205) data elements are already in English text, and hence are already in the format required for archiving (assuming the archives for this collection are to be in English - though any language is possible). This means the CNN data elements require no processing by the Element Processor (203), so the data can go straight from the Element Retriever (202) into the Element Archives (203). [0069] Looking at the ANTARA web page, some processing will be needed. Figure 13 shows how this processing is handled. The ANTARA web page (1301) is retrieved by the Source Retriever (202, which extracts the three data elements. In this instance, the title (1302) and summary (1304 ) are in Indonesian, while the link (1303) is in the correct format. The Element Processor (203) knows from the SSF to apply an Indonesian-to-English plug-in (1305, 1306) to the title (1302) and summary (1304) elements to transform them to English prior to sending to the Element Archiver (204). Similarly the SSF tells the Element Process that nothing needs to be done to the link element, and sends it directly to the archives. To be clear, The title data element in the archives (1307) contains the English translation of the title element before processing (in Indonesian) (1302). The link element in the archives (1308) remains unchanged from when it was first retrieved (1303). And lastly, the summary element in the archives (1309) contains the English version of the summary element before processing (1304). This will ensure English text is maintained in the archives. [ooroi The Aljazeera source is more complicated still. While the title is in Arabic text, the article summary is read, in Arabic, by a newsreader in a video. The URL however, is again in the correct format for the archives. The user arranges a combination of plug-ins and plug-in chains in the sequence shown in figure 14 to consolidate the Arabic title and Arabic video into English text ready for archiving. This is done by simply dragging and dropping the plug-ins into the processing queue. Working from left to right, once again the Aljazeera source (1401) is retrieved by the Source Retriever(202) and has the data elements extracted. At this stage the title data element (1402) is in Arabic text, the link (1403) is in the correct format, and the summary data element (1404) contains a video of a newsreader speaking in Arabic. The Element Processor (203) applies a plug-in Arabic-to-English translator (1405) to the title data element (1402) and sends it to the archives. It then sends the URL directly to the archives (1408). Lastly the Element Processor (203) applies a plug-in chain to the summary data element (1406) before sending the result to the archives (1409). [0071l The plug-in chain for processing the summary consists of three plug-ins . Figure 9 shows an expanded view of this chain. Again from left to right, the video element (801) is sent to an audio extractor plug-in (802) to strip the video from the data element, leaving only the Arabic audio of the newsreader. The audio is then passed to a voice-to-textplug-in (803) resulting in an Arabic text transcript of the original video. Lastly the text transcript is passed to an Arabic to-English text translator plug-in (804) which converts the data element to English (805). The English summary is then ready for archiving. [o07 Table 8 shows the contents of the archives after three hours (and three retrievals) retrievals have occurred. Note that despite the varying lanuages of the orginal sources, the final archives are consolidated and contain only English text. Source Time Title Link Summary CNN 09:00 Nations pledge 200M.. www.cnn.com/a Donor countries have.. CNN 10:00 Eurozone moment Of.. www.cnn.com/b European nations have.. CNN 11:00 Iran may review.. www.cnn.com/c Iranian lawmakers want.. ANTARA 09:00 Producers asked.. www.antara.co.id/d Music industry.. ANTARA 10:00 RI must raise.. www.antara.co.id/e Indonesia should.. ANTARA 11:00 Indonesia hails.. www.antara.co.id/f After high-profile efforts.. ALJAZEERA 09:00 Arab League susp.. wwwaljazeera.com/g Iran has strongly.. ALJAZEERA 10:00 Protecting nature's.. www.aljazeera.com/h The endangered .. ALJAZEERA 11:00 Monti seeks to.. www.aljazeera.com/i Berlusconi's successor.. Table 8: Example collection archive view after three retrievals [O'73[ EXAMPLE 2 [0074] This example illustrates the versatility of HIMS , by explaining how it can be applied to a completely different scenario. It is more a conceptual example, so some of the detail detail is omitted. In this scenario, the user is not concerned with news, but with property. Specifically the user is a property mogul with a large number of properties around the world, and would like to know when any buildings are built or new construction occurs near any of their properties. [00751 The user needs only one data element for this scenario: A map of the neighbourhood around a property, which they can get directly from Google Maps. [007 Next the user adds sources to the collection. The source locations in this example are the URLs of the google maps for each property's neighbourhood at a certain zoom level (say, zoomed enough to give a view of the 500x500 metre view of the neighbourhood around the property). 0077 The user determines that three plug-ins are required: A map-to-image converter, an image differencer and a mobile SMS or email alerter plug-in . The user arranges them as shown in figure 15 Describing figure 15 from left to right, the Google map source (1501) is retrieved by the Source Retriever (202) which extracts the map element (1502) from the source. The map is then sent to the map-to-image converter (1503) which retrieves a corresponding satellite image for the map region. [oo781 The satellite image is then passed to the image differencer (1504) which compares it to the last retrieved satellite image of the region in the archives. If the latest satellite image is different, the SMS or email alerter plug-in is activated, notifying the user that a neighbourhood map has changed, and hence possible construction or development is occuring in the neighbourhood. [oo79] The satellite image (1506) is then sent to the Element Archives(204). In this way, a record of images of each property's neighbourhood is built up over time in the archives, This image data is ideal, for example, for building a future time-lapse animation using an animation plug-in within a report. Such an animation would clearly show the neibourhood changes over time for each property.

Reference Signs List 200 The HIMS Harvesting & Information Management System 201 A set of sources in a collection 202 Element Retriever component 203 Element Processor component 204 Element Archives component 205 Search + Report component 206 HIMS user 301 A web server 302 HIMS administrative interface 303 A router / firewall 304 Network or internet 401 Process: Retrieve next scheduled source 402 Process: Test if source has changed 403 Process: Check if source can have duplicates 404 Process: Extract data elements from source 405 Process: Apply processing to data elements 406 Process: Archive processed elements 407 Process: Wait until next source is scheduled 501 Process: Check if source has been retrieved before 502 Process: Get SSF for source 503 Process: Generate SSF for source 504 Process: Extract data elements using SSF 601 Process: Get next data element 602 Process: Get processing list for element 603 Process: Determine if processing required for data element 604 Process: Apply process to element 605 Process: Determine if more elements to process 606 Process: Archive processed elements 701 Indonesian data element before processing 702 Plugin Indonesian- English translator 703 Data element translated to English 801 A data element containing an arabic newsreader video 802 An audio extractor plugin 803 A voice to text plugin 804 An Arabic to English translator 805 English transcript of Arabic news report 901 A pattern matching plugin 902 An alert-issuing plugin 903 A mobile device 904 Unchanged data element 1001 Indonesian title data element 1002 Source Specification Filter 1003 Indonesian to English translator plugin 1004 Data element-to-parameter alignment 1005 Plugin output is captured by plugin manager 1006 Plugin manager passes new output back to Process Controller 1007 Data element is now in English 1100 Reports User 1101 Data Specifier Tool (DST) 1102 Slice of data from Element Archives 1103 Element Archives 1104 Templating System 1105 Layout Template 1106 Style Template 1107 Finished Report 1108 Exporting System 1201 CNN Source 1203 CNN Title data element 1204 CNN Link data element 1205 CNN Summary data element 1208 Unchanged data element from 1203 1209 Unchanged data element from 1204 1210 Unchanged data element from 1205 1301 ANTARA Source 1302 ANTARA Title data element in Indonesian 1303 ANTARA Link data element 1304 ANTARA Summary data element in Indonesian 1305 Plug-in Indonesiah-English translator 1306 Plug-in Indonesian-English translator 1307 English translation of Title data element (1302) 1308 Unchanged Link data element from 1303 1309 English translation of Summary data element (1304) 1401 Aljazeera Source 1402 Aljazeera Title data element in Arabic 1403 Aljazeera Link data element 1404 Aljazeera Summary data element (Arabic video) 1405 Plug-in Arabic-English translator 1406 Chained plug-ins for transcript generation 1407 English translation of Title data element (1402) 1408 Unchanged Link data element from 1403 1409 English transcript of Arabic video (1404) 1501 Map source 1502 Map data element of neighbourhood 1503 Plug-in map-to-satellite image converter 1504 Plug-in image differencer 1505 Plug-in SMS/Email alerter 1506 Satellite image of map for Element Archives Table 9: Reference signs list

Claims

1. A computer implemented method for managing the retrieval, extraction, processing and reporting on text or binary data elements across web sites, files databases and other sources, said sources potentially containing varying content, layout and / or form, the method comprising: providing at the computer a client-based interface for users to manage sets of sources and data elements, said management comprising the creation, editing and deletion of sources and / or sets of sources, the retrieval, processing and/or classification of data elements from said sources, and / or the reporting, searching and / or exporting of data elements from archives of processed elements and the management of said archives; providing at the server(s) a module for the retrieval and the determination of optimal retrieval rates, of a given source file within a set of source files; providing at the server(s) a module for determining automatically, or specifying manually, the location of data elements of interest within a source, and extracting said data elements from said retrieved source file; providing at the server(s) a module for optionally applying additional processing to ex tracted data elements; providing at the server(s) a module for storing binary and / or textual data elements into database archives containing representations of source data elements over time.

2. The method of Claim 1, wherein the client-based interface further comprises the operation of optionally assigning programmatic instructions, either as textual or as a graphical representation programmatic instructions, to'control the locating and processing of data elements within a source

3. The method of Claim 1, wherein the operation of the module for the retrieval and de termination of optimal retrieval rates further comprises the automatic calculation of an optimal retrieval rate for a given source file within a set of sources by determining average the interval of time elapsed between successive source changes.

4. The method of Claim 1 wherein the operation of the module for the locating and extract ing of data elements from a source further comprises a module for automatic or manual analysis of said source, to automatically or manually generate programmatic instructions for the extraction and optional processing of data elements of interest from said source.

5. The method of Claim 4 wherein the extraction of relevant data elements further com prises a method of extracting parts of a file using programmatic compiler or interpreter instructions to move one or more "virtual cursors" within said file and between deter mined delimiters, relative byte offsets, tokens and/or determined fixed points of reference, and marking areas of said file for extraction using said virtual cursors.

6. The method of Claim 4, wherein the operation of the module for the automatic locating of data elements from a source further comprises the means of automatically identifying the area of a text source most likely to contain data elements of interest by: stripping text not in the set of delimiters from the source, locating the first occuring, non-recurring delimiter before, and the first occuring, non-recurring delimiter after the area of the file containing the nth longest recurring substring, then reinserting the stripped text that , fell within the area bounded by the two non-recurring fixed point of reference delimiters, such that much of the content with the bounds can change without affecting the ability to locate and extract data elements of interest.

7. The method of Claim 1, wherein the module for processing data elements of interest further comprises of a method for sequentially, iteratively and/or conditionally or combi nations thereof, applying processing to data elements of interest using external processes and/or programs known but external to the invention, thereby transforming said data element and/or performing some additional function,

8. The method of Claim 5, wherein the data elements elements of interest are automatically extracted from the determined most likely Area Of Interest (AOL) within a source by a method of subdivision of the AOI around the set of delimiters within the AOI, and the assignment of subdivided, delimited text to likely data elements of interest based on one or more specified relative data element properties.