EP1461725A1

EP1461725A1 - Method and apparatus for information retrieval

Info

Publication number: EP1461725A1
Application number: EP02779016A
Authority: EP
Inventors: Kathleen Phelan
Original assignee: Web-Track Media Pty Ltd
Current assignee: Web-Track Media Pty Ltd
Priority date: 2001-11-27
Filing date: 2002-11-27
Publication date: 2004-09-29
Also published as: EP1461725A4; WO2003046755A9; NZ533730A; WO2003046755A1; CA2507279A1; AUPR914601A0

Abstract

A method for automated search and retrieval of information available on a networked database, the method including the steps of providing search topic information, providing a target information resource location, spidering or dividing the target information resource location for further resource locations that are likely to lead to relevant information, and retrieving information from the target information resource location or from a relevant one of the further resource locations.

Description

METHOD AND APPARATUS FOR INFORMATION RETRIEVAL

FIELD OF THE INVENTION

This invention relates to information retrieval, and is directed primarily but not solely automated retrieval and analysis of information available on the Internet or similar databas such as databases, internal networks and intranets.

BACKGROUND OF THE INVENTION

Computer databases, internal networks, intranets, networks and, in particular, the network networks such as that commonly referred to as the Internet have resulted in vast amounts information being publicly available on those sources. However, for example, there is single organised and completely up-to-date repository or index of all information on 1 Internet.

To be useful, information must be relevant and timely. The Internet makes information e∑ to access, but it can be a very difficult task to fully canvas the Internet to find all informati that is relevant to a particular topic or range of topics. Also, with information bei accumulated and changed so rapidly due to the Internet environment, even if extensi searching is performed in a manual procedure, then the time taken to search in this mannei quite likely to not be fully up to date.

There are a number of Internet search engines, such as "Yahoo™" for example whi attempt to provide a user friendly search facility for information on the Internet or simi databases. However, these search engines try to cover a full range of topics from ms disparate sources and are therefore not continually up to date. They also index or frequency of only 4 to 12 weeks. OBJECT OF THE INVENTION

It is an object of the present invention to provide methods or apparatus for informati retrieval and/or analysis and/or user information alerts which will at least go some w toward overcoming disadvantages of known apparatus and methods, or which will at le provide the public with a useful choice.

Throughout this specification, where there is a description with reference to the Internet should be appreciated that the invention is applicable also to databases, internal networ intranets and the like.

SUMMARY OF THE INVENTION

In one broad aspect the invention provides a method for automated search and retrieval information available on a networked database, the method including the steps of

providing search topic information,

providing a target information resource location,

spidering or dividing the target information resource location for further resou locations that are likely to lead to relevant information, and

retrieving information from the target information resource location or from relevant one of the further resource locations.

Preferably the network is the Internet.

Preferably the retrieved information is analysed.

Preferably an alert is provided to an entity as a result of the analysis. In another broad aspect the invention provides an automated information seai and retrieval system in which real time selection and retrieval of the information occurs.

Preferably the system includes provision for archiving the retrieved information in a read accessible manner.

It is preferred that the information is searched and retrieved from the Internet.

In a further aspect the invention provides a method for automated searching and retrieval information, performing real time selection and retrieval of the information.

Preferably the information is archived for subsequent analysis.

The method preferably includes the step of establishing one or more target resource locat from which information is to be searched and retrieved.

Furthermore, the target location preferably includes a URL which is spidered by the syst to identify underlying links.

Preferably the spidering step is performed in a plurality of passes, each pass being targe toward certain links, and each pass ignoring links that are unlikely to be relevant.

Preferably the method includes the step of retrieving information from links that app relevant.

Preferably the method includes the step of assigning or attaching metadata to each item information to create a database record.

Preferably the database records are archived.

Preferably retrieved information which is not in a textual format is converted to an edita raw-text data type.

Preferably data can be provided from other sources, for example hard copies which may converted to text using optical character recognition processors, or from an audio forr using speech recognition applications. Preferably the method includes the step of analysing text retrieved by the method agaii predetermined rules. The predetermined rules may include a literal string (key woi matches, regular expression matches, string patterns or occurrences of text, or otl linguistically defined criteria. The predetermined rules may additionally involve other t( analysis technology to recognise desired matches. The rules may be used to implemen criterion against which retrieved items of information are compared to determine th relevance to various topics and therefore the manner in which the information should indexed, or possibly discarded.

Preferably the method includes the step of discarding or stripping all extraneous informati from the information that is retrieved. Such extraneous information may include HTI\ tags, images and the like.

Preferably relevant information which is the subject of a new record created for immedi, analysis or for archiving is stored with associated metadata (for example source URL, d retrieved, string length, HTML headers and the like). Furthermore, preferably each record a distinct and unique item in the database or archive and is assigned a unique identifier.

The unique identifier may be a thirty two character UUID (universally unique identifier).

The invention also includes apparatus to implement the system or method of one or more the preceding statements of invention.

The invention includes a computing machine operable to implement the system or method one or more of the preceding statements of invention.

To those skilled in the art to which the invention relates, many changes in constructions a widely different embodiments and applications of the invention will suggest themseh without departing from the scope of the invention as defined in the appended claims. 1 disclosure and descriptions herein are purely illustrative and are not intended to be in a sense limiting.

The invention consists of the foregoing and also envisages constructions of which following gives examples only. DRAWINGS DESCRIPTION

One presently preferred embodiment of the invention will now be described with referen to the accompanying drawings, wherein;

Figure 1 an overview diagram of an information retrieval and archiving syste according to the invention,

Figure 2 is a diagrammatic time line of internet information search functions accordi to the invention.

Figure 3 is a flow diagram of an internet search and retrieval function according to t invention.

Figures 4a & 4b constitute a single flow diagram showing the search and retries function of Figure 3 in greater detail.

Figure 5 is a diagram showing the action of an agent or bot spidering a target server accordance with the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

Referring to Figure 1, an overview of a method or system and associated apparatus accordi to the present invention is shown. Raw data is shown at a first level referenced 1. It is ti data that the present invention searches, selects and then organises or indexes to arrive relevant timely information. As can be seen from the diagram, this raw data can includi diverse range of data formats such as hard copy documents 10, Internet data 12, audio d 14 and video data 16.

Sources of hard copy documents include sources such as newspapers and magazine artic or other paper records. Internet or other network data can include data contained in or generated by HT1N documents, XML documents/feeds, dynamic pages (CGI, ASP, CFM, PHP) and WAP d, sources, amongst others.

Audio data can include radio broadcasts, tape recordings/interviews and streaming audio ( example provided on the Internet).

Video data can include television broadcasts, tape recordings or streaming video ( example provided on the Internet).

At level 2 in Figure 1, a data processing level is shown. For hardcopy documents 1 preferred processing is performed by an optical character recognition (OCR) applicatii This is indicated with reference number 18 in Figure 1. OCR uses high definition scann to capture an image of a hard copy document and convert it to a raw text format, facilitate OCR, a computer or series of computers to which a high-resolution scanni device/s (with a bulk feeder mechanism into which many pages of documents can be load< is attached.

The application automatically scans each page, converts the document into a raw text fore using OCR (optical character recognition), and saves it into the central database.

The documents may be newspaper articles, magazine journals, printed PDF files, or oti hard-copy material.

To process Internet data, HTTP (and similar or subsequent methods and protocols) reque are used to supply the required HTML, or other, documents and these can then be stripped extraneous information such as HTML tags and the like to arrive at a text document. T processing is generally indicated using reference numeral 20 in Figure 1.

Audio data and video data are processed using speech recognition components to transfo the audio information into a textual format. This process is generally indicated us: reference numeral 22 in Figure 1. To facilitate speech recognition/transcription, a compii or series of computers running an application which processes audio from TV broadcas video, and other media (streaming, CDROM, etc). The audio/video data may be stoi digitally on a storage device connected to the computer or captured from an analogue sou such as a bank of VCRs or similar playback devices. The "audio signal" can be derived from either an audio or video source. Provision is ma for additional metadata with video sources that analyses and classifies video & ima information.

The application running on the computer analyses the broadcast using speech recogniti software to convert it to a raw text form where it is saved into the central database.

The result of the processing step in level 2 is a text document, referenced 24 which provided in electronic form. Each text item 24 then has metadata added to it (as will described further below) so as to create a database record in step 26, and each record is tli stored on a database 28. The database can then be accessed to review information of inter that has been gathered using the process. Furthermore, the information on the database c be archived in a number of convenient formats for use to track changes and patterns o^ time or to review historical data information.

Although the system may be used with a wide variety of sources of raw data, as descrit with reference to Figure 1, an immediate application of the invention is to Internet data, a this is indicated in Figure 2 and will be described further by way of example with referer to the remaining figures.

Turning now to Figure 2, a time line having an axis 30 representing time advancing in lini intervals in a direction to the right hand side of the figure shows examples of agents or b< which automatically search target data sources on the Internet.

Agents or bots (or similar kinds of automated agents) are used in the preferred embodimi to automatically search target data sources on the Internet. The agents are releas periodically.

By way of example, at 7:00am, a first agent 32 which has the task of extracting informati from a specific URL e.g. theage.com may be released. Each agent is attached to a speci site and is profiled with information specific to that site. The information determines method and depth of spidering (this will be explained further below) and how information is extracted. Each agent is released at predetermined intervals and they begin harvestii information through a process as will be described further below. Once each agent b finished its automated process, it returns to a "wait" state until it is next triggered.

Therefore, to continue with the example, another agent 34 may be attached to another UI e.g. SMH.com and be released at 8:00am. The agent 36 may be attached to a URL e news.com.au and be released at 9:00am. The agent 38 may be attached to yet another UI e.g. ordermail.com.au and be released at 10:00am.

Turning now to Figure 3, a general process flow is described beginning at step 40 when t agent begins operation. Firstly, the agent makes an http get request to retrieve the HTIv document from its target URL. This is performed in step 42. In the example given in Figv. 2, if the agent in step 40 is agent 32, then the URL that the request is sent to would theage.com.au.

Almost invariably, the document that the agent receives from the target URL will include number of links. These links will typically consist of links to other URLs. These links ∑ filtered according to certain criteria and information the agent is loaded with and stored oi system server in a "spider list". Certain types of resource are filtered as well as compared an "exclusion list" on the server. Any URL which is listed on the exclusion list is ignored the agent. In this way, from a general known website structure, links which are known to valueless in terms of their information can be readily excluded by the system. This step filtering the relevant links is carried out in step 44 and is generally performed by a parsi process whereby the text and the link is analysed by the agent to look for key words known words or word patterns such as linguistically defined criteria or "themes" which ∑ likely to indicate a relevant link to the information which is sought. The method includes t step of analysing text retrieved by the method against predetermined rules. T predetermined rules may include a literal string (key word) matches, regular expressi matches, string patterns or occurrences of text, or other linguistically defined criteria. T predetermined rules may additionally involve other text analysis technology to recogn desired matches. The rules may be used to implement a criterion against which retriev items of information are compared to determine their relevance to various topics a therefore the manner in which the information should be indexed, or possibly discarded. I term "spidering" refers to the process of navigating through a series of on line resources a gathering information. Therefore, the spider list which is established by the agent s forth a pattern of links at the target site which is subsequently visited by the agent to retr information as is described further below.

In step 46 the agent then proceeds to process each parsed URL from step 44 individua until all further links (of which there may be many) are checked in this manner. This occi in step 46. Again, links which are on the exclusion list are ignored by the agent.

As each URL is parsed, the agent inserts the relevant URL (or link) into a URL string tab This occurs in step 48.

Once the spidering process has been completed, the agent then performs a query in step 50 retrieve all the URL's from the URL stream table.

The next general step is for the agent to look through a document retrieval process until the URLs or links from the URL stream table have been accessed i.e. spidered. Therefore, step 52 the process begins by the agent making an HTTP GET request to retrieve a documi from the first URL. The agent then retrieves a profile for the base URL. This occurs in si 54 and the purpose is to obtain further information about any known document structure structures at the website of interest. Therefore, profiles tend to be specific toward each tarj URL. If the profile is known, then this can make the content of the HTML document mv easier to accurately retrieve in a desired form. If the structure of the HTML documi retrieved does not match the profile then the agent defaults to retrieving the entire text f the HTML document with the HTML tags stripped out.

Therefore, in step 56, the agent executes the profile and in step 58 retrieves the relevi material (for example) in text with extraneous content stripped out.

The next step 60 is for an analysis to be performed of the retrieved document. The ag< analyses the text retrieved against predetermined rules which may be called "themes" stoi on the system server. The themes may consist of actual literal string (i.e. key word) match regular expression matches, string patterns or occurrences of text or other linguistica defined criteria as determined.

In practice, themes are defined by system users in consultation with analysts and may cons of any of the foregoing, and additionally may involve other text analysis technology recognise desired matches. The word "themes" is broadly used in this document describe a scheme of criteria against which retrieved items are compared to ascertain or di: documents of relevance to the user.

Returning to Figure 3, should the query performed in step 60 result in a match, then the ag inserts the text document that has been retrieved into the system database. This occurs step 62. If a match is not achieved, then the document is discarded.

Having retrieved one document, the agent then returns to the next URL in the URL stre. table in step 64 so that the process begins to repeat from step 52 until all URLs have be examined.

Once the spidering process is complete, the agent "returns" to the system server until next cycle is due to begin. This is represented as step 66 in Figure 3. As described w reference to Figure 1, as each text item is added to the database, additional metadata is adc to the item so that the data is organised or indexed for subsequent retrieval or for furtl analysis for identification purposes. Therefore, as each new record is created on the syster database, the text is stored and any associated metadata (such as source URL, date retriev string length, HTML headers etc) is stored with the text. Each record is created is thu distinct and unique item in the data base and is assigned a unique identifier. This identii preferably takes the form of 32 character UUID.

The system envisages storing text documents regardless of whether a theme is matched not so that recursive searches may be made.

Turning now to Figures 4a and 4b, a further example of spidering a target base URL provided, using the methodology similar to that described with reference to Figure 2, 1 incorporating some more detail. Thus in Figure 4a, the agent executes in step 70 and initial query occurs in step 72 which is an HTTP request to get the base URL. In step !• check is performed from the document returned as a result of the request. This check is review the header data from the HTML document that is returned to ascertain the last ti that the document was updated or modified. A comparison occurs in step 76, and if then no change, then the agent returns to step 70. However, if a change has occurred, then document is received in step 78 and is parsed in step 80 to ascertain relevant links. I desired (but not absolutely necessary) that only links which relate to text documents parsed and that the agent ignores links from any exclusion list as described above.

In step 82 the parsed URL is processed and in step 84 the agent performs a query to chi whether the processed URL is present in the URL stream table. If it is not, then in step 8 further query is performed to check whether the URL is in the URL archive table. If URL is not present in that table either, then the agent inserts the URL into the URL stre table together with further parameters such as the base URL, the date and time of ] modification of the document to which the URL relates and a depth variable.

If the URL is identified in steps 84 or 86, then the agent continues to process the next U in step 82 and the process continues until all the URL's have been parsed.

The process continues in step 90 when the agent retrieves all the URL's that have bι passed from the URL stream table. A GET request is then performed in step 92 for the f URL from the URL stream table. A check is then performed in step 94 to see whether depth variable is greater than 1 i.e. whether there are further links in the document tha retrieved from that URL. If there is, then these links are parsed and the process is perforn again beginning at step 80 until all the subsidiary links are parsed and then the agent retu to step 96 where a query is performed to retrieve the profile for the relevant base URL.

The process flow continues in Figure 4b where in step 98 the agent attempts to execute retrieved profile. If there is a profile match failure, as shown in step 100, then the full texi the HTML document is simply retrieved and all the HTML tags are simply stripped from document. If there is a profile match success as shown in step 102, then the text from document is easily retrieved with extraneous content removed from it. The resultant t document is then compared with the themes referred to above to see whether a match occ in step 104. A query is then performed in step 106 to see whether the URL to which document relates already exists. If it does, then the URL is discarded and the agent turns the next URL in the URL stream table at step 108. However, if the URL does not alre; exist, then the agent inserts the full text into the content items table (i.e. into the databa together with further metadata such as the base URL and further information identification and search purposes. This occurs in step 110. If for some reason an article cannot be extracted, then an email is generated in s 112. The agent then continues to repeat the process for subsequent URL's in the U stream table at step 114.

Step 106 has the purpose of preventing information being retrieved and stored twice.

In Figure 5, a simplified diagrammatic illustration of the spidering process described abc in Figures 3, 4a and 4b is shown. The system server is referenced 150 and a target server which the target URL i.e. the base URL referred to above is located as referenced 152. agent 154 begins by making a first pass of the base URL of the target server 152. That ag then returns data to the server as shown by arrow 156. If the information returned indica that there are links to further URL's on the target server, then the agent makes a further p i.e. a second pass 158. Information from the second parse is returned to the server in s 160. Again, if the second pass shows that further links are present on the server, then a tb pass 162 may be made, which will again return further information to the server. Of com a large number of parses may be made if required. The method provides a logical ∑ straight forward way of spidering a target server for relevant information. As can further seen from Figure 5, information on a target server may be represented in a pie chart foi The information in an initial state of the server 170 may show that no information has b< spidered. After the first pass, a certain amount of information will have been retrieved indicated in diagram 172. After a second pass further information will have been retrie as shown by diagram 174. Finally, after the third pass, yet more information has bt retrieved as shown by diagram 176. The spidered information from the server is shown the shaded portions of each diagram. As can be seen, a certain amount of information ignored and this information relates to links that have been parsed by the agent but wh have been ignored because they have been determined to be a) irrelevant, b) on a list URL's to be ignored, or c) are not in the required data form (for example do not compris text document).

After a content item has been stored in the database, an "alert" will be generated. The al configuration is definable by the client, and may take the form of an email, an SMS messa the remote updating of a web page, or remote communication with another datab system of application.

The alert may be sent in "real-time" (as soon as the content item is retrieved) or after it ] been analysed (after the analyst has processed the content item).

The alerts may be received singly or in digest form on a different frequency, for examj. daily, weekly, or even monthly if desired.

The client may view "real-time" reports sowing visually the retrieval, processing J analysis of items that match their keyword themes. These reports consist of dynamic graphs, pie graphs, and other types of chart which display information and metad pertaining to these contents items. The client may further manipulate these charts and graj with different ranges and criteria to produce different results.

The analysis may be performed by a human analyst or by a software component on server. The analysis metadata is compiled from the client perspective and stored on a p user client, so one content item may have many analyses for different clients.

The analysis allows the user to select many database cross-sections for different repc showing the analysis metadata which is linked to retrieved content items. The analysis x also be displayed real-time to the client so as items are updated and analysed the on-scn information is updated with no intervention from the client.

The analysis enables the user to quickly gain an understanding of the skew of a large volu of content at a glance; instead of perusing each item they are able to view a dissect overview in graphical format and provide a powerful tool in determining real-time trends they appear.

From the foregoing it will be seen that a system for retrieving relevant and tim information and archiving information in a form which is readily searchable and may analysed, is provided. In particular, a methodical and efficient method of spidering tar websites is provided. Also, a method of discarding irrelevant information to arrive document in text format is provided, together with a method of indexing or organising < • identifying retrieved documents for subsequent analysis. Finally a system of convenier and timely alerting users for the presence of information relevant to them is provid

Claims

1. A method for automated search and retrieval of information available on a networkec database, the method including the steps of

providing search topic information,

providing a target information resource location,

spidering or dividing the target information resource location for further resourci locations that are likely to lead to relevant information, and

retrieving information from the target information resource location or from ; relevant one of the further resource locations.

A method according to claim 1 in which the networked database is the Internet.

A method according to claim 2 in which the retrieved information is analysed.

4. A method according to claim 3 in which an alert is provided to an entity as a result o the analysis.

5. A method for automated searching and retrieval of information, performing real tims selection and retrieval of the information.

6. A method according to claim 5 in which the information is archived for subsequen analysis.

7. A method according to claim 6 including the step of establishing one c more target resource locations from which information is to be searched and retrieved.

8. A method according to claim 7 in which the target location includes a URL which i spidered by the system to identify underlying links.

9. A method according to claim 8 in which the spidering step is performed in a pluralit of passes, each pass being targeted toward certain links, and each pass ignoring links that ar unlikely to be relevant.

10. A method according to claim 9 including the step of retrieving information from link that appear relevant.

11. A method according to claim 10 including the step of assigning or attaching metadat to each item of information to create a database record.

12. A method according to claim 11 in which the database records are archived.

13. A method according to claim 12 in which retrieved information which is not in textual format is converted to an editable raw-text data type.

14. A method according to claim 13 including the step of analysing retrieved text agains predetermined rules to recognise desired matches.

15. A method according to claim 14 in which the rules are used to implement criterion against which retrieved items of information are compared to determine thei relevance to various topics and therefore the manner in which the information should b indexed, or possibly discarded.

16. A method according to claim 15 in which the rules include one or more of literε string (key word) matches, regular expression matches, string patterns or occurrences of texi or other linguistically defined criteria to recognise desired matches

17. A method according to claim 16 including the step of discarding or stripping a] extraneous information from the information that is retrieved including HTML tags, image and the like.

18. A method according to claim 17 in which relevant information which is the subject o a new record is stored with associated metadata.

19. A method according to claim 18 in which each record is a distinct and unique item ii the database or archive and is assigned a unique identifier.

20. An automated information search and retrieval system in which real time selectioi and retrieval of the information occurs.

21. A system according to claim 20 including provision for archiving the retrieve! information in a readily accessible manner.

22. A system according to claim 21 in which the information is searched retrieved from the Internet.

23. A system according to claim 22 including means for establishing one or more tai resource locations from which information is to be searched and retrieved.

24. A system according to claim 23 including means for spidering a target resov location to identify underlying links.

25. A system according to claim 24 including means for retrieving information fi links.

26. A system according to claim 25 including means for assigning or attaching metac to each item of information to create a database record.

27. A system according to claim 26 including means for archiving retrieved informal for later analysis.

28. A system according to claim 27 including means for converting retrieved informa which is not in a textual format to an editable raw-text data type.

29. A system according to claim 28 including means for providing text data from r text sources including hard copies by conversion to text using optical character recogni processors and audio format using speech recognition applications.

30. Apparatus to implement the system or method of any one of the precedύ claims.

31. A computing machine operable to implement the system or method or apparatus any one of the preceding claims.