EP1208470A1 - Procede et systeme permettant de creer une structure de donnees par sujet - Google Patents

Procede et systeme permettant de creer une structure de donnees par sujet

Info

Publication number
EP1208470A1
EP1208470A1 EP00935871A EP00935871A EP1208470A1 EP 1208470 A1 EP1208470 A1 EP 1208470A1 EP 00935871 A EP00935871 A EP 00935871A EP 00935871 A EP00935871 A EP 00935871A EP 1208470 A1 EP1208470 A1 EP 1208470A1
Authority
EP
European Patent Office
Prior art keywords
documents
document
topical
spider
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00935871A
Other languages
German (de)
English (en)
Inventor
Timothy W. Starzl
Ravi S. Starzl
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SearchLogic com Corp
Original Assignee
SearchLogic com Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SearchLogic com Corp filed Critical SearchLogic com Corp
Publication of EP1208470A1 publication Critical patent/EP1208470A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to processes for discovering and collecting information located in an inter-linked environment such as the Internet and the World Wide Web ("Web") or in other archived, repository, database or stored information environment where the information is in a digital format, and is accessible electronically. More specifically, the present invention relates to improving both the topical or class relevancy of the information collected and the amount of relevant information collected from these environments.
  • an inter-linked environment such as the Internet and the World Wide Web (“Web") or in other archived, repository, database or stored information environment where the information is in a digital format, and is accessible electronically. More specifically, the present invention relates to improving both the topical or class relevancy of the information collected and the amount of relevant information collected from these environments.
  • Web World Wide Web
  • the World Wide Web is an extremely large, inter-networked data system connecting hundreds of millions of informational sites and documents and is growing daily.
  • the inter-linked relationships between these sites create a dynamic system of enormous complexity.
  • the existing Internet addressing system does not locate or identify sites based on their information content.
  • finding useful information Indeed, while the rich, decentralized, dynamic and diverse nature of the Web can make casual Web surfing enjoyable, it has made serious navigation aimed at finding specific information extremely difficult.
  • spiders In response to this problem, several types of Internet/Web navigation, location, finding or searching resources have evolved to facilitate the finding of sites based on content.
  • One such resource relates to an automated information retrieval system, often referred to as an Internet or Web "search engine.”
  • Typical Web search engine systems use automated collection agents, software programs generally called “spiders”, to automatically traverse the Web to discover and collect any accessible information sources.
  • a spider automatically traverses the Web's hypertext link structure, recursively retrieving documents, pages, or resources that are discovered. These spiders return Web documents or document addresses (URLs) to a confined data structure or Information Retrieval System. Spiders may retrieve all or only a portion of a document such as the headers or metatags, or may only collect the page address.
  • the term spider is understood here to include automated user agents, call utilities, Web robots, bots, autonomous and mobile agents dedicated to this function, and other like utilities
  • the resources collected by the spider are typically stored in a database as part of an Information Retrieval System.
  • Information Retrieval System or "IR system” refers to the data structure-based functions of storage, ordering, and presenting of previously discovered and collected information, as distinct from the processes of discovery and collection of data from the Web.
  • the IR systems sort these previously collected documents, or representations of documents, and associate them with their Internet, archive, or other address for presentation to the user.
  • all of the web pages that the spider discovers and collects are searched and sorted in an undifferentiated manner. Other such IR systems differentiate content within the IR data structure itself for more efficient ordering, storage and quicker access.
  • Web search engine IR systems In response to a user supplied query, Web search engine IR systems typically analyze the collected Web documents using filters to perform a calculation and produce a relevance score. This score may be based on a number of factors such as the number of search or query terms that appear in the document, where and how often they appear. Some systems use other criteria such as number of links or frequency of use as scoring criteria. Usually a low score indicates the document is not relevant to the user query, and a high score indicates that it is likely to be relevant.
  • Web directories are manually created by people who examine each page or resource and determine whether the resource should be included in the directory.
  • Web directories are distinguished from search engines in that they only collect or accept content that is relevant to a topic or category within the directory.
  • each directory typically has highly relevant resources, the throughput of manual processing creates directories that are unsatisfactorily small, on the scale both of the total Web and when compared to the size of Web search engines.
  • people must manually perform the task of accepting or rejecting each and every resource, the cost of maintaining and updating the directories is significantly high.
  • the present invention relates to an automated system and method for creating a topical data structure, which can then be searched using conventional IR means.
  • topical relates to the concepts of human-derived topic, class, category, grouping, natural grouping, taxonomic grouping, taxon, theme, cluster, or subject, and which may be identified through measures of relatedness, similarity, likeness, clustering, nearness, or other like measures. Since the data structure is topical, i.e., primarily restricted to topically related information, the results from the search show substantially improved query relevancy. Additionally, since the discovery and collection system is automated many more documents can be incorporated into the data structure, and the cost of generating and updating the data structure is relatively low.
  • the present invention relates to a system or method for discovering and collecting information from an inter-linked system of documents, such as the Web and/or the Internet.
  • the system or method recursively traverses the system of inter-linked documents, analyzes each document traversed to extract a signature for each document, wherein the signature is related to the content of the document, and then compares the signature for each document to predetermined signature criteria related to that topic to determine the relevancy of each document to that topic. Once the relevancy of the document is determined, the method adds or combines relevant documents to create the topical data structure.
  • the analysis and comparison is done by a filter system that may be external to an information retrieval system where the topical data structure resides.
  • the system or method utilizes a spider to traverse the Web, wherein the spider feeds document information to the filter system.
  • the spider may further be combined with a filter system to deliver topically relevant documents to the data structure and to confine the traversal paths taken by the spider.
  • the spider may receive relevancy information about document signatures so the spider may determine whether paths are relevant or conforming.
  • the spider may further elect to traverse only relevant paths based on this determination.
  • the spider may further be configured to jump a predetermined number of irrelevant documents in determining whether paths are relevant.
  • At least one filter may determine relevancy based on a predetermined scale and provides relevancy information according to the predetermined scale. Additionally, more than one filter may be used to determine the relevancy of each document. This information can then be further evaluated to determine whether additional analysis is necessary in determining whether to include or reject a document from the topical data.
  • the predetermined criteria is derived from a collection of sample documents to determine topical signatures and preferably using some form of analysis, such as lexical, relational, statistical, linguistic, or inferential content analysis.
  • the constrained results produced may subsequently be used in any IR system, such as a document search engine, a hierarchical directory, a vector space construct, any clustering algorithm driven data structure, array or construct, or any data storage and query format.
  • the invention may be implemented as a computer process, a computing system or as an article of manufacture such as a computer program product.
  • the computer program product may be a computer storage medium readable by a computer system and encoding a computer program of instructions for executing a computer process.
  • the computer program product may also be a propagated signal on a carrier readable by a computing system and encoding a computer program of instructions for executing a computer process.
  • Fig. 1 is a block diagram of the computer system shown in Fig. 1 connected to server computers through a computer network.
  • Fig. 2 is a block diagram of a computer system that may be used to implement a method and apparatus embodying the improved spider of the present invention.
  • Fig. 3 illustrates the functional components of a prior art Web discovery and collection system.
  • Fig. 4 illustrates the functional components of a Web discovery and collection system of the present invention.
  • Fig. 5 illustrates the functional components of a Web discovery and collection system of an alternative embodiment of the present invention.
  • Fig. 6 is a flowchart illustrating the operational characteristics of an embodiment of the invention.
  • FIG. 1 An interconnected computer system 100 that may incorporate the present invention is shown in Fig. 1.
  • the client computer system 102 operates a traditional browser application 104.
  • the browser application 104 communicates with an information retrieval system 106, which is located on either computer system 102 or on another server computer system (not shown).
  • the retrieval system 106 comprises a suitable query server 108 and a data structure 110, preferably a database or text base.
  • the information retrieval system 106 communicates with a collection agent 112 for collecting information from the Web 114 and storing those collected resources in the data structure 110.
  • the data structure stores the various resources, and may be configured to index or otherwise sort the information for future reference.
  • the query server 108 receives a query from browser 104 and uses the query to search the data structure for relevant information. Once the relevant information is retrieved, that information is then presented to a user of computer 102 through the interface that is displayed through the browser 104.
  • the collection agent 112 traverses the Web 114, which generally has a network of informational sites that are linked via the hypertext transfer protocol (HTTP).
  • HTTP hypertext transfer protocol
  • Each of the sites resides on a server computer system, such as server computer systems 116 as shown in Fig. 1.
  • the collection agent 112 in various embodiments of the present invention is capable of differentiating the various web resources during traversal so that the resulting data structure comprises mostly resources that are relevant to a particular topic.
  • the query server 108 generally returns only relevant information.
  • the query server is better able to perform advanced algorithms on the data structure to retrieve highly relevant information and can present that information accordingly.
  • the collection agent since the data structure is constrained to containing only topical information, the collection agent must accept and collect far fewer documents than prior art collection agents, and thus is able to traverse a significantly larger portion of the Web before the IR system reaches capacity, further improving the results provided by the query server.
  • the computer 102 is a desktop computer system.
  • the invention is used in combination with any number of other computer systems or environments, such as in handheld computer environments, laptop or notebook computer systems, multiprocessor systems, micro-processor based or programmable consumer electronics, network PCs, mini computers, main frame computers and the like.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network in a distributed computing environment, programs may be located in both local and remote memory storage devices.
  • the computer 102 incorporates a system of resources for implementing an embodiment of the invention, such as the system 200 shown in Fig. 2.
  • the system 200 incorporates a computer 202 having at least one central processing unit (CPU) 204, a memory system 206, an input device 208, and an output device 210. These elements are coupled by at least one system bus 212.
  • CPU central processing unit
  • the CPU 204 is of familiar design and includes an Arithmetic Logic Unit (ALU) 214 for performing computations, a collection of registers 216 for temporary storage of data and instructions, and a control unit 218 for controlling operation of the system 200.
  • ALU Arithmetic Logic Unit
  • the CPU 204 may be a microprocessor having any of a variety of architectures including, but not limited to those architectures currently produced by Intel, Cyrix, AMD, IBM and Motorola.
  • the system memory 206 comprises a main memory 220, in the form of media such as random access memory (RAM) and read only memory (ROM), and may incorporate or be adapted to connect to secondary storage 222 in the form of long term storage mediums such as hard disks, floppy disks, tape, compact disks (CDs), flash memory, etc. and other devices that store data using electrical, magnetic, optical or other recording media.
  • the main memory 220 may also comprise video display memory for displaying images through the output device 208.
  • the memory can comprise a variety of alternative components having a variety of storage capacities such as magnetic cassettes memory cards, video digital disks, Bernoulli cartridges, random access memories, read only memories and the like may also be used in the exemplary operating environment.
  • Memory devices within the memory system and their associated computer readable media provide non-volatile storage of computer readable instructions, data structures, programs and other data for the computer system.
  • the system bus 212 may be any of several types of bus structures such as a memory bus, a peripheral bus or a local bus using any of a variety of bus architectures.
  • the input and output devices are also familiar.
  • the input device can comprise a small keyboard, a mouse, a microphone, a touch pad, a touch screen, etc.
  • the output device can comprise a display, a printer, a speaker, a touch screen, etc. Some devices, such as a network interface or a modem can be used as input and/or output devices.
  • the input and output devices are connected to the computer through system buses 212.
  • the computer system 200 further comprises an operating system and usually one or more application programs.
  • the operating system comprises a set of programs that control the operation of the system 200, control the allocation of resources, provide a graphical user interface to the user, facilitate access to local or remote information, and may also include certain utility programs such as the email system.
  • An application program is software that runs on top of the operating system software and uses computer resources made available through the operating system to perform application specific tasks desired by the user. In general, applications are responsible for generating displays in accordance with the present invention, but the invention may be integrated into the operating system.
  • the information retrieval system 302 which is similar to informational retrieval system 106 (Fig. 1), communicates with a spider 304.
  • the term, "spider,” is understood here to include prior art automated user agents, call utilities, Web robots, bots, autonomous and mobile agents dedicated to the function, and other like utilities.
  • Spider 304 automatically traverses the Web's hypertext structure, recursively retrieving all web documents 306 that are found.
  • the documents 306 may be actual textual documents, images, pages, or other resources found on the Web, as well as their addresses, and are referred to hereinafter as either documents, pages or resources.
  • the web documents 306 are stored in data structure 312, hereinafter referred to as the database or data structure, of the information retrieval system 302.
  • the database 312 may have a filter so that specific predetermined types, structures or formats of pages are not accepted in the database (e.g. duplicates, spam pages, by character set, by domain).
  • the database may not have a filter, but in any case may create an index of the information stored in the database 312.
  • the information is available to the user through user interface 314, which may comprise a browser and query server, such as the browser 104 and query server 108 shown in Fig. 1.
  • the prior art spider 304 does not differentiate documents 306 based on topical content. Instead, each document that is traversed is returned to the database 312, creating a large, undifferentiated collection of items.
  • the embodiment uses a discovery and collection system 400 having a spider 402 and a topical content filter 404.
  • the spider 402 is any software program or system capable of traversing the Web, and which, in this case, must also be capable of interfacing with and using a linguistic, lexical or other text filter such as filter 404.
  • the content filter 404 analyzes the documents 306 returned by the spider 402 and accepts or rejects each document based on predetermined topical content criteria.
  • the criteria may be a lexical or linguistic signature or some other basis.
  • the database 312 Based on the acceptance or rejection of these documents by the content filter 404, the database 312 comprises topical information.
  • the filter 404 is integrated with the information retrieval system 302 in a manner that pre-filters the content accepted by the information retrieval system 302.
  • the information returned to the database 312 has been differentiated, based on topical content, from other information on the Web through the use of the system 400.
  • Another embodiment of the present invention is shown in Fig. 5. This embodiment may operate in conjunction with information retrieval system 302, having the same type of database 312 and user interface 314 as described above in conjunction with Figs. 3 and 4.
  • the information retrieved and stored in the database 312 is topical, i.e., content based as described above in conjunction with Fig. 4.
  • the embodiment shown in Fig. 5 comprises a unique filtering spider 500.
  • the filtering spider 500 traverses the Web and performs its own filtering analysis on each document or site that the filtering spider 500 encounters.
  • an embodiment of the filtering spider 500 can be configured to elect different paths of traversal, thereby only traversing selected web documents 306.
  • the spider 500 may avoid paths that are less likely to produce relevant information and concentrate on paths that are more likely to produce relevant information.
  • This process is referred to herein as "link- tunneling.”
  • link-tunneling approach is to limit the content presented for incorporation into a topical data structure to only material that displays pre-specified characteristics associated with that topic.
  • targeted "link- tunneling" methods of traversal can capture the topic knowledge of site authors, as expressed in their linking decisions. The effect of this system is the selection of a constrained population of resources for inclusion in a data structure, around which a topical or subject oriented information retrieval system can be defined.
  • a linguistic or lexical signature relates to any extractable attribute or representation of content, i.e., subject matter, that provides a basis for document or subject recognition or differentiation and usually beyond that provided by the simple presence or absence of a keyword, a group of keywords, or Boolean expression. Designed constructs of keywords representing a subject or topic may be extracted or generated that reflect this equivalent function.
  • differentiation of discovered material by comparison to a linguistic signature or template may be topically or categorically related by a predefined linguistic, lexical, textual, semantic, syntactic, mythographic, semiotic, pictographic, hieroglyphic, graphic, structural, hybrid or other content related attributes.
  • topical signature data for differentiation.
  • This signature refers to any of a class of processes for the mathematical, logical, or linguistic extraction and characterization of document, atomic, molecular or elemental components (words, lexies, associative patterns, frequencies, word clusters, word class relationships, etc.) to produce a set of differentiating representations or characteristics. These representations are referred to as "linguistic signatures" in this disclosure.
  • the methods referenced here include: lexical analysis, semantic analysis, syntactical analysis, textual analysis, clustering analysis, auto-categorization, vector analysis, statistical analysis, heuristics, pragmatic methods and/or any models, algorithms or relationships using these methods.
  • a linguistic signature derived or extracted by any means, by the filter 402 or filtering spider 500 as a conformity test for unknown, heterogeneous documents.
  • Differentiation by "linguistic signature" according to subject matter of a web document 306 is to be understood as the automated assignment of document membership or the identification of non-membership within a pre-defined subject, category, class, or topic area. Acceptance, differentiation or rejection may be into, or in reference to, any topical, subject, categorical, hierarchical, relational or other organizational system, scheme, ontology, taxonomy, or concept hierarchy, using any relatedness-based classification measure or method.
  • a class, category, subject or topic may be identified by human judgment or agency, or may be identified as a measure of relatedness, similarity, or clustering of a group of documents.
  • a class, category, subject or topic "linguistic signature" may be determined in substantially the same manner as described above for the determination of document "linguistic signature” as applied over a sufficiently large group of documents judged to be members of the class, category, subject or topic so as to allow for the creation of a representative signature.
  • the method includes any method for the development or identification of lists, strings, arrays, files, algorithms, expressions, collections or groupings of such elements that are characteristic of the subject class, category, subject or topic.
  • Fig. 6 illustrates the operational flow process 600 which relates to an embodiment of the invention.
  • Process 600 begins with traverse operation 602 which traverses the Web, or another inter-linked data structure, using a provided link,
  • a first link is provided to operation 602, indicating the first site to visit.
  • the spider 304 or the filtering spider 500 carries out the traversal.
  • Operation 602 may mark the link in some manner so that the process can recognize, at a later time, that the link has been analyzed at some earlier time. Similarly, this step may analyze the link to determine if the link has been marked in the past. If the link has been analyzed, the process may elect to either re-analyze the document or recursively determine the next link to analyze.
  • Other embodiments utilize tables of links.
  • a first table stores potential links.
  • a second table is used to store traversed links, and another to store topically rejected links. By comparing a link in the first table to those in the second table, the process can determine if it has been traversed.
  • page capture and decomposition operation 604 retrieves the document located at the site and parses the information. This operation may involve an in-depth lexical analysis, or other analysis of the document to extract a "signature" for the document.
  • the signature is reflective of the subject matter or content of the document.
  • operation 606 performs a comparison on the signature that has been generated by operation 604.
  • the filtering operation 606 may be any method suitable for the comparison of the document "linguistic signature" to a pre-determined class, category, subject or topic "linguistic signature", so as to determine within some specified level of precision, the membership of the subject document within the subject class.
  • the method references any means suitable to allow a determination of whether a document falls within, or out of, a particular pre-specified class, topic, subject or category.
  • the filtering operation 606 utilizes a linguistic signature to determine conformity of collected data sets to preexisting human-derived topic, category, class or subject cognitive criteria. For example, one use for this system is the automated production of an information resource similar to a content-based Web Directory.
  • the filtering step 606 may compare the document signature with a predefined signature to produce a weighted score related to the probable degree of relevance for the document.
  • personnel responsible for the data structure may decide what topic(s) the data structure should include and what untargeted topic(s) may use language similar to that of the target topic(s).
  • a definition of the goals for the inclusion filters and exclusion filters for the topical data structure is generated.
  • a topical database for the topic of golf i.e., the game, may require the inclusion of documents having the word golf in them, unless they refer to cars named GOLF which are made by Volkswagen.
  • This process may involve the selection by the database collection personnel of one or more electronic texts as representative of the topic selected.
  • These documents may be manually selected or automatically selected from a web directory or other search resource that can provide topically representative documents.
  • a class, category, subject or topic may be identified by human judgment or agency, or may be identified as a measure of relatedness, similarity, or clustering of a group of documents.
  • it may be important to select documents representative of the exclusions that are identified by the database personnel and to place these into separate corpora for analysis.
  • Such topics and documents may use overlapping terminology but are not targeted by the topical database. Generally, more than one document will be required to form a corpus of documents for analysis.
  • topical document collections are then analyzed for a lexical signature.
  • the ability to differentiate, select or reject a document based on its content requires the use of such signature data for differentiation.
  • this signature refers to any of a class of processes for the mathematical, logical, or linguistic extraction and characterization of document, atomic, molecular or elemental components (words, lexes, associative patterns, frequencies, word clusters, word class relationships, etc) to produce a set of differentiating representations or characteristics.
  • the sample documents are analyzed using some form of quantitative or semi-quantitative analysis beyond that provided by the simple presence or absence of a keyword, a group of keywords, or Boolean expressions that are derived by qualitative analysis of the topic by the database collection personnel.
  • the relationships between words and non-lexical features of the document may also be analyzed for features of a signature.
  • a simple signature may be expressed as a simple list of keywords extracted from the representative document(s). In this case, it is preferable that a minimum of three keywords be used to provide the most basic data for a Boolean-logic-based filter for the presence or absence of keywords in any given document.
  • the previously mentioned quantitative and semi-quantitative methods should be employed to extract or assist in the extraction of meaningful lexical features of the signature.
  • the signature extraction process produces a series of features of the document. These features can then be applied within the topical filter.
  • the filter process may involve application of the feature extraction process in reverse.
  • the process for filter process does not have to be the same analysis as that used to extract the signature. For example, a keyword frequency analysis could be employed to extract the lexical signature and then those keywords could be employed in a Boolean filter, a co-association matrix, or may be extended using a semantic nearness function. Not every type of extracted feature in a signature will be able to be employed in every type of possible topical filter. Therefore, if a particular type of topical filter is to be used, it is important to make sure the feature extraction method used will produce features that are compatible with the filter and vice versa.
  • more than one filter may be employed in this step of the process.
  • An array of topical filters may be employed for document analysis for both the inclusion and exclusion of pages into the topical database. Additional topical filters may also generate lexical metrics about the pages at this step in the process to be associated with the document into the topical database. These additional topical filters need not necessarily be part of the acceptance/rejection of the document into the topical database.
  • the process determines, at step 608, whether the document meets the requisite criteria to be accepted (included) or rejected (excluded).
  • the filtering step produces a topical relevancy score and operation 608 compares the topical relevancy score against a minimum threshold value. If the score for the document is above the minimum threshold value, the document is determined to meet the criteria. In such a case, flow branches YES and the document is added to the conforming list at add operation 610.
  • identify link act 612 identifies the next link, typically a link on the conforming page. This link is provided to operation 602 and the process begins again. If there are no links on the conforming page that have not been analyzed, then the identify link act 612 recursively determines the next link to analyze.
  • Determination step 614 determines whether pages on the non-conforming page should be analyzed. This determination involves a comparison of the depth level for non-conforming documents to a predetermined number of levels to be searched. For example, the process may be configured to not analyze any sub-links on a non-conforming page and therefore the predetermined number of levels would equal one. In such a case, determination step 614 would always branch YES since the current document is, by definition at level one. However, if the predetermined level was set to two, then the sub-pages of a first non-conforming page are analyzed.
  • three-level (and up) traversal matching further allows for conditional or transient acceptance of non-conformal links can be specified for the spider. In such a case, more links may be discovered. However each level retained requires additional processing and memory capacity, and contributes to the growth of the link-validation burden. Decisions as to the number of levels to be traversed will depend upon the density of information sources, and the degree of completeness desired for the topical information space being developed. Such a system allows for the retention of link trails or threads through three or more non-conformal layers. It is important to note that continued traversal of the link thread does not imply the retention or recording of the non-conformal pages to the information retrieval data structure.
  • mark level operation 616 provides for the identification of the link as a sub-link of a non- conforming page, thereby allowing later analysis of link levels relative to a non- conforming parent page.
  • identify next link 612 identifies the next link to be analyzed and passes it to operation 602 and the process 600 is repeated.
  • Determination module 618 determines whether all the links have been analyzed. This module may be configured to stop after a predetermined number of documents have been analyzed or collected. Otherwise, this module may be configured to only quit once each and every document in the system has been analyzed. If there are more documents to be analyzed, which is typically the case in large-scale information systems such as the Web, then flow branches NO to operation 620. Operation 620 recursively determines the next link and passes it to operation 602 and the process 602 is repeated. If determination module has determined that all the potential links have been analyzed, then flow branches YES to end step 622 and process 600 ends.
  • the conforming list created at operation 610 comprises the link or URL, for all the items that are added to the topical database 312, Figs. 4 and 5.
  • This module performs a more intensive analysis on the document, as opposed to merely comparing a signature for the document to a template.
  • the full analysis may comprise lexie identification, grouping, correlation, pattern recognition, pattern matching, fitting and other analysis techniques.
  • the page is either determined to be in or out of topic. If it is out of topic, the page is rejected as described above at step 608 and flow branches to operation 614. If it is determined to be in topic, then the page is forwarded to the topical database. Additionally, the page may be forwarded to a topical hierarchy directory interface and potentially a learning engine of strategy level modeling or a neural network for pattern recognition.
  • the information retrieval system may operate in the conventional manner. However, since only topically related information exists in the database, the system is more likely to produce relevant information. Also, since the database is not filled with a significantly large amount of irrelevant data, the results of query searches are more complete as well. That is, since the invention allows for the discovery and inclusion of defined subsets of resources, differentiated from other unrelated resources, in an automated or semi-automated manner, a high relevancy resource is generated.
  • the depth or completeness achieved by this system can be as great or greater than provided by a typical, prior-art Web directory approach.
  • the sources discovered and collected by this process may be incorporated into any conventional information retrieval system, may be subject to further processing, ordering, characterization, or organization, and may be presented as either a directory hierarchy or as a searchable data structure.
  • a significant benefit derived from the present invention relates to the fact that the constrained content approach removes a very large portion of the processing burden from the information retrieval internal system, placing it instead on an exogenous filter system. Additionally the reduced number of entries, and the tighter linguistic and topical focus of the entries, allows for specialized and more efficient processing functions.
  • topical differentiation also has important advantages in the areas of information organization, refinement, and presentation.
  • the system may take advantage of "natural" or common usage methods for organizing collected information derived from the topic area itself.
  • the specialized uses of language often associated with specific topics can be used by this system as guides and markers to refine and differentiate topical groupings.
  • this specialized usage is a significant contributor to the noise and imprecision within the process.
  • the use of a topical format lends itself readily to thematic graphical and design expression for display and presentation within the context of the specific topic.
  • the invention disclosed here is distinct from prior teaching within this field in that it parses or segments the processing of information into separate pre- acceptance and information retrieval system stages, resulting in a substantial and useful change in the processing profile and capabilities for large scale Web or Internet search resources.
  • Another aspect of this system is the ability to control the degree of precision used to select or reject links or documents. This is accomplished by selecting the degree of precision of the linguistic signature applied, and by the stringency of conformity required for acceptance. Additionally this system allows for the ability to specify immediate rejection of a link thread on the basis of page non-conformity or to allow the link thread to be explored despite page non-conformity. Links may be followed despite non-conformal page status for any specified number of steps or layers, or indefinitely, without the collection of non-conformal pages, so as to discover discontinuous regions (non-topically inter-linked) of a topical information space. This method allows the system to "jump" over intervening or blocking pages to any prescribed depth.
  • the purpose of this approach is to insulate and protect the system from the burden of undifferentiated data sets. This method reduces the number of instances that the information retrieval system must process, prior to its being exposed to them. This approach also narrows and focuses the range of operations required of the information retrieval system through the imposition of a topic, class, category or subject limitation. These modifications from standard search practice serve to substantially reduce the processing overhead and burden, allowing for substantial improvement in performance.
  • the present invention is the method, apparatus, computer storage medium or propagated signal containing a computer program for providing a discovery and collection system for creating a topical database as recited within the claimed attached hereto.
  • the present invention is presently embodied as a method, apparatus, computer storage medium or propagated signal containing a computer program for traversing the Web, analyzing sites and/or documents and delivering only relevant documents to a database. Additionally, the system may restrict or confine the paths that are traversed in the Web using relevancy information. While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various other changes in the form and details may be made therein without departing form the spirit and scope of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Cette invention concerne un système (200) et un procédé automatisés permettant de créer une structure de données par sujet pour un document et autres supports à partir d'un système interconnecté de documentation tel que WEB (114) et/ou Internet. On peut mener des recherches dans cette structure de données (110) à partir d'informations classiques et obtenir des résultats hautement adéquats. Le système (200) localise et recueille automatiquement les informations en rapport avec le sujet considéré en analysant chaque document parcouru et en déterminant s'il est ou non en rapport avec ledit sujet avant de l'ajouter à la structure de données (110). On peut par ailleurs utiliser les informations en rapport avec le sujet pour cerner des axes de recherche transversale qui sont davantage susceptibles d'être en rapport avec le sujet considéré.
EP00935871A 1999-05-07 2000-05-05 Procede et systeme permettant de creer une structure de donnees par sujet Withdrawn EP1208470A1 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US13320199P 1999-05-07 1999-05-07
US133201P 1999-05-07
PCT/US2000/012396 WO2000068837A1 (fr) 1999-05-07 2000-05-05 Procede et systeme permettant de creer une structure de donnees par sujet

Publications (1)

Publication Number Publication Date
EP1208470A1 true EP1208470A1 (fr) 2002-05-29

Family

ID=22457466

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00935871A Withdrawn EP1208470A1 (fr) 1999-05-07 2000-05-05 Procede et systeme permettant de creer une structure de donnees par sujet

Country Status (4)

Country Link
EP (1) EP1208470A1 (fr)
AU (1) AU5126700A (fr)
CA (1) CA2373457A1 (fr)
WO (1) WO2000068837A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080256065A1 (en) * 2005-10-14 2008-10-16 Jonathan Baxter Information Extraction System
WO2018150434A1 (fr) * 2017-02-14 2018-08-23 Bhalerao Mrunmayee Milind Table des matières et moteur de recherche à deux niveaux à base indexée

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5408655A (en) * 1989-02-27 1995-04-18 Apple Computer, Inc. User interface system and method for traversing a database
US5717913A (en) * 1995-01-03 1998-02-10 University Of Central Florida Method for detecting and extracting text data using database schemas
US6067552A (en) * 1995-08-21 2000-05-23 Cnet, Inc. User interface system and method for browsing a hypertext database
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0068837A1 *

Also Published As

Publication number Publication date
CA2373457A1 (fr) 2000-11-16
WO2000068837A1 (fr) 2000-11-16
AU5126700A (en) 2000-11-21

Similar Documents

Publication Publication Date Title
US20020103809A1 (en) Combinatorial query generating system and method
Tao et al. A personalized ontology model for web information gathering
Chen et al. Trailblazing the literature of hypertext: author co-citation analysis (1989–1998)
US20060026152A1 (en) Query-based snippet clustering for search result grouping
WO2010105218A2 (fr) Système et procédé de recherche de connaissances
Ding et al. Bibliometric information retrieval system (BIRS): A web search interface utilizing bibliometric research results
Kennedy et al. Query-adaptive fusion for multimodal search
Wolfram The symbiotic relationship between information retrieval and informetrics
Syn et al. Finding subject terms for classificatory metadata from user‐generated social tags
Agosti et al. Information retrieval on the web
Barifah et al. Exploring usage patterns of a large-scale digital library
Liu et al. Visualizing document classification: A search aid for the digital library
White et al. Media monitoring using social networks
WO2001039008A1 (fr) Procede et systeme de collecte de ressources par sujet
El Wakil Introducing text mining
EP1208470A1 (fr) Procede et systeme permettant de creer une structure de donnees par sujet
Syn et al. Tags as keywords–comparison of the relative quality of tags and keywords
Sommaruga et al. “Tagsonomy”: Easy Access to Web Sites through a Combination of Taxonomy and Folksonomy
Cui et al. Hierarchical structural approach to improving the browsability of web search engine results
Nowick et al. A model search engine based on cluster analysis of user search terms
Diederich et al. The semantic growbag demonstrator for automatically organizing topic facets
Ahmed et al. Web Content Mining: A Solution to Consumer’s Product Hunt
Bergholz et al. Using query probing to identify query language features on the Web
Lafia et al. Exploratory and directed search strategies at a social science data archive
Tang et al. A visual exploratory search engine solution based on cloud computing

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20011204

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20031202