EP1269355A1 - Hypermedia resource search engine and related indexing method - Google Patents
Hypermedia resource search engine and related indexing methodInfo
- Publication number
- EP1269355A1 EP1269355A1 EP01921462A EP01921462A EP1269355A1 EP 1269355 A1 EP1269355 A1 EP 1269355A1 EP 01921462 A EP01921462 A EP 01921462A EP 01921462 A EP01921462 A EP 01921462A EP 1269355 A1 EP1269355 A1 EP 1269355A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- resources
- resource
- main
- indexing
- dependent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims description 25
- 230000001419 dependent effect Effects 0.000 claims abstract description 55
- 238000000605 extraction Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
- G06F16/94—Hypermedia
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Definitions
- the present invention relates to a search engine comprising on the one hand a module for indexing resources accessible on a computer network for the creation and updating of an indexing base, on the other hand a module for searching resources on the network adapted to interrogate the indexing base from a request made by a user and to provide, in response, the universal URL address of the resources corresponding to the request, the indexing module comprising means for collection of main resources, means of extracting dependent resources from the main resources and means of indexing the resources to extract descriptors therefrom.
- the indexing module automatically collects the resources accessible at these addresses;
- the indexing means extract from each of these resources an index by associating with it a set of words characterizing its content;
- the extraction means extract from each resource previously indexed the set of universal URL addresses from the hypertext links which they contain, thus making it possible to add new URL addresses to the initial list.
- the process can be repeated to obtain a very large number of indexed resources in the end.
- this loop is executed periodically in order to update the indexing base according to the evolution of the content of the resources of the initial list, as well as the appearance of new links.
- the search engine In response to a request made by a user, the search engine returns the universal URL addresses of the resources corresponding to the request, by ordering them from a word counting system in the indexing base. In most cases, it then returns thousands of responses - for a request. In addition, the order of presentation of these responses does not always solve the problem of searching in these too many resources.
- the invention aims to remedy the drawbacks of conventional search engines by creating a search engine giving access to numerous resources while improving the quality of the responses provided, in particular according to the needs of the user.
- the subject of the invention is therefore a search engine of the aforementioned type, characterized in that the indexing module also comprises means for associating each dependent resource with at most one main resource as a function of the hypertext type links between these dependent resources and the main resource
- main resources of a first information base are collected and indexed. This is supplemented by a large number of resources identified from the hypertext links present in the main resources.
- the indexing module includes means for transferring a copy of the descriptors from the main resources to the dependent resources associated with them,
- the search module comprises means for filtering a resource indexed by the indexing module, by combined processing of the descriptors extracted from this resource and of the descriptors transferred to this resource, the search module is adapted to provide, in response to a request, the universal URL address of a corresponding dependent resource to the request, associated with the hypertext link of the main resource associated with this dependent resource,
- the association means comprise means for selecting at most one main resource from a set of main resources capable of being associated with a dependent resource, by minimizing a distance calculated between the dependent resource and each main resource;
- the distance between two resources is a decreasing function of the number of common directories between the universal URL addresses of the two resources
- the invention also relates to a method of indexing resources accessible on a computer network for the creation and updating of an indexing base comprising the following steps
- the indexing method according to the invention may also include a step of excluding, from the indexing base, any dependent resource not associated with a main resource.
- FIG. 1 is a diagram illustrating the general structure of an engine. research according to the invention
- FIG. 2 is a diagram illustrating the operation of a search engine according to the invention
- FIG. 3 is a flowchart detailing the operation of means for associating a dependent resource with at most one main resource, of a search engine according to the invention
- a search engine according to the invention represented in FIG. 1 comprises a server 2 connected, by the Internet network, on the one hand to a database 4 constituted by the World Spider Web, conventionally called the Web, of elsewhere at an access terminal 6 of a user in search of resources available on the Web
- the server 2 comprises a database 8 of directories
- a directory comprises a restricted set of universal addresses URLs of main resources each corresponding to the first page of a multimedia document
- These main resources are associated with external descriptors, for example recorded manually by librarians possibly assisted by computer tools
- These external descriptors correspond to a classification in a nomenclature of themes, to a title, to a textual presentation of the main resources, more generally to information specifying the context of the documents considered
- the server 2 also includes an indexing base 10, comprising all of the descriptors of the resources accessible by the search engine. It notably includes the external descriptors of the main resources as described above.
- the server 2 also includes an indexing module 12 , comprising means of automatic indexing of resources These are capable of extracting external descriptors by analyzing the content of the resources, in a conventional manner
- This module also includes a method of association of dependent resources with a main resource and of transfer external descriptors of a main resource to its dependent resources The operation of this module will be detailed below, during the description of Figure 2
- the indexing module is therefore connected as input to the directory database 8 as well as to the Web 4, in order to access resources and, at output, to the indexing base 10, for the supply of descriptors.
- the server 2 finally comprises a search module 14 connected on the one hand to the indexing base 10, on the other hand to the access terminal 6 for the supply to a user, of relevant resources in response to a request from the user -this.
- the indexing module 12 proceeds to register descriptors in the indexing base 10, in several stages.
- the indexing module 12 accesses the main resources accessible on the Web 4, by receiving as input their universal addresses URL, stored in the database 8 of directories.
- the extraction means extract from each main resource all of the universal URL addresses of the hypertext links that they contain. New, dependent resources are thus recovered, from which we can again extract the universal URL addresses from the hypertext links that they themselves contain.
- This recursive method of extracting dependent resources from a first set of main resources is known from the state of the art. Said first set, conventionally called seed, is here extracted from the directory database 8.
- extraction means associate each dependent resource with at most one main resource. This association is a function of the number, type or any attribute of the hypertext links that must be followed to reach the dependent resource from the universal URL address of the main resource. At the end of this step, the dependent resources not associated with a main resource are eliminated. The process will be detailed during the description of FIG. 3.
- transfer means copy the external descriptors of each main resource and transfer them to all the dependent resources associated with it.
- the indexing means extract descriptors automatically for each resource.
- the indexing module 12 stores in the indexing base 10 the descriptors relating to each resource, these comprising the descriptors extracted automatically as well as the external descriptors transferred by copy to a dependent resource from the main resource associated with this dependent resource, or directly extracted from the directory database 8 for a main resource.
- This request form takes the form of an HTML presentation page. It allows the user to enter at least one keyword and to specify the context of his search by selecting the values of a certain number of descriptors from a list offered.
- the descriptors of the proposed list correspond to at least some of the external descriptors stored in the database 8 of directories and describing the main resources. They allow for example to specify a research area, the age range of the user, etc. These details allow the search module to filter the resources corresponding to the keywords of the query.
- the responses therefore consist of the main and dependent resources having extracted descriptors corresponding to the keywords and external descriptor values corresponding to those selected by the user.
- each dependent resource returned by the search module to the user, is accompanied by a hypertext link to the main resource associated with this dependent resource.
- the method of associating a dependent resource with at most one main resource, among a set of N main resources, is in accordance with the flow diagram represented in FIG. 3
- An initialization step 100 initializes an index i to 1 and a counter L to zero
- an analysis step 102 identifies a path, that is to say a series of hypertext links, which must be followed in order to reach the dependent resource from the universal address URL of the i-th main resource.
- a series of p steps, 104- ,,, 104 p constitutes a set of rules relating to the paths identified in step 102, and more particularly, on the number of links, their type and their attributes.
- the method is carried over to a step 108. If all the rules are checked, then the i-th main resource is temporarily associated with the dependent resource and the process is carried over in a step 106.
- a rule is for example “the number of links is less than or equal to 4”, “no link is of external type”, etc.
- Step 106 increments the value of the counter L by one unit, so that L gives the number of main resources associated with the dependent resource, and defers the method to step 108
- the looping step 108 tests the value of the index i. If this index is strictly less than N, then the method goes to step 1 10, otherwise, that is to say if i is equal to N, the method goes to step 112.
- Step 110 increments the value of the index i by one unit and defers the process to step 102.
- Step 112 tests the value of the counter L If L is equal to 0, then the method is carried over to a step 114. Otherwise, the method is carried over to a subsequent step 116.
- Exclusion step 114 removes the dependent resource from the indexing base and ends the association process for the dependent resource considered
- Step 116 is a test step on the value L If L is strictly greater than 1, then the process is moved to step 1 18 otherwise it is transferred to a step 120
- the step 118 selects from the main resources temporarily associated with the dependent resource, that which minimizes a distance relative to the dependent resource This distance is a decreasing function of the number of common directories between the universal URL addresses of two resources
- the method is then carried over to step 120 if a main resource is selected If several main resources minimize the distance, the process is carried over to step 114
- the end of process step 120 validates the association between the dependent resource and the single main resource selected.
- a search engine overcomes the drawbacks of conventional search engines. Indeed, an intelligent indexing of main resources, adapted to take into account the context of a request launched by a user, allows them to be classified into major categories and quality filtering of the responses to the request. In addition, this indexing is accompanied by the association of a very large number of dependent resources with each of these main resources, which improves the quantity while retaining the quality of the responses provided.
- Another advantage of this search engine is the possibility that it offers to present to a user a resource meeting the criteria of his request, accompanied by a more general main resource, explaining its context.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0004419A FR2807537B1 (en) | 2000-04-06 | 2000-04-06 | HYPERMEDIA RESOURCE SEARCH ENGINE AND INDEXING METHOD THEREOF |
FR0004419 | 2000-04-06 | ||
PCT/FR2001/000998 WO2001077890A1 (en) | 2000-04-06 | 2001-04-03 | Hypermedia resource search engine and related indexing method |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1269355A1 true EP1269355A1 (en) | 2003-01-02 |
Family
ID=8848953
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP01921462A Withdrawn EP1269355A1 (en) | 2000-04-06 | 2001-04-03 | Hypermedia resource search engine and related indexing method |
Country Status (6)
Country | Link |
---|---|
US (1) | US20030187833A1 (en) |
EP (1) | EP1269355A1 (en) |
AU (1) | AU2001248451A1 (en) |
FR (1) | FR2807537B1 (en) |
PL (1) | PL359716A1 (en) |
WO (1) | WO2001077890A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8296304B2 (en) | 2004-01-26 | 2012-10-23 | International Business Machines Corporation | Method, system, and program for handling redirects in a search engine |
US7293005B2 (en) | 2004-01-26 | 2007-11-06 | International Business Machines Corporation | Pipelined architecture for global analysis and index building |
US7499913B2 (en) | 2004-01-26 | 2009-03-03 | International Business Machines Corporation | Method for handling anchor text |
US7424467B2 (en) | 2004-01-26 | 2008-09-09 | International Business Machines Corporation | Architecture for an indexer with fixed width sort and variable width sort |
US7461064B2 (en) | 2004-09-24 | 2008-12-02 | International Buiness Machines Corporation | Method for searching documents for ranges of numeric values |
US8417693B2 (en) | 2005-07-14 | 2013-04-09 | International Business Machines Corporation | Enforcing native access control to indexed documents |
CN103164435B (en) * | 2011-12-13 | 2016-03-09 | 北大方正集团有限公司 | A kind of acquisition method of network data and system |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AUPQ131399A0 (en) * | 1999-06-30 | 1999-07-22 | Silverbrook Research Pty Ltd | A method and apparatus (NPAGE02) |
US5841978A (en) * | 1993-11-18 | 1998-11-24 | Digimarc Corporation | Network linking method using steganographically embedded data objects |
US5761436A (en) * | 1996-07-01 | 1998-06-02 | Sun Microsystems, Inc. | Method and apparatus for combining truncated hyperlinks to form a hyperlink aggregate |
GB2328297B (en) * | 1997-08-13 | 2002-04-24 | Ibm | Text in anchor tag of hyperlink adjustable according to context |
US6336116B1 (en) * | 1998-08-06 | 2002-01-01 | Ryan Brown | Search and index hosting system |
US6772139B1 (en) * | 1998-10-05 | 2004-08-03 | Smith, Iii Julius O. | Method and apparatus for facilitating use of hypertext links on the world wide web |
US6490577B1 (en) * | 1999-04-01 | 2002-12-03 | Polyvista, Inc. | Search engine with user activity memory |
US7099898B1 (en) * | 1999-08-12 | 2006-08-29 | International Business Machines Corporation | Data access system |
-
2000
- 2000-04-06 FR FR0004419A patent/FR2807537B1/en not_active Expired - Fee Related
-
2001
- 2001-04-03 AU AU2001248451A patent/AU2001248451A1/en not_active Abandoned
- 2001-04-03 WO PCT/FR2001/000998 patent/WO2001077890A1/en active Application Filing
- 2001-04-03 PL PL35971601A patent/PL359716A1/en not_active Application Discontinuation
- 2001-04-03 US US10/240,720 patent/US20030187833A1/en not_active Abandoned
- 2001-04-03 EP EP01921462A patent/EP1269355A1/en not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
See references of WO0177890A1 * |
Also Published As
Publication number | Publication date |
---|---|
AU2001248451A1 (en) | 2001-10-23 |
FR2807537B1 (en) | 2003-10-17 |
FR2807537A1 (en) | 2001-10-12 |
PL359716A1 (en) | 2004-09-06 |
US20030187833A1 (en) | 2003-10-02 |
WO2001077890A1 (en) | 2001-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4722051B2 (en) | System and method for search query processing using trend analysis | |
JP3673487B2 (en) | Hierarchical statistical analysis system and method | |
JP3936243B2 (en) | Method and system for segmenting and identifying events in an image using voice annotation | |
US6904560B1 (en) | Identifying key images in a document in correspondence to document text | |
CN100405371C (en) | Method and system for abstracting new word | |
US20110238694A1 (en) | System and Method for Matching Entities | |
US9785707B2 (en) | Method and system for converting audio text files originating from audio files to searchable text and for processing the searchable text | |
US20080189591A1 (en) | Method and system for generating a media presentation | |
EP1364316A2 (en) | Device for retrieving data from a knowledge-based text | |
JP5066963B2 (en) | Database construction device | |
US8751494B2 (en) | Constructing album data using discrete track data from multiple sources | |
Pramana et al. | Systematic literature review of stemming and lemmatization performance for sentence similarity | |
US20150294005A1 (en) | Method and device for acquiring information | |
CN100458788C (en) | Clustering method, searching method and system for interconnection network audio file | |
US20060253433A1 (en) | Method and apparatus for knowledge-based music searching and method and apparatus for managing music file | |
EP1269355A1 (en) | Hypermedia resource search engine and related indexing method | |
KR20040017824A (en) | Information search system which it follows in the Pattern-Forecast-Analysis to use the pattern of the web document and list | |
CN116401434A (en) | Intelligent network data information extraction system | |
KR20010105983A (en) | method of service providing on internet | |
EP1334444A1 (en) | Method for searching, selecting and mapping web pages | |
TWI290684B (en) | Incremental thesaurus construction method | |
WO2013117872A1 (en) | Method for identifying a set of sentences in a digital document, method for generating a digital document, and associated device | |
WO2004088542A1 (en) | A method of managing registered web sites in search engine and a system thereof | |
Khan | Structuring and querying personalized audio using ontologies | |
KR20240001769U (en) | User-customized keyword data analysis and information provision system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20021008 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Free format text: AL;LT;LV;MK;RO;SI |
|
17Q | First examination report despatched |
Effective date: 20030429 |
|
17Q | First examination report despatched |
Effective date: 20030429 |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20080703 |