EP1269355A1 - Hypermedia resource search engine and related indexing method - Google Patents

Hypermedia resource search engine and related indexing method

Info

Publication number
EP1269355A1
EP1269355A1 EP01921462A EP01921462A EP1269355A1 EP 1269355 A1 EP1269355 A1 EP 1269355A1 EP 01921462 A EP01921462 A EP 01921462A EP 01921462 A EP01921462 A EP 01921462A EP 1269355 A1 EP1269355 A1 EP 1269355A1
Authority
EP
European Patent Office
Prior art keywords
resources
resource
main
indexing
dependent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP01921462A
Other languages
German (de)
French (fr)
Inventor
Michel Plu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange SA
Original Assignee
France Telecom SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by France Telecom SA filed Critical France Telecom SA
Publication of EP1269355A1 publication Critical patent/EP1269355A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • G06F16/94Hypermedia
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Definitions

  • the present invention relates to a search engine comprising on the one hand a module for indexing resources accessible on a computer network for the creation and updating of an indexing base, on the other hand a module for searching resources on the network adapted to interrogate the indexing base from a request made by a user and to provide, in response, the universal URL address of the resources corresponding to the request, the indexing module comprising means for collection of main resources, means of extracting dependent resources from the main resources and means of indexing the resources to extract descriptors therefrom.
  • the indexing module automatically collects the resources accessible at these addresses;
  • the indexing means extract from each of these resources an index by associating with it a set of words characterizing its content;
  • the extraction means extract from each resource previously indexed the set of universal URL addresses from the hypertext links which they contain, thus making it possible to add new URL addresses to the initial list.
  • the process can be repeated to obtain a very large number of indexed resources in the end.
  • this loop is executed periodically in order to update the indexing base according to the evolution of the content of the resources of the initial list, as well as the appearance of new links.
  • the search engine In response to a request made by a user, the search engine returns the universal URL addresses of the resources corresponding to the request, by ordering them from a word counting system in the indexing base. In most cases, it then returns thousands of responses - for a request. In addition, the order of presentation of these responses does not always solve the problem of searching in these too many resources.
  • the invention aims to remedy the drawbacks of conventional search engines by creating a search engine giving access to numerous resources while improving the quality of the responses provided, in particular according to the needs of the user.
  • the subject of the invention is therefore a search engine of the aforementioned type, characterized in that the indexing module also comprises means for associating each dependent resource with at most one main resource as a function of the hypertext type links between these dependent resources and the main resource
  • main resources of a first information base are collected and indexed. This is supplemented by a large number of resources identified from the hypertext links present in the main resources.
  • the indexing module includes means for transferring a copy of the descriptors from the main resources to the dependent resources associated with them,
  • the search module comprises means for filtering a resource indexed by the indexing module, by combined processing of the descriptors extracted from this resource and of the descriptors transferred to this resource, the search module is adapted to provide, in response to a request, the universal URL address of a corresponding dependent resource to the request, associated with the hypertext link of the main resource associated with this dependent resource,
  • the association means comprise means for selecting at most one main resource from a set of main resources capable of being associated with a dependent resource, by minimizing a distance calculated between the dependent resource and each main resource;
  • the distance between two resources is a decreasing function of the number of common directories between the universal URL addresses of the two resources
  • the invention also relates to a method of indexing resources accessible on a computer network for the creation and updating of an indexing base comprising the following steps
  • the indexing method according to the invention may also include a step of excluding, from the indexing base, any dependent resource not associated with a main resource.
  • FIG. 1 is a diagram illustrating the general structure of an engine. research according to the invention
  • FIG. 2 is a diagram illustrating the operation of a search engine according to the invention
  • FIG. 3 is a flowchart detailing the operation of means for associating a dependent resource with at most one main resource, of a search engine according to the invention
  • a search engine according to the invention represented in FIG. 1 comprises a server 2 connected, by the Internet network, on the one hand to a database 4 constituted by the World Spider Web, conventionally called the Web, of elsewhere at an access terminal 6 of a user in search of resources available on the Web
  • the server 2 comprises a database 8 of directories
  • a directory comprises a restricted set of universal addresses URLs of main resources each corresponding to the first page of a multimedia document
  • These main resources are associated with external descriptors, for example recorded manually by librarians possibly assisted by computer tools
  • These external descriptors correspond to a classification in a nomenclature of themes, to a title, to a textual presentation of the main resources, more generally to information specifying the context of the documents considered
  • the server 2 also includes an indexing base 10, comprising all of the descriptors of the resources accessible by the search engine. It notably includes the external descriptors of the main resources as described above.
  • the server 2 also includes an indexing module 12 , comprising means of automatic indexing of resources These are capable of extracting external descriptors by analyzing the content of the resources, in a conventional manner
  • This module also includes a method of association of dependent resources with a main resource and of transfer external descriptors of a main resource to its dependent resources The operation of this module will be detailed below, during the description of Figure 2
  • the indexing module is therefore connected as input to the directory database 8 as well as to the Web 4, in order to access resources and, at output, to the indexing base 10, for the supply of descriptors.
  • the server 2 finally comprises a search module 14 connected on the one hand to the indexing base 10, on the other hand to the access terminal 6 for the supply to a user, of relevant resources in response to a request from the user -this.
  • the indexing module 12 proceeds to register descriptors in the indexing base 10, in several stages.
  • the indexing module 12 accesses the main resources accessible on the Web 4, by receiving as input their universal addresses URL, stored in the database 8 of directories.
  • the extraction means extract from each main resource all of the universal URL addresses of the hypertext links that they contain. New, dependent resources are thus recovered, from which we can again extract the universal URL addresses from the hypertext links that they themselves contain.
  • This recursive method of extracting dependent resources from a first set of main resources is known from the state of the art. Said first set, conventionally called seed, is here extracted from the directory database 8.
  • extraction means associate each dependent resource with at most one main resource. This association is a function of the number, type or any attribute of the hypertext links that must be followed to reach the dependent resource from the universal URL address of the main resource. At the end of this step, the dependent resources not associated with a main resource are eliminated. The process will be detailed during the description of FIG. 3.
  • transfer means copy the external descriptors of each main resource and transfer them to all the dependent resources associated with it.
  • the indexing means extract descriptors automatically for each resource.
  • the indexing module 12 stores in the indexing base 10 the descriptors relating to each resource, these comprising the descriptors extracted automatically as well as the external descriptors transferred by copy to a dependent resource from the main resource associated with this dependent resource, or directly extracted from the directory database 8 for a main resource.
  • This request form takes the form of an HTML presentation page. It allows the user to enter at least one keyword and to specify the context of his search by selecting the values of a certain number of descriptors from a list offered.
  • the descriptors of the proposed list correspond to at least some of the external descriptors stored in the database 8 of directories and describing the main resources. They allow for example to specify a research area, the age range of the user, etc. These details allow the search module to filter the resources corresponding to the keywords of the query.
  • the responses therefore consist of the main and dependent resources having extracted descriptors corresponding to the keywords and external descriptor values corresponding to those selected by the user.
  • each dependent resource returned by the search module to the user, is accompanied by a hypertext link to the main resource associated with this dependent resource.
  • the method of associating a dependent resource with at most one main resource, among a set of N main resources, is in accordance with the flow diagram represented in FIG. 3
  • An initialization step 100 initializes an index i to 1 and a counter L to zero
  • an analysis step 102 identifies a path, that is to say a series of hypertext links, which must be followed in order to reach the dependent resource from the universal address URL of the i-th main resource.
  • a series of p steps, 104- ,,, 104 p constitutes a set of rules relating to the paths identified in step 102, and more particularly, on the number of links, their type and their attributes.
  • the method is carried over to a step 108. If all the rules are checked, then the i-th main resource is temporarily associated with the dependent resource and the process is carried over in a step 106.
  • a rule is for example “the number of links is less than or equal to 4”, “no link is of external type”, etc.
  • Step 106 increments the value of the counter L by one unit, so that L gives the number of main resources associated with the dependent resource, and defers the method to step 108
  • the looping step 108 tests the value of the index i. If this index is strictly less than N, then the method goes to step 1 10, otherwise, that is to say if i is equal to N, the method goes to step 112.
  • Step 110 increments the value of the index i by one unit and defers the process to step 102.
  • Step 112 tests the value of the counter L If L is equal to 0, then the method is carried over to a step 114. Otherwise, the method is carried over to a subsequent step 116.
  • Exclusion step 114 removes the dependent resource from the indexing base and ends the association process for the dependent resource considered
  • Step 116 is a test step on the value L If L is strictly greater than 1, then the process is moved to step 1 18 otherwise it is transferred to a step 120
  • the step 118 selects from the main resources temporarily associated with the dependent resource, that which minimizes a distance relative to the dependent resource This distance is a decreasing function of the number of common directories between the universal URL addresses of two resources
  • the method is then carried over to step 120 if a main resource is selected If several main resources minimize the distance, the process is carried over to step 114
  • the end of process step 120 validates the association between the dependent resource and the single main resource selected.
  • a search engine overcomes the drawbacks of conventional search engines. Indeed, an intelligent indexing of main resources, adapted to take into account the context of a request launched by a user, allows them to be classified into major categories and quality filtering of the responses to the request. In addition, this indexing is accompanied by the association of a very large number of dependent resources with each of these main resources, which improves the quantity while retaining the quality of the responses provided.
  • Another advantage of this search engine is the possibility that it offers to present to a user a resource meeting the criteria of his request, accompanied by a more general main resource, explaining its context.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention concerns a search engine comprising a module for indexing accessible resources on a computer network for creating and updating an indexing base, a module for searching resources on the network adapted to interrogate an indexing base from a request presented by a user and for supplying, in response, the Web address (URL) corresponding to the request, the indexing module comprising means for collecting main resources, means for retrieving dependent resources from the main resources and means for indexing resources to retrieve therefrom descriptors. Furthermore, the indexing module comprises means associating each dependent resource to at most one main resource depending on the hypermedia links between said dependent resources and the main resource.

Description

Moteur de recherche de ressources hypermédia et procédé d ' indexation associéHypermedia resource search engine and associated indexing method
La présente invention concerne un moteur de recherche comportant d'une part un module d'indexation de ressources accessibles sur un réseau informatique pour la création et la mise à jour d'une base d'indexation, d'autre part un module de recherche de ressources sur le réseau adapté pour interroger la base d'indexation à partir d'une requête formulée par un utilisateur et pour fournir, en réponse, l'adresse universelle URL des ressources correspondant à la requête, le module d'indexation comportant des moyens de collecte de ressources principales, des moyens d'extraction de ressources dépendantes à partir des ressources principales et des moyens d'indexation des ressources pour en extraire des descripteurs.The present invention relates to a search engine comprising on the one hand a module for indexing resources accessible on a computer network for the creation and updating of an indexing base, on the other hand a module for searching resources on the network adapted to interrogate the indexing base from a request made by a user and to provide, in response, the universal URL address of the resources corresponding to the request, the indexing module comprising means for collection of main resources, means of extracting dependent resources from the main resources and means of indexing the resources to extract descriptors therefrom.
Il existe aujourd'hui de tels moteurs de recherche. Parmi, ceux-ci, les moteurs de recherche pleine page, fonctionnent selon le principe suivant :Today there are such search engines. Among these, the full page search engines, operate on the following principle:
- à partir d'une liste initiale d'adresses universelles URL, par exemple définies manuellement, le module d'indexation collecte automatiquement les ressources accessibles à ces adresses ;- from an initial list of universal URL addresses, for example defined manually, the indexing module automatically collects the resources accessible at these addresses;
- les moyens d'indexation extraient de chacune de ces ressources un index en lui associant un ensemble de mots caractérisant son contenu ;etthe indexing means extract from each of these resources an index by associating with it a set of words characterizing its content; and
- les moyens d'extraction extraient de chaque ressource précédemment indexée l'ensemble des adresses universelles URL des liens hypertextes qu'elles contiennent permettant ainsi d'ajouter de nouvelles adresses URL à la liste initiale.the extraction means extract from each resource previously indexed the set of universal URL addresses from the hypertext links which they contain, thus making it possible to add new URL addresses to the initial list.
Ainsi, le processus peut être réitéré pour obtenir au final un très grand nombre de ressources indexées.Thus, the process can be repeated to obtain a very large number of indexed resources in the end.
De plus, cette boucle est exécutée périodiquement afin de mettre à jour la base d'indexation en fonction de l'évolution du contenu des ressources de la liste initiale, ainsi que de l'apparition de nouveaux liens.In addition, this loop is executed periodically in order to update the indexing base according to the evolution of the content of the resources of the initial list, as well as the appearance of new links.
En réponse à une requête formulée par un utilisateur, le moteur de recherche renvoie les adresses universelles URL des ressources correspondant à la requête, en les ordonnant à partir d'un système de comptage de mots dans la base d'indexation. Il retourne alors, le plus souvent, des milliers de réponses - pour une requête. De plus, l'ordre de présentation de ces réponses ne résout pas toujours le problème de la recherche dans ces trop nombreuses ressources.In response to a request made by a user, the search engine returns the universal URL addresses of the resources corresponding to the request, by ordering them from a word counting system in the indexing base. In most cases, it then returns thousands of responses - for a request. In addition, the order of presentation of these responses does not always solve the problem of searching in these too many resources.
C rCcIu Mi!t- !i- E. f ϋîcP R rïECi-j'tP' --. '-i TvL.i'ï «.l* **- t" -* J . i {t .c, X*-•^ r- ,-sf5)\ En effet, cet ordre ne correspond pas aux besoins de l'utilisateur, tels que l'usage des ressources recherchées, la qualité d'information désirée ou tout autre critère personnel de l'utilisateurC rCcIu Mi! T-! I- E. f ϋîcP R rïECi-j'tP '-. '-i TvL.i'ï ".l * ** - t" - * J. i {t .c, X * - • ^ r-, -sf5) \ Indeed, this order does not correspond to the needs of the user, such as the use of the resources sought, the quality of information desired or any other personal criterion of the user.
Un autre problème lié à ce type de moteurs de recherche est que les réponses fournies donnent un accès direct aux contenus de ressources dont l'appréciation par l'utilisateur dépend parfois de la lecture antérieure d'autres ressourcesAnother problem related to this type of search engines is that the answers provided give direct access to the content of resources whose appreciation by the user sometimes depends on the previous reading of other resources.
L'invention vise à remédier aux inconvénients des moteurs de recherche classiques en créant un moteur de recherche donnant l'accès à de nombreuses ressources tout en améliorant la qualité des réponses fournies, notamment en fonction des besoins de l'utilisateurThe invention aims to remedy the drawbacks of conventional search engines by creating a search engine giving access to numerous resources while improving the quality of the responses provided, in particular according to the needs of the user.
L'invention a donc pour objet un moteur de recherche du type précité, caractérisé en ce que le module d'indexation comportent en outre des moyens d'association de chaque ressource dépendante à au plus une ressource principale en fonction des liens de type hypertexte entre ces ressources dépendantes et la ressource principaleThe subject of the invention is therefore a search engine of the aforementioned type, characterized in that the indexing module also comprises means for associating each dependent resource with at most one main resource as a function of the hypertext type links between these dependent resources and the main resource
De la sorte, des ressources principales d'une première base d'information sont collectées et indexées Celle-ci est complétée par un grand nombre de ressources identifiées à partir des liens hypertextes présents dans les ressources principalesIn this way, the main resources of a first information base are collected and indexed. This is supplemented by a large number of resources identified from the hypertext links present in the main resources.
Le moteur de recherche selon l'invention peut aussi comporter une ou plusieurs des caractéristiques suivantesThe search engine according to the invention may also include one or more of the following characteristics
- le module d'indexation comporte des moyens de transfert d'une copie des descripteurs des ressources principales aux ressources dépendantes qui leur sont associées ,the indexing module includes means for transferring a copy of the descriptors from the main resources to the dependent resources associated with them,
- le module de recherche comporte des moyens de filtrage d'une ressource indexée par le module d'indexation, par traitement combiné des descripteurs extraits de cette ressource et des descripteurs transfères à cette ressource , - le module de recherche est adapté pour fournir, en réponse à une requête, l'adresse universelle URL d'une ressource dépendante correspondant à la requête, associée au lien hypertexte de la ressource principale associée à cette ressource dépendante ,the search module comprises means for filtering a resource indexed by the indexing module, by combined processing of the descriptors extracted from this resource and of the descriptors transferred to this resource, the search module is adapted to provide, in response to a request, the universal URL address of a corresponding dependent resource to the request, associated with the hypertext link of the main resource associated with this dependent resource,
- les moyens d'association comportent des moyens de sélection d'au plus une ressource principale parmi un ensemble de ressources principales susceptibles d'être associées à une ressource dépendante, par minimisation d'une distance calculée entre la ressource dépendante et chaque ressource principale ; etthe association means comprise means for selecting at most one main resource from a set of main resources capable of being associated with a dependent resource, by minimizing a distance calculated between the dependent resource and each main resource; and
- la distance entre deux ressources est une fonction décroissante du nombre de répertoires communs entre les adresses universelles URL des deux ressources- the distance between two resources is a decreasing function of the number of common directories between the universal URL addresses of the two resources
L'invention a également pour objet un procédé d'indexation de ressources accessibles sur un réseau informatique pour la création et la mise à jour d'une base d'indexation comprenant les étapes suivantesThe invention also relates to a method of indexing resources accessible on a computer network for the creation and updating of an indexing base comprising the following steps
- collecte de ressources principales , - indexation des ressources principales ,- collection of main resources, - indexing of main resources,
- extraction de ressources dépendantes à partir des ressources principales , caractérisé en ce qu'il comporte en outre les étapes suivantes- extraction of dependent resources from the main resources, characterized in that it further comprises the following steps
- association de chaque ressource dépendante à au plus une ressource principale en fonction des liens hypertextes entre ces ressources dépendantes et la ressource principale , et- association of each dependent resource with at most one main resource as a function of the hypertext links between these dependent resources and the main resource, and
- transfert d'une copie des descripteurs des ressources principales aux ressources dépendantes qui leur sont associées- transfer of a copy of the descriptors of the main resources to the dependent resources associated with them
Le procédé d'indexation selon l'invention peut en outre comporter une étape d'exclusion, de la base d'indexation, de toute ressource dépendante non associée à une ressource principaleThe indexing method according to the invention may also include a step of excluding, from the indexing base, any dependent resource not associated with a main resource.
L'invention sera mieux comprise à l'aide de la description qui va suivre, donnée uniquement à titre d'exemple et faite en se référant aux dessins annexés sur lesquels - la figure 1 est- un schéma illustrant la structure générale d'un moteur de recherche selon l'invention , - la figure 2 est un schéma illustrant le fonctionnement d'un moteur de recherche selon l'invention , etThe invention will be better understood with the aid of the description which follows, given solely by way of example and made with reference to the appended drawings in which - FIG. 1 is a diagram illustrating the general structure of an engine. research according to the invention, FIG. 2 is a diagram illustrating the operation of a search engine according to the invention, and
- la figure 3 est un organigramme détaillant le fonctionnement de moyens d'association d'une ressource dépendante à au plus une ressource principale, d'un moteur de recherche selon l'invention- Figure 3 is a flowchart detailing the operation of means for associating a dependent resource with at most one main resource, of a search engine according to the invention
Un moteur de recherche selon l'invention représenté à la figure 1 comporte un serveur 2 connecté, par le réseau Internet, d'une part à une base de données 4 constituée par la Toile d'Araignée Mondiale, classiquement appelée le Web, d'autre part à un terminal d'accès 6 d'un utilisateur en quête de ressources disponibles sur le WebA search engine according to the invention represented in FIG. 1 comprises a server 2 connected, by the Internet network, on the one hand to a database 4 constituted by the World Spider Web, conventionally called the Web, of elsewhere at an access terminal 6 of a user in search of resources available on the Web
Le serveur 2 comprend une base de données 8 d'annuaires Un annuaire comporte un ensemble restreint d'adresses universelles URL de ressources principales correspondant chacune à la première page d'un document multimédia Ces ressources principales sont associées à des descripteurs externes, par exemple enregistrés manuellement par des documentalistes éventuellement assistés par des outils informatiques Ces descripteurs externes correspondent à un classement dans une nomenclature de thèmes, à un titre, à une présentation textuelle des ressources principales, de façon plus générale à des informations précisant le contexte des documents considérésThe server 2 comprises a database 8 of directories A directory comprises a restricted set of universal addresses URLs of main resources each corresponding to the first page of a multimedia document These main resources are associated with external descriptors, for example recorded manually by librarians possibly assisted by computer tools These external descriptors correspond to a classification in a nomenclature of themes, to a title, to a textual presentation of the main resources, more generally to information specifying the context of the documents considered
Le serveur 2 comprend également une base d'indexation 10, comportant l'ensemble des descripteurs des ressources accessibles par le moteur de recherche Elle comprend notamment les descripteurs externes des ressources principales tels que décrits précédemment Le serveur 2 comprend également un module d'indexation 12, comportant des moyens d'indexation automatique de ressources Ceux-ci sont capables d'extraire des descripteurs externes en analysant le contenu des ressources, de façon classique Ce module inclut aussi un procédé d'association de ressources dépendantes à une ressource principale et de transfert des descripteurs externes d'une ressource principale à ses ressources dépendantes Le fonctionnement de ce module sera détaillé dans la suite, lors de la description de la figure 2 Le module d'indexation est donc connecté en entrée à la base de données 8 d'annuaires ainsi qu'au Web 4, afin d'accéder à des ressources et, en sortie, à la base d'indexation 10, pour la fourniture de descripteurs.The server 2 also includes an indexing base 10, comprising all of the descriptors of the resources accessible by the search engine. It notably includes the external descriptors of the main resources as described above. The server 2 also includes an indexing module 12 , comprising means of automatic indexing of resources These are capable of extracting external descriptors by analyzing the content of the resources, in a conventional manner This module also includes a method of association of dependent resources with a main resource and of transfer external descriptors of a main resource to its dependent resources The operation of this module will be detailed below, during the description of Figure 2 The indexing module is therefore connected as input to the directory database 8 as well as to the Web 4, in order to access resources and, at output, to the indexing base 10, for the supply of descriptors.
Le serveur 2 comprend enfin un module de recherche 14 connecté d'une part à la base d'indexation 10, d'autre part au terminal d'accès 6 pour la fourniture à un utilisateur, de ressources pertinentes en réponse à une requête de celui-ci.The server 2 finally comprises a search module 14 connected on the one hand to the indexing base 10, on the other hand to the access terminal 6 for the supply to a user, of relevant resources in response to a request from the user -this.
Le fonctionnement du moteur de recherche, dont la structure a été décrite précédemment, est représenté à la figure 2. Le module d'indexation 12 procède à l'enregistrement de descripteurs dans la base d'indexation 10, en plusieurs étapes.The operation of the search engine, the structure of which has been described previously, is shown in FIG. 2. The indexing module 12 proceeds to register descriptors in the indexing base 10, in several stages.
Au cours d'une première étape 16 de collecte, le module d'indexation 12 accède aux ressources principales accessibles sur le Web 4, en recevant en entrée leurs adresses universelles URL, stockées dans la base de données 8 d'annuaires.During a first collection step 16, the indexing module 12 accesses the main resources accessible on the Web 4, by receiving as input their universal addresses URL, stored in the database 8 of directories.
Au cours d'une deuxième étape 18 d'extraction, les moyens d'extraction extraient de chaque ressource principale l'ensemble des adresses universelles URL des liens hypertextes qu'elles contiennent. De nouvelles ressources, dépendantes, sont ainsi récupérées dont on peut extraire de nouveau les adresses universelles URL des liens hypertextes qu'elles contiennent elles- mêmes. Ce procédé récursif d'extraction de ressources dépendantes à partir d'un premier ensemble de ressources principales est connu de l'état de la technique. Ledit premier ensemble, appelé classiquement germe, est ici extrait de la base de données d'annuaires 8. Au cours d'une troisième étape 20 d'association, des moyens d'extraction associent chaque ressource dépendante à au plus une ressource principale. Cette association est fonction du nombre, du type ou de tout attribut des liens hypertextes qu'il faut suivre pour atteindre la ressource dépendante à partir de l'adresse universelle URL de la ressource principale. A l'issue de cette étape, les ressources dépendantes non associées à une ressource principale sont éliminées. Le procédé sera détaillé lors de la description de la figure 3. Au cours d'une quatrième étape 22 de transfert, des moyens de transfert copient les descripteurs externes de chaque ressource principale et les transfèrent à toutes les ressources dépendantes qui lui sont associées.During a second extraction step 18, the extraction means extract from each main resource all of the universal URL addresses of the hypertext links that they contain. New, dependent resources are thus recovered, from which we can again extract the universal URL addresses from the hypertext links that they themselves contain. This recursive method of extracting dependent resources from a first set of main resources is known from the state of the art. Said first set, conventionally called seed, is here extracted from the directory database 8. During a third association step 20, extraction means associate each dependent resource with at most one main resource. This association is a function of the number, type or any attribute of the hypertext links that must be followed to reach the dependent resource from the universal URL address of the main resource. At the end of this step, the dependent resources not associated with a main resource are eliminated. The process will be detailed during the description of FIG. 3. During a fourth transfer step 22, transfer means copy the external descriptors of each main resource and transfer them to all the dependent resources associated with it.
Enfin, au cours d'une cinquième étape 24 d'indexation, les moyens d'indexation extraient des descripteurs de façon automatique pour chaque ressource. Lors de cette étape, le module d'indexation 12 enregistre dans la base d'indexation 10 les descripteurs relatifs à chaque ressource, ceux-ci comprenant les descripteurs extraits automatiquement ainsi que les descripteurs externes transférés par copie à une ressource dépendante à partir de la ressource principale associée à cette ressource dépendante, ou directement extraits de la base de données 8 d'annuaire pour une ressource principale.Finally, during a fifth indexing step 24, the indexing means extract descriptors automatically for each resource. During this step, the indexing module 12 stores in the indexing base 10 the descriptors relating to each resource, these comprising the descriptors extracted automatically as well as the external descriptors transferred by copy to a dependent resource from the main resource associated with this dependent resource, or directly extracted from the directory database 8 for a main resource.
Le procédé décrit précédemment, de la première à la cinquième étape, est réitéré régulièrement afin de tenir à jour la base d'indexation en fonction de l'évolution des ressources principales de la base de données d'annuaires, ainsi que de l'évolution des liens hypertextes qu'elles contiennent.The process described above, from the first to the fifth step, is repeated regularly in order to keep the indexing base up to date according to the evolution of the main resources of the directory database, as well as the evolution hypertext links they contain.
Lorsque la base d'indexation est à jour, l'utilisateur accède à un formulaire de requête défini par le module de recherche 14. Ce formulaire de requête prend la forme d'une page de présentation HTML. Il permet à l'utilisateur d'entrer au moins un mot-clé et de préciser le contexte de sa recherche en sélectionnant des valeurs d'un certain nombre de descripteurs parmi une liste proposée. Les descripteurs de la liste proposée correspondent à au moins une partie des descripteurs externes stockés dans la base de données 8 d'annuaires et décrivant les ressources principales. Ils permettent par exemple de préciser un domaine de recherche, la tranche d'âge de l'utilisateur, etc. Ces précisions permettent au module de recherche de filtrer les ressources correspondant aux mots clés de la requête.When the indexing base is up to date, the user accesses a request form defined by the search module 14. This request form takes the form of an HTML presentation page. It allows the user to enter at least one keyword and to specify the context of his search by selecting the values of a certain number of descriptors from a list offered. The descriptors of the proposed list correspond to at least some of the external descriptors stored in the database 8 of directories and describing the main resources. They allow for example to specify a research area, the age range of the user, etc. These details allow the search module to filter the resources corresponding to the keywords of the query.
Les réponses sont donc constituées des ressources principales et dépendantes ayant des descripteurs extraits correspondant aux mots clés et des valeurs de descripteurs externes correspondant à celles sélectionnées par l'utilisateur. Parmi les réponses, chaque ressource dépendante, retournée par le module de recherche à l'utilisateur, est accompagnée d'un lien hypertexte vers la ressource principale associée à cette ressource dépendanteThe responses therefore consist of the main and dependent resources having extracted descriptors corresponding to the keywords and external descriptor values corresponding to those selected by the user. Among the responses, each dependent resource, returned by the search module to the user, is accompanied by a hypertext link to the main resource associated with this dependent resource.
Le procédé d'association d'une ressource dépendante à au plus une ressource principale, parmi un ensemble de N ressources principales, est conforme à l'organigramme représenté à la figure 3The method of associating a dependent resource with at most one main resource, among a set of N main resources, is in accordance with the flow diagram represented in FIG. 3
Une étape d'initialisation 100 initialise un indice i à 1 et un compteur L à zéroAn initialization step 100 initializes an index i to 1 and a counter L to zero
Ensuite, une étape 102 d'analyse identifie un chemin, c'est-à-dire une suite de liens hypertextes, qu'il faut suivre pour atteindre la ressource dépendante à partir de l'adresse universelle URL de la i-ème ressource principaleThen, an analysis step 102 identifies a path, that is to say a series of hypertext links, which must be followed in order to reach the dependent resource from the universal address URL of the i-th main resource.
Ensuite, une série de p étapes, 104-,, , 104p, constitue un ensemble de règles portant sur les chemins identifiés à l'étape 102, et plus particulièrement, sur le nombre de liens, leur type et leurs attributsThen, a series of p steps, 104- ,,, 104 p , constitutes a set of rules relating to the paths identified in step 102, and more particularly, on the number of links, their type and their attributes.
Sept types de liens sont définis de façon classiqueSeven types of links are defined in a classic way
- les liens de structure de présentation, tels que les cadres, les tableaux ou les éléments inclus ,- presentation structure links, such as frames, tables or elements included,
- les liens transversaux, entre deux fichiers de même répertoire , - les liens parallèles, pour des fichiers situés dans des répertoires différents, eux-même situés dans un même répertoire ,- transverse links, between two files in the same directory, - parallel links, for files located in different directories, themselves located in the same directory,
- les liens externes, entre des fichiers situés dans des sites différents ,- external links, between files located on different sites,
- les liens plus profonds, lorsque le fichier de la ressource dépendante est situé dans un sous-répertoire du répertoire du fichier de la ressource principale ,- deeper links, when the file of the dependent resource is located in a sub-directory of the directory of the file of the main resource,
- les liens supérieurs, lorsque le fichier de la ressource principale est situé dans un sous-répertoire du répertoire du fichier de la ressource dépendante , et- the upper links, when the main resource file is located in a sub-directory of the directory of the dependent resource file, and
- les liens menus, pour des liens inclus dans une ressource pour laquelle le nombre de liens inclus divisé par la taille de la ressource mesurée en octets est supérieur à un seuil prédéterminé Les attributs sont associés de façon classique aux ancres des liens et connus de l'état de la technique.- menu links, for links included in a resource for which the number of included links divided by the size of the resource measured in bytes is greater than a predetermined threshold The attributes are conventionally associated with the anchors of the links and known from the state of the art.
Si au moins l'une des règles n'est pas vérifiée, alors le procédé est reporté à une étape 108. Si toutes les règles sont vérifiées, alors la i-ème ressource principale est temporairement associée à la ressource dépendante et le procédé est reporté à une étape 106. Une règle est par exemple « le nombre de liens est inférieur ou égal à 4 », « aucun lien n'est de type externe », etc.If at least one of the rules is not checked, then the method is carried over to a step 108. If all the rules are checked, then the i-th main resource is temporarily associated with the dependent resource and the process is carried over in a step 106. A rule is for example “the number of links is less than or equal to 4”, “no link is of external type”, etc.
L'étape 106 incrémente la valeur du compteur L d'une unité, de sorte que L donne le nombre de ressources principales associées à la ressource dépendante, et reporte le procédé à l'étape 108Step 106 increments the value of the counter L by one unit, so that L gives the number of main resources associated with the dependent resource, and defers the method to step 108
L'étape 108 de bouclage teste la valeur de l'indice i. Si cet indice est inférieur strictement à N, alors le procédé passe à une étape 1 10, sinon, c'est-à- dire si i est égal à N, le procédé passe à une étape 112.The looping step 108 tests the value of the index i. If this index is strictly less than N, then the method goes to step 1 10, otherwise, that is to say if i is equal to N, the method goes to step 112.
L'étape 110 incrémente la valeur de l'indice i d'une unité et reporte le procédé à l'étape 102.Step 110 increments the value of the index i by one unit and defers the process to step 102.
L'étape 112 teste la valeur du compteur L Si L est égal à 0, alors le procédé est reporté à une étape 114. Sinon, le procédé est reporté à une étape 116 ultérieure.Step 112 tests the value of the counter L If L is equal to 0, then the method is carried over to a step 114. Otherwise, the method is carried over to a subsequent step 116.
L'étape 114 d'exclusion retire la ressource dépendante de la base d'indexation et termine le procédé d'association pour la ressource dépendante considéréeExclusion step 114 removes the dependent resource from the indexing base and ends the association process for the dependent resource considered
L'étape 116 est également une étape de test sur la valeur de L Si L est strictement supérieur à 1 , alors le procédé est reporté à une étape 1 18, sinon il est reporté à une étape 120 L'étape 118 sélectionne, parmi les ressources principales temporairement associées à la ressource dépendante, celle qui minimise une distance par rapport à la ressource dépendante Cette distance est une fonction décroissante du nombre de répertoires communs entre les adresses universelles URL de deux ressources Le procédé est ensuite reporté à l'étape 120 si une ressource -principale est sélectionnée Si plusieurs ressources principales minimisent la distance, le procédé est reporté à l'étape 114 L'étape 120 de fin de procédé valide l'association entre la ressource dépendante et l'unique ressource principale sélectionnée.Step 116 is a test step on the value L If L is strictly greater than 1, then the process is moved to step 1 18 otherwise it is transferred to a step 120 The step 118 selects from the main resources temporarily associated with the dependent resource, that which minimizes a distance relative to the dependent resource This distance is a decreasing function of the number of common directories between the universal URL addresses of two resources The method is then carried over to step 120 if a main resource is selected If several main resources minimize the distance, the process is carried over to step 114 The end of process step 120 validates the association between the dependent resource and the single main resource selected.
Il apparaît clairement qu'un moteur de recherche selon l'invention remédie aux inconvénients des moteurs de recherche classiques. En effet, une indexation intelligente de ressources principales, adaptée pour prendre en compte le contexte d'une requête lancée par un utilisateur, permet leur classement en grandes catégories et un filtrage de qualité des réponses à la requête. De plus, cette indexation est accompagnée de l'association d'un très grand nombre de ressources dépendantes à chacune de ces ressources principales, ce qui permet d'améliorer la quantité tout en conservant la qualité des réponses fournies.It is clear that a search engine according to the invention overcomes the drawbacks of conventional search engines. Indeed, an intelligent indexing of main resources, adapted to take into account the context of a request launched by a user, allows them to be classified into major categories and quality filtering of the responses to the request. In addition, this indexing is accompanied by the association of a very large number of dependent resources with each of these main resources, which improves the quantity while retaining the quality of the responses provided.
Un autre avantage de ce moteur de recherche est la possibilité qu'il offre de présenter à un utilisateur une ressource répondant aux critères de sa requête, accompagnée d'une ressource principale plus générale, explicitant son contexte. Another advantage of this search engine is the possibility that it offers to present to a user a resource meeting the criteria of his request, accompanied by a more general main resource, explaining its context.

Claims

REVENDICATIONS
1. Moteur de recherche comportant d'une part un module d'indexation de ressources accessibles sur un réseau informatique pour la création et la mise à jour d'une base d'indexation, d'autre part un module de recherche de ressources sur le réseau adapté pour interroger la base d'indexation à partir d'une requête formulée par un utilisateur et pour fournir, en réponse, l'adresse universelle URL des ressources correspondant à la requête, le module d'indexation comportant des moyens de collecte de ressources principales, des moyens d'extraction de ressources dépendantes à partir des ressources principales et des moyens d'indexation des ressources pour en extraire des descripteurs, caractérisé en ce que le module d'indexation comporte en outre des moyens d'association de chaque ressource dépendante à au plus une ressource principale en fonction des liens de type hypertexte entre ces ressources dépendantes et la ressource principale.1. Search engine comprising on the one hand an indexing module of resources accessible on a computer network for the creation and updating of an indexing base, on the other hand a module of searching for resources on the network adapted to interrogate the indexing base from a request formulated by a user and to provide, in response, the universal URL address of the resources corresponding to the request, the indexing module comprising means for collecting resources main, means for extracting dependent resources from the main resources and means for indexing the resources to extract descriptors therefrom, characterized in that the indexing module also comprises means for associating each dependent resource to at most one main resource according to the hypertext type links between these dependent resources and the main resource.
2. Moteur de recherche selon la revendication 1 , caractérisé en ce que le module d'indexation comporte des moyens de transfert d'une copie des descripteurs des ressources principales aux ressources dépendantes qui leur sont associées. 2. Search engine according to claim 1, characterized in that the indexing module comprises means for transferring a copy of the descriptors from the main resources to the dependent resources associated with them.
3. Moteur de recherche selon la revendication 2, caractérisé en ce que le module de recherche comporte des moyens de filtrage d'une ressource indexée par le module d'indexation, par traitement combiné des descripteurs extraits de cette ressource et des descripteurs transférés à cette ressource.3. Search engine according to claim 2, characterized in that the search module comprises means for filtering a resource indexed by the indexing module, by combined processing of the descriptors extracted from this resource and of the descriptors transferred to this resource.
4. Moteur de recherche selon l'une des revendications 1 à 3, caractérisé en ce que le module de recherche est adapté pour fournir, en réponse à une requête, l'adresse universelle URL d'une ressource dépendante correspondant à la requête, associée au lien hypertexte de la ressource principale associée à cette ressource dépendante.4. Search engine according to one of claims 1 to 3, characterized in that the search module is adapted to provide, in response to a request, the universal URL address of a dependent resource corresponding to the request, associated to the hypertext link of the main resource associated with this dependent resource.
5. Moteur de recherche selon l'une des revendications 1 à 4, caractérisé en ce que les moyens d'association comportent des moyens de sélection d'au plus une ressource principale parmi un ensemble de ressources principales susceptibles d'être associées à une ressource dépendante, par minimisation d'une distance calculée entre la ressource dépendante et chaque ressource principale.5. Search engine according to one of claims 1 to 4, characterized in that the association means comprise means for selecting at most one main resource from a set of main resources capable of being associated with a resource dependent, by minimization of a calculated distance between the dependent resource and each main resource.
6. Moteur de recherche selon la revendication 5, caractérisé en ce que la distance entre deux ressources est une fonction décroissante du nombre de répertoires communs entre les adresses universelles URL des deux ressources.6. Search engine according to claim 5, characterized in that the distance between two resources is a decreasing function of the number of common directories between the universal URL addresses of the two resources.
7. Procédé d'indexation de ressources accessibles sur un réseau informatique pour la création et la mise à jour d'une base d'indexation, comprenant les étapes suivantes :7. Method for indexing resources accessible on a computer network for creating and updating an indexing base, comprising the following steps:
- collecte de ressources principales ; - indexation des ressources principales ;- collection of main resources; - indexing of main resources;
- extraction de ressources dépendantes à partir des ressources principales ; caractérisé en ce qu'il comporte en outre les étapes suivantes :- extraction of dependent resources from main resources; characterized in that it further comprises the following steps:
- association de chaque ressource dépendante à au plus une ressource principale en fonction des liens hypertextes entre ces ressources dépendantes et la ressource principale ; et- association of each dependent resource with at most one main resource according to the hypertext links between these dependent resources and the main resource; and
- transfert d'une copie des descripteurs des ressources principales aux ressources dépendantes qui leur sont associées.- transfer of a copy of the descriptors of the main resources to the dependent resources associated with them.
8. Procédé d'indexation selon la revendication 7, caractérisé en ce qu'il comporte en outre une étape d'exclusion, de la base d'indexation, de toute ressource dépendante non associée à une ressource principale. 8. Indexing method according to claim 7, characterized in that it further comprises a step of excluding, from the indexing base, any dependent resource not associated with a main resource.
EP01921462A 2000-04-06 2001-04-03 Hypermedia resource search engine and related indexing method Withdrawn EP1269355A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
FR0004419A FR2807537B1 (en) 2000-04-06 2000-04-06 HYPERMEDIA RESOURCE SEARCH ENGINE AND INDEXING METHOD THEREOF
FR0004419 2000-04-06
PCT/FR2001/000998 WO2001077890A1 (en) 2000-04-06 2001-04-03 Hypermedia resource search engine and related indexing method

Publications (1)

Publication Number Publication Date
EP1269355A1 true EP1269355A1 (en) 2003-01-02

Family

ID=8848953

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01921462A Withdrawn EP1269355A1 (en) 2000-04-06 2001-04-03 Hypermedia resource search engine and related indexing method

Country Status (6)

Country Link
US (1) US20030187833A1 (en)
EP (1) EP1269355A1 (en)
AU (1) AU2001248451A1 (en)
FR (1) FR2807537B1 (en)
PL (1) PL359716A1 (en)
WO (1) WO2001077890A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8296304B2 (en) 2004-01-26 2012-10-23 International Business Machines Corporation Method, system, and program for handling redirects in a search engine
US7293005B2 (en) 2004-01-26 2007-11-06 International Business Machines Corporation Pipelined architecture for global analysis and index building
US7499913B2 (en) 2004-01-26 2009-03-03 International Business Machines Corporation Method for handling anchor text
US7424467B2 (en) 2004-01-26 2008-09-09 International Business Machines Corporation Architecture for an indexer with fixed width sort and variable width sort
US7461064B2 (en) 2004-09-24 2008-12-02 International Buiness Machines Corporation Method for searching documents for ranges of numeric values
US8417693B2 (en) 2005-07-14 2013-04-09 International Business Machines Corporation Enforcing native access control to indexed documents
CN103164435B (en) * 2011-12-13 2016-03-09 北大方正集团有限公司 A kind of acquisition method of network data and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPQ131399A0 (en) * 1999-06-30 1999-07-22 Silverbrook Research Pty Ltd A method and apparatus (NPAGE02)
US5841978A (en) * 1993-11-18 1998-11-24 Digimarc Corporation Network linking method using steganographically embedded data objects
US5761436A (en) * 1996-07-01 1998-06-02 Sun Microsystems, Inc. Method and apparatus for combining truncated hyperlinks to form a hyperlink aggregate
GB2328297B (en) * 1997-08-13 2002-04-24 Ibm Text in anchor tag of hyperlink adjustable according to context
US6336116B1 (en) * 1998-08-06 2002-01-01 Ryan Brown Search and index hosting system
US6772139B1 (en) * 1998-10-05 2004-08-03 Smith, Iii Julius O. Method and apparatus for facilitating use of hypertext links on the world wide web
US6490577B1 (en) * 1999-04-01 2002-12-03 Polyvista, Inc. Search engine with user activity memory
US7099898B1 (en) * 1999-08-12 2006-08-29 International Business Machines Corporation Data access system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0177890A1 *

Also Published As

Publication number Publication date
AU2001248451A1 (en) 2001-10-23
FR2807537B1 (en) 2003-10-17
FR2807537A1 (en) 2001-10-12
PL359716A1 (en) 2004-09-06
US20030187833A1 (en) 2003-10-02
WO2001077890A1 (en) 2001-10-18

Similar Documents

Publication Publication Date Title
JP4722051B2 (en) System and method for search query processing using trend analysis
JP3673487B2 (en) Hierarchical statistical analysis system and method
JP3936243B2 (en) Method and system for segmenting and identifying events in an image using voice annotation
US6904560B1 (en) Identifying key images in a document in correspondence to document text
CN100405371C (en) Method and system for abstracting new word
US20110238694A1 (en) System and Method for Matching Entities
US9785707B2 (en) Method and system for converting audio text files originating from audio files to searchable text and for processing the searchable text
US20080189591A1 (en) Method and system for generating a media presentation
EP1364316A2 (en) Device for retrieving data from a knowledge-based text
JP5066963B2 (en) Database construction device
US8751494B2 (en) Constructing album data using discrete track data from multiple sources
Pramana et al. Systematic literature review of stemming and lemmatization performance for sentence similarity
US20150294005A1 (en) Method and device for acquiring information
CN100458788C (en) Clustering method, searching method and system for interconnection network audio file
US20060253433A1 (en) Method and apparatus for knowledge-based music searching and method and apparatus for managing music file
EP1269355A1 (en) Hypermedia resource search engine and related indexing method
KR20040017824A (en) Information search system which it follows in the Pattern-Forecast-Analysis to use the pattern of the web document and list
CN116401434A (en) Intelligent network data information extraction system
KR20010105983A (en) method of service providing on internet
EP1334444A1 (en) Method for searching, selecting and mapping web pages
TWI290684B (en) Incremental thesaurus construction method
WO2013117872A1 (en) Method for identifying a set of sentences in a digital document, method for generating a digital document, and associated device
WO2004088542A1 (en) A method of managing registered web sites in search engine and a system thereof
Khan Structuring and querying personalized audio using ontologies
KR20240001769U (en) User-customized keyword data analysis and information provision system

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20021008

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17Q First examination report despatched

Effective date: 20030429

17Q First examination report despatched

Effective date: 20030429

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20080703