WO2004066163A1 - Searching apparatus and methods - Google Patents

Searching apparatus and methods Download PDF

Info

Publication number
WO2004066163A1
WO2004066163A1 PCT/GB2004/000310 GB2004000310W WO2004066163A1 WO 2004066163 A1 WO2004066163 A1 WO 2004066163A1 GB 2004000310 W GB2004000310 W GB 2004000310W WO 2004066163 A1 WO2004066163 A1 WO 2004066163A1
Authority
WO
WIPO (PCT)
Prior art keywords
documents
keywords
user
relatedness
updating
Prior art date
Application number
PCT/GB2004/000310
Other languages
French (fr)
Inventor
Gery Michel Ducatel
Behnam Azvine
Original Assignee
British Telecommunications Public Limited Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0301721A external-priority patent/GB0301721D0/en
Priority claimed from GB0309460A external-priority patent/GB0309460D0/en
Application filed by British Telecommunications Public Limited Company filed Critical British Telecommunications Public Limited Company
Priority to CA002513490A priority Critical patent/CA2513490A1/en
Priority to US10/543,096 priority patent/US20060136405A1/en
Priority to EP04704667A priority patent/EP1586058A1/en
Publication of WO2004066163A1 publication Critical patent/WO2004066163A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present invention relates in general to the use of search engines that access databases.
  • the invention relates to apparatus and methods which allow for the improved use of search engines by creating, maintaining and using user profiles.
  • Embodiments of the present invention may be used in conjunction with existing standard search engines or with specifically configured search engines, and it should therefore be noted that the technical field of the invention relates to the manner in which a user may interact with a system such as a personal computer, and not to the software by which any chosen search engine functions.
  • An example of an application of the invention is in relation to intranet search engines that access large databases such as large corporate repositories holding legal or medical data sets. It also applies to renewed data repositories such as news sources.
  • Embodiments of the invention would typically be integrated with a search platform utilised by users who wish to access and search large unstructured databases such as intranets or the Internet. Such platforms may have several thousand users.
  • users may receive a personalised newspaper every day using a search engine that has access to an information source such as "Intellact”, disclosed in B Crabtree & SJ Soltysiak: “Automatic Learning of User Profiles - Towards Personalisation of Agent Services” (BT Technology Journal, 16(3):1 10-1 17, 1998).
  • an information source such as "Intellact”, disclosed in B Crabtree & SJ Soltysiak: “Automatic Learning of User Profiles - Towards Personalisation of Agent Services” (BT Technology Journal, 16(3):1 10-1 17, 1998).
  • the window frame In order to adapt user profiles to changes in interests there are two main approaches: the window frame and the ageing mechanism. Maintaining interests in a window frame is a solution that is beneficial to discover and maintain a list of recently introduced interests, because they appear fast and distinctively as shown in Crabtree (1 998) above.
  • the drawback of the window frame approach is that it is difficult to retrieve past interests. Typically, if an interest changes or disappears, it is discarded. This has lead to experiments with optimised "interest forgetting functions" as disclosed in I Koychev: “Gradual Forgetting for Adaptation to Concept Drift" (ECAI 2000 Workshop, Current Issues in Spatio-Temporal Reasoning, pages 101 -106, 2000).
  • This method is a function that decreases the influence of an interest in time; old interests gradually disappear as their importance is reduced linearly over a period of time.
  • the classification of the interests is a crisp set that discards interests when the linear function of the "gradual forgetting" process comes to term.
  • apparatus for creating and maintaining a user profile for a user for improving database searching by the user, said apparatus comprising: means for accessing a predetermined set of documents containing a plurality of keywords during a learning phase; analysing means arranged to analyse said documents and to identify, according to predetermined rules, groups of related keywords therein; attribute assigning means arranged to assign attributes indicative of relatedness to said groups of keywords; and user profile storing means arranged to store said relatedness attributes as a user profile; said apparatus further comprising: document updating means arranged to update the set of documents by adding documents to or subtracting documents from the set during an updating phase; identifying means arranged to analyse the updated set of documents and to identify existing and additional groups of related keywords therein, according to predetermined rules; means arranged to assign attributes indicative of relatedness to said additional groups of keywords; relatedness attribute updating means for updating the relatedness attributes of said existing groups of keywords; and user profile updating means arranged to update the user profile in accordance with the relatedness attributes of said
  • a method for creating and maintaining a user profile for a user for improving database searching by the user comprising a learning phase and an updating phase, wherein said learning phase comprises the steps of: accessing a predetermined set of documents containing a plurality of keywords; analysing said documents and identifying, according to predetermined rules, groups of related keywords therein; assigning attributes indicative of relatedness to said groups of keywords; and storing said relatedness attributes as a user profile; and wherein said updating phase comprises the steps of: updating the set of documents by adding documents to or subtracting documents from the set; analysing the updated set of documents and identifying existing and additional groups of related keywords therein, according to predetermined rules; assigning attributes indicative of relatedness to said additional groups of keywords; updating the relatedness attributes of said existing groups of keywords; and updating the user profile in accordance with the relatedness attributes of said existing and additional groups of keywords.
  • the predetermined set of documents is preferably a set of documents expected to reflect the interests of a specific user, such as a sub-set of documents derived from a set of documents previously viewed by a specific user.
  • the complete content of the documents may be stored in a local memory, or access to the full content may be by means of a set of links to internet or intranet locations where the full content is available.
  • the identification of related keywords from the set of documents may be achieved by means of a self-organising map algorithm, or may use other techniques to identify groups of related keywords.
  • the groups may comprise pairs of words or may be larger groups.
  • the types of attributes assigned to groups of keywords include an importance value indicating the statistical significance of related keywords in the set of documents, and a life-span value indicating the expected remaining period of time of relatedness between keywords in the set of documents.
  • Such life-span values may be systematically or automatically decreased over time until such time as the life- span values reach zero, indicating that the respective keywords are not considered to be related anymore.
  • the user may however be given the opportunity to manage the profile manually by adjusting the attributes, for example, or the apparatus may require confirmation before allowing the life-span values in relation to certain keyword groups to reach zero.
  • Embodiments of the invention in which the user is not required to provide input in order for the user profile to be updated allow for what may be termed "unsupervised learning". This is advantageous particularly where users are reluctant to provide feedback, regardless of how valuable it is to their future requests in the system.
  • the document updating means may be arranged to update the set of documents in response to user input confirming, for example, that new documents are of interest to the user.
  • the updating may be carried out on the basis of documents viewed by the user following receipt of a response from a search engine to a search query. It may also be done without the need for any further input from the user, however.
  • the user profile storing means is arranged to store relatedness attributes in the form of fuzzy sets.
  • apparatus for improving database searching comprising: user profile means, having access to a predetermined set of documents, arranged to provide data indicative of relatedness criteria between keywords from the set of documents; means for receiving a search query comprising one or more search keywords from a user; means arranged to access said user profile means and to identify therefrom, for the or each search keyword, potentially-related keywords according to predetermined criteria; means arranged to provide said potentially-related keywords to the user; means for receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; means arranged to incorporate such potentially-related keywords as keywords in an improved search query in the event that they are confirmed by the user to be related keywords; and means for submitting the improved search query to a search engine.
  • a method for improving database searching comprising the steps of: receiving a search query comprising one or more search keywords from a user; accessing a user profile means arranged to provide data indicative of relatedness criteria between keywords from a set of documents, and identifying from said user profile means, for the or each search keyword, potentially-related keywords according to predetermined criteria; providing said potentially-related keywords to the user; receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; in the event that any potentially-related keywords are confirmed by the user to be related keywords, incorporating such potentially-related keywords as keywords in an improved search query; and submitting the improved search query to a search engine.
  • the predetermined set of documents is a set of documents expected to reflect the interests of a specific user, such as a sub-set of documents derived from a set of documents previously viewed by the user.
  • a specific user such as a sub-set of documents derived from a set of documents previously viewed by the user.
  • such embodiments allow personalisation of the system.
  • assigned attributes such as an importance value indicating the statistical significance of related keywords in the set of documents, and a life-span value indicating an expected period of time of relatedness between keywords in the set of documents.
  • the user profile means preferably comprises means for identifying related keywords from the set of documents by means of a self-organising map algorithm.
  • the user profile means is arranged to provide data indicative of relatedness criteria in the form of fuzzy sets.
  • the set of documents is updated on the basis of documents viewed by the user following receipt of a response from a search engine to a search query.
  • the updating may be carried out on the basis of documents viewed by the user following receipt of a response from a search engine to a search query, or may be done without the need for further input from the user.
  • Preferred embodiments of the invention thus aim to improve the performance of an on-line search engine by gathering and maintaining user profiles obtained by analysing the documents that are relevant to the users.
  • the system may build and maintain user profiles in a two-fold process.
  • the system uses an algorithm as disclosed in the A N ⁇ rnberger article: "Interactive Text Retrieval Supported by Self-Organising Maps" (Technical report, BTexact Technologies, IS Lab, 2002), to extract contextually related keywords from a set of documents.
  • the keywords in the concepts are given attributes: a "life span” and a "relevance value". The life span indicates to the system when some words within a concept have not been found relevant for some time and therefore should be reduced in importance or removed altogether.
  • the relevance value is a link between two keywords of a concept; this value reflects the strength of the relationship between the two keywords. Users may have control over these parameters. They can decide if words should have a long or a short life span, and if the strength of the relationship between keywords should be strong or weak before they can start appearing in their profiles.
  • the solution proposed here also offers the users the facility to rebuild a query that is more valuable based on their initial query and their profile. At least a part of the interaction with the system may be performed before the documents are retrieved, when users are more receptive to further interaction with the system.
  • This application helps users maintain a profile of temporary interests.
  • the system also provides the analysis required to extract keywords that are relevant to help the users build an efficient profile.
  • the analysis is based on personal data and therefore the keywords suggested to the users are all adapted to their profiles.
  • the system helps in maintaining profiles, allowing the users to have an informed control over their profile.
  • the system is able to identify which are the keywords and concepts that the users need to improve their search.
  • the profile obtained can be used for query expansion.
  • the users can decide if a keyword is negative or positive to their search.
  • Figure 1 is a schematic diagram representing the hardware architecture of an embodiment of the invention
  • Figures 2a and 2b are screen shots of the user interface of an embodiment of the invention, showing the embodiment in use;
  • Figure 3 is a schematic illustration of the operation of an embodiment of the invention in response to a user input
  • Figure 4 is a schematic diagram of the functional elements of the system
  • Figure 5 is a flow chart illustrating the embodiment of the invention processing data to produce or maintain a list of user interests
  • Figure 6 is a schematic representation of the processing of the list of interests of Figure 5 into a plurality of fuzzy sets.
  • a conventional personal computer (PC) 101 is connected to a network 103 such as a wide area network (WAN) or, more specifically, the Internet.
  • WAN wide area network
  • Another computer 105 is connected to the WAN 103 and acts as a server computer.
  • the computers 101 , 105 may be connected to the WAN 103 via a Local Area Network (LAN) 107 coupled with the access to a gateway server computer (not shown) that enables the computers 101 , 105 to access to the WAN 103.
  • the connection 107 may be provided via home Internet access such as broadband and telephone line based access.
  • the PC computer 101 also referred to as the client machine, is arranged to access the server computer 105.
  • the client machine 101 has software to be able to access the WAN 103.
  • the computer 101 has an operating system (e.g. Microsoft WindowsTM, Unix, or Linux) and a web browser (e.g. Microsoft Internet ExplorerTM, or Netscape NavigatorTM).
  • an operating system e.g. Microsoft WindowsTM
  • FIG. 2a On initiation of the system via a web browser the user is presented with a start page 201 as shown in figure 2a.
  • the user can enter a query into the system from a "Search for" box 203 provided.
  • the user enters the acronym for the British Broadcasting Corporation "BBC”.
  • a "Search” button 205 instructs the search engine to execute the entered query.
  • the system returns a list 207 of alternative keywords as shown in figure 2b.
  • the list of keywords 207 comprises the acronyms for some alternative television companies "Granada" and "ITV” as well as the original entry of "BBC”.
  • the list of keywords 207 is provided to assist the users perform a better search.
  • the user can select one or more of the keywords from the list 207 to refine their query and then use the "Refine" button 209 to submit the query.
  • the selection can be either positive or negative i.e. the keywords can be included in the query or specifically excluded via alternative selection indicators 21 1 .
  • the system returns the list 207 of alternative keywords prior to retrieving the search results.
  • the system may be arranged to return the results as would be expected from a conventional search engine. Along with the set of results, the application would return the list 207 of alternative keywords.
  • the process described above with reference to figures 2a & 2b is summarised in figure 3.
  • the user 301 enters the query into the system 303 at step 305 and system 303 then accesses the user profile 307 for that user at step 309.
  • the system then generates a list of keywords from the profile 307 at step 31 1 and returns them to the user 301 at step 31 3 as described above with reference to figure 2b.
  • the user makes their choice of refining the search using the list 207 of keywords and the system executes the query or search at step 31 5 taking into account the users refinements using the search engine 317 and the database 31 9.
  • the results are then displayed to the user at step 321 via the system front end.
  • the core of the system is a profile manager 401 that operates in two phases.
  • the first phase uses a word group extraction system 403 to identify related keywords from a repository of documents 405.
  • the repository 405 is a set of documents that are expected to reflect the users' interests.
  • the extracted groups of related keywords are representative of those interests of a given user.
  • Each user of the system has a document repository 405 which can be maintained either by the user or an automatic document retriever (not shown).
  • the processing of the contents of the repository 405 to extract the related keywords may be performed offline.
  • the operation of the word group extraction system 403 will be described further below.
  • the second phase is the classification of the related keywords or interests extracted using an interest classifier 407.
  • the interest classifier 407 uses a set of rules 409 to classify interests by their statistical significance (importance) in the corpus of text in the repository 405 and by their age (life span). The operation of the interest classifier 407 will be described further below.
  • the output of the profile manager 401 is a set of interests 41 1 classified by their importance in the repository 405 and life span.
  • the profile manager 401 uses the set of interests 41 1 in response to the input of a query 413 (203, 205 in figure 2a) to provide the user with a list of keywords (207 in figure 2b).
  • the management and maintenance of the interests is carried out by the profile manager in accordance with a set of rules which will be described below.
  • the management includes updating the interests from time to time and removing old or outdated interests.
  • the interests 41 1 are used to refine the search as described above.
  • the set of interests 41 1 may also be referred to as the user profile. In some situations the profile may include other data describing the users interests and or preferences.
  • the profile manager 401 requires a set of interests 41 1 before it can provide a list of key words in response to a user query. As a result, the system needs to go through a learning process while the set of interests is initially set up.
  • the profile manager 401 uses the word group extraction system 403 to identify contextually related keywords within bodies of text in the repository 405.
  • the word group extraction system 403 uses a Self-Organising Map (SOM) algorithm disclosed in T Kohonen: "Self-Organising and Associative Memory” (Springer-Verlag, 1 984).
  • SOM Self-Organising Map
  • the input to the SOM is word triples (represented in a numerical format).
  • the SOM produces a representation of the input words in clusters on a conceptual two- dimensional map where strongly related keywords appear close to one another.
  • a, b, x and y are words that can be found in a text corpus T
  • a x b, and a y b if the following two word arrangements are frequent across T: a x b, and a y b, then a and b are contextually related keywords.
  • the output of the SOM algorithm is extracted as a list of contextually related keywords.
  • the list is represented by a number N of items made of keywords A (a,b,c), B (d,e,f) ... N (x,y,z), where the upper case letters represent sets of related keywords or interests and lower case letters simply represent keywords.
  • the set of interests can be seen as a personalised ontology. Every keyword is associated with the keywords that are statistically related to it.
  • the profile manager 401 assigns each interest an initial importance value and a life span value.
  • the importance value is initially set up as the average Inverse Document Frequency (IDF) value of every keyword of the interest as disclosed in K Sparck Jones: "Index Term Weighting" (Information Storage and Retrieval, (9):31 3 - 31 6, 1 973).
  • IDF Inverse Document Frequency
  • the IDF value of a given keyword reflects its statistical importance in a given text corpus (in this case the user document repository 405). This importance value is normalised so that the weight can be expressed as a percentage value.
  • step 507 the interest classifier 407 takes each interest in turn and determines whether it is a new interest or an existing interest. If the interest is a new interest processing moves to step 509.
  • the profile manager 401 creates a new set and the interest is added to it. If the interest is an addition to an existing set 41 1 then it is simply added to the set 41 1 .
  • step 507 the new interest is identified as an existing interest in the set 41 1 then processing moves to step 513.
  • each keyword of the new interest is taken in turn, and if the keyword is part of the existing interest then its weight is increased by a factor x. In the present embodiment the increase is linear and the factor is set to 1 .3. If a keyword in the new interest is not present in the existing interest then it is given a weight of 1 . Once each keyword in the new interest has been processed in this way the weights are normalised and the system is able to express the weights as a value between 0 and 1 .
  • the profile manager 401 gives each interest a life span expressed in days. In the present embodiment this is set to 60 days. A renewed interest is automatically reclassified with a 60 day or full life span. The new or updated interests are then added to the set of interests 41 1 . The existing interest is then replaced with the new or updated interest in the set of interests 401 .
  • the profile manager 401 uses the interest classifier 407 to process the interests 41 1 further.
  • the input into the interest classifier is the set of interests 41 1 and the set of rules 409.
  • the interest classifier 407 outputs the set of interests classified into two fuzzy sets 501 , 503. Every interest is classified into one of the three life span fuzzy sets 503a, 503b, 503c and into one of the three importance weight fuzzy sets 501 a, 501 b, 501 c.
  • the classification of each interest depends on the life span and importance weights assigned to each interest in steps 505, 509, 51 1 and/or 513 of figure 5 as described above.
  • an interest is given an initial life span (step 51 1 in figure 5) and is classified into one of three fuzzy sets by the interest classifier 407. If the initial classification is "long” the interest will be sustained in the system for at least as long as the system is initially set up to (sixty days in the current implementation). This classification is reviewed on a regular basis by the fuzzy engine such as when concepts are updated or added. If the interest is not renewed its lifespan will result in a gradual downgrading to the "average" set, then to the "short” set and finally will be removed from the set of interests 41 1 . In other words, the classification of an interest into a life span fuzzy set is an indication of its life span expectancy in the system.
  • the users may have access to the fuzzy sets configuration through an interface to enable them to control the classification process.
  • the users can modify the size of the life span sets 503a, 503b, 503c and thus modify the life span of concepts.
  • the fuzzy set of recent concepts 503a can be increased and the sizes of one or more of the sets of older concepts 503b, 503c reduced.
  • the importance fuzzy sets 501 a, 501 b, 501 c are used in the selection of keywords that will be suggested to a user in response to the entry of a query.
  • the system may be arranged to suggest only strong interests, strong and medium interest or all interests. Again the users can decide on the size of these data sets so that they have control over the selection process.
  • the system 401 is arranged so that if the system is about to discard a concept with strong relevance (because its life span has expired) the system can require confirmation from the user. This gives the user the facility to renew the lifespan of the interest if they choose. Interests that have had their importance value renewed (step 51 3 of figure 5) may well remain in the same fuzzy set or they may be upgraded.
  • Others that have not been renewed may either be sustained a little longer in the same set or they may be downgraded.
  • An interest with an updated importance value is not automatically reclassified in the "high” fuzzy set, others are gradually downgraded to the "medium” and the "low” sets.
  • the system is designed to help the users manage their profile efficiently. Yet, the system can run without requiring the users to maintain anything. Users are also allowed to add, change, and remove concepts. They can thoroughly control their sets of interests 41 1 , repositories 405 and rules 409.
  • the system provides a non- obtrusive software application.
  • the application gradually builds fuzzy sets of keywords and is able to make helpful suggestions to the users. By giving control to the users with regards to the size of the fuzzy sets they can manage the maintenance of the profiles and they can build more efficient queries.
  • the apparatus that embodies the invention could be a general purpose device having software arranged to provide an embodiment of the invention.
  • the device could be a single device or a group of devices and the software could be a single program or a set of programs.
  • any or all of the software used to implement the invention can be contained on various transmission and/or storage mediums such as a floppy disc, CD- ROM, or magnetic tape so that the program can be loaded onto one or more general purpose devices or could be downloaded over a network using a suitable transmission medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An apparatus and method are provided for improving database searching, the method comprising the steps of: receiving a search query comprising one or more search keywords from a user; accessing a user profile means arranged to provide data indicative of relatedness criteria between keywords from a set of documents, and identifying from said user profile means, for the or each search keyword, potentially-related keywords according to predetermined criteria; providing said potentially-related keywords to the user; receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; in the event that any potentially-related keywords are confirmed by the user to be related keywords, incorporating such potentially-related keywords as keywords in an improved search query; and submitting the improved search query to a search engine. Also provided are an apparatus and method for creating and maintaining user profiles for use in the above searching apparatus and method.

Description

Searching Apparatus and Methods
Technical Field
The present invention relates in general to the use of search engines that access databases. In particular, the invention relates to apparatus and methods which allow for the improved use of search engines by creating, maintaining and using user profiles. Embodiments of the present invention may be used in conjunction with existing standard search engines or with specifically configured search engines, and it should therefore be noted that the technical field of the invention relates to the manner in which a user may interact with a system such as a personal computer, and not to the software by which any chosen search engine functions.
An example of an application of the invention is in relation to intranet search engines that access large databases such as large corporate repositories holding legal or medical data sets. It also applies to renewed data repositories such as news sources. Embodiments of the invention would typically be integrated with a search platform utilised by users who wish to access and search large unstructured databases such as intranets or the Internet. Such platforms may have several thousand users.
Background to the Invention
A system providing an "Intelligent Personalised Agent Framework", formerly known as "Idioms" is disclosed in MP Thint, B Crabtree & SJ Soltysiak: "Adaptive Personal Agents" (Personal Technologies Journal, 2(3): 141 -1 51 , 1 998); and B Crabtree & SJ Soltysiak: "Knowing Me, Knowing You: Practical Issues in the Personalisation of Agent Technology", (PAAM'98 Third International Conference on the Practical Application of Intelligent Agents and Multi-Agent Technology, March 23-25 1998). This system acts as a host to a community of users and provides them with on-line services including news sources or corporate databases. The system offers to the users a personalised experience. With such a system, users may receive a personalised newspaper every day using a search engine that has access to an information source such as "Intellact", disclosed in B Crabtree & SJ Soltysiak: "Automatic Learning of User Profiles - Towards Personalisation of Agent Services" (BT Technology Journal, 16(3):1 10-1 17, 1998). I Koychev: "Tracking Changing User Interests Through Prior-Learning of Context" (AH'2002, 2nd International Conference on Adaptive Hypermedia and Adaptive Web Based Systems, 2002); and T Mitchell, R Caruana, D Freitag, J McDermott & D Zabowski: "Experience with a Learning Personal Assistant" (Communications of the ACM, 7(37):81 - 91 , 1 994), disclose profile creation systems that are based on decision tree algorithms that have input vectors with a number of features below thirty. In Koychev's approach the application does not only rely on a window based approach but the algorithm attempts to freeze an interest in time and save it for future use. When a new interest is found it is checked against "past interests" to see if it corresponds to an old interest, and if it does, the application merges the old interest into the new one; this augments the new interest with information that is relevant to it. The system enables advantageous learning capabilities. The number of features in a vector may however be orders of magnitude larger; every keyword that has any relevance must be taken into account and consequently the size of a vector rapidly reaches thousands of features.
In order to adapt user profiles to changes in interests there are two main approaches: the window frame and the ageing mechanism. Maintaining interests in a window frame is a solution that is beneficial to discover and maintain a list of recently introduced interests, because they appear fast and distinctively as shown in Crabtree (1 998) above. However, the drawback of the window frame approach is that it is difficult to retrieve past interests. Typically, if an interest changes or disappears, it is discarded. This has lead to experiments with optimised "interest forgetting functions" as disclosed in I Koychev: "Gradual Forgetting for Adaptation to Concept Drift" (ECAI 2000 Workshop, Current Issues in Spatio-Temporal Reasoning, pages 101 -106, 2000). This method is a function that decreases the influence of an interest in time; old interests gradually disappear as their importance is reduced linearly over a period of time. The classification of the interests is a crisp set that discards interests when the linear function of the "gradual forgetting" process comes to term.
In order to compensate for the large dimensionality of information retrieval it is known to use user feedback in various forms such as the relevance feedback system disclosed in JJ Rocchio: "Performance Indices for Information Retrieval" (Prentice Hall, 1 971 , Soft Computing and Information Organisation, 1 1 ), or user rating as disclosed in D Billsus & M Pazzani: "Learning and Revising User Profiles: The Identification of Interesting Web Sites" (Machine Learning, 27:313 - 331 , 1 997). One problem related to requiring feedback from users is that in practice users are reluctant to provide any feedback regardless of how valuable it is to their future requests in the system. It seems that users do not want to interact with the search engine once it has returned the results since it is perceived as an annoyance rather than a benefit.
Summary of the Invention According to a first aspect of the invention, there is provided apparatus for creating and maintaining a user profile for a user for improving database searching by the user, said apparatus comprising: means for accessing a predetermined set of documents containing a plurality of keywords during a learning phase; analysing means arranged to analyse said documents and to identify, according to predetermined rules, groups of related keywords therein; attribute assigning means arranged to assign attributes indicative of relatedness to said groups of keywords; and user profile storing means arranged to store said relatedness attributes as a user profile; said apparatus further comprising: document updating means arranged to update the set of documents by adding documents to or subtracting documents from the set during an updating phase; identifying means arranged to analyse the updated set of documents and to identify existing and additional groups of related keywords therein, according to predetermined rules; means arranged to assign attributes indicative of relatedness to said additional groups of keywords; relatedness attribute updating means for updating the relatedness attributes of said existing groups of keywords; and user profile updating means arranged to update the user profile in accordance with the relatedness attributes of said existing and additional groups of keywords.
There is also provided a method for creating and maintaining a user profile for a user for improving database searching by the user, said method comprising a learning phase and an updating phase, wherein said learning phase comprises the steps of: accessing a predetermined set of documents containing a plurality of keywords; analysing said documents and identifying, according to predetermined rules, groups of related keywords therein; assigning attributes indicative of relatedness to said groups of keywords; and storing said relatedness attributes as a user profile; and wherein said updating phase comprises the steps of: updating the set of documents by adding documents to or subtracting documents from the set; analysing the updated set of documents and identifying existing and additional groups of related keywords therein, according to predetermined rules; assigning attributes indicative of relatedness to said additional groups of keywords; updating the relatedness attributes of said existing groups of keywords; and updating the user profile in accordance with the relatedness attributes of said existing and additional groups of keywords.
The predetermined set of documents is preferably a set of documents expected to reflect the interests of a specific user, such as a sub-set of documents derived from a set of documents previously viewed by a specific user. The complete content of the documents may be stored in a local memory, or access to the full content may be by means of a set of links to internet or intranet locations where the full content is available.
The identification of related keywords from the set of documents may be achieved by means of a self-organising map algorithm, or may use other techniques to identify groups of related keywords. The groups may comprise pairs of words or may be larger groups.
Preferably the types of attributes assigned to groups of keywords include an importance value indicating the statistical significance of related keywords in the set of documents, and a life-span value indicating the expected remaining period of time of relatedness between keywords in the set of documents. Such life-span values may be systematically or automatically decreased over time until such time as the life- span values reach zero, indicating that the respective keywords are not considered to be related anymore. The user may however be given the opportunity to manage the profile manually by adjusting the attributes, for example, or the apparatus may require confirmation before allowing the life-span values in relation to certain keyword groups to reach zero.
Embodiments of the invention in which the user is not required to provide input in order for the user profile to be updated allow for what may be termed "unsupervised learning". This is advantageous particularly where users are reluctant to provide feedback, regardless of how valuable it is to their future requests in the system.
According to preferred embodiments of the apparatus, the document updating means may be arranged to update the set of documents in response to user input confirming, for example, that new documents are of interest to the user. The updating may be carried out on the basis of documents viewed by the user following receipt of a response from a search engine to a search query. It may also be done without the need for any further input from the user, however.
Preferably, the user profile storing means is arranged to store relatedness attributes in the form of fuzzy sets.
According to a second aspect of the invention, there is provided apparatus for improving database searching, comprising: user profile means, having access to a predetermined set of documents, arranged to provide data indicative of relatedness criteria between keywords from the set of documents; means for receiving a search query comprising one or more search keywords from a user; means arranged to access said user profile means and to identify therefrom, for the or each search keyword, potentially-related keywords according to predetermined criteria; means arranged to provide said potentially-related keywords to the user; means for receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; means arranged to incorporate such potentially-related keywords as keywords in an improved search query in the event that they are confirmed by the user to be related keywords; and means for submitting the improved search query to a search engine.
There is further provided a method for improving database searching, comprising the steps of: receiving a search query comprising one or more search keywords from a user; accessing a user profile means arranged to provide data indicative of relatedness criteria between keywords from a set of documents, and identifying from said user profile means, for the or each search keyword, potentially-related keywords according to predetermined criteria; providing said potentially-related keywords to the user; receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; in the event that any potentially-related keywords are confirmed by the user to be related keywords, incorporating such potentially-related keywords as keywords in an improved search query; and submitting the improved search query to a search engine. According to preferred embodiments of the second aspect of the invention, the predetermined set of documents is a set of documents expected to reflect the interests of a specific user, such as a sub-set of documents derived from a set of documents previously viewed by the user. By virtue of this, such embodiments allow personalisation of the system. By use of assigned attributes such as an importance value indicating the statistical significance of related keywords in the set of documents, and a life-span value indicating an expected period of time of relatedness between keywords in the set of documents, personalisation is possible, such that the changing interests of the individual user are reflected.
The user profile means preferably comprises means for identifying related keywords from the set of documents by means of a self-organising map algorithm. Preferably the user profile means is arranged to provide data indicative of relatedness criteria in the form of fuzzy sets.
According to preferred embodiments, the set of documents is updated on the basis of documents viewed by the user following receipt of a response from a search engine to a search query. The updating may be carried out on the basis of documents viewed by the user following receipt of a response from a search engine to a search query, or may be done without the need for further input from the user.
Preferred embodiments of the invention thus aim to improve the performance of an on-line search engine by gathering and maintaining user profiles obtained by analysing the documents that are relevant to the users. Looking at a preferred embodiment in more detail, the system may build and maintain user profiles in a two-fold process. First the system uses an algorithm as disclosed in the A Nϋrnberger article: "Interactive Text Retrieval Supported by Self-Organising Maps" (Technical report, BTexact Technologies, IS Lab, 2002), to extract contextually related keywords from a set of documents. Secondly, the keywords in the concepts are given attributes: a "life span" and a "relevance value". The life span indicates to the system when some words within a concept have not been found relevant for some time and therefore should be reduced in importance or removed altogether. The relevance value is a link between two keywords of a concept; this value reflects the strength of the relationship between the two keywords. Users may have control over these parameters. They can decide if words should have a long or a short life span, and if the strength of the relationship between keywords should be strong or weak before they can start appearing in their profiles.
The solution proposed here also offers the users the facility to rebuild a query that is more valuable based on their initial query and their profile. At least a part of the interaction with the system may be performed before the documents are retrieved, when users are more receptive to further interaction with the system.
This application helps users maintain a profile of temporary interests. The system also provides the analysis required to extract keywords that are relevant to help the users build an efficient profile. The analysis is based on personal data and therefore the keywords suggested to the users are all adapted to their profiles.
The system helps in maintaining profiles, allowing the users to have an informed control over their profile. The system is able to identify which are the keywords and concepts that the users need to improve their search. The profile obtained can be used for query expansion. The users can decide if a keyword is negative or positive to their search.
Brief Description of the Drawings
Embodiments of the invention will now be described with reference to the accompanying figures in which:
Figure 1 is a schematic diagram representing the hardware architecture of an embodiment of the invention;
Figures 2a and 2b are screen shots of the user interface of an embodiment of the invention, showing the embodiment in use;
Figure 3 is a schematic illustration of the operation of an embodiment of the invention in response to a user input; Figure 4 is a schematic diagram of the functional elements of the system;
Figure 5 is a flow chart illustrating the embodiment of the invention processing data to produce or maintain a list of user interests; Figure 6 is a schematic representation of the processing of the list of interests of Figure 5 into a plurality of fuzzy sets.
Description of the Embodiments
With reference to figure 1 , a conventional personal computer (PC) 101 is connected to a network 103 such as a wide area network (WAN) or, more specifically, the Internet. Another computer 105 is connected to the WAN 103 and acts as a server computer. The computers 101 , 105 may be connected to the WAN 103 via a Local Area Network (LAN) 107 coupled with the access to a gateway server computer (not shown) that enables the computers 101 , 105 to access to the WAN 103. Alternatively, the connection 107 may be provided via home Internet access such as broadband and telephone line based access. The PC computer 101 , also referred to as the client machine, is arranged to access the server computer 105. The client machine 101 has software to be able to access the WAN 103. The computer 101 has an operating system (e.g. Microsoft Windows™, Unix, or Linux) and a web browser (e.g. Microsoft Internet Explorer™, or Netscape Navigator™).
An overview of the user interaction with the system will now be described with reference to figures 2a & 2b. On initiation of the system via a web browser the user is presented with a start page 201 as shown in figure 2a. The user can enter a query into the system from a "Search for" box 203 provided. In this example the user enters the acronym for the British Broadcasting Corporation "BBC". A "Search" button 205 instructs the search engine to execute the entered query. In response to this the system returns a list 207 of alternative keywords as shown in figure 2b. In this example the list of keywords 207 comprises the acronyms for some alternative television companies "Granada" and "ITV" as well as the original entry of "BBC". The list of keywords 207 is provided to assist the users perform a better search. The user can select one or more of the keywords from the list 207 to refine their query and then use the "Refine" button 209 to submit the query. The selection can be either positive or negative i.e. the keywords can be included in the query or specifically excluded via alternative selection indicators 21 1 . As described above, the system returns the list 207 of alternative keywords prior to retrieving the search results. Alternatively, the system may be arranged to return the results as would be expected from a conventional search engine. Along with the set of results, the application would return the list 207 of alternative keywords.
The process described above with reference to figures 2a & 2b is summarised in figure 3. The user 301 enters the query into the system 303 at step 305 and system 303 then accesses the user profile 307 for that user at step 309. The system then generates a list of keywords from the profile 307 at step 31 1 and returns them to the user 301 at step 31 3 as described above with reference to figure 2b. The user makes their choice of refining the search using the list 207 of keywords and the system executes the query or search at step 31 5 taking into account the users refinements using the search engine 317 and the database 31 9. The results are then displayed to the user at step 321 via the system front end.
With reference to figure 4, the core of the system is a profile manager 401 that operates in two phases. The first phase uses a word group extraction system 403 to identify related keywords from a repository of documents 405. The repository 405 is a set of documents that are expected to reflect the users' interests. The extracted groups of related keywords are representative of those interests of a given user. Each user of the system has a document repository 405 which can be maintained either by the user or an automatic document retriever (not shown). The processing of the contents of the repository 405 to extract the related keywords may be performed offline. The operation of the word group extraction system 403 will be described further below. The second phase is the classification of the related keywords or interests extracted using an interest classifier 407. The interest classifier 407 uses a set of rules 409 to classify interests by their statistical significance (importance) in the corpus of text in the repository 405 and by their age (life span). The operation of the interest classifier 407 will be described further below.
The output of the profile manager 401 is a set of interests 41 1 classified by their importance in the repository 405 and life span. The profile manager 401 then uses the set of interests 41 1 in response to the input of a query 413 (203, 205 in figure 2a) to provide the user with a list of keywords (207 in figure 2b). The management and maintenance of the interests is carried out by the profile manager in accordance with a set of rules which will be described below. The management includes updating the interests from time to time and removing old or outdated interests. The interests 41 1 are used to refine the search as described above. The set of interests 41 1 may also be referred to as the user profile. In some situations the profile may include other data describing the users interests and or preferences. The profile manager 401 requires a set of interests 41 1 before it can provide a list of key words in response to a user query. As a result, the system needs to go through a learning process while the set of interests is initially set up.
The process carried out by the profile manager 401 described above will now be described in further detail with reference to the flow chart of figure 5. At step 501 the profile manager 401 uses the word group extraction system 403 to identify contextually related keywords within bodies of text in the repository 405. The word group extraction system 403 uses a Self-Organising Map (SOM) algorithm disclosed in T Kohonen: "Self-Organising and Associative Memory" (Springer-Verlag, 1 984). The input to the SOM is word triples (represented in a numerical format). The SOM produces a representation of the input words in clusters on a conceptual two- dimensional map where strongly related keywords appear close to one another. For example, if a, b, x and y are words that can be found in a text corpus T, if the following two word arrangements are frequent across T: a x b, and a y b, then a and b are contextually related keywords.
At step 503 the output of the SOM algorithm is extracted as a list of contextually related keywords. The list is represented by a number N of items made of keywords A (a,b,c), B (d,e,f) ... N (x,y,z), where the upper case letters represent sets of related keywords or interests and lower case letters simply represent keywords. The set of interests can be seen as a personalised ontology. Every keyword is associated with the keywords that are statistically related to it.
Processing then moves to step 505 at which the profile manager 401 assigns each interest an initial importance value and a life span value. The importance value is initially set up as the average Inverse Document Frequency (IDF) value of every keyword of the interest as disclosed in K Sparck Jones: "Index Term Weighting" (Information Storage and Retrieval, (9):31 3 - 31 6, 1 973). The IDF value of a given keyword reflects its statistical importance in a given text corpus (in this case the user document repository 405). This importance value is normalised so that the weight can be expressed as a percentage value.
Processing then moves to step 507 where the interest classifier 407 takes each interest in turn and determines whether it is a new interest or an existing interest. If the interest is a new interest processing moves to step 509.
At step 509, if the interest is the first interest for a new set of interests 41 1 then the profile manager 401 creates a new set and the interest is added to it. If the interest is an addition to an existing set 41 1 then it is simply added to the set 41 1 .
If at step 507 the new interest is identified as an existing interest in the set 41 1 then processing moves to step 513. At step 513 each keyword of the new interest is taken in turn, and if the keyword is part of the existing interest then its weight is increased by a factor x. In the present embodiment the increase is linear and the factor is set to 1 .3. If a keyword in the new interest is not present in the existing interest then it is given a weight of 1 . Once each keyword in the new interest has been processed in this way the weights are normalised and the system is able to express the weights as a value between 0 and 1 .
At step 51 1 the profile manager 401 gives each interest a life span expressed in days. In the present embodiment this is set to 60 days. A renewed interest is automatically reclassified with a 60 day or full life span. The new or updated interests are then added to the set of interests 41 1 . The existing interest is then replaced with the new or updated interest in the set of interests 401 .
Once the profile manager 401 has produced or updated a set of interests 41 1 it then utilises the interest classifier 407 to process the interests 41 1 further. With reference to figure 6, the input into the interest classifier is the set of interests 41 1 and the set of rules 409. The interest classifier 407 outputs the set of interests classified into two fuzzy sets 501 , 503. Every interest is classified into one of the three life span fuzzy sets 503a, 503b, 503c and into one of the three importance weight fuzzy sets 501 a, 501 b, 501 c. The classification of each interest depends on the life span and importance weights assigned to each interest in steps 505, 509, 51 1 and/or 513 of figure 5 as described above.
As noted above, an interest is given an initial life span (step 51 1 in figure 5) and is classified into one of three fuzzy sets by the interest classifier 407. If the initial classification is "long" the interest will be sustained in the system for at least as long as the system is initially set up to (sixty days in the current implementation). This classification is reviewed on a regular basis by the fuzzy engine such as when concepts are updated or added. If the interest is not renewed its lifespan will result in a gradual downgrading to the "average" set, then to the "short" set and finally will be removed from the set of interests 41 1 . In other words, the classification of an interest into a life span fuzzy set is an indication of its life span expectancy in the system.
The users may have access to the fuzzy sets configuration through an interface to enable them to control the classification process. The users can modify the size of the life span sets 503a, 503b, 503c and thus modify the life span of concepts. To keep concepts longer the fuzzy set of recent concepts 503a can be increased and the sizes of one or more of the sets of older concepts 503b, 503c reduced.
The importance fuzzy sets 501 a, 501 b, 501 c are used in the selection of keywords that will be suggested to a user in response to the entry of a query. For example, the system may be arranged to suggest only strong interests, strong and medium interest or all interests. Again the users can decide on the size of these data sets so that they have control over the selection process. Similarly the system 401 is arranged so that if the system is about to discard a concept with strong relevance (because its life span has expired) the system can require confirmation from the user. This gives the user the facility to renew the lifespan of the interest if they choose. Interests that have had their importance value renewed (step 51 3 of figure 5) may well remain in the same fuzzy set or they may be upgraded. Others that have not been renewed may either be sustained a little longer in the same set or they may be downgraded. An interest with an updated importance value is not automatically reclassified in the "high" fuzzy set, others are gradually downgraded to the "medium" and the "low" sets.
The system is designed to help the users manage their profile efficiently. Yet, the system can run without requiring the users to maintain anything. Users are also allowed to add, change, and remove concepts. They can thoroughly control their sets of interests 41 1 , repositories 405 and rules 409. The system provides a non- obtrusive software application. The application gradually builds fuzzy sets of keywords and is able to make helpful suggestions to the users. By giving control to the users with regards to the size of the fuzzy sets they can manage the maintenance of the profiles and they can build more efficient queries.
Self organising maps are discussed further in T Kohonen: "Self-Organized Formation of Topologically Correct Feature Maps" (Biological Cybernetics, 43:59-69, 1 982); and H Ritter & T Kohonen: "Self-Organising Semantic Maps" (Biological Cybernetics, 61 (4):241 - 254, 1 989).
It will be understood by those skilled in the art that the apparatus that embodies the invention could be a general purpose device having software arranged to provide an embodiment of the invention. The device could be a single device or a group of devices and the software could be a single program or a set of programs. Furthermore, any or all of the software used to implement the invention can be contained on various transmission and/or storage mediums such as a floppy disc, CD- ROM, or magnetic tape so that the program can be loaded onto one or more general purpose devices or could be downloaded over a network using a suitable transmission medium.
Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising" and the like are to be construed in an inclusive as opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including, but not limited to".

Claims

1 . Apparatus for creating and maintaining a user profile for a user for improving database searching by the user, said apparatus comprising: means for accessing a predetermined set of documents containing a plurality of keywords during a learning phase; analysing means arranged to analyse said documents and to identify, according to predetermined rules, groups of related keywords therein; attribute assigning means arranged to assign attributes indicative of relatedness to said groups of keywords; and user profile storing means arranged to store said relatedness attributes as a user profile; said apparatus further comprising: document updating means arranged to update the set of documents by adding documents to or subtracting documents from the set during an updating phase; identifying means arranged to analyse the updated set of documents and to identify existing and additional groups of related keywords therein, according to predetermined rules; means arranged to assign attributes indicative of relatedness to said additional groups of keywords; relatedness attribute updating means for updating the relatedness attributes of said existing groups of keywords; and user profile updating means arranged to update the user profile in accordance with the relatedness attributes of said existing and additional groups of keywords.
2. Apparatus according to claim 1 , wherein the predetermined set of documents is a set of documents expected to reflect the interests of a specific user.
3. Apparatus according to claim 1 or 2, wherein the predetermined set of documents is a set of documents derived from a set of documents previously viewed by a specific user.
4. Apparatus according to claim 1 , 2 or 3, wherein the analysing means comprises means for identifying groups containing pairs of related keywords.
5. Apparatus according to claim 1 , 2, 3 or 4, wherein the analysing means comprises means for identifying related keywords from the set of documents by means of a self-organising map algorithm.
6. Apparatus according to any of claims 1 to 5, wherein the attribute assigning means comprises importance value assigning means for assigning importance values indicating the statistical significance of related keywords in the set of documents.
7. Apparatus according to any of claims 1 to 6, wherein the attribute assigning means comprises means for assigning life-span values indicating the expected remaining period of time of relatedness between keywords in the set of documents.
8. Apparatus according to claim 7, wherein said relatedness attribute updating means comprises means for systematically decreasing the life-span values over time.
9. Apparatus according to any of claims 1 to 8, wherein the document updating means is arranged to update the set of documents in response to user input.
10. Apparatus according to claim 9, wherein the document updating means is arranged to add new documents to the set of documents in the event of user input confirming that said new documents are of interest to the user.
1 1 . Apparatus according to any of claims 1 to 10, wherein the user profile storing means is arranged to store said relatedness attributes in the form of fuzzy sets.
1 2. A method for creating and maintaining a user profile for a user for improving database searching by the user, said method comprising a learning phase and an updating phase, wherein said learning phase comprises the steps of: accessing a predetermined set of documents containing a plurality of keywords; analysing said documents and identifying, according to predetermined rules, groups of related keywords therein; assigning attributes indicative of relatedness to said groups of keywords; and storing said relatedness attributes as a user profile; and wherein said updating phase comprises the steps of: updating the set of documents by adding documents to or subtracting documents from the set; analysing the updated set of documents and identifying existing and additional groups of related keywords therein, according to predetermined rules; assigning attributes indicative of relatedness to said additional groups of keywords; updating the relatedness attributes of said existing groups of keywords; and updating the user profile in accordance with the relatedness attributes of said existing and additional groups of keywords.
13. A method according to claim 1 2, wherein groups containing pairs of related keywords are identified.
14. A method according to claim 1 2 or 13, wherein related keywords are identified from the set of documents by means of a self-organising map algorithm.
1 5. A method according to claim 1 2, 13 or 14, wherein the step of assigning attributes comprises assigning importance values indicating the statistical significance of related keywords in the set of documents.
1 6. A method according to any of claims 1 2 to 1 5, wherein the step of assigning attributes comprises assigning life-span values indicating the expected remaining period of time of relatedness between keywords in the set of documents.
17. A method according to claim 16, wherein the step of updating the relatedness attributes comprises a step of systematically decreasing the life-span values over time.
18. A method according to any of claims 1 2 to 17, wherein the step of updating the set of documents comprises updating the set of documents in response to user input.
1 9. A method according to claim 18, wherein the step of updating the set of documents comprises adding new documents to the set of documents in the event of user input confirming that said new documents are of interest to the user.
20. A method according to any of claims 12 to 19, further comprising a step of updating the set of documents on the basis of documents viewed by the user following receipt of a response from a search engine to a search query.
21 . A method according to any of claims 1 2 to 20, wherein said relatedness attributes are stored in the form of fuzzy sets.
22. Apparatus for improving database searching, comprising: user profile means, having access to a predetermined set of documents, arranged to provide data indicative of relatedness criteria between keywords from the set of documents; means for receiving a search query comprising one or more search keywords from a user; means arranged to access said user profile means and to identify therefrom, for the or each search keyword, potentially-related keywords according to predetermined criteria; means arranged to provide said potentially-related keywords to the user; means for receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; means arranged to incorporate such potentially-related keywords as keywords in an improved search query in the event that they are confirmed by the user to be related keywords; and means for submitting the improved search query to a search engine.
23. Apparatus according to claim 22, wherein the predetermined set of documents is a set of documents expected to reflect the interests of a specific user.
24. Apparatus according to claim 22 or 23, wherein the predetermined set of documents is a set of documents derived from a set of documents previously viewed by the user.
25. Apparatus according to claim 22, 23 or 24, wherein the user profile means comprises means for identifying related keywords from the set of documents by means of a self-organising map algorithm.
26. Apparatus according to claim 22, 23, 24 or 25, wherein the user profile means comprises importance value deriving means for deriving importance values indicating the statistical significance of related keywords in the set of documents.
27. Apparatus according to any of claims 22 to 26, wherein the user profile means comprises means for assigning life-span values indicating an expected period of time of relatedness between keywords in the set of documents.
28. Apparatus according to any of claims 22 to 27, wherein the user profile means is arranged to provide said data indicative of relatedness criteria in the form of fuzzy sets.
29. Apparatus according to any of claims 22 to 28, further comprising means for updating the set of documents on the basis of documents viewed by the user following receipt of a response from a search engine to a search to a search query.
30. Apparatus according to any of claims 22 to 29, wherein the user profile means further comprises means for updating the data indicative of relatedness criteria on the basis of information received from the user.
31 . A method for improving database searching, comprising the steps of: receiving a search query comprising one or more search keywords from a user; accessing a user profile means arranged to provide data indicative of relatedness criteria between keywords from a set of documents, and identifying from said user profile means, for the or each search keyword, potentially-related keywords according to predetermined criteria; providing said potentially-related keywords to the user; receiving information from the user confirming that any potentially-related keywords are considered to be related keywords; in the event that any potentially-related keywords are confirmed by the user to be related keywords, incorporating such potentially-related keywords as keywords in an improved search query; and submitting the improved search query to a search engine.
32. A method according to claim 31 , wherein the user profile means is arranged to identify said data indicative of relatedness criteria by means of a self-organising map algorithm.
33. A method according to claim 31 or 32, wherein the user profile means is arranged to provide importance values indicating the statistical significance of related keywords in the set of documents.
34. A method according to claim 31 , 32 or 33, wherein the user profile means is arranged to provide life-span values indicating an expected period of time of relatedness between keywords in the set of documents.
35. A method according to any of claims 31 to 34, wherein the user profile means is arranged to provide said data indicative of relatedness criteria in the form of fuzzy sets.
36. A method according to any of claims 31 to 35, further comprising the step of updating the set of documents on the basis of documents viewed by the user following receipt of a response from a search engine to a search to a search query.
37. A method according to any of claims 31 to 36, further comprising the step of updating the data indicative of relatedness criteria on the basis of information received from the user.
PCT/GB2004/000310 2003-01-24 2004-01-23 Searching apparatus and methods WO2004066163A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CA002513490A CA2513490A1 (en) 2003-01-24 2004-01-23 Searching apparatus and methods
US10/543,096 US20060136405A1 (en) 2003-01-24 2004-01-23 Searching apparatus and methods
EP04704667A EP1586058A1 (en) 2003-01-24 2004-01-23 Searching apparatus and methods

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB0301721A GB0301721D0 (en) 2003-01-24 2003-01-24 Search method and apparatus
GB0301721.7 2003-01-24
GB0309460.4 2003-04-25
GB0309460A GB0309460D0 (en) 2003-04-25 2003-04-25 Searching apparatus and methods

Publications (1)

Publication Number Publication Date
WO2004066163A1 true WO2004066163A1 (en) 2004-08-05

Family

ID=32773977

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2004/000310 WO2004066163A1 (en) 2003-01-24 2004-01-23 Searching apparatus and methods

Country Status (4)

Country Link
US (1) US20060136405A1 (en)
EP (1) EP1586058A1 (en)
CA (1) CA2513490A1 (en)
WO (1) WO2004066163A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006112856A1 (en) * 2005-04-15 2006-10-26 Kjn Partners, L.P. Method, system and software for centralized generation and storage of individualized requests and results
WO2007106269A1 (en) * 2006-03-02 2007-09-20 Microsoft Corporation Mining web search user behavior to enhance web search relevance
WO2008059515A2 (en) * 2006-08-01 2008-05-22 Divyank Turakhia A system and method of generating related words and word concepts
CN102067119A (en) * 2008-02-25 2011-05-18 水宙责任有限公司 Electronic profile development, storage, use and systems for taking action based thereon
WO2011109516A2 (en) * 2010-03-03 2011-09-09 Ebay Inc. Document processing using retrieval path data
EP2704080A1 (en) * 2007-05-25 2014-03-05 KIT Digital Inc. Recommendation systems and methods
US8984647B2 (en) 2010-05-06 2015-03-17 Atigeo Llc Systems, methods, and computer readable media for security in profile utilizing systems
WO2016176379A1 (en) * 2015-04-30 2016-11-03 Microsoft Technology Licensing, Llc Extracting and surfacing user work attributes from data sources
US10860956B2 (en) 2015-04-30 2020-12-08 Microsoft Technology Licensing, Llc Extracting and surfacing user work attributes from data sources

Families Citing this family (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8041713B2 (en) * 2004-03-31 2011-10-18 Google Inc. Systems and methods for analyzing boilerplate
US8631001B2 (en) * 2004-03-31 2014-01-14 Google Inc. Systems and methods for weighting a search query result
US7272601B1 (en) 2004-03-31 2007-09-18 Google Inc. Systems and methods for associating a keyword with a user interface area
US7664734B2 (en) * 2004-03-31 2010-02-16 Google Inc. Systems and methods for generating multiple implicit search queries
US7707142B1 (en) 2004-03-31 2010-04-27 Google Inc. Methods and systems for performing an offline search
US20080040315A1 (en) * 2004-03-31 2008-02-14 Auerbach David B Systems and methods for generating a user interface
US9009153B2 (en) 2004-03-31 2015-04-14 Google Inc. Systems and methods for identifying a named entity
US20070276801A1 (en) * 2004-03-31 2007-11-29 Lawrence Stephen R Systems and methods for constructing and using a user profile
US7693825B2 (en) * 2004-03-31 2010-04-06 Google Inc. Systems and methods for ranking implicit search results
US7788274B1 (en) 2004-06-30 2010-08-31 Google Inc. Systems and methods for category-based search
US8131754B1 (en) 2004-06-30 2012-03-06 Google Inc. Systems and methods for determining an article association measure
US7865495B1 (en) * 2004-10-06 2011-01-04 Shopzilla, Inc. Word deletion for searches
US20090049127A1 (en) * 2007-08-16 2009-02-19 Yun-Fang Juan System and method for invitation targeting in a web-based social network
US8027982B2 (en) * 2006-03-01 2011-09-27 Oracle International Corporation Self-service sources for secure search
US8433712B2 (en) * 2006-03-01 2013-04-30 Oracle International Corporation Link analysis for enterprise environment
US8214394B2 (en) 2006-03-01 2012-07-03 Oracle International Corporation Propagating user identities in a secure federated search system
US9177124B2 (en) 2006-03-01 2015-11-03 Oracle International Corporation Flexible authentication framework
US8332430B2 (en) * 2006-03-01 2012-12-11 Oracle International Corporation Secure search performance improvement
US8707451B2 (en) 2006-03-01 2014-04-22 Oracle International Corporation Search hit URL modification for secure application integration
US7941419B2 (en) 2006-03-01 2011-05-10 Oracle International Corporation Suggested content with attribute parameterization
US8868540B2 (en) * 2006-03-01 2014-10-21 Oracle International Corporation Method for suggesting web links and alternate terms for matching search queries
US8005816B2 (en) * 2006-03-01 2011-08-23 Oracle International Corporation Auto generation of suggested links in a search system
US8875249B2 (en) * 2006-03-01 2014-10-28 Oracle International Corporation Minimum lifespan credentials for crawling data repositories
US20070214129A1 (en) * 2006-03-01 2007-09-13 Oracle International Corporation Flexible Authorization Model for Secure Search
US20070271255A1 (en) * 2006-05-17 2007-11-22 Nicky Pappo Reverse search-engine
US7577718B2 (en) * 2006-07-31 2009-08-18 Microsoft Corporation Adaptive dissemination of personalized and contextually relevant information
US7849079B2 (en) * 2006-07-31 2010-12-07 Microsoft Corporation Temporal ranking of search results
US7685199B2 (en) * 2006-07-31 2010-03-23 Microsoft Corporation Presenting information related to topics extracted from event classes
KR20080096005A (en) * 2007-04-26 2008-10-30 엔에이치엔(주) Method for providing keyword depending on a range of providing keyword and system thereof
US20080288328A1 (en) * 2007-05-17 2008-11-20 Bryan Michael Minor Content advertising performance optimization system and method
US7996392B2 (en) 2007-06-27 2011-08-09 Oracle International Corporation Changing ranking algorithms based on customer settings
US8316007B2 (en) * 2007-06-28 2012-11-20 Oracle International Corporation Automatically finding acronyms and synonyms in a corpus
JP4510109B2 (en) * 2008-03-24 2010-07-21 富士通株式会社 Target content search support program, target content search support method, and target content search support device
WO2009117830A1 (en) * 2008-03-27 2009-10-01 Hotgrinds Canada System and method for query expansion using tooltips
US9092517B2 (en) 2008-09-23 2015-07-28 Microsoft Technology Licensing, Llc Generating synonyms based on query log data
US20100208984A1 (en) * 2009-02-13 2010-08-19 Microsoft Corporation Evaluating related phrases
US8868538B2 (en) 2010-04-22 2014-10-21 Microsoft Corporation Information presentation system
US9785987B2 (en) 2010-04-22 2017-10-10 Microsoft Technology Licensing, Llc User interface for information presentation system
US9600566B2 (en) 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
US9043296B2 (en) 2010-07-30 2015-05-26 Microsoft Technology Licensing, Llc System of providing suggestions based on accessible and contextual information
TW201235867A (en) * 2011-02-18 2012-09-01 Hon Hai Prec Ind Co Ltd System and method for searching related terms
US9280535B2 (en) 2011-03-31 2016-03-08 Infosys Limited Natural language querying with cascaded conditional random fields
US9026519B2 (en) * 2011-08-09 2015-05-05 Microsoft Technology Licensing, Llc Clustering web pages on a search engine results page
US10739938B2 (en) * 2012-01-05 2020-08-11 International Business Machines Corporation Customizing a tag cloud
US20130332451A1 (en) * 2012-06-06 2013-12-12 Fliptop, Inc. System and method for correlating personal identifiers with corresponding online presence
US10032131B2 (en) 2012-06-20 2018-07-24 Microsoft Technology Licensing, Llc Data services for enterprises leveraging search system data assets
US9594831B2 (en) * 2012-06-22 2017-03-14 Microsoft Technology Licensing, Llc Targeted disambiguation of named entities
US9229924B2 (en) 2012-08-24 2016-01-05 Microsoft Technology Licensing, Llc Word detection and domain dictionary recommendation
WO2015175100A1 (en) * 2014-05-16 2015-11-19 Linkedin Corporation Suggested keywords
US9727654B2 (en) 2014-05-16 2017-08-08 Linkedin Corporation Suggested keywords
US10162820B2 (en) * 2014-05-16 2018-12-25 Microsoft Technology Licensing, Llc Suggested keywords
US20200341977A1 (en) * 2019-04-25 2020-10-29 Mycelebs Co., Ltd. Method and apparatus for managing attribute language

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000077689A1 (en) * 1999-06-16 2000-12-21 Triogo, Inc. A process for improving search engine efficiency using user feedback
US6256633B1 (en) * 1998-06-25 2001-07-03 U.S. Philips Corporation Context-based and user-profile driven information retrieval
US20020042793A1 (en) * 2000-08-23 2002-04-11 Jun-Hyeog Choi Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US7263659B2 (en) * 1998-09-09 2007-08-28 Ricoh Company, Ltd. Paper-based interface for multimedia information
US6363377B1 (en) * 1998-07-30 2002-03-26 Sarnoff Corporation Search data processor
US6539375B2 (en) * 1998-08-04 2003-03-25 Microsoft Corporation Method and system for generating and using a computer user's personal interest profile
US6327590B1 (en) * 1999-05-05 2001-12-04 Xerox Corporation System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis
US6895406B2 (en) * 2000-08-25 2005-05-17 Seaseer R&D, Llc Dynamic personalization method of creating personalized user profiles for searching a database of information
KR100516289B1 (en) * 2000-11-02 2005-09-21 주식회사 케이티 Content based image reference apparatus and method for relevance feedback using fussy integral
JP4259861B2 (en) * 2000-11-20 2009-04-30 ブリティッシュ・テレコミュニケーションズ・パブリック・リミテッド・カンパニー Information provider
US20020104088A1 (en) * 2001-01-29 2002-08-01 Philips Electronics North Americas Corp. Method for searching for television programs
US7836010B2 (en) * 2003-07-30 2010-11-16 Northwestern University Method and system for assessing relevant properties of work contexts for use by information services

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6256633B1 (en) * 1998-06-25 2001-07-03 U.S. Philips Corporation Context-based and user-profile driven information retrieval
WO2000077689A1 (en) * 1999-06-16 2000-12-21 Triogo, Inc. A process for improving search engine efficiency using user feedback
US20020042793A1 (en) * 2000-08-23 2002-04-11 Jun-Hyeog Choi Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LAGUS K: "Text retrieval using self-organized document maps", NEURAL PROCESS. LETT. (NETHERLANDS), NEURAL PROCESSING LETTERS, FEB. 2002, KLUWER ACADEMIC PUBLISHERS, NETHERLANDS, vol. 15, no. 1, February 2002 (2002-02-01), pages 21 - 29, XP002286398, ISSN: 1370-4621 *
RONG JIN ET AL: "Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall", SIGIR FORUM (USA), SIGIR FORUM, 2001, ACM, USA, vol. spec. issue., 2001, pages 83 - 89, XP002286399, ISSN: 0163-5840 *
See also references of EP1586058A1 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006112856A1 (en) * 2005-04-15 2006-10-26 Kjn Partners, L.P. Method, system and software for centralized generation and storage of individualized requests and results
WO2007106269A1 (en) * 2006-03-02 2007-09-20 Microsoft Corporation Mining web search user behavior to enhance web search relevance
KR101366408B1 (en) 2006-03-02 2014-03-03 마이크로소프트 코포레이션 Mining web search user behavior to enhance web search relevance
WO2008059515A2 (en) * 2006-08-01 2008-05-22 Divyank Turakhia A system and method of generating related words and word concepts
WO2008059515A3 (en) * 2006-08-01 2009-09-24 Divyank Turakhia A system and method of generating related words and word concepts
EP2704080A1 (en) * 2007-05-25 2014-03-05 KIT Digital Inc. Recommendation systems and methods
US8402081B2 (en) 2008-02-25 2013-03-19 Atigeo, LLC Platform for data aggregation, communication, rule evaluation, and combinations thereof, using templated auto-generation
US8255396B2 (en) 2008-02-25 2012-08-28 Atigeo Llc Electronic profile development, storage, use, and systems therefor
EP2354982A1 (en) * 2008-02-25 2011-08-10 Atigeo LLC Electronic profile development, storage, use and systems for taking action based thereon
CN102067119A (en) * 2008-02-25 2011-05-18 水宙责任有限公司 Electronic profile development, storage, use and systems for taking action based thereon
WO2011109516A3 (en) * 2010-03-03 2012-01-05 Ebay Inc. Document processing using retrieval path data
WO2011109516A2 (en) * 2010-03-03 2011-09-09 Ebay Inc. Document processing using retrieval path data
US8984647B2 (en) 2010-05-06 2015-03-17 Atigeo Llc Systems, methods, and computer readable media for security in profile utilizing systems
WO2016176379A1 (en) * 2015-04-30 2016-11-03 Microsoft Technology Licensing, Llc Extracting and surfacing user work attributes from data sources
US10860956B2 (en) 2015-04-30 2020-12-08 Microsoft Technology Licensing, Llc Extracting and surfacing user work attributes from data sources

Also Published As

Publication number Publication date
EP1586058A1 (en) 2005-10-19
CA2513490A1 (en) 2004-08-05
US20060136405A1 (en) 2006-06-22

Similar Documents

Publication Publication Date Title
US20060136405A1 (en) Searching apparatus and methods
US5960422A (en) System and method for optimized source selection in an information retrieval system
US6029161A (en) Multi-level mindpool system especially adapted to provide collaborative filter data for a large scale information filtering system
US6687696B2 (en) System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models
RU2435212C2 (en) Collecting data on user behaviour during web search to increase web search relevance
US6182063B1 (en) Method and apparatus for cascaded indexing and retrieval
CN1192320C (en) Cooperative topical servers with automatic prefiltering and routing
EP1050830A2 (en) System and method for collaborative ranking of search results employing user and group profiles
US20060167896A1 (en) Systems and methods for managing and using multiple concept networks for assisted search processing
US20070156622A1 (en) Method and system to compose software applications by combining planning with semantic reasoning
EP1386250A1 (en) Very-large-scale automatic categorizer for web content
US20040107221A1 (en) Information storage and retrieval
Ding et al. User modeling for personalized Web search with self‐organizing map
Shinde et al. A new approach for on line recommender system in web usage mining
Aas A survey on personalized information filtering systems for the world wide web
Singh et al. Web semantics for personalized information retrieval
Amati et al. A Framework for Filtering News and Managing Distributed Data.
US20160085760A1 (en) Method for in-loop human validation of disambiguated features
Albana et al. Intelligent web objects prediction approach in web proxy cache using supervised machine learning and feature selection
WO2002037328A2 (en) Integrating search, classification, scoring and ranking
Abass et al. Information retrieval models, techniques and applications
Felden et al. Recommender systems based on an active data warehouse with text documents
Bottraud et al. An adaptive information research personal assistant
JP2002215674A (en) Web page browsing support system, method and program
Melguizo et al. What a proactive recommendation system needs-relevance, non-intrusiveness, and a new long-term memory

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2513490

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2004704667

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2006136405

Country of ref document: US

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 10543096

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 2004704667

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 10543096

Country of ref document: US