EP0941515A1 - System für kundenausgerichtete elektronische identifizierung von wünschenswerten objekten - Google Patents

System für kundenausgerichtete elektronische identifizierung von wünschenswerten objekten

Info

Publication number
EP0941515A1
EP0941515A1 EP96939616A EP96939616A EP0941515A1 EP 0941515 A1 EP0941515 A1 EP 0941515A1 EP 96939616 A EP96939616 A EP 96939616A EP 96939616 A EP96939616 A EP 96939616A EP 0941515 A1 EP0941515 A1 EP 0941515A1
Authority
EP
European Patent Office
Prior art keywords
user
target
target object
sets
target objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP96939616A
Other languages
English (en)
French (fr)
Inventor
Frederick S.M. Herz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pinpoint Inc
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of EP0941515A1 publication Critical patent/EP0941515A1/de
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising

Definitions

  • This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a "target profile” for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user, which target profile interest summary describes the user's interest level in various types of tar get objects.
  • the system evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media.
  • Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network.
  • a cryptographically based proxy server is provided to ensure the privacy ofa user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.
  • PROBLEM It is a problem in the field of electronic media to enable a user to access information of relevance and interest to the user without requiring the user to expend an excessive amount of time and energy searching for the information.
  • Electronic media such as on-line information sources, provide a vast amount of information to users, typically in the form of "articles," each of which comprises a publication item or document that relates to a specific topic.
  • the difficulty with electronic media is that the amount of information available to the user is overwhelming and the article repository systems that are connected on-line are not organized in a manner that sufficiently simplifies access to only the articles of interest to the user.
  • a user either fails to access relevant articles because they are not easily identified or expends a significant amount of time and energy to conduct an exhaustive search of all articles to identify those most likely to be of interest to the user. Furthermore, even if the user conducts an exhaustive search, present information searching techniques do not necessarily accurately extract only the most relevant articles, but also present articles of marginal relevance due to the functional limitations of the information searching techniques. There is also no existing system which automatically estimates the inherent quality of a n article or other target object to distinguish among a number of articles or target objects identified as of possible interest to a user.
  • Users may receive information on a computer network either by actively retrieving the information or by passively receiving information that is sent to them. Just as users of information retrieval systems face the problem of too much information, so do users who are targeted with electronic junk mail by individuals and organizations. An ideal system would protect the user from unsolicited advertising, both by automatically extracting only the most relevant messages received by electronic ma il, and by preserving the confidentiality ofthe user's preferences, which should not be freely available to others on the network.
  • the Ringo system maintains a complete list of users ratings of music selections and makes recommendations by finding which selections were liked by multiple people.
  • the Ringo system does not take advantage of any available descriptions of the music, such as structured descriptions in a data base, or free text, such as that contained in music reviews.
  • users must define news categories and the users actively indicate their opinion of the selected articles.
  • Their system uses a list of keywords to represent sets of articles and the records of users' interests are updated using genetic algorithms.
  • the cluster "summaries" are generated by picking those words which appear most frequently in the cluster and the titles of those articles closest to the center of the cluster. However, no feedback from users is collected or stored, so no performance improvement occurs over time.
  • Apple's Advanced Technology Group has developed an interface based on the concept of a "pile of articles". This interface is described in an article titled “A 'pile' metaphor for supporting casual organization of information in Human factors in computer systems” published in CHI '92 Conf. Proc. 627-634 by Mander, R. G. Salomon and Y. Wong. 1992. Another article titled “Content awareness in a file system interface: implementing the 'pile' metaphor for organizing information” was published in 16 Ann. Int'l SIGIR '93, ACM 260-269 by Rose E. D. et al. The Apple interface uses word frequencies to automatically file articles by picking the pile most similar to the article being filed. This system functions to cluster articles into subpiles, determine key words for indexing by picking the words with the largest TF/IDF (where TF is term (word) frequency and IDF is the inverse document frequency) and label piles by using the determined key words.
  • TF/IDF where TF is term (word) frequency and IDF
  • U.S. Patent No. 5,331,554 issued to Graham et al. discloses a method for retrieving segments of a manual by comparing a query with nodes in a decision tree.
  • U.S. Patent No. 5,331,556 addresses techniques for deriving morphological part-of-speech information and thus to make use of the similarities of different forms ofthe same word (e.g. "article” and “articles”). Therefore, there presently is no information retrieval and delivery system operable in an electronic media environment that enables a user to access info ⁇ nation of relevance and interest to the user without requiring the user to expend an excessive amount of time and energy.
  • target object an object available for access by the user, which may be either physical or electronic in nature, is termed a "target object”
  • target profile a digitally represented profile indicating t hat target object's attributes
  • user profile a digitally represented profile indicating t hat target object's attributes
  • user profile a profile holding that user's attributes, including age/zip code/etc.
  • a user profile (e.) a summary of digital profiles of target objects that a user likes and/or dislikes, is termed the “target profile interest summary” of that user, (f.) a profile consisting of a collection of attributes, such that a user likes target objects whose profiles are similar to this collection of attributes, is termed a “search profile” or in some contexts a “query” or “query profile,” (g.) a specific embodiment of the target profile interest summary which comprises a set of search profiles is termed the “search profile set" of a user, (h.) a collection of target objects with similar profiles, is termed a “cluster,” (i.) an aggregate profile formed by averaging the attributes of all tar get objects in a cluster, termed a “cluster profile,” (j.) a real number determined by calculating the statistical variance ofthe profiles of all target objects in a cluster, is termed a “cluster variance,” (k.) a real number determined by calculating the maximum distance
  • the system for electronic identification of desirable objects ofthe present invention automatically constructs both a target profile for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user, which target profile interest summary describes the user's interest level in various types of target objects.
  • the system evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of tar get objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects available on the electronic media.
  • a target profile interest summary for a single user must represent multiple areas of interest, for example, by consisting of a set of individual search profiles, each of which identifies one ofthe user's areas of interest.
  • Each user is presented with those target objects whose profiles most closely match the user's interests as described by the user's target profile interest summary.
  • Users' target profile interest summaries are automatically updated on a continuing basis to reflect each user's changing interests.
  • target objects can be grouped into clusters based on their similarity to each other, for example, based on similarity of their topics in the case where the target objects are published articles, and menus automatically generated for each cluster of target objects to allow users to navigate throughout the clusters and manually locate target objects of interest.
  • a particular user may not wish to make public all of the interests recorded in the user's target profile interest summary, particularly when these interests are determined by the user's purchasing patterns.
  • the user may desire that all or part of the target profile interest summary be kept confidential, such as information relating to the user's political, religious, financial or purchasing behavior, indeed, confidentiality with respect to purchasing behavior is the user's legal right in many states. It is therefore necessary that data in a user's target profile interest summary be protected from unwanted disclosure except with the user's agreement.
  • the user's target profile interest summaries must be accessible to the relevant servers that perform the matching of target objects to the users, if the benefit of this matching is desired by both providers and consumers ofthe target objects.
  • the disclosed system provides a solution to the privacy problem by using a proxy server which acts as an intermediary between the information provider and the user.
  • the proxy server dissociates the user's true identity from the pseudonym by the use of cryptographic techniques.
  • the proxy server also permits users to control access to their target profile interest summaries and/or user profiles, including provision of this information to marketers and advertisers if they so desire, possibly in exchange for cash or other considerations. Marketers may purchase these profiles in order to target advertisements to particular users, or they may purchase partial user profiles, which do not include enough info ⁇ nation to identify the individual users in question, in order to carry out standard kinds of demographic analysis and market research on the resulting database of partial user profiles.
  • the system for customized electronic identification of desirable objects uses a fundamental methodology for accurately and efficiently matching users and target objects by automatically calculating, using and updating profile information that describes both the users' interests and the target objects' characteristics.
  • the target objects may be published articles, purchasable items, or even other people, and their properties are stored, and/or represented and /or denoted on the electronic media as (digital) data. Examples of target objects can include, but are not limited to: a newspaper story of potential interest, a movie to watch, an item to buy, e-mail to receive, or another person to correspond with.
  • the information delivery process in the preferred embodiment is based on determining the similarity between a profile for the target object and the profiles of target objects for which the user (or a similar user) has provided positive feedback in the past.
  • the individual data that describe a target object and constitute the target object's profile are herein termed "attributes" ofthe target object.
  • Attributes may include, but are not limited to, the following: (1) long pieces of text (a newspaper story, a movie review, a product description or an advertisement), (2) short pieces of text (name ofa movie's director, name of town from which an advertisement was placed, name ofthe language in which an article was written), (3) numeric measurements (price ofa product, rating given to a movie, reading level ofa book), (4) associations with other types of objects (list of actors in a movie, list of persons who have read a document). Any of these attributes, but especially the numeric ones, may co ⁇ elate with the quality of the target object, such as measures of its popularity (how often it is accessed) or of user satisfaction (number of complaints received).
  • the prefe ⁇ ed embodiment ofthe system for customized electronic identification of desirable objects operates in an electronic media environment for accessing these target objects, which may be news, electronic mail, other published documents, or product descriptions.
  • the system in its broadest construction comprises three conceptual modules, which may be separate entities distributed across many implementing systems, or combined into a lesser subset of physical entities.
  • the specific embodiment of this system disclosed herein illustrates the use ofa first module which automatically constructs a "target profile" for each target object in the electronic media based on various descriptive attributes ofthe target object.
  • a second module uses interest feedback from users to construct a "target profile interest summary" for each user, for example in the form of a "search profile set” consisting of a plurality of search profiles, each of which co ⁇ esponds to a single topic of high interest for the user.
  • the system further includes a profile processing module which estimates each us er's interest in various target objects by reference to the users' target profile interest summaries, for example by comparing the target profiles of these target objects against the search profiles in users' search profile sets, and generates for each us er a customized rank-ordered listing of target objects most likely to be of interest to that user.
  • Each user's target profile interest summary is automatically updated on a continuing basis to reflect the user's changing interests.
  • Target objects may be of various sorts, and it is sometimes advantageous to use a single system that delivers and or clusters target objects of several distinct sorts at once, in a unified framework. For example, users who exhibit a strong interest in certain novels may also show an interest in certain movies, presumably ofa similar nature.
  • a system in which some target objects are novels and other target objects are movies can discover such a correlation and exploit it in order to group particular novels with particular movies, e.g., for clustering purposes, or to recommend the movies to a user who has demonstrated interest in the novels.
  • the system can match the products with the sites and thereby recommend to the marketers of those products that they place advertisements at those sites, e.g., in the form of hypertext links to their own sites.
  • the ability to measure the similarity of profiles describing target objects and a user's interests can be applied in two basic ways: filtering and browsing. Filtering is useful when large numbers of target objects are described in the electronic media s pace. These target objects can for example be articles that are received or potentially received by a user, who only has time to read a small fraction of them. For example, one might potentially receive all items on the AP news wire service, all items posted to a number of news groups, all advertisements in a set of newspapers, or all unsolicited electronic mail, but few people have the time or inclination to read so many articles.
  • a filtering system in the system for customized electronic identification of desirable objects automatically selects a set of articles that the user is likely to wish to read.
  • This filtering system improves over time by noting which articles the user reads and by generating a measurement ofthe depth to which the user reads each article. This information is then us ed to update the user's target profile interest summary. Browsing provides an alternate method of selecting a small subset of a large number of target objects, such as articles. Articles are organized so that users can actively navigate among groups of articles by moving from one group to a larger, more general group, to a smaller, more specific group, or to a closely related group. Each individual article forms a one-member group of its own, so that the user can navigate to and from individual article s as well as larger groups.
  • the methods used by the system for customized electronic identification of desirable objects allow articles to be grouped into clusters and the clusters to be grouped and merged into larger and larger clusters. These hierarchies of clusters then form the basis for menuing and navigational systems to allow the rapid searching of large numbers of articles. This same clustering technique is applicable to any type of target objects that can be profiled on the electronic media.
  • the detailed, comprehensive target profiles and user-specific target profile interest summaries enable the system to provide responsive routing of specific queries for user information access.
  • the information maps so produced and the application of users' target profile interest summaries to predict the information consumption patterns of a user allows for pre-caching of data at locations on the data communication network and at times that minimize the traffic flow in the communication network to thereby efficiently provide the desired information to the user and/or conserve valuable storage space by only storing those target objects (or segments thereof) which are relevant to the user's interests.
  • FIG. 1 illustrates in block diagram form a typical architecture of an electronic media system in which the system for customized electronic identification of desirable objects ofthe present invention can be implemented as part of a user server system;
  • FIG. 2 illustrates in block diagram form one embodiment of the system for customized electronic identification of desirable objects
  • Figures 3 and 4 illustrate typical network trees
  • Figure 5 illustrates in flow diagram form a method for automatically generating article profiles and an associated hierarchical menu system
  • Figures 6-9 illustrate examples of menu generating process
  • Figure 10 illustrates in flow diagram form the operational steps taken by the system for customized electronic identification of desirable objects to screen articles for a user
  • Figure 11 illustrates a hierarchical cluster tree example
  • Figure 12 illustrates in flow diagram form the process for determination of likelihood of interest by a specific user in a selected target object
  • Figures 13A-B illustrate in flow diagram form the automatic clustering process
  • Figure 14 illustrates in flow diagram form the use ofthe pseudonymous server
  • Figure 15 illustrates in flow diagram form the use of the system for accessing information in response to a user query
  • Figure 16 illustrates in flow diagram form the use of the system for accessing information in response to a user query when the system is a distributed network implementation.
  • Target objects being compared can be, as an example but not limited to: textual documents, human beings, movies, or mutual funds.
  • the computed similarity measurements serve as input to additional processes, which function to enable human users to locate desired target objects using a large computer system.
  • additional processes estimate a human user's interest in various target objects, or else cluster a plurality of target objects in to logically coherent groups.
  • the methods used by these additional processes might in principle be implemented on either a single computer or on a computer network. Jointly or separately, they form the underpinning for various sorts of database systems and information retrieval systems.
  • IR Information Retrieval
  • the user is a literate human and the target objects in question are textual documents stored on data storage devices interconnected to the user via a computer network. That is, the target objects consist entirely of text, and so are digitally stored on the data storage devices within the computer network.
  • target object domains that present related retrieval problems that are not capable of being solved by present information retrieval technology: (a.) the user is a film buff and the target objects are movies available on videotape.
  • the user is a consumer and the target objects are products being sold through promotional deals.
  • the user is an investor and the target objects are publicly traded stocks, mutual funds and/or real estate properties.
  • the user is a student and the target objects are classes being offered.
  • the user is an activist and the target objects are Congressional bills of potential concern
  • the user is a direct-mail marketer and the target objects are potential customers.
  • the user is a net-surfer and the target objects are pages, servers, or newsgroups available on the World Wide Web.
  • the user is a philanthropist and the target objects are charities.
  • the user is ill and the target objects are medical specialists.
  • the user is an employee and the target objects are potential employers.
  • the user is an employer and the target objects are potential employees.
  • the user is an beleaguered executive and the target objects are electronic mail messages addressed to the user.
  • the user is a lonely heart and the target objects are potential conversation partners, (o.) the user is in search of an expert and the target objects are users, with known retrieval habits, of an document retrieval system.
  • the user is a social worker and the target objects are families that may need extra visits.
  • the user is an auto insurance company and the target objects are potential customers
  • the user wishes to locate some small subset of the target objects
  • the task is to help the user identify the most interesting target objects, where the user's interest in a target object is defined to be a numerical measurement ofthe user's relative desire to locate that object rather than others.
  • the generality of this problem motivates a general approach to solving the information retrieval problems noted above. It is assumed that many target objects are known to the system for customized electronic identification of desirable objects, and that specifically, the system stores (or has the ability to reconstruct) several pieces of information about each target object.
  • attributes collectively, they are said to form a profile of the target object, or a "target profile.”
  • attributes such as these: (a.) title of movie,
  • Attributes c-g are numeric attributes, of the sort that might be found in a database record. It is evident that they can be used to help the user identify target objects (movies) of interest. For example, the user might previously have rented many Parental Guidance (PG) films, and many films made in the 1970's. This generalization is useful: new films with values for one or both attributes that are numerically similar to these (such as MPAA rating of 1, release date of 1975) are judged similar to the films the user already likes, and therefore of probable interest. Attributes a-b and h are textual attributes. They too are important for helping the user locate desired films.
  • Attribute i is an associative attribute. It records associations between the target objects in this domain, namely movies, and ancillary target objects of an entirely different sort, namely humans.
  • a good indication that the user wants to rent a particular movie is that the user has previously rented other movies with similar attribute values, and this holds for attribute I just as it does for attributes a-h.
  • Attribute j is another example of an associative attribute, recording associations between target objects and actors. Notice that any of these attributes can be made subject to authentication when the profile is constructed, through the use of digital signatures; for example, the target object could be accompanied by a digitally signed note from the MPAA, which note names the target object and specifies its authentic value for attribute c .
  • numeric, textual, and associative are common: numeric, textual, and associative.
  • the system might only consider a single, textual attribute when measuring similarity: the full text ofthe target object.
  • numeric and associative attributes are common:
  • a great many attributes might be used to characterize each co ⁇ oration, including but not limited to the following: (a.) type of business (textual), (b.) corporate mission statement (textual), (c.) number of employees during each of the last 10 years (ten separate numeric attributes),
  • a target object's popularity can be usefully measured as a numeric attribute specifying the number of users who have retrieved that object.
  • Related measurable numeric attributes that also indicate a kind of popularity include the number of replies to a target object, in the domain where target objects are messages posted to an electronic community such as an computer bulletin board or newsgroup, and the number of links leading to a target object, in the domain where target objects are interlinked hypertext documents on the World Wide Web or a similar system.
  • a target object may also receive explicit numeric evaluations (another kind of numeric attribute) from various groups, such as the Motion Picture Association of America (MPAA), as above, which rates movies' appropriateness for children, or the American Medical Association, which might rate the accuracy and novelty of medical research papers, or a random survey sample of users (chosen from all users or a selected set of experts), who could be asked to rate nearly anything.
  • MPAA Motion Picture Association of America
  • Certain other types of evaluation, which also yield numeric attributes, may be carried out mechanically.
  • the difficulty of reading a text can be assessed by standard procedures that count word and sentence lengths, while the vulgarity of a text could be defined as (say) the number of vulgar words it contains, and the expertise of a text could be crudely assessed by counting the number of similar texts its author had previously retrieved and read using the invention, perhaps confining this count to texts that have high approval ratings from critics.
  • textual and associative attributes are large and complex pieces of data, for information retrieval pu ⁇ oses they can be decomposed into smaller, simpler numeric attributes.
  • any set of attributes can be replaced by a (usually larger) set of numeric attributes, and hence that any profile can be represented as a vector of numbers denoting the values of these numeric attributes.
  • a textual attribute such as the full text of a movie review, can be replaced by a collection of numeric attributes that represent scores to denote the presence and significance ofthe words “aardvark,” “aback,” “abacus,” and so on through “zymurgy” in that text.
  • the score ofa word in a text may be defined in numerous ways.
  • the score is the rate ofthe word in the text, which is computed by computing the number of times the word occurs in the text, an d dividing this number by the total number of words in the text.
  • This sort of score is often called the "term frequency" (TF) ofthe word.
  • TF term frequency
  • the definition of term frequency may optionally be modified to weight different portions ofthe text unequally: for example, any occu ⁇ ence ofa word in the text's title might be counted as a 3 -fold or more generally k-fold occurrence (as if the title had been repeated k times within the text), in order to reflect a heuristic assumption that the words in the title are particularly important indicators ofthe text's content or topic.
  • the score of a word is typically defined to be not merely its term frequency, but its term frequency multiplied by the negated logarithm of the word's "global frequency," as measured with respect to the textual attribute in question.
  • the global frequency of a word which effectively measures the word's uninformativeness, is a fraction between 0 and 1, defined to be the fraction of all target objects for which the textual attribute in question contains this word.
  • This adjusted score is often known in the art as TF/TDF ("term frequency times inverse document frequency").
  • TF/TDF term frequency times inverse document frequency
  • this sentence contains a sequence of overlapping character 5-grams which starts “for e”, “or ex”, “r exa”, “ exam”, “examp”, etc.
  • the sentence may be characterized, imprecisely but usefully, by the score of each possible character 5-gram ("aaaaa”, “aaaab”, ... "7,7,7.7.7”) in the sentence.
  • most of these numeric attributes have values of 0, since most 5-grams do not appear in the target object attributes.
  • These zero values need not be stored anywhere.
  • the value ofa textual attribute could be characterized by storing the set of character 5-grams that actually do appear in the text, together with the nonzero score of each one. Any 5-gram that is no t included in the set can be assumed to have a score of zero.
  • the decomposition of textual attributes is not limited to attributes whose values are expected to be long texts.
  • a simple, one-term textual attribute can be replaced by a collection of numeric attributes in exactly the same way.
  • the "name of director” attribute which is textual, can be replaced by numeric attributes giving the scores for "Federico-Fellini,” “Woody- Allen,” “Terence- Davies,” and so forth, in that attribute.
  • the score ofa word is usually defined to be its rate in the text, without any consideration of global frequency. Note that under these conditions, one ofthe scores is 1, while the other scores are 0 and need not be stored.
  • an associative attribute may be decomposed into a number of component associations.
  • a typical associative attribute used in profiling a movie would be a list of customers who have rented that movie. This list can be replaced by a collection of numeric attributes, which give the "association scores" between the movie and each of the customers known to the system. For example, the 165th such numeric attribute would be the association score between the movie and customer #165, where the association score is defined to be 1 if customer #165 has previously rented the movie, and 0 otherwise.
  • this association s core could be defined to be the degree of interest, possibly zero, that customer #165 exhibited in the movie, as determined by relevance feedback (as described below).
  • an associative attribute indicating the major shareholders of the company would be decomposed into a collection of association scores, each of which would indicate the percentage ofthe company (possibly zero) owned by some particular individual or corporate body.
  • each association score may optionally be adjusted by a multiplicative factor: for example, the association score between a movie and customer #165 might be multiplied by the negated logarithm ofthe "global frequency" of customer #165, i.e., the fraction of all movies that have been rented by customer #165.
  • scores used in decomposing textual attributes most association scores found when decomposing a particular value of an associative attribute are zero, and a similar economy of storage may be gained in exactly the same manner by storing a list of only those ancillary objects with which the target object has a nonzero association score, together with their respective association scores. Similarity Measures What does it mean for two target objects to be similar?
  • the distance between two values of a given attribute is a numeric, associative, or textual attribute. If the attribute is numeric, then the distance between two values ofthe attribute is the absolute value ofthe difference between the two values. (Other definitions are also possible: for example, the distance between prices pl and p2 might be defined by
  • V may therefore be regarded as a vector with components V j , V 2 , V 3 , etc., representing the association scores between the object and ancillary objects 1, 2, 3, etc., respectively.
  • the distance between two vector values V and U of an associative attribute is then computed using the angle distance measure, arccos (VU /sqrt((Vv')( UU 1 )).
  • One technique is to augment the set of words actually found in the article with a set of synonyms or other words which tend to co-occur with the words in the article, so that "Kennedy” could be added to every article that mentions "JFK. "
  • words found in the article may be wholly replaced by synonyms, so that "JFK” might be replaced by "Kennedy” or by "John F. Kennedy” wherever it appears.
  • the synonym dictionary may be sensitive to the topic of the document as a whole; for example, it may recognize that "crane” is likely to have a different synonym in a document that mentions birds than in a document that mentions construction.
  • a related technique is to replace each word by its mo ⁇ hological stem, so that "staple”, “stapler”, and “staples” are all replaced by "staple.”
  • Common function words (“a”, “and”, “the” 7) c an influence the calculated similarity of texts without regard to their topics, and so are typically removed from the text before the scores of terms in the text are computed.
  • a more general approach to recognizing synonyms is to use a revised measure ofthe distance between textual attribute vectors V and U, namely arccos(AV(AU)' /sqrt (AV(AV)' AU(AU) t ), where the matrix A is the dimensionality-reducing linear transformation (or an approximation thereto) determined by collecting the vector values of the textual attribute, for all target objects known to the system, and applying singular value decomposition to the resulting collection.
  • the same approach can be applied to the vector values of associative attributes.
  • the above definitions allow us to determine how close together two target objects are with respect to a single attribute, whether numeric, associative, or textual.
  • the distance between two target objects X and Y with respect to their entire multi-attribute profiles P x and P ⁇ is then denoted d(X,Y) or d(P x , P ⁇ ) and defined as:
  • k is a fixed positive real number, typically 2, and the weights are non- negative real numbers indicating the relative importance of the various attributes. For example, if the target objects are consumer goods, and the weight ofthe "color" attribute is comparatively very small, then price is not a consideration in determining similarity: a user who likes a brown massage cushion is predicted to show equal interest in the same cushion manufactured in blue, and vice-versa.
  • Target objects may be of various sorts, and it is sometimes advantageous to use a single system that is able to compare tar get objects of distinct sorts. For example, in a system where some target objects are novels while other target objects are movies, it is desirable to judge a novel and a movie similar if their profiles show that similar users like them (an associative attribute).
  • Max(a) be an upper bound on the distance between two values of attribute a; notice that if attribute a is an associative or textual attribute, this distance is an angle determined by arccos, so that Max( a) may be chosen to be 180 degrees, while if attribute a is a numeric attribute, a sufficiently large number must be selected by the system designers.
  • the distance between two values of attribute a is given as before in the case where both values are defined; the distance between two undefined values is taken to be zero; finally, the distance between a defined value and an undefined value is always taken to be Max(a)/2.
  • a simple application ofthe similarity measurement is a system to match buyers with sellers in small-volume markets, such as used cars and other used goods, artwork, or employment.
  • Sellers submit profiles of the goods (target objects) they want to sell, and buyers submit profiles of the goods (target objects) they want to buy. Participants may submit or withdraw these profiles at any time.
  • the system for customized electronic identification of desirable objects computes the similarities between seller-submitted profiles and buyer-submitted profiles, and when two profiles match closely (i.e., the similarity is above a threshold), the corresponding seller and buyer are notified of each other's identities. To prevent users from being flooded with responses, it may be desirable to limit the number of notifications each user receives to a fixed number, such as ten per day. Filtering: Relevance Feedback
  • a filtering system is a device that can search through many target objects and estimate a given user's interest in each target object, so as to identify those that are of greatest interest to the user.
  • the filtering system uses relevance feed back to refine its knowledge ofthe user's interests: whenever the filtering system identifies a target object as potentially interesting to a user, the user (if an on-line user) provides feedback as to whether or not that target object really is of interest.
  • Such feedback is stored long-term in summarized form, as part of a database of user feedback information, and may be provided either actively or passively.
  • active feedback the user explicitly indicates his or her interest, for instance, on a scale of -2 (active distaste) through 0 (no special interest) to 10 (great interest).
  • the system infers the user's interest from the user's behavior. For example, if target objects are textual documents, the system might monitor which documents the user chooses to read, or not to read, and how much time the user spends reading them.
  • a typical formula for assessing interest in a document via passive feedback, in this domain, on a scale of 0 to 10, might be: + 2 if the second page is viewed,
  • target objects are electronic mail messages
  • interest points might also be added in the case of a particularly lengthy or particularly prompt reply.
  • target objects are purchasable goods, interest points might be added for target objects that the user actually purchases, with further points in the case ofa large-quantity or high-price purchase. In any domain, further points might be added for target objects that the user accesses early in a session, on the grounds that users access the object s that most interest them first.
  • Other potential sources of passive feedback include an electronic measurement of the extent to which the user's pupils dilate while the user views the target object or a description ofthe target object. It is possible to combine active and passive feedback.
  • One option is to take a weighted average ofthe two ratings.
  • Another option is to use passive feedback by default, but to allow the user to examine and actively modify the passive feedback score. In the scenario abo ve, for instance, an uninteresting article may sometimes remain on the display device for a long period while the user is engaged in unrelated business; the passive feedback score is then inappropriately high, and the user may wish to co ⁇ ect it before continuing.
  • a visual indicator such as a sliding bar or indicator needle on the user's screen
  • a visual indicator can be used to continuously display the passive feedback score estimated by the system for the target object being viewed, unless the user has manually adjusted the indicator by a mouse operation or other means in order to reflect a different score for this target object, after which the indicator displays the active feedback score selected by the user, and this active feedback score is used by the system instead of the passive feedback score.
  • the user cannot see or adjust the indicator until just after the user has finished viewing the target object. Regardless how a user's feedback is computed, it is stored long-term as part of that user's target profile interest summary.
  • the filtering system For target objects that the user has not yet seen, the filtering system must estimate the user's interest. This estimation task is the heart ofthe filtering problem, and the reason that the similarity measurement is important. More concretely, the prefe ⁇ ed embodiment of the filtering system is a news clipping service that periodically presents the user with news articles of potential interest. The user provides active and/or passive feedback to the system relating to these presented articles. However, the system does not have feedback information from the user for articles that have never been presented to the user, such as new articles that have just been added to the database, or old articles that the system chose not to present to the user.
  • the system has only received feedback on old flames, not on prospective new loves.
  • the evaluation of the likelihood of interest in a particular target object for a specific user can automatically be computed.
  • the interest that a given target object X holds for a user U is assumed to be a sum of two quantities: q(U, X), the intrinsic "quality" of X, plus f(U, X), the "topical interest” that users like U have in target objects like X.
  • the intrinsic quality measure q(U, X) is easily estimated at steps 1201-1203 directly from numeric attributes ofthe target object X.
  • the computation process begins at step 1201, where certain designated numeric attributes of target object X are specifically selected, which attributes by their very nature should be positively or negatively correlated with users' interest. Such attributes, termed “quality attributes,” have the normative property that the higher (or in some cases lower) their value, the more interesting a user is expected to find them. Quality attributes of target object X may include, but are not limited to, target object X's popularity among users in general, the rating a particular reviewer has given target object X, the age (time since authorship ⁇ also known as outdatedness) of target object X, the number of vulgar words used in target object X, the price of target object X, and the amount of money that the company selling target object X has donated to the user's favorite charity.
  • each ofthe selected attributes is multiplied by a positive or negative weight indicative ofthe strength of user U's preference for those target objects that have high values for this attribute, which weight must be retrieved from a data file storing quality attribute weights for the selected user.
  • a weighted sum of the identified weighted selected attributes is computed to determine the intrinsic quality measure q(U, X).
  • the summarized weighted relevance feedback data is retrieved, wherein some relevance feedback points are weighted more heavily than others and the stored relevance data can be summarized to some degree, for example by the use of search profile sets.
  • the more difficult part of determining user U's interest in target object X is to find or compute at step 1205 the value of f(U, X), which denotes the topical interest that users like U generally have in target objects like X.
  • the method of determining a user's interest relies on the following heuristic: when X and Y are similar target objects (have similar attributes), and U and V are similar users (have similar attributes), then topical interest f U, X) is predicted to have a similar value to the value of topical interest f(V, Y).
  • the problem of estimating topical interest at all points becomes a problem of inte ⁇ olating among these estimates of topical interest at selected points, such as the feedback estimate of f(V, Y) a s r(V, Y) - q(V, Y).
  • This interpolation can be accomplished with any standard smoothing technique, using as input the known point estimates of the value of the topical interest function f(*, *), and determining as output a function that approximates the entire topical interest function f(*.
  • point estimates ofthe topical interest function f *, * should be given equal weight as inputs to the smoothing algorithm. Since passive relevance feedback is less reliable than active relevance feedback, point estimates made from passive relevance feedback should be weighted less heavily than point estimates made from active relevance feedback, or even not used at all. In most domains, a user's interests may change over time and, therefore, estimates of topic al interest that derive from more recent feedback should also be weighted more heavily. A user's interests may vary according to mood, so estimates of topical interest that derive from the cu ⁇ ent session should be weighted more heavily for the duration of the cu ⁇ ent session, and past estimates of topical interest made at approximately the current time of day or on the cu ⁇ ent weekday should be weighted more heavily.
  • target profile Y was created in 1990 to describe a particular investment that was available in 1990, and that was purchased in 1990 by user V
  • the system solicits relevance feedback from user V in the years 1990, 1991, 1992, 1993, 1994, 1995, etc., and treats these as successively stronger indications of user Vs true interest in target profile Y, and thus as indications of user Vs likely interest in new investments whose cu ⁇ ent profiles resemble the original 1990 investment profile Y.
  • the system tends to recommend additional investments when they have profiles like target profile Y, on the grounds that they too will turn out to be satisfactory in 4 to 5 years.
  • topical interest fTJ, X topical interest
  • the method must therefore associate a weight with each attribute used in the profile of (user, target object) pairs, that is, with each attribute used to profile either users or target objects.
  • These weights specify the relative importance ofthe attributes in establishing similarity or difference, and therefore, in determining how topical interest is generalized from one (user, target object) pair to another. Additional weights determine which attributes of a target object contribute to the quality function q, and by how much. It is possible and often desirable for a filtering system to store a different set of weights for each user.
  • a user who thinks of two-star films as having materially different topic and style from four-star films wants to assign a high weight to "number of stars" for pu ⁇ oses of the similarity distance measure d(*, *); this means that interest in a two-star film does not necessarily signal interest in an otherwise similar four-star film, or vice- versa. If the user also agrees with the critics, and actually prefers four-star films, the user also wants to assign "number of stars" a high positive weight in the determination ofthe quality function q.
  • Attribute weights may be set or adjusted by the system administrator or the individual user, on either a temporary or a permanent basis. However, it is often desirable for the filtering system to learn attribute weights automatically, based on relevance feedback.
  • the optimal attribute weights for a user U are those that allow the most accurate prediction of user U's interests.
  • the difference aj - b is herein termed the "residue feedback ⁇ m ( ⁇ , X ⁇ ) of user U on target object Xj.”
  • (iii) Compute user U's e ⁇ or measure, (a, - b ) 2 + ( - b 2 ) 2 + (a 3 - b 3 ) 2 + ... + (a,, -
  • a gradient-descent or other numerical optimization method may be used to adjust user U's attribute weights so that this e ⁇ or measure reaches a (local) minimum.
  • This approach tends to work best if the smoothing technique used in estimation is such that the value of f(V, Y) is strongly affected by the point estimate r(V, Y) - q(V, Y) when the latter value is provided as input. Otherwise, the presence or absence ofthe single input feedback rating r(U, Xj), in steps (i)-(ii) may not make a j and b j very different from each other.
  • a slight variation of this learning technique adjusts a single global set of at tribute weights for all users, by adjusting the weights so as to minimize not a particular user's e ⁇ or measure but rather the total e ⁇ or measure of all users.
  • These global weights are used as a default initial setting for a new user who has not yet provided any feedback.
  • Gradient descent can then be employed to adjust this user's individual weights over time.
  • a useful quality attribute for a target object X is the average amount of residue feedback r ⁇ V, X) from users on that target object, averaged over all users V who have provided relevance feedback on the target object.
  • residue feedback is never averaged indiscriminately over all users to form a new attribute, but instead is smoothed to consider users' similarity to each other.
  • the quality measure q(U, X) depends on the user U as well as the target object X, so that a given target object X may be perceived by different users to have different quality.
  • q(U, X) is calculated as a weighted sum of various quality attributes that are dependent only on X, but then an additional term is added, namely an estimate of r ⁇ (U, X) found by applying a smoothing algorithm to known values of r res (V, X).
  • V ranges over all users who have provided relevance feedback on target object X, and the smoothing algorithm is sensitive to the distances d(U, V) from each such user V to user U.
  • a method for defining the distance between any pair of target objects was disclosed above. Given this distance measure, it is simple to apply a standard clustering algorithm, such as k- means, to group the target objects into a number of clusters, in such a way that similar target objects tend to be grouped in the same cluster. It is clear that the resulting clusters can be used to improve the efficiency of matching buyers and sellers in the application described in section "Matching Buyers and Sellers" above: it is not necessary to compare every buy profile to every sell profile, but only to compare buy profiles and sell profiles that are similar enough to appear in the same cluster. As explained below, the results ofthe clustering procedure can also be used to make filtering more efficient, and in the service of querying and browsing tasks.
  • the k-means clustering method is familiar to those skilled in the art. Briefly put, it finds a grouping of points (target profiles, in this case, whose numeric coordinates are given by numeric decomposition of their attributes as described above) to minimize the distance between points in the clusters and the centers ofthe clusters in which they are located. This is done by alternating between assigning each point to the cluster which has the nearest center and then, once the points have been assigned, computing the (new) center of each cluster by averaging the coordinates of the points (target profiles) located in this cluster.
  • Other clustering methods can be used, such as "soft" or "fuzzy" k-means clustering, in which objects are allowed to belong to more than one cluster. This can be cast as a clustering problem similar to the k-means problem, but now the criterion being optimized is a little different:
  • C ranges over cluster numbers
  • i ranges over target objects
  • Xj is the numeric vector corresponding to the profile of target object number i
  • ⁇ c is the mean of all the numeric vectors corresponding to target profiles of target objects in cluster number C, termed the "cluster profile" of cluster C
  • d(*, *) is the metric used to measure distance between two target profiles
  • ij C is a value between 0 and 1 that indicates how much target object number i is associated with cluster number C
  • i ic is either 0 or 1.
  • Association-based clustering in which profiles contain only associative attributes, and thus distance is defined entirely by associations.
  • This kind of clustering generally (a) clusters target objects based on the similarity of the users who like them or (b) clusters users based on the similarity ofthe target objects they like. In this approach, the system does not need any information about target objects or users, except for their history of interaction with each other 2) Content-based clustering, in which profiles contain only non-associative attributes.
  • This kind of clustering (a) clusters target objects based on the similarity of their non-associative attributes (such as word frequencies) or (b) clusters users base d on the similarity of their non-associative attributes (such as demographics and psychographics)
  • the system does not need to record any information about users' historical patterns of information access, but it does need information about the intrinsic properties of users and/or target objects.
  • Sequential hybrid method First apply the k-means procedure to do la, so that articles are labeled by cluster based on which user read them, then use supervised clustering (maximum likelihood discriminant methods) using the word frequencies to do the process of method 2a described above This tries to use knowledge of who read what to do a better job of clustering based on word frequencies.
  • supervised clustering maximum likelihood discriminant methods
  • Hierarchical clustering of target objects is often useful.
  • Hierarchical clustering produces a tree which divides the target objects first into two large clusters of roughly similar objects, each of these clusters is in turn divided into two or more smaller clusters, which in turn are ea ch divided into yet smaller clusters until the collection of target objects has been entirely divided into "clusters" consisting of a single object each, as diagrammed in Figure 8
  • the node d denotes a particular target object d, or equivalently, a single-member cluster consisting of this target object.
  • Target object d is a member ofthe cluster (a, b, d), which is a subset ofthe cluster (a, b, c, d, e, f), which in turn is a subset of all target objects
  • the tree shown in Figure 8 would be produced from a set of target objects such as those shown geometrically in Figure 7 In Figure 7, each letter represents a target object, and axes xl and x2 represent two ofthe many numeric attributes on which the target objects differ.
  • Such a cluster tree may be created by hand, using human judgment to form clusters and subclusters of similar objects, or may be created automatically in either of two standard ways: top-down or bottom-up.
  • the set of all target objects in Figure 7 would be divided into the clusters (a, b, c, d, e, f) and (g, h, i, j, k).
  • the clustering algorithm would then be reapplied to the target objects in each cluster, so that the cluster (g, h, i, j, k) is subpartitioned into the clusters (g, k) and (h, i, j), and so on to arrive at the tree shown in Figure 8.
  • the set of all target objects in Figure 7 would be grouped into numerous small clusters, namely (a, b), d, (c, f), e, (g,k), (h, i), and j. These clusters would then themselves be grouped into the larger clusters (a, b, d), (c, e, f), (g, k), and (h, i, j), according to their cluster profiles. These larger clusters would themselves be grouped into (a, b, c, d, e, f) and (g, k, h, i, j), and so on until all target objects had been grouped together, resulting in the tree of Figure 8.
  • the target profile of a single Woody Allen film would assign "Woody-Allen” a score of 1 in the "name-of-director” field, while giving "Federico-Fellini” and "Terence-Davies” scores of 0.
  • a cluster that consisted of 20 films directed by Allen and 5 directed by Fellini would be profiled with scores of 0.8, 0.2, and 0 respectively, because, for example, 0.8 is the average of 20 ones and 5 zeros.
  • a hierarchical cluster tree of target objects makes it possible for the system to search efficiently for target objects with target profiles similar to P. It is only necessarily to navigate through the tree, automatically, in search of such target profiles.
  • the system for customized electronic identification of desirable objects begins by considering the largest, top-level clusters, and selects the cluster whose profile is most similar to target profile P. In the event ofa near-tie, multiple clusters may be selected. Next, the system considers all subclusters ofthe selected clusters, and this time selects the subcluster or subclusters whose profiles are closest to target profile P.
  • step 13A03 the list of target objects is returned.
  • step 13B00 the variable I is set to 1 and for each child subtree Ti ofthe root of tree T, is retrieved. 4.
  • step 13B02 calculate d(P, p;), the similarity distance between P and p j ,
  • step 13B03 if d(P, p ⁇ ) ⁇ t, a threshold, branch to one of two options
  • step 13B04 add that target object to list of identified target objects at step 13B05 and advance to step 13B07.
  • step 13B04 scan the ith child subtree for target objects similar to P by invoking the steps of the process of Figure 13B recursively and then recurse to step 3 (step 13A01 in Figure 13 A) with T bound for the duration ofthe recursion to tree Ti, in order to search in tree Ti for target objects with profiles similar to P.
  • step 5 of this pseudo-code smaller thresholds are typically used at lower levels ofthe tree, for example by making the threshold an affine function or other function ofthe cluster variance or cluster diameter ofthe cluster ft.
  • this process may be executed in distributed fashion as follows: steps 3-7 are executed by the server that stores the root node of hierarchical cluster tree T, and the recursion in step 7 to a subcluster tree T ( involves the transmission of a search request to the server that stores the root node of tree T ; , which server carries out the recursive step upon receipt of this request.
  • Steps 1-2 are carried out by the processor that initiates the search, and the server that executes step 6 must send a message identifying the target object to this initiating processor, which adds it to the list.
  • a standard back-propagation neural net is one such method: it should be trained to take the attributes ofa target object as input, and produce as output a unique pattern that can be used to identify the appropriate low-level cluster. For maximum accuracy, low-level clusters that are similar to each other (close together in the cluster tree) should be given similar identifying patterns.
  • Another approach is a standard decision tree that considers the attributes of target profile P one at a time until it can identify the appropriate cluster. If profiles are large, this may be more rapid than considering all attributes.
  • a hybrid approach to searching uses distance measurements as described above to navigate through the top few levels of the hierarchical cluster tree, until it reaches an cluster of intermediate size whose profile is similar to target profile P, and then continues by using a decision tree specialized to search for low-level subclusters of that intermediate cluster.
  • One use of these searching techniques is to search for target objects that match a search profile from a user's search profile set. This form of searching is used repeatedly in the news clipping service, active navigation, and Virtual Community Service applications, described below.
  • Another use is to add a new target object quickly to the cluster tree. An existing cluster that is similar to the new target object can be located rapidly, and the new target object can be added to this cluster. If the object is beyond a certain threshold distance from the cluster center, then it is advisable to start a new cluster.
  • This incremental clustering scheme can be used, and can be built using variants of subroutines available in advanced statistical packages. Note that various methods can be used to locate t he new target objects that must be added to the cluster tree, depending on the architecture used.
  • a "webcrawler" program running on a central computer periodically scans all servers in search of new target objects, calculates the target profiles of these objects, and adds them to the hierarchical cluster tree by the above method.
  • a software "agent" at that server calculates the target profile and adds it to the hierarchical cluster tree by the above method. Rapid Profiling
  • target objects are wallpaper patterns
  • an attribute such as “genre” (a single textual term such as “Art-Deco,” “Children's,” “Rustic,” etc.) may be a matter of judgment and opinion, difficult to determine except by consulting a human. More significantly, if each wallpaper pattern has an associative attribute that records the positive or negative relevance feedback to that pattern from various human users (consumers), then all the association scores of any newly introduced pattern are initially zero, so that it is initially unclear what other patterns are similar to the new pattern with respect to the users who like them.
  • the system can in principle determine the genre of a wallpaper pattern by consulting one or more randomly chosen individuals from a set of known human experts, while to determine the numeric association score between a new wallpaper pattern and a particular user, it can in principle show the pattern to the that user and obtain relevance feedback.
  • Rapid profiling is a method for selecting those numeric attributes that are most important to determine. (Recall that all attributes can be decomposed into numeric attributes, such as association scores or term scores.)
  • a set of existing target objects that already have complete or largely complete profiles are clustered using a k-means algorithm.
  • each of the resulting clusters is assigned a unique identifying number, and each clustered target object is labeled with the identifying number of its cluster.
  • Standard methods then allow construction of a single decision tree that can determine any target object's cluster number, with substantial accuracy, by considering the attributes of the target object, one at a time. Only attributes that can if necessary be determined for any new target object are used in the construction of this decision tree.
  • the decision tree is traversed downward from its root as far as is desired. The root of the decision tree considers some attribute of the target object.
  • this attribute is not yet known, it is determined by a method appropriate to that attribute; for example, if the attribute is the association score ofthe target object with user #4589, then relevance feedback (to be used as the value of this attribute) is solicited from user #4589, perhaps by the ruse of adding the possibly uninteresting target object to a set of objects that the system recommends to the user's attention, in order to find out what the user thinks of it.
  • the rapid profiling method descends the decision tree by one level, choosing one ofthe decision subtrees ofthe root in accordance with the determined value ofthe root attribute.
  • the root of this chosen subtree considers another attribute ofthe target object, whose value is likewise determined by an appropriate method.
  • the process c an be repeated to determine as many attributes as desired, by whatever methods are available, although it is ordinarily stopped after a small number of attributes, to avoid the burden of determining too many attributes.
  • the rapid profiling method can be used to identify important attributes in any sort of profile, and not just profiles of target objects.
  • the disclosed method for determining topical interest through similarity requires users as well as target objects to have profiles.
  • New users may be profiled or partially profiled through the rapid profiling process.
  • the rapid profiling procedure can rapidly form a rough characterization ofa new user's interests by soliciting the user's feedback on a small number of significant target objects, and perhaps also by determining a small n umber of other key attributes ofthe new user, by on-line queries, telephone surveys, or other means.
  • each user's user profile is subdivided into a set of long-term attributes, such as demographic characteristics, and a set of short-term attributes that help to identify the user's temporary desires and emotional state, such as the user's textual or multiple-choice answers to questions whose answers reflect the user's mood.
  • a subset of the user's long-term attributes are determined when the user first registers with the system, through the use ofa rapid profiling tree of long-term attributes.
  • a subset of the user's short-term attributes are additionally determined, through the use of a separate rapid profiling tree that asks about short-term attributes.
  • Market Research A technique similar to rapid profiling is of interest in market research (or voter research).
  • the target objects are consumers.
  • a particular attribute in each target profile indicates whether the consumer described by that target profile h as purchased product X.
  • a decision tree can be built that attempts to determine what value a consumer has for this attribute, by consideration ofthe other attributes in the consumer's profile. This decision tree may be traversed to determine whether additional users are likely to purchase product X.
  • the top few levels of the decision tree provide information, valuable to advertisers who are planning mass-market or direct-mail campaigns, about the most significant characteristics of consumers of product X. Similar information can alternatively be extracted from a collection of consumer profiles without recourse to a decision tree, by considering attributes one at a time, and identifying those attributes on which pro duct X's consumers differ significantly from its non-consumers. These techniques serve to characterize consumers ofa particular product; they can be equally well applied to voter research or other survey research, where the objective is to characterize those individuals from a given set of surveyed individuals who favor a particular candidate, hold a particular opinion, belong to a particular demographic group, or have some other set of distinguishing attributes.
  • researchers may wish to purchase batches of analyzed or unanalyzed user profiles from which personal identifying information has been removed.
  • statistical conclusions can be drawn, and relationships between attributes can be elucidated using knowledge discovery techniques which are well known in the art.
  • Figure 1 illustrates in block diagram form the overall architecture of an electronic media system, known in the art, in which the system for customized electronic identification of desirable objects ofthe present invention can be used to provide user customized access to target objects that are available via the electronic media system.
  • the electronic media system comprises a data communication facility that interconnects a plurality of users with a number of information servers.
  • the users a re typically individuals, whose personal computers (terminals) T,-T n are connected via a data communications link, such as a modem and a telephone connection established in well-known fashion, to a telecommunication network N.
  • User information access software is resident on the user's personal computer and serves to communicate over the data communications link and the telecommunication network N with one ofthe plurality of network vendors V j -V k (America Online, Prodigy, CompuServe, other private companies or even universities) who provide data interconnection service with selected ones ofthe information servers I, -1-.
  • the user can, by use ofthe user information access software, interact with the information servers I. -I n , to request and obtain access to data that resides on mass storage systems -SS m that are part ofthe information server apparatus. New data is input to this system y users via their personal computers T, -T n and by commercial information services by populating their mass storage systems SSj -SS m with commercial data.
  • Each user terminal T -J and the information servers I, -I,- have phone numbers or IP addresses on the network N which enable a data communication link to be established between a particular user terminal T 1 -T n and the selected information server I, -I-,,.
  • a user's electronic mail address also uniquely identifies the user and the user's network vendor V. -V k in an industry-standard format such as: usemame@aol.com or username@netcom.com.
  • the network vendors V, -V k provide access passwords for their subscribers (selected users), through which the users can access the information servers I, -I,,,.
  • the subscribers pay the network vendors V, -V k for the access services on a fee schedule that typically includes a monthly subscription fee and usage based charges.
  • the information is comprised of individual "files," which can contain audio data, video data, graphics data, text data, structured database data and combinations thereof
  • each target object is associated with a unique file: for target objects that are informational in nature and can be digitally represented, the file directly stores the informational content ofthe target object, while for target objects that are not stored electronically, such as purchasable goods, the file contains an identifying description ofthe target object.
  • Target objects stored electronically as text files can include commercially provided news articles, published documents , letters, user-generated documents, descriptions of physical objects, or combinations of these classes of data.
  • the organization of the files containing the information and the native format of the data contained in files ofthe same conceptual type may vary by information server I, -! transmit.
  • a user can have difficulty in locating files that contain the desired info ⁇ nation, because the info ⁇ nation may be contained in files whose info ⁇ nation server cataloging may not enable the user to locate them.
  • a user therefore does not have simple access to information but must expend a significant amount of time and energy to exce ⁇ t a segment ofthe information that may be relevant to the user from the plethora of information that is generated and populated on this system. Even if the user commits the necessary resources to this task, existing information retrieval processes lack the accuracy and efficiency to ensure that the user obtains the desired information.
  • the three modules of the system for customized electronic identification of desirable objects can be implemented in a distributed manner, even with various modules being implemented on and/or by different vendors within the electronic media system.
  • the information servers I, -! ranch can include the target profile generation module while the network vendors V, -V k may implement the user profile generation module, the target profile interest summary generation module, and/or the profile processing module.
  • a module can itself be implemented in a distributed manner, with numerous nodes being present in the network N, each node serving a population of users in a particular geographic area. The totality of these nodes comprises the functionality ofthe particular module.
  • the vendors V ! -V k may be augmented with some number of proxy servers, which provide a mechanism for ongoing pseudonymous access and profile building through the method described herein. At least one trusted validation server must be in place to administer the creation of pseudonyms in the system.
  • the various processors interconnected by the data communication network N as shown in Figure 1 can be divided into two classes and grouped as illustrated in Figure 2: clients and servers.
  • the clients Cl-Cn are individual user's computer systems which are connected to servers S1-S5 at various times via data communications links. Each of the clients Ci is typically associated with a single server Sj, but these associations can change over time.
  • the clients Cl-Cn both interface with users and produce and retrieve files to and from servers.
  • the clients Cl-Cn are not necessarily continuously on-line, since they typically serve a single user and can be movable systems, such as laptop computers, which can be connected to the data communications network N at any of a number of locations.
  • a server Si is a computer system that is presumed to be continuously on-line and functions to both collect files from various sources on the data communication network N for access by local clients Cl-Cn and collect files from local clients Cl-Cn for access by remote clients.
  • the server Si is equipped with persistent storage, such as a magnetic disk data storage medium, and are interconnected with other servers via data communications links.
  • the data communications links can be of arbitrary topology and architecture, and are described herein for the pu ⁇ ose of simplicity as point-to-point links or, more precisely, as virtual point-to-point links.
  • the servers S1-S5 comprise the network vendors VI -Vk as well as the information servers Ij -]_-, of Figure 1 and the functions performed by these two classes of modules can be merged to a greater or lesser extent in a single server Si or distributed over a number of servers in the data communication network N.
  • Figure 3 illustrates in block diagram form a representation of an arbitrarily selected network topology for a plurality of servers A-D, each of which is interconnected to at least one other server and typically also to a plurality of clients p-s.
  • Servers A-D are interconnected by a collection of point to point data communications links, and server A is connected to client r, server B is connected to clients p-q, while server D is connected to client s.
  • Servers transmit encrypted or unencrypted messages amongst themselves: a message typically contains the textual and/or graphic information stored in a particular file, and also contains data which describe the type and origin of this file, the name ofthe server that is supposed to receive the message, and the purpose for which the file contents are being transmitted. Some messages are not associated with any file, but are sent by one server to other servers for control reasons, for example to request transmission of a file or to announce the availability of a new file.
  • Messages can be forwarded by a server to another server, as in the case where server A transmits a message to server D via a relay node of either server C or servers B, C. It is generally preferable to have multiple paths through the network, with each path being characterized by its performance capability and cost to enable the network N to optimize traffic routing.
  • a buyer may desire to be targeted for certain mailings that describe products that are related to his or her interests, and a seller may desire to target users who are predicted to be interested in the goods and services that the seller provides.
  • the usefulness of the technology described herein is contingent upon the ability of the system to collect and compare data about many users and many target objects.
  • a compromise between total user anonymity and total public disclosure of the user's search profiles or target profile interest summary is a pseudonym.
  • a pseudonym is an artifact that allows a service provider to communicate with users and build and accumulate records of their preferences over time, while at the same time remaining unaware ofthe users' true identities, so that users can keep their purchases or preferences private.
  • a second and equally important requirement of a pseudonym system is that it provide for digital credentials, which are used to guarantee that the user represented by a particular pseudonym has certain properties. These credentials may be granted on the basis of result of activities and transactions conducted by means of the system for customized electronic identification of desirable objects, or on the basis of other activities and transactions conducted on the network N ofthe present system, on the basis of users' activities outside of network N.
  • a service provider may require proof that the purchaser has sufficient funds on deposit at his/her bank, which might possibly not be on a network, before agreeing to transact business with that user. The user, therefore, must provide the service provider with proof of funds (a credential) from the bank, while still not disclosing the user's true identity to the service provider.
  • Any server in the network N may be configured to act as a proxy server in addition to its other functions.
  • Each proxy server provides service to a set of users, which set is termed the "user base" of that proxy server.
  • a given proxy server provides three sorts of service to each user U in its user base, as follows: 1.
  • the first function of the proxy server is to bidirectionally transfer communications between user U and other entities such as information servers
  • the proxy server communicates with server S (and thence with user U), either through anonymizing mix paths that obscure the identity of server S and user U, in which case the proxy server knows user U only through a secure pseudonym, or else through a conventional virtual point-to-point connection, in which case the proxy server knows user U by user Us address at server S, which address may be regarded as a non-secure pseudonym for user U.
  • a second function of the proxy server is to record user-specific information associated with user U.
  • This user-specific information includes a user profile and target profile interest summary for user U, as well as a list of access control instructions specified by user U, as described below, and a set of one-time return addresses provided by user U that can be used to send messages to user U without knowing user U's true identity. All of this user-specific information is stored in a database that is keyed by user U's pseudonym (whether secure or non-secure) on the proxy server.
  • a third function of the proxy server is to act as a selective forwarding agent for unsolicited communications that are addressed to user U: the proxy server forwards some such communications to user U and rejects others, in accordance with the access control instructions specified by user U.
  • Our combined method allows a given user to use either a single pseudonym in all transactions where he or she wishes to remain pseudonymous, or else different pseudonyms for different types of transactions. In the latter case, each service provider might transact with the user under a different pseudonym for the user.
  • a coalition of service providers all of whom match users with the same genre of target objects, might agree to transact with the user using a common pseudonym, so that the target profile interest summary associated with that pseudonym would be complete with respect to said genre of target objects.
  • the user may freely choose a proxy server to service each pseudonym; these proxy servers may be the same or different. From the service provider's perspective, our system provides security, in that it can guarantee that users of a service are legitimately entitled to the services used and that no user is using multiple pseudonyms to communicate with the same provider.
  • This uniqueness of pseudonyms is important for the pu ⁇ oses of this application, since the transaction information gathered for a given individual must represent a complete and consistent picture of a single user's activities with respect to a given service provider or coalition of service providers; otherwise, a user's target profile interest summary and user profile would not be able to represent the user's interests to other parties as completely and accurately as possible.
  • the service provider must have a means of protection from users who violate previously agreed upon terms of service. For example, if a user that uses a given pseudonym engages in activities that violate the terms of service, then the service provider should be able to take action against the user, such as denying the user service and blacklisting the user from transactions with other parties that the user might be tempted to defraud.
  • the issuer of a resolution credential refuses to grant this resolution credential to the user, then the refusal may be appealed to an adjudicating third party.
  • the integrity ofthe user profiles and target profile interest summaries stored on proxy servers is important: if a seller relies on such user-specific information to deliver promotional offers or other material to a particular class of users, but not to other users, then the user-specific information must be accurate and untampered with in any way.
  • the user may likewise wish to ensure that other parties not tamper with the user's user profile and target profile interest summary, since such modification could degrade the system's ability to match the user with the most appropriate target objects.
  • Each pseudonym is paired with a public cryptographic key and a private cryptographic key, where the private key is known only to the user who holds that pseudonym; when t he user sends a control message to a proxy server under a given pseudonym, the proxy server uses the pseudonym's public key to verify that the message has been digitally signed by someone who knows the pseudonym's private key. This prevents other parties from masquerading as the user.
  • Our approach as disclosed in this application, provides an improvement over the prior art in privacy-protected pseudonymy for network subscribers such as taught in U.S.
  • Patent 5,245,656 which provides for a name translator station to act as an intermediary between a service provider and the user.
  • U.S. Patent 5,245,656 provides that the information transmitted between the end user U and the service provider be doubly encrypted, the fact that a relationship exist s between user U and the service provider is known to the name translator, and this fact could be used to compromise user U, for example if the service provider specializes in the provision of content that is not deemed acceptable by user Us peers.
  • Patent 5,245,656 also omits a method for the convenient updating of pseudonymous user profile information, such as is provided in this application, and does not provide for assurance of unique and credentialed registration of pseudonyms from a credentialing agent as is also provided in this application, and does not provide a means of access control to the user based on profile information and conditional access as will be subsequently described.
  • the method described by Loeb et al. also does not describe any provision for credentials, such as might be used for authenticating a user's right to access particular target objects, such as target objects that are intended to be available only upon payment ofa subscription fee, or target objects that are intend ed to be unavailable to younger users.
  • the user employs as an intermediary any one of a number of proxy servers available on the data communication network N of Figure 2 (for example, server S2).
  • the proxy servers function to disguise the true identity of the user from other parties on the data communication network N.
  • the proxy server represents a given us er to either single network vendors and information servers or coalitions thereof.
  • a proxy server e.g.
  • S2 is a server computer with CPU, main memory, secondary disk storage and network communication function and with a database function which retrieves the target profile interest summary and access control instructions associated with a particular pseudonym P, which represents a particular user U, and performs bi-directional routing of commands, target objects and billing information between the user at a given client (e.g. C3) and other network entities such as network vendors VI -Vk and information servers Il-Im.
  • Each proxy server maintains an encrypted target profile interest summary associated with each allocated pseudonym in its pseudonym database D .
  • the actual user-specific information and the associated pseudonyms need not be stored locally on the proxy server, but may alternatively be stored in a distributed fashion and be remotely addressable from the proxy server via point-to-point connections .
  • the proxy server supports two types of bi-directional connections: point-to-point connections and pseudonymous connections through mix paths, as taught by D. Chaum in the paper titled “Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms", Communications of the ACM, Volume 24, Number 2, February 1981.
  • the normal connections between the proxy server and information servers for example a connection between proxy server S2 and information server S4 in Figure 2, are accomplished through the point-to-point connection protocols provided by network N as described in the "Electronic Media System Architecture" section of this application.
  • the normal type of point-to-point connections may be used between S2-S4, for example, since the dissociation ofthe user and the pseudonym need only occur between the client C3 and the proxy server S2, where the pseudonym used by the user is available. Knowing that an information provider such as S4 communicates with a given pseudonym P on proxy server S2 does not compromise the true identity of user U.
  • the bidirectional connection between the user and the proxy server S2 can also be a normal point-to-point connection, but it may instead be made anonymous and secure, if the user desires, though the consistent use of an anonymizing mix protocol as taught by D.
  • credentials which represent facts about a pseudonym that an organization is willing to certify, can be granted to a particular pseudonym, and transfe ⁇ ed to other pseudonyms that the same user employs.
  • the user can use different pseudonyms with different organizations (or disjoint sets of organizations), yet still present credentials that were granted by one organization, under one pseudonym, in order to transact with another organization under another pseudonym, without revealing that the two pseudonyms co ⁇ espond to the same user.
  • Credentials may be granted to provide assurances regarding the pseudonym bearer's age, financial status, legal status, and the like.
  • credentials signifying "legal adult” may be issued to a pseudonym based on information known about the corresponding user by the given is suing organization. Then, when the credential is transfe ⁇ ed to another pseudonym that represents the user to another disjoint organization, presentation of this credential on the other pseudonym can be taken as proof of legal adulthood, which might satisfy a condition of terms of service.
  • Credential-issuing organizations may also certify particular facts about a user's demographic profile or target profile interest summary, for example by granting a credential that asserts "the bearer of this pseudonym is either well-read or is middle-aged and works for a large company"; by presenting this credential to another entity, the user can prove eligibility for (say) a discount without revealing the user 's personal data to that entity.
  • the method taught by Chaum provides for assurances that no individual may co ⁇ espond with a given organization or coalition of organizations using more than one pseudonym; that credentials may not be feasibly forged by the user; and t hat credentials may not be transfe ⁇ ed from one user's pseudonym to a different user's pseudonym.
  • the method provides for expiration of credentials and for the issuance of "black marks" against Individuals who do not act according to the terms of service that they are extended. This is done through the resolution credential mechanism as described in Chaum' s work, in which resolutions are issued periodically by organizations to pseudonyms that are in good standing.
  • a pseudonym is a data record consisting of two fields.
  • the first field specifies the address of the proxy server at which the pseudonym is registered.
  • the second field contains a unique string of bits (e.g., a random binary number) that is associated with a particular user; credentials take the form of public-key digital signatures computed on this number, and the number itself is issued by a pseudonym administering server Z, as depicted in Figure 2, and detailed I n a generic form in the paper by D. Chaum and J.H.
  • Evertse titled “A secure and privacy-protecting protocol for transmitting personal information between organizations ". It is possible to send information to the user holding a given pseudonym, by enveloping the information in a control message that specifies the pseudonym and is addressed to the proxy server that is named in the first field of the pseudonym; the proxy server may forward the information to the user upon receipt ofthe control message.
  • All ofthe user's transactions with a given coalition can be linked by virtue ofthe fact that they are conducted under the same pseudonym, and therefore can be combined to define a unified picture, in the form of a user profile and a target profile interest summary, of the user's interests vis-a-vis the service or services provided by said coalition.
  • a pseudonym may be useful and the present description is in no way intended to limit the scope ofthe claimed invention for example, the previously described rapid profiling tree could be used to pseudonymously acquire information about the user which is considered by the user to be sensitive such as that information which is of interest to such entities as insurance companies, medical specialists, family counselors or dating services.
  • the organizations that the user U interacts with are the servers Sl-Sn on the network N.
  • the user employs a proxy server, e.g. S2, as an intermediary between the local server ofthe user' s own client and the information provider or network vendor.
  • a proxy server e.g. S2
  • S(M,K) represent the digital signing of message M by modular exponentiation with key K as detailed in a paper by Rivest, R.L., Shamir, A., and Adleman, L. Titled "A method for obtaining digital signatures and public-key cryptosystems", published in the Comm. ACM 21, 2 Feb.120-126.
  • server Z for a pseudonym P and is granted a signed pseudonym signed with the private key SK Z of server Z
  • the following protocol takes place to establish an entry for the user U in the proxy server S2's database D. 1.
  • the user now sends proxy server S2 the pseudonym, which has been signed by Z to indicate the authenticity and uniqueness ofthe pseudonym.
  • the user also generates a PK P , SK P key pair for use with the granted pseudonym, where is the private key associated with the pseudonym and PKp is the public key associated with the pseudonym.
  • the user forms a request to establish pseudonym P on proxy server S2, by sending the signed pseudonym S(P, SK Z ) to the proxy server S2 along with a request to create a new database entry, indexed by P, and the public key PK P . It envelopes the message and transmits it to a proxy server S2 through an anonymizing mix path, along with an anonymous return envelope header. 2.
  • the proxy server S2 receives the database creation entry request and associated certified pseudonym message.
  • the proxy server S2 checks to ensure that the requested pseudonym P is signed by server Z and if so grants the request and creates a database entry for the pseudonym, as well as storing the user's public key PK P to ensure that only the user U can make requests in the future using pseudonym P. 3.
  • the structure ofthe user's database entry consists of a user profile as detailed herein, a target profile interest summary as detailed herein, and a Boolean combination of access control criteria as detailed below, along with the associated public key for the pseudonym P. 4.
  • the user U may provide proxy server S2 with credentials on that pseudonym, provided by third parties, which credentials make certain assertions about that pseudonym.
  • the proxy server may verify those credentials and make appropriate modifications to the user's profile as required by these credentials, such as recording the user's new demographic status as an adult. It may also store those credentials, so that it can present them to service providers on the user's behalf.
  • the above steps may be repeated, with either the same or a different proxy server, each time user U requires a new pseudonym for use with a new and disjoint coalition of providers.
  • a given pseudonym may have already been allocated by due to the random nature of the pseudonym generation process carried out by Z. If this highly unlikely event occurs, then the proxy server S2 may reply to the user with a signed message indicating that the generated pseudonym has already been allocated, and asking for a new pseudonym to be generated.
  • a proxy server S2 Once a proxy server S2 has authenticated and registered a user's pseudonym, the user may begin to use the services of the proxy server S2, in interacting with other network entities such as service providers, as exemplified by server S4 in Figure 2, an information service provider node connected to the network.
  • the user controls the proxy server S2 by forming digitally encoded requests that the user subsequently transmits to the proxy server S2 over the network N.
  • the nature and format of these requests will vary, since the proxy server may be used for any of the services described in this application, such as the browsing , querying, and other navigational functions described below.
  • the user wishes to communicate under pseudonym P with a particular information provider or user at address A, where P is a pseudonym allocated to the user and A is either a public network address at a server such as S4, or another pseudonym that is registered on a proxy server such as S4.
  • address A is the address of an information provider, and the user is requesting that the in formation provider send target objects of interest.
  • the user must form a request R to proxy server S2, that requests proxy server S2 to send a message to address A and to forward the response back to the user.
  • the user may thereby communicate with other parties, either non-pseudonymous parties, in the case where address A is a public network address, or pseudonymous parties, in the case where address A is a pseudonym held by, for example, a business or another user who prefers to operate pseudonymously.
  • the request R to proxy server S2 formed by the user may have different content.
  • request R may instruct proxy server S2 to use the methods described later in this description to retrieve from the most convenient server a particular piece of information that has been multicast to many servers, and to send this information to the user.
  • request R may instruct proxy server S2 to multicast to many servers a file associated with a new target object provided by the user, as described below. If the user is a subscriber to the news clipping service described below, request R may instruct proxy server S2 to forward to the user all target objects that the news clipping service has sent to proxy server S2 for the user's attention.
  • request R may instruct proxy server S2 to select a particular cluster from the hierarchical cluster tree and provide a menu of its subclusters to the user, or to activate a query that temporarily affects proxy server S2's record ofthe user's target profile interest summary. If the user is a member ofa virtual community as described below, request R may instruct proxy server S2 to forward to the user all messages that have been sent to the virtual community.
  • the user Regardless ofthe content of request R, the user, at client C3, initiates a connection to the user's local server SI, and instructs server SI to send the request R along a secure mix path to the proxy server S2, initiating the following sequence of actions: 1.
  • the user's client processor C3 forms a signed message S(R, SK P ), which is paired with the user's pseudonym P and (if the request R requires a response) a secure one-time set of return envelopes, to form a message M. It protects the message M with an multiply enveloped route for the outgoing path.
  • the enveloped route s provide for secure communication between SI and the proxy server S2.
  • the message M is enveloped in the most deeply nested message and is therefore difficult to recover should the message be intercepted by an eavesdropper. 2.
  • the message M is sent by client C3 to its local server SI, and is then routed by the data communication network N from server S 1 through a set of mixes as dictated by the outgoing envelope set and arrives at the selected proxy server S2.
  • the proxy server S2 separates the received message M into the request message R, the pseudonym P, and (if included) the set of envelopes for the return path.
  • the proxy server S2 uses pseudonym P to index and retrieve the corresponding record in proxy server S2's database, which record is stored in local storage at the proxy server S2 or on other distributed storage media accessible to proxy server S2 via the network N.
  • This record contains a public key PK P , user-specific information, and credentials associated with pseudonym P.
  • the proxy server S2 uses the public key PK P to check that the signed version S(R, SK j ) of request message R is valid.
  • proxy server S2 acts on the request R.
  • request message R includes an embedded message Ml and an address A to whom message Ml should be sent; in this case, proxy server S2 sends message Ml to the server named in address A, such as server S4.
  • the communication is done using signed and optionally encrypted messages over the normal point to point connections provided by the data communication network N.
  • server S4 may exchange or be caused to exchange further signed and optionally encrypted messages with proxy server S2, still over normal point to point connections, in order to negotiate the release of user-specific information and credentials from proxy server S2.
  • server S4 may require server S2 to supply credentials proving that the user is entitled to the info ⁇ nation requested ⁇ for example, proving that the user is a subscriber in good standing to a particular information service, that the user is old enough to legally receive adult material, and that the user has been offered a particular discount (by means of a special discount credential issued to the user's pseudonym).
  • proxy server S2 has sent a message to a server S4 and server S4 has created a response M2 to message Ml to be sent to the user, then server S4 transmits the response M2 to the proxy server S2 using normal network point-to-point connections.
  • the proxy server S2 upon receipt ofthe response M2, creates a return message Mr comprising the response M2 embedded in the return envelope set that was earlier transmitted to proxy server S2 by the user in the original message M. It transmits the return message Mr along the pseudonymous mix path specified by this return envelope set, so that the response M2 reaches the user at the user's client processor C3.
  • the response M2 may contain a request for electronic payment to the information server S4.
  • the user may then respond by means of a message M3 transmitted by the same means as described for message Ml above, which message M3 encloses some form of anonymous payment.
  • the proxy server may respond automatically with such a payment, which is debited from an account maintained by the proxy server for this user.
  • Either the response message M2 from the information server S4 to the user, or a subsequent message sent by the proxy server S2 to the user, may contain advertising material that is related to the user's request and/or is targeted to the user.
  • proxy server S2 determines a weighted set of advertisements that are "associated with" target object X, (b) a subset of this set is chosen randomly, where the weight of an advertisement is proportional to the probability that it is included in the subset, and (c) proxy server S2 selects from this subset just those advertisements that the user is most likely to be interested in.
  • this set typically consists of all advertisements that the proxy server's owner has been paid to disseminate and whose target profiles are within a threshold similarity distance of the target profile of target object X.
  • proxy server S4 determines the set of advertisements associated with target object X
  • advertisers typically purchase the right to include advertisements in this set.
  • the weight of an advertisement is determined by the amount that an advertiser is willing to pay.
  • proxy server S2 retrieves the selected advertising material and transmits it to the user's client processor C3, where it will be displayed to the user, within a specified length of time after it is received, by a trusted process running on the user's client processor C3.
  • proxy server S2 transmits an advertisement, it sends a message to the advertiser, indicating that the advertisement has been transmitted to a user with a particular predicted level of interest.
  • the message may also indicate the identity of target object X.
  • the advertiser may transmit an electronic payment to proxy server S2; proxy server S2 retains a service fee for itself, optionally forwards a service fee to information server S4, and the balance is forwarded to the user or used to credit the user's account on the proxy server.
  • the passive and/or active relevance feedback that the user provides on this object is tabulated by a process on the user's client processor C3.
  • SK C3 is periodically transmitted through an a secure mix path to the proxy server S2, whereupon the search profile generation module 202 resident on server S2 updates the appropriate target profile interest summary associated with pseudonym P, provided that the signature on the summary message can be authenticated with the corresponding public key PK- ⁇ which is available to all tabulating process that are ensured to have integrity.
  • a particular pseudonym may be extended for the consumer with respect to the given provider as detailed in the previous section.
  • the consumer and the service provider agree to certain terms.
  • the service provider may decline to provide service to the pseudonym under which it transacts with the user.
  • the service provider has the recourse of refusing to provide resolution credentials to the pseudonym, and may choose to do so until the pseudonym bearer returns to good standing.
  • a user may request access in sequence to many files, which are stored on one or more information servers. This behavior is common when navigating a hypertext system such as the World Wide Web, or when using the target object browsing system described below.
  • the user requests access to a particular target object or menu of target objects; once the corresponding file has been transmitted to the user's client processor, the user views its contents and makes another such request, and so on.
  • Each request may take many seconds to satisfy, due to retrieval and transmission delays.
  • the system for customized electronic identification of desirable objects can respond more quickly to each request, by retrieving or starting to retrieve the appropriate files even before the user requests them. This early retrieval is termed "pre-fetching of files.”
  • Successful pre-fetching depends on the ability of the system to predict the next action or actions of the user.
  • the system may further collect and utilize a similar set of statistics that describes the aggregate behavior of all users; in cases where the system cannot confidently make a prediction as to what a particular user will do, because the relevant statistics concerning that user's user cluster are derived from only a small amount of data, the system may instead make its predictions based on the aggregate statistics for all users, which are derived from a larger amount of data.
  • a pre-fetching system that both employs these insights and that makes its pre-fetching decisions through accurate measurement of the expected cost and benefit of each potential pre-fetch.
  • Pre-fetching exhibits a cost-benefit tradeoff. Let t denote the approximate number of minutes that pre-fetched files are retained in local storage (before they are deleted to make room for other pre-fetched files). If the system elects to pre-fetch a file co ⁇ esponding to a target object X, then the user benefits from a fast response at no extra cost, provided that the user explicitly requests target object X soon thereafter. However, if the user does not request target object X within t minutes ofthe pre-fetch, then the pre-fetch was worthless, and its cost is an added cost that must be borne (directly or indirectly) by the user. The first scenario therefore provides benefit at no cost, while the second scenario incurs a cost at no benefit.
  • the system tries to favor the first scenario by pre-fetching only those files that the user will access anyway.
  • the system may pre-fetch either conservatively, where it controls costs by pre-fetching only files that the user is extremely likely to request explicitly (and that are relatively cheap to retrieve), or more aggressively, where it also pre-fetches files that the user is only moderately likely to request explicitly, thereby increasing both the total cost and (to a lesser degree) the total benefit to the user.
  • pre-fetching for a user U is accomplished by the user's proxy server S.
  • proxy server S retrieves a user-requested file F from an information server, it uses the identity of this file F and the characteristics of the user, as described below, to identify a group of other files Gl...Gk that the user is likely to access soon.
  • the user's request for file F is said to "trigger" files Gl...Gk.
  • Proxy server S pre-fetches each of these triggered files Gi as follows:
  • proxy server S retrieves file Gi from an appropriate information server and stores it locally. 2. Proxy server S timestamps its local copy of file Gi as having just been pre-fetched, so that file Gi will be retained in local storage for a minimum of approximately t minutes before being deleted. Whenever user U (or, in principle, any other user registered with proxy server S) requests proxy server S to retrieve a file that has been pre-fetched and not yet deleted, proxy server S can then retrieve the file from local storage rather than from another server. In a variation on steps 1-2 above, proxy server S pre-fetches a file Gi somewhat differently, so that pre-fetched files are stored on the user's client processor q rather than on server S:
  • proxy server S If proxy server S has not pre-fetched file Gi in the past t minutes, it retrieves file Gi and transmits it to user U's client processor q.
  • Proxy server S notifies client q that client q should timestamp its local copy of file Gi; this notification may be combined with the message transmitted in step 1, if any.
  • client q Upon receipt ofthe message sent in step 3, client q timestamps its local copy of file Gi as having just been pre-fetched, so that file Gi will be retained in local storage for a minimum of approximately t minutes before being deleted.
  • client q can respond to any request for file Gi (by user U or, in principle, any other user of client q) immediately and without the assistance of proxy server S.
  • Proxy server S each time it retrieves a file F in response to a request, to identify the files Gl ...Gk that should be triggered by the request for file F and pre-fetched immediately.
  • Proxy server S employs a cost-benefit analysis, performing each pre-fetch whose benefit exceeds a user-determined multiple of its cost; the user may set the multiplier low for aggressive prefetching or high for conservative prefetching. These pre-fetches may be performed in parallel.
  • pre-fetching file Gi immediately is defined to be the expected number of seconds saved by such a pre-fetch, as compared to a situation where Gi is left to be retrieved later (either by a later pre-fetch, or by the user's request) if at all.
  • the cost of pre-fetching file Gi immediately is defined to be the expected cost for proxy server S to retrieve file Gi, as determined for example by the network locations of server S and file Gi and by information provider charges, times 1 minus the probability that proxy server S will have to retrieve file Gi within t minutes (to satisfy either a later pre-fetch or the user's explicit request) if it is not pre-fetched now.
  • the proxy server S may estimate the necessary costs and benefits by adhering to the following discipline:
  • Proxy server S maintains a set of disjoint clusters of the users in its user base, clustered according to their user profiles. 2. Proxy server S maintains an initially empty set PFT of "pre-fetch triples"
  • proxy server S takes the following actions: a. For C being the user cluster containing user U, and then again for C being the set of all users: b.
  • request R2 was a request for file G
  • the total benefit of triple ⁇ C,F,G> is increased either by the time elapsed between request R0 and request R2, or by the expected time to retrieve file G, whichever is less.
  • request R2 was a request for file G, and G was triggered or explicitly retrieved by one or more requests that user U made strictly in between requests R0 and R2, with Rl denoting the earliest such request
  • the total benefit of triple ⁇ C,F,G> is decreased either by the time elapsed between request Rl and request R2, or by the expected time to retrieve file G, whichever is less.
  • the trigger-count is incremented by one for each triple cu ⁇ ently in the set PFT such that the triple has form ⁇ C,F,G>, where user U is in the set or cluster identified by C.
  • the "age" of a triple ⁇ C,F,G> is defined to be the number of days elapsed between its timestamp and the cu ⁇ ent date and time. If the age of any triple ⁇ C,F,G> exceeds a fixed constant number of days, and also exceeds a fixed constant multiple of the triple's count, then the triple may be deleted from the set PFT.
  • Proxy server S can therefore decide rapidly which files G should be triggered by a request for a given file F from a given user U, as follows.
  • CO the user cluster containing user U
  • C 1 the set of all users.
  • Server S constructs a list L of all triples ⁇ CO,F,G> such that ⁇ C0,F,G> appears in set PFT with a count exceeding a fixed threshold.
  • Server S adds to list L all triples ⁇ C1,F,G> such that ⁇ C0,F,G> does not appear on list L and ⁇ C1,F,G> appears in set PFT with a count exceeding another fixed threshold. 4. For each triple ⁇ C,F,G> on list L:
  • Server S computes the cost of triggering file G to be expected cost of retrieving file Gi, times 1 minus the quotient ofthe target-count of ⁇ C,F,G> by the trigger-count of ⁇ C,F,G>.
  • Server S computes the benefit of triggering file G to be the total benefit of ⁇ C,F,G> divided by the count of ⁇ C,F,G>.
  • proxy server S uses the computed cost and benefit, as described earlier, to decide whether file G should be triggered.
  • the approach to pre-fetching just described has the advantage that all data storage and manipulation concerning pre-fetching decisions by proxy server S is handled locally at proxy server S.
  • this "user-based" approach does lead to duplicated storage and effort across proxy servers, as well as incomplete data at each individual proxy server. That is, the information indicating what files are frequently retrieved after file F is scattered in an uncoordinated way across numerous proxy servers.
  • An alternative, "file-based” approach is to store all such information with file F itself. The difference is as follows.
  • a pre-fetch triple ⁇ C,F,G> in server S's set PFT may mention any file F and any file G on the network, but is restricted to clusters C that are subsets ofthe user base of server S.
  • a pre-fetch triple ⁇ C,F,G> in server S's set PFT may mention any user cluster C and any file G on the network, but is restricted to files F that are stored on server S.
  • proxy server S2 sends a request to server S to retrieve file F for a user U
  • server S2 indicates in this message the user Us user cluster CO, as well as the user U's value for the user-determined multiplier that is used in cost-benefit analysis.
  • Server S can use this information, together with all its triples in its set PFT of the form ⁇ C0,F,G> and ⁇ C1,F,G>, where Cl is the set of all users everywhere on the network, to determine (exactly as in the user-based approach) which files Gl...Gk are triggered by the request for file F.
  • server S sends file F back to proxy server S2, it also sends this list of files Gl...Gk, so that proxy server S2 can proceed to pre-fetch files G 1... Gk.
  • server S must execute steps 3c-3g above for any ordered pair of requests RO and R2 made within t minutes of each other by a user who employs server S as a proxy server.
  • server S must execute steps 3c-3g above for any ordered pair of requests RO and R2 made within t minutes of each other, by any user on the network, such that RO requests a file stored on server S. Therefore, when a user makes a request R2, the user's proxy server must send a notification of request R2 to all servers S such that, during the preceding t minutes (where the variable t may now depend on server S), the user has made a request RO for a file stored on server S. This notification need not be sent immediately, and it is generally more efficient for each proxy server to buffer up such notifications and send them periodically in groups to the appropriate servers. Access And Reachability Control of Users and User-Specific Information
  • proxy server S2 can act as a representative on behalf of each user in its user base, permitting access to the user and the user's private data only in accordance with criteria that have been set by the user. Proxy server S2 can restrict access in two ways:
  • the proxy server S2 may restrict access by third parties to server S2's pseudonymous database of user-specific information.
  • server S2 When a third party such as an advertiser sends a message to server S2 requesting the release of user-specific information for a pseudonym P, server S2 re fuses to honor the request unless the message includes credentials for the accessor adequate to prove that the accessor is entitled to this information.
  • the user associated with pseudonym P may at any time send signed control messages to proxy server S2, specifying the credentials or Boolean combinations of credentials that proxy server S2 should thenceforth consider to be adequate grounds for releasing a specified subset ofthe information associated with pseudonym P.
  • Proxy server S2 stores these access criteria with its database record for pseudonym P.
  • proxy server S2 For example, a user might wish to proxy server S2 to release purchasing information only to selected information providers, to charitable organizations (that is, organizations that can provide a government-issued credential that is issued only to registered charities), and to market researchers who have paid user U for the right to study user Us purchasing habits.
  • charitable organizations that is, organizations that can provide a government-issued credential that is issued only to registered charities
  • the proxy server S2 may restrict the ability of third parties to send electronic messages to the user.
  • a third party such as an advertiser attempts to send information (such as a textual message or a request to enter into spoken or written real -time communication) to pseudonym P
  • proxy server S2 will refuse to honor the request, unless the message includes credentials for the accessor adequate to meet the requirements the user has chosen to impose, as above, on third parties who wish to send information to the user.
  • proxy server S2 removes a single-use pseudonymous return address envelope from it s database record for pseudonym P, and uses the envelope to send a message containing the specified information along a secure mix path to the user of pseudonym P. If the envelope being used is the only envelope stored for pseudonym P, or more generally if the supply of such envelopes is low, proxy server S2 adds a notation to this message before sending it, which notation indicates to the user's local server that it should send additional envelopes to proxy server S2 for future use. In a more general variation, the user may instruct the proxy server S2 to impose more complex requirements on the granting of requests by third parties, not simply boolean combinations of required credentials. The user may impose any Boolean combination of simple requirements that may include, but are not limited to, the following: (a.) the accessor (third party) is a particular party
  • satisfying the request would involve disclosure to the accessor of statistical summary data, which data are computed from the user's user profile or target profile interest summary together with the user profiles and target profile interest summaries of at least n other users in the user base ofthe proxy server (f.) the content ofthe request is to send the user a target object, and this target object has a particular attribute (such as high reading level, or low vulgarity, or an authenticated Parental Guidance rating from the MPAA)
  • the content ofthe request is to send the user a target object, and this target object has been digitally signed with a particular private key (such as the private key used by the National Pharmaceutical Association to certify approved documents)
  • the content ofthe request is to send the user a target object, and the target profile has been digitally signed by a profile authentication agency, guaranteeing that the target profile is a true and accurate profile ofthe target object it claims to describe, with all attributes authenticated.
  • the content ofthe request is to send the user a target object, and the target profile of this target object is within a specified distance of a particular search profile specified by the user
  • the content ofthe request is to send the user a target object, and the proxy server S2,.by using the user's stored target profile interest summary, estimates the user's likely interest in the target object to be above a specified threshold
  • the user composes a boolean combination of predicates that apply to requests; the resulting complex predicate should be true when applied to a request that the user wants proxy server S2 to honor, and false otherwise.
  • the complex predicate may be encoded in another form, for efficiency.
  • the complex predicate is signed with SK P , and transmitted from the user's client processor C3 to the proxy server S2 through the mix path enclosed in a packet that also contains the user's pseudonym P.
  • the proxy server S2 receives the packet, verifies its authenticity using PK P and stores the access control instructions specified in the packet as part of its database record for pseudonym P.
  • the proxy server S2 enforces access control as follows:
  • the third party transmits a request to proxy server S2 using the normal point-to-point connections provided by the network N .
  • the request may be to access the target profile interest summaries associated with a set of pseudonyms Pl...Pn, or to access the user profiles associated with a set of pseudonyms Pl ...Pn, or to forward a message to the users associated with pseudonyms Pl...Pn.
  • the accessor may explicitly specify the pseudonyms Pl...Pn, or may ask that PL. Pn be chosen to be the set of all pseudonyms registered with proxy server S2 that meet specified conditions.
  • the proxy server S2 If the request can be satisfied but only upon payment ofa fee, the proxy server S2 transmits a payment request to the accessor, and waits for the accessor to send the payment to the proxy server S2. Proxy server S2 retains a service fee and forward s the balance ofthe payment to the user associated with pseudonym Pi, via an anonymous return packet that this user has provided.
  • the proxy server S2 transmits a credential request to the accessor, and waits for the accessor to send the credential to the proxy server S2. 3c.
  • the proxy server S2 satisfies the request by disclosing user-specific information to the accessor, by providing the accessor with a set of single-use envelopes to communicate directly with the user, or by forwarding a message to the user, as requested. 4.
  • Proxy server S2 optionally sends a message to the accessor, indicating why each ofthe denied requests for Pl ...Pn was denied, and/or indicating how many requests were satisfied.
  • the active and/or passive relevance feedback provided by any user U with respect to any target object sent by any path from the accessor is tabulated by the above-described tabulating process resident on user U's client processor C3. As described above, a summary of such information is periodically transmitted to the proxy server S2 to enable the proxy server S2 to update that user's target profile interest summary and user profile.
  • the access control criteria can be applied to solicited as well as unsolicited transmissions. That is, the proxy server can be used to protect the user from inappropriate or misrepresented target objects that the user may request. If the user requests a target object from an information server, but the target object turns out not to meet the access control criteria, then the proxy server will not permit the info ⁇ nation server to transmit the target object to the user, or to charge the user for such transmission. For example, to guard against target objects whose profiles have been tampered with, the user may specify an access control criterion that requires the provider to prove the target profile's accuracy by means of a digital signature from a profile authentication agency.
  • the parents of a child user may instruct the proxy server that only target objects that have been digitally signed by a recognized child protection organization may be transmitted to the user; thus, the proxy server will not let the user retrieve pornography, even from a rogue mformation server that is willing to provide pornography to users who have not supplied an adulthood credential.
  • the graphical representation ofthe network N presented in Figure 3 shows that at least one ofthe data communications links can be eliminated, as shown in Figure 4, while still enabling the network N to transmit messages among all the servers A-D.
  • elimination we mean that the link is unused in the logical design ofthe network, rather than a physical disconnection of the link.
  • the graphs that result when all redundant data communications links are eliminated are termed "trees" or "connected acyclic graphs.”
  • a graph where a message could be transmitted by a server through other servers and then return to the transmitting server over a different originating data communications link is termed a "cycle.”
  • a tree is thus an acyclic graph whose edges (links) connect a set of graph "nodes" (servers). The tree can be used to efficiently broadcast any data file to selected servers in a set of interconnected servers.
  • the tree structure is attractive in a communications network because much information distribution is multicast in nature —that is, a piece of information available at a single source must be distributed to a multiplicity of points where the information can be accessed.
  • This technique is widely known: for example, "FAX trees” are in common use in political organizations, and multicast trees are widely used in distribution of multimedia data in the Intemet; for example, see “Scalable Feedback Control for Multicast Video Distribution in the Internet,” (Jean - Chrysostome Bolot, Thierry Turletti, & Ian Wakeman, Computer Communication Review, Vol. 24, # 4, Oct. '94, Proceedings of SIGCOMM'94 , pp.
  • One ofthe most difficult problems in practical network design is the construction of "good" multicast trees, that is, tree choices which exhibit low cost (due to data not traversing links unnecessarily) and good performance (due to data frequently being close to where it is needed)
  • Constructing a Multicast Tree Algorithms for constructing multicast trees have either been ad-hoc, as is the case of the Deering, et al. Intemet multicast tree, which adds clients as they request service by grafting them into the existing tree, or by construction of a minimum cost spanning tree.
  • a distributed algorithm for creating a spanning tree (defined as a tree that connects, or "spans,” all nodes of the graph) on a set of Ethernet bridges was developed by Radia Perlman ("Interconnections: Bridges and Routers," Radia Perlman, Addison-Wesley, 1992). Creating a minimal-cost spanning tree for a graph depends on having a cost model for the arcs ofthe graph (co ⁇ esponding to communications 1 inks in the communications network). In the case of Ethernet bridges, the default cost (more complicated costing models for path costs are discussed on pp.
  • the spanning tree minimizes the cost to the root by first electing a unique root and then constructing a spanning tree based on the distances from the root.
  • the root is elected by recourse to a numeric ID contained in "configuration messages": the server w hose ID has minimum numeric value is chosen as the root.
  • core servers For that group.
  • core servers We first show how to use the similarity-based methods described above to select the servers most interested in a group of target objects, herein termed "core servers" for that group. Next we show how to construct an unrooted multicast tree that can be used to broadcast files to these core servers. Finally, we show how files co ⁇ esponding to target objects are actually broadcast through the multicast tree at the initiative ofa client, and how these files are later retrieved from the core servers when clients request them.
  • a separate set of core servers and hence a separate multicast tree may be used for each topical group of target objects.
  • servers may communicate among themselves through any path over which messages can travel; the goal of each multicast tree is to optimize the multicast distribution of files corresponding to target objects of the co ⁇ esponding topic. Note that this problem is completely distinct from selecting a multiplicity of spanning trees for the complete set of interconnected nodes as disclosed by Sincoskie in U.S. Patent No.
  • a set of topical multicast trees for a set of homogenous target objects may be constructed o r reconstructed at any time, as follows.
  • the set of target objects is grouped into a fixed number of topical clusters Cl...Cp with the methods described above, for example, by choosing Cl...Cp to be the result ofa k-means clustering ofthe set of target objects, or altematively a covering set of low-level clusters from a hierarchical cluster tree of these target objects.
  • a multicast tree MT(c) is then constructed from each cluster C in Cl...Cp, by the following procedure:
  • Each pair ⁇ Si, O is associated with a weight, w(Si, C), which is intended to covary with the expected number of users in the user base of proxy server Si who will subsequently access a target object from cluster C.
  • This weight is computed by proxy server Si in any of several ways, all of which make use of the similarity measurement computation described herein.
  • Proxy server Si randomly selects a target object T from cluster C.
  • proxy server Si applies the techniques disclosed above to user U's stored user profile and target profile interest summary in order to estimate the interest w(U, T) that user U has in t he selected target object T.
  • the aggregate interest w(Si, T) that the user base of proxy server Si has in the target object T is defined to be the sum of these interest values w(U, T).
  • w(Si, T) may be defined to be the sum of values s(w(U, T)) over all U in the user base.
  • s(*) is a sigmoidal function that is close to 0 for small arguments and close to a constant p ⁇ for large arguments; thus s(w(U, T)) estimates the probability that user U will access target object T, which probability is assumed to be independent of the probability that any other user will access target object T.
  • w(Si, T) is made to estimate the probability that at least one user from the user base of Si will access target object T: then w(Si, T) may be defined as the maximum of values w(U, T), or of 1 minus the product over the users U of the quantity (1 - s(w(U, T))).
  • Step (c) Proxy server Si repeats steps (a)-(b) for several target objects T selected randomly from cluster C, and averages the several values of w(Si, T) thereby computed in step (b) to determine the desired quantity w(Si, C), which quantity represents the expected aggregate interest by the user base of proxy server Si in the target objects of cluster C.
  • w(Si, C) For each search profile P s in the locally stored search profile set of any user in the user base of proxy server Si, proxy server Si computes the distance d(P s , P c ) between the search profile and the cluster profile P c of cluster C. (b). w(Si,C) is chosen to be the maximum value of (-d(P s ,P c )/r) across all such search profiles P s , where r is computed as an affine function ofthe cluster diameter of cluster C.
  • This affine function are chosen to be smaller (thereby increasing w(Si, C)) for servers Si for which the target object provider wishes to improve performance, as may be the case if the users in the user base of proxy server Si pay a premium for improved performance, or if performance at Si will otherwise be unacceptably low due to slow network connections.
  • the proxy server Si is modified so that it maintains not only target profile interest summaries for each user in its user base, but also a single aggregate target profile interest summary for the entire user base.
  • This aggregate target profile interest summary is determined in the usual way from relevance feedback, but the relevance feedback on a target object, in this case, is considered to be the frequency with which users in the user base retrieved the target object when it was new.
  • the aggregate target profile interest summary for proxy server Si is updated.
  • w(Si, C) I estimated by the following steps:
  • Proxy server Si applies the techniques disclosed above to its stored aggregate target profile interest summary in order to estimate the aggregate interest w(Si, T) that its aggregated user base had in the selected target object T, when new; this may be inte ⁇ reted as an estimate ofthe likelihood that at least one member of the user base will retrieve a new target object similar to T.
  • Step (c) Proxy server Si repeats steps (a)-(b) for several target objects T selected randomly from cluster C, and averages the several values of w(Si, T) thereby computed in step (b) to determine the desired quantity w(Si, C), which quantity represents the expected aggregate interest by the user base of proxy server Si in the target objects of cluster C.
  • Those servers Si from among Sl...Sn with the greatest weights w(Si, C) are designated "core servers" for cluster C.
  • those servers Si with the greatest values of w(Si, C) are selected.
  • the value of w(Si, C) for each server Si is compared against a fixed threshold w ⁇ , and those servers Si such that w(Si, C) equals or exceeds w ⁇ are selected as core servers.
  • cluster C represents a na ⁇ ow and specialized set of target objects, as often happens when the clusters Cl...Cp are numerous, it is usually adequate to select only a small number of core server cluster C, thereby obtaining substantial advantages in computational efficiency in steps 4-5 below
  • a complete graph G(C) is constructed whose vertices are the designated core servers for cluster C. For each pair of core servers, the cost of transmitting a message between those core servers along the cheapest path is estimated, and the weight ofthe edge connecting those core servers is taken to be this cost. The cost is determined as a suitable function of average transmission charges, average transmission delay, and worst-case or near-worst-case transmission delay.
  • the multicast tree MT(C) is computed by standard methods to be the minimum spanning tree (or a near-minimum spanning tree) for G(C), where the weight of an edge between two core servers is taken to b e the cost of transmitting a message between those two core servers. Note that MT(C) does not contain as vertices all proxy servers Sl...Sn, but only the core servers for cluster C.
  • a message M is formed describing the cluster profile for cluster C, the core servers for cluster C and the topology ofthe multicast tree MT(C) constructed on those core servers.
  • Message M is broadcast to all proxy servers Sl...Sn by means of the general multicast tree MT m .
  • Each proxy server Si upon receipt of message M, extracts the cluster profile of cluster C, and stores it on a local storage device, together with certain other information that it determines from message M, as follows.
  • proxy server Si If proxy server Si is named in message M as a core server for cluster C, then proxy server Si extracts and stores the subtree of MT(C) induced by all core servers whose path distance from Si in the graph MT(C) is less than or equal to d, where d is a constant positive integer (usually from 1 to 3). If message M does not name proxy server Si as a core server for MT(C), then proxy server Si extracts and stores a list of one or more nearby core servers that can be inexpensively contacted by proxy server Si over virtual point-to-point links.
  • client r provides on-line information for the network, such as an electronic newspaper.
  • This information can be structured by client r into a prea ⁇ anged form, comprising a number of files, each of which is associated with a different target object.
  • the files can contain textual representations of stock prices, weather forecasts, editorials, etc.
  • the system determines likely demand for the target objects associated with these files in order to optimize the distribution of the files through the network N of interconnected clients p-s and proxy servers A-D.
  • cluster C consists of text articles relating to the aerospace industry; further assume that the target profile interest summaries stored at proxy servers A and B for the users at clients p and r indicate that these users are strongly interested in such articles. Then the proxy servers A and B are selected as core servers for the multicast tree MT(C). The multicast tree MT(C) is then computed to consist ofthe core servers, A and B, connected by an edge that represents the least costly virtual point-to-point link between A and B (either the direct path A-B or the indirect path A-C-B, depending on the cost).
  • a proxy server S is termed a
  • Such a message M triggers the broadcast of an embedded request R to all core servers in a multicast tree MT(C).
  • the content of request R and the identity of cluster C are included in the message M, as is a field indicating that message M is a global request message.
  • the message M contains a field S ⁇ , which is unspecified except under certain circumstances described below, when it names a specific core server.
  • a global request message M may be transmitted to proxy server S by a user registered with proxy server S, which transmission may take place along a pseudonymous mix path, or it may be transmitted to proxy server S from another proxy server, along a virtual point-to-point connection.
  • proxy server S When a proxy server S receives a message M that is marked as a global request message, it acts as follows: 1. If proxy server S is not a core server for topic C, it retrieves its locally stored list of nearby core servers for topic C, selects from this list a nearby core server S', and transmits a copy of message M over a virtual point-to-point connection to cor e server S'. If this transmission fails, proxy server S repeats the procedure with other core servers on its list. 2. If proxy server S is a core server for topic C, it executes the following steps: (a) Act on the request R that is embedded in message M.
  • list L may be empty before this step, or may become empty as a result of this step, (e) For each server Si in list L, transmit a copy of message M from server S to server Si over a virtual point-to-point connection, where the S-s field ofthe copy of message M has been altered to S curr If Si cannot be reached in a reasonable amount of time by any virtual point-to-point connection (for example, server Si is broken), recurse to step (c) above with S- ⁇ bound to S cu ⁇ r and S cun . bound to S ⁇ sub 1 ⁇ for the duration ofthe recursion.
  • step (e) When server S' in step 1 or a server Si in step 2(e) receives a copy of the global request message M, it acts according to exactly the same steps. As a result, all core servers eventually receive a copy of global request message M and act on the embedded request R, unless some core servers cannot be reached. Even if a core server is unreachable, step (e) ensures that the broadcast can continue to other core servers in most circumstances, provided that d > 1; higher values of d provide additional insurance against unreachable core servers. Multicasting Files
  • the system for customized electronic information of desirable objects executes the following steps in order to introduce a new target object into the system. These steps are initiated by an entity E, which may be either a user entering commands via a keyboard at a client processor q, as illustrated in Figure 3, or an automatic software process resident on a client or server processor q. 1.
  • entity E which may be either a user entering commands via a keyboard at a client processor q, as illustrated in Figure 3, or an automatic software process resident on a client or server processor q. 1.
  • Processor q forms a signed request R, which asks the receiver to store a copy of a file F on its local storage device.
  • File F which is maintained by client q on storage at client q or on storage accessible by client q over the network, contains the informational content of or an identifying description of a target object, as described above.
  • the request R also includes an address at which entity E may be contacted (possibly a pseudonymous address at some proxy server D), and asks the receiver t o store the fact that file F is maintained by an entity at said address.
  • Processor q embeds request R in a message Ml, which it pseudonymously transmits to the entity E's proxy server D as described above. Message Ml instructs proxy server D to broadcast request R along an appropriate multicast tree. 3.
  • proxy server D Upon receipt of message Ml, proxy server D examines the doubly embedded file F and computes a target profile P for the co ⁇ esponding target object.
  • Proxy server D sends itself a global request message M instructing itself to broadcast request R along the topical multicast tree MT(Ck). 5. Proxy server D notifies entity E through a pseudonymous communication that file F has been multicast along the topical multicast tree for cluster Ck.
  • step 4 eventually causes all core servers for topic Ck to act on request R and therefore store a local copy of file F.
  • a core server Si may have to delete a less useful file.
  • Si may choose to delete the least recently accessed file.
  • Si deletes a file that it believes few users will access.
  • w(Si, Cp) the weight w(Si, Cp), where C F is a cluster consisting ofthe single target object associated with file F.
  • server Si when server Si needs to delete a file, it chooses to delete the file F with the lowest weight w(Si, C F ). To reflect the fact that files are accessed less as they age, server Si periodically multiplies its stored value of w(Si, C F ) by a decay factor, such as 0.95, for each file F that it then stores. Alter natively, instead of using a decay factor, server Si may periodically recompute aggregate interest w(Si, C F ) for each file F that it stores; the aggregate interest changes over time because target objects typically have an age attribute that the system considers in estimating user interest, as described above.
  • a decay factor such as 0.95
  • entity E later wishes to remove file F from the network, for example because it has just multicast an updated version, it pseudonymously transmits a digitally signed global request message to proxy server D, requesting all proxy servers in the multicast tree MT(Ck) to delete any local copy of file F that they may be storing.
  • Queries to Multicast Trees In addition to global request messages, another type of message that may be transmitted to any proxy server S is termed a "query message.”
  • a query message causes a reply to be sent to the originator ofthe message; this reply will contain an answer to a given query Q if any of the servers in a given multicast tree MT(C) are able to answer it, and will otherwise indicate that no answer is available.
  • the query and the cluster C are named in the query message.
  • the query message contains a field Su-. which is unspecified except under certain circumstances described below, when it names a specific core server.
  • a proxy server S receives a message M that is marked as a query message, it acts as follows: 1. Proxy server S sets A. to be the return address for the client or server that transmitted message M to server S. A. may be either a network address or a pseudonymous address 2. If proxy server S is not a core server for cluster C, it retrieves its locally stored list of nearby core servers for topic C, selects from this list a nearby core server S', and transmits a copy ofthe locate message M over a virtual point-to-point connection to core server S'.
  • proxy server S repeats the procedure with other core servers on its list. Upon receiving a reply, it forwards this reply to address A.. 3. If proxy server S is a core server for cluster C, and it is able to answer query Q using locally stored information, then it transmits a "positive" reply to A, containing the answer. 4. If proxy server S is a core server for topic C, but it is unable to answer query Q using locally stored information, then it carries out a parallel depth-first search by executing the following steps: (a) Set L to be the empty list, (b) retrieve the locally stored subtree of MT(C).
  • step 4 For each server Si directly linked to S curr in this subtree, other than S ⁇ (if specified), add the ordered pair (Si, S) to the list L. (c) If L is empty, transmit a "negative" reply to address A, saying that server S cannot locate an answer to query Q, and terminate the execution of step 4; otherwise proceed to step (d). (d) Select a list Ll of one or more server pairs (Ai, Bi) from the list L.
  • a processor q in the network When a processor q in the network wishes to retrieve the file associated with a given target object, it executes the following steps. These steps are initiated by an entity E, which may be either a user entering commands via a keyboard at a client q, as illustrated in Figure 3, or an automatic software process resident on a client or server processor q. 1.
  • entity E which may be either a user entering commands via a keyboard at a client q, as illustrated in Figure 3, or an automatic software process resident on a client or server processor q. 1.
  • Processor q forms a query Q that asks whether the recipient (a core server for cluster C) still stores a file F that was previously multicast to the multicast tree MT(C); if so, the recipient server should reply with its own server name .
  • processor q must already know the name of file F and the identity of cluster C; typically, this information is provided to entity E by a service such as the news clipping service or browsing system described below, which must identify files to the user by (name, multicast topic) pair.
  • Processor q forms a query message M that poses query Q to the multicast tree MT(C).
  • Processor q pseudonymously transmits message M to the user's proxy server D, as described above.
  • Processor q receives a response M2 to message M. 5. If the response M2 is "positive," that is, it names a server S that still stores file F, then processor q pseudonymously instructs the user's proxy server D to retrieve file F from server S.
  • processor q If the retrieval fails because server S has deleted file F since it answered the query, then client q retums to step 1. 6. If the response M2 is "negative," that is, it indicates that no server in MT(C) still stores file F, then processor q forms a query Q that asks the recipient for the address A ofthe entity that maintains file F; this entity will ordinarily maintain a copy of file F indefinitely. All core servers in MT(C) ordinarily retain this information (unless instructed to delete it by the maintaining entity), even if they delete file F for space reasons. Therefore, processor q should receive a response providing address A, whereupon processor q pseudonymously instructs the user's proxy server D to retrieve file F from address A.
  • the system for customized electronic identification of desirable objects of the present invention can be used in the electronic media system of Figure 1 to implement an automatic news clipping service which learns to select (filter) news articles to match a user's interests, based solely on which articles the user chooses to read.
  • the system for customized electronic identification of desirable objects generates a target profile for each article that enters the electronic media system, based on the relative frequency of occu ⁇ ence ofthe words contained in the article.
  • the system for customized electronic identification of desirable objects also generates a search profile set for each user, as a function of the target profiles ofthe articles the user has accessed and the relevance feedback the user has provided on these articles.
  • the system for customized electronic identification of desirable objects As new articles are received for storage on the mass storage systems SSj -SS m ofthe information servers I, -I115, , the system for customized electronic identification of desirable objects generates their target profiles.
  • the generated target profiles are later compared to the search profiles in the users' search profile sets, and those new articles whose tar get profiles are closest (most similar) to the closest search profile in a user's search profile set are identified to that user for possible reading.
  • the computer program providing the articles to the user monitors how much the user reads (the number of screens of data and the number of minutes spent reading), and adjusts the search profiles in the user's search profile set to more closely match what the user apparently prefers to read. T he details ofthe method used by this system are disclosed in flow diagram form in Figure 5.
  • This method requires selecting a specific method of calculating user-specific search profile sets, of measuring similarity between two profiles, and of updating a user's search profile set (or more generally target profile interest summary) based on what the user read, and the examples disclosed herein are examples ofthe many possible implementations that can be used and should not be construed to limit the scope ofthe system.
  • the news clipping service instantiates target profile interest summaries as search profile sets, so that a set of high-interest search profiles is stored for each user.
  • the search profiles associate d with a given user change over time.
  • they can be initially determined for a new user (or explicitly altered by an existing user) by any of a number of procedures, including the following prefe ⁇ ed methods : (1) asking the user to specify search profiles directly by giving keywords and/or numeric attributes, (2) using copies of the profiles of target objects or target clusters that the user indicates are representative of his or her interest, (3) using a standard set of search profiles copied or otherwise determined from the search profile sets of people who are demographically similar to the user.
  • Articles are available on-line from a wide variety of sources. In the prefe ⁇ ed embodiment, one would use the cu ⁇ ent days news as supplied by a news source, such as the AP or Reuters news wire. These news articles are input to the electronic media system by being loaded into the mass storage system SS 4 of an information server S 4 .
  • a news source such as the AP or Reuters news wire.
  • the article profile module 201 of the system for customized electronic identification of desirable objects can reside on the information server S 4 and operates pursuant to the steps illustrated in the flow diagram of Figure 5, where, as each article is received at step 501 by the information server S 4 , the article profile module 201 at step 502 generates a target profile for the article and stores the target profile in an article indexing memory (typically part of mass storage system SS 4 for later use in selectively delivering articles to users.
  • an article indexing memory typically part of mass storage system SS 4 for later use in selectively delivering articles to users. This method is equally useful for selecting which articles to read from electronic news groups and electronic bulletin boards, and can be used as part ofa system for screening and organizing electronic mail ("e-mail").
  • a target profile is computed for each new article, as described earlier.
  • the most important attribute ofthe target profile is a textual attribute that stands for the entire text of the article.
  • This textual attribute is represented as described earlier, as a vector of numbers, which numbers in the prefe ⁇ ed embodiment include the relative frequencies (TF/TDF scores) of word occu ⁇ ences in this article relative to other comparable articles.
  • the server must count the frequency of occu ⁇ ence of each word in the article in order to compute the TF/TDF scores.
  • These news articles are then hierarchically clustered in a hierarchical cluster tree at step 503, which serves as a decision tree for determining which news articles are closest to the user's interest.
  • the resulting clusters can be viewed as a tree in which the top ofthe tree includes all target objects and branches further down the tree represent divisions ofthe set of target objects into successively smaller subclusters of target objects.
  • Each cluster has a cluster profile, so that at each node ofthe tree, the average target profile (centroid) of all target objects stored in the subtree rooted at that node is stored.
  • This average of target profiles is computed over the representation of target profiles as vectors of numeric attributes, as described above. Compare Current Articles' Target Profiles to a User's Search Profiles
  • the process by which a user employs this apparatus to retrieve news articles of interest is illustrated in flow diagram form in Figure 11.
  • the user logs into the data communication network N via their client processor C ⁇ and activates the news reading program. This is accomplished by the user establishing a pseudonymous data communications connection as described above to a proxy server S , which provides front-end access to the data communication network N.
  • the proxy server S 2 maintains a list of authorized pseudonyms and their co ⁇ esponding public keys and provides access and billing control.
  • the user has a search profile set stored in the local data storage medium on the proxy server S 2 .
  • the profile matching module 203 resident on proxy server S 2 sequentially considers each search profile p k from the user's search profile set to determine which news articles are most likely of interest to the user.
  • the news articles were automatically clustered into a hierarchical cluster tree at an earlier step so that the determination can be made rapidly for each user.
  • the hierarchical cluster tree serves as a decision tree for determining which articles' target profiles are most similar to search profile p k : the search for relevant articles begins at the top of the tree, and at each level of the tree the branch or branches are selected which have cluster profiles closest to p ⁇ This process is recursively executed until the leaves ofthe tree are reached, identifying individual articles of interest to the user, as described in the section " Searching for Target Objects" above.
  • the system begins by non-hierarchically clustering all the search profiles in the search profile sets ofa large number of users. For each cluster k of search profiles, with cluster profile P t it uses the method described in the section "Searching for Target Objects" to locate articles with target profiles similar to p k . Each located article is then identified as of interest to each user who has a search profile represented in cluster k of search profiles.
  • the above variation attempts to match clusters of search profiles with similar clusters of articles. Since this is a symmetrical problem, it may instead be given a symmetrical solution, as the following more general variation shows.
  • target profile cluster tree the search profiles of all users to be considered are clustered into a second hierarchical tree, termed the "search profile cluster tree.”
  • search profile cluster tree The following steps serve to find all matches between individual target profiles from any target profile cluster tree and individual search profiles from any search profile cluster tree: 1. For each child subtree S ofthe root ofthe search profile cluster tree (or, let S be the entire search profile cluster tree if it contains only one search profile): 2.
  • step 6 is typically an affine function or other function ofthe greater of the cluster variances (or cluster diameters) of S and T.
  • the process can be applied even when the set of users to be considered or the set of target objects to be considered is very small.
  • the process reduces to the method given for identifying articles of interest to a single user.
  • the process constitutes a method for identifying users to whom that target object is of interest.
  • the profile processing module 203 stores a list of the identified articles for presentation to each user.
  • the profile processing system 203 retrieves the generated list of relevant articles and presents this list of titles ofthe selected articles to the user, who can then select at step 1105 any article for viewing. (If no titles are available, then the first sentence(s) of each article can be used.)
  • the list of article titles is sorted according to the degree of similarity ofthe article's target profile to the most similar search profile in the user's search profile set.
  • the resulting sorted list is either transmitted in real time to the user client processor C if the user is present at their client processor C,, or can be transmitted to a user's mailbox, resident on the user's client processor C, or stored within the server S 2 for later retrieval by the user; other methods of transmission include facsimile transmission of the printed list or telephone transmission by means of a text-to-speech system.
  • the user can then transmit a request by computer, facsimile, or telephone to indicate which ofthe identified articles the user wishes to review, if any.
  • the user can still access all articles in any information server S 4 to which the user has authorized access, however, those lower on the generated list are simply further from the user's interests, as determined by the user's search profile set.
  • the server S 2 retrieves the article from the local data storage medium or from an information server S 4 and presents the article one screen at a time to the user's client processor Ci. The user can at any time select another article for reading or exit the process. Monitor Which Articles Are Read
  • the computed measure of article attractiveness can then be used as a weighting function to adjust the user's search profile set to thereby more accurately reflect the user's dynamically changing interests.
  • Updating of a user's generated search profile set can be done at step 1108 using the method described in copending U.S. Patent Application Serial No. 08/346,425.
  • the server S 2 shifts each search profile in the set slightly in the direction of the target profiles of those nearby articles for which the computed measure of article attractiveness was high. Given a search profile with attributes u ⁇ from a user's search profile set, and a set of J articles available with attributes d p .
  • the user search profile set generation module should try to adjust u and ord to more accurately predict the articles the user selected.
  • u, and or d j should be shifted to increase their similarity if user I was predicted not to select article j but did select it, and perhaps also to decrease their similarity if user I was predicted to select article j but did not.
  • u is chosen to be the search profile from user l's search profile set that is closest to target profile. If e is positive, this adjustment increases the match between user l's search profile set and the target profiles ofthe articles user I actually selects, by making u, closer to d, for the case where the algorithm failed to predict an article that the viewer selected.
  • the size of e determines how many example articles one must see to change the search profile substantially. If e is too large, the algorithm becomes unstable, but for sufficiently small e, it drives u to its co ⁇ ect value. In general, e should be proportional to the measure of article attractiveness; for example, it should be relatively high if user I spends a long time reading article j.
  • the news clipping service may deliver news articles (or advertisements and coupons for purchasables) to off-line users as well as to users who are on-line.
  • the off-line users may have no way of providing relevance feedback
  • the user profile of an off-line user U may be similar to the profiles of on-line users, for example because user U is demographically similar to these other users, and the level of user Us interest in particular target objects can therefore be estimated via the general interest-estimation methods described earlier.
  • the news clipping service chooses a set of news articles (respectively, advertisements and coupons) that are predicted to be of interest to user U, thereby determining the content of a customized newspaper (respectively, advertising coupon circular) that may be printed and physically sent to user U via other methods.
  • the target objects included in the printed document delivered to user U are those with the highest median predicted interest among a group G of users, where group G consists of either the single off-line user U, a set of off-line users who are demographically similar to user U, or a set of off-line users who are in the same geographic area and thus on the same newspaper delivery route.
  • user group G is clustered into several subgroups Gl...Gk; an average user profile Pi is created from each subgroup Gi; for each article T and each user profile Pi, the interest in T by a hypothetical user with user profile Pi is predicted, and the interest of article T to group G is taken to be the maximum interest in article T by any of these k hypothetical users; finally, the customized newspaper for user group G is constructed from those articles of greatest interest to group G.
  • the filtering technology ofthe news clipping service is not limited to news articles provided by a single source, but may be extended to articles or target objects collected from any number of sources. For example, rather than identifying new news articles of interest, the technology may identify new or updated World Wide Web pages of interest.
  • broadcast clipping where individual users desire to broadcast messages to all interested users, the pool of news articles is replaced by a pool of messages to be broadcast, and these messages are sent to the broadcast-clipping-service subscribers most interested in them.
  • the system scans the transcripts of all real-time spoken or written discussions on the network that are cu ⁇ ently in progress and designated as public, and employs the news-clipping technology to rapidly identify discussions that the user may be interested in joining, or to rapidly identify and notify users who may be interested in joining an ongoing discussion.
  • the method is used as a post-process that filters and ranks in order of interest the many target objects found by a conventional database search, such as a search for all homes selling for under $200,000 in a given area, for all 1994 news articles about Marcia Clark, or for all Italian-language films.
  • the method is used to filter and rank the links in a hypertext document by estimating the user's interest in the document or other object associated with each link.
  • paying advertisers who may be companies or individuals, are the source of advertisements or other messages, which take the place of the news articles in the news clipping service.
  • a consumer who buys a product is deemed to have provided positive relevance feedback on advertisements for that product, and a consumer who buys a product apparently because of a particular advertisement (for example, by using a coupon clipped from that advertisement) is deemed to have provided particularly high relevance feedback on that advertisement.
  • Such feedback may be communicated to a proxy server by the consumer's client processor (if the consumer is making the purchase electronically), by the retail vendor, or by the credit-card reader (at the vendor's establishment) that the consumer uses to pay for the purchase.
  • the disclosed technology is then used to match advertisements with those users who are most interested in them; advertisements selected for a user are presented to that user by any one of several means, including electronic mail, automatic display on the users screen, or printing them on a printer at a retail establishment where the consumer is paying for a purchase.
  • the threshold distance used to identify interest may be increased for a particular advertisement, causing the system to present that advertisement to more users, in accordance with the amount that the advertiser is willing to pay.
  • a further use of the capabilities of this system is to manage a user's investment portfolio. Instead of recommending articles to the user, the system recommends target objects that are investments. As illustrated above by the example of stock market investments, many different attributes can be used together to profile each investment. The user's past investment behavior is characterized in the user's search profile set or target profile interest summary, and this information is used to match the user with stock opportunities (target objects) similar in nature to past investments. The rapid profiling method described above may be used to determine a rough set of preferences for new users. Quality attributes used in this system can include negatively weighted attributes, such as a measurement of fluctuations in dividends historically paid by the investment, a quality attribute that would have a strongly negative weight for a conservative investor dependent on a regular flow of investment income.
  • the user can set filter parameters so that the system can monitor stock prices and automatically take certain actions, such as placing buy or sell orders, or e-mailing or paging the user with a notification, when certain stock performance characteristics are met.
  • the system can immediately notify the user when a selected stock reaches a predetermined price, without the user having to monitor the stock market activity.
  • the user's investments can be profiled in part by a "type of investment" attribute (to be used in conjunction with other attributes), which distinguishes among bonds, mutual funds, growth stocks, income stocks, etc., to thereby segment the user's portfolio according to investment type.
  • Each investment type can then be managed to identify investment opportunities and the user can identify the desired ratio of investment capital for each type.
  • the system for customized electronic identification of desirable objects functions in an e ⁇ _mail environment in a similar but slightly different manner.
  • the news clipping service selects and retrieves news information that would not otherwise reach its subscribers. But at the same time, large numbers of e-mail messages do reach users, having been generated and sent by humans or automatic programs. These users need an e-mail filter, which automatically processes the messages received.
  • the necessary processing includes a determination ofthe action to be taken with each message, including, but not limited to: filing the message, notifying the user of receipt of a high priority message, automatically responding to a message.
  • the e-mail filter system must not require too great an investment on the part ofthe user to leam and use, and the user must have confidence in the appropriateness ofthe actions automatically taken by the system.
  • the same filter may be applied to voice mail messages or facsimile messages that have been converted into electronically stored text, whether automatically or at the user's request, via the use of w ell-known techniques for speech recognition or optical character recognition.
  • a message processing function MPF(*) maps from a received message (document) to one or more of a set of actions.
  • the actions which may be quite specific, may be either predefined or customized by the use r.
  • Each action A has an appropriateness function F A (*,*) such that F A (U,D) retums a real number, representing the appropriateness of selecting action A on behalf of user U when us er U is in receipt of message D.
  • the function MPF(D) is used to automatically select the appropriate action or actions.
  • the following set of actions might be useful:
  • actions 8 and 9 in the sample list above are designed to filter out messages that are undesirable to the user or that are received from undesirable sources, such as pesky salespersons, by deleting the unwanted message and or sending a reply that indicates that messages of this type will not be read.
  • the appropriateness functions must be tailored to describe the appropriateness of carrying out each action given the target profile for a particular document, and then a message processing function MPF can be found which is in some sense optimal with respect to the appropriateness function.
  • MPF One reasonable choice of MPF always picks the action with highest appropriateness, an d in cases where multiple actions are highly appropriate and are also compatible with each other, selects more than one action: for example, it may automatically reply to a message and also file the same message in directory X, so that the value of MPF(D) is the set ⁇ reply, file in directory X ⁇ .
  • the system asks the user for confirmation ofthe action(s) selected by MPF.
  • the system also asks the user for confirmation: for example, mail should not be deleted if it is nearly as appropriate to let the user see it.
  • Each received document is viewed as a target object whose profile includes such attributes as the entire text ofthe document (represented as TF/TDF scores), document sender, date sent, document length, date of last document received from this sender, key words, list of other addressees, etc. It was disclosed above how to estimate an interest function on profiled target objects, using relevance feedback together with measured similarities among target objects and among users. In the con text of the e-mail filter, the task is to estimate several appropriateness functions F A (*,*), one per action.
  • the e-mail filter learns to take particular actions on e-mail messages that have certain attributes or combinations of attributes. For example, messages from John Doe that originate in the (212) area code may prompt the system to forward a copy by fax transmission to a given fax number, or to file the message in directory X on the user's client processor.
  • a variation allows active requests of this form from the us er, such as a request that any message from John Doe be forwarded to a desired fax number until further notice.
  • This active user input requires the use of a natural language or form-based interface for which specific commands are associated with particular attributes and combinations of attributes. Update Notification
  • a very important and novel characteristic ofthe architecture is the ability to identify new or updated target objects that are relevant to the user, as determined by the user's search profile set or target profile interest summary.
  • Updated target objects include revised versions of documents and new models of purchasable goods.
  • the system may notify the user of these relevant target objects by an electronic notification such as an e-mail message or facsimile transmission.
  • the user's e-mail filter can then respond appropriately to the notification, for instance, by bringing the notification immediately to the user's personal attention, or by automatically submitting an electronic request to purchase the target object named in the notification.
  • a simple example ofthe latter response is for the e-mail filter to retrieve an on-line document at a nominal or zero charge, or request to buy a purchasable of limited quantity such as a used product or an auctionable.
  • a hierarchical cluster tree imposes a useful organization on a collection of target objects.
  • the tree is of direct use to a user who wishes to browse through all the target objects in the tree. Such a user may be exploring the collection with or without a well- specified goal.
  • the tree's division of target objects into coherent clusters provides an efficient method whereby the user can locate a target object of interest. The user first chooses one ofthe highest level (largest) clusters from a menu, and is presented with a menu listing the subclusters of said cluster, whereupon the user may select one of these subclusters.
  • the system locates the subcluster, via the appropriate pointer that was stored with the larger cluster, and allows the user to select one of its subclusters from another menu. This process is repeated until the user comes t o a leaf of the tree, which yields the details of an actual target object.
  • the user may also make selections over the telephone, with a voice synthesizer reading the menus and the user selecting subclusters via the telephone's touch-tone keypad.
  • the user simultaneously maintains two connections to the server, a telephone voice connection and a fax connection; the server sends successive menus to the user by fax, while the user selects choices via the telephone's touch-tone keypad.
  • user profiles commonly include an associative attribute indicating the user's degree of interest in each target object, it is useful to augment user profiles with an additional associative attribute indicating the user's degree of interest in each cluster in the hierarchical cluster tree.
  • This degree of interest may be estimated numerically as the number of subclusters or target objects the user has selected from menus associated with the given cluster or its subclusters, expressed as a proportion ofthe total number of subclusters or target objects the user has selected.
  • This associative attribute is particularly valuable if the hierarchical tree was built using "soft” or “fuzzy” clustering, which allows a subcluster or target object to appear in multiple clusters: if a target document appears in both the "sports" and the "humor” clusters, and the user selects it from a menu associated with the "humor” cluster, then the system increases its association between the user and the "humor” cluster but not its association between the user and the "sports” cluster.
  • the basic automatic technique is simply to display the cluster's "characteristic value" for each of a few highly weighted attributes. With numeric attributes, this may be taken to mean the cluster's average value for that attribute: thus, if the "year of release" attribute is highly weighted in predicting which movies a user will like, then it is useful to display average year of release as part of each cluster's label. Thus the user sees that one cluster consists of movies that were released around 1962, while another consists of movies from around 1982.
  • the system can display the attribute's value for the cluster member (target object) whose profile is most similar to the cluster's profile (the mean profile for all members ofthe cluster), for example, the title ofthe most typical movie in the cluster.
  • a useful technique is to select those terms for which the amount by which the term's average TF/TDF score across members of the cluster exceeds the term's average TFTDF score across all tar get objects is greatest, either in absolute terms or else as a fraction ofthe standard deviation ofthe term's TF/TDF score across all target objects.
  • the selected terms are replaced with their mo ⁇ hological stems, eliminating duplicates (so that if bot h "slept” and “sleeping” were selected, they would be replaced by the single term “sleep") and optionally eliminating close synonyms or collocates (so that if both "nurse” and “medical” were selected, they might both be replaced by a single term such as “nurse,” “medical,” “medicine,” or “hospital”).
  • the resulting set of terms is displayed as part ofthe label.
  • the system can display as part ofthe label the image or images whose associated target objects have target profiles most similar to the cluster profile.
  • Users' navigational pattems may provide some useful feedback as to the quality of the labels.
  • this may signal that the first cluster's label is misleading.
  • other terms and attributes can pro vide "next-best” alternative labels for the first cluster, such "next-best” labels can be automatically substituted for the misleading label.
  • any user can locally relabel a cluster for his or her own convenience.
  • cluster label provided by a user is in general visible only to that user, it is possible to make global use of these labels via a "user labels" textual attribute for target objects, which attribute is defined for a given target object to be the concatenation of all label s provided by any user for any cluster containing that target object.
  • This attribute influences similarity judgments: for example, it may induce the system to regard target articles in a cluster often labeled "Sports News" by users as being mildly similar to articles in an otherwise dissimilar cluster often labeled "Intemational News" by users, precisely because the "user labels” attribute in each cluster profile is strongly associated with the term “News.”
  • the "user label” attribute is also used in the automatic generation of labels, just as other textual attributes are, so that if the user-generated labels for a cluster often include "Sports,” the term “Sports” may be included in the automatically generated label as well.
  • menus it is not necessary for menus to be displayed as simple lists of labeled options; it is possible to display or print a menu in a form that shows in more detail the relation of the different menu options to each other.
  • the menu options are visually laid out in two dimensions or in a perspective drawing of three dimensions. Each option is displayed or printed as a textual or graphical label.
  • the physical coordinates at which the options are displayed or printed are generated by the following sequence of steps: (1) construct for each option the cluster profile ofthe cluster it represents, (2) construct from each cluster profile its decomposition into a numeric vector, as described above, (3) apply singular value decomposition (SVD) to determine the set of two or three orthogonal linear axes along which these numeric vectors are most greatly differentiated, and (4) take the coordinates of each option to be the projected coordinates of that option's numeric vector along said axes.
  • SSD singular value decomposition
  • Step (3) may be varied to determine a set of, say, 6 axes, so that step (4) lays out the options in a 6-dimensional space; in this case the user may view the geometric projection ofthe 6-dimensional layout onto any plane passing through the origin, and may rotate this viewing plane in order to see differing configurations of the options, which emphasize similarity with respect to differing attributes in the profiles of the associated clusters.
  • the sizes ofthe cluster labels can be varied according to the number of objects contained in the corresponding clusters.
  • all options from the parent menu are displayed in some number of dimensions, as just described, but with the option co ⁇ esponding to the cu ⁇ ent menu replaced by a more prominent subdisplay of the options on the cu ⁇ ent menu; optionally, the scale of this composite display may be gradually increased over time, thereby increasing the area ofthe screen devoted to showing the options on the cu ⁇ ent menu, and giving the visual impression that the user is regarding the parent cluster and "zooming in" on the cu ⁇ ent cluster and its subclusters.
  • a hierarchical cluster-tree may be configured with multiple cluster selections branching from each node or the same labeled clusters presented in the form of single branches for multiple nodes ordered in a hierarchy.
  • the user is able to perform lateral navigation between neighboring clusters as well, by requesting that the system search for a cluster whose cluster profile resembles the cluster profile ofthe cu ⁇ ently selected cluster. If this type of navigation is performed at the level of individual objects (leaf ends), then automatic hyperlinks may be then created as navigation occurs. This is one way that nearest neighbor clustering navigation may be performed. For example, in a domain where target objects are home pages on the World Wide Web, a collection of such pages could be laterally linked to create a " virtual mall.”
  • the simplest way to use the automatic menuing system described above is for the user to begin browsing at the top of the tree and moving to more specific subclusters.
  • the user optionally provides a query consisting of textual and/or other attributes, from which query the system constructs a profile in the manner described herein, optionally altering textual attributes as described herein before decomposing them into numeric attributes.
  • Query profiles are similar to the search profiles in a user's search profile set, except that their attributes are explicitly specified by a user, most often for one-time usage, and unlike search profiles, they are not automatically updated to reflect changing interests.
  • a typical query in the domain of text articles might have "Tell me about the relation between Galileo and the Medici family" as the value of its "text of article” attribute, and 8 as the value of its "reading difficulty” attribute (that is, 8th-grade level).
  • the system uses the method of section “Searching for Target Objects” above to automatically locate a small set of one or more clusters with profiles similar to the query profile, for example, the articles they contain are written at roughly an 8th-grade level and tend to mention Galileo and the Medicis. The user may start browsing at any of these clusters, and can move from it to subclusters, superclusters, and other nearby clusters.
  • each question-answer pair may be profiled with two separate textual attributes, one for the question and one for the answer.
  • a query might then locate a cluster by specifying only the question attribute, or for completeness, both the question attribute and the (lower-weighted) answer attribute, to be the text "Tell me about the relation between Galileo and the Medici family.”
  • the filtering technology described earlier can also aid the user in navigating among the target objects.
  • the system presents the user with a menu of subclusters of a cluster C of target objects, it can simultaneously present an additional menu ofthe most interesting target objects in cluster C, so that the user has the choice of accessing a subcluster or directly accessing one ofthe target objects.
  • this additional menu lists n target objects, then for each I between 1 and n inclusive, in increasing ord er, the I* most prominent choice on this additional menu, which choice is denoted Top(C,i), is found by considering all target objects in cluster C that are further than a threshold distance t from all of Top(C,l), Top(C,2), ... Top(C, I-l), and selecting the one in which the user's interest is estimated to be highest. If the threshold distance t is 0, then the menu resulting from this procedure simply displays the n most interesting objects in cluster C, but the threshold distance may be increased to achieve more variety in the target objects displayed. Generally the threshold distance t is chosen to be an affine function or other function of the cluster variance or cluster diameter of the cluster C .
  • the user U can "masquerade" as another user V, such as a prominent intellectual or a celebrity supermodel; as long as user U is masquerading as user V, the filtering technology will recommend articles not according to user U's preferences, but rather according to user Vs preferences.
  • user U has access to the user-specific data of user V, for example because user V has leased these data to user U for a financial consideration, then user U can masquerade as user V by instructing user U's proxy server S to temporarily substitute user Vs user profile and target profile interest summary for user U's.
  • user U has access to an average user profile and an composite target profile interest summary for a group G of users; by instructing proxy server S to substitute these for user U's user-specific data, user U can masquerade as a typical member of group G, as is useful in exploring group preferences for sociological, political, or market research. More generally, user U may "partially masquerade" as another user V or group G, by instructing proxy server S to temporarily replace user Us user-specific data with a weighted average of user U's user-specific data and the user-specific data for user V and group G.
  • the hierarchical menu presented to the user for the user's navigation need not be exactly isomo ⁇ hic to the cluster tree.
  • the menu is typically a somewhat modified version ofthe cluster tree, reorganized manually or automatically so that the clusters most interesting to a user are easily accessible by the user.
  • the system first attempts automatically to identify existing clusters that are of interest to the user.
  • the system may identify a cluster as interesting because the user often accesses target objects in that cluster — or, in a more sophisticated variation, because the user is predicted to have high interest in the cluster's profile, using the methods disclosed herein for estimating interest from relevance feedback.
  • the system can at the user's request or at all times display a special list ofthe most interesting clusters, or the most interesting subclusters ofthe current cluster, so that the user can select one of these clusters based on its label and jump directly to it.
  • the I* most prominent choice on the list which choice is denoted Top(I)
  • Top(I) is found by considering all appropriate clusters C that are furtherthan a threshold distance t from all of Top(l), Top(2), ... Top(I-l), and selecting the one in which the user's interest is estimated to be highest.
  • the threshold distance t is optionally dependent on the computed cluster variance or cluster diameter of the profiles in the latter cluster.
  • menus can be reorganized so that the most interesting subcluster choices appear earliest on the menu, or are visually marked as interesting; for example, their labels are displayed in a special color or type face, or are displayed together with a number or graphical image indicating the likely level of interest.
  • interesting clusters can be moved to menus higher in the tree, i.e., closer to the root o the tree, so that they are easier to access if the user starts browsing at the root ofthe tree.
  • uninteresting clusters can be moved to menus lower in the tree, to make room for interesting clusters that are being moved higher.
  • clusters with an especially low interest score can simply be suppressed from the menus; thus, a user with children may assign an extremely negative weight to the "vulgarity" attribute in the determination of q, so that vulgar clusters and documents will not be available at all.
  • a customized tree develops that can be more efficiently navigated by the particular user. If menus are chosen so that each menu item is chosen with approximately equal probability, then the expected number of choices the user has to make is minimized. If, for example, a user frequently accessed target objects whose profiles resembled the cluster profile of cluster (a, b, d) in Figure 8 then the menu in Figure 9 could be modified to show the structure illustrated in Figure 10.
  • a user U In the variation where the general techniques disclosed herein for estimating a user's interest from relevance feedback are used to identify interesting clusters, it is possible for a user U to supply "temporary relevance feedback" to indicate a temporary interest that is added to his or her usual interests. This is done by entering a query as described above, i.e., a set of textual and other attributes that closely match the user's interests ofthe momen . This query becomes "active," and affects the system's dete ⁇ nination of interest in either of two ways. In one approach, an active query is treated as if it were any other target object, and by virtue of being a query, it is taken to have received relevance feedback that indicates especially high interest.
  • target objects X whose target profiles are similar to an active query's profile are simply considered to have higher quality q(U, X), in that q(U, X) is incremented by a term that increases with target object X's similarity to the query profile.
  • Either strategy affects the usual interest estimates: clusters that match user U's usual interests (and have high quality q(*)) are still considered to be of interest, and clusters w hose profiles are similar to an active query are adjudged to have especially high interest. Clusters that are similar to both the query and the user's usual interests are most interesting of all. The user may modify or deactivate an active query at any ti me while browsing.
  • the user may replace or augment the original (perhaps vague) query profile with the target profile of target object or cluster X, t hereby amplifying or refining the original query to indicate an particular interest in objects similar to X.
  • the user is browsing through documents, and specifies an initial query containing the word "Lloyd's,” so that the system predicts documents containing the word "Lloyd's" to be more interesting and makes them more easily accessible, even to the point of listing such documents or clusters of such documents, as described above.
  • the association score of target object X with a particular query term T is defined to be the mean relevance feedback on target object X, averaged over just those accesses of target object X that were made while a query containing term T was active, multiplied by the negated logarithm of term T's global frequency in all queries.
  • the effect of this associative attribute is to increase the measured similarity of two documents if they are good responses to queries that contain the same terms.
  • a further maneuver can be used to improve the accuracy of responses to a query: in the summation used to determine the quality q(U, X) of a target object X, a term is included that is proportional to the sum of association scores between target object X and each term in the active query, if any, so that target objects that are closely associated with terms in an active query are determined to have higher quality and therefore higher interest for the user.
  • the user can be given the ability to reorganize the tree manually, as he or she sees fit. Any changes are optionally saved on the user's local storage device so that they will affect the presentation ofthe tree in future sessions.
  • the user can choose to move or copy menu options to other menus, so that useful clusters can thereafter be chosen directly from the root menu ofthe tree or from other easily accessed or topically appropriate menus.
  • the user can select clusters C c , ... C k listed on a particular menu M and choose to remove these clusters from the menu, replacing them on the menu with a single aggregate cluster M' containing all the target objects from clusters C,, c , ... C k .
  • the immediate subclusters of new cluster M' are either taken to be clusters C,, c , ...
  • C k themselves, or else, in a variation similar to the "scatter-gather" method, are automatically computed by clustering the set of all the subclusters of clusters , C 2 , ... C k according to the similarity ofthe cluster profiles of these subclusters.
  • the browsing techniques described above may be applied to a domain where the target objects are purchasable goods.
  • the target objects are purchasable goods.
  • the cu ⁇ ent practice is to use hand-crafted menus and sub-menus in which similar items are grouped together. It is possible to use the automated clustering and browsing methods described above to more effectively group and present the items.
  • Purchasable items can be hierarchically clustered using a plurality of different criteria.
  • Useful attributes for a purchasable item include but are not limited to a textual description and predefined category labels (if available), the unit price ofthe item, and an associative attribute listing the users who have bought this item in the past. Also useful is an associative attribute indicating which other items are often bought on the same shopping "trip" as this item; items that are often bought on the same trip will be judged similar with respect to this attribute, so tend to be grouped together. Retailers may be interested in utilizing a similar technique for pu ⁇ oses of predicting both the nature and relative quantity of items which are likely to be popular to their particular clientele. This prediction may be made by using aggregate purchasing records as the search profile set from which a collection of target objects is recommended. Estimated customer demand which is indicative of (relative) inventory quantity for each target object item is determined by measuring the cluster variance of that item compared to another target object item (which is in stock).
  • hierarchically clustering the purchasable target objects results in a hierarchical menu system, in which the target objects or clusters of target objects that appear on each menu can be labeled by names or icons and displayed in a two -dimensional or three-dimensional menu in which similar items are displayed physically near each other or on the same graphically represented "shelf.”
  • this grouping occurs both at the level of specific items (such as standard size Ivory soap or large Breck shampoo) and at the level of classes of items (such as soaps and shampoos).
  • Non-purchasable objects such as artwork, advertisements, and free samples may also be added to a display of purchasable objects, if they are associated with (liked by) substantially the same users as are the purchasable objects in the display.
  • the files associated with target objects are typically distributed across a large number of different servers SI -So and clients Cl-Cn. Each file has been entered into the data storage medium at some server or client in any one of a number of ways, including, but not limited to: scanning, keyboard input, e-mail, FTP transmission, automatic synthesis from another file under the control of another computer program. While a system to enable users to efficiently locate target objects may store its hierarchical cluster tree on a single centralized machine, greater efficiency can be achieved if the storage of the hierarchical cluster tree is distributed across many machines in the network.
  • Each cluster C including single-member clusters (target objects), is digitally represented by a file F, which is multicast to a topical multicast tree MT(C1); here cluster Cl is either cluster C itself or some supercluster of cluster C.
  • file F is stored at multiple servers, for redundancy.
  • the file F that represents cluster C contains at least the following data: 1. The cluster profile for cluster C, or data sufficient to reconstruct this cluster profile. 2. The number of target objects contained in cluster C. 3. A human-readable label for cluster C, as described in section "Labeling Clusters" above. 4. If the cluster is divided into subclusters, a list of pointers to files representing the subclusters.
  • Each pointer is an ordered pair containing naming, first, a file, and second, a multicast tree or a specific server where that file is stored. 5. If the cluster consists of a single target object, a pointer to the file co ⁇ esponding to that target object.
  • a client machine can retrieve the file F from the multicast tree MT(C1) from the multicast tree MT(C1).
  • the client can perform further tasks pertaining to this cluster, such as displaying a labeled menu of subclusters, from which the user may select subclusters for the client to retrieve next.
  • the advantage of this distributed implementation is threefold.
  • the system can be scaled to larger cluster sizes and numbers of target objects, since much more searching and data retrieval can be carried out concurrently.
  • the system is fault-tolerant in that partial matching can be achieved even if portions ofthe system are temporarily unavailable.
  • the distributed hierarchical cluster tree can be created in a distributed fashion, that is, with the participation of many processors. Indeed, in most applications it should be recreated from time to time, because as users interact with target objects, the associative attributes in the target profiles ofthe target objects change to reflect these interactions; the system's similarity measurements can therefore take these interactions into account when judging similarity, which allows a more perspicuous cluster tree to be built
  • the key technique is the following procedure for merging n disjoint cluster trees, represented respectively by files Fl...Fn in distributed fashion as described above, into a combined cluster tree that contains all the target objects from all these trees.
  • the files Fl...Fn are described above, except that the cluster labels are not included in the representation.
  • the following steps are executed by a server SI, in response to a request message from another server SO, which request message includes pointers to the files Fl...Fn. 1. Retrieve files Fl ...Fn. 2. Let L and M be empty lists. 3. For each file Fi from among Fl ...Fn: 4. If file Fi contains pointers to subcluster files, add these pointers to list L. 5. If file Fi represents a single target object, add a pointer to file Fi to list L. 6. For each pointer X on list L, retrieve the file that pointer P points to and extract the cluster profile P(X) that this file stores. 7. Apply a clustering algorithm to group the pointers X on list L according to the distances between their respective cluster profiles P(X). 8.
  • the distributed hierarchical cluster tree for a particular domain of target objects is constructed by merging many local hierarchical cluster trees, as follows. 1.
  • One server S (preferably one with good connectivity) is elected from the tree.
  • Server S sends itself a global request message that causes each proxy server in MT ⁇ (that is., each proxy server in the network) to ask its clients for files for the cluster tree.
  • the clients of each proxy server transmit to the proxy server any files that they maintain, which files represent target objects from the appropriate domain that should be added to the cluster tree. 4.
  • Server S forms a request Rl that, upon receipt, will cause the recipient server SI to take the following actions: (a) Build a hierarchical cluster tree of all the files stored on server SI that are maintained by users in the user base of SI . These files correspond to target objects from the appropriate domain. This cluster tree is typically stored entirely on SI, but may in principle be stored in a distributed fashion.
  • This reply includes a pointer to a file F that represents the completed hierarchical cluster tree.
  • Server S multicasts file F to all proxy servers in MT ⁇ .
  • server S can send additional messages through the cluster tree, to a ⁇ ange that multicast trees MT(C) are created for sufficiently large clusters C, and that each file F is multicast to the tree MT(C), where C is the smallest cluster containing file F.
  • Intemet bulletin boards also termed newsgroups
  • BSS's Intemet mailing lists and private bulletin board services
  • the system for customized electronic identification of desirable objects described herein can of course function as a browser for bulletin boards, where target objects are taken to be bulletin boards, or subtopics of bulletin boards, and each target profile is the cluster profile for a cluster of documents posted on some bulletin board.
  • a user can locate bulletin boards of interest by all the navigational techniques described above, including browsing and querying.
  • this method only serves to locate existing virtual communities. Because people have varied and varying complex interests, it is desirable to automatically locate groups of people with common interests in order to form virtual communities.
  • VCS Virtual Community Service
  • the Virtual Community Service described below is a network-based agent that seeks out users of a network with common interests, dynamically creates bulletin boards or electronic mailing lists for those users, and introduces them to each other electronically via e-mail. It is useful to note that once virtual communities have been created by VCS, the other browsing and filtering technologies described above can subsequently be used to help a user locate particular virtual communities (whether pre-existing or automatically generated by VCS); similarly, since the messages sent to a given virtual community may vary in interest and urgency for a user who has joined that community, these browsing and filtering technologies (such as the e-mail filter) can also be used to alert the user to urgent messages and to screen out uninteresting ones.
  • the functions ofthe Virtual Community Service are general functions that could be implemented on any network ranging from an office network in a small company to the World Wide Web or the Intemet.
  • the four main steps in the procedure are: 1. Scan postings to existing virtual communities. 2. Identify groups of users with common interests. 3. Match users with virtual communities, creating new virtual communities when necessary. 4. Continue to enroll additional users in the existing virtual communities.
  • users may post messages to virtual communities pseudonymously, even employing different pseudonyms for different virtual communities.
  • Posts not employing a pseudonymous mix path may, as usual, be considered to be posts employing a non-secure pseudonym, namely the user's true network address.
  • the above steps may be expressed more generally as follows 1. Scan pseudonymous postings to existing virtual communities. 2. Identify groups of pseudonyms whose associated users have common interests. 3. Match pseudonymous users with virtual communities, creating new virtual communities when necessary. 4. Continue to enroll additional pseudonymous users in the existing virtual communities.
  • Virtual Community Service constantly scans all the messages posted to all the newsgroups and electronic mailing lists on a given network, and constructs a target profile for each message found.
  • the network can be the Intemet, or a set of bulletin boards maintained by America Online, Prodigy, or CompuServe, or a smaller set of bulletin boards that might be local to a single organization, for example a large company, a law firm, or a university.
  • the scanning activity need not be confined to bulletin boards and mailing lists that were created by Virtual Community Service, but may also be used to scan the activity of communities that predate Virtual Community Service or are otherwise created by means outside the Virtual Community Service system, provided that these communities are public or otherwise grant their permission.
  • the target profile of each message includes textual attributes specifying the title and body text ofthe message. In the case of a spoken rather than written message, the latter attribute may b e computed from the acoustic speech data by using a speech recognition system.
  • the target profile also includes an associative attribute listing the author(s) and designated recipient(s) ofthe message, where the recipients may be individuals and/or entire virtual communities; if this attribute is highly weighted, then the system tends to regard messages among the same set of people as being similar or related, even if the topical similarity ofthe messages is not clear from their content, as may happen when some ofthe messages are very short.
  • Other important attributes include the fraction ofthe message that consists of quoted material from previous messages, as well as attributes that are generally useful in characterizing documents, such as the message's date, length, and reading level.
  • Virtual Community Service attempts to identify groups of pseudonymous users with common interests. These groups, herein termed "pre-communities,” are represented as sets of pseudonyms. Whenever Virtual Community Service identifies a pre-community, it will subsequently attempt to put the users in said pre-community in contact with each other, as described below. Each pre-community is said to be "determined” by a cluster of messages, pseudonymous users, search profiles, or target objects. In the usual method for determining pre-communities, Virtual Community Service clusters the messages that were scanned and profiled in the above step, based on the similarity of those messages ' computed target profiles, thus automatically finding threads of discussion that show common interests among the users.
  • Each cluster of messages that is found by Virtual Community Service and that is of sufficient size determines a pre-community whose members are the pseudonymous authors and recipients of the messages in the cluster. More precisely, the pre-community consists ofthe various pseudonyms under which the messages in the cluster were sent and received.
  • Pre-communities can be generated by grouping together users who have similar interests of any sort, not merely Individuals who have already written or received messages about similar topics. If the user profile associated with each pseudonym indicates the user's interests, for example through an associative attribute that indicates the documents or Web sites a user likes, then pseudonyms can be clustered based on the similarity of their associated user profiles, and each of the resulting clusters of pseudonyms determines a pre-community comprising the pseudonyms in the cluster. 2.
  • each pseudonym has an associated search profile set formed through participation in the news clipping service described above, then all search profiles of all pseudonymous users can be clustered based on their similarity, and each cluster of search profiles determines a pre-community whose members are the pseudonyms from whose search profile sets the search profiles in the cluster are drawn. Such groups of people have been reading about the same topic (or, more generally, accessing similar target objects) and so presumably share an interest. 3. If users participate in a news clipping service or any other filtering or browsing system for target objects, then an individual user can pseudonymously request the formation ofa virtual community to discuss a particular cluster of one or more target objects known to that system.
  • This cluster of target objects determines a pre-community consisting ofthe pseudonyms of users determined to be most interested in that cluster (for example, users who have search profiles similar to the cluster pro file), together with the pseudonym ofthe user who requested formation ofthe virtual community. Matching Users with Communities
  • Virtual Community Service identifies a cluster C of messages, users, search profiles, or target objects that determines a pre-community M, it attempts to arrange for the members of this pre-community to have the chance to participate in a common virtual community V.
  • an existing virtual community V may suit the needs of the pre-community M. Virtual Community Service first attempts to find such an existing community V.
  • V may be chosen to be any existing virtual community such that the cluster profile of cluster C is within a threshold distance ofthe mean profile ofthe set of messages recently posted to virtual community V; in the case where cluster C is a cluster of users, V may be chosen to be any existing virtual community such that the cluster profile of cluster C is within a threshold distance of the mean user profile of the active members of virtual community V; in the case where the cluster C is a cluster of search profiles, V may be chosen to be any existing virtual community such that the cluster profile of cluster C is within a threshold distance of the cluster profile ofthe largest cluster resulting from clustering all the search profiles of active members of virtual community V; and in the case where the cluster C is a cluster of one or more target objects chosen from a separate browsing or filtering system, V may be chosen to be any existing virtual community initiated in the same way from a cluster whose cluster profile in that other system is within a threshold distance ofthe cluster profile of cluster C.
  • the threshold distance used in each case is optionally dependent on the cluster variance or cluster
  • Virtual Community Service attempts to create a new virtual community V. Regardless of whether virtual community V is an existing community or a newly created community, Virtual Community Service sends an e-mail message to each pseudonym P in pre-community M whose associated user U does not already belong to virtual community V (under pseudonym P) and has not previously tumed down a request to join virtual community V.
  • the e-mail message informs user U of the existence of virtual community V , and provides instructions which user U may follow in order to join virtual community V if desired; these instructions vary depending on whether virtual community V is an existing community or a new community.
  • the message includes a credential, granted to pseudonym P, which credential must be presented by user U upon joining the virtual community V, as proof that user U was actually invited to join. If user U wishes to join virtual community V under a different pseudonym
  • user U may first transfer the credential from pseudonym P to pseudonym Q, as described above.
  • the e-mail message further provides an indication ofthe common interests ofthe community, for example by including a list of titles of messages recently sent to the community, or a charter or introductory message provided by the community (if available), or a label generated by the methods described above that identifies the content ofthe cluster of messages, user profiles, search profiles, or target objects that was used to identify the pre-community M. If Virtual Community Service must create a new community V, several methods are available for enabling the members ofthe new community to communicate with each other.
  • Virtual Community Service typically establishes either a multicast tree, as described below, or a widely-distributed bulletin board, assigning a name to the new bulletin board. If the pre-community M has fewer members, for example 2-50, Virtual Community Service typically establishes either a multicast tree, as described below, or an e-mail mailing list. If the new virtual community V was determined by a cluster of messages, then Virtual Community Service kicks off the discussion by distributing these messages to all members of virtual community V.
  • Virtual Community Service continues to scan other virtual communities for new messages whose target profiles are similar to the community's cluster profile (average message profile). Copies of any such messages are sent to the new virtual community, and the pseudonymous authors of these messages, as well as users who show high interest in reading such messages, are informed by Virtual Community Service (as for pre-community members, above) that they may want to join the community. Each such user can then decide whether or not to join the community.
  • Virtual Community Service provides automatic creation of new virtual communities in any local or wide- area network, as well as maintenance of all virtual communities on the network, including those not created by Virtual Community Service.
  • the core technology underlying Virtual Community Service is creating a search and clustering mechanism that can find articles that are "similar" in that the users share interests. This is precisely what was described above. One must be sure that Virtual Community Service does not bombard users with notices about communities in which they have no real interest. On a very small network a human could be "in the loop", scanning proposed virtual communities and perhaps even giving them names. But on larger networks Virtual Community Service has to run in fully automatic mode, since it is likely to find a large number of virtual communities.
  • Virtual Commumty Service To establish a mailing list so that any member ofthe virtual community may distribute e-mail to all other members.
  • Another method of distribution is to use a conventional network bulletin board or newsgroup to distribute the messages to all servers in the network, where they can be accessed by any member of the virtual community.
  • these simple methods do not take into account cost and performance advantages which accrue from optimizing the construction ofa multicast tree to carry messages to the virtual community.
  • a multicast tree distributes messages to only a selected set of servers, and unlike an e-mail mailing list, it does so efficiently.
  • a separate multicast tree MT(V) is maintained for each virtual community V, by use ofthe following four procedures. 1. To construct or reconstruct this multicast tree, the core servers for virtual community V are taken to be those proxy servers that serve at least one pseudonymous member of virtual community V. Then the multicast tree MT(V) is established via steps 4 -6 in the section "Multicast Tree Construction Procedure" above. 2. When a new user joins virtual community V, which is an existing virtual commumty, the user sends a message to the user's proxy server S. If user's proxy server S is not already a core server for V, then it is designated as a core server and is added to the multicast tree MT(V), as follows.
  • server S retrieves its locally stored list of nearby core servers for V, and chooses a server S 1.
  • Server S sends a control message to S 1 , indicating that it would like to be added to the multicast tree MT(V).
  • server SI retrieves its locally stored subtree Gl of MT(V), and forms a new graph G from Gl by removing all degree- 1 vertices other than SI itself.
  • Server SI transmits graph G t o server S, which stores it as its locally stored subtree of MT(V). Finally, server S sends a message to itself and to all servers that are vertices of graph G, instructing these servers to modify their locally stored subtrees of MT(V) by adding S as a vertex and adding an edge between SI and S. 3.
  • client q embeds message F in a request R instructing the recipient to store message F locally, for a limited time, for access by member s of virtual community V.
  • Request R includes a credential proving that the user is a member of virtual community V or is otherwise entitled to post messages to virtual community V (for example is not "black marked” by that or other virtual community members).
  • Client q then broadcasts request R to all core servers in the multicast tree MT(V), by means of a global request message transmitted to the user's proxy server as described above.
  • the core servers satisfy request R, provided that they can verify the included credential. 4.
  • a user U at client q initiates the steps described in section "Retrieving Files from a Multicast Tree," above.
  • user U does not want to retrieve a particular message, but rather wants to retrieve all new messages sent to virtual community V, then user U pseudonymously instructs its proxy server (which is a core server for V) to send it all messages that were multicast to MT(V) after a certain date. In either case, user U must provide a credential proving user U to be a member of virtual community V, or otherwise entitled to access messages on virtual community V.
  • a method has been presented for automatically selecting articles of interest to a user.
  • the method generates sets of search profiles for the users based on such attributes as the relative frequency of occu ⁇ ence of words in the articles read by the users, and uses these search profiles to efficiently identify future articles of interest.
  • the methods is characterized by passive monitoring (users do not need to explicitly rate the articles), multiple search profiles per user (reflecting interest in multiple topics) and use of elements of the search profiles which are automatically determined from the data (notably, the TF/TDF measure based on word frequencies and descriptions of purchasable items).
  • a method has also been presented for automatically generating menus to allow users to locate and retrieve articles on topics of interest. This method clusters articles based on their similarity, as measured by the relative frequency of word occu ⁇ ences.
  • Clusters are labeled either with article titles or with key words extracted from the article.
  • the method can be applied to large sets of articles distributed over many machines. It has been further shown how to extend the above methods from articles to any class of target objects for which profiles can be generated, including news articles, reference or work articles, electronic mail, product or service descriptions, people (based on the articles they read, demographic data, or the products they buy), and electronic bulletin boards (based on the articles posted to them).
  • a particular consequence of being able to group people by their interests is that one can form virtual communities of people of common interest, who can then co ⁇ espond with one another via electronic mail.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Mathematical Physics (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
EP96939616A 1995-10-31 1996-10-29 System für kundenausgerichtete elektronische identifizierung von wünschenswerten objekten Ceased EP0941515A1 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US55119895A 1995-10-31 1995-10-31
US551198 1995-10-31
PCT/US1996/017981 WO1997016796A1 (en) 1995-10-31 1996-10-29 System for customized electronic identification of desirable objects

Publications (1)

Publication Number Publication Date
EP0941515A1 true EP0941515A1 (de) 1999-09-15

Family

ID=24200252

Family Applications (1)

Application Number Title Priority Date Filing Date
EP96939616A Ceased EP0941515A1 (de) 1995-10-31 1996-10-29 System für kundenausgerichtete elektronische identifizierung von wünschenswerten objekten

Country Status (5)

Country Link
EP (1) EP0941515A1 (de)
JP (1) JPH11514764A (de)
AU (1) AU7674996A (de)
MX (1) MX9803418A (de)
WO (1) WO1997016796A1 (de)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7162434B1 (en) * 1997-05-05 2007-01-09 Walker Digital, Llc Method and apparatus for facilitating the sale of subscriptions to periodicals
JP3798114B2 (ja) * 1997-05-23 2006-07-19 富士通株式会社 端末、移動端末、サーバ、端末通信方法およびサーバ通信方法
US6157905A (en) * 1997-12-11 2000-12-05 Microsoft Corporation Identifying language and character set of data representing text
DE69907425T2 (de) * 1998-02-27 2004-03-11 Engage Technologies, Andover System und Verfahren zum Aufbau von Benutzerprofilen
US6327574B1 (en) * 1998-07-07 2001-12-04 Encirq Corporation Hierarchical models of consumer attributes for targeting content in a privacy-preserving manner
JP2002528819A (ja) * 1998-10-28 2002-09-03 バーティカルワン コーポレイション 自動集合の装置および方法、電子パーソナルインフォメーションあるいはデータを送達する装置および方法、ならびに電子パーソナルインフォメーションあるいはデータを含むトランザクション
US6351747B1 (en) * 1999-04-12 2002-02-26 Multex.Com, Inc. Method and system for providing data to a user based on a user's query
US6571234B1 (en) 1999-05-11 2003-05-27 Prophet Financial Systems, Inc. System and method for managing online message board
IL133489A0 (en) 1999-12-13 2001-04-30 Almondnet Inc A descriptive-profile mercantile method
WO2001069452A2 (en) * 2000-03-14 2001-09-20 Blue Dolphin Group, Inc. Method of selecting content for a user
EP1186164A1 (de) * 2000-03-17 2002-03-13 Koninklijke Philips Electronics N.V. Verfahren und vorrichtung zur bewertung von datenbankobjekten
JP2003085081A (ja) * 2000-07-25 2003-03-20 Nosu:Kk 情報配信サービスシステム
JP2002150147A (ja) * 2000-08-29 2002-05-24 Yutaka Nishimura 情報提供システム及び方法並びに情報提供用プログラムを記録した記録媒体
JP2002109183A (ja) * 2000-10-04 2002-04-12 Dentsu Inc ブランドおよびビークルの近縁性の評価方法、システム、および記録媒体
JP2002133271A (ja) * 2000-10-25 2002-05-10 Nec Corp 広告自動配信システム
JP2002170035A (ja) * 2000-11-30 2002-06-14 Hitachi Ltd 情報提供方法及びその実施装置並びにそのデータを記録した記録媒体
US7930362B2 (en) 2000-12-18 2011-04-19 Shaw Parsing, Llc Techniques for delivering personalized content with a real-time routing network
US8505024B2 (en) 2000-12-18 2013-08-06 Shaw Parsing Llc Storing state in a dynamic content routing network
US7051070B2 (en) 2000-12-18 2006-05-23 Timothy Tuttle Asynchronous messaging using a node specialization architecture in the dynamic routing network
US7680859B2 (en) * 2001-12-21 2010-03-16 Location Inc. Group Corporation a Massachusetts corporation Method for analyzing demographic data
US7434167B2 (en) 2002-09-30 2008-10-07 Microsoft Corporation Accessibility system and method
US8127252B2 (en) 2003-11-07 2012-02-28 Microsoft Corporation Method and system for presenting user interface (UI) information
US7644367B2 (en) 2003-05-16 2010-01-05 Microsoft Corporation User interface automation framework classes and interfaces
US8397237B2 (en) 2004-08-17 2013-03-12 Shaw Parsing, L.L.C. Dynamically allocating threads from a thread pool to thread boundaries configured to perform a service for an event
EP1779636B1 (de) 2004-08-17 2015-05-27 Shaw Parsing LLC Techniken zur detektion von aufwärtsstromfehlern und fehlerbehebung
CA2604030A1 (fr) * 2005-04-13 2006-10-19 Inria Institut National De Recherche En Informatique Et En Automatique Installation pour la diffusion contextuelle d'informations en mode a la fois collectif et personnel
KR20100051767A (ko) * 2006-12-22 2010-05-18 폼 유케이, 인코포레이티드 클라이언트 네트워크 활동 채널링 시스템 및 방법
CA2985910C (en) * 2009-09-08 2018-11-27 Primal Fusion Inc. Synthesizing messaging using context provided by consumers
US9269273B1 (en) 2012-07-30 2016-02-23 Weongozi Inc. Systems, methods and computer program products for building a database associating n-grams with cognitive motivation orientations
US20220092138A1 (en) * 2018-09-16 2022-03-24 Cameron Price System and method for delivering information to a user
CN111337931B (zh) * 2020-03-19 2022-11-15 哈尔滨工程大学 一种auv目标搜索方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9716796A1 *

Also Published As

Publication number Publication date
JPH11514764A (ja) 1999-12-14
WO1997016796A1 (en) 1997-05-09
MX9803418A (es) 1998-11-30
AU7674996A (en) 1997-05-22

Similar Documents

Publication Publication Date Title
US5835087A (en) System for generation of object profiles for a system for customized electronic identification of desirable objects
US8171032B2 (en) Providing customized electronic information
US6029195A (en) System for customized electronic identification of desirable objects
EP0941515A1 (de) System für kundenausgerichtete elektronische identifizierung von wünschenswerten objekten
US7092914B1 (en) Methods for matching, selecting, narrowcasting, and/or classifying based on rights management and/or other information
Aïmeur et al. Alambic: a privacy-preserving recommender system for electronic commerce
EP1770626A2 (de) Systeme und Methoden zur Übereinstimmung, Auswahl, Verteilung an eine begrenzte Anzahl von Empfängern und/oder Klassifikation basierend auf Verwaltung von Rechten und/oder anderer Information
EP2486532A1 (de) Kontextabhängige telefonnachrichtenverwaltung
WO2001037193A1 (en) System, method, and article of manufacture for recommending items to users based on user preferences
AU2008261113A1 (en) System for Customized Electronic Identification of Desirable Objects
AU1562402A (en) System for customized electronic identification of desirable objects
AU2012216241A1 (en) System for Customized Electronic Identification of Desirable Objects
Miller Toward a personal recommender system
CA2236015A1 (en) System for customized electronic identification of desirable objects
Schafer MetaLens: A framework for multi-source recommendations
UA et al. Recommender Systems: A Survey
Miller This is to certify that I have examined this copy of a doctoral thesis
Albers Collaborative Filtering Recommender Systems And New Approaches For Recommender Systems

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19980515

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

17Q First examination report despatched

Effective date: 20010129

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: PINPOINT INCORPORATED

RIN1 Information on inventor provided before grant (corrected)

Inventor name: PINPOINT INCORPORATED

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20061205