WO2008032037A1 - Procédé et système pour filtrer et rechercher des données à l'aide de fréquences de mots - Google Patents

Procédé et système pour filtrer et rechercher des données à l'aide de fréquences de mots Download PDF

Info

Publication number
WO2008032037A1
WO2008032037A1 PCT/GB2007/003418 GB2007003418W WO2008032037A1 WO 2008032037 A1 WO2008032037 A1 WO 2008032037A1 GB 2007003418 W GB2007003418 W GB 2007003418W WO 2008032037 A1 WO2008032037 A1 WO 2008032037A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
user
web
profile
words
Prior art date
Application number
PCT/GB2007/003418
Other languages
English (en)
Inventor
Richard J. Stevens
Original Assignee
Stevens Richard J
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stevens Richard J filed Critical Stevens Richard J
Publication of WO2008032037A1 publication Critical patent/WO2008032037A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the invention can make the user interaction with the Internet, also known as the Web, and other media more effective.
  • a user controlled profile is generated automatically (though the user can add to it) .
  • the user controls the use and display of the profile, and can reveal parts of it for specific uses.
  • Irrelevant information is suppressed, while the user's interests can be emphasized and displayed and used efficiently.
  • the system called KnowingMe, acts as the intermediary between the User's interests and the mass of information and systems of the World Wide Web, filtering out information irrelevant to the user, finding out useful information and maximizing the return for the user.
  • the system then actively matches the user's needs with the Web information and programs, acting in the user's interests, revealing only what the user wants to reveal about himself/herself.
  • Figure 1 Show relative use of words, or word frequency as a graph.
  • Figure 2 Shows a schematic of the word cone data structure.
  • Figure 3 Shows creation of a word frequency curve as a filter. ⁇
  • Figure 4 Shows the application applied to an Internet search.
  • the user interests and context are determined by examining the words that the user writes and views relative to average usage.
  • the size of a user's vocabulary is estimated by examining the sample of words that the user has written or actively viewed while using the Web. This is demonstrated by Hunter Diack in the article "Test your own Wordpower," (1975) Paladin 586 08233 6, which is incorporated herein by reference. This can cover both viewing e.g. looking at Web pages or writing e.g. Word documents or searches in Search Engines. The larger the vocabulary in the pages read, the larger is the size of the user's 'vocabulary' is assumed to be. The word 'vocabulary' used in this discussion is related to, but different from, from traditional definitions of vocabulary.
  • the user vocabulary is derived from the information written, the level of interest in Web sites, the location of the user cursor on the Web page, and the time spent on sites by the user. If the user writes text on the computer (e.g. Word), these are analyzed and weighted differently.
  • a person's vocabulary is a good measure of key marketing indicators such as age and education, and is a better predictor of income than IQ.
  • the user's words are sorted by the average frequency of word occurrence as a whole across the whole Web.
  • a person with a larger vocabulary will use more unusual words.
  • the average frequency of Word usage for all users across the Web is determined by scanning a wide sample of Web pages and gathering data from many users. For example the word "the” typically is the most common word, perhaps 6-8% of the total number of words. Other words may occur at a rate of 1 in 1,000 or 1 in 100,000. This provides a list of words ordered by frequency of occurrence across the Web.
  • the size of the user vocabulary can then be estimated by comparison with stereotypical vocabulary of similar size. For example a user with a vocabulary of 20,000 - -
  • Fig 1 An example of distribution profile is shown in Fig 1. This approach allows the vocabulary size to be estimated from a relatively small sample from a user. Typical vocabulary size is measured in "stem" words — for example, 'exercise', 'exercises', 'exercising, and so on are counted as only one word — the stem of all of these words is 'exercise'. Usually nouns carry the most significance.
  • the user's interests are then determined by determining which words in the user's lexicon are used preferentially, when compared to the average word frequency for a user of similar size vocabulary - both for overuse and underuse of words.
  • the result is an automatically generated set of words which indicate the user's interests and "anti-interest" compared to stereotypical vocabularies. For example, a user may use the words 'requirements', 'history', 'football', 50 times more than an average user who also has the same size vocabulary. The same person may use the word 'rap', 'McDonalds', 4 times less than the typical user.
  • the net result for an individual users may be two groups of words, perhaps in total 50-3000, which the person uses in significantly different way to the average user.
  • phrases Preferential use of phrases can be found by examining combinations of the words that are used preferentially. Phrases are even more specific to user interests than words. Phrases may be represented by a string of keywords to avoid having minor differences count as two different phrases.
  • Additional keywords can be generated later by seeing clusters of keywords, once they are positioned on the thesaurus. For example, gaps can be filled in, or headings derived that summarize a cluster of words.
  • Such individual user characteristics can be differentiated through keywords selected to optimally differentiate different groups.
  • the keywords are derived by comparing the vocabulary of a stereotypical group (e.g. young) with - -
  • the words “baseball”, “elevator”, “Congress”, and “Seattle” may (statistically) be used many times more by an US citizen than a UK citizen. If we check the usage of those keywords, we can estimate whether the user is American or British. Of course, an individual person may have a vocabulary reflecting interests in both cultures. Hence the result is not binary, and a user may be 75% American and 25% British in his/her interests.
  • pension and “retirement” may be used many times more by an older person than a young one.
  • a person may be heavily interested in sport, but specifically not interested in ice-hockey or baseball.
  • a relatively small number of keywords can derive a parameter value for the particular characteristic. For example, the usage of a group of 10-30 keywords compared to average use may be used to estimate the level of interest in golf, or the age of a user.
  • Actions made by the user can add to the profile.
  • the knowledge that the user buys golf equipment can be used for the user profile.
  • Processing can take place as a background task while the user is working, while the compilation tasks are performed at the server side.
  • a user profile is constructed from the user's interests, characteristics, and context.
  • the system performs two steps:
  • the words used in web pages, documents or any other data object on the Web are organized into a structure so that closely related words are generally speaking, statistically close together.
  • An initial data structure called a WordCone can be created for the whole Web, which represents a hierarchy.
  • the WordCone is a layered hierarchy, like a layered tree structure but wrapped around itself so that it does not have a beginning or end. However, it can be sliced through from top to bottom to produce a standard layered hierarchy.
  • the hierarchy for the WordCone word index is generated by a recursive technique of looking for words which summarize the group beneath them in the hierarchy. For example, “sport” would summarize “baseball”, “football”, “rugby” etc at the level above.
  • the structure is actually a hierarchical cone (hence WordCone), rather than a tree structure.
  • the WordCone structure is not a thesaurus or a hierarchy, but can easily be converted into these when needed.
  • the cone can be cut vertically at any point to convert it to a tree structure at a convenient boundary.
  • the leaf level of the WordCone is a word index of closely related subject areas. That is, the proximate words at the leaf nodes are related by subject matter.
  • the WordCone data structure can be constructed by assembling data elements containing a word and it's associated pointer to other elements. One pointer points to the parent node, one or more pointers point to the child nodes and two pointers point to neighboring nodes. At the leaves, the child node pointers are null.
  • the initial structure is passed to each user's computer, to allow it to populate their personal WordCone(s). This gives each user a compatible structure and hence allows them to be added easily.
  • a structure for a word index can be created manually, with only the leaf level or the lower levels generated automatically. This allows existing heading such as “sports”, “business” or “entertainment” to be used.
  • the same word can appear more than once and in different places in the WordCone, because the same word can have multiple meanings.
  • the WordCone can use a link or pointer in the data structure to show the relationships between the same word with different meanings. Each instance of a word will be close to other words that are used in the same context.
  • the WordCone can be stretched and/or compressed, for example to make a thesaurus in which each unit along the leaf represent uniform usage of a word on the Web (for the Web's WordCone) or for a user. Each linear section of the thesaurus then represents a uniform amount of interest, instead of the sheer number of words.
  • Fig. 3 show the process of adjusting the overall WordCone, firstly to the overall frequency of word usage, and then to the interests of the user.
  • unitary WordCone for which constitutes the thesaurus for the World Wide Web: it is the set of usage frequencies (or probabilities) for the unitary WordCone.
  • the unitary WordCone is created from the base thesaurus by, for each word in the cone structure, determining the average frequency of usage for that word.
  • the user's WordCone can be used to create a filter to emphasize a user's interests for Web information by suppressing information irrelevant to the user.
  • the filtering process consists of the user's WordCone applied over the overall Web WordCone. Changes in a WordCone over time can for example illustrate the recent changes in interests or recent trends on the Web.
  • the initial WordCone is then passed to individual users who map their own level of interest by populating their word usage into the data structure.
  • the system can scan user's documents on their computer or scan web pages bookmarked by the user in order to calculate word frequencies associated with the user for the user's word preference profile.
  • This WordCone stores the intensity of user's interests. The intensities of words may vary with time, reflecting the user's changing interests or, for example, the latest events in current affairs.
  • the relative usage of words in the WordCone can be used as a filter to find similar WordCone structure to emphasize the user's interests.
  • Each word that occupies a node (that is, categorical words and leaf nodes) has a frequency associated with it.
  • the data structure can have an additional data item comprising the data element that houses the frequency for that word associated with that node.
  • the leaf nodes are used, and each leaf node is mapped to a position on the number line for ease of computation.
  • the parent nodes which have the category definitions in them
  • each node in the WordCone is mapped to a unique position in a number-line by well known tree ordering techniques used in computer programming.
  • the leaf node words and category description words each have word-use frequencies associated with them.
  • the WordCone represents a word preference profile for the data object it is derived from, and in the case of data objects associated with a user, the user's word preference profile.
  • P(i) for the i-th word in the vector, w(i), there is a usage frequency P(i).
  • the parent node P(i) would equal the sum of the P(i) associated with its child nodes.
  • usage frequency it is mean the percentage difference between the frequency of use of the word in that domain and the overall average frequency of use.
  • the score can be compared to a scoring threshold in order to determine whether to present the website to the user as a result of a search.
  • search results can be ranked by the dot product score and presented to the user from the highest score down.
  • Presentation can include displaying a web page with hyperlinks to the search results, where the top of the page shows the highest relevant result, with decreasing relevance as one moves down the display page.
  • the user can adjust the scoring threshold to broaden or tighten the effect of the filter typically by making changes through a graphical user interface.
  • a Web site can be classified on the WordCone as well as an individual. This need only be done once per user and sent to a central database to then be available to any user without processing. This allows the web site to be matched to a particular user's interests by comparing the same node for the user and the web site. For example, consider a website specialising in music downloads.
  • the website would show high specific values for the nodes of "entertainment”, then sequentially higher values for subnodes of (say) "arts”, then "music” - because the user is more interested in “music” than its parent node “arts” or “entertainment” in general.
  • a user who is interested in classical music downloads will also have high values for the nodes of "entertainment", its child node “arts”, and then "music”.
  • the match with the music download database derived from the product of the two nodes for the user and the music download site will therefore be a high value.
  • the match could be for example, a product of the level of interest for the same node for the user and website.
  • the match level would be 81%. If the specific interest of the user is at a lower node beneath music (e.g. classical music), then that lower node will have a yet higher value of interest. Provided that the music download site also has a significant value for classical, music then the match for this node will be high for classical music and much lower for say "jazz" if the user has little interest there. If the user has a 5% interest in "jazz" and the website is 90% for the same node, the match would be 4.5%.
  • the Wordcone for the user and websites are then classified as to the level of interest for particular subjects. Matching the most important web sites for the user's interests then becomes possible.
  • a simple measure of user's interests against a web site consists of the sum of the products of the two leaf nodes of the user and a Web site i.e. ⁇ NuNws for all leaf nodes. The highest value will be obtained by the web site with the largest amount of information relevant to the user, regardless of the amount of irrelevant information.
  • Another measure of interest to the user is a smaller web site or a section of a larger web site that has a higher matching ratio but for a smaller area of the web site i.e. Max ⁇ NuNws for each of the leaf nodes between a user and a Web site that are below a node of specific interest for the user. This maximum value shows the best match to a special area for the user.
  • Both measures can be presented to a user on a single graph by showing the Wordcone with each node populated by the highest Web site matches for the user.
  • the generic large Web sites will occupy the top node, and the lowest specialized areas will occur at successively lower nodes. They will show web sites or sections of web sites of interest to the subject matter at that node. For example, the "early music" node will be occupied by specialist early music web sites, but also the early music sections of more generic web sites that cover the subject area well.
  • the user can select the relevant web site by clicking on a node of interest and seeing an list of web sites that have the highest matching values for that node.
  • the calculation of S(j)(k) can be made so that it is limited to be equal to a sum ⁇ m : P(m)(j)*P(m)(k), where m are the indices associated with the words in the category of interest specified by the user, in the above example, it would be "music.”
  • the unitary WordCone can be used to determine what relative share of usage the entire selected category represents. The reciprocal of that can be used to scale the dot product prior to comparison to the scoring threshold. For example, if the category of "music" is selected, and the unitary WordCone has 5% of usage being words in that category, then the relevance score dot product can be scaled by a factor of 20 prior to comparison to the scoring threshold.
  • the scoring threshold can be reduced by a factor of 20.
  • Other heuristic adjustments to the scoring threshold may be used. Comparison of two word preference profiles can be made by means of the dot product as well. In this case the result of the dot product can be compared to a threshold that indicates some level of confidence of the match.
  • the user can filter information to his/her interests in real time. Where there are a cluster of interests along a line, small gaps show potential areas of interest in which the user has not been active (see figure 3).
  • the term can be used to locate a section of the WordCone, corresponding to words for similar interests/meaning.
  • the user may well have preferentially used words in the user profile in that part of the WordCone. These words can be used to programmably reinforce the search term in the Search Engine. The resulting search is more relevant to the user interest.
  • An extension to this technique is to determine the area in the Wordcone that is most relevant to the search and then to select user keywords that are specific to that subject area.
  • Applications include the ability to filter or prioritize a Web site of text, media or products from a Web site or an e-mail sent to a user e.g. DVDs, wines, news snippets.
  • a Web newspaper (text, audio, video, MP3) could be filtered through a user filter to emphasize the user interests. Similarly it could reorder the information according to user interests.
  • This filter could reside at the origin of the media information or at the user site.
  • the system performs the steps of:
  • Synthetic view(s) of the overall Web information usage can be produced by combining individual users' WordCones.
  • One or more user WordCones are sent to the server and combined to provide generic information about Web usage.
  • Each user WordCone represents the interests of the individ ⁇ al user, the intensity of that interest and is ordered by similarity of subject. Because the underlying WordCone structures used by each user are compatible, they can be added together straightforwardly.
  • the summation is a map of the whole Web word and word usage in different areas by the totality of users, i.e. an overall WordCone for the Web. Subsets can be chosen for different tasks e.g. to define a USA stereotype.
  • the overall WordCone ordering and word usage can also be adjusted in the light of users' inputs and then re-distributed.
  • the process of accumulating WordCones is anonymous, with the user identity being unknown, but they can be encrypted to ensure they are from unique individuals (to prevent automatic systems supplying information in bulk to distort the system). The system performs these steps:
  • the user profile can now be anonymously used to scan Web information.
  • a profile agent combines the user profile, software to view the Web, and encryption.
  • the profile is owned by the user, who can read the profile and can alter parts of the profile.
  • the user profile is partially or completely encrypted to preserve the user's anonymity. It permits different levels of exposure to allow the user to apply the profile for different tasks while retaining sufficient anonymity and prevent tampering.
  • the profile agent can use the WordCone to filter the Web content for example to find other users with similar interests, advertisements, jobs etc. anonymously.
  • the user can have multiple profiles or get a specialist profile from others for a particular task, for example a profile could be specific about a holiday destination or a pop star.
  • WordCones are additive so multiple cones can be combined to complete complex tasks.
  • the profile agent can be used to receive information from a Web site anonymously by having the information passed indirectly e.g. through a KnowingMe site.
  • click-through information will be able to be passed to third parties such as advertisers or KnowingMe (without compromising anonymity).
  • the profile agents belong to the user, based on a user profile, and anonymously trawl the Web and retrieve items of interest to the user.
  • the profile, profile agent and/or WordCone can be used on other devices to allow autonomous selection of information specific to a user. It may be passed to a mobile phone, a TV storage device to store or display programs, PDA and other mobile devices and link to those devices, to be able to detect items of interests to the user. For example it may combine user interests and details of the current location known to the phone and tell the user that an event of interest is currently taking place locally. Alternatively the user-controlled profile may extract information on the TV programs that are watched from a TV or TV recorder system and add them to the overall
  • a part of the profile can represent what the user (or site) has done in a neutral and unalterable fashion i.e. the user can chose whether to display parts of this information, but not alter the information itself. This leads to a range of applications which verify the user or the site.
  • the profile is still user-controlled (for display and use) but not for content of what is displayed. - -
  • the user can send (e.g. by attaching to an e-mail), an encrypted e-mail which summarizes the user's profile and the way it has been produced.
  • the user is unable to alter this summary, which therefore forms a basis for independent verification.
  • the recipient would be to see that the profile was generated over a period of 18 months and has changed only steadily during that time.
  • This approach enables a user to be independently verified to reduce spam, verify user interests and stability.
  • some details from the profile could be used to prevent and detect clickthrough fraud, where a computer is used to repeatedly click on ads to generate cash without any intention to purchase the product.
  • the user may have a system which allows themselves to be rated by other users.
  • the user can choose whether or not to display the overall rating, but cannot alter the overall rating or the information supplied by others.
  • This is a generalized version of the rating systems which are specific to one site (Amazon, ebay) but in this case the information is held by the end user or Web site. However the end user is not able to alter the statistics of this rating.
  • the encrypted user profile can be used by others to verify the user is unique and active, without necessarily supplying a e-mail address.
  • the agents will be able to assure click- through companies that the request is real and relevant to the user's interests.
  • the profile agents will be able to supply the KnowingMe supplier with details of click-throughs and/or transactions, even though it may not know who the user is.
  • a user profile is unique, a valuable item built up over time, with elements not alterable by the user, yet hides personal details of the user. As a result, a user could exclude or suppress e-mails from recently created e-mail addresses. - -
  • the application could be used to help reject Spam, by indicating factors such as length of usage, interests etc. It can also be used to detect false clickthroughs (clickthrough repeatedly performed specifically to earn clickthrough income) or to prevent a single person ordering masses of tickets for re-sale.
  • WordCone may be encrypted to control the supply of these third- party elements, and to direct part of the click-through revenue back to the originator. Practitioners of ordinary skill will recognize that access to the WordCone can be controlled by means of the encryption.
  • the code that can decrypt and use it to create search queries or to filter search results or send to search engines specific profile information can also include an identifier for the search engine itself to recognize that the source of the query was WordCone or the source of the WordCone application. In this manner, a credit for the search referral can be accrued to the WordCone account associated with the identifier.
  • the user profile can be used to automatically classify information that is received and stored by the user, such as e-mails.
  • the system has these capabilities:
  • the user-controlled profile interacts with other media systems (TV, radio, ads) to select appropriate material based on the user interests.
  • TV television, radio, ads
  • non-Web media such as programs or ads
  • KnowingMe can select an ad from several possibilities using the user profile, or select broadcast or stored programs or music based on the user profile.
  • a user could have one or more sequences of TV programs selected from broadcast options based on the user profile. The user choices could then be added to improve the user profile.
  • the single- dimension WordCone can be wrapped along the curve to fill a two-dimensional display.
  • the single dimension line then has the whole 2-D area available for display.
  • Related information on the WordCone will then appear even closer on a 2-D display.
  • Real-time filtering on the 1- D WordCone is then shown in 2-D through the curve.
  • the single dimensional WordCone is wrapped onto the two dimensions following the fractal curve that fills in the plane with a single line. The curve ensures that subjects that are close in one dimension are statistically even closer in two dimensions. This is demonstrated by RJ. Stevens, A.F. Lehar and F.H.
  • the display can show (e.g. by coloration or 3-D relief map) the intensity of usage of the words across the WordCone, and the WordCone can be stretched to emphasize the heavily used elements.
  • Filtering consists of overlaying the user WordCone usage levels over the generic WordCone in 1-D. Display consists of showing the result in 2-D through the fractal curve. In this case a Hubert curve is used, but other fractal curves will produce satisfactory results.
  • the user WordCone acts as a 1-D filter of the user interests to enable relevant Web sites to be selected, and to select out relevant information in a Web site.
  • the filters also allow Web sites to tailor their content to each user to provide more relevant information that is more relevant to the user needs.
  • the user can receive a standard set of information and filter that locally e.g. for a Web newspaper or audio blog with tags that enable filtering.
  • the one-dimensional WordCone is mapped along a two — dimensional fractal curve (e.g. a Hubert Curve) that fills the 2-D screen.
  • a two — dimensional fractal curve e.g. a Hubert Curve
  • Fig, TBD shows how a continuous fractal curve can track through every pixel of a 2-D display, following the path of a Hubert Curve. This has two great advantages - a long, detailed curve can be mapped onto the hole of the 2-D screen, and elements close on the line will be even closer in 2-D because of the fractal curve.
  • the effective length of the curve that can be displayed is n 2 . So a screen of 1024 x 1024 elements can display an index 1048576 elements long, yet retain the coherence of the one-dimensional index.
  • the systems capabilities include:
  • Ps(w) the Standard Word Frequency. The probability of occurrence of a word w in a large diverse sample of documents.
  • Pa(w) the Actual Word Frequency. The probability of the occurrence of a word w in a specific document or portion of text.
  • the actual frequency Pa(w) of a word w is the number of occurrences of the word nw divided by the total words :
  • the relative frequency of a word w in a specific document or portion of text Pr(w) is defined by:
  • Pr(w) Pa(w) / Ps(w).
  • a single word can be ambiguous without some context given to it by the words around it.
  • We measure the context of a specific word by recording the distance (that is, counting the number of intervening words) between it and each of the surrounding words.
  • a threshold must be placed on the maximum distance between words that we regard as related e.g. the first and last words in a book are unlikely to be related. Empirically, distances of less than 10 words have proven to be the best choice.
  • the distance between a word wi and a word wj is Dij and the number of instances of the pairs of words wi,wj within some predetermined number of words and in a given portion of text as I(i j) then rank the pairs of words in descending order of their number of instances then the words at the top of the list are contextually significant.
  • the predetermined number is about 10 words, and 1(1 j) is typically an entire document. However, portions of a document may be use, for example, article abstracts and the like.
  • the resulting subset of words defines the likely subject of the document being examined. Each part of the information is not sufficient by itself. A very unusual single word must not be able to dominate the subject of a document but similarly neither should a very common set of close word pairs. Unusual words that are close together are significant and relevant to the subject of the document.
  • the combination of the statistically significant words and the contextually significant words is a logical union. In some embodiments, a set concatenating can occur so that the size of the union set of words is within practical bounds. Practitioners of ordinary skill will recognize that identifying the subject of a document is not limited to this method, but are disclosed in a variety of techniques.
  • the User Profile is generated by aggregating instances of the subjects of documents examined over the course of time by the user. Also, the system will examine frequency analysis of subject result words to determine word usage for the user. After a large enough collection of documents have been analyzed the subject areas of interest to the user will rise to the top of the subject rankings. By aggregating the instances of each subject area over a large number of users we can calculate the relative interest in a subject for an individual user.
  • Ir(s) I(s) / Iav(s), where I(s) is the number of instances of the subject s for a user and Iav(s) is the average number of instances of the subject s over a large number N of users.
  • the WordCone can be visualised as a 3 -dimensional hierarchical structure consisting of subject related words at each level that summarise the words in hierarchically lower levels e.g. the word sport has the words football, hockey and baseball beneath it. At each level the subject related words wrap around to form a "circle" of subject words. Each higher level in the hierarchy is a smaller diameter circle of (summarising) parent words with a larger diameter circle of more detailed words underneath. The structure naturally forms a cone. However, it can also be thought of as a tree data structure, where each word is represented by an element in the data structure, called a node. The subject categories are branch nodes and the finely granular normative words are leaf nodes.
  • a data structure can be used where each element of the data structure is combined with at least one pointer, pointing to a parent node (representing a category) or to a child node, possibly a leaf node, which has no further categorical distinctions.
  • the parent node and that parent node's parent and so on up the tree) are incremented in value. The point is that if a document has the word "softball", for example, then the node representing the category "sport” should also register a change in order to facilitate word cone matching and filtering.
  • the 3 -dimensional nature of the structure is necessary to account for documents that cross subject areas.
  • History and Science could be nodes in the hierarchy.
  • Some documents will be about the History of Science. In such a document there will be subject words that are children of the History node and children of the Science node.
  • Grouping related subjects can be done with a suitable classification algorithm such as Bayesian classification or fuzzy logic that are common in areas such as spam recognition.
  • An entire group of users can be considered to have one Word Cone data structure, which can be determined by aggregating each user's individual Word Cone.
  • documents, websites or other sources of words can be analyzed in the same manner so that they too have a Word Cone calculated.
  • most of the nodes in the Word Cone are associated with a zero word frequency: very few documents use at least every word in a language.
  • Documents can be matched with users by comparing the nodes in the document's word cone to the user's word cone. Those comparisons that show a sufficiently close match indicate that the user is likely to find the document interesting. In the same manner, a user can adjust their word cone so that they can pre-focus search efforts.
  • the user can specify that their immediate interest is "sports” and hence matching is conducted in the "sports" sub-tree of the user's word cone and not across all subjects and therefore not all of the nodes need to be matched.
  • the comparison of word cones is done by comparing the nodes.
  • Each node in the data structure has at least two elements: the word and its frequency. Additionally, there is one or two structure pointers that point to the parent or leaf nodes or both.
  • Two word cones are likely different because a document and a user profile will not have the same word set. Therefore, nodes not. appearing in both word cone structures are not further compared. Of the nodes appearing in both word cones, the frequency value is compared and, if within a predetermined threshold, considered a match. The result of the comparison is a certain number of words in common and for each word, a match or no match. Any kind of analysis can be used to determine if a sufficient number of words in common have matching frequencies of usage. In addition, heuristics can be used to weight the comparisons. For example, branch matches may be considered to be more significant than merely leaf matches. In any case, when the two word cones are found to be sufficiently matched, then the results of the match can be used, for example, for indexing or searching.
  • User-controlled means a characteristic of the user profile in which the user is able to know the profile is being created, and control the use and display of the information in the display. While the user may enter some of the information, other parts (e.g. interests and usage) is automatically derived and not controlled by the user.
  • WordCone means a structured organization of a set of words organized by similarity of words at the leaf and all levels of the structure. Also the name of the overall system. The nodes above the leaf level partition the word set into related groups. The structure forms a hierarchical cone - at all levels of the cone, similar words are close together and organized along a circular path. The structure can be cut vertically at any point to form a hierarchy with a thesaurus at the leaf level. The leaf level of the thesaurus is what is mapped along the fractal curve.
  • the WordCone can be an initial variety, generated by analysis of the Web, a user WordCone which reflects the user's interests and the current WordCone which is the summary of all user WordCones.
  • Subject a subject is an identifiable topic that a Information Unit relates to.
  • an Information Unit is a collection of words about a subject.
  • Information Unit may take the form of written or spoken words in electronic or physical form.
  • Information Unit may also be a subset or a larger Information Unit e.g. a chapter in a book.
  • the common structure is essentially a tree structure with each node representing one or more subject words that summarise the subject words in their child nodes. So a node that represents
  • Another mechanism for filling the structure would be to determine the subjects (we describe how later) of a large set of Information Units and look for the most frequent hypernyms of those subjects (x is a hypernym of y if the statement "y is a kind of x" is true, so for example sport is a hypernym of football). Each of the most frequent hypernyms becomes the parent category for the Information Units where that hypernym is found most often. Finding hypernyms can be done using common tools such as WordNet (http://wordnet.princeton.edu).
  • WordNet http://wordnet.princeton.edu
  • This global structure is created and updated by continuously determining the subjects examined by all of the users. Updated versions of the structure are periodically distributed to all users in order to ensure that subjects are up to date and that every user is measured against the most current structure.
  • a user profile by associating a numerical value with each node in the common structure that is a "measurement" of the user's level of interest in the subject matter associated with the node.
  • a value of 0 indicates that a user has no interest in a subject and the larger the number, the greater the interest.
  • the measurement should reflect such factors as the number of Information Units that a user examines in a subject area and the frequency that they are examined since this should give a good indication of interest. Other possible factors could be included.
  • the parent flag would be set to TRUE is any child node were set to TRUE and FALSE if all child nodes were FALSE. In practice the user will examine many separate instances in an area and an analog value created to express the overall level of interests across time.
  • Another possible method would be to calculate a number for each category between (0,1) where 0 implied that the user had no interest (i.e. had examined no Information Units) in a category and 1 implied that the user had read every Information Unit relating to the category.
  • the method of calculating this would be to divide the total words in a category that the user had examined by the global number of words in the same category i.e. the sum total of all the words in that category examined by all users.
  • Each user examines a number of documents and the above analysis is carried out for each document.
  • a local copy is kept for each user that only contains the analysis (word counts by word and total words by category) for the documents examined by that user.
  • Another global analysis is updated that contains the results for the documents examined by every user.
  • the global analysis is the master structure that forms the standard that every user populates with their profile (dealt with under scoring).
  • the results of the analysis could either be carried out at the user end and transmitted to the central resource or the document (or its location) could be transmitted to the central resource and the analysis carried out there.
  • the hierarchical structure is essentially a tree structure with each node representing one or more words that summarise the content of the words in their child nodes. So a node that represents "sport” would have child nodes that represent "football”, “basketball”, “athletics” etc.
  • Another mechanism for filling the structure would be to analyse the words in a document as described above and look for the most frequent hypernyms (x is a hypernym of y if the statement "y is a kind of x" is true, so for example sport is a hypernym of football).
  • x is a hypernym of y if the statement "y is a kind of x" is true, so for example sport is a hypernym of football).
  • Each of the most frequent hypernyms becomes the parent category for the documents where that hypernym is found most often.
  • Finding hypernyms can be done using common tools such as WordNet, a research program at Princeton University, (http://wordnet.princeton.eduy
  • One possible method would be to calculate a number for each category between (0,1) where 0 implied that the user had no interest (i.e. had examined no documents) in a category and 1 implied that the user had read every document relating to the category.
  • the method of calculating this would be to divide the total words in a category that the user had examined by the global number of words in the same category (both numbers being taken from the analysis above).
  • the same calculation could be applied at higher levels in the hierarchy by summing the user's word count for all of the subcategories and dividing by the sum of the global word counts for the subcategories.
  • Another more crude mechanism would be a simple Boolean flag set to TRUE if there were any word count entries for the user in a given category and FALSE if there were no entries.
  • the parent flag would be set to TRUE is any child node were set to TRUE and FALSE if all child nodes were FALSE.
  • Searching a set of documents where each document already has a word cone associated with it can be implemented.
  • the user can specify some key- words.
  • Those words in turn can be expanded to find related words, as demonstrated by the prior art or simply by using the word cone to include those words constituting parent nodes to the specified keywords and, if the keyword is a leaf-node, those words within a predetermined distance of the leaf-node, or a combination of the two.
  • the expanded set of keywords can be used to expand the results of the search. Practitioners of ordinary skill will recognize that expanding search results is often not the goal, but rather focusing search results on the few documents that are truly relevant. The invention facilitates this by generating a word cone for each recovered document.
  • the word cone can be matched against all or part of a user's word cone, where the match context is limited to a subtree, that is, a subject area that encompasses only part of the word cone. Those documents whose word cones sufficiently match the subtree would be considered relevant search results.
  • Another way that the invention facilitates filtering search results is by making the query more limited: if the user specifies keywords and selects a subject area, then the invention can determine which words are the most likely to be found in documents related to the same subject area. In that case, the query can be automatically recomposed to require that those additional words are present, and, as described above, exist at approximately the same word frequencies as expected for that search subject. In another embodiment, the user does not explicitly select a subject area for the search.
  • the user's keywords are used and the word cone for the user is examined to determine which subject areas the keywords are mostly coming from. This is accomplished by examining the parent-leaf nodes structure. As an example, a query for "basketball, knee and arthroscopic" is more likely a search under the category of "medicine” than "sports". That is because all three would be found under "medicine” but no under "sports”. In this way, incoming search query results can be filtered be determining which documents have the approximately same expected word frequencies for all three words within the document.
  • the KnowingMe invention has applications beyond just searching and filtering of documents or adverts.
  • this invention can be used to create a distributed search engine (as opposed to the centralised, company controlled model used now). This is accomplished by use of peer to peer distribution systems.
  • KnowingMe application then a search consists of broadcasting to all users a desired word cone set. Each user makes a comparison locally and then sends keyword or excerpts back to the requesting user. The requesting user then can request specific documents directly from the peer computer. In this case, there is no central indexing of document data.
  • the invention is not limited to just searching a contained set of (structured) information (e.g. a database). It can search information wherever it can be found, including the internet.
  • the invention may be a stand-alone application or a tool that runs in conjunction or "piggyback" on existing search engines by constructing appropriate search terms on behalf of the user or creating its own index of web pages.
  • KnowingMe users can have their classification or word cone results uploaded as an aggregate data form so that a large user-base word cone is created.
  • This data classification is dynamically updated by a large number of users. Both the classification and the data itself can be updated with time.
  • the word cone classification is not only about documents.
  • the text that a user reads can be used to classify the user themselves or it could classify a collection of text (a whole web site) or it could classify a collection of people ( a company). It is therefore also a means of aggregating knowledge.
  • the collective information gathered is anonymous. Each individual user chooses how much information they reveal to whom at any given time. No advertising agency purchases space. The user can reveal as much information as they please in order to receive _relevant_ adverts.
  • KnowingMe finds the area in WordCone to which the search is related, and extracts words heavily used words and phrases from that subject area. KnowingMe then adds those to the search, and thereby makes the search more precise.
  • the user's vocabulary size is calculated from his/her usage of the Web and words written by the user.
  • the size of the user vocabulary can be estimated in terms of stem words e.g. 25,000 stem words. Although the user may not have looked at all these words, the profile of word usage will drop off steeply at around 25,000 words i.e. the user will use a few words in the
  • the Science words would be extracted from the WordCone section on Science as if it were a thesaurus section. Each word is classified according to frequency of usgae on the web allowing a selection to be made of words corresponding to the size of the user's vocabularly.
  • a typical user with a vocabulary of 25,000 stem words will therefore have 50% of these words in his/her vocabulary.
  • In examining the user's vocabulary we might find 93% of these Science words, indicating a heavy user interest in the area.
  • KnowingMe allows the application of personal clickthrough management. Currently the money for clickthrough goes to the Search Engine and the site on which the advert appears.
  • KnowingMe can maximize this opportunity by increasing the value of the clickthrough because of known user interests. KnowingMe would also negotiate the clickthrough "fee" for the individual user and route it to a payment mechanism e.g. Paypal.
  • data object it is meant any kind of file that contains text or data that can be converted into text, whether HTML, Java, digital video stream, word processing document, email, mobile SMS or MMS, text derived from voice data converstion or any other kind of format.
  • a hyperlink A link in a data object to information within that data object or another data object. These links are usually represented by highlighted words or images.
  • the computer display switches to the document or portion of the document referenced by the hyperlink.
  • Retrieving by Internet search may be accomplished three ways. First, a user who uses an internet search website, can transmit to the search website one or more keywords. The website will respond by sending back a message with links to retrieved web pages. In this case, the invention may be run on the user's computer, and the user's computer can process the retrieved hyperlinks to rank them according to the relevancy score. This can be done by fetching the cited webpages and processing them to calculate their relevancy.
  • the search website itself can execute the invention, in which case the internet search retrieval is simply the operation of its search engine database lookup system.
  • the website operator can process cached copies of the webpages in its database to come up with relevancy scores.
  • the word preference profile of a website can be determined during the web- crawling phase of search engine operation and the profile stored.
  • an intermediate website can receive keyword search requests from the user, transmit the search request to a search engine, and then check the results for relevancy before sending the most relevant search results back to the user.
  • the invention can be implemented on a computer by means of software, such that coded instructions comprising software that are executed by the computer cause the computer to execute the method.
  • the computer is typically a central processing unit (CPU), a random access addressable memory (RAM) operatively coupled to the CPU, external mass storage, typically a hard drive, and some kind of input-output device (I/O).
  • I/O is typically a keyboard operatively connected to the computer and a visual display also operatively connected to the computer.
  • the computer has an operative connection to a data communications network, typically the Internet, and executes typical network protocols in order that data packets be transmitted and received between one or more computers embodying the invention.
  • Source code may include a series of computer program instructions implemented in any of various programming languages (e.g., an object code, an assembly language, or a high-level language such as FORTRAN, C, C++, JAVA, or HTML) for use with various operating systems or operating environments.
  • the source code may define and use various data structures and communication messages.
  • the source code may be in a computer executable form (e.g., via an interpreter), or the source code may be converted (e.g., via a translator, assembler, or compiler) into a computer executable form.
  • the computer program embodying the invention may be fixed in any form (e.g., source code form, computer executable form, or an intermediate form) either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical memory device (e.g., a CD-ROM), a PC card (e.g., PCMCIA card), or other memory device.
  • a semiconductor memory device e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM
  • a magnetic memory device e.g., a diskette or fixed disk
  • an optical memory device e.g., a CD-ROM
  • PC card e.g., PCMCIA card
  • the computer program may be fixed in any form in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies, networking technologies, and internetworking technologies.
  • the computer program may be distributed in any form as a removable storage medium with accompanying printed or electronic documentation (e.g., shrink wrapped software or a magnetic tape), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e.g., the Internet or World Wide Web.)

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé d'indexation de documents et de caractérisation d'un utilisateur d'ordinateur par la détection de la fréquence d'utilisation de mots dans une langue, la comparaison de la pertinence des documents vis-à-vis d'une interrogation de recherche au moyen de la comparaison de la fréquence d'utilisation de mots dans le document à des mots dans l'interrogation de recherche.
PCT/GB2007/003418 2006-09-11 2007-09-11 Procédé et système pour filtrer et rechercher des données à l'aide de fréquences de mots WO2008032037A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US82520106P 2006-09-11 2006-09-11
US60/825,201 2006-09-11

Publications (1)

Publication Number Publication Date
WO2008032037A1 true WO2008032037A1 (fr) 2008-03-20

Family

ID=38621708

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/003418 WO2008032037A1 (fr) 2006-09-11 2007-09-11 Procédé et système pour filtrer et rechercher des données à l'aide de fréquences de mots

Country Status (1)

Country Link
WO (1) WO2008032037A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949647B2 (en) 2008-11-26 2011-05-24 Yahoo! Inc. Navigation assistance for search engines
US20140289389A1 (en) * 2012-02-29 2014-09-25 William Brandon George Systems And Methods For Analysis of Content Items
EP2840515A4 (fr) * 2012-04-17 2015-09-23 Tencent Tech Shenzhen Co Ltd Procédé, dispositif et supports de stockage informatique pour la collecte d'informations relatives à des préférences d'utilisateur
CN110457917A (zh) * 2019-01-09 2019-11-15 腾讯科技(深圳)有限公司 滤除区块链数据中的非法内容的方法及相关装置

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6981040B1 (en) * 1999-12-28 2005-12-27 Utopy, Inc. Automatic, personalized online information and product services

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6981040B1 (en) * 1999-12-28 2005-12-27 Utopy, Inc. Automatic, personalized online information and product services

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AHU SIEG, BAMSHAD MOBASHER, AND ROBIN BURKE: "Inferring User?s Information Context: Integrating User Profiles and Concept Hierarchies", N PROC. OF THE 2004 MEETING OF THE INTERNATIONAL FEDERATION OF CLASSIFICATION SOCIETIES, CHICAGO, IL, JULY 2004, 15 July 2004 (2004-07-15) - 18 July 2004 (2004-07-18), pages 1 - 12, XP002457374, Retrieved from the Internet <URL:http://maya.cs.depaul.edu/~mobasher/papers/arch-ifcs2004.pdf> [retrieved on 20071031] *
J.C. BOTTRAUD, G. BISSON, M.F. BRUANDET: "An Adaptive Information Research Personal Assistant", IN PROC. OF WORKSHOP ARTIFICIAL INTELLIGENCE, INFORMATION ACCESS AND MOBILE COMPUTING, ACAPULCO, MEXICO, 2003, 11 August 2003 (2003-08-11), pages 48 - 58, XP002457373, Retrieved from the Internet <URL:http://users.dimi.uniud.it/workshop/ai2ia/cameraready/bottraud.pdf> [retrieved on 20071030] *
NORIHIDE SHINAGAWA ET AL: "Dynamic Generation and Browsing of Virtual WWW Space Based on User Profiles", INTERNET APPLICATIONS LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER-VERLAG, BE, vol. 1749, 2004, pages 93 - 108, XP019001155, ISBN: 3-540-66903-5 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949647B2 (en) 2008-11-26 2011-05-24 Yahoo! Inc. Navigation assistance for search engines
US8484184B2 (en) 2008-11-26 2013-07-09 Yahoo! Inc. Navigation assistance for search engines
US20140289389A1 (en) * 2012-02-29 2014-09-25 William Brandon George Systems And Methods For Analysis of Content Items
US9514461B2 (en) * 2012-02-29 2016-12-06 Adobe Systems Incorporated Systems and methods for analysis of content items
EP2840515A4 (fr) * 2012-04-17 2015-09-23 Tencent Tech Shenzhen Co Ltd Procédé, dispositif et supports de stockage informatique pour la collecte d'informations relatives à des préférences d'utilisateur
CN110457917A (zh) * 2019-01-09 2019-11-15 腾讯科技(深圳)有限公司 滤除区块链数据中的非法内容的方法及相关装置
CN110457917B (zh) * 2019-01-09 2022-12-09 腾讯科技(深圳)有限公司 滤除区块链数据中的非法内容的方法及相关装置

Similar Documents

Publication Publication Date Title
US9355185B2 (en) Infinite browse
KR101171405B1 (ko) 검색 결과에서 배치 내용 정렬의 맞춤화
JP5329900B2 (ja) 対象領域におけるディジタル情報開示方法
US9165060B2 (en) Content creation and management system
US10102307B2 (en) Method and system for multi-phase ranking for content personalization
CA2833359C (fr) Analyse d&#39;un contenu permettant de determiner un contexte et un contenu pertinent de services base sur le contexte
KR101793222B1 (ko) 어플리케이션 검색들을 가능하게 하기 위해 사용되는 검색 인덱스의 업데이트
US8321278B2 (en) Targeted advertisements based on user profiles and page profile
US7451135B2 (en) System and method for retrieving and displaying information relating to electronic documents available from an informational network
US20070214133A1 (en) Methods for filtering data and filling in missing data using nonlinear inference
US20060155751A1 (en) System and method for document analysis, processing and information extraction
WO2018040069A1 (fr) Système et procédé de recommandation d&#39;informations
WO2001025947A1 (fr) Procede permettant de recommander de maniere dynamique des sites web et de repondre a des requetes d&#39;utilisateurs repartis par groupes d&#39;affinite
KR20050049750A (ko) 온라인 광고 시스템 및 방법
Liu et al. Real‐time user interest modeling for real‐time ranking
WO2010087882A1 (fr) Moteur de personnalisation pour la création d&#39;un profil utilisateur
WO2008032037A1 (fr) Procédé et système pour filtrer et rechercher des données à l&#39;aide de fréquences de mots
US20140278983A1 (en) Using entity repository to enhance advertisement display
Wen Development of personalized online systems for web search, recommendations, and e-commerce
JP2017182746A (ja) 情報提供サーバ装置、プログラム及び情報提供方法
JP6228425B2 (ja) 広告生成装置および広告生成方法
KR101083669B1 (ko) 인터넷을 활용한 전문가 웹사이트 검색시스템 및 그 방법
Ashkan et al. Location-and Query-Aware Modeling of Browsing and Click Behavior in Sponsored Search
Giuliani Studying, developing, and experimenting contextual advertising systems
EP1797499A2 (fr) Systeme et procede d&#39;analyse de documents, de traitement et d&#39;extraction d&#39;informations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07804216

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07804216

Country of ref document: EP

Kind code of ref document: A1