WO2006034222A2 - Systeme et procede d'analyse de documents, de traitement et d'extraction d'informations - Google Patents

Systeme et procede d'analyse de documents, de traitement et d'extraction d'informations Download PDF

Info

Publication number
WO2006034222A2
WO2006034222A2 PCT/US2005/033526 US2005033526W WO2006034222A2 WO 2006034222 A2 WO2006034222 A2 WO 2006034222A2 US 2005033526 W US2005033526 W US 2005033526W WO 2006034222 A2 WO2006034222 A2 WO 2006034222A2
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
request
data elements
information
web pages
Prior art date
Application number
PCT/US2005/033526
Other languages
English (en)
Other versions
WO2006034222A3 (fr
Inventor
Frank Geshwind
Andreas C. Coppi
William G. Fateley
Nicholas Black
Zydrunas Gimbutas
Marya R. Doery
Original Assignee
Plain Sight Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/165,633 external-priority patent/US20060004753A1/en
Application filed by Plain Sight Systems, Inc. filed Critical Plain Sight Systems, Inc.
Priority to EP05800792A priority Critical patent/EP1797499A2/fr
Publication of WO2006034222A2 publication Critical patent/WO2006034222A2/fr
Publication of WO2006034222A3 publication Critical patent/WO2006034222A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/832Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8365Query optimisation

Definitions

  • the present invention relates generally to database searching, data organization, information extraction, and data features extraction. More particularly, the present invention relates to personalized search of databases including intranets and the Internet, and to mathematically motivated techniques for efficiently empirically discovering useful metric structures in high-dimensional data, and for the computationally efficient exploitation of such structures.
  • the methods disclosed relate as well to improvement of information retrieval processes generally, by providing methods of augmenting these processes with additional information that refines the scope of the information to be retrieved.
  • Search terms have different meanings in different contexts.
  • Prior art search engines such as Google, typically use a single method of interpretation and scoring of search results.
  • the most popular meaning of a particular search term will end up being prioritized over alternate, less popular, meanings.
  • the search query term "gates” may mean “logic gates”, “Bill Gates”, “wrought-iron gates”, etc.
  • the addition of extra keywords could serve to disambiguate the search query.
  • a user does not realize that these extra terms are needed, or otherwise does not wish to put in the time or effort perfecting the search query.
  • data mining as used herein broadly refers to the methods of data organization and subset and feature extraction. Furthermore, the kinds of data described or used in data mining are referred to as (sets of) "digital documents.” Note that this phrase is used for conceptual illustration only, can refer to any type of data, and is not meant to imply that the data in question are necessarily formally documents, nor that the data in question are necessarily digital data. The “digital documents” in the traditional sense of the phrase are certainly interesting examples of the kinds of data that are addressed herein.
  • the search term "gates” could be rewritten for a CMOS technologist as “logic gates OR CMOS gates”, while it could be rewritten as "Bill Gates” for an operating system software business pundit, and "iron gates” for a wrought-iron specialist. For users with multiple interests, several forms could be used.
  • This augmentation can then be used to construct a second search query; the augmented query.
  • a corpus of documents may be used that consists of baseball news articles, baseball encyclopedia entries, baseball website content & blogs, and the like.
  • an embodiment of the present invention comprises a search query rewriting system which takes as input a first query.
  • the first query is used to run a first search on a first corpus of documents, returning a first subset of documents in response to the first search.
  • Word frequency statistics are computed for the first subset of documents. These statistics are compared with the corresponding word frequency statistics for the corpus as a whole, or for the language as a whole.
  • Resultant words are identified for which the difference between the word's frequency in the first subset of documents, as compared with the corresponding whole-corpus or whole-language frequencies, is largest (e.g. above a given threshold, or, say, the 5 largest).
  • a second query is formed consisting of the first query, Boolean connectors, and the resultant words, (e.g. ⁇ first query> AND wordl OR word2 OR ... OR word5).
  • a second search is then run on a second one or more corpora of documents, for example on the Internet. The second search is a search for documents that match the second query. The results of the second search are returned to the user.
  • fr_matr_bin-type we sometimes refer broadly to the class of embodiments described in this paragraph as fr_matr_bin-type. This name comes from the name of a particular set of algorithms within the broad class, but the term “fr_matr_bin-type" is meant to refer to this general class of embodiments just described.
  • an embodiment of the present invention comprises a search by example system.
  • a search engine is disposed to search through a corpus of digital music files.
  • the system has pre-computed a set of numerical coordinates that characterize various standard aspects of the file.
  • the embodiment can treat the corpus of data as a set of points in a high dimensional space.
  • Such characteristic numerical coordinates are known to those of skill in the art, and include, but are not limited to, timberal Fourier, MERL and cepstral coefficients, Hidden Markov Model parameters, dynamic range vs. time parameters, etc.
  • a user specifies a few music files from the corpus of digital music files.
  • the embodiment then characterizes the coordinates of the subset of points associated with the specified few music files, and selects a region or set of directions in the high dimensional space that are characteristic of the contrast between the subset of points, and the full set of points corresponding to the whole corpus.
  • the embodiment selects those other points that are also within or near the region, or are also disposed along the directions in the high dimensional space, and the music files (or, e.g., a list of pointers or indexes thereto) corresponding to the data points are returned as the results of the improved "query by example".
  • the music files or, e.g., a list of pointers or indexes thereto
  • the music files or, e.g., a list of pointers or indexes thereto
  • fr_matr_bin-type embodiments relate in part to methods for finding objects that have similarity or affinity to some other target objects or search query results.
  • diffusion geometries also relate in part to methods for finding similarity or affinity between objects, hi this regard, elements disclosed herein relating to the use of fr_matr_bin- type embodiments on the one hand, and on the other hand elements disclosed herein relating to the use of diffusion geometry, can be interchanged.
  • corpora (5) and (9) of data is used to add meaning to the query.
  • corpora (5) and (9) be a "rich enough" statistical sample of the full set of documents (i.e., music files). It is appreciated that this "rich enough" statistical sample can be accomplished in a number of ways standard in the art. For example, the statistical sample can be obtained iteratively by trying a small subset, collecting and storing the results of a number of typical/popular queries, and then adding more documents at random and performing the same typical/popular queries. If the results are roughly the same, then stop adding more documents.
  • results are not roughly the same, then add more documents at random until the process stabilizes, i.e., results are roughly the same.
  • the present invention characterizes the music files with "extra features” to compute music affinity (or generally, music “meaning”) or obtain a "rich enough” statistical sample (i.e., in the corpora (5) and (9)).
  • the corpus (13) of music files necessary to perform information retrieval needs to be a full set of all available documents (i.e., music files), but the present invention, at least in certain embodiments, does not need to characterize these music files with "extra features" as with the corpora (5) and (9).
  • the present systems and methods described relate herein are applicable to diffusion geometry and document analysis, processing and information extraction. These methods and systems described herein are applicable at least in the case in which, as is typical, the given data to be analyzed can be thought of as a collection of data objects, and for which there is some at least rudimentary notion of what it means for two data objects to be similar, close to each other, or nearby.
  • the present invention relates to the fact that certain notions of similarity or nearness of data objects (including but not limited to conventional Euclidean metrics or similarity measures such as correlation, and many others described below) are not a priori very useful inference tools for sorting high dimensional data.
  • data mining and information extraction from digital documents can be considerably enhanced by using the techniques described herein.
  • the techniques relate to augmenting given similarity or nearness concepts or measures with empirically derived diffusion geometries, as further defined and described herein.
  • An aspect of the present invention relates to the fact that, without the present invention, it is not practical to compute or use diffusion distances on high dimensional data. This is because standard computations of the diffusion metric require d*n or even d*n number of computations, where d is the dimension of the data, and n is the number of data points. This would be expected because there are O(n ) pairs of points, so one might believe that it is necessary to perform at least n operations to compute all pairwise distances.
  • an embodiment of the present invention provides a method for computing a dataset that is often in linear time O(n), from which approximations to these distances, to within any desired precision, can be computed in fixed time.
  • An embodiment of the present invention provides a data driven self- induced multiscale organization of data in which different time/scale parameters correspond to different representations of the data structure at different levels of granularity, while preserving microscopic similarity relations.
  • Examples of digital documents in this broad sense could be, but are not limited to, an almost unlimited variety of possibilities such as sets of object- oriented data objects on a computer, sets of web pages on the world wide web, sets of document files on a computer, sets of vectors in a vector space, sets of points in a metric space, sets of digital or analog signals or functions, sets of financial histories of various kinds (e.g. stock prices over time), sets of readouts from a scientific instrument, sets of images, sets of videos, sets of audio clips or streams, one or more graphs (i.e. collections of nodes and links), consumer data, relational databases, to name just a few.
  • sets of object- oriented data objects on a computer sets of web pages on the world wide web
  • sets of document files on a computer sets of vectors in a vector space, sets of points in a metric space, sets of digital or analog signals or functions
  • sets of financial histories of various kinds e.g. stock prices over time
  • sets of readouts from a scientific instrument sets of images
  • a vector could be represented, but is not limited to being represented, as an ordered n-tuple of floating point numbers, stored in a computer.
  • a function could be represented, but is not limited to be represented, as a sequence of samples of the function, or coefficients of the function in some given basis, or as symbolic expressions given by algebraic, trigonometric, transcendental and other standard or well defined function expressions.
  • Such digital documents typically exceed 100 dimensions.
  • the present invention initially restricts the use of given metrics (i.e. notions of similarity, etc) only to the case of very strong similarity between documents, a similarity for which inference is self evident and robust.
  • Such similarity relations are then extended to documents that are not directly and obviously related by analyzing all possible chains of links or similarities connecting them.
  • This is achieved through the use of diffusions processes (processes that are analogous to heat-flow in a mathematical sense that will be described herein), and this leads to a very simple and robust quantity that can be measured as an ordinary Euclidean distance in a low dimensional embedding of the data.
  • embedding refers to a "diffusion map" and the distance thereby defined as a "diffusion metric.”
  • the present invention relates in part to influencing the position or presence on a search result list generated by a computer network search engine and for influencing a position or presence or placement within an advertising section of document or rendering of a document or meta-document on a computer network.
  • systems and methods are disclosed for enabling information providers using a computer network such as the Internet to influence a position for a search listing within a search result list generated by a computer network search engine and for influencing a position or presence or placement of a listing within a document or rendering of a document or meta-document on a computer network.
  • listing refers to any digital document content that a provider wishes to have listed, rendered, displayed, or otherwise delivered using a computer network, by one practicing the present invention. Such a listing can be, but is not limited to banner advertisements, text advertisements, video clips and other media, and can be as simple as a link to another web page or web site.
  • advertising opportunity refers to any instance where there is an opportunity to position a search listing, or position, place or present a listing within an advertising or other section within a document or rendering of a document or meta- document on a computer network.
  • advertising refers to any act of listing, rendering, displaying, or otherwise delivering a listing or other content using a computer network, in exchange for compensation or other value.
  • the present invention relates to the strategic matching of online content for optimization of collaborative opportunities for one web page or web site to display content related to another web page or web site. Examples of such use include, but are not limited to:
  • the system and method provides a database having accounts for the listing providers.
  • Each account contains contact and billing information for a listing provider.
  • each account contains at least one search listing having at least two components: 1. at least one digital document describing the product, service or other listing to be positioned, placed, or presented; and 2. a bid amount, which is preferably a money amount, for a listing.
  • the listing provider may add, delete, or modify a search listing after logging into his or her account via an authentication process.
  • the present invention includes methods for determining the eligibility of any listing for any given advertising opportunity. During an advertising opportunity, the selection of, or positioning of a listing is influenced by a continuous online competitive bidding process.
  • the bidding process occurs whenever an advertising opportunity arises.
  • the system and method of the present invention compares all bid amounts for those listings eligible for the advertising opportunity in question, and generates a rank value for all eligible listings.
  • the rank value generated by the bidding process determines where the network information providers listing will appear in the context determined by the advertising opportunity. A higher bid by a network information provider will result in a higher rank value and a more advantageous placement.
  • advertisements are placed by a method that uses keywords, but keywords can be ambiguous.
  • keywords can be ambiguous.
  • the keyword "nails” might bring up advertisements for hardware stores in these prior art systems, even when searched from a website about women's beauty, where results about nail polish, etc, are more appropriate as top advertisements.
  • methods and systems as disclosed herein which, in part, are able to resolve such ambiguities.
  • the diffusion geometric techniques and other techniques disclosed herein provide a new and novel means of displaying advertisements that are related to content and for which preferential positioning of the advertisements displayed can be determined by relevance to the context, as well as influenced by a bidding process or other economic considerations. Algorithms for preferential positioning of advertisements, etc, are disclosed herein.
  • An aspect of the present invention relates to the application of the above algorithm and related ones, to the problem of automatically designing or augmenting the links within a single company's web site.
  • Web companies often wish to increase the amount of traffic on their web sites, and the amount of time and volume of data viewed by customers of their sites.
  • Offering links from pages on the site to related pages on the site provides a proactive replacement for an outside search engine. Users will be able to find what they need (e.g. if they enter a site from the result of a search engine), and then find related information, and thus be motivated to "explore" the site. This is true for sites in general, and also specifically when the site in question is one that contains catalog-like or other listings of products and services. In a store, customers often begin shopping by looking at one product but end up buying another product. By having tight links between related products, online sites can achieve this same "emotional buying" phenomenon.
  • An aspect of the present invention relates to the application of the above algorithm and related ones, to the problem of automatically designing or augmenting the links between two or more companies' web sites.
  • Web companies often wish to increase the amount of traffic that they receive from or provide to affiliated sites.
  • the present invention provides a method to design or augment the links between these sites, thereby linking related content, and organically increasing this traffic.
  • One skilled in the art will see how to do this, and how it results in economic benefit to the parties in question, each in a way analogous to the case described in the previous paragraph.
  • the request is modified based on the additional information to refine the scope of information to be retrieved from a second corpus of data elements.
  • the information is retrieved from the second corpus of data elements based on the modified request.
  • a method of influencing traffic between predetermined web pages comprises the steps of: determining diffusion geometry coordinates of a set of web pages, the set of web pages comprising at least one of the predetermined web pages; and determining links between the web pages based on the diffusion geometry coordinates.
  • a computer readable medium comprises code for retrieving information in response to an information retrieval request, the code comprising instructions for: extracting additional information from a first corpus of data elements based on the request; modifying the request based on the additional information to refine the scope of information to be retrieved from a second corpus of data elements; and retrieving information from the second corpus of data elements based on the modified request.
  • a computer readable medium comprises code for influencing traffic between predetermined web pages, the code comprising instructions for: determining diffusion geometry coordinates of a set of web pages, the set of web pages comprising at least one of the predetermined web pages; and determining links between the web pages based on the diffusion geometry coordinates.
  • a system for retrieving information in response to an information retrieval request comprises: an extracting module for extracting additional information from a first corpus of data elements based on the request; a processing module for modifying the request based on the additional information to refine the scope of information to be retrieved from a second corpus of data elements; and a retrieving module for retrieving information from the second corpus of data elements based on the modified request.
  • a system for influencing traffic between predetermined web pages comprises a processing module for determining diffusion geometry coordinates of a set of web pages, the set of web pages comprising at least one of the predetermined web pages; and determining links between the web pages based on the diffusion geometry coordinates.
  • FIG. 1 shows a block diagram of a contextualized search engine in accordance with an embodiment of the present invention
  • Fig. 2 shows a schematic representation of an imagined forest, with trees and shrubs, presumed to burn at different rates
  • Fig. 3 shows an exemplary flow chart for computing multiscale diffusion geometry in accordance with an embodiment of the present invention
  • Fig. 4 illustrates a Public Find Similar Document Internet Utility in accordance with an embodiment of the present invention.
  • Step 110 A user (1) enters a first search query (2) into a search query user interface (3).
  • Step 120 The query (2) is sent to a first search engine (4).
  • Step 130 The first search engine (4) performs a search on a first one or more corpora of documents (5) using the query (2).
  • Step 140 Mean word frequencies f0 (6) are computed on the set of documents returned by the first search engine (4).
  • Step 150 Mean word frequencies fl (10) are computed for a second one or more corpora of documents (9). (It is appreciated that this step can be done once at initialization.)
  • Step 170 The set of words (8) is identified corresponding to those top K words for which d (7) is greatest (for some fixed parameter K), or e.g., to those words for which d is greater than some threshold t (for some fixed parameter t).
  • Step 180 A new search query (11) is defined by combining the first query
  • the new search query (11) could be "nail AND (polish OR beauty OR manicure)".
  • Other algorithms for this combination are disclosed herein.
  • Step 190 The new query is sent to a second search engine (12) disposed to search a third one or more corpora of documents (13).
  • Step 200 The results returned by the second search engine (12) are displayed on a search result user interface (14).
  • the corpora (9) represent the language as a whole. For example, if the target searches are conducted in English, then corpora (9) can be a random sample of documents in the English language.
  • the corpora (5) are used to define the subject(s) of interest to the user of the search. For example, if the subject of interest is Major League Baseball, then the documents in question can be a web-craw ofwww.mlb.com, as well as news articles, encyclopedia articles, etc, on the subject of baseball.
  • the algorithm of the present invention acts to find those words which are much more likely to occur in documents that meet the first search query criteria, within the subject(s) of interest to the user of the search, as compared with the generic occurrence of the words within the target search language as a whole.
  • the corpora (9) can be taken to be the same as (5).
  • the algorithm of the present invention acts to find those words which are much more likely to occur in documents that meet the first search query criteria, within the subject(s) of interest to the user of the search, as compared with the generic occurrence of the words within the subject(s) of interest to the user of the search.
  • the corpora (13) can be, in certain embodiments, the entire Internet, or the set of documents indexed by a public or private search engine. Since, in certain embodiments, the algorithm of the present invention takes a first search query, and produces a second search query, each suitable for full text search, these queries can be passed to search engines via techniques standard in the art, including but not limited to HTTP requests and/or network interfaces such as SOAP. The results returned by these search engines can be displayed as is standard in the art, including but not limited to display in a browser by rendering results encoded with HTML, XML, Java, JavaScript, Python, Perl, PHP, etc.
  • provisions are made to correct spelling errors. This can be done, for example, by using SOUNDEX scores to identify words that are misspelled but are most likely meant to be other given words.
  • SOUNDEX scores to identify words that are misspelled but are most likely meant to be other given words.
  • One can also employ other techniques, such as a list of commonly misspelled words, phrases and queries, hi the present context, statistics and other information, including but not limited to information from the corpora and/or the search logs, can be used to identify misspellings and likely suggested replacements for input queries. Spelling errors in the corpora can also be flagged and automatically, semi-automatically, partially- assisted or manually corrected.
  • certain word frequency coefficients, or differences between word frequencies are set to zero when they are below a given threshold, hi this way, "noise" is removed from the process.
  • a given threshold For example, hi the case where documents are being tested for the presence of a set of words or phrases as in the search in step 130 of Fig. 1, one can take only those documents that contain the phrase more than a certain number of times. This number can be fixed, or it can be some fraction of the average number, where the average is taken, for example, over the set of documents for which the value is at least 1.
  • a corresponding type of threshold can also be applied in one or more of steps, for example to steps 170, 180 or 190.
  • searches are implemented in part using sparse matrix representations. For example, given the matrix W(ij) as described herein, for a first one or more corpora, and an initial search query based on the presence of all of the words w_l, w_2, ..., w_n, and the absence of all of the words x_l, ..., x_m, one can perform the search in step 130 by finding those rows of W that have non-zero values in all of the columns corresponding to the indices of the words w_l, ..., w_n, and have only zero values in all of the columns corresponding to the words x_l, ..., x_m.
  • Steps 140 and 150 correspond to summing a matrix over all columns. In the case of step 140, the sum is over the sub matrix of rows selected as described in this paragraph, hi the case of step 150, it is, for example, a sum over a whole matrix.
  • the matrix W is sparse, and sparse matrix math is used in certain embodiments, to carry out the steps described.
  • the former is useful at least when one want to find the words J_i that occur in a given document i.
  • the latter is useful at least when one wants to find the documents IJ that contain a particular word j. Both of these kinds of finding are used in certain embodiments as described herein.
  • step 180 defines the new query (11) by taking the logical conjunction of the original query (2) with the logical disjunction of the set of new search terms (8). That is, if the original query (2) were represented by x, and the new search term (8) by the set ⁇ a, b, c, ..., z ⁇ (with no assumption about the size of the set), then the new query (11) would, in the one exemplary embodiment, be (x AND a OR b OR c OR ... OR z).
  • x itself may be a compound or complex query. For example, it can be, using the notation of the Google search engine, "nails -hardware" (which means “find those documents that contain the word “nails” and do not contain the word “hardware”).
  • a more varied set of output logical structures can be used, hi such embodiments, the elements (6) and (8) in Fig. 1 can be replaced by elements (6') and (8') respectively as follows: (6') is collectively the word frequencies of, and a word-document matrix or similar structure that allows one to compute at least the frequency of occurrence of each word in each document. Similarly, the element (8') is collectively both the set of words corresponding to those top K words for which d (7) is greatest, together with the word-document sub-matrix (e.g. an L x K matrix, ml(i J)) (collectively element 8').
  • the word-document sub-matrix e.g. an L x K matrix, ml(i J)
  • the new query (11) has the form of a logical conjunction of a set of logical parts.
  • the first part is the original query x and the whole of (11) has the form (x AND A_l OR A_2 OR ... OR A K).
  • each of the A_i is a conjunction of those words corresponding to columns of ml which are well correlated to column i. That is, A_ l is the set of words that are highly correlated to the word corresponding to column 1 of ml, all "AND'ed" together.
  • A_2 for the word corresponding to column 2, etc.
  • words that are highly correlated with each other when used in documents that satisfy the original search query, are required to appear together to satisfy the advanced rewritten query.
  • the absolute requirement of appearing together is relaxed to a statistical favoring of those documents for which at least some of the words appear together.
  • contextualized search engines can be generated for almost any topic given the methods and systems of the present invention described herein.
  • public web directories such as DMOZ (see www.dmoz.org), that give pointers to web pages and web sites, arranged by topics and sub-topics.
  • one or more corpora of documents are obtained, at least in part, automatically or semi-automatically, by web crawling from a topic or sub topic within DMOZ, or the Google directory, or Yahoo directory, or some other directory of documents.
  • Certain embodiments of the present invention can be used, for example, to discover similarity or affinity between songs, and/or between artists, in the domain of music affinity.
  • the corpora can consist, at least in part, of set of playlists (lists of song titles).
  • individual songs take the place of individual words.
  • the playlists take the place of documents discussed herein.
  • an embodiment would select those certain playlists that contain one or many of the songs s_, and then find those songs that are more likely to occur in certain playlists, as compared with their occurrence in a generic playlist.
  • a method and system for automatically discovering one or more genres associated with a target is as follows. Create one or more corpora of documents from music reviews, music enthusiasts' web pages, music liner notes, and the like. Use the one or more corpora as the element (5) in Fig. 1. Perform the first search, etc. From the resulting set of words (8), extract a subset corresponding to words that are the names of genres. Replace steps 170 - 190 by a step that filters away all words other than genre terms, and replace step 200 with a step that returns the remaining genre terms as the result to the user. These results, together with their numerical scores from the algorithm, give a weighted genre description associated with the target. For example, one can automatically find the genre(s) associated with any music artist in this way.
  • the columns of the matrix in the algorithm can be restricted to only genre words. Additionally, one can use full-text searching techniques so that multi-word genres are recognized. As a short cut in this embodiment, since there is a small finite list of genres and sub-genres, one could convert each genre "phrase" into a token using techniques standard in the art.
  • genre can be replaced with any other concept, i.e. band name, country of origin, artist, mood, etc, or any combination.
  • this algorithm applies quite generally as a means for creating an automatic ontological classifier and ontological affinity engine, and applies to all subjects, not just music.
  • the present invention relates to multiscale mathematics and harmonic analysis.
  • Such mathematics e.g., a paper by Coifman and Maggioni entitled “Multiresolution Analysis Associated to Diffusion Semigroups: Construction And Fast Algorithms” (hereinafter referred to as the "Coifman & Maggioni” reference) disclosed in the U.S.
  • structural multiscale geometric harmonic analysis refers to multiscale harmonic analysis on sets of digital documents in which empirical methods are used to create or enhance knowledge and information about metric and geometric structures on the given sets of digital documents.
  • the present invention also relates to the mathematics of linear algebra, and Markov processes, as known to one skilled in the art. See, e.g., the Coifman & Maggioni reference.
  • the techniques disclosed herein provide a framework for structural multiscale geometric harmonic analysis on digital documents (viewed, for illustration and not limiting purposes, as points in R" or as nodes of a graph).
  • Diffusion maps are used to generate multiscale geometries in order to organize and represent complex structures.
  • Appropriately selected eigenfunctions of Markov matrices (describing local transitions inferences, or affinities in the system) lead to macroscopic organization of the data at different scales.
  • the top such eigenfunctions are the coordinates of the diffusion map embedding.
  • a diffusion map is constructed given any measure space of points X and any appropriate kernel k(x,y) describing a relationship between points x and y lying in X.
  • the article provides anyone skilled in the art the means and methods to calculate the diffusion map, diffusion distance, etc.
  • These means and methods include, but are not limited to the following: 1) construction and computation of diffusion coordinates on a data set, and 2) construction and computation of multiscale diffusion geometry (including scaling functions and wavelets) on a data set.
  • An optional threshold parameter ⁇ with a default of ⁇ 0: used to "denoise” T by, e.g., setting to 0 those values of T that are less than ⁇ .
  • the thresholding step can be more sophisticated. For example, one could perform a smooth operation that sets to 0 those values less than S 1 and preserves those values greater than S 2 , for some pair of input parameters E 1 ⁇ ⁇ 2 . Multi-parameter smoothing and thresholding are also of use.
  • the matrix T can come from a variety of sources. One is for T to be derived from a kernel K(x,y) as described in the Coifman & Maggioni and Coifman et al. papers referenced herein. K(x,y) (and T) can be derived from a metric d(x,y), also as described in the Coifman & Maggioni and Coifman et al. papers referenced herein.
  • T can denote the connectivity matrix of a finite graph.
  • An optional threshold parameter ⁇ with a default of ⁇ 0: Used to "denoise" T by, e.g., setting to 0 those values of T that are less than ⁇ .
  • LocalGS ⁇ ( ) is the local Gram-Schmidt algorithm described in the Coifman & Maggioni and Coifman et al. papers referenced herein (an embodiment of which is describe below), but in various embodiments it can be replaced by other algorithms as described in the Coifman & Maggioni and Coifman et al. papers referenced herein. In particular, a modified Gram Schmidt can be used. See the Coifman & Maggioni and Coifman et al. papers referenced herein for details. Note as before that the thresholding step can be more sophisticated, and the matrix T can come from a variety of sources. See the discussion relating to preceding algorithm described herein. A person skilled in the art will readily understand several variations and generalizations of the algorithm above, including those that are suggested and presented in the Coifman & Maggioni and Coifman et al. papers referenced herein.
  • Fig. 3 depicts the above algorithm for computing mutiscale diffusion geometry as a flowchart in accordance with an embodiment of the present invention.
  • the system reads the inputs into the algorithm.
  • Various variables utilized in the algorithm are initialized in steps 1010, 1020, 1030, and 1040.
  • the system computes the local
  • step 1060 Gram Schmidt orthonormaliation in step 1060.
  • the system sets Xj to be the index set of Pj in step 1070.
  • the system computes the next power of the matrix T, restricted to and written as a matrix on the appropriate set in step 1080.
  • the system increments the loop index i in step 1090.
  • step 1100 the system performs a loop-control test: if the stopping conditions are met, we get out of the loop, otherwise the system return to step 1050.
  • the system outputs the results of the algorithm in step 1110.
  • Il Q is n x m and orthogonal
  • MultiscaleDyadicOrthogonalization // ⁇ : a family of functions to be orthono ⁇ nalized, as in Proposition 21 // Q : a family of dyadic cube on X Il J : finest dyadic scale // 1: precision
  • the construction of the wavelets at each scale includes an orthogonalization step to find an orthonormal basis of functions for the orthogonal complement of the scaling function space at the scale into the scaling function space at the previous scale.
  • the construction of the scaling functions and wavelets allows the analysis of functions on the original graph or manifold in a multiscale fashion, generalizing the classical Euclidean, low-dimensional wavelet transform and related algorithms, hi particular the wavelet transform generalizes to a diffusion wavelet transform, allowing one to encode efficiently functions on the graph in terms of their diffusion wavelet and scaling function coefficients, hi certain embodiments of the present invention, the wavelet algorithms known to those skilled in the art are practiced with diffusion wavelets as described herein.
  • functions on the graph or manifold can be compressed and denoised, for example by generalizing in the obvious way the standard algorithms (e.g. hard or soft wavelet thresholding) for these task based on classical wavelets.
  • standard algorithms e.g. hard or soft wavelet thresholding
  • nodes of the graph represent a body of documents or web pages
  • user's preferences for example single-user or multi-user
  • each coordinate is a function on the graph that can be compressed and denoised, and a denoised graph, where each node has as coordinates the denoised or compressed coordinates, is obtained.
  • This allows a nonlinear structural multiscale denoising of the whole data set. For example, when applied to a noisy mesh or cloud of points, this results in a denoised mesh or cloud of points.
  • diffusion wavelets and scaling functions can be used for regression and learning tasks, for functions on the graph, this task being essentially equivalent to the tasks of compressing and denoising discussed herein.
  • standard regression algorithms known for classical wavelets can be generalized in an obvious way to algorithms working with diffusion wavelets.
  • a space or graph can be organized in a multiscale fashion as follows:
  • the method and system relates to searching web pages on Internets and intranets, and indexing such web pages and the web.
  • the points of the space X represents documents on the Web
  • the kernel k will be some measure of distance between documents or relevance of one document to another.
  • Such a kernel can make use of many attributes, including but not limited to those known to practitioners in the art of web searching and indexing, such as text within documents, link structures, known statistics, and affinity information to name a few.
  • Google's PageRank as described, for example, in US Patent 6,285,999, which is incorporated herein by reference in its entirety.
  • PageRank reduces the web to one dimension. It is very good for what it does, but it throws away a lot of information.
  • PageRank With the present invention, one can work at least as efficiently as PageRank, but keep the critical higher-dimensional properties of the web. These dimensions embody the multiple contexts and interdependencies that are lost when the web is distilled to a ranking system. Accordingly, the present invention opens the door to a huge number of novel web information extraction techniques.
  • the present invention is ideal for affinity-based searching, indexing and interactive searches.
  • the Algorithms of the present invention goes beyond the traditional interactive search, allowing more interactivity to capture the intent of the user.
  • the core algorithm is adapted to searching or indexing based on intrinsic and extrinsic information including items such as content keywords, frequencies, link popularity and other link geometry/topology factors, etc., as well as external forces such as the special interests of consumers and providers.
  • the present invention is ideally suited for addressing the problem of re-parameterizing the Internet for special interest groups, with the ability to modulate the filtering of the raw structure of the WWW to take in to account the interests of paid advertisers or a group of users with common definable preferences.
  • a computer system periodically maps the multiscale geometric harmonic diffusion metric structure of the Internet, and stores this information as well as possibly other information such as cached version of pages, hash functions and key word indexes in a database (hereinafter the database), analogous to the way in which contemporary search engines pre-compute page ranking and other indexing and hashing information.
  • the initial notion of proximity used to elucidate the geometric harmonic structure can be any mathematical combination of factors, including but not limited to content keywords, frequencies, link popularity and other link geometry/topology factors, etc., as well as external forces such as the special interests of consumers and providers.
  • an interface is presented to users for searching the web.
  • Web pages are found by searching the database for the key words, phrases, and other constraints given by the users query.
  • An aspect of the present invention is that, as seen from this disclosure by one skilled in the art, the search can be accelerated by using partial results to rapidly find other hits. This can be accomplished, for example, by an algorithm that searches in a space filling path spiraling out from early search hits to find others, or, similarly, that uses diffusion techniques as discussed herein to expand on early search hits.
  • results can be presented in ways that relate to the geometry of the returned set of web pages.
  • Popularity of any particular site can be used, as is done in common practice, but this can now be augmented by any other function of the geometric harmonic data.
  • ⁇ results can be presented in a variety of evident non-linear ways by representing the higher-dimensional graph of results in graphical ways standard in the art of graphic representation of metric spaces and graphs. The latter can be enhanced and augmented by the multiscale nature of the data by applying these graphical methods at multiple scales corresponding to the multiscale structures described herein, with the user controlling the choice of scale.
  • This presentation of results can also include other interactive and interface elements such as sound.
  • web search results, web indexes, and many other kinds of data can be presented in a graphical interface wherein collections of digital documents are rendered in graphical ways standard in the art of graphic representation of such documents, and combined with or using graphical ways standard in the art of graphic representation of metric spaces and graphs, and at the same time the user is presented with an interface for navigation of this graph of representations.
  • this would be analogous to database fly-through animation as is common in the art of flight simulators and other interactive rendering systems.
  • a web browser can be provided in accordance with an embodiment of the present invention, with which the user can view web pages and traverse links in these pages, in the usual way that contemporary browsers allow.
  • users can be presented with the option of jumping to another web page that is close to the current web page in diffusion distance, whether or not there is an explicit link between the pages.
  • the navigation can be accomplished in a graphical way.
  • web pages near the current web page can be clustered using standard art clustering techniques applied to the database and the diffusion distance.
  • each cluster or navigation direction can be labeled with the most popular word, words, phrases or other features common among document in that cluster or direction.
  • certain common words such as (often) pronouns, definite and indefinite articles could be excluded from this labeling/voting.
  • the present invention can be used to automatically produce a synopsis of a web page (hereinafter a contextual synopsis).
  • a contextual synopsis a web page
  • This can be done, for example, as follows.
  • cluster a scale-appropriate neighborhood of the web page in question. Compute the most popular text phrases among pages within the neighborhood, weighting according to diffusion distance from current location.
  • throw out genetically common words unless they are especially relevant, for example words like 'his' and 'hers' are generally less relevant, but in the colloquial phrase "his & hers fashions" these become more relevant.
  • the top N results (where N is fixed a priori, or from the numerical rank of the data), give a description of the web page.
  • this concept of contextual synopsis applies to all kinds of digital documents, and not just web pages.
  • the method of the present invention can be used to generate automatics reviews of new pieces of music.
  • contextual synopsis concept allows one to compare a web page textually to its own contextual synopsis.
  • a page can be scored by computing its distance to its own contextual synopsis.
  • the resulting numerical score can be thought of as a measure analogous to the curvature of the Internet at the particular web page (hereinafter contextual curvature).
  • This information could be collected and sold as a valuable marketing analysis of the Internet.
  • Sub-manifolds given by locally extremal values of contextual curvature determine "contextual edges" on the Internet, in the sense that this is analogous to a numerical Laplacian (difference between a function at a point, and the average in a neighborhood of the point).
  • the system and method analyzes the effect of proposed modification or additions to the World Wide Web, prior to such modification or additions being made, hi its simplest form, this amounts to computing the database of diffusion metric data as already described herein, and then computing the changes in diffusion metric information that would result, were a certain set of changes to be made. Using this, one can do things including, but not limited to, computing the solution to an optimization problem stated in terms of diffusion distances, hi this way, the present invention yields methods for optimizing web-site deployment.
  • the system and method incorporates information collected by web servers that gather statistics on links followed and pages visited, perhaps augmented by so-called cookies, or other means, so as to track which users have viewed which web pages, and in what order, and at what time, hi its simplest form, this information is exploited by simply weighting the metric links according to their probability of being followed to constructing the initial notion of similarity from which the diffusion data are derived.
  • the system and method can be used to discover models of Internet users surfing patterns obviating the need for server acquired statistics.
  • the contextual synopsis information applied to web pages and clusters of pages, present a model of user profiles.
  • the present invention yields a new mode of interactive web searches: hyper-interactive web searches, hi accordance with an embodiment of the present invention, a method for such searches comprises presenting the user with a first diffusion geometry based web search as described herein, and then allowing the user to characterize the results from the first search as being near or far from what the user seeks.
  • the underlying distance data is then updated by adding this information as one or more additional coordinates in the n-tuples describing each web page, and using diffusion to propagate these values away from the explicit examples given by the user.
  • contextual synopsis data of the indicated web pages can be used to augment the search criteria, hi this way, by using the new metric and/or the new search criteria, another modified search can be conducted. The process can be iterated until the user is satisfied.
  • a database of any sort can be analyzed in ways that are similar to the analysis of the Internet and World Wide Web described herein.
  • a static database or file system may play the role of X, with each point of X corresponding to a file.
  • the kernel in this case might be any measure useful for an organizational task - for example, similarity measures based on file size, date of creation, type, field values, data contents, keywords, similarity of values, or any mixture of known attributes may be used.
  • X can be comprised of a library of music recordings, and the kernel can be comprised of features of the music recordings such as but not limited to those described herein.
  • an embodiment of the present invention comprises a music recommendation engine with user steerable interface.
  • the set of files on a user's computer, hard drive, or on a network may be automatically organized into contextual clusters at multiple scales, by the means and methods disclosed herein.
  • This process can be augmented by user interaction, in which the process described herein for contextual information is carried out, and the user is provided with the analysis. The user can then select which automatically derived contexts are of interest, which need to be further divided, which need to be combined, and which need to be eliminated. Based on this, the process can be iterated across scales until the user is satisfied with the result.
  • the method and system can be used in collaborative filtering.
  • the customers of some business or organization might play the role of X, and the kernel would be some measure of similarity of purchasing patterns.
  • interesting patterns among the customers and predictions of future behavior maybe be derived via the diffusion map. This observation can also be applied to similar databases such as survey results, databases of user ratings, etc.
  • an embodiment of the present invention can proceed as detailed herein using an example wherein a business has n customers and sells m products.
  • M(x,y) the number of times that customer #x has purchased product #y.
  • the system computes a sparse n x n matrix T such that T(xl,x2) is the correlation between normalized vectors of purchases between customers xl and x2 (i.e. correlate normalized versions of the rows xl and x2 of the matrix M when the correlation is expected to be high, take 0 otherwise.
  • normalized can mean, for example, converting counts to fractions of the total: i.e. dividing each row by its sum prior to the inner product). Note that correlation is used simply as an example. One could also use, for example, a matrix with the value 1 for any pair of customers that have some fixed number of purchases in common, and 0 otherwise.
  • the system obtains a low dimensional representation of the set of customers, and the set of products, such that the customers are close in the map when the preponderance of similarities between their purchase habits is close, as viewed from the context of inference from similarity of behavior of the population.
  • the system obtains a low dimensional map of the products, in which products are close in the map when the preponderance of similarities between their purchase histories is close, as viewed from the context of inference from similarity of behavior of the population.
  • the matrices T and S can be formed, and compatible multiscale organizations of artists and playlists generated.
  • the resulting multiscale structure on sets of songs will constitute a kind of automatically generated classification into genres and sub-genres.
  • the playlists one gets a kind of multiscale classification of playlists by "mood" and "sub- mood”.
  • Yet another example of a similar embodiment consists of one in which the files on a computer are automatically organized into a hierarchy of "folders" by taking a matrix M(x,y) where x indexes, say, keywords, and y indexes documents.
  • the multiscale structure is then an automatically generated filesystem/folder structure on the set of files.
  • x could be some data other than keywords, as described elsewhere in this disclosure.
  • stop words are simply words that are so common that they are usually ignored in standard/state of the art search systems for indexing and information retrieval.
  • the method and system disclosed herein can be used in network routing applications.
  • Nodes on a general network can play the role of points in the space X and the kernel may be determined by traffic levels on the network.
  • the diffusion map in this case can be used to guide routing of traffic on the network.
  • the matrix T can be taken to be any of the standard network similarity matrices. For example, node connectivity, weighted by traffic levels.
  • the embodiment proceeds as above, and the result is a low-dimensional embedding of the network for which ordinary Euclidean distance corresponds to diffusion distance on the graph. Standard algorithms for traffic routing, network enhancement, etc, can then be applied to the diffusion mapped graph in addition to or instead of the original graph, so that results will similarly be mapped to results relevant for diffuse flow of events, resources, etc, within the graph.
  • the method and system can be used in imaging and hyperspectral imaging applications.
  • each spatial (x-y) point in the scene will be a point of X and the kernel could be a distance measure computed from local spatial information (in the imaging case) or from the spectral vectors at each point.
  • the diffusion map can be used to explore the existence of sub-manifolds within the data.
  • the method and system can be used in automatic learning of diagnostic or classification applications.
  • the set X consists of a set of training data
  • the kernel is any kernel that measures similarity of diagnosis or classification in the training data.
  • the diffusion map then gives a means to classify later test data. This example is of particular interest in a hyper-interactive mode.
  • the method and system can be used in measured (sensor) data applications.
  • the (continuous) data vectors which are the result of measurements by physical devices (e.g. medical instruments) or sensors can be thought of as points in a high dimensional space and that space can play the role of X as described herein.
  • the diffusion map can be used to identify structure within the data, and such structure can be used to address statistical learning tasks such as regression.
  • the present invention employs a geographic map (or graph) in which each site is connected to its immediate neighbors by a weighted link measuring the rate (risk) of propagation of fire between the sites.
  • the remapping by the diffusion map reorganizes the geography so that the usual Euclidean distance between the remapped sites represents the risk of fire propagation between them.
  • the system of present invention takes the possible dynamic information about local fire propagation risk as input and computes the multiscale diffusion metric.
  • the system displays a caricaturized map of the region, wherein distance in the display corresponds to risk of fire spreading, hi accordance with an aspect of the present invention, information about the fire, such as where it is currently burning, can be superimposed on the display.
  • the system of the present invention provides situational awareness information about the fire in real time, which can change dynamically with time, to enable the user can assess in real time where the fire is likely to spread next.
  • the present system can compute this situational awareness information in real time and can be updated on the fly as conditions change (wind, temperature, fuel, etc.).
  • the points affected by a fire source can be immediately identified by their physical (Euclidean) proximity in the diffusion map.
  • the system also can be useful for simulating the effects of contemplated countermeasures, thus allowing for a new and valuable means for allocating fire fighting resources.
  • diffusion metric Given census data about places of abode and places of employment, as well other data on travel patterns of the citizens of a region, one can define diffusion metric from initial data relating to the probability of a person traveling from one location to another. Roads, as well as public transportation routes and schedules, can then all be planned so that the capacity of transport between locations is equal to the diffusion distance.
  • the sites can be viewed as digital documents which are tightly related to their immediate neighbors, the links representing the strengths of inference (or relationship) between them.
  • the multiplicity of paths connecting a given pair of documents represents the various chains of inference, each of which carries some particular weight with the sum ranking the relation between them.
  • each customer can be viewed as a "site", with the corresponding list of customer attributes being the digital document.
  • the system and method only links customers whose attributes are similar, preferably very similar, in order to map out the relational structure of the customer base. Good customers are then identified by their natural proximity to known customers, and a risk level can be identified by the preponderance of links (or distance in the map) from a given customer to "dead beats".
  • the methods and algorithms of the present invention have application in the area of automatic organization or assembly of systems. For example, consider the task of having an automated system assemble a jigsaw puzzle. This can be accomplished by digitizing the pieces, using information about the images and the shapes of the pieces to form coordinates in any of many standard ways, using typical diffusion kernels, possibly adapted to reflection symmetries, etc., and computing diffusion distances. Then, pieces that are close in diffusion distance will be much more likely to fit together, so a search for pieces that fit can be greatly enhanced in this way. Of course, this technique is applicable to many practical automated assembly and organization tasks.
  • the characterization allows for the grouping of active elements into similarity classes at different scales of resolution, which finds many applications in the organization of these active elements, as they can be "paired up" or grouped according to behavior, when such is desirable, or allocated as resources when such is desirable.
  • this ability to group together active elements in any context, with the grouping corresponding to similarity of behavior, together with the ability to automatically represent and use this information at a range of resolutions, as disclosed herein, can be used as the basis for automated learning and knowledge extraction in a myriad of contexts.
  • An embodiment of the present invention relates to finding good coordinate systems and projections for surfaces and higher dimensional manifolds and related objects.
  • An embodiment of the present invention relates to the analysis of a linear operator given as a matrix. If the columns of the matrix are viewed as vectors in R N , and any standard diffusion kernel used, then the matrix can be compressed in the diffusion embedding, allowing for rapid computation with the matrix.
  • An aspect of the present invention relates to the automated or assisted discovery of mappings between different sets of digital documents. This is useful, for example, when one has a specific set of digital documents for which there is some amount of analytical knowledge, and one or more sets of digital documents for which there is less knowledge, but for which knowledge is sought.
  • This is useful, for example, when one has a specific set of digital documents for which there is some amount of analytical knowledge, and one or more sets of digital documents for which there is less knowledge, but for which knowledge is sought.
  • [00130] In an embodiment, consider two sets of digital documents, A and B. Begin by organizing A and B using any appropriate diffusion metric. Now, build two new sets of digital documents A' and B'. For each document D in A, let S be the set of nearest neighbors of D in the diffusion embedding within some fixed radius (this radius is a parameter in the method), translated to the origin by subtracting the coordinates of D in the diffusion embedding. Now replace S with the corresponding member from an a priori fixed coset under the action of the unitary group, thus capturing just the local geometry around S. Now place a point D 1 in A', with coordinates equal to this reduced S. Alternatively, the coordinates of D' can be taken to be the reduced S coordinates at a few different multi-scale resolutions.
  • the original problem can be stated as that of finding a natural function mapping between A and B, but with the added complexity that either A or B or both might be incomplete, so that one really seeks a partial mapping. It is natural to require that this mapping, where defined, be a quasi-isometry, or at least a homeomorphism. hi any case, theoretically since A and B are finite, a brute-force search would yield an optimal mapping, although it would be intractable to carry out such a search directly.
  • the procedure in the previous paragraph pre-processes the data so as to greatly reduce the cost of such a search, hi practical problem for which it is possible to make progress from partial information, such as the Rosetta stone example, the process can be iterated, adjusting the metric with the partial progress information.
  • the method and system relates to organizing and sorting, for example in the style of the "3D" demonstration in the Coifman et al. paper.
  • the input to the algorithm was simply a randomized collection of views of the letters "3D”
  • the output was a representation in the top two diffusion coordinates. These coordinates sorted the data into the relevant two parameters of pitch and yaw. Since, in general, the diffusion metric techniques disclosed herein have the power to piece together smooth objects from multi-scale patch information, it is the right tool for automated discovery of smooth morphisms (using "smooth" in a weak sense).
  • the present methods are applicable also for non-symmetric diffusions as discussed in the Coifman & Maggioni reference.
  • the point being that many transitions or inferences as occurring in various applications (e.g., in web searches) are not necessarily symmetric. In general this lack of symmetry invalidates the eigenfunction method as well as the diffusion map method.
  • the present invention overcomes these problems by building diffusion wavelets to achieve the same efficiencies in computing diffusion distances, as well as Euclidean embedding as described herewith the symmetric case. For this reason, the use of the term "diffusion map" and other similar terms herein should be taken as illustrative and not limiting, in the sense that the corresponding techniques with diffusion wavelets are more generally applicable.
  • fr_matr_bin- type embodiments described herein are also interchangeable with diffusion geometry and diffusion wavelet embodiments; each can be substituted for any of the others.
  • the methods of the present invention herein provide a means for steering the diffusion processes in order to filter or avoid irrelevant data as defined by some criterion.
  • Such steering can be implemented interactively using the display of diffusion distances provided by the embedding. This can be implemented exactly as described in the section on hyper-interactive web site searching. This method is particularly preferred in the case of expert assisted machine learning of diagnosis or classification.
  • an embodiment of such techniques to steer diffusion analysis comprises of the following steps:
  • 240 For each class in the classification problem, or for the classes "correct” and “incorrect”; 240a: Use the diffusion process to propagate these user-defined labelings from the specific data elements selected in step 230 and corresponding to the current class, for a time t, so that the labels are spread over a substantial amount of the initial dataset;
  • step 250 Collect the data vector of diffused class information (scores); and 260: Use the data vector in step 250 as additional coordinates and go to step 210.
  • the present techniques to steer diffusion analysis can comprise the following additional steps:
  • step 250 Use the data vector in step 250 to change the initial metric from which the initial diffusion process was conducted. Do this as follows:
  • 261.1 Label each element in the initial dataset with a "guess classification" equal to the class for which its diffused class score is the highest.
  • 261.2 Modify the initial metric so that connections between data elements of the same guess class are enhanced, at least slightly, for at least some elements, and/or so that connections between data elements of different guess classes are reduced, at least slightly, for at least some elements.
  • steps 210 through 230 can be replaced by any means for allowing the user, or any other process or factor, including a priori knowledge, to label certain data elements in the initial dataset, with respect to class membership in a classification problem, or with respect to being "good” or “bad”, “hot” or “cold”, etc., with respect to some search or some desired outcome.
  • the rest of the algorithm (steps 230 - 260 (or 230 - 261.2)) remain the same.
  • the above algorithm can be used in other aspects of the present invention described herein, modified as one skilled in the art would see fit.
  • the technique can be used for regression instead of classification, by simply labeling selected components with numerical values instead of classification data.
  • the different values are propagated forward by diffusion, they can be combined by averaging, or in any standard mathematical way.
  • items of inventory are arranged according to diffusion geometry, or are indexed by a search engine as in Fig. 1, so that when potential sales arise (e.g. advertising opportunities), elements of the inventory can be presented to the potential customer(s) according to customer profiles, context, and/or search queries.
  • potential sales arise e.g. advertising opportunities
  • elements of the inventory can be presented to the potential customer(s) according to customer profiles, context, and/or search queries. Examples include but are not limited to arrangement of inventory of visual content such as images, photos and videos, music content, text content, advertising inventory, as well as tangible inventory such as books, clothing, toys, or any merchandise.
  • Step 310 Compute diffusion geometry for a corpus of documents with appropriate choice of initial metric data that can relate to document interlinking, latent semantic index, mutual information and other methods including those standard in the art.
  • An illustrative but non-limiting example of such a corpus would be one that has the text of a collection of web pages from one or more web sites, from one or more collaborating business, as well as, optionally, the text of a number of product advertisements that one seeks to advertise on at least some of the web pages in the corpus via banner ads or other links.
  • Step 320 Pre-store a data-structure that allows for the diffusion distance between any pair of documents in the corpus to be computed rapidly (e.g., the top several coordinate in the diffusion geometry).
  • Step 330 Optionally, pre-store a data-structure that allows one to compute the diffusion nearest neighbor documents to any document in the corpus.
  • Step 340 Optionally adjust the results that would be returned by steps 320 and/or 330 to favor certain listings which are economically favorable (i.e. weight by bids or by other perceived economic numerical value of the listing).
  • a method to do this for advertisements and other similar listings would be to break the favored listings into a separate sub-corpus, and arrange the data- structure so that one can find the top nearest neighbors to any document, the neighbors being from within the whole corpus, and also find the top nearest neighbors to any document, the neighbors being from within the selected sub- corpus.
  • Step 350 When an advertising opportunity arises (i.e. either when one wishes to decide which ads to display, or which pages to interlink for some combination of the reasons that the content is inter-related, and/or that there is some economic motivation for linking, such as a paid advertisement), compute the nearest neighbor documents and provide listings of those documents. Present invention provides preferential placement to those listings that have the most favorable numerical scores of nearness, as modified in step 340.
  • An embodiment of the present invention in this aspect comprises a method for influencing a position or presence or placement of a listing within an advertising section of a rendering of a document or meta-document on a computer network, wherein text documents relating to the listing are used to characterize the listing, and the content of the document or meta-document are then matched against this text for the listing by methods further disclosed herein, in order to decide where the listing should be placed.
  • This can incorporate the other elements described herein, such as bidding and other economic influencing of listing placement, etc.
  • An embodiment of the present invention consists of a system for strategic content co-management (SCcMS).
  • SCcMS strategic content co-management
  • the present means and methods allow for the calculation of an optimal preferential ranking of the related items.
  • the resulting conglomeration of web-pages, products and service listings can be rendered for display. It is one method of practice of the present invention to provide up to 3 different preferential rankings of the related content, as well as methods for, e.g., generating html or other web renderings, that allow for three different customized views of the same content, wherein the views are branded coA, coB, and coC, respectively, and wherein the rendering optionally uses the preferential ranking to decide on preferential positioning of the related items.
  • Another aspect of the present invention relates to steerable searching, as disclosed herein. Further details of such searches include the idea of a meta-search engine which uses ordinary search engines to return initial results of an initial query.
  • the initial results can be given a diffusion geometry as disclosed. Users can then rate pages as being "good” or "bad” and the diffusion geometry can be used to re-order the returned results.
  • the method for performing a meta-search comprise the following steps:
  • 410 Pre-compute the diffusion geometry of a first corpus of documents; 420: Provide one or more search engines to one or more users (i.e., this invention works in the context where there are search engines provided. Such provisioning is not necessarily part of the invention, although it can be); 430: Take the results of search queries and post-process them as follows: 431 : Take at least some documents from the set of documents returned by a search query as a second corpus;
  • step 410 Use the diffusion map corresponding to the diffusion coordinates in step 410, to project the documents in corpus 2 (or at least an excerpt from at least some of the documents) into the "space" of corpus 1 (i.e. compute the coordinates of each document/excerpt taken from corpus 2, with respect to the diffusion mapping for corpus 1);
  • An example of the above algorithm comprises the following. Take corpus 1 to be at least some of the documents from a special-interest web site (e.g., mlb.com for Major League Baseball), hi this way, the corpus, and it's diffusion geometry, "defines" the special interest (i.e. in the example given, the corpus defines the web for Major League Baseball, in the sense that diffusion proximity to documents in the corpus implies relevance to/for Baseball fans). Compute the diffusion geometry of this corpus, using, e.g.
  • search engine such as Google
  • a search result from Google (corpus 2).
  • search result from Google (corpus 2).
  • Yet another aspect of the present invention relates to distributed calculation of the diffusion vectors, and pageRank.
  • PageRank and diffusion geometry computations (hereafter features) were both originally disclosed within systems for which the relevant quantities are computed on a server or cluster of servers. This can be a lengthy process, and can require a cluster of a large number of servers for the computation to be done in a reasonable amount of time. Such clusters are expensive. Hence there is a need for a method to perform these computations and related computations without requiring a specialized server.
  • the present invention solves this problem in the context of networked databases and document delivery systems such as the Internet, World Wide Web, and Internet email.
  • the documents for which the features are to be computed are each handled by at least one server. As described herein, one can augment the protocols and processing in such a way that the server which is already serving the document computes the feature.
  • the model can be empty at first, and will be dynamically updated by this algorithm.
  • the rank number can be random at first, and is dynamically updated by this algorithm.
  • the server receiving the request has a dynamic update of the estimate of the rank of the pages that link to it. From this, it can regularly update its internal model of the pages that link to it, and it can compute, via the usual formula or any number of related formuli, its rank.
  • Another useful formula would be sum_i frac_i * rank_i, where frac_i is the fraction of the time that a refer come from page i, and rank_i is the rank of page i, and the sum is from L. N, where again N is the total number of distinct pages known to link to the current page. 540: Whenever a link is "clicked on" within the current page, the HTTP request to follow that link shall forward the revised current estimate of the current pages rank, so that the receiving page can implement this algorithm.
  • pageRank as defined by Page and Brin (See: "The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page; ⁇ http://www- db. stanford.edu/ ⁇ backrub/google.html>) weighs all links into a page with the same weight, conditioned only by the page rank of the page, the above process has enough information to weigh the links according to the amount of traffic that flows through the link at any given time, in addition to the rank of each page. Hence a more relevant ranking of pages is computed; one that factors in not only link popularity, but usage popularity.
  • the above algorithm computes essentially the top non-trivial eigenvector of a certain linear map (as is standard in the art, and it is intended that the above algorithm be modified with all of the usual techniques standard in the art).
  • An embodiment of the present invention also comprising the following modification to the above algorithm: instead of computing one eigenvector, compute several (a fixed number) diffusion geometry eigenvectors, using standard iterative methods from linear algebra, augmented with the present disclosure and those items incorporated by reference. The computation can factor in not only link geometry and traffic weights, but also semantic and text processing such as standard in the art and as described herein. In this way, each web server carries at all times an estimate of the diffusion geometry coordinates of each page on the server.
  • this algorithm need not be implemented on all servers, in that the algorithm can be restricted simply to "participating" servers. In that case, if and when a refer comes from a non-participating server, the page's rank can be updated using a default value for the referring page's rank, or by looking up some other proxy for the referring page's rank, or by ignoring the page, as if the link did not exist.
  • a further aspect of the present invention as it relates to distributed computation is that methods standard in the art can be used for authentication and validation of reported ranks.
  • secure protocols, with signed certificates, etc can be used, to detect that the servers in question have not been tampered with, either by the administrator of the server or other outside parties. It is seen that the disclosed algorithm would be otherwise potentially subject to falsification of data, which could artificially inflate a perceived rank of a page.
  • One specific method for authentication comprises the step of randomly or systematically asking a page to not only report its rank, but report how it computed its rank (by listing those pages that linked to it, and their respective ranks).
  • a querying application can then randomly or systematically perform a "spot check" that all or many of the reported data are correct or approximately correct (the latter since the numbers are dynamic).
  • Servers can keep a log of reports of rank, and of the rank of pages that they link to, not just pages that link to them, hi this way, such spot checks can be made even more tamper resistant. Exploits to defeat the described authentication of the present invention requires a conspiracy between a server and those servers that link to it, which is possible, but the conspiracy would have to propagate to all servers that connect to the latter servers, and so on.
  • each server can keep a record of any "cheating” and report it as part of a protocol, or even refuse to follow links to cheaters, hi addition, servers could report a "cheating index" to those servers connected to it, and the servers could cache an "honesty diffusion geometry" in addition to the above, the latter being a “relatedness diffusion geometry”.
  • the system can be made self-policing and tamper-proof.
  • Yet another use for the present invention relates to applying the above technique as a means for optimizing email paths for solicited email and a means for stopping email spam (i.e. unsolicited commercial email).
  • each email server can keep a "traffic diffusion geometry” and a "spam diffusion geometry” for itself and for those servers from which it receives frequent email.
  • These diffusion geometries can propagate over the Internet in a way analogous to the "honesty" and “relatedness” geometries as disclosed herein.
  • the disclosed means of traffic, interlinking and index propagation are obviously augmented by all of the methods for the same that are standard in the art.
  • An embodiment of the present invention can be practiced to assign diffusion coordinates to a new digital document, i.e. one that was not used to compute the diffusion geometry.
  • the diffusion coordinates of a digital document are, in practice, accessed by looking up the document in a pre-computed data-structure.
  • This pre-computed structure contains information on how to map document attributes such as link structure, word frequency, mutual information, latent semantic index coordinates, and any number of other factors, into coordinates. If one encounters a new document, one can apply the map given by the data-structure, to the new document, in order to instantiate diffusion coordinates for it.
  • Applications of the present invention include but are not limited to: deciding where within a web site to place new content; dynamically updating diffusion data; decreasing the complexity of diffusion calculations by lessening the requirements on corpus size for the pre ⁇ processing step; merging two pre-analyzed corpuses into one; and others, as will be readily seen by one skilled in the art.
  • An embodiment of the present invention comprises a browser, or browser toolbar, or server, or proxy server disposed as in the following example that illustrates assisted content viewing, etc, in the context of web browsing:
  • Step 610 provide a view of web pages, or practice the system as an improvement of an existing web browser, e.g. as a toolbar, server, or proxy server; and
  • Step 620 provide, as part of the view, either in another panel, a menu, a popup, or other comparable means, one or more lists of links to "related documents". These can come from diffusion coordinates or other lists of one or more of the following types: from the user's personal preferences, from knowledge of the user's profile, from strategic content analysis as disclosed herein.
  • the algorithm can be embodied in a form that exploits the observation of the preceding paragraph, in which coordinates can be put on new documents. That is, one can build a few sets of diffusion geometry databases, and then for example browse the World Wide Web. If a document is encountered that is in the databases, then the related links shown is the diffusion nearest neighbors, modified by any relevant filtering (e.g. the economic factors described hereinabove) (referred herein as "generalized nearest neighbors"). La the more likely case, where a viewed document is not in the databases, the coordinates of the document are computed, and the generalized nearest neighbors to the computed point are shown as the related links.
  • the application of the system and method can include automatically advertising within web pages, serving advertisements that are optimally, or nearly optimally related to the user's profile and to what the user is currently doing, and as usual conditioned by bids and other economic factors, as well as automatically assisting the user with a "super browser" that actively monitors the user's likes, dislikes, browsing history, etc, and uses diffusion mathematics or other standard methods to associate content that will improve the user's experience.
  • the system and method comprises the following algorithm:
  • Step 710 Compute a measure of similarity, based on keywords, for a corpus of documents, using methods including those standard in the art.
  • An illustrative but non-limiting example of such a corpus would be one that has the text of a collection of web pages from one or more web sites, from one or more collaborating business, as well as, optionally, the text of a number of product advertisements that one seeks to advertise on at least some of the web pages in the corpus via banner ads or other links.
  • Step 720 Pre-store a data-structure that allows for the similarity between any pair of documents in the corpus to be computed rapidly.
  • Step 730 Optionally pre-store a data-structure that allows one to compute the nearest neighbor documents to any document in the corpus.
  • Step 740 Optionally adjust the results that would be returned by steps 720 and/or 730 to favor certain listings which are economically favorable (i.e. weight by bids or by other perceived economic numerical value of the listing).
  • a system and method of the present invention can break the favored listings into a separate sub- corpus, and arrange the data-structure so that one can find the top nearest neighbors to any document. The neighbors located within the whole corpus. Also the system and method of the present invention finds the top nearest neighbors to any document, the neighbors being from within the selected sub- corpus.
  • Step 750 When an advertising opportunity arises (i.e. either when one wishes to decide which ads to display, or which pages to interlink for some combination of the reasons that the content is inter-related, and/or that there is some economic motivation for linking, such as a paid advertisement), the method and system of the present invention computes the nearest neighbor documents and provides listings of those documents. The present system and method can provide preferential placement to those listings that have the most favorable numerical scores of nearness, as modified in step 740. [00160] The following description gives some further details of an embodiment of the present invention, it is meant to be illustrative and not limiting.
  • a system for computing the diffusion geometry of a corpus of documents comprises the following components (Part A):
  • the system can be used in an application, for example as follow (part C):
  • step Al can be a collection of web pages from a content management database or from a web crawler or web spider as is standard in the art.
  • Step A2 could consists of a set of perl scripts, lexical analysis code in the C "lex" extension, and other tools standard in the art or otherwise, for cannonicalizing the input web pages (e.g. deleting web tags, javascript, ess, comments, etc, correcting spelling errors, stemming, removal of stop words, etc), as is standing in the art or otherwise.
  • Step A3 can be based on the computation of word frequencies for each document in the corpus (i.e. the words in the language (or at least those that occur in the corpus) index the coordinate axes, and the coordinates of each document are the frequencies of occurrence of each word in the language.
  • Steps A4 and A5 can comprise estimating the nearest neighbors by techniques standard in the art, and then computing correlations between vectors, thresholded if below some cutoff. In this way, a sparse matrix W results.
  • F D*W*D
  • A (F+F')/2 (where prime denotes matrix transpose).
  • This matrix A is the example of a matrix for step A5 above.
  • FIG. 4 Another illustrative embodiment of an aspect of the present invention is found in the Public Find Similar Document Internet Utility, which enables people to find documents on the World Wide Web that are similar to a particular document appearing in their web browser.
  • a web page about 18th century French Literature would have a hyperlink on the bottom of the page that says "Find Similar Documents". This hyperlink forwards the user's web browser to the Public Find Similar Document Internet Utility and it, in turn displays a summary list of documents similar to the one about 18th century French Literature available on the web. The titles of each document on the list would be a hyperlink and forward the user to the document itself.
  • PF2 Document Comparison Indexer
  • PF3 Document and Comparison Information Database
  • PF4 Document Comparison Search Engine
  • PF5. Search Request Handler and Results Displayer.
  • the first step is for the Public Find Similar Document Internet Utility to acquire documents from the World Wide Web. This is done by using the World Wide Web Document Acquisition Engine (PFl) to acquire documents (PFA).
  • the documents are communicated (PFB) to the Document Comparison Indexer (PF2).
  • the Document Comparison Indexer (PF2) analyses the documents in such a manner to enable document comparison at a later point.
  • the information resulting from the analysis and any another required data from the document, such as the document's title and source location, also known as the URI, is communicated (PFC) to the Document and Comparison Information Database (PF3).
  • the Public Find Similar Document Internet Utility can now respond to "ad hoc" requests for finding similar documents.
  • This process is initiated by a computer user clicking on a hyperlink on a web page that forwards the user's web browser to the Public Find Similar Document Internet Utility.
  • the user's web browser communicates (PFD) to the Search Request Handler and Results Displayer (PF5) that the user would like to see similar documents to the one the user was just viewing.
  • PF5 Search Request Handler and Results Displayer
  • URI Resource Identifier
  • This information is called the "referrer" described in HTTP/1.1 RFC 2616 14.36.
  • the Search Request Handler and Results Displayer retrieves the document the user was just viewing (PFE and F) by use of the received URI, and communicates (PFG) that document to the Document Comparison Search Engine (PF4).
  • the Document Comparison Search Engine reads data (PFH) from the Document and Comparison Information Database (PF3) and finds similar documents to the document the user was just viewing.
  • the Document Comparison Search Engine (PF4) communicates (PFI) data regarding the list of similar documents to the Search Request Handler and Results Displayer (PF5).
  • the Search Request Handler and Results Displayer formats the data such that it will can be easily viewed and understood by the user.
  • the Search Request Handler and Results Displayer then communicates (PFJ) the list of similar documents to the user.
  • the World Wide Web Document Acquisition Engine PFl
  • the World Wide Web Document Acquisition Engine PF5
  • the Search Request Handler and Results Displayer PF5
  • PFK the document retrieved
  • PFE and PFF the document retrieved
  • PF2 the Document Comparison Indexer
  • the Public Find Similar Document Internet Utility can also count the number and frequency of request by users to retrieve similar documents of particular documents they were viewing. This information can be used for similar document list ranking or general statistical purposes.
  • the Public Find Similar Document Internet Utility can retrieve documents based on the comparison of entire documents instead of a small set of keywords.
  • the Public Find Similar Document Internet Utility also only requires one click of a computer mouse to find similar documents to the one they are viewing, as opposed to current World Wide Web search engines which would require the user to pick out a few relevant keywords from the document and type or cut and paste them into the search box of a current World Wide Web search engine.

Abstract

L'invention porte sur un procédé et un système d'extraction d'informations en réponse à une demande d'extraction d'informations, consistant à extraire des informations supplémentaires d'un premier corpus d'éléments de données en fonction de la demande. Cette demande est modifiée en fonction des informations supplémentaires afin de préciser le nombre d'informations à extraire d'un second corpus d'éléments de données. Les informations sont extraites du second corpus d'éléments de données en fonction de la demande modifiée.
PCT/US2005/033526 2004-09-17 2005-09-19 Systeme et procede d'analyse de documents, de traitement et d'extraction d'informations WO2006034222A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP05800792A EP1797499A2 (fr) 2004-09-17 2005-09-19 Systeme et procede d'analyse de documents, de traitement et d'extraction d'informations

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US61084104P 2004-09-17 2004-09-17
US60/610,841 2004-09-17
US11/165,633 2005-06-23
US11/165,633 US20060004753A1 (en) 2004-06-23 2005-06-23 System and method for document analysis, processing and information extraction
US69706905P 2005-07-05 2005-07-05
US60/697,069 2005-07-05

Publications (2)

Publication Number Publication Date
WO2006034222A2 true WO2006034222A2 (fr) 2006-03-30
WO2006034222A3 WO2006034222A3 (fr) 2009-04-09

Family

ID=36090593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/033526 WO2006034222A2 (fr) 2004-09-17 2005-09-19 Systeme et procede d'analyse de documents, de traitement et d'extraction d'informations

Country Status (2)

Country Link
EP (1) EP1797499A2 (fr)
WO (1) WO2006034222A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170155571A1 (en) * 2015-11-30 2017-06-01 International Business Machines Corporation System and method for discovering ad-hoc communities over large-scale implicit networks by wave relaxation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6490577B1 (en) * 1999-04-01 2002-12-03 Polyvista, Inc. Search engine with user activity memory
US7089237B2 (en) * 2001-01-26 2006-08-08 Google, Inc. Interface and system for providing persistent contextual relevance for commerce activities in a networked environment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6490577B1 (en) * 1999-04-01 2002-12-03 Polyvista, Inc. Search engine with user activity memory
US7089237B2 (en) * 2001-01-26 2006-08-08 Google, Inc. Interface and system for providing persistent contextual relevance for commerce activities in a networked environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170155571A1 (en) * 2015-11-30 2017-06-01 International Business Machines Corporation System and method for discovering ad-hoc communities over large-scale implicit networks by wave relaxation
US11979309B2 (en) * 2015-11-30 2024-05-07 International Business Machines Corporation System and method for discovering ad-hoc communities over large-scale implicit networks by wave relaxation

Also Published As

Publication number Publication date
WO2006034222A3 (fr) 2009-04-09
EP1797499A2 (fr) 2007-06-20

Similar Documents

Publication Publication Date Title
US20060155751A1 (en) System and method for document analysis, processing and information extraction
US20070214133A1 (en) Methods for filtering data and filling in missing data using nonlinear inference
Lu et al. BizSeeker: a hybrid semantic recommendation system for personalized government‐to‐business e‐services
US20120047123A1 (en) System and method for document analysis, processing and information extraction
KR101793222B1 (ko) 어플리케이션 검색들을 가능하게 하기 위해 사용되는 검색 인덱스의 업데이트
US8019752B2 (en) System and method for information retrieval from object collections with complex interrelationships
CA2897886C (fr) Procedes et appareils permettant d'identifier des concepts correspondant a des informations d'entree
US20120278321A1 (en) Visualization of concepts within a collection of information
US20070174255A1 (en) Analyzing content to determine context and serving relevant content based on the context
US8930822B2 (en) Method for human-centric information access and presentation
Kim et al. A framework for tag-aware recommender systems
Serrano Neural networks in big data and Web search
Gasparetti Modeling user interests from web browsing activities
Xu et al. Improving contextual advertising matching by using Wikipedia thesaurus knowledge
Li et al. A feature-free search query classification approach using semantic distance
Wang et al. Query ranking model for search engine query recommendation
Liu et al. Visualizing document classification: A search aid for the digital library
Alghamdi et al. Extended user preference based weighted page ranking algorithm
Xu Web mining techniques for recommendation and personalization
Rana et al. Analysis of web mining technology and their impact on semantic web
Veningston et al. Semantic association ranking schemes for information retrieval applications using term association graph representation
WO2006034222A2 (fr) Systeme et procede d'analyse de documents, de traitement et d'extraction d'informations
Munilatha et al. A study on issues and techniques of web mining
Siddiqui et al. Qualitative approaches in content mining-a review
Wu et al. Automatic topics discovery from hyperlinked documents

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2005800792

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2005800792

Country of ref document: EP