US8103686B2 - Extracting similar entities from lists/tables - Google Patents

Extracting similar entities from lists/tables Download PDF

Info

Publication number
US8103686B2
US8103686B2 US11/954,218 US95421807A US8103686B2 US 8103686 B2 US8103686 B2 US 8103686B2 US 95421807 A US95421807 A US 95421807A US 8103686 B2 US8103686 B2 US 8103686B2
Authority
US
United States
Prior art keywords
lists
list
corpus
elements
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US11/954,218
Other versions
US20090157644A1 (en
Inventor
Sreenivas Gollapudi
Alan Halverson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US11/954,218 priority Critical patent/US8103686B2/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GOLLAPUDI, SREENIVAS, HALVERSON, ALAN
Publication of US20090157644A1 publication Critical patent/US20090157644A1/en
Application granted granted Critical
Publication of US8103686B2 publication Critical patent/US8103686B2/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries

Definitions

  • a term frequency-inverse document frequency (TF-IDF) weight may be used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. Lists of entities contain information grouped according to some criterion. As such, lists are a good source of information to determine relevant information responsive to a query. However, entities may occur in different lists and may be associated with different members in each list. In addition, there are a large number of lists on the web and assigning weights to such a large number of lists creates hurdles in mining such lists for information.
  • Lists of entities may be mined for similar entities to related searches.
  • a representation for each list may be determined to provide for a comparison between lists and to support membership checks.
  • a score for an element in a list may be computed that represents the validity of an item in the corpus of lists. Thus, a spurious element would receive a very low score, where a valid element would receive a higher score.
  • a list weight is then computed using the constituent element weights, and the element and list weight are used to compute the nearest neighbors of a given query element.
  • FIG. 1 illustrates an exemplary network environment
  • FIG. 2 shows a document parsed into a sequence of tokens and overlapping shingles
  • FIG. 3 is an operational flow of an implementation of a process to determine similar entities from lists
  • FIG. 4 illustrates exemplary HTML tables
  • FIG. 5 shows an exemplary computing environment.
  • FIG. 1 illustrates an exemplary network environment 100 .
  • a client 120 can communicate through a network 140 (e.g., Internet, WAN, LAN, 3G, or other communication network), with a plurality of servers 150 1 to 150 N .
  • the client 120 may communicate with a search engine 160 .
  • the client 120 may by configured to communicate with any of the servers 150 1 to 150 N and the search engine 160 , to access, receive, retrieve and display media content and other information such as web pages 155 and web sites.
  • the client 120 may include a desktop personal computer, workstation, laptop, PDA, cell phone, or any WAP-enabled device or any other computing device capable of interfacing directly or indirectly with the network 140 .
  • the client 120 may run an HTTP client, e.g., a browsing program, such as MICROSOFT INTERNET EXPLORER or other browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like, allowing a user of the client 120 to access, process and view information and pages available to it from the servers 150 1 to 150 N .
  • the client 120 may also include one or more user interface devices 122 , such as a keyboard, a mouse, touch-screen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser on a display (e.g., monitor screen, LCD display, etc.), in conjunction with pages, forms and other information provided by the servers 150 1 to 150 N or other servers.
  • GUI graphical user interface
  • Implementations described herein are suitable for use with the Internet, which refers to a specific global internetwork of networks. However, it should be understood that other networks can be used instead of or in addition to the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, any LAN or WAN or the like.
  • VPN virtual private network
  • a client application 125 executing on the client 120 may include instructions for controlling the client 120 and its components to communicate with the servers 150 1 to 150 N and the search engine 160 and to process and display data content received therefrom. Additionally, the client application 125 may include various software modules for processing data and media content. For example, the client application 125 may include one or more of a search module 126 for processing search requests and search result data, a user interface module 127 for rendering data and media content in text and data frames and active windows, e.g., browser windows and dialog boxes, and an application interface module 128 for interfacing and communicating with various applications executing on the client 120 . Further, the interface module 127 may include a browser, such as a default browser configured on the client 120 or a different browser.
  • the search engine 160 is configured to provide search result data and media content to the client 120
  • the servers 150 1 to 150 N are configured to provide data and media content such as web pages to the client 120 , for example, in response to links selected in search result pages provided by the search engine 160
  • the search engine 160 may reference various collection technologies for collecting information from the World Wide Web and for populating one or more indexes with, for example, pages, links to pages, etc.
  • Such collection technologies include automatic web crawlers, spiders, etc., as well as manual or semi-automatic classification algorithms and interfaces for classifying and ranking web pages within an hierarchical structure.
  • the search engine 160 may also be configured having search-related algorithms within a list gathering engine 161 that gathers and maintains the lists, a comparison engine 162 that determines a representation of each list and compares lists to each other, a weighting engine 163 that determines weights of lists and elements within lists, and a ranking engine 164 that determines nearest neighbors to a query element from the lists.
  • the search engine 160 may be configured to provide data responsive to a search query 170 received from the client 120 , via the search module 126 .
  • the servers 150 1 to 150 N and 160 may be part of a single organization, e.g., a distributed server system such as that provided to users by search provider, or they may be part of disparate organizations.
  • the servers 150 1 to 150 N and the search engine 160 each may include at least one server and an associated database system, and may include multiple servers and associated database systems, and although shown as a single block, may be geographically distributed.
  • the search engine 160 may include algorithms that provide search results 190 to users in response to the search query 170 received from the client 120 .
  • the search engine 160 may be configured to increase the relevance search queries received from client 120 by mining lists for similar entities to support related searches, as discussed in detail below.
  • the search query 170 may be transmitted to the search engine 160 to initiate an Internet search (e.g., a web search).
  • the search engine 160 locates content matching the search query 170 from a search corpus 180 .
  • the search corpus 180 represents content that is accessible via the World Wide Web, the Internet, intranets, local networks, and wide area networks.
  • the search engine 160 may retrieve content from the search corpus 180 that matches search the query 170 and transmit the matching content (i.e., search results 190 ) to the client 120 in the form of a web page to be displayed in the user interface module 127 .
  • the most relevant search results are displayed to a user in the user interface module 127 .
  • any data object for example, a web page 155 may be viewed as a linear sequence of tokens 200 .
  • the tokens 200 may be arbitrary document features, for example, characters, words, or lines. It should be understood that in multimedia documents the tokens 200 are not necessarily human readable. Tokens may represent parts of graphic images, videos, audio, or for that matter, any digitally encoded data that may be decomposed into a canonical sequence of tokens.
  • the tokens may be grouped into overlapping fixed size sequences of k contiguous tokens called shingles 202 .
  • the tokens 200 of a particular document may be grouped into shingles 202 in many different ways, but for any shingling, the number of tokens in any particular shingle should be the same.
  • the general method may be applied to any data object from which discernable features can be extracted and stored as a canonical sequence or set.
  • each web page 155 to be compared for resemblance is parsed to produce a canonical sequence of tokens 200 .
  • canonical may mean that any formatting, case, and other minor feature differences, such as HTML commands, spacing, punctuation, etc., are ignored.
  • the tokens 200 may be grouped into shingles 202 , where the “k-shingling” of a web page 155 is the identification of a multi-set of all shingles of size k contained in the document. This multi set is denoted as S(D, k).
  • the Jaccard similarity coefficient is a measure used for comparing the similarity of sample sets.
  • the Jaccard similarity coefficient is defined as the size of the intersection divided by the size of the union of the sample sets. In other words, the fraction of elements that are common to both sets approximates the similarity between the two sets. In the scenario when a document (such as a web page) is represented as a set of words, this measure is useful for determining similarity between documents. Specifically, it is useful for determining near-duplicates of web pages.
  • the resemblance R of two documents A and B according to the Jaccard similarity coefficient may be defined as the ratio:
  • R k (A, A) 1
  • document A always resembles itself 100%.
  • a strong resemblance, that is, close to 1, will capture the notion of two documents being “roughly” the same.
  • a “sketch” which provides an estimate of a measurement of the resemblance distance between any two documents may be produced.
  • a unique identification g(w) is assigned to each distinct shingle w using fingerprinting.
  • a random permutation of the set of all possible fingerprints is computed to produce a plurality of random images of the unique identifications.
  • the permutation makes it possible to compute numbers ⁇ (g(w)) for the shingles S(A, k) in each document.
  • a predetermined number s of the smallest elements of ⁇ (S(A, k)) is selected and the smallest s elements are stored as a list sorted in order to create the sketch of the document. Given the sketches s(A) and s(B) of two documents, their resemblance may be determined by the Jaccard similarity coefficient.
  • a certain threshold such as a predetermined percentage like 97% or 99% for example.
  • This “filtering” may provide sharper bounds on errors of both types, i.e., false positives (claiming that two documents resemble, when they do not) and false negatives (claiming that two documents do not resemble, when they do).
  • a list of entities that group entities according to some criterion may be mined for similar entities. Entities may occur in different lists and often may be associated with different members in each list. Mining such lists for similar entities supports related searches, list completion in document processing, and other applications.
  • FIG. 3 is an operational flow 300 of an implementation of a process to determine similar entities responsive to a query from a corpus of lists.
  • a repository of lists is maintained. Lists may be gathered using a mechanism such as high static rank, and the lists may be stored in the search corpus 180 . In some implementations, an online real-time representation of the lists may be provided.
  • a representation of each list is determined. The representation may provide for a comparison between any two lists and also support efficient membership checks.
  • Min-hashing is a technique for sampling an element from a set of elements which is uniformly random and consistent. As noted above, the similarity between two elements may be defined as the overlap between their item sets, as given by the Jaccard similarity coefficient. In techniques that use min-hashing, each document may be mapped to an arbitrarily long string of 0s and 1s. The largest number is used as the result to a query. If there is a tie, more bits may be evaluated.
  • the comparison may be performed using sketches, as described above.
  • the sketch of a list may be computed by hashing each element into a bitvector of size m and then sampling log n+c bits from this bitvector.
  • n is the number of elements in the list.
  • a bloom filter may be used to determine membership checks by treating the bloom filter as the sketch of the list. The number of hash functions and the length of the filter should be same to compare two filters.
  • a score for an element in a list is determined. This score represents the validity of an item in the corpus of lists. Thus, a spurious element may receive a very low score, while a valid element would receive a higher score.
  • a list weight is determined using the constituent element weights.
  • Stage 308 may implement the following model to compute the weight of an element:
  • the weight of an element in a similar lists is greater than the weight of the same element in dissimilar lists.
  • the weight of an element in a short list is greater than the weight of the same element in a longer list.
  • a weight of an element i, w i may be computed as
  • L i log ( N 1 + f i ) ⁇ ⁇ l ⁇ L i ⁇ S i ⁇ ( l ) log ⁇ ( ⁇ g - length ⁇ ( l ) ⁇ )
  • L i is the set of lists containing element i f i is equal to
  • g is the average length of a list
  • N is the total number of lists.
  • the weight of a list j may be determined as:
  • the element and list weight is used to determine the nearest neighbors of a given query element.
  • L i may be determined, and the set sorted based on the list-weight, lw.
  • the top 200 lists from this sorted set may be selected, and then a weight for each element in this set of lists is determined by:
  • elements in the top p lists may be sorted according to their weights, and the top k elements may be selected and returned as the nearest neighbors of element i.
  • the lists may be mapped to the items they contain, and in turn all items are mapped to the lists they contain.
  • Lists and items may be assigned a 64-bit unique ID number, upon which lookups of the list sketches/bloom filters, list/item computed weights, and list/item link sets are keyed.
  • FIG. 4 illustrates exemplary HTML tables containing stock quotes.
  • Table 400 shows the ticker symbol in column 1
  • table 402 shows the full company name.
  • the column sketch for table 400 and table 402 may be determined (e.g., 100111 and 101001) and the similarity of the sketches compared.
  • a row sketch may be determined, where in other implementations, a column sketch is determined.
  • the row sketch may be determined as noted above, where each row is analyzed as a list of column values.
  • an HTML table may be characterized as a list of sketches.
  • a similar list of entities e.g., rows
  • similar entities may be queried even when the input row does not contain values of all the columns.
  • rows may be query extracted that contain elements in a given column. In this instance, the input query (e.g., a column) may be encoded using a sketch.
  • FIG. 5 shows an exemplary computing environment in which example embodiments and aspects may be implemented.
  • the computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality.
  • Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network personal computers (PCs), minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
  • Computer-executable instructions such as program modules, being executed by a computer may be used.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium.
  • program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing aspects described herein includes a computing device, such as computing device 500 .
  • computing device 500 typically includes at least one processing unit 502 and memory 504 .
  • memory 504 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two.
  • RAM random access memory
  • ROM read-only memory
  • flash memory etc.
  • This most basic configuration is illustrated in FIG. 5 by dashed line 506 .
  • Computing device 500 may have additional features/functionality.
  • computing device 500 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape.
  • additional storage is illustrated in FIG. 5 by removable storage 508 and non-removable storage 510 .
  • Computing device 500 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by device 500 and includes both volatile and non-volatile media, removable and non-removable media.
  • Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Memory 504 , removable storage 508 , and non-removable storage 510 are all examples of computer storage media.
  • Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500 . Any such computer storage media may be part of computing device 500 .
  • Computing device 500 may contain communications connection(s) 512 that allow the device to communicate with other devices.
  • Computing device 500 may also have input device(s) 514 such as a keyboard, mouse, pen, voice input device, touch input device, etc.
  • Output device(s) 516 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
  • exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices. Such devices might include personal computers, network servers, and handheld devices, for example.

Abstract

Large numbers of lists of entities may be mined for similar entities to related searches. A representation for each list may be determined to provide for a comparison between lists and to support membership checks. A score for an element in a list may be computed that represents the validity of an item in the corpus of lists. Thus, a spurious element would receive a very low score, where a valid element would receive a higher score. A list weight is then computed using the constituent element weights, and the element and list weight are used to compute the nearest neighbors of a given query element.

Description

BACKGROUND
A term frequency-inverse document frequency (TF-IDF) weight may be used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. Lists of entities contain information grouped according to some criterion. As such, lists are a good source of information to determine relevant information responsive to a query. However, entities may occur in different lists and may be associated with different members in each list. In addition, there are a large number of lists on the web and assigning weights to such a large number of lists creates hurdles in mining such lists for information.
SUMMARY
Lists of entities may be mined for similar entities to related searches. A representation for each list may be determined to provide for a comparison between lists and to support membership checks. A score for an element in a list may be computed that represents the validity of an item in the corpus of lists. Thus, a spurious element would receive a very low score, where a valid element would receive a higher score. A list weight is then computed using the constituent element weights, and the element and list weight are used to compute the nearest neighbors of a given query element.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing summary, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the embodiments, there is shown in the drawings example constructions of the embodiments; however, the embodiments are not limited to the specific methods and instrumentalities disclosed. In the drawings:
FIG. 1 illustrates an exemplary network environment;
FIG. 2 shows a document parsed into a sequence of tokens and overlapping shingles;
FIG. 3 is an operational flow of an implementation of a process to determine similar entities from lists;
FIG. 4 illustrates exemplary HTML tables; and
FIG. 5 shows an exemplary computing environment.
DETAILED DESCRIPTION
FIG. 1 illustrates an exemplary network environment 100. In the network 100, a client 120 can may communicate through a network 140 (e.g., Internet, WAN, LAN, 3G, or other communication network), with a plurality of servers 150 1 to 150 N. The client 120 may communicate with a search engine 160. The client 120 may by configured to communicate with any of the servers 150 1 to 150 N and the search engine 160, to access, receive, retrieve and display media content and other information such as web pages 155 and web sites.
In some implementations, the client 120 may include a desktop personal computer, workstation, laptop, PDA, cell phone, or any WAP-enabled device or any other computing device capable of interfacing directly or indirectly with the network 140. The client 120 may run an HTTP client, e.g., a browsing program, such as MICROSOFT INTERNET EXPLORER or other browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like, allowing a user of the client 120 to access, process and view information and pages available to it from the servers 150 1 to 150 N.
The client 120 may also include one or more user interface devices 122, such as a keyboard, a mouse, touch-screen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser on a display (e.g., monitor screen, LCD display, etc.), in conjunction with pages, forms and other information provided by the servers 150 1 to 150 N or other servers. Implementations described herein are suitable for use with the Internet, which refers to a specific global internetwork of networks. However, it should be understood that other networks can be used instead of or in addition to the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, any LAN or WAN or the like.
According to an implementation, a client application 125 executing on the client 120 may include instructions for controlling the client 120 and its components to communicate with the servers 150 1 to 150 N and the search engine 160 and to process and display data content received therefrom. Additionally, the client application 125 may include various software modules for processing data and media content. For example, the client application 125 may include one or more of a search module 126 for processing search requests and search result data, a user interface module 127 for rendering data and media content in text and data frames and active windows, e.g., browser windows and dialog boxes, and an application interface module 128 for interfacing and communicating with various applications executing on the client 120. Further, the interface module 127 may include a browser, such as a default browser configured on the client 120 or a different browser.
According to an implementation, the search engine 160 is configured to provide search result data and media content to the client 120, and the servers 150 1 to 150 N are configured to provide data and media content such as web pages to the client 120, for example, in response to links selected in search result pages provided by the search engine 160. The search engine 160 may reference various collection technologies for collecting information from the World Wide Web and for populating one or more indexes with, for example, pages, links to pages, etc. Such collection technologies include automatic web crawlers, spiders, etc., as well as manual or semi-automatic classification algorithms and interfaces for classifying and ranking web pages within an hierarchical structure. In certain aspects, the search engine 160 may also be configured having search-related algorithms within a list gathering engine 161 that gathers and maintains the lists, a comparison engine 162 that determines a representation of each list and compares lists to each other, a weighting engine 163 that determines weights of lists and elements within lists, and a ranking engine 164 that determines nearest neighbors to a query element from the lists.
In an implementation, the search engine 160 may be configured to provide data responsive to a search query 170 received from the client 120, via the search module 126. The servers 150 1 to 150 N and 160 may be part of a single organization, e.g., a distributed server system such as that provided to users by search provider, or they may be part of disparate organizations. The servers 150 1 to 150 N and the search engine 160 each may include at least one server and an associated database system, and may include multiple servers and associated database systems, and although shown as a single block, may be geographically distributed.
According to an implementation, the search engine 160 may include algorithms that provide search results 190 to users in response to the search query 170 received from the client 120. The search engine 160 may be configured to increase the relevance search queries received from client 120 by mining lists for similar entities to support related searches, as discussed in detail below. The search query 170 may be transmitted to the search engine 160 to initiate an Internet search (e.g., a web search). The search engine 160 locates content matching the search query 170 from a search corpus 180. The search corpus 180 represents content that is accessible via the World Wide Web, the Internet, intranets, local networks, and wide area networks.
The search engine 160 may retrieve content from the search corpus 180 that matches search the query 170 and transmit the matching content (i.e., search results 190) to the client 120 in the form of a web page to be displayed in the user interface module 127. In some implementations, the most relevant search results are displayed to a user in the user interface module 127.
As shown in FIG. 2, any data object, for example, a web page 155 may be viewed as a linear sequence of tokens 200. The tokens 200 may be arbitrary document features, for example, characters, words, or lines. It should be understood that in multimedia documents the tokens 200 are not necessarily human readable. Tokens may represent parts of graphic images, videos, audio, or for that matter, any digitally encoded data that may be decomposed into a canonical sequence of tokens.
The tokens may be grouped into overlapping fixed size sequences of k contiguous tokens called shingles 202. For example, for k=3, {This, is, a} is a shingle of the web page 155, as is {is, a, document}. The tokens 200 of a particular document may be grouped into shingles 202 in many different ways, but for any shingling, the number of tokens in any particular shingle should be the same. The general method may be applied to any data object from which discernable features can be extracted and stored as a canonical sequence or set.
In an implementation, each web page 155 to be compared for resemblance is parsed to produce a canonical sequence of tokens 200. In the specific case of web pages 155, canonical may mean that any formatting, case, and other minor feature differences, such as HTML commands, spacing, punctuation, etc., are ignored. The tokens 200 may be grouped into shingles 202, where the “k-shingling” of a web page 155 is the identification of a multi-set of all shingles of size k contained in the document. This multi set is denoted as S(D, k).
The Jaccard similarity coefficient is a measure used for comparing the similarity of sample sets. The Jaccard similarity coefficient is defined as the size of the intersection divided by the size of the union of the sample sets. In other words, the fraction of elements that are common to both sets approximates the similarity between the two sets. In the scenario when a document (such as a web page) is represented as a set of words, this measure is useful for determining similarity between documents. Specifically, it is useful for determining near-duplicates of web pages. Accordingly, the resemblance R of two documents A and B according to the Jaccard similarity coefficient may be defined as the ratio:
|S(A,k)∩S(B,k)/|S(A,k)∪S(B,k)∥
Thus, two documents will have a high resemblance when the documents have many common shingles. The resemblance may be expressed as some number in the interval 0 to 1, and for any shingling, Rk(A, A)=1. In other words, document A always resembles itself 100%. A strong resemblance, that is, close to 1, will capture the notion of two documents being “roughly” the same.
When document A resembles document B by 100% for a shingle size of 1, this may mean that B is some arbitrary permutation of A. For larger sized shingles, this is still true, but now fewer permutations are possible. For example, if A={a, b, a, c, a} and B={a, c, a, b, a}, then A resembles B 100% for a size of two. Increasing the size of shingles makes the resemblance checking algorithm more sensitive to permutation changes, but also more sensitive to insertion and deletion changes.
A “sketch” which provides an estimate of a measurement of the resemblance distance between any two documents may be produced. First, a unique identification g(w) is assigned to each distinct shingle w using fingerprinting. Then, a random permutation of the set of all possible fingerprints is computed to produce a plurality of random images of the unique identifications. The permutation makes it possible to compute numbers δ(g(w)) for the shingles S(A, k) in each document. A predetermined number s of the smallest elements of δ (S(A, k)) is selected and the smallest s elements are stored as a list sorted in order to create the sketch of the document. Given the sketches s(A) and s(B) of two documents, their resemblance may be determined by the Jaccard similarity coefficient.
In some implementations, it may not be necessary to determine the precise resemblance, however, only that the resemblance is above a certain threshold, such as a predetermined percentage like 97% or 99% for example. This “filtering” may provide sharper bounds on errors of both types, i.e., false positives (claiming that two documents resemble, when they do not) and false negatives (claiming that two documents do not resemble, when they do).
According to an implementation, a list of entities that group entities according to some criterion may be mined for similar entities. Entities may occur in different lists and often may be associated with different members in each list. Mining such lists for similar entities supports related searches, list completion in document processing, and other applications.
FIG. 3 is an operational flow 300 of an implementation of a process to determine similar entities responsive to a query from a corpus of lists. At stage 302, a repository of lists is maintained. Lists may be gathered using a mechanism such as high static rank, and the lists may be stored in the search corpus 180. In some implementations, an online real-time representation of the lists may be provided. At stage 304, a representation of each list is determined. The representation may provide for a comparison between any two lists and also support efficient membership checks.
Many well known techniques may be used to determine whether documents are near-duplicates, and many of these techniques use randomness. Min-hashing is a technique for sampling an element from a set of elements which is uniformly random and consistent. As noted above, the similarity between two elements may be defined as the overlap between their item sets, as given by the Jaccard similarity coefficient. In techniques that use min-hashing, each document may be mapped to an arbitrarily long string of 0s and 1s. The largest number is used as the result to a query. If there is a tie, more bits may be evaluated.
The comparison may be performed using sketches, as described above. The sketch of a list may be computed by hashing each element into a bitvector of size m and then sampling log n+c bits from this bitvector. Here n is the number of elements in the list. In some implementations, to minimize of the number of distinct lengths a sketch can take on, the size of a sketch is rounded up to the nearest power of two. In some implementations, a bloom filter may be used to determine membership checks by treating the bloom filter as the sketch of the list. The number of hash functions and the length of the filter should be same to compare two filters.
At stage 306, a score for an element in a list is determined. This score represents the validity of an item in the corpus of lists. Thus, a spurious element may receive a very low score, while a valid element would receive a higher score.
At stage 308, a list weight is determined using the constituent element weights. Stage 308 may implement the following model to compute the weight of an element:
A. The weight of an element in a similar lists is greater than the weight of the same element in dissimilar lists.
B. An element in less number of similar lists has greater weight than when it occurs in a lot of similar lists
C. The weight of an element in a short list is greater than the weight of the same element in a longer list.
D. If the likelihood of an element A being similar to other good elements is larger than the likelihood for element B, then the weight of A is greater than the weight of B.
According to implementations based on the above, a weight of an element i, wi, may be computed as
w i = log ( N 1 + f i ) l L i S i ( l ) log ( g - length ( l ) )
where Li is the set of lists containing element i
fi is equal to |Li|,
g is the average length of a list, and
N is the total number of lists.
Next, the weight of a list j may be determined as:
lw j = 1 length ( j ) l j w i
At stage 310, the element and list weight is used to determine the nearest neighbors of a given query element. For an element i, Li may be determined, and the set sorted based on the list-weight, lw. In some implementations, the top 200 lists from this sorted set may be selected, and then a weight for each element in this set of lists is determined by:
w ^ ki = lw i arg max m lw m w ki j w ji
Thus, elements in the top p lists may be sorted according to their weights, and the top k elements may be selected and returned as the nearest neighbors of element i.
In some implementations, the lists may be mapped to the items they contain, and in turn all items are mapped to the lists they contain. Lists and items may be assigned a 64-bit unique ID number, upon which lookups of the list sketches/bloom filters, list/item computed weights, and list/item link sets are keyed.
In some implementations, the above may be used to perform sketch-based comparisons of HTML tables. This may be performed where tables of the same type, i.e., schema, are compared. Where only the information contained in a table is responsive to a query, then the tables need not be of the same type. FIG. 4 illustrates exemplary HTML tables containing stock quotes. Table 400 shows the ticker symbol in column 1, whereas table 402 shows the full company name. The column sketch for table 400 and table 402 may be determined (e.g., 100111 and 101001) and the similarity of the sketches compared.
Thus, in some implementations, a row sketch may be determined, where in other implementations, a column sketch is determined. The row sketch may be determined as noted above, where each row is analyzed as a list of column values. As such, an HTML table may be characterized as a list of sketches. A similar list of entities (e.g., rows) may be determined for a given row. Further, similar entities may be queried even when the input row does not contain values of all the columns. Similarly, rows may be query extracted that contain elements in a given column. In this instance, the input query (e.g., a column) may be encoded using a sketch.
Exemplary Computing Arrangement
FIG. 5 shows an exemplary computing environment in which example embodiments and aspects may be implemented. The computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality.
Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network personal computers (PCs), minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
Computer-executable instructions, such as program modules, being executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
With reference to FIG. 5, an exemplary system for implementing aspects described herein includes a computing device, such as computing device 500. In its most basic configuration, computing device 500 typically includes at least one processing unit 502 and memory 504. Depending on the exact configuration and type of computing device, memory 504 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in FIG. 5 by dashed line 506.
Computing device 500 may have additional features/functionality. For example, computing device 500 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in FIG. 5 by removable storage 508 and non-removable storage 510.
Computing device 500 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by device 500 and includes both volatile and non-volatile media, removable and non-removable media.
Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Memory 504, removable storage 508, and non-removable storage 510 are all examples of computer storage media. Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500. Any such computer storage media may be part of computing device 500.
Computing device 500 may contain communications connection(s) 512 that allow the device to communicate with other devices. Computing device 500 may also have input device(s) 514 such as a keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 516 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium where, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter.
Although exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices. Such devices might include personal computers, network servers, and handheld devices, for example.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (20)

1. A method for determining similar entities to a query element from a corpus of lists, comprising:
storing a corpus of lists at a computing device configured to provide data responsive to search queries;
determining a representation for each list in the corpus;
determining similarity between lists in the corpus that contain common elements by comparing representations of the lists in the corpus;
determining a score for each element in each list in the corpus that represents validity of each element in the corpus;
determining, for each list in the corpus, an element weight for each element occurring in each list in the corpus, wherein:
the element weight of each common element that occurs in different lists in the corpus is based on the similarity between the different lists in the corpus that contain the common element, and
the element weight of each common element that occurs in different lists is greater in similar lists in the corpus than the element weight of the same common element in dissimilar lists in the corpus;
determining a list weight for each list in the corpus using constituent element weights of the elements within each list;
receiving a search query that includes the query element at the computing device;
determining nearest neighbors of the query element by:
selecting a predetermined number of top lists from a set of lists that contain the query element based on the list weight for each list in the set of lists that contain the query element,
determining an element weight for each of the elements in the top lists, and
selecting a predetermined number of top elements from the elements in the top lists as the nearest neighbors of the query element based on the element weight for each of the elements in the top lists; and
providing data responsive to the search query comprising the nearest neighbors of the query element.
2. The method of claim 1, wherein the representation for each list in the corpus comprises a sketch computed by hashing elements of each list into a bitvector and sampling a predetermined number of bits from the bitvector, the method further comprising:
comparing a first list to a second list using the sketch of the first list and the sketch of the second list.
3. The method of claim 2, further comprising:
performing membership checks using a bloom filter having a length equal to a number of hash functions used to determine the sketch.
4. The method of claim 1, wherein the element weight for each common element that occurs in the different lists is based further on:
number of lists in the corpus that contain the common element;
an average length of the lists that contain the common element; and
a total number of lists in the corpus.
5. The method of claim 4, further comprising:
determining an average weight of the lists that contain the common element.
6. The method of claim 4, further comprising:
assigning a greater element weight to each common element that occurs in similar lists when the common element appears in a smaller number of similar lists than when the common element appears in a larger number of similar lists.
7. The method of claim 4, further comprising:
assigning a greater element weight to each common element that occurs in the different lists when the common element appears in a short list than when the element appears in a longer list.
8. The method of claim 1, wherein determining nearest neighbors of the query element further comprises:
sorting the set of lists that contain the query element based on the list weight of for each list in the set of lists that contain the query element.
9. The method of claim 8, further comprising:
selecting the predetermined number of top lists from the sorted set of lists that contain the query element;
sorting the elements in the top lists based on the element weight for each of the elements in the top lists to determine a sorted list of elements; and
selecting the predetermined number of top elements from the sorted list of elements as the nearest neighbors of the query element.
10. The method of claim 1, wherein the representations of the lists in the corpus comprise sketches of HTML tables.
11. A system of determining nearest neighbors of a query element, the system including a processing unit executing computer-executable program modules located in computer storage media comprising:
a search engine that provides data responsive to search queries;
a list gathering engine that stores a corpus of lists;
a comparison engine that determines a representation of each list in the corpus and compares representations of lists in the corpus to determine similarity between the lists in the corpus that contain common elements;
a weighting engine that determines, for each list in the corpus, a score for each element in each list in the corpus that represents validity of each element in the corpus, an element weight for each element occurring in each list in the corpus, and a list weight for each list in the corpus using constituent element weights, wherein:
the element weight of each common element that occurs in different lists in the corpus is based on the similarity between the different lists in the corpus that contain the common element, and
the element weight of each common element that occurs in different lists is greater in similar lists in the corpus than the element weight of the same common element in dissimilar lists in the corpus; and
a ranking engine that, in response to the search engine receiving a search query comprising the query element, determines the nearest neighbors to the query element by:
selecting a predetermined number of top lists from a set of lists that contain the query element based on the list weight for each list in the set of lists that contain the query element,
determining an element weight for each of the elements in the top lists, and
selecting a predetermined number of top elements from the elements in the top lists that contain the query element based on the element weight for each of the elements in the top lists.
12. The system of claim 11, wherein the comparison engine:
determines a sketch of each list in the corpus, and
determines membership checks using a bloom filter having a length equal to a number of hash functions used to determine the sketch.
13. The system of claim 11, wherein the weighting engine determines the element weight for each common element that occurs in the different lists based on number of lists in the corpus that contain the common element, total number of lists in the corpus, and an average length of the lists that contain the common element.
14. The system of claim 13, wherein the weighting engine determines an average weight of lists that contain the common element.
15. The system of claim 11, wherein the ranking engine sorts the set of lists that contain the query element based on the list weight for each list in the set of lists that contain the query element.
16. The system of claim 15, wherein the ranking engine:
selects the predetermined number of top lists from the sorted set of lists that contain the query element;
sorts the elements in the top lists based on the element weight for each of the elements in the top lists to determine a sorted list of elements; and
selects the predetermined number of top elements from the sorted list of elements as the nearest neighbors of the query element.
17. A computer-readable storage medium comprising computer-executable program instructions stored thereon that, when executed, cause a computing device to:
store a corpus of lists at the computing device, wherein the computing device is configured to provide data responsive to search queries;
determine a representation for each list in the corpus;
determine similarity between lists in the corpus that contain common elements by comparing representations of the lists in the corpus;
determine a score for each element in each list in the corpus that represents validity of each element in the corpus;
determine, for each list in the corpus, an element weight for each element occurring in each list in the corpus, wherein:
the element weight of each common element that occurs in different lists in the corpus is based on the similarity between the different lists in the corpus that contain the common element, and
each common element is assigned a greater weight when the common element occurs in similar lists in the corpus than when the common element occurs in dissimilar lists in the corpus;
determine a list weight for each list in the corpus using constituent element weights of the elements within each list;
receive a search query that includes the query element at the computing device;
determine nearest neighbors of the query element by:
selecting a predetermined number of top lists from a set of lists that contain the query element based on the list weight for each list in the set of lists that contain the query element,
determining an element weight for each of the elements in the top lists, and
selecting a predetermined number of top elements from the elements in the top lists as the nearest neighbors of the query element based on the element weight for each of the elements in the top lists; and
provide data responsive to the search query comprising the nearest neighbors of the query element.
18. The computer-readable storage medium of claim 17 wherein the representation for each list in the corpus comprises a sketch computed by hashing elements of each list into a bitvector and sampling a predetermined number of bits from the bitvector.
19. The computer-readable storage medium of claim 17 further comprising computer-executable program instructions for causing the computing device to:
sort the set of lists that contain the query element based on the list weight for each list in the set of lists that contain the query element.
20. The computer-readable storage medium of claim 19 further comprising computer-executable program instructions for causing the computing device to:
select the predetermined number of top lists from the sorted set of lists that contain the query element;
sort the elements in the top lists based on the element weight for each of the elements in the top lists to determine a sorted list of elements; and
select the predetermined number of top elements from the sorted list of elements as the nearest neighbors of the query element.
US11/954,218 2007-12-12 2007-12-12 Extracting similar entities from lists/tables Expired - Fee Related US8103686B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/954,218 US8103686B2 (en) 2007-12-12 2007-12-12 Extracting similar entities from lists/tables

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/954,218 US8103686B2 (en) 2007-12-12 2007-12-12 Extracting similar entities from lists/tables

Publications (2)

Publication Number Publication Date
US20090157644A1 US20090157644A1 (en) 2009-06-18
US8103686B2 true US8103686B2 (en) 2012-01-24

Family

ID=40754577

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/954,218 Expired - Fee Related US8103686B2 (en) 2007-12-12 2007-12-12 Extracting similar entities from lists/tables

Country Status (1)

Country Link
US (1) US8103686B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231405A1 (en) * 2010-03-17 2011-09-22 Microsoft Corporation Data Structures for Collaborative Filtering Systems
US20120005207A1 (en) * 2010-07-01 2012-01-05 Yahoo! Inc. Method and system for web extraction
US9934311B2 (en) 2014-04-24 2018-04-03 Microsoft Technology Licensing, Llc Generating unweighted samples from weighted features

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676465B2 (en) * 2006-07-05 2010-03-09 Yahoo! Inc. Techniques for clustering structurally similar web pages based on page features
US7941420B2 (en) * 2007-08-14 2011-05-10 Yahoo! Inc. Method for organizing structurally similar web pages from a web site
US20100169311A1 (en) * 2008-12-30 2010-07-01 Ashwin Tengli Approaches for the unsupervised creation of structural templates for electronic documents
US20120051657A1 (en) * 2010-08-30 2012-03-01 Microsoft Corporation Containment coefficient for identifying textual subsets
US9020835B2 (en) * 2012-07-13 2015-04-28 Facebook, Inc. Search-powered connection targeting
US9152714B1 (en) * 2012-10-01 2015-10-06 Google Inc. Selecting score improvements
US11341138B2 (en) * 2017-12-06 2022-05-24 International Business Machines Corporation Method and system for query performance prediction

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5873082A (en) 1994-09-01 1999-02-16 Fujitsu Limited List process system for managing and processing lists of data
US5909677A (en) * 1996-06-18 1999-06-01 Digital Equipment Corporation Method for determining the resemblance of documents
US6240409B1 (en) * 1998-07-31 2001-05-29 The Regents Of The University Of California Method and apparatus for detecting and summarizing document similarity within large document sets
US6338060B1 (en) 1998-01-30 2002-01-08 Canon Kabushiki Kaisha Data processing apparatus and method for outputting data on the basis of similarity
US6374209B1 (en) * 1998-03-19 2002-04-16 Sharp Kabushiki Kaisha Text structure analyzing apparatus, abstracting apparatus, and program recording medium
US6446068B1 (en) * 1999-11-15 2002-09-03 Chris Alan Kortge System and method of finding near neighbors in large metric space databases
US6493709B1 (en) 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US20040107189A1 (en) 2002-12-03 2004-06-03 Lockheed Martin Corporation System for identifying similarities in record fields
US20040107205A1 (en) 2002-12-03 2004-06-03 Lockheed Martin Corporation Boolean rule-based system for clustering similar records
US20040117357A1 (en) 2002-12-17 2004-06-17 International Business Machines Corporation Method, system and program product for identifying similar user profiles in a collection
US6996572B1 (en) 1997-10-08 2006-02-07 International Business Machines Corporation Method and system for filtering of information entities
US20060122978A1 (en) 2004-12-07 2006-06-08 Microsoft Corporation Entity-specific tuned searching
US7139756B2 (en) 2002-01-22 2006-11-21 International Business Machines Corporation System and method for detecting duplicate and similar documents
US20070005589A1 (en) 2005-07-01 2007-01-04 Sreenivas Gollapudi Method and apparatus for document clustering and document sketching
US20070083511A1 (en) 2005-10-11 2007-04-12 Microsoft Corporation Finding similarities in data records
US7398200B2 (en) * 2002-10-16 2008-07-08 Adobe Systems Incorporated Token stream differencing with moved-block detection
US20080256143A1 (en) * 2007-04-11 2008-10-16 Data Domain, Inc. Cluster storage using subsegmenting

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5873082A (en) 1994-09-01 1999-02-16 Fujitsu Limited List process system for managing and processing lists of data
US5909677A (en) * 1996-06-18 1999-06-01 Digital Equipment Corporation Method for determining the resemblance of documents
US6996572B1 (en) 1997-10-08 2006-02-07 International Business Machines Corporation Method and system for filtering of information entities
US6338060B1 (en) 1998-01-30 2002-01-08 Canon Kabushiki Kaisha Data processing apparatus and method for outputting data on the basis of similarity
US6374209B1 (en) * 1998-03-19 2002-04-16 Sharp Kabushiki Kaisha Text structure analyzing apparatus, abstracting apparatus, and program recording medium
US6240409B1 (en) * 1998-07-31 2001-05-29 The Regents Of The University Of California Method and apparatus for detecting and summarizing document similarity within large document sets
US6493709B1 (en) 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US6446068B1 (en) * 1999-11-15 2002-09-03 Chris Alan Kortge System and method of finding near neighbors in large metric space databases
US7139756B2 (en) 2002-01-22 2006-11-21 International Business Machines Corporation System and method for detecting duplicate and similar documents
US7398200B2 (en) * 2002-10-16 2008-07-08 Adobe Systems Incorporated Token stream differencing with moved-block detection
US20040107189A1 (en) 2002-12-03 2004-06-03 Lockheed Martin Corporation System for identifying similarities in record fields
US20040107205A1 (en) 2002-12-03 2004-06-03 Lockheed Martin Corporation Boolean rule-based system for clustering similar records
US20040117357A1 (en) 2002-12-17 2004-06-17 International Business Machines Corporation Method, system and program product for identifying similar user profiles in a collection
US20060122978A1 (en) 2004-12-07 2006-06-08 Microsoft Corporation Entity-specific tuned searching
US20070005589A1 (en) 2005-07-01 2007-01-04 Sreenivas Gollapudi Method and apparatus for document clustering and document sketching
US20070083511A1 (en) 2005-10-11 2007-04-12 Microsoft Corporation Finding similarities in data records
US20080256143A1 (en) * 2007-04-11 2008-10-16 Data Domain, Inc. Cluster storage using subsegmenting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Similarity Functions", http://www.hcrc.ed.ac.uk/ilex/ilex3/Programmers/ProgGuide/node8.html.
Liu, et al., "Measuring Semantic Similarity between Named Entities by Searching the Web Directory", pp. 1-5.

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231405A1 (en) * 2010-03-17 2011-09-22 Microsoft Corporation Data Structures for Collaborative Filtering Systems
US8560528B2 (en) * 2010-03-17 2013-10-15 Microsoft Corporation Data structures for collaborative filtering systems
US20120005207A1 (en) * 2010-07-01 2012-01-05 Yahoo! Inc. Method and system for web extraction
US9934311B2 (en) 2014-04-24 2018-04-03 Microsoft Technology Licensing, Llc Generating unweighted samples from weighted features

Also Published As

Publication number Publication date
US20090157644A1 (en) 2009-06-18

Similar Documents

Publication Publication Date Title
US8103686B2 (en) Extracting similar entities from lists/tables
Chakrabarti et al. Page-level template detection via isotonic smoothing
US8812493B2 (en) Search results ranking using editing distance and document information
US8099417B2 (en) Semi-supervised part-of-speech tagging
US7747600B2 (en) Multi-level search
US7610282B1 (en) Rank-adjusted content items
US7987417B2 (en) System and method for detecting a web page template
US8010545B2 (en) System and method for providing a topic-directed search
US7840569B2 (en) Enterprise relevancy ranking using a neural network
CA2625493C (en) System, method & computer program product for concept based searching & analysis
US7870474B2 (en) System and method for smoothing hierarchical data using isotonic regression
US7822752B2 (en) Efficient retrieval algorithm by query term discrimination
US20120323877A1 (en) Enriched Search Features Based In Part On Discovering People-Centric Search Intent
US20090144240A1 (en) Method and systems for using community bookmark data to supplement internet search results
US8812508B2 (en) Systems and methods for extracting phases from text
US20110307432A1 (en) Relevance for name segment searches
Selvan et al. Survey on web page ranking algorithms
Ohta et al. Related paper recommendation to support online-browsing of research papers
US20110307479A1 (en) Automatic Extraction of Structured Web Content
CN113297457B (en) High-precision intelligent information resource pushing system and pushing method
Cheng et al. Fuzzy matching of web queries to structured data
Makhabel et al. R: Mining spatial, text, web, and social media data
Kuzomin et al. Applying The Hits Algorithm On Web Archives
US8161065B2 (en) Facilitating advertisement selection using advertisable units
Ortega et al. Polarityspam propagating content-based information through a web-graph to detect web-spam

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GOLLAPUDI, SREENIVAS;HALVERSON, ALAN;REEL/FRAME:020290/0970

Effective date: 20071208

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034542/0001

Effective date: 20141014

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY