US8849787B2 - Two stage search - Google Patents

Two stage search Download PDF

Info

Publication number
US8849787B2
US8849787B2 US13/343,160 US201213343160A US8849787B2 US 8849787 B2 US8849787 B2 US 8849787B2 US 201213343160 A US201213343160 A US 201213343160A US 8849787 B2 US8849787 B2 US 8849787B2
Authority
US
United States
Prior art keywords
query
occurrence
query result
document
people
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US13/343,160
Other versions
US20120109949A1 (en
Inventor
Yunbo Cao
Hang Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US13/343,160 priority Critical patent/US8849787B2/en
Publication of US20120109949A1 publication Critical patent/US20120109949A1/en
Application granted granted Critical
Publication of US8849787B2 publication Critical patent/US8849787B2/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • G06F17/30684
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • Some current searching systems attempt to accomplish this task by generating a model indicative of how often a person authored a type of document. While this type of system does identify authors, it is not highly effective in identifying experts in subject matter areas that are relevant to a given query.
  • a two stage model identifies individuals having knowledge in a subject matter area relevant to a query.
  • a relevance model receives a query and identifies documents, or other information, relevant to the query.
  • a co-occurrence model identifies individuals, in the retrieved documents, related to the subject matter of the query. Individuals identified can be scored by combining scores from the relevance model and the co-occurrence model and output in a rank ordered list.
  • FIG. 1 is a block diagram of one computing environment in which some embodiments may be practiced.
  • FIGS. 2A and 2B are a more detailed block diagram of one embodiment of an expert search system.
  • FIG. 3 is a flow diagram illustrating one embodiment of the overall operation of the system shown in FIG. 2
  • FIG. 4 is a flow diagram illustrating one embodiment of the operation of the co-occurrence model shown in FIG. 2 .
  • FIG. 5 is a block diagram illustrating a hierarchical tree section.
  • FIG. 6 is a flow diagram illustrating one embodiment of the operation of the cluster based re-ranking component shown in FIG. 2 .
  • the present subject matter deals with identifying individuals having knowledge in certain subject matter areas. However, before describing the system in more detail, one illustrative environment in which it can be practiced will be described.
  • FIG. 1 illustrates an example of a suitable computing system environment 100 on which embodiments may be implemented.
  • the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the claimed subject matter. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100 .
  • Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
  • Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules are located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing some embodiments includes a general-purpose computing device in the form of a computer 110 .
  • Components of computer 110 may include, but are not limited to, a processing unit 120 , a system memory 130 , and a system bus 121 that couples various system components including the system memory to the processing unit 120 .
  • the system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer 110 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110 .
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • the system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132 .
  • ROM read only memory
  • RAM random access memory
  • BIOS basic input/output system
  • RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120 .
  • FIG. 1 illustrates operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
  • the computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media.
  • FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152 , and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140
  • magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150 .
  • hard disk drive 141 is illustrated as storing operating system 144 , application programs 145 , other program modules 146 , and program data 147 . Note that these components can either be the same as or different from operating system 134 , application programs 135 , other program modules 136 , and program data 137 . Operating system 144 , application programs 145 , other program modules 146 , and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 110 through input devices such as a keyboard 162 , a microphone 163 , and a pointing device 161 , such as a mouse, trackball or touch pad.
  • Other input devices may include a joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190 .
  • computers may also include other peripheral output devices such as speakers 197 and printer 196 , which may be connected through an output peripheral interface 195 .
  • the computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180 .
  • the remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110 .
  • the logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173 , but may also include other networks.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 110 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170 .
  • the computer 110 When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173 , such as the Internet.
  • the modem 172 which may be internal or external, may be connected to the system bus 121 via the user input interface 160 , or other appropriate mechanism.
  • program modules depicted relative to the computer 110 may be stored in the remote memory storage device.
  • FIG. 1 illustrates remote application programs 185 as residing on remote computer 180 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • FIGS. 2A and 2B show a more detailed block diagram of an expert search system 200 in accordance with one embodiment.
  • System 200 includes a relevance model 202 that accesses a data corpus 204 .
  • System 200 also includes a co-occurrence model 206 that includes a plurality of co-occurrence submodels which are described in detail below.
  • the illustrative embodiment shown in FIG. 2 illustrates exemplary co-occurrence submodels 208 , 210 , 212 , 214 , 216 , 218 , 220 , and 222 .
  • the submodels shown by way of example in FIG. 2 are described in greater detail below.
  • FIG. 2 also shows that system 200 includes ranking component 226 and cluster based re-ranking component 228 .
  • FIG. 3 is a flow diagram better illustrating the operation of system 200 shown in FIG. 2 .
  • Relevance model 202 first receives a query 230 from a user.
  • the query is illustratively a textual input indicative of a subject matter area for which the user desires to find an expert, or an individual having significant knowledge of that subject matter area. Receiving the query is indicated by block 232 in FIG. 3 .
  • Relevance model 202 retrieves from data corpus 204 a plurality of relevant documents 234 , that are relevant to query 230 . Retrieving the documents is indicated by block 236 in FIG. 3 .
  • Relevance model 202 can illustratively be comprised of any known information retrieval system.
  • One example is a language model, that receives the input query 230 and performs information retrieval on data corpus 204 to identify relevant documents 234 .
  • documents 234 also include related metadata that describes such things as the author, anchor text, title, body, and sections, of each of the retrieved, relevant documents 234 .
  • the relevance model 202 also illustratively generates a score or confidence measure, or other metric, corresponding to each of the documents 234 that indicates how relevant those documents are to the received query 230 .
  • the score or metric is illustratively provided, along with documents 234 , to co-occurrence model 206 . Therefore, it can be seen that relevance model 202 identifies documents from corpus 204 that are relevant to the query 230 .
  • Co-occurrence model 206 receives query 230 (or a parsed representation, or other representation, of query 230 generated by relevance model 202 ) along with documents 234 . Model 206 also receives, in one embodiment, the metadata and scores corresponding to documents 234 . Co-occurrence model 206 determines whether the subject matter of query 230 is associated with any individual persons, and identifies those persons. Co-occurrence model 206 does this by utilizing the submodels in considering both the query 230 and the documents 234 provided by relevance model 202 . Determining the experts, given the documents 234 in query 230 , is indicated by block 238 in FIG. 3 .
  • ranking component 226 Based on the results of applying the submodels in co-occurrence model 206 , and based on the scores from relevance model 202 , ranking component 226 ranks a list of individuals as experts in the subject matter of the query 230 . The identified experts are ranked according to a combined score provided by relevance model 202 and co-occurrence model 206 . Ranking the experts and outputting the ranked list 240 is indicated by block 242 in FIG. 3 .
  • relevance model 202 and co-occurrence model 206 are combined together and used by ranking component 226 in ranking the experts.
  • This two stage model is illustratively defined as follows:
  • both models 202 and 206 are statistical models, and in one illustrative embodiment, they are both statistical language models. Therefore, P(d
  • the co-occurrence model score is estimated as follows:
  • d , q ) ⁇ ⁇ ⁇ pf ⁇ ( e , d ) ⁇ d ⁇ + ( 1 - ⁇ ) ⁇ ⁇ d ′ : e ⁇ d ′ ⁇ pf ⁇ ( e , d ′ ) ⁇ d ′ ⁇ / df e Eq . ⁇ 2
  • pf(e,d) represents the count of the occurrences of the person e in the document d
  • represents the length of the document d in terms of the count of the occurrences of the person in that document d
  • df e represents the count of the documents in which the person e occurs.
  • the known technique of Dirichlet prior smoothing is used in determining the linear parameter ⁇ as follows:
  • ⁇ d ⁇ ⁇ d ⁇ + ⁇ Eq . ⁇ 3
  • is a parameter setting that is illustratively set to the average length of the documents in the collection of relevant documents 234 . This length is illustratively in terms of the number of occurrences of the persons in those documents.
  • Ranking component 226 uses these scores to generate the list of ranked experts 240 .
  • the list of ranked experts 240 can be provided to a cluster based re-ranking component 228 which re-ranks the ranked experts 240 (or at least some of them) based on their relation to one another. This is indicated by block 244 in FIG. 3 and is described in greater detail below with respect to FIG. 6 .
  • FIG. 4 is a flow diagram better illustrating the step of determining the experts given the documents and the query indicated by numeral 238 in FIG. 3 .
  • relevance model 202 retrieves documents 234 and assigns a weight or score to each of those documents. This is indicated by block 250 in FIG. 4 .
  • Co-occurrence model 206 applies one or more of the co-occurrence submodels to the documents 234 and the query 230 . This is indicated by block 252 in FIG. 4 . Where a plurality of submodels are used in co-occurrence model 206 , the contribution of each of those models can be weighted such that more important models are weighted more heavily. The weights can be determined empirically, or otherwise. Combining the weighted scores from the co-occurrence submodels to obtain the list of ranked experts 240 , along with the scores associated with those ranked experts 240 is indicated by block 254 in FIG. 4 .
  • co-occurrence model 206 does not use any metadata information corresponding to documents 234 .
  • co-occurrence submodel 208 is used.
  • Submodel 208 simply identifies whether the subject matter of the query and a candidate person (who is a possible expert for the subject matter of the query) co-occur in the body of the text of the retrieved documents 234 .
  • Co-occurrence submodel 208 can thus simply be referred to as a document-based co-occurrence submodel, or a body/body co-occurrence submodel.
  • the body/body notation simply indicates that the submodel looks to determine whether both the subject matter of the query and the candidate person both occur in the body of the text of the relevant documents 234 .
  • the subject matter of the query can be represented in any desirable way, such as a logical form, another dependency structure, a parse tree or simply content words in the query. It has been found that even this relatively simple two stage model provides significant advantages in average precision and recall over a co-occurrence model by itself.
  • body/body submodel 208 can be revised such that it takes into account proximity as a metric in identifying an expert.
  • a portion of text is defined according to a certain size in words. The submodel then considers the subject matter of the query and the candidate person to co-occur in the body of the document only if they co-occur within the same defined portion of the document.
  • a relevant document 234 is divided up into multiple portions of 200 words each.
  • the body/body submodel 208 will only determine that the candidate person and the subject matter of the query co-occur in the body of the document if they co-occur within one of the predefined 200-word portions.
  • the 200-word portion is a sliding window such that the divisions of the window are not predefined.
  • the candidate person and subject matter of the query will be found to co-occur in the body of the document if they co-occur within 200 words of each other.
  • any given size portion can be used, and the particular size can be determined empirically. In one illustrative embodiment, however, it was found that a size of 200 words works well.
  • the metadata corresponding to the retrieved relevant documents 234 is used by various co-occurrence submodels 208 - 222 .
  • the metadata that is used by the co-occurrence submodels 208 - 222 is metadata that identifies the body, title, author, anchor text, and sections of the retrieved documents 234 .
  • the body metadata identifies the body of the document, as opposed to the title, author, abstract, and bibliography, for instance.
  • the title metadata identifies the title of the document.
  • the author metadata identifies the author of the document, the section metadata can be used to identify a plurality of different text blocks as separate and unique sections of the document. These can be based on HTML tags, for example, and are described in greater detail below.
  • the anchor text metadata is information used in labeling text of links in a document.
  • the submodels 208 - 222 can be generated.
  • the body/body submodel 208 is a model which identifies the co-occurrence of the topic or subject matter of query 230 and the author in the body of the text.
  • the title/body co-occurrence submodel 210 identifies a co-occurrence if the topic or subject matter of the query 230 is found in the title and the candidate person is found in the body of the document 234 .
  • the body/author co-occurrence submodel 212 identifies a co-occurrence if the topic or subject matter of the query 230 is found in the body of the document 234 and the candidate person is found in the author section of the document 234 .
  • the anchor text/body submodel 214 identifies a co-occurrence if the subject matter of query 230 appears in the anchor text and the candidate person appears in the body of document 234 .
  • the anchor text/author co-occurrence submodel 216 identifies a co-occurrence if the topic or subject matter of the query 230 is found in anchor text and the candidate person is found in the author section of the document 234 .
  • the title/author co-occurrence submodel 218 identifies a co-occurrence if the topic or subject matter of the query 230 is found in the title and the candidate person is found in the author section of the document 234 .
  • Section based submodel 220 and section-tree based submodel 222 are described in greater detail below.
  • each of the submodels in co-occurrence model 206 can be represented as shown in Eq. 2 above.
  • the different co-occurrence submodels can be linearly combined together to form a single co-occurrence submodel score for each person, as follows:
  • d , q ) ⁇ m ⁇ ⁇ m ⁇ p m ⁇ ( e
  • d,q) denotes the score for the m th co-occurrence submodel. For instance, with the title/body co-occurrence submodel 210 , if the subject matter of the query 230 occurs in the title of the document 234 , the submodel 210 estimates Eq. 2 by counting persons in the body of the corresponding document.
  • different combinations of the co-occurrence submodels can be used with varying weights, as desired.
  • the body/body submodel 208 can be used with the title/author submodel 218 , and they can be weighted as desired. In one illustrative embodiment, they are weighted 1 and 45, respectively, although a wide variety of other submodel combinations and weights can be used, and those combinations and weights can be determined empirically or otherwise.
  • section metadata divides the relevant documents 234 into relatively fine-grained sections
  • the co-occurrence submodels can be configured to define a co-occurrence based on whether the subject matter of the query 230 and the candidate person co-occur within those fine-grained sections.
  • a section might be defined as the text block enclosed by a set of any of the HTML tags listed in Table 1 below.
  • sections can be defined using any other tags, or other markers, not necessarily tags, as desired.
  • section-based submodels search for co-occurrences of subject matter and persons within the span of a section, not to exceed a given larger portion. For instance, if the document 234 is also divided into portions (as described above with respect to a 200 word portion) then the narrower (more granular) section defined by the HTML tag will be subordinate to the larger portion. In that case, if either the left or the right boundary of the narrower section defined by the HTML tag exceeds a boundary of the given larger portion, then the boundary of the larger portion can be used by default.
  • the section-based submodel 220 is configured to identify co-occurrences of the subject matter of the query 230 and the candidate person within these more narrowly defined, fine-grained sections within a relevant document 234 .
  • Section-tree based co-occurrence submodel 222 can use section information in a different way. Model 222 accommodates the realization that some sections are subordinate to other sections. For example, in technical literature, it is very common to have parent sections identified by numerals (such as section 1, section 2, section 3, etc.). However, it is also very common to have each of those numbered sections include descendent subsections (such as section 1.1, 1.2, 1.2.1, etc.). In that case, it may be desirable to have a co-occurrence recognized if the co-occurrence occurs anywhere within the overall section, as opposed to only the subsections. Therefore, section-tree based co-occurrence submodel 222 determines co-occurrences on the base of a section-tree structure such as that shown in FIG. 5 . Of course, FIG.
  • HTML tags ⁇ H1>, ⁇ H2>, ⁇ H3>, ⁇ H4>, ⁇ H5>, and ⁇ H6>, are used in defining the ⁇ section>s for section-tree 400 .
  • a priority is defined for the tags as follows:
  • sub-model 222 can be generated based on that tree structure 400 .
  • Table 2 illustrates one exemplary embodiment for the operation of submodel 222 .
  • submodel 222 exploits the association of the subject matter of queries and experts within the HEAD fields of HTML and the associated text blocks.
  • the function AncestorPath (Ti) is used to concatenate all the texts in the path from the HTML root to the text block Ti.
  • the AncestorPath for Text Block 3 is “HTML Root> ⁇ H1>Head1--> ⁇ H2>Head2”.
  • the function match (AncestorPath (Ti), q) embodies the rules that match the query q to the AncestorPath(Ti). These rules return “true” when all the content words in the query q appear in the ancestor path, and otherwise return false.
  • Rank experts 240 will be ranked in order of their relationship to the subject matter of query 230 .
  • the list of ranked experts 240 can be re-ranked based on a relationship of the experts with one another.
  • the relationship between the experts lies in two categories:
  • FIG. 6 is a flow diagram illustrating the operation of one embodiment of re-ranking component 228 in order to accommodate these two categories.
  • a plurality of context vectors are constructed to capture the topics (or subject matter of the query) that co-occur with different candidates in expert list 240 . This is indicated by block 300 in FIG. 6 .
  • context vectors are also constructed to capture candidates that co-occur with other candidates. This is indicated by block 302 in FIG. 6 .
  • the two context vectors are concatenated into a single context vector for each given candidate. This is indicated by block 304 in FIG. 6 .
  • the combined context vector has both categories of information represented in it.
  • component 228 clusters the candidates based on the combined context vectors. This is indicated by block 306 in FIG. 6 .
  • K-Means are used as the mechanism for clustering the candidates. New scores are calculated for re-ranking each of the candidates as follows:
  • E represents the cluster that the person e′ belongs to
  • represents the volume of the cluster (the number of people in the cluster).
  • the value of the parameter ⁇ affects the cluster-based re-ranking.
  • the performance of the re-ranking component 228 is relatively insensitive to ⁇ within a relatively wide range, such as 0.05-0.9. In one illustrative embodiment, ⁇ is thus set to 0.5, but other values can be used, as desired.
  • the list of experts is re-ranked to provide the re-ranked list of experts 350 .
  • the top 10 candidates in ranked list 240 are not changed, while the lower ranked candidates are re-ranked.
  • all candidates could be re-ranked, or fewer or more candidates could be left as originally ranked by component 226 , as desired.
  • Computing the new scores for the candidates based on the clusters, and re-ranking some or all candidates based on the new scores is indicated by blocks 308 and 310 in FIG. 6 .
  • the cluster-based scores can be taken into account in co-occurrence model 206 .
  • the results of applying component 228 provide an indication as to the relationships among candidate persons, as described above. This can be combined into other submodels in co-occurrence model 206 .
  • submodel 208 given a document 234 relevant to a query 230 , if some given candidate person does not occur in the document 234 , then for that candidate person, submodel 208 can provide a score that is the average score for all candidate persons that do appear in that document 234 and that are clustered with the given candidate person.
  • the clustering score is simply taken into account with the scores of the other submodels by ranking component 226 .
  • the ranked list of experts 240 is the final ranked list of experts.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A two stage model identifies individuals having knowledge in a subject matter area relevant to a query. A relevance model receives a query and identifies documents, or other information, relevant to the query. A co-occurrence model identifies individuals, in the retrieved documents, related to the subject matter of the query. Individuals identified can be scored by combining scores from the relevance model and the co-occurrence model and output in a rank ordered list.

Description

CROSS-REFERENCE TO RELATED APPLICATION
The present application is a continuation of and claims priority of U.S. patent application Ser. No. 11/273,314, filed Nov. 14, 2005, now issued as U.S. Pat. No. 8,156,097, the content of which is hereby incorporated by reference in its entirety.
BACKGROUND
Currently, people are using computer systems in an attempt to locate people that are experts, or highly knowledgeable, in a specific subject matter area. This type of expert searching is of growing importance, particularly in relatively large companies. For instance, in companies that have several hundreds, to many thousands, of employees, it can be very important to find knowledgeable people, within the company, about certain subject matter areas. This greatly assists in leveraging the knowledge of individuals in the company, in order to obtain efficacies.
Some current searching systems attempt to accomplish this task by generating a model indicative of how often a person authored a type of document. While this type of system does identify authors, it is not highly effective in identifying experts in subject matter areas that are relevant to a given query.
Of course, current information retrieval systems receive a query and attempt to retrieve documents relevant to that query. These types of information retrieval systems are, as with the ones mentioned previously, not highly helpful in identifying persons having a great deal of knowledge in a subject matter area relevant to a query.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
SUMMARY
A two stage model identifies individuals having knowledge in a subject matter area relevant to a query. A relevance model receives a query and identifies documents, or other information, relevant to the query. A co-occurrence model identifies individuals, in the retrieved documents, related to the subject matter of the query. Individuals identified can be scored by combining scores from the relevance model and the co-occurrence model and output in a rank ordered list.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of one computing environment in which some embodiments may be practiced.
FIGS. 2A and 2B (collectively FIG. 2) are a more detailed block diagram of one embodiment of an expert search system.
FIG. 3 is a flow diagram illustrating one embodiment of the overall operation of the system shown in FIG. 2
FIG. 4 is a flow diagram illustrating one embodiment of the operation of the co-occurrence model shown in FIG. 2.
FIG. 5 is a block diagram illustrating a hierarchical tree section.
FIG. 6 is a flow diagram illustrating one embodiment of the operation of the cluster based re-ranking component shown in FIG. 2.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
The present subject matter deals with identifying individuals having knowledge in certain subject matter areas. However, before describing the system in more detail, one illustrative environment in which it can be practiced will be described.
FIG. 1 illustrates an example of a suitable computing system environment 100 on which embodiments may be implemented. The computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the claimed subject matter. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to FIG. 1, an exemplary system for implementing some embodiments includes a general-purpose computing device in the form of a computer 110. Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.
The drives and their associated computer storage media discussed above and illustrated in FIG. 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 1, for example, hard disk drive 141 is illustrated as storing operating system 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 144, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 1 illustrates remote application programs 185 as residing on remote computer 180. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
FIGS. 2A and 2B (collectively FIG. 2) show a more detailed block diagram of an expert search system 200 in accordance with one embodiment. System 200 includes a relevance model 202 that accesses a data corpus 204. System 200 also includes a co-occurrence model 206 that includes a plurality of co-occurrence submodels which are described in detail below. The illustrative embodiment shown in FIG. 2 illustrates exemplary co-occurrence submodels 208, 210, 212, 214, 216, 218, 220, and 222. The submodels shown by way of example in FIG. 2 are described in greater detail below. FIG. 2 also shows that system 200 includes ranking component 226 and cluster based re-ranking component 228.
FIG. 3 is a flow diagram better illustrating the operation of system 200 shown in FIG. 2. Relevance model 202 first receives a query 230 from a user. The query is illustratively a textual input indicative of a subject matter area for which the user desires to find an expert, or an individual having significant knowledge of that subject matter area. Receiving the query is indicated by block 232 in FIG. 3.
Relevance model 202 retrieves from data corpus 204 a plurality of relevant documents 234, that are relevant to query 230. Retrieving the documents is indicated by block 236 in FIG. 3. Relevance model 202 can illustratively be comprised of any known information retrieval system. One example is a language model, that receives the input query 230 and performs information retrieval on data corpus 204 to identify relevant documents 234. In one illustrative embodiment, as described in more detail below, documents 234 also include related metadata that describes such things as the author, anchor text, title, body, and sections, of each of the retrieved, relevant documents 234.
The relevance model 202 also illustratively generates a score or confidence measure, or other metric, corresponding to each of the documents 234 that indicates how relevant those documents are to the received query 230. The score or metric is illustratively provided, along with documents 234, to co-occurrence model 206. Therefore, it can be seen that relevance model 202 identifies documents from corpus 204 that are relevant to the query 230.
Co-occurrence model 206 receives query 230 (or a parsed representation, or other representation, of query 230 generated by relevance model 202) along with documents 234. Model 206 also receives, in one embodiment, the metadata and scores corresponding to documents 234. Co-occurrence model 206 determines whether the subject matter of query 230 is associated with any individual persons, and identifies those persons. Co-occurrence model 206 does this by utilizing the submodels in considering both the query 230 and the documents 234 provided by relevance model 202. Determining the experts, given the documents 234 in query 230, is indicated by block 238 in FIG. 3.
Based on the results of applying the submodels in co-occurrence model 206, and based on the scores from relevance model 202, ranking component 226 ranks a list of individuals as experts in the subject matter of the query 230. The identified experts are ranked according to a combined score provided by relevance model 202 and co-occurrence model 206. Ranking the experts and outputting the ranked list 240 is indicated by block 242 in FIG. 3.
In one specific example, relevance model 202 and co-occurrence model 206 are combined together and used by ranking component 226 in ranking the experts. This two stage model is illustratively defined as follows:
P ( e | q ) = d P ( e , d | q ) = d P ( d | q ) P ( e | d , q ) Eq . 1
While any models can be used, in the embodiment illustrated, both models 202 and 206 are statistical models, and in one illustrative embodiment, they are both statistical language models. Therefore, P(d|q) denotes the probability of a document, given the query 230, and is indicative of the score provided by relevance model 202. P(e|d,q) denotes the probability of an expert (or person, e) given the document d and the query q.
The co-occurrence model score is estimated as follows:
p ( e | d , q ) = μ pf ( e , d ) d + ( 1 - μ ) d : e d pf ( e , d ) d / df e Eq . 2
where pf(e,d) represents the count of the occurrences of the person e in the document d;|d| represents the length of the document d in terms of the count of the occurrences of the person in that document d; and dfe represents the count of the documents in which the person e occurs. Also, in one embodiment, the known technique of Dirichlet prior smoothing is used in determining the linear parameter μ as follows:
μ = d d + κ Eq . 3
where κ is a parameter setting that is illustratively set to the average length of the documents in the collection of relevant documents 234. This length is illustratively in terms of the number of occurrences of the persons in those documents.
Ranking component 226 uses these scores to generate the list of ranked experts 240. In one illustrative embodiment, however, the list of ranked experts 240 can be provided to a cluster based re-ranking component 228 which re-ranks the ranked experts 240 (or at least some of them) based on their relation to one another. This is indicated by block 244 in FIG. 3 and is described in greater detail below with respect to FIG. 6.
FIG. 4 is a flow diagram better illustrating the step of determining the experts given the documents and the query indicated by numeral 238 in FIG. 3. First, as discussed above, relevance model 202 retrieves documents 234 and assigns a weight or score to each of those documents. This is indicated by block 250 in FIG. 4.
Co-occurrence model 206 applies one or more of the co-occurrence submodels to the documents 234 and the query 230. This is indicated by block 252 in FIG. 4. Where a plurality of submodels are used in co-occurrence model 206, the contribution of each of those models can be weighted such that more important models are weighted more heavily. The weights can be determined empirically, or otherwise. Combining the weighted scores from the co-occurrence submodels to obtain the list of ranked experts 240, along with the scores associated with those ranked experts 240 is indicated by block 254 in FIG. 4.
A number of the co-occurrence submodels for co-occurrence model 206 will now be described in greater detail. In a relatively simple form, co-occurrence model 206 does not use any metadata information corresponding to documents 234. In that case, co-occurrence submodel 208 is used. Submodel 208 simply identifies whether the subject matter of the query and a candidate person (who is a possible expert for the subject matter of the query) co-occur in the body of the text of the retrieved documents 234. Co-occurrence submodel 208 can thus simply be referred to as a document-based co-occurrence submodel, or a body/body co-occurrence submodel. The body/body notation simply indicates that the submodel looks to determine whether both the subject matter of the query and the candidate person both occur in the body of the text of the relevant documents 234. The subject matter of the query can be represented in any desirable way, such as a logical form, another dependency structure, a parse tree or simply content words in the query. It has been found that even this relatively simple two stage model provides significant advantages in average precision and recall over a co-occurrence model by itself.
In another embodiment, body/body submodel 208 can be revised such that it takes into account proximity as a metric in identifying an expert. In the revised body/body submodel 208, a portion of text is defined according to a certain size in words. The submodel then considers the subject matter of the query and the candidate person to co-occur in the body of the document only if they co-occur within the same defined portion of the document.
For instance, assume that a relevant document 234 is divided up into multiple portions of 200 words each. The body/body submodel 208 will only determine that the candidate person and the subject matter of the query co-occur in the body of the document if they co-occur within one of the predefined 200-word portions. Of course, in another embodiment, the 200-word portion is a sliding window such that the divisions of the window are not predefined. Instead, the candidate person and subject matter of the query will be found to co-occur in the body of the document if they co-occur within 200 words of each other. Of course, any given size portion can be used, and the particular size can be determined empirically. In one illustrative embodiment, however, it was found that a size of 200 words works well.
In accordance with another embodiment, the metadata corresponding to the retrieved relevant documents 234 is used by various co-occurrence submodels 208-222. In the illustrative embodiment, the metadata that is used by the co-occurrence submodels 208-222 is metadata that identifies the body, title, author, anchor text, and sections of the retrieved documents 234. The body metadata identifies the body of the document, as opposed to the title, author, abstract, and bibliography, for instance. The title metadata identifies the title of the document. The author metadata identifies the author of the document, the section metadata can be used to identify a plurality of different text blocks as separate and unique sections of the document. These can be based on HTML tags, for example, and are described in greater detail below. The anchor text metadata is information used in labeling text of links in a document.
Based on this metadata, the submodels 208-222 can be generated. As briefly described above, the body/body submodel 208 is a model which identifies the co-occurrence of the topic or subject matter of query 230 and the author in the body of the text.
The title/body co-occurrence submodel 210 identifies a co-occurrence if the topic or subject matter of the query 230 is found in the title and the candidate person is found in the body of the document 234. The body/author co-occurrence submodel 212 identifies a co-occurrence if the topic or subject matter of the query 230 is found in the body of the document 234 and the candidate person is found in the author section of the document 234. The anchor text/body submodel 214 identifies a co-occurrence if the subject matter of query 230 appears in the anchor text and the candidate person appears in the body of document 234. The anchor text/author co-occurrence submodel 216 identifies a co-occurrence if the topic or subject matter of the query 230 is found in anchor text and the candidate person is found in the author section of the document 234. The title/author co-occurrence submodel 218 identifies a co-occurrence if the topic or subject matter of the query 230 is found in the title and the candidate person is found in the author section of the document 234. Section based submodel 220 and section-tree based submodel 222 are described in greater detail below.
In one illustrative embodiment, each of the submodels in co-occurrence model 206 can be represented as shown in Eq. 2 above. The different co-occurrence submodels can be linearly combined together to form a single co-occurrence submodel score for each person, as follows:
p ( e | d , q ) = m λ m p m ( e | d , q ) Eq . 4
where pm(e|d,q) denotes the score for the mth co-occurrence submodel. For instance, with the title/body co-occurrence submodel 210, if the subject matter of the query 230 occurs in the title of the document 234, the submodel 210 estimates Eq. 2 by counting persons in the body of the corresponding document.
It might be found that different ones of the co-occurrence submodels 208-222 are more or less important than others in providing an accurate expert list. In that case, different ones of the individual co-occurrence submodels, or groups of them, can be weighted differently. Therefore the weights λm can be set empirically. In one illustrative embodiment the title/author co-occurrence submodel 218 can be weighted most heavily.
In another illustrative embodiment, different combinations of the co-occurrence submodels can be used with varying weights, as desired. In one illustrative embodiment, the body/body submodel 208 can be used with the title/author submodel 218, and they can be weighted as desired. In one illustrative embodiment, they are weighted 1 and 45, respectively, although a wide variety of other submodel combinations and weights can be used, and those combinations and weights can be determined empirically or otherwise.
The two co-occurrence submodels 220 and 222 which are based on section metadata will now be described in more detail. In one illustrative embodiment, section metadata divides the relevant documents 234 into relatively fine-grained sections, and the co-occurrence submodels can be configured to define a co-occurrence based on whether the subject matter of the query 230 and the candidate person co-occur within those fine-grained sections. For instance, a section might be defined as the text block enclosed by a set of any of the HTML tags listed in Table 1 below.
TABLE 1
HTML tags for identifying <section>
<table>, <ol>, <p>, <ul>, <pre>, <li>, <dl>, <dt>, <tr>, <hr>
Of course, sections can be defined using any other tags, or other markers, not necessarily tags, as desired. In any case, given the example in which Table 1 defines the HTML tags that identify sections, then given a query in a document, section-based submodels search for co-occurrences of subject matter and persons within the span of a section, not to exceed a given larger portion. For instance, if the document 234 is also divided into portions (as described above with respect to a 200 word portion) then the narrower (more granular) section defined by the HTML tag will be subordinate to the larger portion. In that case, if either the left or the right boundary of the narrower section defined by the HTML tag exceeds a boundary of the given larger portion, then the boundary of the larger portion can be used by default. Thus, the section-based submodel 220 is configured to identify co-occurrences of the subject matter of the query 230 and the candidate person within these more narrowly defined, fine-grained sections within a relevant document 234.
Section-tree based co-occurrence submodel 222 can use section information in a different way. Model 222 accommodates the realization that some sections are subordinate to other sections. For example, in technical literature, it is very common to have parent sections identified by numerals (such as section 1, section 2, section 3, etc.). However, it is also very common to have each of those numbered sections include descendent subsections (such as section 1.1, 1.2, 1.2.1, etc.). In that case, it may be desirable to have a co-occurrence recognized if the co-occurrence occurs anywhere within the overall section, as opposed to only the subsections. Therefore, section-tree based co-occurrence submodel 222 determines co-occurrences on the base of a section-tree structure such as that shown in FIG. 5. Of course, FIG. 5 is only an exemplary tree structure, and any other structure could be used as well. However, in FIG. 5, the HTML tags, <H1>, <H2>, <H3>, <H4>, <H5>, and <H6>, are used in defining the <section>s for section-tree 400. In addition, a priority is defined for the tags as follows:
<H1
Figure US08849787-20140930-P00001
<H2>
Figure US08849787-20140930-P00001
<H3>
Figure US08849787-20140930-P00001
<H4>
Figure US08849787-20140930-P00001
<H5>
Figure US08849787-20140930-P00001
<H6>.
Then a tree-structure, such as structure 400 shown in FIG. 5, can be constructed and sub-model 222 can be generated based on that tree structure 400. Table 2 illustrates one exemplary embodiment for the operation of submodel 222.
TABLE 2
1) For a given query q
2) Foreach Text Block Ti{
3)  Get the ancestor path AncestorPath (Ti)
4)  if (match (AncestorPath (Ti), q)){
5  Foreach candidate experts e appears in the Text Block Ti{
6   count the co-occurrence of e and the topic q;}
8 }
9 }
It can be seen that, in the embodiment shown in Table 2, submodel 222 exploits the association of the subject matter of queries and experts within the HEAD fields of HTML and the associated text blocks. In Table 2, the function AncestorPath (Ti) is used to concatenate all the texts in the path from the HTML root to the text block Ti. For instance, in FIG. 5, the AncestorPath for Text Block 3 is “HTML Root><H1>Head1--><H2>Head2”. In line 4 of Table 2, the function match (AncestorPath (Ti), q) embodies the rules that match the query q to the AncestorPath(Ti). These rules return “true” when all the content words in the query q appear in the ancestor path, and otherwise return false.
Regardless of which co-occurrence submodels are used, the scores for those submodels are combined by ranking component 226 and a list of ranked experts 240 is output. Rank experts 240 will be ranked in order of their relationship to the subject matter of query 230.
In one illustrative embodiment, the list of ranked experts 240 can be re-ranked based on a relationship of the experts with one another. In an exemplary embodiment, the relationship between the experts lies in two categories:
    • 1. Candidates appear in similar contexts; and
    • 2. Candidates co-occur in some local contexts.
FIG. 6 is a flow diagram illustrating the operation of one embodiment of re-ranking component 228 in order to accommodate these two categories. In order to accommodate category 1, a plurality of context vectors are constructed to capture the topics (or subject matter of the query) that co-occur with different candidates in expert list 240. This is indicated by block 300 in FIG. 6.
In order to accommodate the information in category 2, context vectors are also constructed to capture candidates that co-occur with other candidates. This is indicated by block 302 in FIG. 6.
Next, the two context vectors are concatenated into a single context vector for each given candidate. This is indicated by block 304 in FIG. 6. Thus, the combined context vector has both categories of information represented in it.
Next, component 228 clusters the candidates based on the combined context vectors. This is indicated by block 306 in FIG. 6. In one illustrative embodiment, K-Means are used as the mechanism for clustering the candidates. New scores are calculated for re-ranking each of the candidates as follows:
p new ( e | q ) = ( 1 - λ ) p ( e | q ) + λ e E p ( e | q ) / E Eq . 5
Where E represents the cluster that the person e′ belongs to, and |E| represents the volume of the cluster (the number of people in the cluster). The value of the parameter λ affects the cluster-based re-ranking. The performance of the re-ranking component 228 is relatively insensitive to λ within a relatively wide range, such as 0.05-0.9. In one illustrative embodiment, λ is thus set to 0.5, but other values can be used, as desired.
Using the new scores calculated as described with respect to Eq. 5, the list of experts is re-ranked to provide the re-ranked list of experts 350. In one illustrative embodiment, the top 10 candidates in ranked list 240 are not changed, while the lower ranked candidates are re-ranked. Of course, all candidates could be re-ranked, or fewer or more candidates could be left as originally ranked by component 226, as desired. Computing the new scores for the candidates based on the clusters, and re-ranking some or all candidates based on the new scores is indicated by blocks 308 and 310 in FIG. 6.
It should also be noted that, while re-ranking is shown in FIG. 2 as being performed by component 228, after ranking component 226 provides an initial ranked list of experts 240, the cluster-based scores can be taken into account in co-occurrence model 206. The results of applying component 228 provide an indication as to the relationships among candidate persons, as described above. This can be combined into other submodels in co-occurrence model 206. For example, in submodel 208, given a document 234 relevant to a query 230, if some given candidate person does not occur in the document 234, then for that candidate person, submodel 208 can provide a score that is the average score for all candidate persons that do appear in that document 234 and that are clustered with the given candidate person. In that embodiment, the clustering score is simply taken into account with the scores of the other submodels by ranking component 226. Thus, the ranked list of experts 240 is the final ranked list of experts.
Although the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.

Claims (20)

What is claimed is:
1. A search system, comprising:
a relevance model stored on a computer readable medium;
a computer processor that applies the relevance model to a query so as to identify a query result and generate a relevance score indicative of relevancy of the query result to the query;
a co-occurrence model, stored on the computer readable medium, for determining textual co-occurrences within a predefined portion of the query result, that comprises less than all of the query result, wherein the computer processor:
identifies, from the query result, a plurality of persons related to a subject matter of the query; and
applies the co-occurrence model to the query result to generate a co-occurrence score based on a number of textual co-occurrences of both the subject matter of the query and a name of the plurality of persons within the predefined portion of the query result; and
a ranking component that generates a ranked list of the plurality of persons based at least in part on the relevance score for the query result and the co-occurrence score.
2. The system of claim 1, wherein the relevance score is indicative of a probability of a document given the query, and the co-occurrence score is indicative of a probability of an expert given the document and the query.
3. The system of claim 1, and further comprising a re-ranking component that computes a score for each of the plurality of persons based on a relationship with another of the plurality of persons, and re-ranks the plurality of persons based on said computed scores.
4. The system of claim 1, wherein the ranking component uses a smoothing coefficient that is based at least in part on a length of the document.
5. The system of claim 1, wherein the co-occurrence model identifies the co-occurrences only if terms from the query and the name of the person occur together within the predefined portion of the query result.
6. The system of claim 1, wherein the predefined portion of the query result comprises one of:
a body portion of the query result;
a body portion and a title of the query result;
a body portion and an author section of the query result;
an anchor text portion and a body portion of the query result; and
a title and author section of the query result.
7. The system of claim 1, wherein the predefined portion of the query result comprises a document and related metadata, the related metadata comprising one or more of document body metadata, title metadata, author metadata, anchor text metadata, and metadata indicative of a hierarchy of document sections.
8. The system of claim 7, wherein the title metadata identifies a title of the document, and wherein the co-occurrence model determines co-occurrence of a topic of the query in the title.
9. The system of claim 7, wherein the author metadata identifies an author section of the document, and wherein the co-occurrence model determines co-occurrence of a candidate person in the author section.
10. The search system of claim 1, wherein the predefined portion of the query result comprises a predefined number of words, and wherein co-occurrences of the subject matter of the query and a name of the plurality of persons are determined to exist only if the subject matter of the query and the name occur within the predefined number of words from each other.
11. The system of claim 1, wherein the co-occurrence model comprises a plurality of co-occurrence submodels, each submodel determining textual co-occurrences within a predefined portion of the query result, and wherein the computer processor generates the co-occurrence score by individually applying each co-occurrence submodel to the query result to generate a corresponding submodel score, and combining the submodel scores to obtain the co-occurrence score, wherein in combining the submodel scores at least some of the submodel scores are weighted differently from one another.
12. A computer implemented method of identifying persons related to a query, comprising:
processing a query to obtain a query result, wherein the query result comprises at least one document;
identifying a set of people, not specifically named in the query itself, based on a relationship of the set of people to the query, the relationship being determined based at least in part on information derived from the query result;
generating a co-occurrence score for each person in the set of people by:
applying a plurality of co-occurrence submodels to the information derived from the query result, each co-occurrence submodel analyzing different portions of the information compared to another co-occurrence submodel to determine whether a subject matter of the query and a name of the person, not specifically named in the query itself, co-occur in the analyzed portions, and generates a score as a function of location of both the subject matter of the query and the name of the person in the document, wherein the plurality of co-occurrence submodels comprise at least two of:
a body/body co-occurrence submodel generating a score based on a determination that the subject matter of the query and the name of the person both occur in a same body portion of the document;
a title/body co-occurrence submodel generating a score based on a determination that the subject matter of the query occurs in a title of the document and the name of the person occurs in a body portion of the document;
a body/author co-occurrence submodel generating a score based on a determination that the subject matter of the query occurs in a body portion of the document and the name of the person occurs in an author section of the document;
an anchor text/body co-occurrence submodel generating a score based on a determination that the subject matter of query occurs in anchor text of the document and the name of the person occurs in a body portion of document;
an anchor text/author co-occurrence submodel generating a score based on a determination that the subject matter of the query occurs in anchor text of the document and the name of the person occurs in an author section of the document; and
a title/author co-occurrence submodel generating a score based on a determination that the subject matter of the query occurs in a title of the document and the name of the person occurs in an author section of the document; and
combining scores from the plurality of co-occurrence submodels to obtain the co-occurrence score;
ranking the set of people based at least in part on the co-occurrence scores; and outputting the set of people in a ranked order.
13. The method of claim 12, and further comprising linearly combining scores for the plurality of co-occurrence submodels to form a single score for each person in the set of people.
14. The method of claim 13, wherein linearly combining comprises applying multiple different weighting factors.
15. The method of claim 12, wherein ranking comprises ranking based at least in part on a proximity of a person in the set of people relative to a portion of information in the query result, and wherein the plurality of co-occurrence submodels are weighted empirically.
16. A computer implemented method of identifying persons related to a query, comprising:
processing a query to obtain a query result;
dividing the query result into a number of different sections;
identifying a set of people based on a number of co-occurrences of a name identifying a person and a subject matter of the query within one of the different sections;
ranking the set of people based at least in part on the number of co-occurrences;
identifying relationships, from the query result, between at least some of the people in the set of people, by:
identifying topics that co-occur in the query result with different people in the set of people; and
identifying people in the set of people, that co-occur with each other in a given section of the query result;
re-ranking the set of people based on the identified relationships; and
generating an output based on the re-ranking.
17. The method of claim 16, wherein the query result includes documents and related metadata, wherein the identification of the set of people is based at least in part on the related metadata, and wherein the query result is divided into sections having a fixed number of words.
18. The method of claim 17, wherein the sections having the fixed number of words are defined by at least one of:
a sliding window having the fixed number of words; and
one or more predefined non-moving sections having the fixed number of words.
19. The computer implemented method of claim 16, wherein re-ranking comprises:
for each person of the set of people, computing a score for the person based on a relationship of the person with another person in the set of people; and
ranking the set of people based on the computed scores.
20. The computer implemented method of claim 16, wherein re-ranking comprises at least one of:
determining that names of first and second people in the set of people co-occur in a similar subject matter context in the query result; and
determining that the names of the first and second people co-occur in at least one local context in the query result.
US13/343,160 2005-11-14 2012-01-04 Two stage search Expired - Fee Related US8849787B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/343,160 US8849787B2 (en) 2005-11-14 2012-01-04 Two stage search

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/273,314 US8156097B2 (en) 2005-11-14 2005-11-14 Two stage search
US13/343,160 US8849787B2 (en) 2005-11-14 2012-01-04 Two stage search

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/273,314 Continuation US8156097B2 (en) 2005-11-14 2005-11-14 Two stage search

Publications (2)

Publication Number Publication Date
US20120109949A1 US20120109949A1 (en) 2012-05-03
US8849787B2 true US8849787B2 (en) 2014-09-30

Family

ID=38042092

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/273,314 Expired - Fee Related US8156097B2 (en) 2005-11-14 2005-11-14 Two stage search
US13/343,160 Expired - Fee Related US8849787B2 (en) 2005-11-14 2012-01-04 Two stage search

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/273,314 Expired - Fee Related US8156097B2 (en) 2005-11-14 2005-11-14 Two stage search

Country Status (1)

Country Link
US (2) US8156097B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240248902A1 (en) * 2017-11-09 2024-07-25 Microsoft Technology Licensing, Llc Machine reading comprehension system for answering queries related to a document

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689615B2 (en) * 2005-02-25 2010-03-30 Microsoft Corporation Ranking results using multiple nested ranking
JP2007140973A (en) * 2005-11-18 2007-06-07 National Institute Of Information & Communication Technology Page reranking device, and page reranking program
US7707204B2 (en) * 2005-12-13 2010-04-27 Microsoft Corporation Factoid-based searching
US20080005067A1 (en) * 2006-06-28 2008-01-03 Microsoft Corporation Context-based search, retrieval, and awareness
US8195660B2 (en) * 2007-06-29 2012-06-05 Intel Corporation Method and apparatus to reorder search results in view of identified information of interest
US20090182723A1 (en) * 2008-01-10 2009-07-16 Microsoft Corporation Ranking search results using author extraction
US8417694B2 (en) * 2008-03-31 2013-04-09 International Business Machines Corporation System and method for constructing targeted ranking from multiple information sources
US8364693B2 (en) * 2008-06-13 2013-01-29 News Distribution Network, Inc. Searching, sorting, and displaying video clips and sound files by relevance
US20100082652A1 (en) * 2008-09-29 2010-04-01 Chacha Search, Inc. Method and system for managing user interaction
US8661030B2 (en) 2009-04-09 2014-02-25 Microsoft Corporation Re-ranking top search results
US9830379B2 (en) * 2010-11-29 2017-11-28 Google Inc. Name disambiguation using context terms
US9589072B2 (en) 2011-06-01 2017-03-07 Microsoft Technology Licensing, Llc Discovering expertise using document metadata in part to rank authors
US8510322B2 (en) 2011-06-17 2013-08-13 Microsoft Corporation Enriched search features based in part on discovering people-centric search intent
US10593003B2 (en) 2013-03-14 2020-03-17 Securiport Llc Systems, methods and apparatuses for identifying person of interest
US9378010B1 (en) * 2014-12-17 2016-06-28 International Business Machines Corporation Calculating confidence values for source code based on availability of experts
CN110309300B (en) * 2018-08-23 2021-05-11 北京慧经知行信息技术有限公司 Method for identifying knowledge points of physical examination questions
US11822588B2 (en) 2018-10-24 2023-11-21 International Business Machines Corporation Supporting passage ranking in question answering (QA) system
US11275749B2 (en) * 2018-12-31 2022-03-15 International Business Machines Corporation Enhanced query performance prediction for information retrieval systems
US12026967B2 (en) 2020-12-31 2024-07-02 Securiport Llc Travel document validation using artificial intelligence and unsupervised learning

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6223165B1 (en) * 1999-03-22 2001-04-24 Keen.Com, Incorporated Method and apparatus to connect consumer to expert
US20020059201A1 (en) * 2000-05-09 2002-05-16 Work James Duncan Method and apparatus for internet-based human network brokering
US20020198875A1 (en) * 2001-06-20 2002-12-26 Masters Graham S. System and method for optimizing search results
US20030115187A1 (en) * 2001-12-17 2003-06-19 Andreas Bode Text search ordered along one or more dimensions
US20030158855A1 (en) 2002-02-20 2003-08-21 Farnham Shelly D. Computer system architecture for automatic context associations
US20040133557A1 (en) 2003-01-06 2004-07-08 Ji-Rong Wen Retrieval of structured documents
US20050234904A1 (en) * 2004-04-08 2005-10-20 Microsoft Corporation Systems and methods that rank search results
US20060059147A1 (en) * 2004-09-15 2006-03-16 Yahoo! Inc. System and method of adaptive personalization of search results for online dating services
US20060074883A1 (en) 2004-10-05 2006-04-06 Microsoft Corporation Systems, methods, and interfaces for providing personalized search and information access
US7031961B2 (en) 1999-05-05 2006-04-18 Google, Inc. System and method for searching and recommending objects from a categorically organized information repository
US20070038601A1 (en) 2005-08-10 2007-02-15 Guha Ramanathan V Aggregating context data for programmable search engines
US7181438B1 (en) 1999-07-21 2007-02-20 Alberti Anemometer, Llc Database access system
US20070219976A1 (en) 2006-03-20 2007-09-20 Microsoft Corporation Extensible query language with support for rich data types
US20070255702A1 (en) 2005-11-29 2007-11-01 Orme Gregory M Search Engine
US20080104032A1 (en) 2004-09-29 2008-05-01 Sarkar Pte Ltd. Method and System for Organizing Items

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6223165B1 (en) * 1999-03-22 2001-04-24 Keen.Com, Incorporated Method and apparatus to connect consumer to expert
US7031961B2 (en) 1999-05-05 2006-04-18 Google, Inc. System and method for searching and recommending objects from a categorically organized information repository
US7181438B1 (en) 1999-07-21 2007-02-20 Alberti Anemometer, Llc Database access system
US20020059201A1 (en) * 2000-05-09 2002-05-16 Work James Duncan Method and apparatus for internet-based human network brokering
US20020198875A1 (en) * 2001-06-20 2002-12-26 Masters Graham S. System and method for optimizing search results
US20030115187A1 (en) * 2001-12-17 2003-06-19 Andreas Bode Text search ordered along one or more dimensions
US20030158855A1 (en) 2002-02-20 2003-08-21 Farnham Shelly D. Computer system architecture for automatic context associations
US20040133557A1 (en) 2003-01-06 2004-07-08 Ji-Rong Wen Retrieval of structured documents
US20050234904A1 (en) * 2004-04-08 2005-10-20 Microsoft Corporation Systems and methods that rank search results
US20060059147A1 (en) * 2004-09-15 2006-03-16 Yahoo! Inc. System and method of adaptive personalization of search results for online dating services
US20080104032A1 (en) 2004-09-29 2008-05-01 Sarkar Pte Ltd. Method and System for Organizing Items
US20060074883A1 (en) 2004-10-05 2006-04-06 Microsoft Corporation Systems, methods, and interfaces for providing personalized search and information access
US20070038601A1 (en) 2005-08-10 2007-02-15 Guha Ramanathan V Aggregating context data for programmable search engines
US20070255702A1 (en) 2005-11-29 2007-11-01 Orme Gregory M Search Engine
US20070219976A1 (en) 2006-03-20 2007-09-20 Microsoft Corporation Extensible query language with support for rich data types

Non-Patent Citations (14)

* Cited by examiner, † Cited by third party
Title
Advisory Action dated Jul. 1, 2009 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 3 pgs.
Amendment After Final response filed Jun. 15, 2009 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 6 pgs.
Amendment filed Apr. 9, 2010 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 14 pgs.
Amendment filed Dec. 31, 2008 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 9 pgs.
Amendment filed May 23, 2008 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 12 pgs.
Amendment filed Oct. 6, 2010 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 10 pgs.
Final Office Action dated Apr. 15, 2009 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 17 pgs.
Final Office Action dated Apr. 20, 2011 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 19 pgs.
Notice of Allowance dated Oct. 6, 2011 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 12 pgs.
Office Action dated Feb. 6, 2008 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 13 pgs.
Office Action dated Jul. 9, 2010 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 16 pgs.
Office Action dated Oct. 1, 2008 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 15 pgs.
RCE Amendment filed Jul. 27, 2011 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 14 pgs.
RCE Amendment filed Oct. 15, 2009 in related U.S. Appl. No. 11/273,314, filed Nov. 14, 2005. 14 pgs.

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240248902A1 (en) * 2017-11-09 2024-07-25 Microsoft Technology Licensing, Llc Machine reading comprehension system for answering queries related to a document

Also Published As

Publication number Publication date
US8156097B2 (en) 2012-04-10
US20120109949A1 (en) 2012-05-03
US20070112720A1 (en) 2007-05-17

Similar Documents

Publication Publication Date Title
US8849787B2 (en) Two stage search
Witten Text Mining.
US7707204B2 (en) Factoid-based searching
US8266121B2 (en) Identifying related objects using quantum clustering
RU2393533C2 (en) Offering allied terms for multisemantic inquiry
US7085771B2 (en) System and method for automatically discovering a hierarchy of concepts from a corpus of documents
Ko et al. Automatic text categorization by unsupervised learning
US7512598B2 (en) Synthesizing information-bearing content from multiple channels
KR101201037B1 (en) Verifying relevance between keywords and web site contents
US9122747B2 (en) Method, computer system, and computer program for searching document data using search keyword
US6826576B2 (en) Very-large-scale automatic categorizer for web content
US8051080B2 (en) Contextual ranking of keywords using click data
US7831597B2 (en) Text summarization method and apparatus using a multidimensional subspace
KR101203345B1 (en) Method and system for classifying display pages using summaries
US20090157656A1 (en) Automatic, computer-based similarity calculation system for quantifying the similarity of text expressions
US20050262039A1 (en) Method and system for analyzing unstructured text in data warehouse
Alami et al. Hybrid method for text summarization based on statistical and semantic treatment
KR101377447B1 (en) Multi-document summarization method and system using semmantic analysis between tegs
CN113591476A (en) Data label recommendation method based on machine learning
US8682913B1 (en) Corroborating facts extracted from multiple sources
CN110688559A (en) Retrieval method and device
Hoffmeister et al. Text mining using markov chains of variable length
Gope A Study on Knowledge Extraction from Official Bangla Documents
Durfee et al. Mining text with the Prototype-matching method
Deepak et al. Unsupervised Learning from URL Corpora.

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0541

Effective date: 20141014

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551)

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20220930