WO2016206044A1 - Extracting enterprise project information - Google Patents

Extracting enterprise project information Download PDF

Info

Publication number
WO2016206044A1
WO2016206044A1 PCT/CN2015/082341 CN2015082341W WO2016206044A1 WO 2016206044 A1 WO2016206044 A1 WO 2016206044A1 CN 2015082341 W CN2015082341 W CN 2015082341W WO 2016206044 A1 WO2016206044 A1 WO 2016206044A1
Authority
WO
WIPO (PCT)
Prior art keywords
project
enterprise
person
name
names
Prior art date
Application number
PCT/CN2015/082341
Other languages
French (fr)
Inventor
Manish Gupta
Avishek DAN
Victor DAS
Pallavi MATANI
Rupesh Kumar MEHTA
Zhongyuan WANG
Zheng Chen
Jun Yan
Lei Ji
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to CN201580077811.4A priority Critical patent/CN107430607A/en
Priority to PCT/CN2015/082341 priority patent/WO2016206044A1/en
Publication of WO2016206044A1 publication Critical patent/WO2016206044A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Definitions

  • An enterprise can generally be defined as an organizational entity, and more particularly as referring to the entirety of the organization including its various units and locations.
  • An enterprise can amass vast quantities of different types of data relating to its operations. For example, this data includes information about the enterprise's various projects including the people that are working on a project and the project-related items generated and collected during the course of the project. This project information is often scattered across a multitude of enterprise data sources.
  • the project information extraction implementations described herein generally extract project information and generate a project information database for an enterprise. In one implementation, this is accomplished using a computing device to perform the following process actions. First, enterprise project names are extracted from information sources associated with an enterprise. People associated with the project corresponding to each extracted enterprise project name are identified also using information sources associated with an enterprise. A project information database is then generated for the enterprise. This database has an entry for each project which includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
  • FIG. 1 is a diagram illustrating one implementation, in simplified form, of a project information database system for realizing the project information extraction implementations described herein.
  • FIG. 2 is a flow diagram illustrating one implementation of a process for extracting project information and generating a project information database for an enterprise.
  • FIG. 3 shows exemplary pseudocode for performing a Hearst pattern analysis used to identify text strings in enterprise documents that are potential project names.
  • FIG. 4 shows exemplary pseudocode for performing a seed-based splitting procedure for identifying enterprise project names and modifiers in distribution group (DG) titles.
  • FIG. 5 shows exemplary pseudocode for performing a suffix frequency splitting procedure for identifying enterprise project names and modifiers in DG titles.
  • FIGS. 6A-B are a flow diagram illustrating one implementation of a process for identifying people associated with a project corresponding to each extracted enterprise project name using one or more information sources containing enterprise documents.
  • FIGS. 7A-B are a flow diagram illustrating one implementation of a process for identifying people associated with a project corresponding to each extracted enterprise project name using one or more information sources containing enterprise distribution groups and meeting information.
  • FIGS. 8A-B are a flow diagram illustrating one implementation of a process for ranking people associated with a project by their role designations.
  • FIG. 9 is a diagram depicting a general purpose computing device constituting an exemplary system for use with the project information extraction implementations described herein.
  • a component can be a process running on a processor, an object, an executable, a program, a function, a library, a subroutine, a computer, or a combination of software and hardware.
  • an application running on a server and the server can be a component.
  • One or more components can reside within a process and a component can be localized on one computer and/or distributed between two or more computers.
  • processor is generally understood to refer to a hardware component, such as a processing unit of a computer system.
  • the project information extraction implementations described herein extract structured information related to project entities in an enterprise.
  • this information can include, without limitation: a list of projects; the people associated with each project (optionally grouped by roles) ; related meetings; time line information for the projects; distribution groups related to the projects; related projects; documents associated with the projects; definitions, acronyms, project descriptions; concept terms associated with the projects; program code check-ins; emails; social networking messages; among others.
  • Such information is used for a large variety of applications such as search and recommendations.
  • data sources within the enterprise can include, without limitation: documents (along with information about who modified, viewed a document and when) ; an active directory which provides information about employees and the organization hierarchy, and also distribution groups along with the employees which are part of them; meeting information such as attendees, organizer, title, description, time duration, whether recurring or not, and meeting notes if available; social network information for any enterprise wide social network; emails within the enterprise; program code check-ins with metadata such as user, code files, comments, time of checkin, repository location, codebase directory; among others.
  • the data sources can also include external sources where users interact with the enterprise, such as, without limitation, news articles, blog articles, non-enterprise documents, related public projects, related people outside the enterprise, other related entities, external communications of the company, and so on.
  • Fig. 1 shows an exemplary project information database system that can be used to realize the project information extraction implementations described herein.
  • project information extraction implementations described herein can vary in the information that is extracted and used to generate the database. Some implementations extract enterprise project names and the people associated with each project for the database, while others extract the project names and people as well as one or more of the project-related items mentioned previously (e.g., related meetings, time line information, distribution groups, related projects, documents, definitions, acronyms, project descriptions, concept terms, program code check-ins, emails, social networking messages, and so on) .
  • the exemplary database system of Fig. 1 shows the extraction and databasing of the aforementioned project-related items collectively and as an option.
  • One or more computing devices 100 each including a processor, communication interface and memory host various extraction and database generating modules. Whenever there is more than one computing device involved, the computing devices can be in communication with each other via a computer network.
  • the computing devices 100 host a project name extraction module 102, a related-people extraction module 104, and an optional project-related item extraction module 106. It is noted that the optional nature of the project-related item extraction module 106 is indicated by the use of a dashed lined box.
  • Various data sources are in communication with the computing devices 100, and are searchable by the extraction modules 102/104/106.
  • these data sources include an enterprise document data source 108, an active directory 110 (which includes information about people associated with the enterprise and distribution group lists, among other things) , an enterprise meeting data source 112 and a project-related items data source 114.
  • the project-related items data source 114 can include, without limitation, social network information for any enterprise wide social network, emails generated within the enterprise, program code check-ins with metadata such as user, code files, comments, time of check-in, repository location, codebase directory, as well as external sources such as employee log sheets, external communications of the company, and so on.
  • the extraction modules 102/104/106 are in communication with an enterprise project information database generation module 116.
  • the database generation module 116 generates the enterprise project information database 118 from the information extracted by the extraction modules 102/104/106.
  • the one or more computing devices 100 execute a computer program having various program modules which direct the extraction modules 102/104/106 and enterprise project information database generation module 116 to perform the following process actions.
  • the computer program directs the aforementioned modules to extract enterprise project names from the information sources (process action 200) , and identify people associated with the project corresponding to each extracted enterprise project name using the information sources (process action 202) .
  • a project information database is then generated for the enterprise that includes an entry for each project having an extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project (process action 204) .
  • extracting project information from enterprise-related information sources is not simply a matter of looking in the right place. Rather, in the context of the project information extraction implementations described herein, extracting enterprise project information involves finding previously hidden relationships among various information sources and transforming existing information into a form that exposes these hidden relationships. For example, project names, the people that work on a project, and the aforementioned various items generated for a project are often scattered across a variety of information sources with no apparent connection between them. Thus, the project information extraction implementations described herein find these names and items, and the relationships between them, and in one implementation creates an enterprise project database that collects the extracted project information for each enterprise project discovered.
  • Extracting project information from enterprise data has many advantages. It allows the nature and scope of a project to be better understood when viewed in isolation from information associated with other projects. This in turn makes it easier to recommend a project to a new employee. For example, a new employee often faces an information overload problem. A structured view of the projects going on the company can help him or her get an organized picture of the new environment. Employees working a new project in a large company also need to identify whether a sub-problem (or a dependency) has been solved by some other team in the company. Knowing the various projects that are going on in an enterprise can help workers and managers identify duplicate work efforts across multiple project teams. In addition, the project information can assist employees in identifying points of contact within the enterprise.
  • Astructured view of the projects also facilitates a more efficient semantic search capability to find information related to a project.
  • extracting project information from enterprise data rather than exclusively from outside sources overcomes a circumstance where the project names within an enterprise are different from publicly known projects with the same concept. Also internal project names could mean something completely different in the external world.
  • One source for enterprise project names is in digital documents archived in various electronic memories within the enterprise. These memories were collectively referred to in Fig. 1 as the document data source 108.
  • Various procedures can be employed alone or in any combination to extract these project names from the enterprise's documents.
  • a conventional pattern recognition procedure is employed to identify project names within the enterprise's documents.
  • a Hearst pattern analysis can be employed to identify text strings in the enterprise documents that are potential project names.
  • Fig. 3 outlines such a procedure where the variable NP refers to a noun phrase.
  • Table expansion is another procedure that can be employed. More particularly, in one implementation, tables are identified in the enterprise documents that have a column or row which includes at least two previously-known enterprise project names. For example, these previously-known names could have been identified using the aforementioned Hearst pattern analysis. The other names listed in the same column or row are then deemed to be potential project names. For example, let P1 and P2 be two previously-known project names. Now, if a particular table in a document contains a column with P1 and P2 and 10 other strings in the same column, those 10 strings are considered to be potential project names a well.
  • potential project names are identified in the titles of the enterprise documents.
  • a variation of a project name- modifier analysis that eliminates modifier words from the titles and deems the remaining words to be a project name can be employed. This project name-modifier analysis will be described in more detail in sections to follow.
  • potential project names that do not appear in the enterprise documents more than a prescribed number of times are eliminated as candidates.
  • the threshold is set to 10.
  • the threshold is set to a percentage (e.g., 5%) of the average number of documents per project in the enterprise.
  • Distribution group (DG) and meeting titles Another source for enterprise project names is in distribution group (DG) and meeting titles.
  • Distribution group titles can be found in the active directory 110 and meeting titles can be found in the meeting data source 112 referred to in Fig. 1.
  • DG and meeting titles are relatively less noisy in comparison to documents.
  • DGs are often exhaustive in nature because enterprises tend to have a distribution group to link all employees working on a project.
  • DGs are timely because whenever a new project starts, a new DG is usually created.
  • previously identified meetings and distribution groups that have a title which includes a person's name or a term indicative of a person's name (e.g., first name, last name, full name, pseudonym, nickname, email alias and so on) are eliminating from consideration.
  • Conventional methods are employed to identify these meetings and distribution group titles containing people's names.
  • DG and meeting titles can be split into two parts—namely a project name, and modifier words. For example, “Project A dev team” consists of the project name “Project A” and modifier words “dev team” . Based on these observations, it is possible to extract project names from DG and meeting titles using a project name-modifier analysis, as well as to generate a project name corpus and a modifier word corpus.
  • DG titles contain a project name optionally followed by one or more modifier words
  • procedures are proposed to obtain an initial (project, modifier) split for the enterprise’s DG titles.
  • This involves starting with a seed set of project names and extracts modifier words related to those projects.
  • the seed set can be a list of project names extracted from documents, or one could also start with a list of known enterprise project names (e.g., 5-10 names) .
  • the procedure leverages these modifiers to extract more project names from the enterprise’s DG titles. Thus, over multiple iterations more projects and modifiers are identified.
  • the modifier list is trimmed by removing low-frequency words. Using this trimmed modifier list, the DG titles are processed again to obtain the project name corresponding to each DG title.
  • Fig. 4 outlines an exemplary seed-based splitting procedure for identifying enterprise project names and modifiers in DG titles.
  • This procedure begins with a project name seed list as described previously.
  • a list of enterprise DG titles is input.
  • the list of DG titles comes from the aforementioned Active Directory.
  • the first part of the procedure is iterative. In one version (shown in Fig. 4) , this part of the procedure is iterated a prescribed number of times, before moving on to the remainder of the procedure. However, it is noted that in another version, the iterations are repeated until the new project names and/or modifiers discovered in the last iteration fall below a minimum threshold (i.e., have converged) . Still further, in another version, the iterations are repeated the prescribed number of times, or until the new project names and/or modifiers discovered in the last iteration fall below the minimum threshold, whichever occurs first.
  • a list of candidate splits between a potential project name part and a potential modifier part of each unprocessed DG title is generated.
  • all the DG titles are considered unprocessed, while in subsequent iterations only those DG titles that have not had either a project name or modifier discovered therein are considered unprocessed.
  • the list of split candidates for each unprocessed DG title can be generated in a variety of ways, such as the procedure that will be described shortly in connection with the description of project name list refinement.
  • the modifier corpus initially includes one or more pre-established modifiers (e.g., known project name modifier words and/or phrases used in DG titles) .
  • the modifier corpus is initially empty. In either case, it is built up with new modifiers over the course of the iterations as will be apparent shortly.
  • the modifier part of the unprocessed DG title candidate split under consideration matches a modifier in the modifier corpus, then the project name part of the split is added to a project name corpus (which includes the project name seed list) , and an occurrence frequency value of the modifier is incremented by one in the modifier corpus.
  • the unprocessed DG title associated with the candidate split under consideration is re-categorized as a processed DG title.
  • the modifier part of the unprocessed DG title candidate split under consideration does not match a modifier in the modifier corpus, then it is determined if the project name part of the candidate split matches a project name in the project name corpus. If it does, then the modifier part of the unprocessed DG title candidate split under consideration is added to the modifier corpus and its occurrence frequency value is set to one. Additionally, the unprocessed DG title associated with the candidate split under consideration is re-categorized as a processed DG title.
  • the modifiers in the modifiers corpus with a occurrence frequency count that is less than a prescribed count threshold are removed from the corpus.
  • the second part of the exemplary seed-based splitting procedure is then commenced.
  • the occurrence frequency counts of the modifiers in the modifier corpus are zeroed, and the project name corpus is emptied except for the aforementioned seed project names.
  • all the DG titles are returned to their initial unprocessed categorization.
  • a list of candidate splits between a potential project name part and a potential modifier part of each unprocessed DG title is generated once again. Then, for each unprocessed DG title, one of its candidate splits is chosen for processing, and it is determined if the modifier part of the split matches a modifier in the modifier corpus.
  • the project name part of the candidate split under consideration is added to the project name corpus and the occurrence frequency count for the matching modifier in the modifier corpus is incremented by one.
  • the unprocessed DG title associated with the candidate split under consideration is re-categorized as a processed DG title.
  • the modifier occurrence frequency counts are used in one implementation in a statistical analysis that will be described shortly. It is further noted that at this point in the procedure almost all of the DG titles having a modifier will have that modifier included in the modifier corpus, and its project name will be included in the project name corpus. However, if a DG title is the project name only without any modifiers then the foregoing second part of the procedure will not put its project name into the project name corpus. Thus, in one implementation, for the DG tiles that are still categorized as unprocessed, the title is added in its entirely to the project name corpus.
  • the frequencies of all suffixes of length up to L are computed from the enterprise’s DG titles.
  • L is set to 6.
  • Those suffixes which occur with a frequency greater than a prescribed threshold value (e.g., 5) are considered as modifiers and added to a modifiers corpus.
  • the modifier part of the DG is computed as the largest suffix of the title that is present in the modifiers list.
  • the remaining part of the DG title is deemed its project name and added to a project name corpus.
  • Fig. 5 shows an exemplary suffix frequency splitting procedure for identifying project names and modifiers in enterprise DG titles.
  • a project name list can be readily generated from the project name corpus.
  • the project name-modify splits output from the foregoing procedures can be further refined by considering statistical frequencies and point-wise mutual information of unigrams and bigrams prior to generating the project name list.
  • This refinement procedure will now be described.
  • dg be a DG title with words w 1 , w 2 , ... , w N , such that a candidate split has w 1 , ... , w K as the project name and w K+1 to w N as the modifier.
  • a DG title with N words has N such candidates.
  • the refinement procedures use the project and modifier corpus statistics captured in the initial procedures to compute scores for these candidates and then choose the highest scoring one as the winning split.
  • the project name list is then generated from the winning splits.
  • four different refinement procedures have been developed and are each described in the sections to follow. It is noted that although smoothing terms will not be shown in the following equations for the sake of clarity, in one implementation all counts are smoothed using conventional methods.
  • a unigrams refinement procedure (Uni) is employed. Let p P (w i ) be the probability of w i in the project name corpus, and p M (w i ) be the probability of w i in the modifiers corpus. In the Uni procedure, the score is computed as follows:
  • a unigrams+bigrams refinement procedure (UniBi) is employed.
  • the UniBi procedure includes a score for both unigrams and consecutive pairs of words.
  • p P (w i , w i+1 ) be the probability of the pair (w i , w i+1 ) in the project name corpus
  • p M (w i , w i+1 ) be the probability of the pair (w i , w i+1 ) in the modifiers corpus.
  • the word pair (w K , w K+1 ) denotes the bridge bigram pair (bigram with first word from the project part and the second word from the modifier part) .
  • p B (w K , w K+1 ) denote the probability of the bridge bigram in the DG titles, then:
  • a unigrams+unordered bigrams refinement procedure (UniBiU) considers all pairs of words. Let up denote the probability for unordered bigrams. Note that compared to a single bridge bigram in the UniBi procedure, the UniBiU procedure considers multiple bridge bigrams as follows:
  • a point-wise mutual information refinement procedure (PMI) is employed.
  • PMI point-wise mutual information refinement procedure
  • the project name list generated using the foregoing procedures, with or without refinement is subjected to a cleanup procedure.
  • This cleanup procedure involves identifying potential projects that have the same project name. If a pair of projects having identical project names have no common meeting attendees or DG members, then each project name is designated as identifying a separate project. If, however, a pair of projects having identical project names have common meeting attendees or DG members, then each project name is designated as identifying the same project.
  • a project name is not considered valid until a project name classifier has classified it as a valid project name.
  • a project name classifier is employed that has been trained to recognize enterprise project names. For each potential project name, the classifier indicates whether the name is a valid enterprise project name or not. The potential project names classified as valid are then designated as enterprise project names. Any conventional yes/no type classifier can be employed for this purpose and trained using a set of features that will now be described.
  • NLP Natural language processing
  • POS part of speech
  • Pattern features Phrases satisfying some specific patterns are more likely to be projects, e.g. “... is a project that aims at ... ” .
  • a project name is generally expressed by a Capital leading letter for each word in the name.
  • Projects generally have properties like “deliverable” , “milestone” , and so on associated with them. Thus, if these properties appear in the same document as the potential project name it is more likely the potential project name is an enterprise project name.
  • Modifier pattern features Project name appears with a modifier pattern that is indicative of an enterprise project name.
  • Project Richness feature A potential project name that is found in many different data sources is more likely to be an enterprise project name than a potential project name found in one or very few sources.
  • the project information extraction implementations described herein can also identify the names of people associated with the enterprise projects that correspond to the extracted project names. This is done using the aforementioned information sources, and the names of the identified people are included in the project information database.
  • people associated with the project corresponding to each extracted enterprise project name are identified using one or more information sources containing enterprise documents. More particularly, referring to the process outlined in Figs. 6A-B, a previously unselected one of the extracted enterprise project names is selected (process action 600) . Enterprise documents that include the selected enterprise project name are then identified (process action 602) . A previously unselected one of the identified documents is selected (process action 604) , and the person or persons who authored the selected document are identified (process action 606) . In addition, each person named in the selected document who did not author the document is identified (process action 608) . It is then determined if there are any of the identified documents that include the selected enterprise project name which have not yet been considered (process action 610) .
  • process actions 604 through 610 are repeated.
  • the identified person or persons are designated as a candidate member or members of the project corresponding to the currently selected enterprise project name (process action 612) . It is determined if there are any extracted enterprise project names that have not been considered (process action 614) . If so, process actions 600 through 614 are repeated. Once all the extracted enterprise project names have been considered, the process ends.
  • the action (i.e., process action 608) of identifying each non-authoring person named in the currently selected document involves identifying only the person or persons whose name is closer (e.g., as measured by the number of words before or after) to the currently selected project name than it is to any other enterprise project name found in the document.
  • the action i.e., process action 608 identifying each non-authoring person named in the selected document involves identifying a person or persons named in the same column or row of the table as the selected project name.
  • people associated with the project corresponding to each extracted enterprise project name are identified using one or more information sources containing enterprise distribution group and meeting information. More particularly, referring to the process outlined in Figs. 7A-B, a previously unselected one of the extracted enterprise project names is selected (process action 700) . Enterprise distribution group and meeting information that include the selected enterprise project name are then identified (process action 702) . A previously unselected one of the identified distribution groups or meetings associated with the identified meeting information is selected (process action 704) . A person or persons who are distribution group members of a currently selected distribution group or meeting attendees of a currently selected meeting are then identified (process action 706) .
  • process action 708 It is then determined if there are any of the identified distribution groups or meeting information that include the selected enterprise project name but which have not yet been considered (process action 708) . If there is such a distribution group or meeting information, process actions 704 through 708 are repeated. When all the identified distribution groups and meeting information have been considered, the identified person or persons are designated as a candidate member or members of the project corresponding to the currently selected enterprise project name (process action 710) . It is then determined if there are any extracted enterprise project names that have not been considered (process action 712) . If so, process actions 700 through 712 are repeated. Once all the extracted enterprise project names have been considered, the process ends.
  • the foregoing procedures identify people associated with the projects corresponding to the extracted enterprise project names, they do not address the extent to which a person is involved in a project. For example, some people identified using the foregoing procedure may be only peripherally involved with a project. It is advantageous to know which people associated with a project are principal participants. In view of this, in one implementation, the people identified as being associated with a project corresponding to an extracted project name are ranked based on the degree of their participation in the project.
  • each person designated as a member of a project corresponding to an extracted enterprise project name is ranked based on a score derived from various attributes and contributions to the project gleaned from the data sources that referred to that person.
  • a component score is derived from the attributes and contributions contained in each data source that includes a reference to the person being ranked.
  • a component score is derived for each of the following attributes and contributions, or any subset thereof.
  • a component score based on the number of documents authored by the person that includes the name of the project under consideration. In one version, each document contributes equally to the component score. In another version, each document's contribution to the component score is weighted in accordance with how recently the document was created with more recent documents contributing more.
  • An active directory has a hierarchical structure where the internal nodes are distribution group names and the leaves are people.
  • a distribution group g could contain sub-groups g1 and g2, and persons p1, p2, ... p10.
  • sub-groups g1 and g2 could contain persons (some of whom could also be members of the parent distribution group) or further sub-groups, and so on.
  • the active directory can be used as a source to determine if a person is a member of a sub-group of a distribution group associated with the project.
  • a component score based on the number of emails sent by the person to a distribution group associated with the project. In one version, each email contributes equally to the component score. In another version, each email's contribution to the component score is weighted in accordance with how recently the email was sent with more recent emails contributing more.
  • a component score based on the number of check-ins of program code associated with the project that the person made. In one version, each check-in contributes equally to the component score. In another version, each check-in's contribution to the component score is weighted in accordance with how recently the check-in was made with more recent check-ins contributing more.
  • a component score based on the number of meetings associated with the project that the person organized or attended. In one version, each meeting contributes equally to the component score. In another version, each meeting's contribution to the component score is weighted in accordance with how recently the meeting was held with more recent meetings contributing more.
  • a component score based on the number of emails and enterprise social network communications associated with the project sent by the person. In one version, each email or communication contributes equally to the component score. In another version, each email's or communication's contribution to the component score is weighted in accordance with how recently the email or communication was sent with more recent emails or communications contributing more.
  • the component scores are combined to produce an overall score for each person associated the project whose name is under consideration.
  • a higher overall score indicates a larger degree of participation in the project and so a higher ranking.
  • Combining the score can be done in a variety of ways. For example, in one version the raw score are simply added. In another version, the attributes and contributions involving counting the number of an item are normalized based on the total number of that item before the contribution scores are summed. In yet another version, the contribution scores are normalized among themselves using convention methods so that the maximum contribution score associated with any one attribute or contribution is no more than any other contribution score.
  • each component score (regardless of how it is computed) is assigned a weight indicative of the probability that the person is a principal participant in a project.
  • a linear weighted combination of a person's component scores is then computed to produce an overall score for that person. More particularly, in one version, the various attributes and contributions associated with a project are each assigned a weight.
  • a person identified as associated with the project is then ranked based on the component scores derived from the various attributes and contributions gleaned from the data sources that referred to the person. More particularly, each component score associated with an attribute or contribution is multiplied by the weight assigned thereto, and the resulting products are summed to produce an overall score for the person. The overall score indicates the person's ranking when compared to the other people associated with the project.
  • the aforementioned enterprise data sources e.g., the Active Directory
  • the aforementioned enterprise data sources often include designations as to the role of a person (such as developer, tester, program manager, scientist, and so on) . Knowing the role of a person associated with a project is advantageous. Thus, in one implementation, these role designations are assigned to a person and included in the project information database.
  • ranking people associated with a project by their role designations involves identifying the role of each person found to be associated with a project, and then ranking them in the manner described previously except this time doing it separately for the people within each role.
  • the action of designating the identified person or persons as a candidate member or members of the project corresponding to an enterprise project name includes first selecting a previously unselected one of the people identified as being associated with the project under consideration (process action 800) .
  • the role designation of the selected person is then identified from the aforementioned data sources (process action 802) . It is then determined if there are any remaining unselected people identified as being associated with the project under consideration (process action 804) . If so, then process actins 800 through 804 are repeated.
  • a previously unselected role is selected (process action 806) , Then, for each person associated with the project that is assigned the selected role, ranking that person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person (process action 808) , and ordering each person assigned the role under consideration based on that person's ranking (process action 810) . It is then determined if there are any remaining roles that have not been considered (process action 812) . If so, process actions 806 through 812 are repeated. Once the people assigned to each role have been ranked, the process ends.
  • the project information extraction implementations described herein can also extract project-related items and include them in the project information database.
  • One such project-related item is the names of projects that are related to a project under consideration. This is done using the aforementioned information sources.
  • an extracted enterprise project name or names associated with a project or projects that are related to the project under consideration are identified.
  • the identified related project name or names are then added the project information database entry associated with the project under consideration.
  • Identifying the related project of project names is accomplished, in one version, using enterprise DGs, and in another version, using enterprise meeting information. In yet another version, both DGs and meeting information are used identify related projects. The following sections will describe first finding related projects using DGs and then finding related projects using meeting information.
  • identifying an extracted enterprise project name or names associated with a project or projects that are related to the project under consideration involves associating with the project under consideration each project having a sub-super distribution group relationship with the distribution group or groups of the project under consideration.
  • identifying an extracted enterprise project name or names associated with a project or projects that are related to the project under consideration involves first identifying meetings having less than a prescribed number of attendees (e.g., less than 20 attendees) . It is believed that larger meeting are more likely to be general in nature and not specific to a particular project.
  • a weighted graph is built with nodes representing attendees of the identified meetings and edges connecting each node with the other nodes that each have a weight representing the number of meetings the attendees associated with the nodes connected by the edge have attended together.
  • Next, for each meeting it is determined if the meeting is more likely a project-related meeting or a collaborative meeting. There are several way to classify a meeting as a project meeting or a collaborative meeting.
  • the meeting is determined if all the attendees form a clique after edge weight thresholding (for example, the threshold is set to 5, or the threshold is set to a percentage (e.g., 20%) of the average edge weight ) . If so, the meeting is deemed to be a project meeting. In one version, it is determined if terms indicative of a project meeting (such as “sync” , “daily” , “weekly” , “stand up” , “scrum” to name a few) are found in the meeting title. If so, the meeting is deemed to be a project meeting. In one version, it is determined if terms indicative of the presence of remote attendees are found in the places such as the location designation of the meetings.
  • a more formula-based method of identifying a collaborative meeting involves letting a least common ancestor (LCA) in an organization hierarchical tree for all the meeting attendees be a person x levels from the root. If the attendees can be clustered into 2-3 clusters such that the LCA is y levels from the root, then if x-y is greater than a threshold (e.g., 3 or 4) the meeting is deemed to be collaborative.
  • a threshold e.g., 3 or 4
  • attendee subgroups are identified using the weighted graph and conventional clustering methods.
  • Each subgroup that has more than one member is then mapped to an extracted project name. This is done by finding common projects of the members of subgroup and mapping the subgroup to the most tightly fitting of these project (optionally with>x%project members in the subgroup) .
  • the project name associated with the project mapped to a subgroup is deemed to be a related project.
  • the project information extraction implementations described herein can also find documents related to projects. More particularly, in one implementation, for each project associated with an extracted enterprise project name, a document or documents associated with the project are identified. The identified related document or documents, or links thereto, are then added to the project information database entry associated with the project under consideration.
  • identifying a document or documents associated with the project under consideration involves identifying a document or documents from which the enterprise project name associated with the project under consideration were extracted.
  • identifying a document or documents associated with the project under consideration involves indexing documents found in the information sources associated with the enterprise, searching the index documents with the enterprise project name associated with the project under consideration and associating at least some of the documents returned as search results (e.g., top 10 results) with the project under consideration.
  • search results e.g., top 10 results
  • the project information extraction implementations described herein can also generate a project timeline. More particularly, in one implementation, for each project associated with an extracted enterprise project name, a timeline for the project is established. The project timeline is then added to the project information database entry associated with the project under consideration.
  • establishing a timeline for a project involves first estimating a start date for the project, where the start date is estimated as the earliest of the creation date of a distribution group associated with the project, the date of the earliest meeting associated with the project and the date of the earliest program code check-in associated with the project. An end date for the project is then estimated if the project has concluded. The end date is estimated as the latest of the date of the last meeting associated with the project, the date of the last program code check-in associated with the project and the latest date a document associated with the project was modified.
  • the aforementioned data sources are then employed to find events associated with the project and the dates they occurred. For example, comments associated with the code checkins, the meeting titles and meeting notes, the content of related documents and the email content of the emails sent to related distribution groups, among other things can be used to carve out these event and their respect dates.
  • project-related items can be found in the enterprise data sources and added to the project information database. More particularly, in one implementation, for each project associated with an extracted enterprise project name, project-related items are identified in the enterprise data sources using convention methods, where these project-related items include at least one of meetings, distribution groups, program code check-ins, emails, enterprise social networking messages, definitions, acronyms, home page, slides, a project description and concept terms associated with the project. The identified project-related items, or links thereto, are then added to the project information database entry associated with the project under consideration.
  • FIG. 9 illustrates a simplified example of a general-purpose computer system with which various aspects and elements of project information extraction, as described herein, may be implemented. It is noted that any boxes that are represented by broken or dashed lines in the simplified computing device 10 shown in FIG. 9 represent alternate implementations of the simplified computing device. As described below, any or all of these alternate implementations may be used in combination with other alternate implementations that are described throughout this document.
  • the simplified computing device 10 is typically found in devices having at least some minimum computational capability such as personal computers (PCs) , server computers, handheld computing devices, laptop or mobile computers, communications devices such as cell phones and personal digital assistants (PDAs) , multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and audio or video media players.
  • PCs personal computers
  • PDAs personal digital assistants
  • multiprocessor systems microprocessor-based systems
  • set top boxes programmable consumer electronics
  • network PCs network PCs
  • minicomputers minicomputers
  • mainframe computers mainframe computers
  • audio or video media players audio or video media players
  • the device should have a sufficient computational capability and system memory to enable basic computational operations.
  • the computational capability of the simplified computing device 10 shown in FIG. 9 is generally illustrated by one or more processing unit (s) 12, and may also include one or more graphics processing units (GPUs) 14, either or both in communication with system memory 16.
  • the processing unit (s) 12 of the simplified computing device 10 may be specialized microprocessors (such as a digital signal processor (DSP) , a very long instruction word (VLIW) processor, a field-programmable gate array (FPGA) , or other micro-controller) or can be conventional central processing units (CPUs) having one or more processing cores.
  • DSP digital signal processor
  • VLIW very long instruction word
  • FPGA field-programmable gate array
  • CPUs central processing units having one or more processing cores.
  • the simplified computing device 10 may also include other components, such as, for example, a communications interface 18.
  • the simplified computing device 10 may also include one or more conventional computer input devices 20 (e.g., touchscreens, touch-sensitive surfaces, pointing devices, keyboards, audio input devices, voice or speech-based input and control devices, video input devices, haptic input devices, devices for receiving wired or wireless data transmissions, and the like) or any combination of such devices.
  • NUI Natural User Interface
  • the NUI techniques and scenarios enabled by project information extraction include, but are not limited to, interface technologies that allow one or more users user to interact in a “natural” manner, free from artificial constraints imposed by input devices such as mice, keyboards, remote controls, and the like.
  • NUI implementations are enabled by the use of various techniques including, but not limited to, using NUI information derived from user speech or vocalizations captured via microphones or other sensors.
  • NUI implementations are also enabled by the use of various techniques including, but not limited to, information derived from a user’s facial expressions and from the positions, motions, or orientations of a user’s hands, fingers, wrists, arms, legs, body, head, eyes, and the like, where such information may be captured using various types of 2D or depth imaging devices such as stereoscopic or time-of-flight camera systems, infrared camera systems, RGB (red, green and blue) camera systems, and the like, or any combination of such devices.
  • 2D or depth imaging devices such as stereoscopic or time-of-flight camera systems, infrared camera systems, RGB (red, green and blue) camera systems, and the like, or any combination of such devices.
  • NUI implementations include, but are not limited to, NUI information derived from touch and stylus recognition, gesture recognition (both onscreen and adjacent to the screen or display surface) , air or contact-based gestures, user touch (on various surfaces, objects or other users) , hover-based inputs or actions, and the like.
  • NUI implementations may also include, but are not limited, the use of various predictive machine intelligence processes that evaluate current or past user behaviors, inputs, actions, etc., either alone or in combination with other NUI information, to predict information such as user intentions, desires, and/or goals. Regardless of the type or source of the NUI- based information, such information may then be used to initiate, terminate, or otherwise control or interact with one or more inputs, outputs, actions, or functional features of the project information extraction implementations described herein.
  • NUI scenarios may be further augmented by combining the use of artificial constraints or additional signals with any combination of NUI inputs.
  • Such artificial constraints or additional signals may be imposed or generated by input devices such as mice, keyboards, and remote controls, or by a variety of remote or user worn devices such as accelerometers, electromyography (EMG) sensors for receiving myoelectric signals representative of electrical signals generated by user’s muscles, heart-rate monitors, galvanic skin conduction sensors for measuring user perspiration, wearable or remote biosensors for measuring or otherwise sensing user brain activity or electric fields, wearable or remote biosensors for measuring user body temperature changes or differentials, and the like. Any such information derived from these types of artificial constraints or additional signals may be combined with any one or more NUI inputs to initiate, terminate, or otherwise control or interact with one or more inputs, outputs, actions, or functional features of the project information extraction implementations described herein.
  • EMG electromyography
  • the simplified computing device 10 may also include other optional components such as one or more conventional computer output devices 22 (e.g., display device (s) 24, audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, and the like) .
  • conventional computer output devices 22 e.g., display device (s) 24, audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, and the like
  • typical communications interfaces 18, input devices 20, output devices 22, and storage devices 26 for general-purpose computers are well known to those skilled in the art, and will not be described in detail herein.
  • the simplified computing device 10 shown in FIG. 9 may also include a variety of computer-readable media.
  • Computer-readable media can be any available media that can be accessed by the computer 10 via storage devices 26, and can include both volatile and nonvolatile media that is either removable 28 and/or non-removable 30, for storage of information such as computer-readable or computer-executable instructions, data structures, program modules, or other data.
  • Computer-readable media includes computer storage media and communication media.
  • Computer storage media refers to tangible computer-readable or machine-readable media or storage devices such as digital versatile disks (DVDs) , blu-ray discs (BD) , compact discs (CDs) , floppy disks, tape drives, hard drives, optical drives, solid state memory devices, random access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , CD-ROM or other optical disk storage, smart cards, flash memory (e.g., card, stick, and key drive) , magnetic cassettes, magnetic tapes, magnetic disk storage, magnetic strips, or other magnetic storage devices. Further, a propagated signal is not included within the scope of computer-readable storage media.
  • DVDs digital versatile disks
  • BD blu-ray discs
  • CDs compact discs
  • CDs floppy disks
  • tape drives hard drives
  • optical drives solid state memory devices
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable
  • Retention of information such as computer-readable or computer-executable instructions, data structures, program modules, and the like, can also be accomplished by using any of a variety of the aforementioned communication media (as opposed to computer storage media) to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and can include any wired or wireless information delivery mechanism.
  • modulated data signal or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media can include wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, radio frequency (RF) , infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves.
  • wired media such as a wired network or direct-wired connection carrying one or more modulated data signals
  • wireless media such as acoustic, radio frequency (RF) , infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves.
  • RF radio frequency
  • software, programs, and/or computer program products embodying some or all of the various project information extraction implementations described herein, or portions thereof may be stored, received, transmitted, or read from any desired combination of computer-readable or machine-readable media or storage devices and communication media in the form of computer-executable instructions or other data structures.
  • the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter.
  • article of manufacture as used herein is intended to encompass a computer program accessible from any computer-readable device, or media.
  • the project information extraction implementations described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device.
  • program modules include routines, programs, objects, components, data structures, and the like, that perform particular tasks or implement particular abstract data types.
  • the project information extraction implementations described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks.
  • program modules may be located in both local and remote computer storage media including media storage devices.
  • the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
  • the functionality described herein can be performed, at least in part, by one or more hardware logic components.
  • illustrative types of hardware logic components include field-programmable gate arrays (FPGAs) , application-specific integrated circuits (ASICs) , application-specific standard products (ASSPs) , system-on-a-chip systems (SOCs) , complex programmable logic devices (CPLDs) , and so on.
  • one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality.
  • middle layers such as a management layer
  • Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.
  • a computer-implemented process is employed for generating a project information database for an enterprise that uses a computing device to perform the following process actions.
  • enterprise project names are extracted from information sources associated with an enterprise; then people associated with the project corresponding to each extracted enterprise project name are identified using information sources associated with an enterprise; and a project information database is generated for the enterprise including an entry for each project, where each of the entries includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
  • the process action of extracting enterprise project names from information sources associated with the enterprise includes extracting candidate enterprise project names from one or more information sources including enterprise documents, where the extraction includes at least one of, employing a hearst pattern analysis to identify text strings in the enterprise documents that are potential project names, identifying tables in the enterprise documents having a column or row that includes at least two known enterprise project names, and deeming other names listed in the same column or row as potential project names, and identifying potential project names in document titles; eliminating potential project names that do not appear in the enterprise documents more than a prescribed number of times; employing a project name classifier trained to recognize enterprise project names to classify which of the remaining potential project names are valid enterprise project names; and designating the potential project names classified as valid to be enterprise project names.
  • the process action of extracting enterprise project names from information sources associated with the enterprise includes extracting candidate enterprise project names from one or more information sources including meeting information and distribution group information, where the extraction includes, identifying meetings having less than a prescribed maximum number of attendees and more than one attendee, identifying distribution groups having less than a prescribed maximum number of members and more than one member, eliminating from the identified meetings and distribution groups those meetings or groups having a title which includes a person's name or a term indicative of a person's name, identifying as potential project names those names in the remaining identified meetings and distribution groups that precede or follow a project name modifier term or phrase, identifying projects which have the same identified potential project name, whenever a pair of projects having identical project names have no common meeting attendees or DG members, designating each project name as identifying a separate project, and whenever a pair of projects having identical project names have common meeting attendees or DG members, designating each project name of the pair as identifying the same project; employing a project name classifier trained to recognize enterprise project names to classify which of the potential
  • the process action of identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with the enterprise includes, for each extracted enterprise project name, identifying people associated with the project corresponding to the enterprise project name from one or more information sources including enterprise documents, where the identification includes, identifying enterprise documents that include the enterprise project name, for each document identified identifying the person or persons who authored the document and identifying each person named in the document who did not author the document, and designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name.
  • the document under consideration names one or more other enterprise project names in addition to the enterprise project name under consideration, and the aforementioned identification of each non-authoring person named in the document, includes identifying a person or persons if the person's name is closer as measured by the number of words before or after to the enterprise project name under consideration than it is to any other enterprise project name found in the document.
  • the enterprise project name under consideration is found in a table included in the document under consideration, and the aforementioned identification of each non- authoring person named in the document, includes identifying a person or persons named in the same column or row of the table as the enterprise project name under consideration.
  • the process action of identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with the enterprise includes, for each extracted enterprise project name, identifying people associated with the project corresponding to the enterprise project name from one or more information sources including distribution groups or meeting information, where the identification includes identifying a distribution group or groups whose information includes the enterprise project name, identifying a meeting or meeting whose meeting information includes the enterprise project name, identifying each person who is a member of the identified distribution group or groups, and identifying each person who is an attendee of the identified meeting or meetings; and designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name.
  • designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name includes, for each identified person, ranking the person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person, and ordering each identified person in the list of the people identified as being associated with the project found in the project information database based on that person's ranking.
  • ranking the person based on the score derived from various attributes and contributions to the project gleaned from the data sources that referred to the person includes assigning a component score based on each of at least one of: a number of documents authored by the person that include the name of the project; or a degree of proximity of the person's name to the project's name in each document that includes both the person's name and the project's name; or the person's name being in the same column or row of a table in a document, or a same list in the document, as the project's name; or the person being a member of a distribution group associated with the project; or the person being a member of a sub-group of a distribution group associated with the project; or the person being a member of a distribution group associated with the project wherein a majority of the members of that distribution group are supervised by the person; or a number of emails sent by the person to a distribution group associated with the project; or a number of check-ins of program code associated with the
  • the various attributes and contributions to the project are each assigned a weight, and ranking the person based on the score derived from various attributes and contributions to the project gleaned from the data sources that referred to the person, includes multiplying each component score associated with an attribute or contribution by the weight assigned to the attribute or contribution, and summing the resulting products to produce the overall score for the person.
  • designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name includes, for each identified person, identifying the person's role within the project, and for each identified role and each person assigned that role in the project ranking the person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person and ordering each person assigned the role under consideration based on that person's ranking.
  • the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, identifying an extracted enterprise project name or names associated with another project or projects that are related to the project under consideration, and the process action of generating a project information database for the enterprise, further includes adding the extracted enterprise project name or names of project or projects that are related to the project under consideration to the project information database entry associated with the project under consideration.
  • the process action of identifying the extracted enterprise project name or names associated with another project or projects that are related to the project under consideration includes associating with the project under consideration each project having a sub-super distribution group relationship with the distribution group or groups of the project under consideration.
  • the process action of identifying the extracted enterprise project name or names associated with another project or projects that are related to the project under consideration includes identifying meetings having less than a prescribed number of attendees, building a weighted graph with nodes representing attendees of the identified meetings and edges connecting each node with the other nodes that each have a weight representing the number of meetings the attendees associated with the nodes connected by the edge have attended together, for each meeting determining if the meeting is a project-related meeting or a collaborative meeting, for each meeting determined to be a collaborative meeting identifying attendee subgroups using the weighted graph and a clustering method, and for each subgroup that has more than one member mapping the subgroup to an extracted project name and deeming the project corresponding to the extracted project name mapped to the subgroup to be a project related to the project under consideration.
  • the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, identifying a document or documents associated with the project under consideration, and the process action of generating a project information database for the enterprise, further includes adding the identified document or documents, or links thereto, to the project information database entry associated with the project under consideration.
  • the process action of identifying a document or documents associated with the project under consideration includes identifying a document of documents from which the enterprise project name associated with the project under consideration were extracted.
  • the process action of identifying a document or documents associated with the project under consideration includes indexing documents found in the information sources associated with the enterprise, searching the index documents with the enterprise project name associated with the project under consideration, and associating at least some of the documents returned as search results with the project under consideration.
  • the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, establishing a timeline for the project, and the process action of generating a project information database for the enterprise, further includes adding the timeline established for the project under consideration to the project information database entry associated with the project under consideration.
  • the process action of first establishing a timeline for the project includes estimating a start date for the project, where the start date is estimated as the earliest of, the creation date of a distribution group associated with the project, the date of the earliest meeting associated with the project, and the date of the earliest program code check-in associated with the project.
  • an end date is estimated for the project if the project has concluded, where the end date is estimated as the latest of, the date of the last meeting associated with the project, the date of the last program code check-in associated with the project, and the latest date a document associated with the project was modified.
  • the information sources associated with an enterprise are searched to find events associated with the project and the dates they occurred.
  • the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, identifying project-related items including at least one of meetings, or distribution groups, or program code check-ins, or emails, or enterprise social networking messages, or definitions, or acronyms, or home page, or slides, or a project description, or concept terms associated with the project, and the process action of generating a project information database for the enterprise, further includes adding the identified project-related items, or links thereto, to the project information database entry associated with the project under consideration.
  • a project information database system for an enterprise includes one or more computing devices each including a processor, communication interface and memory. If there are multiple computing devices, they are in communication with each other via a computer network.
  • the system also includes a computer program having program modules executable by the one or more computing devices.
  • the one or more computing devices are directed by the program modules of the computer program to, access information sources associated with an enterprise, extract enterprise project names from the information sources, identify people associated with the project corresponding to each extracted enterprise project name using the information sources, and generate a project information database for the enterprise including an entry for each project, where each of the entries includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
  • the computer program includes program modules for receiving a query for a user including terms representing an enterprise project name, or a person associated with an enterprise project, or both, searching the project information database for information corresponding to the queried enterprise project name, or person associated with an enterprise project, or both, and providing the results of the searching to the user.
  • project information extraction involves a step for generating a project information database for an enterprise.
  • project information extraction includes using a computing device to perform the following process actions: an extracting step for extracting enterprise project names from information sources associated with an enterprise; an identifying step for identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with an enterprise; and a generating step for generating a project information database for the enterprise including an entry for each project, where each of the entries includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.

Abstract

Project information extraction implementations are presented that generally extract project information and generate a project information database for an enterprise. This is accomplished by extracting enterprise project names from information sources associated with an enterprise. People associated with the project corresponding to each extracted enterprise project name are identified using information sources associated with the enterprise. In addition, project-related items generated and collected during the course of the project can be identified in the information sources. A project information database is then generated for the enterprise. This database has an entry for each project which includes the extracted enterprise project name associated with the project, a list of the people identified as being associated with the project, and the project-related items or links thereto.

Description

EXTRACTING ENTERPRISE PROJECT INFORMATION BACKGROUND
An enterprise can generally be defined as an organizational entity, and more particularly as referring to the entirety of the organization including its various units and locations. An enterprise can amass vast quantities of different types of data relating to its operations. For example, this data includes information about the enterprise's various projects including the people that are working on a project and the project-related items generated and collected during the course of the project. This project information is often scattered across a multitude of enterprise data sources.
SUMMARY
The project information extraction implementations described herein generally extract project information and generate a project information database for an enterprise. In one implementation, this is accomplished using a computing device to perform the following process actions. First, enterprise project names are extracted from information sources associated with an enterprise. People associated with the project corresponding to each extracted enterprise project name are identified also using information sources associated with an enterprise. A project information database is then generated for the enterprise. This database has an entry for each project which includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
It should be noted that the foregoing Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as  an aid in determining the scope of the claimed subject matter. Its sole purpose is to present some concepts of the claimed subject matter in a simplified form as a prelude to the more detailed description that is presented below.
DESCRIPTION OF THE DRAWINGS
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
FIG. 1 is a diagram illustrating one implementation, in simplified form, of a project information database system for realizing the project information extraction implementations described herein.
FIG. 2 is a flow diagram illustrating one implementation of a process for extracting project information and generating a project information database for an enterprise.
FIG. 3 shows exemplary pseudocode for performing a Hearst pattern analysis used to identify text strings in enterprise documents that are potential project names.
FIG. 4 shows exemplary pseudocode for performing a seed-based splitting procedure for identifying enterprise project names and modifiers in distribution group (DG) titles.
FIG. 5 shows exemplary pseudocode for performing a suffix frequency splitting procedure for identifying enterprise project names and modifiers in DG titles.
FIGS. 6A-B are a flow diagram illustrating one implementation of a process for identifying people associated with a project corresponding to each extracted  enterprise project name using one or more information sources containing enterprise documents.
FIGS. 7A-B are a flow diagram illustrating one implementation of a process for identifying people associated with a project corresponding to each extracted enterprise project name using one or more information sources containing enterprise distribution groups and meeting information.
FIGS. 8A-B are a flow diagram illustrating one implementation of a process for ranking people associated with a project by their role designations.
FIG. 9 is a diagram depicting a general purpose computing device constituting an exemplary system for use with the project information extraction implementations described herein.
DETAILED DESCRIPTION
In the following description reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific versions in which project information extraction implementations can be practiced. It is understood that other implementations can be utilized and structural changes can be made without departing from the scope thereof.
It is also noted that for the sake of clarity specific terminology will be resorted to in describing the project information extraction implementations and it is not intended for these implementations to be limited to the specific terms so chosen. Furthermore, it is to be understood that each specific term includes all its technical equivalents that operate in a broadly similar manner to achieve a similar purpose. Reference herein to “one implementation” , or “another implementation” , or an “exemplary implementation” , or an “alternate implementation” means that a particular feature, a particular structure, or particular characteristics described in connection with the implementation can be included in at least one version of the project information extraction. The appearances of the phrases “in one implementation” , “in another implementation” , “in an exemplary implementation” ,  and “in an alternate implementation” in various places in the specification are not necessarily all referring to the same implementation, nor are separate or alternative implementations mutually exclusive of other implementations. Yet furthermore, the order of process flow representing one or more implementations of the project information extraction does not inherently indicate any particular order or imply any limitations thereof.
As utilized herein, the terms “component, ” “system, ” “client” and the like are intended to refer to a computer-related entity, either hardware, software (e.g., in execution) , firmware, or a combination thereof. For example, a component can be a process running on a processor, an object, an executable, a program, a function, a library, a subroutine, a computer, or a combination of software and hardware. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and a component can be localized on one computer and/or distributed between two or more computers. The term "processor" is generally understood to refer to a hardware component, such as a processing unit of a computer system.
Furthermore, to the extent that the terms “includes, ” “including, ” “has, ” “contains, ” and variants thereof, and other similar words are used in either this detailed description or the claims, these terms are intended to be inclusive in a manner similar to the term “comprising” as an open transition word without precluding any additional or other elements.
1.0 Extracting Project Information From An Enterprise
In general, the project information extraction implementations described herein extract structured information related to project entities in an enterprise. In one exemplary implementation, this information can include, without limitation: a list of projects; the people associated with each project (optionally grouped by roles) ; related meetings; time line information for the projects; distribution groups related to the projects; related projects; documents associated with the projects; definitions, acronyms, project descriptions; concept terms associated with the projects; program code check-ins; emails; social networking messages; among  others. Such information is used for a large variety of applications such as search and recommendations.
The project information extraction implementations described herein can employ many different data sources. Multiple data sources are exploited because each data source can have project information that is unique to that data source. In one exemplary implementation, data sources within the enterprise can include, without limitation: documents (along with information about who modified, viewed a document and when) ; an active directory which provides information about employees and the organization hierarchy, and also distribution groups along with the employees which are part of them; meeting information such as attendees, organizer, title, description, time duration, whether recurring or not, and meeting notes if available; social network information for any enterprise wide social network; emails within the enterprise; program code check-ins with metadata such as user, code files, comments, time of checkin, repository location, codebase directory; among others. In addition, in one implementation, the data sources can also include external sources where users interact with the enterprise, such as, without limitation, news articles, blog articles, non-enterprise documents, related public projects, related people outside the enterprise, other related entities, external communications of the company, and so on.
Fig. 1 shows an exemplary project information database system that can be used to realize the project information extraction implementations described herein. It is noted that project information extraction implementations described herein can vary in the information that is extracted and used to generate the database. Some implementations extract enterprise project names and the people associated with each project for the database, while others extract the project names and people as well as one or more of the project-related items mentioned previously (e.g., related meetings, time line information, distribution groups, related projects, documents, definitions, acronyms, project descriptions, concept terms, program code check-ins, emails, social networking messages, and so on) . For the sake of simplicity, the exemplary database system of Fig. 1 shows the extraction and databasing of the aforementioned project-related items collectively and as an option.
One or more computing devices 100 each including a processor, communication interface and memory host various extraction and database generating modules. Whenever there is more than one computing device involved, the computing devices can be in communication with each other via a computer network. In one implementation, the computing devices 100 host a project name extraction module 102, a related-people extraction module 104, and an optional project-related item extraction module 106. It is noted that the optional nature of the project-related item extraction module 106 is indicated by the use of a dashed lined box. Various data sources are in communication with the computing devices 100, and are searchable by the extraction modules 102/104/106. More particularly, these data sources include an enterprise document data source 108, an active directory 110 (which includes information about people associated with the enterprise and distribution group lists, among other things) , an enterprise meeting data source 112 and a project-related items data source 114. It is noted that the project-related items data source 114 can include, without limitation, social network information for any enterprise wide social network, emails generated within the enterprise, program code check-ins with metadata such as user, code files, comments, time of check-in, repository location, codebase directory, as well as external sources such as employee log sheets, external communications of the company, and so on. The extraction modules 102/104/106 are in communication with an enterprise project information database generation module 116. The database generation module 116 generates the enterprise project information database 118 from the information extracted by the extraction modules 102/104/106.
In one implementation, the one or more computing devices 100 execute a computer program having various program modules which direct the extraction modules 102/104/106 and enterprise project information database generation module 116 to perform the following process actions. Referring now to Fig. 2, the computer program directs the aforementioned modules to extract enterprise project names from the information sources (process action 200) , and identify people associated with the project corresponding to each extracted enterprise project name using the information sources (process action 202) . A project  information database is then generated for the enterprise that includes an entry for each project having an extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project (process action 204) .
It is noted that extracting project information from enterprise-related information sources is not simply a matter of looking in the right place. Rather, in the context of the project information extraction implementations described herein, extracting enterprise project information involves finding previously hidden relationships among various information sources and transforming existing information into a form that exposes these hidden relationships. For example, project names, the people that work on a project, and the aforementioned various items generated for a project are often scattered across a variety of information sources with no apparent connection between them. Thus, the project information extraction implementations described herein find these names and items, and the relationships between them, and in one implementation creates an enterprise project database that collects the extracted project information for each enterprise project discovered.
Extracting project information from enterprise data has many advantages. It allows the nature and scope of a project to be better understood when viewed in isolation from information associated with other projects. This in turn makes it easier to recommend a project to a new employee. For example, a new employee often faces an information overload problem. A structured view of the projects going on the company can help him or her get an organized picture of the new environment. Employees working a new project in a large company also need to identify whether a sub-problem (or a dependency) has been solved by some other team in the company. Knowing the various projects that are going on in an enterprise can help workers and managers identify duplicate work efforts across multiple project teams. In addition, the project information can assist employees in identifying points of contact within the enterprise.
Astructured view of the projects also facilitates a more efficient semantic search capability to find information related to a project. In addition, extracting  project information from enterprise data rather than exclusively from outside sources overcomes a circumstance where the project names within an enterprise are different from publicly known projects with the same concept. Also internal project names could mean something completely different in the external world.
The sections to follow will now described the aforementioned information extraction and database generation in more detail.
1.1 Extracting Potential Project Names From Documents
One source for enterprise project names is in digital documents archived in various electronic memories within the enterprise. These memories were collectively referred to in Fig. 1 as the document data source 108. Various procedures can be employed alone or in any combination to extract these project names from the enterprise's documents. In one implementation, a conventional pattern recognition procedure is employed to identify project names within the enterprise's documents. For example, a Hearst pattern analysis can be employed to identify text strings in the enterprise documents that are potential project names. Fig. 3 outlines such a procedure where the variable NP refers to a noun phrase.
Table expansion is another procedure that can be employed. More particularly, in one implementation, tables are identified in the enterprise documents that have a column or row which includes at least two previously-known enterprise project names. For example, these previously-known names could have been identified using the aforementioned Hearst pattern analysis. The other names listed in the same column or row are then deemed to be potential project names. For example, let P1 and P2 be two previously-known project names. Now, if a particular table in a document contains a column with P1 and P2 and 10 other strings in the same column, those 10 strings are considered to be potential project names a well.
Still further, in one implementation, potential project names are identified in the titles of the enterprise documents. For example, a variation of a project name- modifier analysis that eliminates modifier words from the titles and deems the remaining words to be a project name can be employed. This project name-modifier analysis will be described in more detail in sections to follow.
Once potential project names have been identified using one or any combination of the foregoing procedures, in one implementation, potential project names that do not appear in the enterprise documents more than a prescribed number of times are eliminated as candidates. In one version, the threshold is set to 10. In another version, the threshold is set to a percentage (e.g., 5%) of the average number of documents per project in the enterprise.
1.2 Extracting Potential Project Names from Distribution Groups and Meeting  Data
Another source for enterprise project names is in distribution group (DG) and meeting titles. Distribution group titles can be found in the active directory 110 and meeting titles can be found in the meeting data source 112 referred to in Fig. 1.
There are several advantages to extracting project names from enterprise DG and meeting titles. For example, DG and meeting titles are relatively less noisy in comparison to documents. In addition, DGs are often exhaustive in nature because enterprises tend to have a distribution group to link all employees working on a project. Still further, DGs are timely because whenever a new project starts, a new DG is usually created.
Various procedures are employed to extract these project names from the enterprise's DG and meeting titles. However, as a preliminary matter the number of titles analyzed can be reduced to decrease processing expenses and speed up the process. In one implementation, this involves identifying meetings having less than a prescribed maximum number of attendees and more than one attendee. The meetings falling outside this attendee range are not considered as they are likely to be more general meetings and not project-specific meetings. In addition, in one implementation, distribution groups having less than a prescribed maximum  number of members and more than one member are identified. Here again, distribution groups falling outside this member range are not considered as they are likely not to be project-specific groups. Still further, in one implementation, previously identified meetings and distribution groups that have a title which includes a person's name or a term indicative of a person's name (e.g., first name, last name, full name, pseudonym, nickname, email alias and so on) are eliminating from consideration. Conventional methods are employed to identify these meetings and distribution group titles containing people's names.
Once a list of meeting and DG titles that potentially include project names has been identified, in one implementation potential project names in the meeting and distribution group titles that are followed by a project name modifier term or phrase are identified. More particularly, distribution group and meeting titles often contain a project name followed by modifier terms. As such, DG and meeting titles can be split into two parts—namely a project name, and modifier words. For example, “Project A dev team” consists of the project name “Project A” and modifier words “dev team” . Based on these observations, it is possible to extract project names from DG and meeting titles using a project name-modifier analysis, as well as to generate a project name corpus and a modifier word corpus. In addition, by considering statistical frequencies and point-wise mutual information of unigrams and bigrams, the list of extracted project names can be refined. These procedures will now be described in more detail. However, it is noted that for convenience the procedures described next focus on processing DG titles. The same procedures can be employed using meeting titles in combination with DG titles, or just meeting titles. This latter alternative results in a DG title modifier corpus and a separate meeting title modifier corpus.
1.2.1 Seed-Based Splitting
Given the assumption that DG titles contain a project name optionally followed by one or more modifier words, procedures are proposed to obtain an initial (project, modifier) split for the enterprise’s DG titles. In one seed-based implementation, this involves starting with a seed set of project names and extracts modifier words related to those projects. The seed set can be a list of  project names extracted from documents, or one could also start with a list of known enterprise project names (e.g., 5-10 names) . Also, as the modifier corpus gets populated, the procedure leverages these modifiers to extract more project names from the enterprise’s DG titles. Thus, over multiple iterations more projects and modifiers are identified. On convergence or after a fixed number of iterations, the modifier list is trimmed by removing low-frequency words. Using this trimmed modifier list, the DG titles are processed again to obtain the project name corresponding to each DG title.
Fig. 4 outlines an exemplary seed-based splitting procedure for identifying enterprise project names and modifiers in DG titles. This procedure begins with a project name seed list as described previously. In addition, a list of enterprise DG titles is input. For example, in one implementation the list of DG titles comes from the aforementioned Active Directory. The first part of the procedure is iterative. In one version (shown in Fig. 4) , this part of the procedure is iterated a prescribed number of times, before moving on to the remainder of the procedure. However, it is noted that in another version, the iterations are repeated until the new project names and/or modifiers discovered in the last iteration fall below a minimum threshold (i.e., have converged) . Still further, in another version, the iterations are repeated the prescribed number of times, or until the new project names and/or modifiers discovered in the last iteration fall below the minimum threshold, whichever occurs first.
Regardless of which iteration scheme is used, in each iteration, a list of candidate splits between a potential project name part and a potential modifier part of each unprocessed DG title is generated. In the first iteration, all the DG titles are considered unprocessed, while in subsequent iterations only those DG titles that have not had either a project name or modifier discovered therein are considered unprocessed. The list of split candidates for each unprocessed DG title can be generated in a variety of ways, such as the procedure that will be described shortly in connection with the description of project name list refinement.
For each unprocessed DG title, one of its candidate splits is chosen for processing, and it is determined if the modifier part of the currently chosen split matches a modifier in a modifier corpus. Note that in one version, the modifier corpus initially includes one or more pre-established modifiers (e.g., known project name modifier words and/or phrases used in DG titles) . However, in another version, the modifier corpus is initially empty. In either case, it is built up with new modifiers over the course of the iterations as will be apparent shortly. If the modifier part of the unprocessed DG title candidate split under consideration matches a modifier in the modifier corpus, then the project name part of the split is added to a project name corpus (which includes the project name seed list) , and an occurrence frequency value of the modifier is incremented by one in the modifier corpus. In addition, the unprocessed DG title associated with the candidate split under consideration is re-categorized as a processed DG title.
If, however, the modifier part of the unprocessed DG title candidate split under consideration does not match a modifier in the modifier corpus, then it is determined if the project name part of the candidate split matches a project name in the project name corpus. If it does, then the modifier part of the unprocessed DG title candidate split under consideration is added to the modifier corpus and its occurrence frequency value is set to one. Additionally, the unprocessed DG title associated with the candidate split under consideration is re-categorized as a processed DG title.
Once all the unprocessed DG titles have been processed as described above, the process is repeated in each subsequent iteration for each DG title that is still categorized as unprocessed, albeit with more project names and modifiers in the corpora.
When the number of iterations reaches the prescribed number or converges as described previously, the modifiers in the modifiers corpus with a occurrence frequency count that is less than a prescribed count threshold are removed from the corpus. The second part of the exemplary seed-based splitting procedure is then commenced.
To begin the second part of the procedure, the occurrence frequency counts of the modifiers in the modifier corpus are zeroed, and the project name corpus is emptied except for the aforementioned seed project names. In addition, all the DG titles are returned to their initial unprocessed categorization. A list of candidate splits between a potential project name part and a potential modifier part of each unprocessed DG title is generated once again. Then, for each unprocessed DG title, one of its candidate splits is chosen for processing, and it is determined if the modifier part of the split matches a modifier in the modifier corpus. If it does, then the project name part of the candidate split under consideration is added to the project name corpus and the occurrence frequency count for the matching modifier in the modifier corpus is incremented by one. In addition, the unprocessed DG title associated with the candidate split under consideration is re-categorized as a processed DG title.
It is noted that the modifier occurrence frequency counts are used in one implementation in a statistical analysis that will be described shortly. It is further noted that at this point in the procedure almost all of the DG titles having a modifier will have that modifier included in the modifier corpus, and its project name will be included in the project name corpus. However, if a DG title is the project name only without any modifiers then the foregoing second part of the procedure will not put its project name into the project name corpus. Thus, in one implementation, for the DG tiles that are still categorized as unprocessed, the title is added in its entirely to the project name corpus.
1.2.2 Suffix Frequency Splitting
In an alternate procedure to obtain a project name-modifier split for the enterprise’s DG titles, the frequencies of all suffixes of length up to L are computed from the enterprise’s DG titles. In one version, L is set to 6. Those suffixes which occur with a frequency greater than a prescribed threshold value (e.g., 5) are considered as modifiers and added to a modifiers corpus. Then, for every DG title, the modifier part of the DG is computed as the largest suffix of the title that is present in the modifiers list. The remaining part of the DG title is deemed its project name and added to a project name corpus. Fig. 5 shows an  exemplary suffix frequency splitting procedure for identifying project names and modifiers in enterprise DG titles.
1.2.3 Project Name List Refinement
As the output of the foregoing procedures is a project name-modify split for each enterprise DG title considered, a project name list can be readily generated from the project name corpus. However, as indicated previously, in other implementations, the project name-modify splits output from the foregoing procedures can be further refined by considering statistical frequencies and point-wise mutual information of unigrams and bigrams prior to generating the project name list. One exemplary implementation of this refinement procedure will now be described.
Let dg be a DG title with words w1, w2, … , wN, such that a candidate split has w1, … , wK as the project name and wK+1 to wN as the modifier. Clearly, a DG title with N words has N such candidates. The refinement procedures use the project and modifier corpus statistics captured in the initial procedures to compute scores for these candidates and then choose the highest scoring one as the winning split. The project name list is then generated from the winning splits. In particular, four different refinement procedures have been developed and are each described in the sections to follow. It is noted that although smoothing terms will not be shown in the following equations for the sake of clarity, in one implementation all counts are smoothed using conventional methods.
1.2.3.1 Unigrams Refinement Procedure
In one implementation, a unigrams refinement procedure (Uni) is employed. Let pP (wi) be the probability of wi in the project name corpus, and pM (wi) be the probability of wi in the modifiers corpus. In the Uni procedure, the score is computed as follows:
Figure PCTCN2015082341-appb-000001
1.2.3.2 Unigrams+Bigrams Refinement Procedure
In one implementation, a unigrams+bigrams refinement procedure (UniBi) is employed. The UniBi procedure includes a score for both unigrams and consecutive pairs of words. Let pP (wi, wi+1) be the probability of the pair (wi, wi+1) in the project name corpus, and pM (wi, wi+1) be the probability of the pair (wi, wi+1) in the modifiers corpus. The word pair (wK, wK+1) denotes the bridge bigram pair (bigram with first word from the project part and the second word from the modifier part) . Let pB (wK, wK+1) denote the probability of the bridge bigram in the DG titles, then:
Figure PCTCN2015082341-appb-000002
1.2.3.3 Unigrams+Unordered Bigrams Refinement Procedure
Rather than considering just consecutive word pairs, in one implementation a unigrams+unordered bigrams refinement procedure (UniBiU) considers all pairs of words. Let up denote the probability for unordered bigrams. Note that compared to a single bridge bigram in the UniBi procedure, the UniBiU procedure considers multiple bridge bigrams as follows:
Figure PCTCN2015082341-appb-000003
1.2.3.4 Point-Wise Mutual Information Refinement Procedure
In one implementation, a point-wise mutual information refinement procedure (PMI) is employed. In this procedure, the score of a DG title is computed as the sum of the average PMI for its project name words and the average PMI for its modifier words minus the average PMI for the bridge word pairs. Thus:
Figure PCTCN2015082341-appb-000004
1.2.4 Project Name List Cleanup
In one implementation, the project name list generated using the foregoing procedures, with or without refinement, is subjected to a cleanup procedure. This cleanup procedure involves identifying potential projects that have the same project name. If a pair of projects having identical project names have no common meeting attendees or DG members, then each project name is designated as identifying a separate project. If, however, a pair of projects having identical project names have common meeting attendees or DG members, then each project name is designated as identifying the same project.
1.2.5 Project Name Classifier
The foregoing procedures result, among other things, in a project name list. However, in one implementation, a project name is not considered valid until a project name classifier has classified it as a valid project name. More particularly, in one implementation, a project name classifier is employed that has been trained to recognize enterprise project names. For each potential project name, the classifier indicates whether the name is a valid enterprise project name or not. The potential project names classified as valid are then designated as enterprise project names. Any conventional yes/no type classifier can be employed for this purpose and trained using a set of features that will now be described.
Any one or any combination of the following features can be used to train the aforementioned project name classifier.
a) Natural language processing (NLP) features such as part of speech (POS) : project names are generally noun phrases.
b) Pattern features: Phrases satisfying some specific patterns are more likely to be projects, e.g. “… is a project that aims at … ” .
c) Data source features: Noun phrases, general document contents, email contents, and so on where each source is assigned a probability score based on the likelihood a project name would be found therein.
d) Project Attribute Features: number of meetings, number of DGs, number of related people, number of sub-projects associated with the potential project name.
e) Dictionary features: Phrases that are seldom found on the Web (such as in a query log) , but are found frequently in enterprise documents of a team have higher probability to be enterprise project names.
f) Structure features: If there are other known project names appearing in the same list or the same table with potential project name, there is a higher probability the potential project name is an enterprise project name.
g) Statistical features: Frequency of the potential project name, e.g., how many times it appears in document or email titles associated with the same team normalized by the total number of documents or emails.
h) Symbol features: A project name is generally expressed by a Capital leading letter for each word in the name.
i) Conceptualization features: Enterprises often have area priors (such as computer science, academic institutions like universities, pharma institutions, auto manufactures, insurance companies, and so on) , so a close concept distance between a potential project name and an enterprise prior indicates the name is an enterprise project name.
j) Embedding features: Conventional word embedding where words are represented by vectors and similar words exhibit closer vector distances are used to compare a potential project name and known enterprise project names.
k) Keyword features: Projects generally have properties like “deliverable” , “milestone” , and so on associated with them. Thus, if these properties appear in the same document as the potential project name it is more likely the potential project name is an enterprise project name.
l) Modifier pattern features: Project name appears with a modifier pattern that is indicative of an enterprise project name.
m) Project Richness feature: A potential project name that is found in many different data sources is more likely to be an enterprise project name than a potential project name found in one or very few sources.
1.3 Linking Projects With People Using Documents
The project information extraction implementations described herein can also identify the names of people associated with the enterprise projects that correspond to the extracted project names. This is done using the aforementioned information sources, and the names of the identified people are included in the project information database.
In one implementation, people associated with the project corresponding to each extracted enterprise project name are identified using one or more information sources containing enterprise documents. More particularly, referring to the process outlined in Figs. 6A-B, a previously unselected one of the extracted enterprise project names is selected (process action 600) . Enterprise documents that include the selected enterprise project name are then identified (process action 602) . A previously unselected one of the identified documents is selected (process action 604) , and the person or persons who authored the selected document are identified (process action 606) . In addition, each person named in the selected document who did not author the document is identified (process action 608) . It is then determined if there are any of the identified documents that include the selected enterprise project name which have not yet been considered (process action 610) . If there are such documents, process actions 604 through 610 are repeated. When all the identified documents have been considered, the identified person or persons are designated as a candidate member or members of the project corresponding to the currently selected enterprise project name (process action 612) . It is determined if there are any extracted enterprise project names that have not been considered (process action 614) . If so, process actions 600 through 614 are repeated. Once all the extracted enterprise project names have been considered, the process ends.
In one version of the foregoing process, if a selected document names one or more of the other currently unselected extracted enterprise project names in addition to the selected project name, then the action (i.e., process action 608) of identifying each non-authoring person named in the currently selected document, involves identifying only the person or persons whose name is closer (e.g., as  measured by the number of words before or after) to the currently selected project name than it is to any other enterprise project name found in the document.
Further, in one version of the foregoing process, if the currently selected project name is found in a table included in the currently selected document, then the action (i.e., process action 608) identifying each non-authoring person named in the selected document involves identifying a person or persons named in the same column or row of the table as the selected project name.
1.4 Linking Projects With People Using Distribution Groups and Meeting  Information
In one implementation, people associated with the project corresponding to each extracted enterprise project name are identified using one or more information sources containing enterprise distribution group and meeting information. More particularly, referring to the process outlined in Figs. 7A-B, a previously unselected one of the extracted enterprise project names is selected (process action 700) . Enterprise distribution group and meeting information that include the selected enterprise project name are then identified (process action 702) . A previously unselected one of the identified distribution groups or meetings associated with the identified meeting information is selected (process action 704) . A person or persons who are distribution group members of a currently selected distribution group or meeting attendees of a currently selected meeting are then identified (process action 706) . It is then determined if there are any of the identified distribution groups or meeting information that include the selected enterprise project name but which have not yet been considered (process action 708) . If there is such a distribution group or meeting information, process actions 704 through 708 are repeated. When all the identified distribution groups and meeting information have been considered, the identified person or persons are designated as a candidate member or members of the project corresponding to the currently selected enterprise project name (process action 710) . It is then determined if there are any extracted enterprise project names that have not been considered (process action 712) . If so, process actions 700 through 712 are  repeated. Once all the extracted enterprise project names have been considered, the process ends.
1.5 Ranking People Associated With A Project
While the foregoing procedures identify people associated with the projects corresponding to the extracted enterprise project names, they do not address the extent to which a person is involved in a project. For example, some people identified using the foregoing procedure may be only peripherally involved with a project. It is advantageous to know which people associated with a project are principal participants. In view of this, in one implementation, the people identified as being associated with a project corresponding to an extracted project name are ranked based on the degree of their participation in the project.
In general, the previously described data sources are used to derive scores for each person identified as being involved with a project corresponding to an extracted project name. In one implementation, each person designated as a member of a project corresponding to an extracted enterprise project name is ranked based on a score derived from various attributes and contributions to the project gleaned from the data sources that referred to that person. In one version, a component score is derived from the attributes and contributions contained in each data source that includes a reference to the person being ranked.
In one implementation, a component score is derived for each of the following attributes and contributions, or any subset thereof.
a) A component score based on the number of documents authored by the person that includes the name of the project under consideration. In one version, each document contributes equally to the component score. In another version, each document's contribution to the component score is weighted in accordance with how recently the document was created with more recent documents contributing more.
b) A component score based on the degree of proximity of the person's name to the project's name (e.g., measured by the number of words between the two) in each document that includes both the person's name and the project's name.
c) A component score based on the person's name being in the same column or row of a table in a document, or a same list in the document, as the project's name.
d) A component score based on the person being a member of a distribution group associated with the project.
e) A component score based on the person being a member of a sub-group of a distribution group associated with the project. It is noted that the enterprise's aforementioned active directory can be used to identify distribution groups and sub-groups, and the people associated with them. An active directory has a hierarchical structure where the internal nodes are distribution group names and the leaves are people. For example, a distribution group g could contain sub-groups g1 and g2, and persons p1, p2, … p10. In turn, sub-groups g1 and g2 could contain persons (some of whom could also be members of the parent distribution group) or further sub-groups, and so on. Thus, the active directory can be used as a source to determine if a person is a member of a sub-group of a distribution group associated with the project.
f) A component score based on the person being a member of a distribution group associated with the project wherein a majority of the members of that distribution group are supervised by the person.
g) A component score based on the number of emails sent by the person to a distribution group associated with the project. In one version, each email contributes equally to the component score. In another version, each email's contribution to the component score is weighted in accordance with how recently the email was sent with more recent emails contributing more.
h) A component score based on the number of check-ins of program code associated with the project that the person made. In one version, each check-in contributes equally to the component score. In another version, each check-in's contribution to the component score is weighted in accordance with how recently the check-in was made with more recent check-ins contributing more.
i) A component score based on the number of meetings associated with the project that the person organized or attended. In one version, each meeting contributes equally to the component score. In another version, each meeting's contribution to the component score is weighted in accordance with how recently the meeting was held with more recent meetings contributing more.
j) A component score based on, for each meeting associated with the project, the number of sentences attributed to the person in meeting notes.
k) A component score based on the number of emails and enterprise social network communications associated with the project sent by the person. In one version, each email or communication contributes equally to the component score. In another version, each email's or communication's contribution to the component score is weighted in accordance with how recently the email or communication was sent with more recent emails or communications contributing more.
The component scores are combined to produce an overall score for each person associated the project whose name is under consideration. When compared with the people identified as being associated with a project corresponding to an extracted enterprise project name, a higher overall score indicates a larger degree of participation in the project and so a higher ranking. Combining the score can be done in a variety of ways. For example, in one version the raw score are simply added. In another version, the attributes and contributions involving counting the number of an item are normalized based on the total number of that item before the contribution scores are summed. In yet another version, the contribution scores are normalized among themselves using convention methods so that the maximum contribution score associated with any one attribute or contribution is no more than any other contribution score.
However, the foregoing combination schemes do not take into consideration that some attributes and contributions are more indicative of a person being a principal participant in a project, than others. For example, a member of a DG associated with a project is more likely to be a principal participant than a member of a sub-group of that DG. Thus, in one version, each component score (regardless of how it is computed) is assigned a weight indicative of the probability that the person is a principal participant in a project. A linear weighted combination of a person's component scores is then computed to produce an overall score for that person. More particularly, in one version, the various attributes and contributions associated with a project are each assigned a weight. A person identified as associated with the project is then ranked based on the component scores derived from the various attributes and contributions gleaned from the data sources that referred to the person. More particularly, each component score associated with an attribute or contribution is multiplied by the weight assigned thereto, and the resulting products are summed to produce an overall score for the person. The overall score indicates the person's ranking when compared to the other people associated with the project.
1.5.1 Ranking People Associated With A Project According To Their Role  Assignment
The aforementioned enterprise data sources (e.g., the Active Directory) often include designations as to the role of a person (such as developer, tester, program manager, scientist, and so on) . Knowing the role of a person associated with a project is advantageous. Thus, in one implementation, these role designations are assigned to a person and included in the project information database.
It is also advantageous to rank people associated with a project by their role designations. Thus, for example, developers associated with a project would be ranked based on their degree of participation. Knowing this, a user can consult the database to find the top developers for a project.
In general, ranking people associated with a project by their role designations involves identifying the role of each person found to be associated with a project, and then ranking them in the manner described previously except this time doing it separately for the people within each role.
More particularly, referring to Figs. 8A-B, in one implementation the action of designating the identified person or persons as a candidate member or members of the project corresponding to an enterprise project name includes first selecting a previously unselected one of the people identified as being associated with the project under consideration (process action 800) . The role designation of the selected person is then identified from the aforementioned data sources (process action 802) . It is then determined if there are any remaining unselected people identified as being associated with the project under consideration (process action 804) . If so, then process actins 800 through 804 are repeated. Once roles have been identified for the people associated with the project under consideration, a previously unselected role is selected (process action 806) , Then, for each person associated with the project that is assigned the selected role, ranking that person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person (process action 808) , and ordering each person assigned the role under consideration based on that person's ranking (process action 810) . It is then determined if there are any remaining roles that have not been considered (process action 812) . If so, process actions 806 through 812 are repeated. Once the people assigned to each role have been ranked, the process ends.
1.6 Finding The Names Of Related Projects
As indicated previously, the project information extraction implementations described herein can also extract project-related items and include them in the project information database. One such project-related item is the names of projects that are related to a project under consideration. This is done using the aforementioned information sources.
More particularly, in one implementation, for each project associated with an extracted enterprise project name, an extracted enterprise project name or names associated with a project or projects that are related to the project under consideration are identified. The identified related project name or names are then added the project information database entry associated with the project under consideration.
Identifying the related project of project names is accomplished, in one version, using enterprise DGs, and in another version, using enterprise meeting information. In yet another version, both DGs and meeting information are used identify related projects. The following sections will describe first finding related projects using DGs and then finding related projects using meeting information.
1.6.1 Finding The Names Of Related Projects Using Enterprise Distribution  Groups
Two projects are considered related if their corresponding distribution lists have a sub-super distribution group relationship. Thus, in one version, identifying an extracted enterprise project name or names associated with a project or projects that are related to the project under consideration involves associating with the project under consideration each project having a sub-super distribution group relationship with the distribution group or groups of the project under consideration.
1.6.2 Finding The Names Of Related Projects Using Enterprise Meeting  Information
In one version, identifying an extracted enterprise project name or names associated with a project or projects that are related to the project under consideration involves first identifying meetings having less than a prescribed number of attendees (e.g., less than 20 attendees) . It is believed that larger meeting are more likely to be general in nature and not specific to a particular project. Once the meetings have been identified, a weighted graph is built with nodes representing attendees of the identified meetings and edges connecting  each node with the other nodes that each have a weight representing the number of meetings the attendees associated with the nodes connected by the edge have attended together. Next, for each meeting it is determined if the meeting is more likely a project-related meeting or a collaborative meeting. There are several way to classify a meeting as a project meeting or a collaborative meeting. In one version, it is determined if all the attendees form a clique after edge weight thresholding (for example, the threshold is set to 5, or the threshold is set to a percentage (e.g., 20%) of the average edge weight ) . If so, the meeting is deemed to be a project meeting. In one version, it is determined if terms indicative of a project meeting (such as “sync” , “daily” , “weekly” , “stand up” , “scrum” to name a few) are found in the meeting title. If so, the meeting is deemed to be a project meeting. In one version, it is determined if terms indicative of the presence of remote attendees are found in the places such as the location designation of the meetings. If so, the meeting is deemed to be a collaborative meeting. Other examples of indicators of a collaborative meeting include the location of the meeting not being specified or designated as a conference call; the attendees are known to reside in places that are far away; and the meeting time is outside of normal office working hours for a majority of the attendees. A more formula-based method of identifying a collaborative meeting involves letting a least common ancestor (LCA) in an organization hierarchical tree for all the meeting attendees be a person x levels from the root. If the attendees can be clustered into 2-3 clusters such that the LCA is y levels from the root, then if x-y is greater than a threshold (e.g., 3 or 4) the meeting is deemed to be collaborative. In a graph-based approach, if the density of a graph such as described previously is less than a threshold (e.g., 80%) then the meeting can be deemed to be collaborative.
For all the meetings deemed to be collaborative, attendee subgroups (cliques) are identified using the weighted graph and conventional clustering methods. Each subgroup that has more than one member is then mapped to an extracted project name. This is done by finding common projects of the members of subgroup and mapping the subgroup to the most tightly fitting of these project (optionally with>x%project members in the subgroup) . The project name  associated with the project mapped to a subgroup is deemed to be a related project.
1.7 Finding Documents Related To Projects
The project information extraction implementations described herein can also find documents related to projects. More particularly, in one implementation, for each project associated with an extracted enterprise project name, a document or documents associated with the project are identified. The identified related document or documents, or links thereto, are then added to the project information database entry associated with the project under consideration.
In one version, identifying a document or documents associated with the project under consideration involves identifying a document or documents from which the enterprise project name associated with the project under consideration were extracted. In another version, identifying a document or documents associated with the project under consideration involves indexing documents found in the information sources associated with the enterprise, searching the index documents with the enterprise project name associated with the project under consideration and associating at least some of the documents returned as search results (e.g., top 10 results) with the project under consideration. In yet another version, both of the foregoing procedures are used to find a document or documents associated with the project. In this version, the document or documents from which the enterprise project name associated with the project under consideration were extracted would be identified first, and then the search procedure would be employed to find a document or documents associated with the project not found in the initial procedure.
1.8 Generating A Timeline For A Project
The project information extraction implementations described herein can also generate a project timeline. More particularly, in one implementation, for each project associated with an extracted enterprise project name, a timeline for  the project is established. The project timeline is then added to the project information database entry associated with the project under consideration.
In one version, establishing a timeline for a project involves first estimating a start date for the project, where the start date is estimated as the earliest of the creation date of a distribution group associated with the project, the date of the earliest meeting associated with the project and the date of the earliest program code check-in associated with the project. An end date for the project is then estimated if the project has concluded. The end date is estimated as the latest of the date of the last meeting associated with the project, the date of the last program code check-in associated with the project and the latest date a document associated with the project was modified. The aforementioned data sources are then employed to find events associated with the project and the dates they occurred. For example, comments associated with the code checkins, the meeting titles and meeting notes, the content of related documents and the email content of the emails sent to related distribution groups, among other things can be used to carve out these event and their respect dates.
1.9 Finding Other Project-Related Items
In addition to the items described above, other project-related items can be found in the enterprise data sources and added to the project information database. More particularly, in one implementation, for each project associated with an extracted enterprise project name, project-related items are identified in the enterprise data sources using convention methods, where these project-related items include at least one of meetings, distribution groups, program code check-ins, emails, enterprise social networking messages, definitions, acronyms, home page, slides, a project description and concept terms associated with the project. The identified project-related items, or links thereto, are then added to the project information database entry associated with the project under consideration.
2.0 Exemplary Operating Environments
The project information extraction implementations described herein are operational using numerous types of general purpose or special purpose computing system environments or configurations. FIG. 9 illustrates a simplified example of a general-purpose computer system with which various aspects and elements of project information extraction, as described herein, may be implemented. It is noted that any boxes that are represented by broken or dashed lines in the simplified computing device 10 shown in FIG. 9 represent alternate implementations of the simplified computing device. As described below, any or all of these alternate implementations may be used in combination with other alternate implementations that are described throughout this document. The simplified computing device 10 is typically found in devices having at least some minimum computational capability such as personal computers (PCs) , server computers, handheld computing devices, laptop or mobile computers, communications devices such as cell phones and personal digital assistants (PDAs) , multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and audio or video media players.
To realize the project information extraction implementations described herein, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, the computational capability of the simplified computing device 10 shown in FIG. 9 is generally illustrated by one or more processing unit (s) 12, and may also include one or more graphics processing units (GPUs) 14, either or both in communication with system memory 16. Note that that the processing unit (s) 12 of the simplified computing device 10 may be specialized microprocessors (such as a digital signal processor (DSP) , a very long instruction word (VLIW) processor, a field-programmable gate array (FPGA) , or other micro-controller) or can be conventional central processing units (CPUs) having one or more processing cores.
In addition, the simplified computing device 10 may also include other components, such as, for example, a communications interface 18. The simplified computing device 10 may also include one or more conventional computer input  devices 20 (e.g., touchscreens, touch-sensitive surfaces, pointing devices, keyboards, audio input devices, voice or speech-based input and control devices, video input devices, haptic input devices, devices for receiving wired or wireless data transmissions, and the like) or any combination of such devices.
Similarly, various interactions with the simplified computing device 10 and with any other component or feature of wearable sensing, including input, output, control, feedback, and response to one or more users or other devices or systems associated with project information extraction, are enabled by a variety of Natural User Interface (NUI) scenarios. The NUI techniques and scenarios enabled by project information extraction include, but are not limited to, interface technologies that allow one or more users user to interact in a “natural” manner, free from artificial constraints imposed by input devices such as mice, keyboards, remote controls, and the like.
Such NUI implementations are enabled by the use of various techniques including, but not limited to, using NUI information derived from user speech or vocalizations captured via microphones or other sensors. Such NUI implementations are also enabled by the use of various techniques including, but not limited to, information derived from a user’s facial expressions and from the positions, motions, or orientations of a user’s hands, fingers, wrists, arms, legs, body, head, eyes, and the like, where such information may be captured using various types of 2D or depth imaging devices such as stereoscopic or time-of-flight camera systems, infrared camera systems, RGB (red, green and blue) camera systems, and the like, or any combination of such devices. Further examples of such NUI implementations include, but are not limited to, NUI information derived from touch and stylus recognition, gesture recognition (both onscreen and adjacent to the screen or display surface) , air or contact-based gestures, user touch (on various surfaces, objects or other users) , hover-based inputs or actions, and the like. Such NUI implementations may also include, but are not limited, the use of various predictive machine intelligence processes that evaluate current or past user behaviors, inputs, actions, etc., either alone or in combination with other NUI information, to predict information such as user intentions, desires, and/or goals. Regardless of the type or source of the NUI- based information, such information may then be used to initiate, terminate, or otherwise control or interact with one or more inputs, outputs, actions, or functional features of the project information extraction implementations described herein.
However, it should be understood that the aforementioned exemplary NUI scenarios may be further augmented by combining the use of artificial constraints or additional signals with any combination of NUI inputs. Such artificial constraints or additional signals may be imposed or generated by input devices such as mice, keyboards, and remote controls, or by a variety of remote or user worn devices such as accelerometers, electromyography (EMG) sensors for receiving myoelectric signals representative of electrical signals generated by user’s muscles, heart-rate monitors, galvanic skin conduction sensors for measuring user perspiration, wearable or remote biosensors for measuring or otherwise sensing user brain activity or electric fields, wearable or remote biosensors for measuring user body temperature changes or differentials, and the like. Any such information derived from these types of artificial constraints or additional signals may be combined with any one or more NUI inputs to initiate, terminate, or otherwise control or interact with one or more inputs, outputs, actions, or functional features of the project information extraction implementations described herein.
The simplified computing device 10 may also include other optional components such as one or more conventional computer output devices 22 (e.g., display device (s) 24, audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, and the like) . Note that typical communications interfaces 18, input devices 20, output devices 22, and storage devices 26 for general-purpose computers are well known to those skilled in the art, and will not be described in detail herein.
The simplified computing device 10 shown in FIG. 9 may also include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 10 via storage devices 26, and can include both volatile and nonvolatile media that is either removable 28  and/or non-removable 30, for storage of information such as computer-readable or computer-executable instructions, data structures, program modules, or other data. Computer-readable media includes computer storage media and communication media. Computer storage media refers to tangible computer-readable or machine-readable media or storage devices such as digital versatile disks (DVDs) , blu-ray discs (BD) , compact discs (CDs) , floppy disks, tape drives, hard drives, optical drives, solid state memory devices, random access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , CD-ROM or other optical disk storage, smart cards, flash memory (e.g., card, stick, and key drive) , magnetic cassettes, magnetic tapes, magnetic disk storage, magnetic strips, or other magnetic storage devices. Further, a propagated signal is not included within the scope of computer-readable storage media.
Retention of information such as computer-readable or computer-executable instructions, data structures, program modules, and the like, can also be accomplished by using any of a variety of the aforementioned communication media (as opposed to computer storage media) to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and can include any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media can include wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, radio frequency (RF) , infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves.
Furthermore, software, programs, and/or computer program products embodying some or all of the various project information extraction implementations described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer-readable or machine-readable media or storage devices and communication media in the form of computer-executable instructions or other data structures. Additionally, the  claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device, or media.
The project information extraction implementations described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and the like, that perform particular tasks or implement particular abstract data types. The project information extraction implementations described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Additionally, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field-programmable gate arrays (FPGAs) , application-specific integrated circuits (ASICs) , application-specific standard products (ASSPs) , system-on-a-chip systems (SOCs) , complex programmable logic devices (CPLDs) , and so on.
3.0 Other Implementations
It is noted that any or all of the aforementioned implementations throughout the description may be used in any combination desired to form additional hybrid implementations. In addition, although the subject matter has been described in  language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
What has been described above includes example implementations. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
In regard to the various functions performed by the above described components, devices, circuits, systems and the like, the terms (including a reference to a “means” ) used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., a functional equivalent) , even though not structurally equivalent to the disclosed structure, which performs the function in the herein illustrated exemplary aspects of the claimed subject matter. In this regard, it will also be recognized that the foregoing implementations include a system as well as a computer-readable storage media having computer-executable instructions for performing the acts and/or events of the various methods of the claimed subject matter.
There are multiple ways of realizing the foregoing implementations (such as an appropriate application programming interface (API) , tool kit, driver code, operating system, control, standalone or downloadable software object, or the like) , which enable applications and services to use the implementations described herein. The claimed subject matter contemplates this use from the standpoint of an API (or other software object) , as well as from the standpoint of a software or hardware object that operates according to the implementations set forth herein. Thus, various implementations described herein may have aspects  that are wholly in hardware, or partly in hardware and partly in software, or wholly in software.
The aforementioned systems have been described with respect to interaction between several components. It will be appreciated that such systems and components can include those components or specified sub-components, some of the specified components or sub-components, and/or additional components, and according to various permutations and combinations of the foregoing. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (e.g., hierarchical components) .
Additionally, it is noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.
4.0 Claim Support And Further Implementations
The following paragraphs summarize various examples of implementations of project information extraction which may be claimed in the present document. However, it should be understood that the implementations summarized below are not intended to limit the subject matter which may be claimed in view of the foregoing descriptions. Further, any or all of the implementations summarized below may be claimed in any desired combination with some or all of the implementations described throughout the foregoing description and any implementations illustrated in one or more of the figures, and any other implementations described below. In addition, it should be noted that the following implementations are intended to be understood in view of the foregoing description and figures described throughout this document. 
In one implementation, a computer-implemented process is employed for generating a project information database for an enterprise that uses a computing device to perform the following process actions. First, enterprise project names are extracted from information sources associated with an enterprise; then people associated with the project corresponding to each extracted enterprise project name are identified using information sources associated with an enterprise; and a project information database is generated for the enterprise including an entry for each project, where each of the entries includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
In one implementation, the process action of extracting enterprise project names from information sources associated with the enterprise, includes extracting candidate enterprise project names from one or more information sources including enterprise documents, where the extraction includes at least one of, employing a hearst pattern analysis to identify text strings in the enterprise documents that are potential project names, identifying tables in the enterprise documents having a column or row that includes at least two known enterprise project names, and deeming other names listed in the same column or row as potential project names, and identifying potential project names in document titles; eliminating potential project names that do not appear in the enterprise documents more than a prescribed number of times; employing a project name classifier trained to recognize enterprise project names to classify which of the remaining potential project names are valid enterprise project names; and designating the potential project names classified as valid to be enterprise project names.
In one implementation, the process action of extracting enterprise project names from information sources associated with the enterprise, includes extracting candidate enterprise project names from one or more information sources including meeting information and distribution group information, where the extraction includes, identifying meetings having less than a prescribed maximum number of attendees and more than one attendee, identifying distribution groups having less than a prescribed maximum number of members and more than one member, eliminating from the identified meetings and  distribution groups those meetings or groups having a title which includes a person's name or a term indicative of a person's name, identifying as potential project names those names in the remaining identified meetings and distribution groups that precede or follow a project name modifier term or phrase, identifying projects which have the same identified potential project name, whenever a pair of projects having identical project names have no common meeting attendees or DG members, designating each project name as identifying a separate project, and whenever a pair of projects having identical project names have common meeting attendees or DG members, designating each project name of the pair as identifying the same project; employing a project name classifier trained to recognize enterprise project names to classify which of the potential project names are valid enterprise project names; and designating the potential project names classified as valid to be enterprise project names.
In one implementation, the process action of identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with the enterprise, includes, for each extracted enterprise project name, identifying people associated with the project corresponding to the enterprise project name from one or more information sources including enterprise documents, where the identification includes, identifying enterprise documents that include the enterprise project name, for each document identified identifying the person or persons who authored the document and identifying each person named in the document who did not author the document, and designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name. Further, in one implementation, the document under consideration names one or more other enterprise project names in addition to the enterprise project name under consideration, and the aforementioned identification of each non-authoring person named in the document, includes identifying a person or persons if the person's name is closer as measured by the number of words before or after to the enterprise project name under consideration than it is to any other enterprise project name found in the document. Still further, in one implementation, the enterprise project name under consideration is found in a table included in the document under consideration, and the aforementioned identification of each non- authoring person named in the document, includes identifying a person or persons named in the same column or row of the table as the enterprise project name under consideration.
In one implementation, the process action of identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with the enterprise, includes, for each extracted enterprise project name, identifying people associated with the project corresponding to the enterprise project name from one or more information sources including distribution groups or meeting information, where the identification includes identifying a distribution group or groups whose information includes the enterprise project name, identifying a meeting or meeting whose meeting information includes the enterprise project name, identifying each person who is a member of the identified distribution group or groups, and identifying each person who is an attendee of the identified meeting or meetings; and designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name. Further, in one implementation designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name, includes, for each identified person, ranking the person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person, and ordering each identified person in the list of the people identified as being associated with the project found in the project information database based on that person's ranking. In one implementation, ranking the person based on the score derived from various attributes and contributions to the project gleaned from the data sources that referred to the person, includes assigning a component score based on each of at least one of: a number of documents authored by the person that include the name of the project; or a degree of proximity of the person's name to the project's name in each document that includes both the person's name and the project's name; or the person's name being in the same column or row of a table in a document, or a same list in the document, as the project's name; or the person being a member of a distribution group associated with the project; or the person being a member of a sub-group of a distribution group associated with the project;  or the person being a member of a distribution group associated with the project wherein a majority of the members of that distribution group are supervised by the person; or a number of emails sent by the person to a distribution group associated with the project; or a number of check-ins of program code associated with the project that the person made; or a number of meetings associated with the project that the person organized or attended; or for each meeting associated with the project, a number of sentences attributed to the person in meeting notes of the meeting associated with the project; or a number of emails and enterprise social network communications associated with the project sent by the person. Further, in one implementation, the various attributes and contributions to the project are each assigned a weight, and ranking the person based on the score derived from various attributes and contributions to the project gleaned from the data sources that referred to the person, includes multiplying each component score associated with an attribute or contribution by the weight assigned to the attribute or contribution, and summing the resulting products to produce the overall score for the person. In addition, in one implementation, designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name, includes, for each identified person, identifying the person's role within the project, and for each identified role and each person assigned that role in the project ranking the person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person and ordering each person assigned the role under consideration based on that person's ranking.
In one implementation, the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, identifying an extracted enterprise project name or names associated with another project or projects that are related to the project under consideration, and the process action of generating a project information database for the enterprise, further includes adding the extracted enterprise project name or names of project or projects that are related to the project under consideration to the project information database entry associated with the project under consideration. In one implementation, the process action of identifying the  extracted enterprise project name or names associated with another project or projects that are related to the project under consideration, includes associating with the project under consideration each project having a sub-super distribution group relationship with the distribution group or groups of the project under consideration. Further, in one implementation, the process action of identifying the extracted enterprise project name or names associated with another project or projects that are related to the project under consideration, includes identifying meetings having less than a prescribed number of attendees, building a weighted graph with nodes representing attendees of the identified meetings and edges connecting each node with the other nodes that each have a weight representing the number of meetings the attendees associated with the nodes connected by the edge have attended together, for each meeting determining if the meeting is a project-related meeting or a collaborative meeting, for each meeting determined to be a collaborative meeting identifying attendee subgroups using the weighted graph and a clustering method, and for each subgroup that has more than one member mapping the subgroup to an extracted project name and deeming the project corresponding to the extracted project name mapped to the subgroup to be a project related to the project under consideration.
In one implementation, the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, identifying a document or documents associated with the project under consideration, and the process action of generating a project information database for the enterprise, further includes adding the identified document or documents, or links thereto, to the project information database entry associated with the project under consideration. In one implementation, the process action of identifying a document or documents associated with the project under consideration, includes identifying a document of documents from which the enterprise project name associated with the project under consideration were extracted. Further, in one implementation the process action of identifying a document or documents associated with the project under consideration, includes indexing documents found in the information sources associated with the enterprise, searching the index documents with the enterprise project name  associated with the project under consideration, and associating at least some of the documents returned as search results with the project under consideration.
In one implementation, the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, establishing a timeline for the project, and the process action of generating a project information database for the enterprise, further includes adding the timeline established for the project under consideration to the project information database entry associated with the project under consideration. In one implementation, the process action of first establishing a timeline for the project, includes estimating a start date for the project, where the start date is estimated as the earliest of, the creation date of a distribution group associated with the project, the date of the earliest meeting associated with the project, and the date of the earliest program code check-in associated with the project. Then, an end date is estimated for the project if the project has concluded, where the end date is estimated as the latest of, the date of the last meeting associated with the project, the date of the last program code check-in associated with the project, and the latest date a document associated with the project was modified. Next, the information sources associated with an enterprise are searched to find events associated with the project and the dates they occurred.
In one implementation, the aforementioned computer-implemented process for generating a project information database for an enterprise further includes a process action of, for each project associated with an extracted enterprise project name, identifying project-related items including at least one of meetings, or distribution groups, or program code check-ins, or emails, or enterprise social networking messages, or definitions, or acronyms, or home page, or slides, or a project description, or concept terms associated with the project, and the process action of generating a project information database for the enterprise, further includes adding the identified project-related items, or links thereto, to the project information database entry associated with the project under consideration.
In one implementation, a project information database system for an enterprise is employed. The system includes one or more computing devices each including a processor, communication interface and memory. If there are multiple computing devices, they are in communication with each other via a computer network. The system also includes a computer program having program modules executable by the one or more computing devices. The one or more computing devices are directed by the program modules of the computer program to, access information sources associated with an enterprise, extract enterprise project names from the information sources, identify people associated with the project corresponding to each extracted enterprise project name using the information sources, and generate a project information database for the enterprise including an entry for each project, where each of the entries includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project. Further, in one implementation the computer program includes program modules for receiving a query for a user including terms representing an enterprise project name, or a person associated with an enterprise project, or both, searching the project information database for information corresponding to the queried enterprise project name, or person associated with an enterprise project, or both, and providing the results of the searching to the user.
In various implementations, project information extraction involves a step for generating a project information database for an enterprise. For example, in one implementation, project information extraction includes using a computing device to perform the following process actions: an extracting step for extracting enterprise project names from information sources associated with an enterprise; an identifying step for identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with an enterprise; and a generating step for generating a project information database for the enterprise including an entry for each project, where each of the entries includes the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.

Claims (15)

  1. A computer-implemented process for generating a project information database for an enterprise, comprising the actions of:
    using a computing device to perform the following process actions:
    extracting enterprise project names from information sources associated with an enterprise;
    identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with an enterprise; and
    generating a project information database for the enterprise comprising an entry for each project, each of said entries comprising the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
  2. The process of Claim 1, wherein the process action of extracting enterprise project names from information sources associated with the enterprise, comprises the actions of:
    extracting candidate enterprise project names from one or more information sources comprising enterprise documents, said extraction comprising at least one of,
    employing a hearst pattern analysis to identify text strings in the enterprise documents that are potential project names,
    identifying tables in the enterprise documents having a column or row that includes at least two known enterprise project names, and deeming other names listed in the same column or row as potential project names, and
    identifying potential project names in document titles;
    eliminating potential project names that do not appear in the enterprise documents more than a prescribed number of times;
    employing a project name classifier trained to recognize enterprise project names to classify which of the remaining potential project names are valid enterprise project names; and
    designating the potential project names classified as valid to be enterprise project names.
  3. The process of Claim 1, wherein the process action of extracting enterprise project names from information sources associated with the enterprise, comprises the actions of:
    extracting candidate enterprise project names from one or more information sources comprising meeting information and distribution group information, said extraction comprising,
    identifying meetings having less than a prescribed maximum number of attendees and more than one attendee,
    identifying distribution groups having less than a prescribed maximum number of members and more than one member,
    eliminating from the identified meetings and distribution groups those meetings or groups having a title which includes a person's name or a term indicative of a person's name,
    identifying as potential project names those names in the remaining identified meetings and distribution groups that precede or follow a project name modifier term or phrase,
    identifying projects which have the same identified potential project name,
    whenever a pair of projects having identical project names have no common meeting attendees or DG members, designating each project name as identifying a separate project, and
    whenever a pair of projects having identical project names have common meeting attendees or DG members, designating each project name of the pair as identifying the same project;
    employing a project name classifier trained to recognize enterprise project names to classify which of the potential project names are valid enterprise project names; and
    designating the potential project names classified as valid to be enterprise project names.
  4. The process of Claims 1, 2 or 3, wherein the process action of identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with the enterprise, comprises the actions of:
    for each extracted enterprise project name, identifying people associated with the project corresponding to the enterprise project name from one or more information sources comprising enterprise documents, said identification comprising,
    identifying enterprise documents that include the enterprise project name,
    for each document identified,
    identifying the person or persons who authored the document, and
    identifying each person named in the document who did not author the document, and
    designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name.
  5. The process of Claim 4, wherein the document under consideration names one or more other enterprise project names in addition to the enterprise project name under consideration, and wherein the process action of identifying each non-authoring person named in the document, comprises an action of identifying a person or persons if the person's name is closer as measured by the number of words before or after to the enterprise project name under consideration than it is to any other enterprise project name found in the document.
  6. The process of Claim 4, wherein the enterprise project name under consideration is found in a table included in the document under consideration, and wherein the process action of identifying each non-authoring person named in the document, comprises an action of identifying a person or persons named in the same column or row of the table as the enterprise project name under consideration.
  7. The process of Claims 1, 2 or 3, wherein the process action of identifying people associated with the project corresponding to each extracted enterprise project name using information sources associated with the enterprise, comprises the actions of:
    for each extracted enterprise project name, identifying people associated with the project corresponding to the enterprise project name from one or more information sources comprising distribution groups or meeting information, said identification comprising,
    identifying a distribution group or groups whose information includes the enterprise project name,
    identifying a meeting or meeting whose meeting information includes the enterprise project name,
    identifying each person who is a member of the identified distribution group or groups, and
    identifying each person who is an attendee of the identified meeting or meetings, and
    designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name.
  8. The process of Claims 4 or 7, wherein the process action of designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name, comprises the actions of:
    for each identified person, ranking the person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person; and
    ordering each identified person in said list of the people identified as being associated with the project found in the project information database based on that person's ranking.
  9. The process of Claim 8, wherein the process action of ranking the person based on the score derived from various attributes and contributions to the project gleaned from the data sources that referred to the person, comprises the actions of assigning a component score based on each of at least one of: 
    a number of documents authored by the person that include the name of the project; or
    a degree of proximity of the person's name to the project's name in each document that includes both the person's name and the project's name; or
    the person's name being in the same column or row of a table in a document, or a same list in the document, as the project's name; or
    the person being a member of a distribution group associated with the project; or
    the person being a member of a sub-group of a distribution group associated with the project; or
    the person being a member of a distribution group associated with the project wherein a majority of the members of that distribution group are supervised by the person; or
    a number of emails sent by the person to a distribution group associated with the project; or
    a number of check-ins of program code associated with the project that the person made; or
    a number of meetings associated with the project that the person organized or attended; or
    for each meeting associated with the project, a number of sentences attributed to the person in meeting notes of the meeting associated with the project; or
    a number of emails and enterprise social network communications associated with the project sent by the person.
  10. The process of Claim 9, wherein said various attributes and contributions to the project are each assigned a weight, and wherein the process action of ranking the person based on the component scores derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person, comprises the actions of:
    multiplying each component score associated with an attribute or contribution by the weight assigned to the attribute or contribution; and
    summing the resulting products to produce the overall score for the person.
  11. The process of Claims 4 or 7, wherein the process action of designating the identified person or persons as a candidate member or members of the project corresponding to the enterprise project name, comprises the actions of:
    for each identified person, identifying the person's role within the project; and
    for each identified role and each person assigned that role in the project,
    ranking the person based on a score derived from various attributes and contributions associated with the project gleaned from the data sources that referred to the person, and
    ordering each person assigned the role under consideration based on that person's ranking.
  12. The process of Claim 1, further comprising, for each project associated with an extracted enterprise project name, identifying an extracted enterprise project name or names associated with another project or projects that are related to the project under consideration, and wherein the process action of generating a project information database for the enterprise, further comprises adding the extracted enterprise project name or names of project or projects that are related to the project under consideration to the project information database entry associated with the project under consideration.
  13. The process of Claim 1, further comprising, for each project associated with an extracted enterprise project name, identifying a document or documents associated with the project under consideration, and wherein the process action of generating a project information database for the enterprise, further comprises adding the identified document or documents, or links thereto, to the project information database entry associated with the project under consideration.
  14. The process of Claim 1, further comprising, for each project associated with an extracted enterprise project name, establishing a timeline for  the project, and wherein the process action of generating a project information database for the enterprise, further comprises adding the timeline established for the project under consideration to the project information database entry associated with the project under consideration.
  15. A project information database system for an enterprise, comprising:
    one or more computing devices each comprising a processor, communication interface and memory, wherein said computing devices are in communication with each other via a computer network whenever there are multiple computing devices; and
    a computer program having program modules executable by the one or more computing devices, the one or more computing devices being directed by the program modules of the computer program to,
    access information sources associated with an enterprise,
    extract enterprise project names from the information sources,
    identify people associated with the project corresponding to each extracted enterprise project name using the information sources, and
    generate a project information database for the enterprise comprising an entry for each project, each of said entries comprising the extracted enterprise project name associated with the project and at least a list of the people identified as being associated with the project.
PCT/CN2015/082341 2015-06-25 2015-06-25 Extracting enterprise project information WO2016206044A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580077811.4A CN107430607A (en) 2015-06-25 2015-06-25 Extract Enterprise Project information
PCT/CN2015/082341 WO2016206044A1 (en) 2015-06-25 2015-06-25 Extracting enterprise project information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/082341 WO2016206044A1 (en) 2015-06-25 2015-06-25 Extracting enterprise project information

Publications (1)

Publication Number Publication Date
WO2016206044A1 true WO2016206044A1 (en) 2016-12-29

Family

ID=57584489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/082341 WO2016206044A1 (en) 2015-06-25 2015-06-25 Extracting enterprise project information

Country Status (2)

Country Link
CN (1) CN107430607A (en)
WO (1) WO2016206044A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967018A (en) * 2021-03-03 2021-06-15 北京明略软件系统有限公司 Method and device for project data analysis, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11250127A (en) * 1998-03-04 1999-09-17 Hitachi Ltd Method for automating process management of system development
KR20010045234A (en) * 1999-11-03 2001-06-05 오길록 System and method for object oriented ERP project implementation
JP2005174064A (en) * 2003-12-12 2005-06-30 Fosternet Co Ltd Project ordering/order receiving system
CN101256650A (en) * 2008-03-21 2008-09-03 中国科学院软件研究所 Method and system for extracting enterprise data based on service entity
US8805919B1 (en) * 2006-04-21 2014-08-12 Fredric L. Plotnick Multi-hierarchical reporting methodology

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3882479B2 (en) * 2000-08-01 2007-02-14 コクヨ株式会社 Project activity support system
US7058660B2 (en) * 2002-10-02 2006-06-06 Bank One Corporation System and method for network-based project management

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11250127A (en) * 1998-03-04 1999-09-17 Hitachi Ltd Method for automating process management of system development
KR20010045234A (en) * 1999-11-03 2001-06-05 오길록 System and method for object oriented ERP project implementation
JP2005174064A (en) * 2003-12-12 2005-06-30 Fosternet Co Ltd Project ordering/order receiving system
US8805919B1 (en) * 2006-04-21 2014-08-12 Fredric L. Plotnick Multi-hierarchical reporting methodology
CN101256650A (en) * 2008-03-21 2008-09-03 中国科学院软件研究所 Method and system for extracting enterprise data based on service entity

Also Published As

Publication number Publication date
CN107430607A (en) 2017-12-01

Similar Documents

Publication Publication Date Title
Tangherlini et al. An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web
US11537820B2 (en) Method and system for generating and correcting classification models
Arulmurugan et al. RETRACTED ARTICLE: Classification of sentence level sentiment analysis using cloud machine learning techniques
JP5391634B2 (en) Selecting tags for a document through paragraph analysis
JP5423030B2 (en) Determining words related to a word set
JP5391633B2 (en) Term recommendation to define the ontology space
US10157218B2 (en) Author disambiguation and publication assignment
US9183285B1 (en) Data clustering system and methods
US9519870B2 (en) Weighting dictionary entities for language understanding models
US10133807B2 (en) Author disambiguation and publication assignment
US20200301987A1 (en) Taste extraction curation and tagging
CN106383836B (en) Attributing actionable attributes to data describing an identity of an individual
JP2009093651A (en) Modeling topics using statistical distribution
JP2009093647A (en) Determination for depth of word and document
Singhal et al. Data extract: Mining context from the web for dataset extraction
US11829386B2 (en) Identifying anonymized resume corpus data pertaining to the same individual
Geiß et al. Beyond friendships and followers: The Wikipedia social network
WO2016206044A1 (en) Extracting enterprise project information
US8819023B1 (en) Thematic clustering
US11822609B2 (en) Prediction of future prominence attributes in data set
Sarkar et al. Representing Tasks with a Graph-Based Method for Supporting Users in Complex Search Tasks
Bong et al. Keyphrase extraction in biomedical publications using mesh and intraphrase word co-occurrence information
Saravia et al. Unsupervised graph-based pattern extraction for multilingual emotion classification
Santos et al. Mimicking web search engines for expert search
Rudniy et al. Shortest path edit distance for detecting duplicate biological entities

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15895944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15895944

Country of ref document: EP

Kind code of ref document: A1