WO2016200667A1 - Identifying relationships using information extracted from documents - Google Patents

Identifying relationships using information extracted from documents Download PDF

Info

Publication number
WO2016200667A1
WO2016200667A1 PCT/US2016/035412 US2016035412W WO2016200667A1 WO 2016200667 A1 WO2016200667 A1 WO 2016200667A1 US 2016035412 W US2016035412 W US 2016035412W WO 2016200667 A1 WO2016200667 A1 WO 2016200667A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
structured data
relationships
relationship
determining
Prior art date
Application number
PCT/US2016/035412
Other languages
English (en)
French (fr)
Inventor
Lei Ji
Zheng Chen
Zhongyuan WANG
Jun Yan
Welly Lee
Dmitriy Meyerzon
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2016200667A1 publication Critical patent/WO2016200667A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Definitions

  • the people in the company may desire to identify certain types of relationships. For example, a person in the company may desire to determine roles, projects, clients, technologies, etc. with which employees are associated.
  • a product manager may desire to identify those employees in the company that have familarity with technologies X, Y, and Z.
  • the product manager may send an email to at least a large portion of the company asking for the names of employees that are familiar with the technologies X, Y, or Z.
  • the product manager may then go through the responses to the email request to identify people for inclusion in the product team.
  • a process is time consuming for both the person asking for more information about the relationship between employees and technologies and for those respond to such email requests.
  • some employees may not respond to the email request, resulting in the requestor determining relationships based on imcomplete information.
  • structured data that includes a table may be received.
  • a determination may be made that a first column of the table includes a first type of data and that a second column of the table includes a second type of data.
  • a relationship between first contents of the first column of the table and second contents of the second column of the table may be determined.
  • the relationship between the first contents of the first column of the table and the second contents of the second column of the table may be stored to create stored relationships.
  • the stored relationships may be searched based on one or more search terms. Search results based on searching the stored relationships may be displayed.
  • the search results may identify which project(s) a particular person or set of people are likely to be working on.
  • FIG. 1 illustrates an example framework for mining relationships according to some implementations.
  • FIG. 2 is a flow diagram of an example process that includes processing structured and semi-structured data according to some embodiments.
  • FIG. 3 is a flow diagram of an example process to extract relationships from structured data according to some embodiments.
  • FIG. 4 is a flow diagram of an example process that includes receiving structured data and one or more dictionaries according to some embodiments.
  • FIG. 5 is a flow diagram of an example process that includes receiving structured data that includes a table according to some embodiments.
  • FIG. 6 is a flow diagram of an example process that includes receiving structured data extracted from documents according to some embodiments.
  • FIG. 7 is a block diagram of an example computing device and environment according to some implementations.
  • the system and techniques described herein may be used to extract relationship information from a document repository.
  • Many companies use a document repository that is accessible to multiple employees to enable documents to be (i) shared, (ii) modified to be re-used or re-purposed, (iii) archived, etc.
  • the document repository may be stored on local servers, on remote servers, such as cloud-based storage facilities, or a combination of both (e.g., local storage with cloud backup).
  • the document repository may provide various features, such as version control, multi-user collaboration in real-time, security controls (e.g., selective access based on user permissions, document permissions, or both), etc.
  • the documents stored in the repository may include multiple types of documents, such as, for example, plain text, Microsoft® Word® compatible documents, Microsoft® Excel® compatible documents, Microsoft® PowerPoint® compatible documents, other types of Microsoft® Office® compatible documents (e.g., Visio®, Rich Text Format (RTF), etc.), portable document format (PDF) compatible documents, hypertext markup language (HTML) documents, extended markup language (XML) documents, documents in another type of document format, or any combination thereof.
  • the document repository may be implemented using a database or a type of collaborative document management system, such as IBM® Collaboration Solutions or Microsoft® SharePoint®.
  • the document repository may integrate intranet, content management, and document management.
  • the document repository may include a multipurpose set of technologies using a common technical infrastructure closely integrated with a productivity suite such as Microsoft® Office.
  • the document repository may provide intranet portals, document and file management, collaboration, social networks, extranets, websites, enterprise search, and business intelligence, in addition to system integration, process integration, and workflow automation capabilities.
  • the document repository may be integrated with Enterprise application software, such as, enterprise resource planning (ERP) and customer relationship management (CRM) software.
  • ERP enterprise resource planning
  • CRM customer relationship management
  • Each type of document may have a corresponding parser.
  • a first parser may parse a first type of document (e.g., HTML), a second parser may parse a second type of document (XML) etc.
  • Each parser may parse a document to identify and extract data for which a relationship is to be identified. For example, in the case of identifying projects associated with employees of a company, the parser may look for and extract information that identifies an employee name and information that identifies a project on which they are working, a role (e.g., software designer, team lead, manager, etc.) associated with the employee, etc.
  • a role e.g., software designer, team lead, manager, etc.
  • a crawler may identify new or modified documents in the repository, identify a type of each of the documents, and send each of the new or modified documents to the corresponding parser.
  • the crawler may be a software application that automatically (e.g., without human interaction) and periodically (e.g., at pre-determined intervals) scans the documents stored in the repository and identifies documents that are new, modified, or flagged for inclusion.
  • the documents may include one or more of structured data (e.g., tables), semi- structured data (e.g., XML , email header, JavaScript Object Notation (JSON) metadata etc.), or unstructured data (e.g., body of an email etc.).
  • structured data e.g., tables
  • semi- structured data e.g., XML , email header, JavaScript Object Notation (JSON) metadata etc.
  • unstructured data e.g., body of an email etc.
  • the parsers may extract and convert the information related to a particular type of relationship (e.g., which projects an employee is working on) into a particular type of data structure (e.g., a table).
  • the extracted data may be analyzed by various software modules to identify information associated with particular types of relationships, classify the relationships, filter out noise (e.g., unrelated information, etc.), rank the relationships, and store the relationships (e.g., in a database).
  • One or more of the software modules may include machine learning algorithms, such as support vector machines, neural networks, Bayesian networks, etc. The machine learning algorithms may be used to identify the columns in a table that include relationship related information (e.g., project information).
  • parsers may be used to extract information from documents in a repository.
  • the information that is extracted may be relevant to identifying a specific type of relationship (e.g., a relationship between an employee and one or more projects on which the employee is working).
  • Various modules may be used to identify the relationships, filter out any noise, rank the relationships, and store the relationships in a database.
  • a company may use the database to identify which employees have expertise in a particular technology, experience with a particular client, or other relevant work experience.
  • a software company may identify software designers with expertise in machine learning or telecom protocols.
  • a law firm specializing in intellectual property may land a client that is performing research in a particular technology area and may desire to identify patent attorneys with experience writing applications for the particular technology area (e.g., telecom software, cloud-based services, semiconductors, processors, memory storage, etc.). Such information may be retrieved without having to resort emailing multiple employees to ask them which employees have a particular expertise.
  • a particular technology area e.g., telecom software, cloud-based services, semiconductors, processors, memory storage, etc.
  • FIG. 1 illustrates an example framework 100 for mining relationships according to some implementations.
  • the framework 100 may be executed by one or more computing devices or other machines configured with specific processor-executable instructions.
  • the framework 100 is described below using examples of how company documents (e.g., enterprise documents) may be mined to identify specific types of relationships that answer the question "What is employee ABC currently working on?", e.g., names of projects that employees are current working on or current roles that employees are currently performing.
  • framework 100 may be applied to mine other types of relationship information as well.
  • the relationship information may include a name of the project, technologies involved in the project, one or more roles (e.g., manager, architect, lead developer, technical writer, software engineer, etc.) associated with the employee during the time that employee is working on the project, and other information related to the relationship between the employee and the project.
  • the framework 100 may extract the relationship information and store the relationship information in a data storage mechanism that enables a user to perform various operations, including searching, retrieving, and sorting the relationship information.
  • the modules and data flow shown in FIG. 1 illustrate an exemplary embodiment. However, other embodiments may omit one or more of the modules, combine functions of multiple modules, split a particular module into two or nore additional modules, change the data flow, make other variations to the modules or data flow in FIG. 1, or any combination thereof, while retaining the functionality of mining relationships from documents.
  • the framework 100 may include a document repository 102, one or more parsers 104, and a relationship mining module 106.
  • the document repository 102 may be implemented using a database or a type of collaborative document management system, such as IBM® Collaboration Solutions or Microsoft® SharePoint®.
  • the document repository 102 may include a multipurpose set of technologies using a common technical infrastructure integrated with a productivity suite, such as Microsoft® Office.
  • the document repository may provide document and file management, collaboration, and other functions.
  • the document repository 102 may include documents 108, address book 110, and a crawler 112.
  • the documents 108 stored in the document repository 102 may include multiple types of documents, such as, plain text, Microsoft® Office® compatible documents (e.g., Word®, Excel®, PowerPoint®, Visio®, RTF, etc.), PDF compatible documents, HTML documents, XML documents, documents in another type of document format, or any combination thereof.
  • the documents 108 may include emails. However, in other cases, due to privacy concerns, the documents 108 may exclude emails.
  • the techniques and systems are described herein as techniques and systems for mining documents, excluding emails. However, embodiments include techniques and systems to mine documents, including emails, for relationship information.
  • the address book 110 may include contact information, such as employee names, employee aliases (e.g., nicknames), employee titles, employee addresses (e.g., email addresses, phone numbers, instant messaging addresses, etc.), other employee-related information, or any combination thereof.
  • a crawler 112 may be a software application that automatically and periodically scans the documents 108 to identify the documents 108 that are to be mined for relationship information, such as new, modified, or flagged documents. For example, a user may flag a document as to be included in or excluded from relationship mining. The crawler 112 may select for relationship mining a document of the documents 108 that has been flagged for inclusion while excluding another document that has been flagged for exclusion from relationship mining.
  • the crawler 112 may be provided by a creator of the document repository 102 to create a search index for the documents in the document repository 102. In such cases, the crawler 112 may be modified to send new and modified documents to the parsers 104.
  • the crawler 112 may send at least a portion of the documents 108 to the parsers 104.
  • the parsers 104 may include a first parser 114 to an Nth parser 116 (where N>1). Each of the parsers 104 may process a particular type of document. For example, the first parser 114 may parse Word® compatible documents, a second parser may parse Excel® compatible documents, a third parser may parse PowerPoint® compatible documents, a fourth parser may parse PDF compatible documents, a fifth parser may parse HTML documents, and so on.
  • the parsers 104 may extract input data 116 that is used as the input to mine relationships using the relationship mining module 106.
  • the extracted data 120 may include structured data (e.g., tables), semi-structured data (e.g., lists, XML, JSON, etc.), unstructured data (e.g., data that does not have a pre-defined data model or is not organized in a pre-defined manner), or any combination thereof.
  • structured data e.g., tables
  • semi-structured data e.g., lists, XML, JSON, etc.
  • unstructured data e.g., data that does not have a pre-defined data model or is not organized in a pre-defined manner
  • certain types of relationships may be found predominantly in certain types of data and the parsers 104 may identify certain types of data (e.g., structured data and semi-structured data) while ignoring other types of data (e.g., unstructured data).
  • the projects that employees are currently working on may be found primarily in structured data and semi- structured data.
  • the parsers 104 may be configured to ignore unstructured data.
  • the extracted data 120 may include tables, lists, metadata (e.g., properties associated with a document such as author, title, date modified, etc.), and context information based on a sequence of the data.
  • context information e.g., the first page of a PowerPoint® presentation may include a title of the presentation, one or more authors of the presentation, the titles of the authors, etc.
  • the parsers 104 may look for special formatting characters to identify structured data, such as an indent level, special formatting instructions, etc.
  • the parsers 104 may convert semi-structured data (e.g., lists and other similar data structures) into structured data (e.g., tables).
  • the parsers 104 may first extract various dictionaries, such as a first dictionary 122 to an Mth dictionary 124 (where M>1, M not necessarily equal to N) from the address book 110.
  • the dictionaries 122 to 124 may include the names of people in a company and their corresponding roles.
  • the dictionaries 122 to 124 may be determined based on the address book 110 before the parsers 104 extract structured data and semi-structured data. For example, a dictionary of people names may be compiled using active directory data, and a dictionary of possible project names may be populated by a separate algorithm that extracts acronyms from the documents 108.
  • the dictionaries 122 to 124 may include a people dictionary (e.g., names of employees), a project name dictionary, and a role dictionary (e.g., a current role, such as software designer, technical writer, etc., associated with individual employees).
  • the dictionaries 122 to 124 may be extracted from information in the address book 110.
  • the address book 110 may include the names of employees and their current title (e.g., role).
  • the extracted data 120 and the extracted dictionaries 114 to 116 may be used as the input data 118 to the relationship mining module 106.
  • a feature extraction module 126 may extract features from the input data 118.
  • the features extracted by the feature extraction module 126 may include schema names, a ratio of empty cells to non-empty cells in a particular table, a ratio of distinct cells to indistinct cells in a particular table (e.g., determine whether the values in one column are the same or different, e.g., if all cells in one column are distinct, the ratio is 1 (largest value) and if all cells in one column are the same, the ratio is 1/n (n is the row number, this is the smallest value)), a number of lines in each cell in a particular table, a column index, a ratio of digits to characters (e.g., cells with predominately digits may include dates, prices, or other numeric quantities), a ratio of words that start with capital letters to words that start with lowercase letters (e.g., project names may be capitalized), a ratio of words to numbers (e.g., cells with numbers may include
  • the features extracted by the feature extraction module 126 may be used as input to one or more classifiers 128 to determine (e.g., predict) if a column includes project names, roles, people names, etc.
  • the classifiers 128 may classify whether a column of a table includes employee names, role names, project names, dates, descriptions, or the like.
  • the classifiers 128 may use a machine-learning algorithm, such as logistic regression (LR), support vector machine, neural network, Bayesian network, or other machine-learning algorithm.
  • LR logistic regression
  • the classifiers 128 may be trained during offline training 130 and then perform classification in real-time.
  • training data 132 may be used to perform training 134.
  • the training 134 may include logistic regression (LR) training.
  • LR training probabilities describing possible outcomes are modeled, as a function of the explanatory (predictor) variables, using a logistic function.
  • Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by estimating probabilities.
  • one column may include project names while multiple other columns may include other information (e.g., names of team members, roles of team members, contact information for team member, etc.).
  • the classifiers 128 may include a cost sensitive LR classifier in which wrongly predicted positive results may be given a larger penalty.
  • the training 134 may include others types of training instead of LR training.
  • the result of the training 134 using the training data 132 may be the creation of one or more models, such as a named entity recognition (NER) model 136.
  • NER named entity recognition
  • the NER model 136 is merely used as an example of one type of mode. Depending on the implementation, other types of models may be used instead of the NER model 136.
  • One or more filters 138 may filter out noise from the features that are classified by the classifiers 128.
  • the filters 138 may include rule-based filters and include the use of black-lists (e.g., to exclude specified data), white-lists (e.g., to include data specified in white-list while excluding other data not included in the white-list), or other types of rule-based filters.
  • An example of rules for filtering noise may include (i) a rule to remove any relationships that include date information or time information and (ii) if the words in a cell are included in a black list, then remove the words (e.g., cell includes only black listed words).
  • a disambiguation module 140 may resolve the ambiguities.
  • the names of employees in a large company may include employees with similar names.
  • the similarities may be due to an employee using a nickname or shortening a given name, where the nickname or shortened name is similar or identical to another employee's name.
  • an author of a document may misspell another employee's name in a table or a list that identifies employees working on a particular project where the misspelling is similar or identical to another employee's name.
  • the disambiguation module 140 may resolve ambiguities by looking at one or more relationships, such as other employees (e.g., a manager/supervisor, co-workers, etc.) relationship to the ambiguous employee names, the roles associated with the ambiguous employee names, projects associated with the ambiguous employee names etc.
  • name ambiguity may be resolved by identifying a project associated with each ambiguous name.
  • John Smith may be identified as working on a search engine project while Jon Smith may be identified as working on a productivity suite project.
  • name ambiguity may be resolved by identifying a manager (or supervisor) associated with each ambiguous name.
  • John Smith may be identified as having manager Chris Jones while Jon Smith may be identified as having manager Steve Wilson.
  • name ambiguity may be resolved by identifying a co-worker (e.g., teammate) associated with each ambiguous name.
  • Robert Smith may be identified as having a co-worker Sam Adams working in the same department while Rob Smith may be identified as having a co-worker Dinesh Patel.
  • name ambiguity may be resolved by identifying a role associated with each ambiguous name.
  • John Smith may be identified as having the role of software designer while Jon Smith may be identified as having the role of technical writer.
  • the disambiguation module 140 may use various techniques to identify the identity of ambiguous names and resolve the ambiguity. Similar techniques may be used to resolve ambiguity for other types of relationships that are being mined.
  • a ranking module 142 may rank the relationships that have been identified based on more or criteria.
  • the ranking module 142 may be implemented as an aggregation algorithm that selects project names from a set of project name candidates (e.g., potential project names).
  • the set of project name candidates may be extracted from the documents 108 before ranking is performed.
  • the ranking module 142 may be implemented as a map/reduce algorithm. For example, an employee may be identified as having relationships with multiple projects. The relationships may be ranked based on a date where a more recent relationship results in a higher ranking (e.g., indicating a relatively current project) while a relationship that has a date in the past may have a lower ranking based on how long ago the employee worked on the project.
  • a date associated with a relationship between an employee and a project that the employee was working on may be determined based on the creation date of a document, the last modified date of a document, an other date related to a document from which a relationship between an employee and a project was extracted, or any combination thereof.
  • the ranked relationships 144 may be stored in data storage 146, such as a database or other type of data organizer.
  • the data storage 146 may enable the relationships 144 to be searched, sorted, etc. For example, a manager assembling a team to work on a new project may search the data storage 146 to identify employees with an expertise in particular technology areas and may use the rankings to identify employees with recent experience in the particular technology areas.
  • the crawler 112 may identify new and modified documents in the document repository 102.
  • the identified documents may be parsed based on a type of each document, resulting in structured data to be used for relationship mining.
  • the parsers 104 may convert semi-structured data into structured data.
  • Features e.g., relationships
  • the features may be filtered to remove noise.
  • Ambiguous portions of the data may be disambiguated.
  • the relationships may be ranked based on specified criteria and then stored in the data storage 146. In this way, relationships between different entities may be mined from data in documents. For example, enterprise documents may be mined to identify which projects an employee has worked on, including past projects and current projects.
  • each block represents one or more operations that can be implemented in hardware, software, or a combination thereof.
  • the blocks represent computer-executable instructions that, when executed by one or more processors, cause the processors to perform the recited operations.
  • computer-executable instructions include routines, programs, objects, modules, components, data structures, and the like that perform particular functions or implement particular abstract data types.
  • the order in which the blocks are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
  • the processes 200, 300, 400, 500, and 600 are described with reference to the framework 100, as described above, although other models, frameworks, systems, architectures, and environments may implement these processes.
  • FIG. 2 is a flow diagram of an example process 200 that includes processing structured and semi -structured data according to some embodiments.
  • the process 200 may be performed by the parsers 104, by modules in the relationship mining module 106, or both.
  • the process 200 therefore extracts relationship information from the metadata, the semi-structured data, and the structured data because in the majority of documents, most of the relationship information may be included in the metadata, the semi- structured data, and the structured data.
  • Metadata may include properties associated with a document, such as an author name, date of creation, last modified date, a title of the document, etc. Metadata may also include a first page of presentation that includes a title and an author of the presentation.
  • Metadata is a form of structured data
  • Metadata is usually not found in the body of the document. Metadata is usually included in the properties (or other embedded data) of the document or in a title page of the document, and therefore may be processed differently from structured data that is found in the body of the document.
  • one or more documents may be received.
  • metadata associated with the document may be processed.
  • the metadata may include (i) properties associated with a document, (ii) the first slide of a PowerPoint® presentation, and (iii) other locations that include information associated with the document, such as a title of the document, an author of the document, a creation date of the document, a last modified date of the document, other information associated with the document, or any combination thereof.
  • the metadata may be processed by extracting an author of the document and a title of the document from the metadata to identify a relationship between the author and a title of a document.
  • the document may be parsed to identify semi-structured data (e.g., lists) and structured data (e.g., tables).
  • the semi-structured data may include lists, such as distribution lists.
  • an email distribution list for a project may identify a name of the project, the members of the project, the roles of each member of the project, other project-related information, or any combination thereof.
  • Semi-structured data may proceed to 208, where the semi-structured data is converted to structured data.
  • lists may be converted into tables or other structured data.
  • Structured data identified at 206 may proceed to 210. For example, in FIG.
  • the parsers 104 may receive the documents 108 stored in the document repository 102 and parse the documents 108 to identify and extract metadata, semi-structured data, and structured data.
  • the parsers 104 may convert the semi- structured data into structured data.
  • a first parser may parse the document to identify the metadata (e.g., the properties of the document and a first page of the document) and extract an author name, a document title, and other information, at 204.
  • a second parser may parse the document to identify semi-structured data (e.g., lists, etc.) and structured data (e.g., tables, etc.). The second parser may convert the semi-structured data into structured data.
  • the structured data (e.g., from 206 and 208) is processed to mine (e.g., identify and extract) relationship information.
  • mine e.g., identify and extract
  • the process of mining relationship information from structured data is described in more detail in FIG. 3.
  • the feature extraction module 126 may extract features (e.g., ratio of words to numbers in each cell of table) and the classifiers 128 may use the features as input to determine which column is predicted to include project names, which column is predicted to include people names, which column is predicted to include role names, etc.
  • the relationship information extracted from the structured data (e.g., from 210) and the metadata (e.g., from 204) may be filtered to remove noise.
  • the relationships may be stored.
  • the filters 138 may be used to filter the identified relationships to remove noise and the filtered relationships stored in the data storage 146.
  • parsers may identify and extract the metadata, the semi-structured data, and the structured data from documents.
  • the semi-structured data may be converted into structured data.
  • the structured data may be processed (e.g., by identifying and classifying relationships) to extract relationship information.
  • the relationship information extracted from the metadata and from the structured data may be filtered and stored to enable the relationship information to be searched, sorted, etc.
  • FIG. 3 is a flow diagram of an example process 300 to extract relationships from structured data according to some embodiments.
  • the process 300 may be performed by modules of the relationship mining module 106, such as, for example, the feature extraction module 126, the classifiers 128, or both.
  • structured data (e.g., a table) may be received.
  • the schema of templates used for structured data may be identified (e.g., by the parsers 104 of FIG. 1) and stored in a template dictionary 306 (e.g., one of the dictionaries 122 to 124).
  • the template dictionary 306 determines whether the structured data 302 is based on a template (e.g., the template may be used to create the structure of the structured data 302)
  • the template-based structured data is processed at 308, and the relationships may be stored, at 214.
  • the relationships may be filtered and disambiguation of terms (e.g., proper names) may be performed. For example, if a schema of the structured data 302 matches a previously extracted schema, then the structured data 302 may be determined to have been created based on a template.
  • the schema of the structured data 302 may correspond to a previously extracted schema in which a first column includes people names, a second column includes role names, and a third column includes project names.
  • the people names, and their corresponding roles and projects may be extracted from the columns one, two, and three, respectively, of the structured data 302 and the relationships " ⁇ person's name> has the role of ⁇ role name>" and " ⁇ person's name> has worked on project ⁇ project name>" stored.
  • the people dictionary 312 may be created by the parsers 104 based on parsing the address book 110. For example, the contents of a cell of a table may be compared with contents of the people dictionary 312. If the contents of the cell of the table include a name that is included in the people dictionary 312, then a column of the table that includes the cell may include the names of people (e.g., employees). In this way, columns of a table that include the names of people may be determined using the people dictionary 312. Similar principles apply to identifying other types of relationships.
  • the process 300 may end. If the structured data 302 includes the names of people, then the structured data 302 may include relationship information, such as the roles of the people or the projects that the people are working on.
  • the process 300 proceeds to 314, where a determination is made, using a role dictionary 316, whether the structured data 302 includes roles of people.
  • a role dictionary 316 whether the structured data 302 includes roles of people.
  • the parsers 104 may extract the role dictionary from the address book 110. The contents of a cell of a table may be compared to the contents of the role dictionary to determine if the cell includes a role name.
  • the process 300 proceeds to 318, where the structured data that includes the roles is processed and the resulting relationship information stored, at 214.
  • the relationship between an employee and a role of the employee e.g., Sam Smith is lead software developer, may describe what the employee is currently working on, resulting in the relationship being identified and stored.
  • 314 may be omitted, e.g., in response to determining, at 310, that the structured data 302 includes the names of people, the process 300 may proceed to 320 to determine whether the structured data 302 includes project names.
  • the process 300 proceeds to 320, where a determination is made whether the structured data 302 includes project names.
  • features may be extracted from each cell of a table and the features (e.g., ratio of acronyms to non-acronyms, ratio of words to numbers, etc.) may be used as input to a classifier that has been trained to predict which column (or row) of the table includes project names.
  • the classifier may determine (e.g., predict) that a particular column (or row) includes project names based on the features, e.g., the column includes more acronyms than non-acronyms, the column includes more letters than numbers, etc.
  • the classifier may determine (e.g., predict) that a particular column (or row) does not include project names when the features indicate that each cell includes more numbers (e.g., dates of project milestones) than letters, etc. If a determination is made, at 320, that the structured data includes project names, then the process 300 proceeds to 322, where the structured data 302 that includes the names of people and project names is processed, and the resulting relationship information stored, at 214.
  • the relationship between an employee and a project e.g., Sam Smith is a team member working on the image-based search engine project, may describe what the employee is currently working on, resulting in the relationship being identified and stored.
  • the contents of a cell of a table are included in the people dictionary 312, then the contents are determined to be a person's name.
  • Using the people dictionary 312 (extracted from the address book 110 in FIG. 1), enables the feature extraction module 126 and the classifiers 128 to identify names of people relatively quickly and easily in the structured data 302. Identifying the project names at 320 may be comparatively harder. To identify which portions of the structured data include project names, determining a schema of the structured data may be useful. For example, the first row of a table usually identifies a schema of the table as the first row may include headings describing the contents of each column. Thus, the schema may be used to identify which columns of the table include people names, which columns include roles, and which columns include project names.
  • the features extracted by the feature extraction module 126 to determine whether structured data 302 includes project names (or other project-related information) may include schemas, schema names, a ratio of empty cells to non-empty cells in a particular table, a ratio of distinct cells to indistinct cells in a particular table, a number of lines in each cell in a particular table, a column index, a ratio of digits to characters (e.g., cells with predominately digits may include dates, prices, or other numeric quantities), a ratio of words that start with capital letters to words that start with lowercase letters (e.g., project names may be capitalized), a ratio of words to numbers (e.g., cells with numbers may include dates, prices, or other numeric quantities rather than names, roles, project names, etc.), a ratio of acronyms to non-acronyms (e.g., acronyms are often used to abbreviate projects that employees are working on), a ratio of universal resource locators (URLs) to non-URLs
  • the process 300 illustrates how the relationship mining module 106 of FIG. 1 may identify specific relationships, such as a roles associated with employees or projects associated with employees.
  • the process 300 may be applied to identify other types of relationships, such as a relationship between X (e.g., employee) and Y (e.g., role) or a relationship between X (e.g., employee) and Z (e.g., project).
  • a determination may be made whether the structured data 302 includes X. If the structured data includes X, then at 314, a determination may be made whether the structured data 302 includes Y. If the structured data 302 includes X and Y, then the relationship between X and Y may be stored. If the structured data includes X, then at 320, a determination may be made whether the structured data 302 includes Z. If the structured data 302 includes X and Z, then the relationship between X and Z may be stored.
  • documents may be analyzed to identify relationships by extracting features from structured data and classifying the features using one or more classifiers.
  • Semi- structured data may be converted into structured data prior to being processed.
  • Parsers may create multiple dictionaries that are used to identify which portions of the structured data include particular types of information.
  • the relationships that are identified may be stored the information to be searched, sorted, etc. Identifying projects which employees in a company are each working on is an example of the type of relationships that may be mined from documents in a document repository. Of course, other types of relationships may be mined using the techniques and systems described herein.
  • FIG. 4 is a flow diagram of an example process 400 that includes receiving structured data and one or more dictionaries according to some embodiments.
  • the process 400 may be performed by the relationship mining module 106 of FIG. 1.
  • structured data and one or more dictionaries may be received.
  • the structured data and the one or more dictionaries may be extracted from one or more documents.
  • the relationship mining module 106 may receive the input data 118 that includes the extracted data 120 (e.g., structured data) and the dictionaries 122 to 124.
  • the feature extraction module 126 may determine that a first column of a table includes names of people (e.g., by comparing contents of a cell of the table with names in a people dictionary) and that the second column of the table includes the names of projects that the people are working on (e.g., a classifier may use features extracted from a cell of the table to predict that the second column includes project names), thereby determining a relationship, e.g., that a person named X (e.g., John Smith) is working on a project named Y (e.g., Search Engine for Images).
  • a person named X e.g., John Smith
  • Y e.g., Search Engine for Images
  • disambiguation is performed for at least one of the first data or the second data.
  • the disambiguation module 140 may be used to distinguish between people names that are similar or identical in the structured data. To illustrate, disambiguation may be used to differentiate between people names "John Smith,” “Jon Smith,” and "Johnny Smith.”
  • a rank is associated with the relationship based on when the relationship occurred.
  • the ranking module 142 may be used to rank the relationships based on when each relationship occurred.
  • current relationships may be more relevant than older relationships and therefore a current relationship may be ranked higher than previous relationships.
  • a current relationship may have a rank of 10
  • a year old relationship may have a rank of 9, and so on, with relationships that are 9 or more years old having a rank of 1.
  • the relationship may be stored in a database that includes additional relationships.
  • the relationships 144 may be stored in the data storage 146.
  • a search of the database is performed using one or more search terms.
  • the search results are displayed.
  • the search engine 720 may be used to search the relationships 144 and display the search results 722.
  • parsers may extract structured data and convert semi -structured data to structured data and send the structured data to a relationship mining module.
  • Features may be extracted and classified using a classifier. For example, features of the contents of each cell of a table may be classified to identify which column includes the names of people and which column includes project names (or role names). Relationships, which people are working on which projects, may be determined. The relationships may be filtered, disambiguation performed on data types where ambiguity is possible, ranked according to when each relationship occurred, and stored in a searchable database.
  • FIG. 5 is a flow diagram of an example process 500 that includes receiving structured data that includes a table according to some embodiments.
  • the process 500 may be performed by the relationship mining module 106 of FIG. 1.
  • the process 500 assumes that the table is arranged such that columns are categories, and being in the same row connotes some kind of relationship. However, it should be understood that the process 500 may be applied to tables in which rows identify categories, and columns indicate relationships, by changing "row" to "column” and "column” to “row” in the process 500.
  • structured data that includes a table may be received from one or more document parsers.
  • the relationship mining module 106 may receive the input data 118 that includes the extracted data 120 (e.g., structured data) and the dictionaries 122 to 124.
  • a relationship between first contents of the first column of the table and second contents of the second column of the table are determined.
  • the feature extraction module 126 and the classifiers 128 may determine that a first column of a table includes names of people (e.g., by determining that a contents of a cell include a name included in a people dictionary) and that the second column of the table includes the names of projects that the people are working on (e.g., a classifier predicts, based on features extracted from a cell of the table, that the column includes project names), thereby determining the relationship between a person named X (e.g., John Smith) and a project named Y (e.g., Search Engine for Images) that the person is working on, e.g., the relationship "X is working on Y.”
  • X e.g., John Smith
  • Y e.g., Search Engine for Images
  • the relationship between the first contents of the first column and the second contents of the second column may be stored in a database.
  • the relationships 144 may be stored in the data storage 146.
  • a search of the database is performed using one or more search terms.
  • the search results are displayed.
  • the search engine 720 may be used to search the relationships 144 and display the search results 722.
  • parsers may extract structured data and convert semi -structured data to structured data and send the structured data to a relationship mining module.
  • Features may be extracted and classified using a classifier. For example, features of the contents of each cell of a table may be classified to identify which column includes the names of people and which column includes project names (or role names). Relationships, which people are working on which projects, may be determined. The relationships may be filtered, disambiguation performed on data types where ambiguity is possible, ranked according to when each relationship occurred, and stored in a searchable database.
  • FIG. 6 is a flow diagram of an example process 500 that includes receiving structured data extracted from documents according to some embodiments.
  • the process 600 may be performed by the relationship mining module 106 of FIG. 1.
  • structured data extracted from documents stored in a shared document repository may be received.
  • the relationship mining module 106 may receive the input data 118 that includes the extracted data 120 (e.g., structured data) and the dictionaries 122 to 124.
  • the input data 118 may be extracted by the parsers 104 from the documents 108 in the document repository 102.
  • a plurality of relationships between the first data and the second data are determined. For example, in FIG.
  • the feature extraction module 126 and the classifiers 128 may determine that a first column of a table includes names of people (e.g., by determining that a contents of a cell include a name included in a people dictionary) and that the second column of the table includes the names of projects that the people are working on (e.g., a classifier predicts, based on features extracted from a cell of the table, that the column includes project names), thereby determining a relationship, e.g., that a person named X (e.g., John Smith) is working on a project named Y (e.g., Search Engine for Images).
  • a person named X e.g., John Smith
  • Y e.g., Search Engine for Images
  • the plurality of relationships are filtered by removing noise to create filtered relationships.
  • the filters 138 may be used to remove noise from the classified features (e.g., predicting which column of a table includes project names).
  • the filtered relationships are ranked based on a date associated with individual relationships of the filtered relationships.
  • the ranking module 142 may be used to rank the relationships based on when each relationship occurred. To illustrate, current relationships may be more relevant than older relationships and therefore a current relationship may be ranked higher than previous relationships.
  • the filtered and ranked relationships may be stored in a database.
  • the relationships 144 may be stored in the data storage 146 in the form of a graph index that includes information connecting a name of a person with the document from which the relationship was extracted).
  • a search of the database is performed using one or more search terms.
  • the search results are displayed.
  • the search engine 720 may be used to search the relationships 144 and display the search results 722.
  • the extracted relationship information may be displayed in a user interface (UI) to enable individual employees to confirm that a set of relationships (e.g., projects with which the employee has been involved) are to be associated with the employee's name projects.
  • UI user interface
  • a manager or other employee may select the expertise areas for individual employees using a standardized set of expertise areas.
  • a software company may standardize the expertise area "software designer" for all employees who have written software code to enable consistent search results. Without standardization, search results for the term "software designer” may not include "software engineer,” "computer programmer,” "software developer,” etc.
  • parsers may extract structured data and convert semi -structured data to structured data and send the structured data to a relationship mining module.
  • Features may be extracted and classified using a classifier. For example, features of the contents of each cell of a table may be classified to identify which column includes the names of people and which column includes project names (or role names). Relationships, which people are working on which projects, may be determined. The relationships may be filtered, disambiguation performed on data types where ambiguity is possible, ranked according to when each relationship occurred, and stored in a searchable database.
  • FIG. 7 illustrates an example configuration of a computing device 700 and environment that can be used to implement the modules and functions described herein.
  • the computing device 700 may include at least one processor 702, a memory 704, communication interfaces 706, a display device 708, other input/output (I/O) devices 710, and one or more mass storage devices 712, able to communicate with each other, such as via a system bus 714 or other suitable connection.
  • the processor 702 may be a single processing unit or a number of processing units, all of which may include single or multiple computing units or multiple cores.
  • the processor 702 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the processor 702 can be configured to fetch and execute computer-readable instructions stored in the memory 704, mass storage devices 712, or other computer-readable media.
  • Memory 704 and mass storage devices 712 are examples of computer storage media for storing instructions which are executed by the processor 702 to perform the various functions described above.
  • memory 704 may generally include both volatile memory and non-volatile memory (e.g., RAM, ROM, or the like).
  • mass storage devices 712 may generally include hard disk drives, solid-state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), a storage array, a network attached storage, a storage area network, or the like.
  • Both memory 704 and mass storage devices 712 may be collectively referred to as memory or computer storage media herein, and may be a media capable of storing computer-readable, processor-executable program instructions as computer program code that can be executed by the processor 702 as a particular machine configured for carrying out the operations and functions described in the implementations herein.
  • the computing device 700 may also include one or more communication interfaces 706 for exchanging data with other devices, such as via a network, direct connection, or the like, as discussed above.
  • the communication interfaces 706 can facilitate communications within a wide variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet and the like.
  • Communication interfaces 706 can also provide communication with external storage (not shown), such as in a storage array, network attached storage, storage area network, or the like.
  • a display device 708, such as a monitor may be included in some implementations for displaying information and images to users.
  • Other I/O devices 710 may be devices that receive various inputs from a user and provide various outputs to the user, and may include a keyboard, a remote controller, a mouse, a printer, audio input/output devices, and so forth.
  • Memory 704 may include modules and components for context-based object retrieval according to the implementations herein.
  • memory 704 includes the document repository 102 including the documents 108 that are parsed by the parsers 104.
  • the metadata, semi-structured data, and structured data extracted by the parsers 104 may be processed by the relationship mining module 106 to identify the relationships 144.
  • Memory 704 may further include one or more other modules 716, such as an operating system, drivers, communication software, or the like. Memory 704 may also include other data 718, such as data stored while performing the functions described above and data used by the other modules 716.
  • the memory 704 may include a search engine 720 that may be used to enter search terms to search the stored relationships 144 and provide search results 722.
  • module can represent program code (and/or declarative-type instructions) that performs specified tasks or operations when executed on a processing device or devices (e.g., CPUs or processors).
  • the program code can be stored in one or more computer- readable memory devices or other computer storage devices.
  • computer-readable media includes, at least, two types of computer-readable media, namely computer storage media and communications media.
  • Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.
  • communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism.
  • a modulated data signal such as a carrier wave, or other transmission mechanism.
  • computer storage media does not include communication media.
  • this disclosure provides various example implementations, as described and as illustrated in the drawings. However, this disclosure is not limited to the implementations described and illustrated herein, but can extend to other implementations, as would be known or as would become known to those skilled in the art. Reference in the specification to "one implementation,” “this implementation,” “these implementations” or “some implementations” means that a particular feature, structure, or characteristic described is included in at least one implementation, and the appearances of these phrases in various places in the specification are not necessarily all referring to the same implementation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
PCT/US2016/035412 2015-06-12 2016-06-02 Identifying relationships using information extracted from documents WO2016200667A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510328707.X 2015-06-12
CN201510328707.XA CN106294520B (zh) 2015-06-12 2015-06-12 使用从文档提取的信息来标识关系

Publications (1)

Publication Number Publication Date
WO2016200667A1 true WO2016200667A1 (en) 2016-12-15

Family

ID=56118084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/035412 WO2016200667A1 (en) 2015-06-12 2016-06-02 Identifying relationships using information extracted from documents

Country Status (2)

Country Link
CN (1) CN106294520B (zh)
WO (1) WO2016200667A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170109333A1 (en) * 2015-10-15 2017-04-20 International Business Machines Corporation Criteria modification to improve analysis
WO2021028855A1 (en) * 2019-08-15 2021-02-18 Collibra Nv Classification of data using aggregated information from multiple classification modules

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133208B (zh) * 2017-03-24 2021-08-24 南京柯基数据科技有限公司 一种实体抽取的方法及装置
CN107491530B (zh) * 2017-08-18 2021-05-04 四川神琥科技有限公司 一种基于文件自动标记信息的社会关系挖掘分析方法
CN109739858B (zh) * 2018-12-29 2021-08-17 华立科技股份有限公司 基于ansi c12.19的数据分类存储方法、装置和电子设备
CN109933692B (zh) * 2019-04-01 2022-04-08 北京百度网讯科技有限公司 建立映射关系的方法和装置、信息推荐的方法和装置
CN110472209B (zh) * 2019-07-04 2024-02-06 深圳同奈信息科技有限公司 基于深度学习的表格生成方法、装置和计算机设备
US11495038B2 (en) * 2020-03-06 2022-11-08 International Business Machines Corporation Digital image processing
CN111461537A (zh) * 2020-03-31 2020-07-28 山东胜软科技股份有限公司 一种基于油气生产数据的分类的量数方法及控制系统
CN112882993A (zh) * 2021-03-22 2021-06-01 申建常 一种资料查找方法及查找系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011183A1 (en) * 2005-07-05 2007-01-11 Justin Langseth Analysis and transformation tools for structured and unstructured data
WO2009086312A1 (en) * 2007-12-21 2009-07-09 Kondadadi, Ravi, Kumar Entity, event, and relationship extraction
US20090300043A1 (en) * 2008-05-27 2009-12-03 Microsoft Corporation Text based schema discovery and information extraction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8423579B2 (en) * 2008-10-29 2013-04-16 International Business Machines Corporation Disambiguation of tabular date
US20150006415A1 (en) * 2013-06-27 2015-01-01 Successfactors, Inc. Systems and Methods for Displaying and Analyzing Employee History Data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011183A1 (en) * 2005-07-05 2007-01-11 Justin Langseth Analysis and transformation tools for structured and unstructured data
WO2009086312A1 (en) * 2007-12-21 2009-07-09 Kondadadi, Ravi, Kumar Entity, event, and relationship extraction
US20090300043A1 (en) * 2008-05-27 2009-12-03 Microsoft Corporation Text based schema discovery and information extraction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANTHONY FADER ET AL: "Identifying Relations for Open Information Extraction", PROCEEDINGS OF THE 2011 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 27 July 2011 (2011-07-27), pages 1535 - 1545, XP055292735, ISBN: 978-1-937284-11-4 *
GERHARD WEIKUM ET AL: "Database and information-retrieval methods for knowledge discovery", COMMUNICATIONS OF THE ACM, vol. 52, no. 4, 1 April 2009 (2009-04-01), pages 56 - 64, XP058033726, ISSN: 0001-0782, DOI: 10.1145/1498765.1498784 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170109333A1 (en) * 2015-10-15 2017-04-20 International Business Machines Corporation Criteria modification to improve analysis
US10083161B2 (en) * 2015-10-15 2018-09-25 International Business Machines Corporation Criteria modification to improve analysis
WO2021028855A1 (en) * 2019-08-15 2021-02-18 Collibra Nv Classification of data using aggregated information from multiple classification modules
US11138477B2 (en) 2019-08-15 2021-10-05 Collibra Nv Classification of data using aggregated information from multiple classification modules
JP2022535165A (ja) * 2019-08-15 2022-08-04 コリブラ エヌブイ 多数の分類モジュールから集約された情報を使用するデータ分類
AU2020327704B2 (en) * 2019-08-15 2022-11-10 Collibra Belgium Bv Classification of data using aggregated information from multiple classification modules

Also Published As

Publication number Publication date
CN106294520A (zh) 2017-01-04
CN106294520B (zh) 2019-11-12

Similar Documents

Publication Publication Date Title
US10489454B1 (en) Indexing a dataset based on dataset tags and an ontology
US10839021B2 (en) Knowledge operating system
Johann et al. Safe: A simple approach for feature extraction from app descriptions and app reviews
WO2016200667A1 (en) Identifying relationships using information extracted from documents
US20210149980A1 (en) Systems and method for investigating relationships among entities
US20200401983A1 (en) Extracting and surfacing user work attributes from data sources
US10853574B2 (en) Navigating electronic documents using domain discourse trees
US10146878B2 (en) Method and system for creating filters for social data topic creation
CN107787491B (zh) 用于重新使用文档中的内容的文档存储
US20200097601A1 (en) Identification of an entity representation in unstructured data
US11403457B2 (en) Processing referral objects to add to annotated corpora of a machine learning engine
US20160306798A1 (en) Context-sensitive content recommendation using enterprise search and public search
US11354501B2 (en) Definition retrieval and display
US20210342541A1 (en) Stable identification of entity mentions
US8775423B2 (en) Data mining across multiple social platforms
KR102485129B1 (ko) 정보 푸시 방법, 장치, 기기 및 저장매체
Geiß et al. Neckar: A named entity classifier for wikidata
US11620453B2 (en) System and method for artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
CA2956627A1 (en) System and engine for seeded clustering of news events
El Abdouli et al. Sentiment analysis of moroccan tweets using naive bayes algorithm
Han et al. Understanding and modeling behavior patterns in cross‐device web search
Tao et al. Building ontology for different emotional contexts and multilingual environment in opinion mining
EP4002152A1 (en) Data tagging and synchronisation system
US11586662B2 (en) Extracting and surfacing topic descriptions from regionally separated data stores
Korayem et al. Query sense disambiguation leveraging large scale user behavioral data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16728560

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16728560

Country of ref document: EP

Kind code of ref document: A1