WO2018076348A1

WO2018076348A1 - Building and updating a connected segment graph

Info

Publication number: WO2018076348A1
Application number: PCT/CN2016/104045
Authority: WO
Inventors: Ning Wen; Dafan Liu; Hui Shen; Liang Chen; Dianfei Han; Jiazhang HU; Jinglun LI; Pu Li; Zhenyu Zhao; Mao YANG; Zhenyu Guo; Xiong Zhang
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2016-10-31
Filing date: 2016-10-31
Publication date: 2018-05-03
Also published as: CN108463818A

Abstract

The present disclosure provides a method for building a connected segment graph specific for a domain. The method may comprises: collecting entity data from a source associated with the domain to form an entity dataset for the domain; processing the entity dataset; and building the connected segment graph with the processed entity dataset, wherein the building comprising enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.

Description

BUILDING AND UPDATING A CONNECTED SEGMENT GRAPH

BACKGROUND

A knowledge graph is a knowledge base used to enhance search engine’s search results with semantic-search information gathered from a wide variety of sources. The traditional knowledge graph is a monolithic graph containing knowledge about all types of entities from a variety of domains. The issue with a monolithic knowledge graph is that the quality of the knowledge graph is hard to control, especially for maintaining a high precision graph.

SUMMARY

The following summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In one aspect, the present disclosure provides a method for building a connected segment graph (CSG) specific for a domain. The method may comprise collecting entity data from a source associated with the domain to form an entity dataset for the domain. The method may further comprise processing the entity dataset via cleaning, de-duplicating and mapping processes. The method may further comprise building the connected segment graph with the processed entity dataset. The building may comprise enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.

In another aspect, the present disclosure provides an apparatus for building a connected segment graph (CSG) specific for a domain. The method may comprise a collecting module configured to collect entity data from a source associated with the domain to form an entity dataset for the domain. The apparatus may further comprise a processing module configured to process the entity dataset via cleaning, de-duplicating and mapping processes. The apparatus may further comprise a building module configured to build the connected segment graph with the processed entity dataset. The building module may be further configured to enrich the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.

In another aspect, the present disclosure provides a system for building a connected segment graph (CSG) specific for a domain. The system may comprise one or more processors and a memory. The memory may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for building a connected segment graph specific for a domain according to various aspects of the present disclosure.

It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of a few of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

FIG. 1 illustrates an environment in an example implementation according to an embodiment of the present disclosure.

FIG. 2 illustrates a flow chart of a method for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.

FIG. 3 illustrates an exemplary distributed table service system according to an embodiment of the present disclosure.

FIG. 4 illustrates an exemplary apparatus for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.

FIG. 5 illustrates an exemplary system for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood these implementations are discussed only for enabling those skilled persons in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

A knowledge graph aims at describing all kinds of entities or concepts in the real world. The knowledge graph is made up of entities, facts describing entities, and relationships between entities. Based on the knowledge graph, search engine’s search results can be enhanced with semantic-search information gathered from a wide variety of sources.

The traditional monolithic knowledge graph and associated ontology impose a huge challenge for improving graph data quality, agility and freshness. For example, data updates for the monolithic knowledge graph may take long time due to the expensive and complex graph operations and interconnection of entities. Thus a user’s freshness requirement for a specific domain cannot be satisfied. Furthermore, it may be hard to introduce new ontologies since a single schema is used and it may be hard to introduce new data sources since a single graph is used.

The present disclosure may introduce a connected segment graph (CSG) specific for a domain, which may be built individually and connected with and enriched by a knowledge graph containing knowledge on a plurality of domains. Each CSG may be associated with one scenario and application and thus the scenario and application level isolation and policy settings can be introduced. Each CSG may have its own schema which may be different from other CSGs, and thus it may be easy to introduce new ontologies. Furthermore, there may be a lot of CSGs specific for different domains rather than only one graph, so it may be easy to introduce new data sources. In the present disclosure, the proposed CSG may be updated individually based on the freshness requirement for the domain associated with the CSG. Thus the freshness requirement for a specific domain can be satisfied.

In the following discussion, an example environment is first described that is operable to employ the techniques described herein. Example illustrations of the various embodiments are then described, which may be employed in the example environment, as well as in other environments. Accordingly, the example environment is not limited to performing the described embodiments and the described embodiments are not limited to implementation in the example environment.

FIG. 1 illustrates an environment 100 in an example implementation that is operable to employ the techniques described in the present disclosure. It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc. ) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements of described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions may be carried out by a processor executing instructions stored in memory.

The illustrated environment 100 may include a storage device 110, a search engine server 120 and a user device 130. It should be understood that any number of user devices, search engine servers, and storage devices may be employed within the environment 100 within the scope of the present disclosure. Each may comprise a single device or multiple devices cooperating in a distributed environment. For instance, the search engine server 120 may comprise multiple devices arranged in a distributed environment that collectively provide the functionality of the search engine server 120 described herein. Additionally, other components not shown may also be included within the environment 100.

The user device 130 may be any type of computing device, such as a desktop computer, a laptop computer, a smart phone and so on. The user device 130 may communicate with the search engine server 120 via a network 140, which may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs) .

The storage device 110 may store a knowledge graph containing knowledge on a plurality of domains, such as Microsoft Satori knowledge Graph containing all types of entities, facts and relationships covering various domains. The storage device 110 may also store a plurality of connected segment graphs (CSGs) specific for different domains, such as CSG 1 specific for Product &Service, CSG 2 specific for Real Estates…CSG N specific for Entertainment. The CSGs may be built and updated for individual scenarios. Each CSG may be connected with and enriched by the knowledge graph through entity identification and linking services. The knowledge graph and the CSGs may be stored in a flat table format. Although only one storage device 110 is shown in FIG. 1, there may be a plurality of storage device to store the knowledge graph and the CSGs in a distributed way.

Since a CSG specific for a domain may be built, which is much smaller in scale than the traditional monolithic knowledge graph containing all types of entities and is isolated with other CSGs, it may take much less time to update such a CSG than the traditional knowledge graph.

The search engine server 120 may operate to receive search queries associated with a specific domain from user devices, such as the user device 130, and to provide search results in response to the search queries based on corresponding CSG stored in the storage device 110. For example, a user may be interested in real estates and may frequently submit search queries for latest price information about houses on sale. The search engine server 120 may perform a search operation based on the CSG 2 specific for real estates, which may be updated, for example, every 4 hours based on the freshness requirement for real estate, and return the latest information to the user.

Having described an example operating environment in which the techniques described herein may be employed, consider now a discussion of various embodiments.

FIG. 2 illustrates a flow chart of a method 200 for building a connected segment graph (CSG) specific for a domain.

In step 210, the method 200 may collect entity data from one or more sources associated with the domain to form an entity dataset for the domain. Specifically, the collecting may comprise retrieving information from the one or more sources, extracting entity data from the information with a pre-defined extraction model and storing the entity data to a system performing the method 200. For example, for a CSG specific for Product, the method 200 may retrieve information from Wikipedia webpages, Amazon webpages, and Walmart webpages and so on. Then the method 200 may extract entity data associated with products from the information with a pre-defined extraction model specific for the CSG, which may be trained by a training data set specific for the Product domain. Thereafter the method 200 may store the extracted entity data to form an entity dataset specific for products.

In step 220, the method 200 may process the entity dataset. For example, the processing may comprise cleaning entity dataset to remove noises from the dataset. The processing may further comprise de-duplicating the entity dataset. The processing may further comprise normalizing the entity data items from different sources in the dataset to the same format. The processing may further comprise mapping the entity data in the dataset to a schema specific for the CSG.

In step 230, the method 200 may build the CSG with the processed entity dataset. Specifically, the building may comprise performing entity matching on the processed entity dataset. The entity matching may comprise assign entity data ID for each entity data item in the dataset based on the entity similarity. The same entity ID may be assigned to two or more entity data items if they are associated with the same entity. The building may further comprise compositing two or more entity data items in the dataset based on pre-defined CSG composition rules. For example, for the CSG specific for People, the rules may include compositing two or more entity data items if they have the same name and birthday. The building may further comprise enriching the CSG with a knowledge graph containing knowledge on a plurality of domains. For example, data associated with an entity in the knowledge graph may be added into the corresponding entity in the CSG.

A plurality of CSGs, with each specific for one domain, may be built by using the above described method 200. Once such a CSG is built, it may be updated, based on a freshness requirement for a domain associated with it, by using the changed information from the associated sources. The updating process is similar to the building process as described above. There may be different freshness requirements for different domains. For example, for the real estate domain, the freshness requirement may be that the houses on sale must be refreshed every 4 hours. For the real-time news domain, the freshness requirement may be that the news must be refreshed every 5 minutes. In an embodiment of the present disclosure, each CSG may be updated based on its freshness requirement. Thus, the freshness of each CSG may satisfy the respective user’s requirement.

After a CSG is built, with the updating of it, entity data from the CSG may be used to update the knowledge graph via mapping, conflation and selection processes when the CSG meets predefined criteria. In an embodiment of the present disclosure, the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG. The freshness may be associated with latency per user requirement for freshness. The correctness may be associated with an attribute value variance and an attribute distribution. For example, the values of some attributes, such as birthday and name, of an entity in the CSG should not be changed. The values of some attributes of an entity in the CSG should be in a predefined range. For example, the values of latitude/longitude should be in -90 to 90 and -180 to 180 ranges. The attribute distribution in the CSG should be commonsensible. For example, one person should have two parents (mother and father) only, one company should not have more 1 million employees, and so on. The attribute value variance of most often queried entities should be below a pre-defined percentage, such as 5％. In an embodiment of the present disclosure, the attribute coverage of the CSG may be considered as a factor for evaluating the CSG. For example, the coverage of some important attributes of the CSG should be above a pre-defined threshold. For example, for an organization, in the CSG, the coverage of some attributes such as name, location, website and the like that are critical for describing the organization should be above a first threshold. The coverage of some attributes such as phone number, email address, description and the like in the CSG should be above a second threshold. In an embodiment of the present disclosure, the first threshold may be greater than the second threshold.

In the present disclosure, a CSG may be connected with and enriched by the knowledge graph containing knowledge on a plurality of domains. Reversely, entity data from a CSG may be used to update the knowledge graph.

FIG. 3 illustrates a distributed table service system 300 according to an embodiment of the present disclosure. The distributed table service system 300 may be configured to store and process a knowledge graph containing knowledge on a plurality of domains and a connected segment graph (CSG) specific for a domain which may be connected with and enriched by the knowledge graph. The system 300 may include a distributed table store service 310 and a computing engine 320. The system 300 may further include a plurality of storage servers which are not shown in FIG. 3.

The distributed table store service 310 may store entity data from the knowledge graph and the CSG in a flat table format. The distributed table store service 310 may include a coordinator component 312, a replication component 314, a local store component 316. In an embodiment of the present disclosure, the knowledge graph and the CSG may be represented as a table. The table may be divided into a plurality of partitions by vertical splitting and horizontal partitioning. The plurality of storage severs may store these partitions in a distributed way.

The coordinator component 312 may be configured to host table level metadata such as the schema of the table, partition distribution of the table, the state of each storage server and so on.

In order to ensure the security of the data, the data may be stored in three or more storage servers. The replication component 314 may be configured to keep the data reliable in variable replica count and keep the consistence between replicas. Furthermore, the replication component 314 may be further configured to migrate data from one storage server to another storage server to ensure the uniform data distribution.

Local store component 316 may be configured to store the data in a local box and process the operations for the table such as reading, writing, updating, modifying, deleting and so on. The local store component 316 may also be configured to map the data from a complex data structure to a simple Key-Value storage to make the storage efficiency.

The computing engine 320 may be configured to build a CSG specific for a domain. For example, the computing engine 320 may be configured to collect entity data from one or more sources associated with the domain to form an entity dataset for the domain. Specifically, the collecting may comprise retrieving information from the one or more sources, extracting entity data from the information with a pre-defined extraction model specific for the CSG and storing the entity data to the distributed table service system 300.

The computing engine 320 may be further configured to process the entity dataset. For example, the processing may comprise cleaning entity dataset to remove noises from the dataset. The processing may further comprise de-duplicating the entity dataset. The processing may further comprise normalizing the entity data items from different sources in the dataset to the same format. The processing may further comprise mapping the entity data in the dataset to a schema specific for the CSG.

The computing engine 320 may be further configured to build the CSG with the processed entity dataset. Specifically, the building may comprise performing entity matching on the processed entity dataset. The entity matching may comprise assign an entity ID for each entity data item in the processed entity dataset based on the entity similarity. The same entity ID may be assigned to two or more entity data item if they are associated with the same entity. The building may further comprise compositing two or more entity data item in the dataset based on pre-defined CSG composition rules. For example, for the CSG associated with People, the rules may include compositing two or more entity data item if they have the same name and birthday. The building may further comprise enriching the CSG with the knowledge graph containing knowledge on a plurality of domains. For example, data associated with an entity in the knowledge graph may be added into the corresponding entity in the CSG.

The computing engine 320 may be configured to build a plurality of CSGs with each specific for different domains. Once such a CSG is built and stored in the system 300, the computing engine 320 may be further configured to update the CSG based on a freshness requirement for a domain associated with it by using the changed information from the associated sources.

The computing engine 320 may be further configured to update the knowledge graph with entity data from the CSG via mapping, conflation and selection processes when the CSG meets predefined criteria, with the CSG treated as a source for the knowledge graph. In an embodiment of the present disclosure, the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG

FIG. 4 illustrates an exemplary apparatus 400 for building a connected segment graph (CSG) specific for a domain.

The apparatus 400 may comprises: a collecting module 410 configured to collect entity data from one or more sources associated with the domain to form an entity dataset for the domain； a processing module 420 configured to process the entity data； and a building module 430 configured to build the CSG with the processed entity dataset, wherein the building module is further configured to enrich the CSG with a knowledge graph containing knowledge on a plurality of domains.

In an embodiment of the present disclosure, the apparatus 400 further comprising an updating module configured to update the knowledge graph with the CSG if the CSG meets pre-defined criteria. The pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG.

In an embodiment of the present disclosure, the collecting module 410 may be further configured to retrieve information from the sources associated with the domain； extract entity data from the retrieved information； and store the entity data to the apparatus 400.

In an embodiment of the present disclosure, the processing module 420 may be further configured to perform at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the CSG.

In an embodiment of the present disclosure, the building module 430 may be further configured to perform entity matching on the entity dataset and composite two or more entity data items based on a predefined CSG composition rule. The entity matching may comprise assign entity ID for each entity data item based on entity similarity.

In an embodiment of the present disclosure, the CSG may be updated based on a freshness requirement for the domain. There may be different freshness requirements for different knowledge domains. For example, for the real estate domain, the freshness requirement may be that the houses on sale must be refreshed every 4 hours. For the real-time news domain, the freshness requirement may be that the news must be refreshed every 5 minutes. In an embodiment of the present disclosure, each CSG may be updated based on its freshness requirement. Thus, the freshness of each CSG may satisfy the respective user’s requirement.

FIG. 5 illustrates an exemplary system 500 for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure. The CSG may be connected with and enriched by a knowledge graph containing knowledge on a plurality of domains. The system 500 may comprise one or more processors 510. The system 500 may further comprise a memory 520 that is connected with the one or more processors. The memory 520 may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for building a connected segment graph (CSG) specific for a domain according to the present disclosure.

The solution of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps of the method for building a connected segment graph (CSG) specific for a domain according to the present disclosure.

Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk. Although memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors (e.g., cache or register) .

It is to be understood that the order of steps in the methods disclosed is an illustration of exemplary processes. Based upon design preferences, it is understood that the order of steps in the methods may be rearranged.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A method for building a connected segment graph specific for a domain, the method comprising:

collecting entity data from a source associated with the domain to form an entity dataset for the domain；

processing the entity dataset； and

building the connected segment graph with the processed entity dataset,

wherein the building comprising enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
The method of claim 1, further comprising:

updating the knowledge graph with entity data from the connected segment graph if the connected segment graph meets pre-defined criteria.
The method of claim 1, wherein the collecting comprising:

retrieving information from the source；

extracting entity data from the retrieved information； and

storing the entity data.
The method of claim 1, wherein the processing comprising at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the connected segment graph.
The method of claim 1, wherein the building further comprising:

performing entity matching on the entity dataset； and

compositing two or more entity data items in the entity dataset based on a predefined composition rule associated with the connected segment graph.
The method of claim 1, wherein the connected segment graph is updated based on a freshness requirement for the domain.
The method of claim 2, wherein the predefined criteria are associated with at least one of freshness, correctness and attribute coverage of the connected segment graph.
The method of claim 1, wherein the knowledge graph and the connected segment graph are stored in a flat table format.
The method of claim 1, wherein the knowledge graph and the connected segment graph are searched by using an inverted index.
The method of claim 5, wherein the entity matching is to assign an entity ID for each entity data item in the entity dataset.
An apparatus for building a connected segment graph specific for a domain, the method comprising:

a collecting module configured to collect entity data from a source associated with the domain to form an entity dataset for the domain；

a processing module configured to process the entity dataset； and

a building module configured to build the connected segment graph with the processed entity dataset,

wherein the building module is further configured to enrich the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
The apparatus of claim 11, further comprising:

an updating module configured to update the knowledge graph with entity data from the connected segment graph if the connected segment graph meets pre-defined criteria.
The apparatus of claim 11, wherein the collecting module is further configured to:

retrieve information from the source；

extract entity data from the retrieved information； and

store the entity data.
The apparatus of claim 11, wherein the processing module is further configured to perform at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the connected segment graph.
The apparatus of claim 11, wherein the building module is further configured to:

perform entity matching on the entity dataset； and

composite two or more entity data items in the entity dataset based on a predefined composition rule associated with the connected segment graph.
The apparatus of claim 11, wherein the connected segment graph is updated based on a freshness requirement for the domain.
The apparatus of claim 12, wherein the predefined criteria are associated with at least one of freshness, correctness and attribute coverage of the connected segment graph.
The apparatus of claim 11, wherein the knowledge graph and the CSG are stored in a flat table format.
The apparatus of claim 11, wherein the knowledge graph and the connected segment graph are searched by using an inverted index
A system for building a connected segment graph specific for a domain, the system comprising:

one or more processors； and

a memory, storing computer-executable instructions that, when executed, cause the one or more processors to perform the method according to claims 1-10.