WO2018076348A1 - Building and updating a connected segment graph - Google Patents

Building and updating a connected segment graph Download PDF

Info

Publication number
WO2018076348A1
WO2018076348A1 PCT/CN2016/104045 CN2016104045W WO2018076348A1 WO 2018076348 A1 WO2018076348 A1 WO 2018076348A1 CN 2016104045 W CN2016104045 W CN 2016104045W WO 2018076348 A1 WO2018076348 A1 WO 2018076348A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
graph
connected segment
dataset
domain
Prior art date
Application number
PCT/CN2016/104045
Other languages
French (fr)
Inventor
Ning Wen
Dafan Liu
Hui Shen
Liang Chen
Dianfei Han
Jiazhang HU
Jinglun LI
Pu Li
Zhenyu Zhao
Mao YANG
Zhenyu Guo
Xiong Zhang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to CN201680078539.6A priority Critical patent/CN108463818A/en
Priority to PCT/CN2016/104045 priority patent/WO2018076348A1/en
Publication of WO2018076348A1 publication Critical patent/WO2018076348A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • a knowledge graph is a knowledge base used to enhance search engine’s search results with semantic-search information gathered from a wide variety of sources.
  • the traditional knowledge graph is a monolithic graph containing knowledge about all types of entities from a variety of domains.
  • the issue with a monolithic knowledge graph is that the quality of the knowledge graph is hard to control, especially for maintaining a high precision graph.
  • the present disclosure provides a method for building a connected segment graph (CSG) specific for a domain.
  • the method may comprise collecting entity data from a source associated with the domain to form an entity dataset for the domain.
  • the method may further comprise processing the entity dataset via cleaning, de-duplicating and mapping processes.
  • the method may further comprise building the connected segment graph with the processed entity dataset.
  • the building may comprise enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
  • the present disclosure provides an apparatus for building a connected segment graph (CSG) specific for a domain.
  • the method may comprise a collecting module configured to collect entity data from a source associated with the domain to form an entity dataset for the domain.
  • the apparatus may further comprise a processing module configured to process the entity dataset via cleaning, de-duplicating and mapping processes.
  • the apparatus may further comprise a building module configured to build the connected segment graph with the processed entity dataset.
  • the building module may be further configured to enrich the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
  • the present disclosure provides a system for building a connected segment graph (CSG) specific for a domain.
  • the system may comprise one or more processors and a memory.
  • the memory may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for building a connected segment graph specific for a domain according to various aspects of the present disclosure.
  • FIG. 1 illustrates an environment in an example implementation according to an embodiment of the present disclosure.
  • FIG. 2 illustrates a flow chart of a method for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
  • CSG connected segment graph
  • FIG. 3 illustrates an exemplary distributed table service system according to an embodiment of the present disclosure.
  • FIG. 4 illustrates an exemplary apparatus for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
  • CSG connected segment graph
  • FIG. 5 illustrates an exemplary system for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
  • CSG connected segment graph
  • a knowledge graph aims at describing all kinds of entities or concepts in the real world.
  • the knowledge graph is made up of entities, facts describing entities, and relationships between entities.
  • search engine Based on the knowledge graph, search engine’s search results can be enhanced with semantic-search information gathered from a wide variety of sources.
  • the traditional monolithic knowledge graph and associated ontology impose a huge challenge for improving graph data quality, agility and freshness.
  • data updates for the monolithic knowledge graph may take long time due to the expensive and complex graph operations and interconnection of entities.
  • a user’s freshness requirement for a specific domain cannot be satisfied.
  • it may be hard to introduce new ontologies since a single schema is used and it may be hard to introduce new data sources since a single graph is used.
  • the present disclosure may introduce a connected segment graph (CSG) specific for a domain, which may be built individually and connected with and enriched by a knowledge graph containing knowledge on a plurality of domains.
  • CSG connected segment graph
  • Each CSG may be associated with one scenario and application and thus the scenario and application level isolation and policy settings can be introduced.
  • Each CSG may have its own schema which may be different from other CSGs, and thus it may be easy to introduce new ontologies.
  • the proposed CSG may be updated individually based on the freshness requirement for the domain associated with the CSG. Thus the freshness requirement for a specific domain can be satisfied.
  • Example environment is first described that is operable to employ the techniques described herein.
  • Example illustrations of the various embodiments are then described, which may be employed in the example environment, as well as in other environments. Accordingly, the example environment is not limited to performing the described embodiments and the described embodiments are not limited to implementation in the example environment.
  • FIG. 1 illustrates an environment 100 in an example implementation that is operable to employ the techniques described in the present disclosure. It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc. ) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements of described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions may be carried out by a processor executing instructions stored in memory.
  • the illustrated environment 100 may include a storage device 110, a search engine server 120 and a user device 130. It should be understood that any number of user devices, search engine servers, and storage devices may be employed within the environment 100 within the scope of the present disclosure. Each may comprise a single device or multiple devices cooperating in a distributed environment.
  • the search engine server 120 may comprise multiple devices arranged in a distributed environment that collectively provide the functionality of the search engine server 120 described herein. Additionally, other components not shown may also be included within the environment 100.
  • the user device 130 may be any type of computing device, such as a desktop computer, a laptop computer, a smart phone and so on.
  • the user device 130 may communicate with the search engine server 120 via a network 140, which may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs) .
  • LANs local area networks
  • WANs wide area networks
  • the storage device 110 may store a knowledge graph containing knowledge on a plurality of domains, such as Microsoft Satori knowledge Graph containing all types of entities, facts and relationships covering various domains.
  • the storage device 110 may also store a plurality of connected segment graphs (CSGs) specific for different domains, such as CSG 1 specific for Product &Service, CSG 2 specific for Real Estates...CSG N specific for Entertainment.
  • CSGs may be built and updated for individual scenarios.
  • Each CSG may be connected with and enriched by the knowledge graph through entity identification and linking services.
  • the knowledge graph and the CSGs may be stored in a flat table format. Although only one storage device 110 is shown in FIG. 1, there may be a plurality of storage device to store the knowledge graph and the CSGs in a distributed way.
  • the search engine server 120 may operate to receive search queries associated with a specific domain from user devices, such as the user device 130, and to provide search results in response to the search queries based on corresponding CSG stored in the storage device 110. For example, a user may be interested in real estates and may frequently submit search queries for latest price information about houses on sale.
  • the search engine server 120 may perform a search operation based on the CSG 2 specific for real estates, which may be updated, for example, every 4 hours based on the freshness requirement for real estate, and return the latest information to the user.
  • FIG. 2 illustrates a flow chart of a method 200 for building a connected segment graph (CSG) specific for a domain.
  • CSG connected segment graph
  • the method 200 may collect entity data from one or more sources associated with the domain to form an entity dataset for the domain.
  • the collecting may comprise retrieving information from the one or more sources, extracting entity data from the information with a pre-defined extraction model and storing the entity data to a system performing the method 200.
  • the method 200 may retrieve information from Wikipedia webpages, Amazon webpages, and Walmart webpages and so on.
  • the method 200 may extract entity data associated with products from the information with a pre-defined extraction model specific for the CSG, which may be trained by a training data set specific for the Product domain. Thereafter the method 200 may store the extracted entity data to form an entity dataset specific for products.
  • the method 200 may process the entity dataset.
  • the processing may comprise cleaning entity dataset to remove noises from the dataset.
  • the processing may further comprise de-duplicating the entity dataset.
  • the processing may further comprise normalizing the entity data items from different sources in the dataset to the same format.
  • the processing may further comprise mapping the entity data in the dataset to a schema specific for the CSG.
  • the method 200 may build the CSG with the processed entity dataset.
  • the building may comprise performing entity matching on the processed entity dataset.
  • the entity matching may comprise assign entity data ID for each entity data item in the dataset based on the entity similarity.
  • the same entity ID may be assigned to two or more entity data items if they are associated with the same entity.
  • the building may further comprise compositing two or more entity data items in the dataset based on pre-defined CSG composition rules. For example, for the CSG specific for People, the rules may include compositing two or more entity data items if they have the same name and birthday.
  • the building may further comprise enriching the CSG with a knowledge graph containing knowledge on a plurality of domains. For example, data associated with an entity in the knowledge graph may be added into the corresponding entity in the CSG.
  • a plurality of CSGs, with each specific for one domain, may be built by using the above described method 200. Once such a CSG is built, it may be updated, based on a freshness requirement for a domain associated with it, by using the changed information from the associated sources.
  • the updating process is similar to the building process as described above.
  • each CSG may be updated based on its freshness requirement. Thus, the freshness of each CSG may satisfy the respective user’s requirement.
  • entity data from the CSG may be used to update the knowledge graph via mapping, conflation and selection processes when the CSG meets predefined criteria.
  • the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG.
  • the freshness may be associated with latency per user requirement for freshness.
  • the correctness may be associated with an attribute value variance and an attribute distribution.
  • the values of some attributes, such as birthday and name, of an entity in the CSG should not be changed.
  • the values of some attributes of an entity in the CSG should be in a predefined range. For example, the values of latitude/longitude should be in -90 to 90 and -180 to 180 ranges.
  • the attribute distribution in the CSG should be commonsensible. For example, one person should have two parents (mother and father) only, one company should not have more 1 million employees, and so on.
  • the attribute value variance of most often queried entities should be below a pre-defined percentage, such as 5%.
  • the attribute coverage of the CSG may be considered as a factor for evaluating the CSG. For example, the coverage of some important attributes of the CSG should be above a pre-defined threshold. For example, for an organization, in the CSG, the coverage of some attributes such as name, location, website and the like that are critical for describing the organization should be above a first threshold. The coverage of some attributes such as phone number, email address, description and the like in the CSG should be above a second threshold. In an embodiment of the present disclosure, the first threshold may be greater than the second threshold.
  • a CSG may be connected with and enriched by the knowledge graph containing knowledge on a plurality of domains. Reversely, entity data from a CSG may be used to update the knowledge graph.
  • FIG. 3 illustrates a distributed table service system 300 according to an embodiment of the present disclosure.
  • the distributed table service system 300 may be configured to store and process a knowledge graph containing knowledge on a plurality of domains and a connected segment graph (CSG) specific for a domain which may be connected with and enriched by the knowledge graph.
  • the system 300 may include a distributed table store service 310 and a computing engine 320.
  • the system 300 may further include a plurality of storage servers which are not shown in FIG. 3.
  • the distributed table store service 310 may store entity data from the knowledge graph and the CSG in a flat table format.
  • the distributed table store service 310 may include a coordinator component 312, a replication component 314, a local store component 316.
  • the knowledge graph and the CSG may be represented as a table.
  • the table may be divided into a plurality of partitions by vertical splitting and horizontal partitioning. The plurality of storage severs may store these partitions in a distributed way.
  • the coordinator component 312 may be configured to host table level metadata such as the schema of the table, partition distribution of the table, the state of each storage server and so on.
  • the data may be stored in three or more storage servers.
  • the replication component 314 may be configured to keep the data reliable in variable replica count and keep the consistence between replicas. Furthermore, the replication component 314 may be further configured to migrate data from one storage server to another storage server to ensure the uniform data distribution.
  • Local store component 316 may be configured to store the data in a local box and process the operations for the table such as reading, writing, updating, modifying, deleting and so on.
  • the local store component 316 may also be configured to map the data from a complex data structure to a simple Key-Value storage to make the storage efficiency.
  • the computing engine 320 may be configured to build a CSG specific for a domain.
  • the computing engine 320 may be configured to collect entity data from one or more sources associated with the domain to form an entity dataset for the domain.
  • the collecting may comprise retrieving information from the one or more sources, extracting entity data from the information with a pre-defined extraction model specific for the CSG and storing the entity data to the distributed table service system 300.
  • the computing engine 320 may be further configured to process the entity dataset.
  • the processing may comprise cleaning entity dataset to remove noises from the dataset.
  • the processing may further comprise de-duplicating the entity dataset.
  • the processing may further comprise normalizing the entity data items from different sources in the dataset to the same format.
  • the processing may further comprise mapping the entity data in the dataset to a schema specific for the CSG.
  • the computing engine 320 may be further configured to build the CSG with the processed entity dataset.
  • the building may comprise performing entity matching on the processed entity dataset.
  • the entity matching may comprise assign an entity ID for each entity data item in the processed entity dataset based on the entity similarity.
  • the same entity ID may be assigned to two or more entity data item if they are associated with the same entity.
  • the building may further comprise compositing two or more entity data item in the dataset based on pre-defined CSG composition rules. For example, for the CSG associated with People, the rules may include compositing two or more entity data item if they have the same name and birthday.
  • the building may further comprise enriching the CSG with the knowledge graph containing knowledge on a plurality of domains. For example, data associated with an entity in the knowledge graph may be added into the corresponding entity in the CSG.
  • the computing engine 320 may be configured to build a plurality of CSGs with each specific for different domains. Once such a CSG is built and stored in the system 300, the computing engine 320 may be further configured to update the CSG based on a freshness requirement for a domain associated with it by using the changed information from the associated sources.
  • the computing engine 320 may be further configured to update the knowledge graph with entity data from the CSG via mapping, conflation and selection processes when the CSG meets predefined criteria, with the CSG treated as a source for the knowledge graph.
  • the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG
  • FIG. 4 illustrates an exemplary apparatus 400 for building a connected segment graph (CSG) specific for a domain.
  • CSG connected segment graph
  • the apparatus 400 may comprises: a collecting module 410 configured to collect entity data from one or more sources associated with the domain to form an entity dataset for the domain; a processing module 420 configured to process the entity data; and a building module 430 configured to build the CSG with the processed entity dataset, wherein the building module is further configured to enrich the CSG with a knowledge graph containing knowledge on a plurality of domains.
  • the apparatus 400 further comprising an updating module configured to update the knowledge graph with the CSG if the CSG meets pre-defined criteria.
  • the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG.
  • the collecting module 410 may be further configured to retrieve information from the sources associated with the domain; extract entity data from the retrieved information; and store the entity data to the apparatus 400.
  • the processing module 420 may be further configured to perform at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the CSG.
  • the building module 430 may be further configured to perform entity matching on the entity dataset and composite two or more entity data items based on a predefined CSG composition rule.
  • the entity matching may comprise assign entity ID for each entity data item based on entity similarity.
  • the CSG may be updated based on a freshness requirement for the domain.
  • a freshness requirement for different knowledge domains.
  • the freshness requirement may be that the houses on sale must be refreshed every 4 hours.
  • the freshness requirement may be that the news must be refreshed every 5 minutes.
  • each CSG may be updated based on its freshness requirement. Thus, the freshness of each CSG may satisfy the respective user’s requirement.
  • FIG. 5 illustrates an exemplary system 500 for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
  • the CSG may be connected with and enriched by a knowledge graph containing knowledge on a plurality of domains.
  • the system 500 may comprise one or more processors 510.
  • the system 500 may further comprise a memory 520 that is connected with the one or more processors.
  • the memory 520 may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for building a connected segment graph (CSG) specific for a domain according to the present disclosure.
  • the solution of the present disclosure may be embodied in a non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps of the method for building a connected segment graph (CSG) specific for a domain according to the present disclosure.
  • CSG connected segment graph
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • a state machine gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method for building a connected segment graph specific for a domain. The method may comprises: collecting entity data from a source associated with the domain to form an entity dataset for the domain; processing the entity dataset; and building the connected segment graph with the processed entity dataset, wherein the building comprising enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.

Description

BUILDING AND UPDATING A CONNECTED SEGMENT GRAPH BACKGROUND
A knowledge graph is a knowledge base used to enhance search engine’s search results with semantic-search information gathered from a wide variety of sources. The traditional knowledge graph is a monolithic graph containing knowledge about all types of entities from a variety of domains. The issue with a monolithic knowledge graph is that the quality of the knowledge graph is hard to control, especially for maintaining a high precision graph.
SUMMARY
The following summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In one aspect, the present disclosure provides a method for building a connected segment graph (CSG) specific for a domain. The method may comprise collecting entity data from a source associated with the domain to form an entity dataset for the domain. The method may further comprise processing the entity dataset via cleaning, de-duplicating and mapping processes. The method may further comprise building the connected segment graph with the processed entity dataset. The building may comprise enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
In another aspect, the present disclosure provides an apparatus for building a connected segment graph (CSG) specific for a domain. The method may comprise a collecting module configured to collect entity data from a source associated with the domain to form an entity dataset for the domain. The apparatus may further comprise a processing module configured to process the entity dataset via cleaning, de-duplicating and mapping processes. The apparatus may further comprise a building module configured to build the connected segment graph with the processed entity dataset. The building module may be further configured to enrich the connected segment graph with a knowledge graph containing knowledge on a plurality of  domains.
In another aspect, the present disclosure provides a system for building a connected segment graph (CSG) specific for a domain. The system may comprise one or more processors and a memory. The memory may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for building a connected segment graph specific for a domain according to various aspects of the present disclosure.
It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of a few of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
BRIEF DESCRIPTION OF THE DRAWINGS
The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. 
FIG. 1 illustrates an environment in an example implementation according to an embodiment of the present disclosure.
FIG. 2 illustrates a flow chart of a method for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
FIG. 3 illustrates an exemplary distributed table service system according to an embodiment of the present disclosure.
FIG. 4 illustrates an exemplary apparatus for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
FIG. 5 illustrates an exemplary system for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
The present disclosure will now be discussed with reference to several  example implementations. It is to be understood these implementations are discussed only for enabling those skilled persons in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
A knowledge graph aims at describing all kinds of entities or concepts in the real world. The knowledge graph is made up of entities, facts describing entities, and relationships between entities. Based on the knowledge graph, search engine’s search results can be enhanced with semantic-search information gathered from a wide variety of sources.
The traditional monolithic knowledge graph and associated ontology impose a huge challenge for improving graph data quality, agility and freshness. For example, data updates for the monolithic knowledge graph may take long time due to the expensive and complex graph operations and interconnection of entities. Thus a user’s freshness requirement for a specific domain cannot be satisfied. Furthermore, it may be hard to introduce new ontologies since a single schema is used and it may be hard to introduce new data sources since a single graph is used.
The present disclosure may introduce a connected segment graph (CSG) specific for a domain, which may be built individually and connected with and enriched by a knowledge graph containing knowledge on a plurality of domains. Each CSG may be associated with one scenario and application and thus the scenario and application level isolation and policy settings can be introduced. Each CSG may have its own schema which may be different from other CSGs, and thus it may be easy to introduce new ontologies. Furthermore, there may be a lot of CSGs specific for different domains rather than only one graph, so it may be easy to introduce new data sources. In the present disclosure, the proposed CSG may be updated individually based on the freshness requirement for the domain associated with the CSG. Thus the freshness requirement for a specific domain can be satisfied.
In the following discussion, an example environment is first described that is operable to employ the techniques described herein. Example illustrations of the various embodiments are then described, which may be employed in the example environment, as well as in other environments. Accordingly, the example environment is not limited to performing the described embodiments and the described embodiments are not limited to implementation in the example environment.
FIG. 1 illustrates an environment 100 in an example implementation that is operable to employ the techniques described in the present disclosure. It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc. ) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements of described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions may be carried out by a processor executing instructions stored in memory.
The illustrated environment 100 may include a storage device 110, a search engine server 120 and a user device 130. It should be understood that any number of user devices, search engine servers, and storage devices may be employed within the environment 100 within the scope of the present disclosure. Each may comprise a single device or multiple devices cooperating in a distributed environment. For instance, the search engine server 120 may comprise multiple devices arranged in a distributed environment that collectively provide the functionality of the search engine server 120 described herein. Additionally, other components not shown may also be included within the environment 100.
The user device 130 may be any type of computing device, such as a desktop computer, a laptop computer, a smart phone and so on. The user device 130 may communicate with the search engine server 120 via a network 140, which may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs) .
The storage device 110 may store a knowledge graph containing knowledge on a plurality of domains, such as Microsoft Satori knowledge Graph containing all types of entities, facts and relationships covering various domains. The storage device 110 may also store a plurality of connected segment graphs (CSGs) specific for different domains, such as CSG 1 specific for Product &Service, CSG 2 specific for Real Estates…CSG N specific for Entertainment. The CSGs may be built and updated for individual scenarios. Each CSG may be connected with and enriched by the knowledge graph through entity identification and linking services. The knowledge  graph and the CSGs may be stored in a flat table format. Although only one storage device 110 is shown in FIG. 1, there may be a plurality of storage device to store the knowledge graph and the CSGs in a distributed way.
Since a CSG specific for a domain may be built, which is much smaller in scale than the traditional monolithic knowledge graph containing all types of entities and is isolated with other CSGs, it may take much less time to update such a CSG than the traditional knowledge graph.
The search engine server 120 may operate to receive search queries associated with a specific domain from user devices, such as the user device 130, and to provide search results in response to the search queries based on corresponding CSG stored in the storage device 110. For example, a user may be interested in real estates and may frequently submit search queries for latest price information about houses on sale. The search engine server 120 may perform a search operation based on the CSG 2 specific for real estates, which may be updated, for example, every 4 hours based on the freshness requirement for real estate, and return the latest information to the user.
Having described an example operating environment in which the techniques described herein may be employed, consider now a discussion of various embodiments.
FIG. 2 illustrates a flow chart of a method 200 for building a connected segment graph (CSG) specific for a domain.
In step 210, the method 200 may collect entity data from one or more sources associated with the domain to form an entity dataset for the domain. Specifically, the collecting may comprise retrieving information from the one or more sources, extracting entity data from the information with a pre-defined extraction model and storing the entity data to a system performing the method 200. For example, for a CSG specific for Product, the method 200 may retrieve information from Wikipedia webpages, Amazon webpages, and Walmart webpages and so on. Then the method 200 may extract entity data associated with products from the information with a pre-defined extraction model specific for the CSG, which may be trained by a training data set specific for the Product domain. Thereafter the method 200 may store the extracted entity data to form an entity dataset specific for products. 
In step 220, the method 200 may process the entity dataset. For example, the processing may comprise cleaning entity dataset to remove noises from the dataset.  The processing may further comprise de-duplicating the entity dataset. The processing may further comprise normalizing the entity data items from different sources in the dataset to the same format. The processing may further comprise mapping the entity data in the dataset to a schema specific for the CSG.
In step 230, the method 200 may build the CSG with the processed entity dataset. Specifically, the building may comprise performing entity matching on the processed entity dataset. The entity matching may comprise assign entity data ID for each entity data item in the dataset based on the entity similarity. The same entity ID may be assigned to two or more entity data items if they are associated with the same entity. The building may further comprise compositing two or more entity data items in the dataset based on pre-defined CSG composition rules. For example, for the CSG specific for People, the rules may include compositing two or more entity data items if they have the same name and birthday. The building may further comprise enriching the CSG with a knowledge graph containing knowledge on a plurality of domains. For example, data associated with an entity in the knowledge graph may be added into the corresponding entity in the CSG.
A plurality of CSGs, with each specific for one domain, may be built by using the above described method 200. Once such a CSG is built, it may be updated, based on a freshness requirement for a domain associated with it, by using the changed information from the associated sources. The updating process is similar to the building process as described above. There may be different freshness requirements for different domains. For example, for the real estate domain, the freshness requirement may be that the houses on sale must be refreshed every 4 hours. For the real-time news domain, the freshness requirement may be that the news must be refreshed every 5 minutes. In an embodiment of the present disclosure, each CSG may be updated based on its freshness requirement. Thus, the freshness of each CSG may satisfy the respective user’s requirement.
After a CSG is built, with the updating of it, entity data from the CSG may be used to update the knowledge graph via mapping, conflation and selection processes when the CSG meets predefined criteria. In an embodiment of the present disclosure, the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG. The freshness may be associated with latency per user requirement for freshness. The correctness may be associated with an attribute  value variance and an attribute distribution. For example, the values of some attributes, such as birthday and name, of an entity in the CSG should not be changed. The values of some attributes of an entity in the CSG should be in a predefined range. For example, the values of latitude/longitude should be in -90 to 90 and -180 to 180 ranges. The attribute distribution in the CSG should be commonsensible. For example, one person should have two parents (mother and father) only, one company should not have more 1 million employees, and so on. The attribute value variance of most often queried entities should be below a pre-defined percentage, such as 5%. In an embodiment of the present disclosure, the attribute coverage of the CSG may be considered as a factor for evaluating the CSG. For example, the coverage of some important attributes of the CSG should be above a pre-defined threshold. For example, for an organization, in the CSG, the coverage of some attributes such as name, location, website and the like that are critical for describing the organization should be above a first threshold. The coverage of some attributes such as phone number, email address, description and the like in the CSG should be above a second threshold. In an embodiment of the present disclosure, the first threshold may be greater than the second threshold.
In the present disclosure, a CSG may be connected with and enriched by the knowledge graph containing knowledge on a plurality of domains. Reversely, entity data from a CSG may be used to update the knowledge graph.
FIG. 3 illustrates a distributed table service system 300 according to an embodiment of the present disclosure. The distributed table service system 300 may be configured to store and process a knowledge graph containing knowledge on a plurality of domains and a connected segment graph (CSG) specific for a domain which may be connected with and enriched by the knowledge graph. The system 300 may include a distributed table store service 310 and a computing engine 320. The system 300 may further include a plurality of storage servers which are not shown in FIG. 3.
The distributed table store service 310 may store entity data from the knowledge graph and the CSG in a flat table format. The distributed table store service 310 may include a coordinator component 312, a replication component 314, a local store component 316. In an embodiment of the present disclosure, the knowledge graph and the CSG may be represented as a table. The table may be  divided into a plurality of partitions by vertical splitting and horizontal partitioning. The plurality of storage severs may store these partitions in a distributed way.
The coordinator component 312 may be configured to host table level metadata such as the schema of the table, partition distribution of the table, the state of each storage server and so on.
In order to ensure the security of the data, the data may be stored in three or more storage servers. The replication component 314 may be configured to keep the data reliable in variable replica count and keep the consistence between replicas. Furthermore, the replication component 314 may be further configured to migrate data from one storage server to another storage server to ensure the uniform data distribution.
Local store component 316 may be configured to store the data in a local box and process the operations for the table such as reading, writing, updating, modifying, deleting and so on. The local store component 316 may also be configured to map the data from a complex data structure to a simple Key-Value storage to make the storage efficiency.
The computing engine 320 may be configured to build a CSG specific for a domain. For example, the computing engine 320 may be configured to collect entity data from one or more sources associated with the domain to form an entity dataset for the domain. Specifically, the collecting may comprise retrieving information from the one or more sources, extracting entity data from the information with a pre-defined extraction model specific for the CSG and storing the entity data to the distributed table service system 300.
The computing engine 320 may be further configured to process the entity dataset. For example, the processing may comprise cleaning entity dataset to remove noises from the dataset. The processing may further comprise de-duplicating the entity dataset. The processing may further comprise normalizing the entity data items from different sources in the dataset to the same format. The processing may further comprise mapping the entity data in the dataset to a schema specific for the CSG.
The computing engine 320 may be further configured to build the CSG with the processed entity dataset. Specifically, the building may comprise performing entity matching on the processed entity dataset. The entity matching may comprise assign an entity ID for each entity data item in the processed entity dataset based on  the entity similarity. The same entity ID may be assigned to two or more entity data item if they are associated with the same entity. The building may further comprise compositing two or more entity data item in the dataset based on pre-defined CSG composition rules. For example, for the CSG associated with People, the rules may include compositing two or more entity data item if they have the same name and birthday. The building may further comprise enriching the CSG with the knowledge graph containing knowledge on a plurality of domains. For example, data associated with an entity in the knowledge graph may be added into the corresponding entity in the CSG.
The computing engine 320 may be configured to build a plurality of CSGs with each specific for different domains. Once such a CSG is built and stored in the system 300, the computing engine 320 may be further configured to update the CSG based on a freshness requirement for a domain associated with it by using the changed information from the associated sources.
The computing engine 320 may be further configured to update the knowledge graph with entity data from the CSG via mapping, conflation and selection processes when the CSG meets predefined criteria, with the CSG treated as a source for the knowledge graph. In an embodiment of the present disclosure, the pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG
FIG. 4 illustrates an exemplary apparatus 400 for building a connected segment graph (CSG) specific for a domain.
The apparatus 400 may comprises: a collecting module 410 configured to collect entity data from one or more sources associated with the domain to form an entity dataset for the domain; a processing module 420 configured to process the entity data; and a building module 430 configured to build the CSG with the processed entity dataset, wherein the building module is further configured to enrich the CSG with a knowledge graph containing knowledge on a plurality of domains.
In an embodiment of the present disclosure, the apparatus 400 further comprising an updating module configured to update the knowledge graph with the CSG if the CSG meets pre-defined criteria. The pre-defined criteria may be associated with at least one of freshness, correctness and attribute coverage of the CSG.
In an embodiment of the present disclosure, the collecting module 410 may  be further configured to retrieve information from the sources associated with the domain; extract entity data from the retrieved information; and store the entity data to the apparatus 400.
In an embodiment of the present disclosure, the processing module 420 may be further configured to perform at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the CSG. 
In an embodiment of the present disclosure, the building module 430 may be further configured to perform entity matching on the entity dataset and composite two or more entity data items based on a predefined CSG composition rule. The entity matching may comprise assign entity ID for each entity data item based on entity similarity.
In an embodiment of the present disclosure, the CSG may be updated based on a freshness requirement for the domain. There may be different freshness requirements for different knowledge domains. For example, for the real estate domain, the freshness requirement may be that the houses on sale must be refreshed every 4 hours. For the real-time news domain, the freshness requirement may be that the news must be refreshed every 5 minutes. In an embodiment of the present disclosure, each CSG may be updated based on its freshness requirement. Thus, the freshness of each CSG may satisfy the respective user’s requirement.
FIG. 5 illustrates an exemplary system 500 for building a connected segment graph (CSG) specific for a domain according to an embodiment of the present disclosure. The CSG may be connected with and enriched by a knowledge graph containing knowledge on a plurality of domains. The system 500 may comprise one or more processors 510. The system 500 may further comprise a memory 520 that is connected with the one or more processors. The memory 520 may store computer-executable instructions that, when executed, cause the one or more processors to perform any steps of the method for building a connected segment graph (CSG) specific for a domain according to the present disclosure.
The solution of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps of the method for building a connected segment graph (CSG) specific for a  domain according to the present disclosure.
Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk. Although memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors (e.g., cache or register) .
It is to be understood that the order of steps in the methods disclosed is an illustration of exemplary processes. Based upon design preferences, it is understood that the order of steps in the methods may be rearranged.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles  defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims (20)

  1. A method for building a connected segment graph specific for a domain, the method comprising:
    collecting entity data from a source associated with the domain to form an entity dataset for the domain;
    processing the entity dataset; and
    building the connected segment graph with the processed entity dataset,
    wherein the building comprising enriching the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
  2. The method of claim 1, further comprising:
    updating the knowledge graph with entity data from the connected segment graph if the connected segment graph meets pre-defined criteria.
  3. The method of claim 1, wherein the collecting comprising:
    retrieving information from the source;
    extracting entity data from the retrieved information; and
    storing the entity data.
  4. The method of claim 1, wherein the processing comprising at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the connected segment graph.
  5. The method of claim 1, wherein the building further comprising:
    performing entity matching on the entity dataset; and
    compositing two or more entity data items in the entity dataset based on a predefined composition rule associated with the connected segment graph.
  6. The method of claim 1, wherein the connected segment graph is updated based on a freshness requirement for the domain.
  7. The method of claim 2, wherein the predefined criteria are associated with at least one of freshness, correctness and attribute coverage of the connected segment graph.
  8. The method of claim 1, wherein the knowledge graph and the connected segment graph are stored in a flat table format.
  9. The method of claim 1, wherein the knowledge graph and the connected segment graph are searched by using an inverted index.
  10. The method of claim 5, wherein the entity matching is to assign an entity ID for each entity data item in the entity dataset.
  11. An apparatus for building a connected segment graph specific for a domain, the method comprising:
    a collecting module configured to collect entity data from a source associated with the domain to form an entity dataset for the domain;
    a processing module configured to process the entity dataset; and
    a building module configured to build the connected segment graph with the processed entity dataset,
    wherein the building module is further configured to enrich the connected segment graph with a knowledge graph containing knowledge on a plurality of domains.
  12. The apparatus of claim 11, further comprising:
    an updating module configured to update the knowledge graph with entity data from the connected segment graph if the connected segment graph meets pre-defined criteria.
  13. The apparatus of claim 11, wherein the collecting module is further configured to:
    retrieve information from the source;
    extract entity data from the retrieved information; and
    store the entity data.
  14. The apparatus of claim 11, wherein the processing module is further configured to perform at least one of cleaning the entity dataset to remove noises, de-duplicating the entity dataset, normalizing entity data in the entity dataset, and mapping entity data in the entity dataset to a schema specific for the connected segment graph.
  15. The apparatus of claim 11, wherein the building module is further configured to:
    perform entity matching on the entity dataset; and
    composite two or more entity data items in the entity dataset based on a predefined composition rule associated with the connected segment graph.
  16. The apparatus of claim 11, wherein the connected segment graph is updated based on a freshness requirement for the domain.
  17. The apparatus of claim 12, wherein the predefined criteria are associated with at least one of freshness, correctness and attribute coverage of the connected segment graph.
  18. The apparatus of claim 11, wherein the knowledge graph and the CSG are stored in a flat table format.
  19. The apparatus of claim 11, wherein the knowledge graph and the connected segment graph are searched by using an inverted index
  20. A system for building a connected segment graph specific for a domain, the system comprising:
    one or more processors; and
    a memory, storing computer-executable instructions that, when executed, cause the one or more processors to perform the method according to claims 1-10.
PCT/CN2016/104045 2016-10-31 2016-10-31 Building and updating a connected segment graph WO2018076348A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201680078539.6A CN108463818A (en) 2016-10-31 2016-10-31 Establish and update connection segment collection of illustrative plates
PCT/CN2016/104045 WO2018076348A1 (en) 2016-10-31 2016-10-31 Building and updating a connected segment graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/104045 WO2018076348A1 (en) 2016-10-31 2016-10-31 Building and updating a connected segment graph

Publications (1)

Publication Number Publication Date
WO2018076348A1 true WO2018076348A1 (en) 2018-05-03

Family

ID=62024224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/104045 WO2018076348A1 (en) 2016-10-31 2016-10-31 Building and updating a connected segment graph

Country Status (2)

Country Link
CN (1) CN108463818A (en)
WO (1) WO2018076348A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10997231B2 (en) 2019-01-17 2021-05-04 International Business Machines Corporation Image-based ontology refinement using clusters
CN114691896A (en) * 2022-05-31 2022-07-01 浙江大学 Knowledge graph data cleaning method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015175936A1 (en) * 2014-05-16 2015-11-19 Microsoft Technology Licensing, Llc Knowledge source personalization to improve language models
CN105574098A (en) * 2015-12-11 2016-05-11 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device and entity comparing method and device
US20160239745A1 (en) * 2015-02-13 2016-08-18 International Business Machines Corporation Leveraging an External Ontology for Graph Expansion in Inference Systems
CN106021281A (en) * 2016-04-29 2016-10-12 京东方科技集团股份有限公司 Method for establishing medical knowledge graph, device for same and query method for same

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488724B (en) * 2013-09-16 2016-09-28 复旦大学 A kind of reading domain knowledge map construction method towards books
CN104462227A (en) * 2014-11-13 2015-03-25 中国测绘科学研究院 Automatic construction method of graphic knowledge genealogy
CN104462506A (en) * 2014-12-19 2015-03-25 北京奇虎科技有限公司 Method and device for establishing knowledge graph based on user annotation information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015175936A1 (en) * 2014-05-16 2015-11-19 Microsoft Technology Licensing, Llc Knowledge source personalization to improve language models
US20160239745A1 (en) * 2015-02-13 2016-08-18 International Business Machines Corporation Leveraging an External Ontology for Graph Expansion in Inference Systems
CN105574098A (en) * 2015-12-11 2016-05-11 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device and entity comparing method and device
CN106021281A (en) * 2016-04-29 2016-10-12 京东方科技集团股份有限公司 Method for establishing medical knowledge graph, device for same and query method for same

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10997231B2 (en) 2019-01-17 2021-05-04 International Business Machines Corporation Image-based ontology refinement using clusters
CN114691896A (en) * 2022-05-31 2022-07-01 浙江大学 Knowledge graph data cleaning method and device

Also Published As

Publication number Publication date
CN108463818A (en) 2018-08-28

Similar Documents

Publication Publication Date Title
US11580168B2 (en) Method and system for providing context based query suggestions
US8862566B2 (en) Systems and methods for intelligent parallel searching
US20130311487A1 (en) Semantic search using a single-source semantic model
US9710468B2 (en) Topic profile query creation
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
US20240029086A1 (en) Discovery of new business openings using web content analysis
JP2013531289A (en) Use of model information group in search
CN104915426B (en) Information sorting method, the method and device for generating information sorting model
JP2015204105A (en) Method and device for providing recommendation information
US20140136527A1 (en) Apparatus, system, and method for searching for power user in social media
US12008047B2 (en) Providing an object-based response to a natural language query
US20150081690A1 (en) Network sourced enrichment and categorization of media content
CN106033455B (en) Method and equipment for processing user operation information
US20170116314A1 (en) Integrating real-time news with historic events
CN106991090A (en) The analysis method and device of public sentiment event entity
CN113641707B (en) Knowledge graph disambiguation method, device, equipment and storage medium
WO2018076348A1 (en) Building and updating a connected segment graph
JP7213890B2 (en) Accelerated large-scale similarity computation
US20120284224A1 (en) Build of website knowledge tables
TWI547888B (en) A method of recording user information and a search method and a server
CN110019783B (en) Attribute word clustering method and device
CN108170693B (en) Hot word pushing method and device
US20180113918A1 (en) Micro product specification update based on results to a search query
CN111797620B (en) System and method for identifying proper nouns
CN113076396A (en) Entity relationship processing method and system oriented to man-machine cooperation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16920133

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16920133

Country of ref document: EP

Kind code of ref document: A1