CN111368097A

CN111368097A - Knowledge graph extraction method and device

Info

Publication number: CN111368097A
Application number: CN202010234933.2A
Authority: CN
Inventors: 郭涵; 李斌; 游屹; 谢鸣晓; 陈凯
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-03

Abstract

The invention provides a method and a device for extracting a knowledge graph, which comprise the following steps: judging the type of the accessed data source, and registering the data source into a temporary table according to the type of the data source; determining a corresponding data extraction mode according to the number of the data sources, and extracting data fields from the temporary table; and verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data according to the data corresponding to the fields passing the verification to obtain the updated data. The application provides a general knowledge graph extraction data access interface and a data output interface, and is preset with a plurality of common data source processing modules which can be flexibly adjusted according to different data sources so as to acquire data; and all the acquired data are converted into DataFrame objects in Spark, so that various non-uniform formats and names of bottom layer data are shielded, and the effect of uniform processing is realized.

Description

Knowledge graph extraction method and device

Technical Field

The application belongs to the technical field of big data processing, and particularly relates to a knowledge graph extraction method and device.

Background

With the rapid increase of strategic position of new technologies such as big data, artificial intelligence and the like in the national internet construction, more and more scenes and applications emerge. The knowledge graph is used as a powerful weapon for emphasizing data relationships, and the relationships between a large number of nodes can be quickly and efficiently constructed by depending on a mature big data architecture, so that the knowledge graph has excellent values in the aspects of relationship identification, multi-hop query, path search and the like. The design and construction of the knowledge graph are realized in different ways according to different application requirements, but the data access, extraction and storage interface schemes have inherent commonality. At present, most of knowledge maps in the market are released in a mode of product integral packaging, or are realized by relying on an integrated database, so that the cost is high, and sufficient flexibility, customization and lightweight cannot be realized.

Disclosure of Invention

The application provides a knowledge graph extraction method and a knowledge graph extraction device, which are used for at least solving the problems of high cost and complexity caused by the fact that the construction of the knowledge graph in the existing market is realized in a mode of packaging products integrally or by means of an integrated graph database.

According to an aspect of the present application, there is provided a knowledge-graph extraction method, including:

judging the type of the accessed data source, and registering the data source into a temporary table according to the type of the data source;

determining a corresponding data extraction mode according to the number of the data sources, and extracting data fields from the temporary table;

and verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data according to the data corresponding to the fields passing the verification to obtain the updated data.

In one embodiment, registering the data source as a temporary table according to the type of the data source includes: and constructing a DataFrame object in a corresponding mode according to the type of the data source, and registering the DataFrame object as a temporary table.

In one embodiment, the types of data sources include: HDFS; constructing a DataFrame object in a corresponding mode according to the type of a data source, and registering the DataFrame object into a temporary table, wherein the method comprises the following steps:

packaging the acquired HDFS data into an elastic distributed data set;

generating Schema according to the elastic distributed data set and constructing a DataFrame object;

the DataFrame object is registered as a temporary table.

In one embodiment, the types of data sources include: hbase; constructing a DataFrame object in a corresponding mode according to the type of a data source, and registering the DataFrame object into a temporary table, wherein the method comprises the following steps:

positioning and filtering the acquired Hbase data;

packaging the filtered Hbase data into an elastic distributed data set;

the DataFrame object is registered as a temporary table.

In one embodiment, the types of data sources include: hive; constructing a DataFrame object in a corresponding mode according to the type of a data source, and registering the DataFrame object into a temporary table, wherein the method comprises the following steps:

constructing a first SQL query statement and acquiring a DataFrame object from the acquired Hive data by using the first SQL query statement;

the DataFrame object is registered as a temporary table.

In an embodiment, determining a corresponding data extraction manner according to the number of data sources, and extracting data fields from the temporary table includes:

when the number of the data sources is single, extracting the fields of the data from the temporary table according to preset extraction fields;

when the number of the data sources is multiple, determining the data sources to which the extraction fields belong and establishing an association mode between temporary tables corresponding to the data sources;

and generating a second SQL query statement by using the preset associated field and extracting the field of the data from the temporary table according to the second SQL query statement.

In an embodiment, updating the original data according to the data corresponding to the field meeting the requirement includes:

and acquiring original data, comparing whether the original data is consistent with data corresponding to fields meeting requirements, and if not, covering the original data or supplementing newly added data according to the preset configuration file.

In one embodiment, the method further comprises: and exporting the updated data in a corresponding mode according to the type of the data source.

According to another aspect of the present application, there is also provided a knowledge-graph extraction apparatus, including:

the temporary table registration unit is used for judging the type of the accessed data source and registering the data source into a temporary table according to the type of the data source;

the extraction unit is used for determining a corresponding data extraction mode according to the number of the data sources and extracting data fields from the temporary table;

and the verification updating unit is used for verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data according to the data corresponding to the fields passing the verification to obtain the updated data.

In one embodiment, the temporary table registration unit includes: and the DataFrame object constructing module is used for constructing the DataFrame object in a corresponding mode according to the type of the data source and registering the DataFrame object into a temporary table.

In one embodiment, the types of data sources include: HDFS; the DataFrame object constructing module comprises:

the first RDD packaging module is used for packaging the acquired HDFS data into an elastic distributed data set;

the first construction module is used for generating Schema according to the elastic distributed data set and constructing a DataFrame object;

the first registering module is used for registering the DataFrame object into a temporary table.

In one embodiment, the types of data sources include: hbase; the DataFrame object constructing module comprises:

the positioning module is used for positioning and filtering the acquired Hbase data;

the second RDD packaging module is used for packaging the filtered Hbase data into an elastic distributed data set;

the second construction module is used for generating Schema according to the elastic distributed data set and constructing a DataFrame object;

and the second registration module is used for registering the DataFrame object into a temporary table.

In one embodiment, the types of data sources include: hive; the DataFrame object constructing module comprises:

the first query statement construction module is used for constructing a first SQL query statement and acquiring a DataFrame object from the acquired Hive data by using the first SQL query statement;

and the third registering module is used for registering the DataFrame object into a temporary table.

In one embodiment, the extraction unit comprises:

the single-source extraction module is used for extracting the fields of the data from the temporary table according to preset extraction fields when the number of the data sources is single;

the multi-source extraction module is used for determining the data sources to which the extraction fields belong and establishing the association mode among the temporary tables corresponding to the data sources when the number of the data sources is multiple; and generating a second SQL query statement by using the preset associated field and extracting the field of the data from the temporary table according to the second SQL query statement.

In one embodiment, the verification update unit includes:

and the covering/adding module is used for acquiring the original data, comparing whether the original data is consistent with the data corresponding to the fields meeting the requirements or not, and if not, covering the original data or supplementing the added data according to the preset configuration file.

In one embodiment, the method further comprises: and the export unit is used for exporting the updated data in a corresponding mode according to the type of the data source.

The application provides a general method for extracting the knowledge graph with lighter weight, which can complete the access, extraction and storage of mass data with different types, is suitable for the knowledge graph construction project needing to be quickly landed, is more efficient and lighter compared with the existing method for extracting and constructing the knowledge graph, and eliminates unnecessary components.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for extracting a knowledge graph according to the present application.

FIG. 2 is a flow chart of the extraction configuration of the universal knowledge-graph data nodes and edges in the present application.

Fig. 3 is a schematic diagram of data verification in a temporary table in the embodiment of the present application.

FIG. 4 is a flowchart illustrating a method for registering a temporary table when the data source type is HDFS according to an embodiment of the present application.

FIG. 5 is a flowchart of a method for registering a temporary table when the data source type is Hbase in the embodiment of the present application.

Fig. 6 is a flowchart of a method for registering a temporary table when the type of the data source is Hive in the embodiment of the present application.

Fig. 7 is a flowchart of a method for determining a corresponding data extraction manner according to the number of data sources to perform extraction in the embodiment of the present application.

Fig. 8 is a schematic diagram of a knowledge-graph data extraction method in the embodiment of the present application.

Fig. 9 is a flowchart illustrating updating original data according to data corresponding to fields meeting requirements in the embodiment of the present application.

Fig. 10 is a flowchart of data export in an embodiment of the present application.

Fig. 11 is a block diagram of a knowledge-graph extraction apparatus according to the present application.

Fig. 12 is a block diagram of a temporary table registration unit in the embodiment of the present application.

Fig. 13 is a block diagram illustrating a structure of a DataFrame object constructing module in the embodiment of the present application.

Fig. 14 is a block diagram illustrating a structure of a DataFrame object constructing module in the embodiment of the present application.

Fig. 15 is a block diagram illustrating a structure of a DataFrame object constructing module in the embodiment of the present application.

Fig. 16 is a block diagram of the structure of the extraction unit in the embodiment of the present application.

Fig. 17 is a specific implementation of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The concept of Knowledge Graph (Knowledge Graph) is originally proposed to optimize the results returned by a search engine and enhance the search quality and experience of users. A knowledge graph is essentially a semantic network in which "nodes" represent entities or concepts and "edges" represent various semantic relationships between entities or concepts, and thus a knowledge graph is essentially a semantic network that reveals relationships between entities.

The knowledge graph can be divided into a mode layer and a data layer in a logic structure, wherein the data layer mainly comprises a series of facts, and the knowledge is stored by taking the facts as units. If facts are expressed in triplets of (entity 1, relationship, entity 2), (entity, attribute value), a graph database may be selected as the storage medium. The mode layer is built on the data layer, and a series of fact expressions of the data layer are specified mainly through an ontology library. The ontology is a concept template of the structured knowledge base, and the knowledge base formed by the ontology base has a strong hierarchical structure and a small redundancy degree. The construction and application of the large-scale knowledge base need the support of various intelligent information processing technologies. Knowledge elements such as entities, relationships, attributes and the like can be extracted from some published semi-structured and unstructured data through knowledge extraction technology. The construction of the knowledge graph not only comprises knowledge extraction, but also comprises knowledge fusion, knowledge reasoning and the like, and ambiguity between the designated items such as entities, relationships, attributes and the like and the fact objects can be eliminated through the knowledge fusion, so that a high-quality knowledge base is formed. Knowledge reasoning is to further mine implicit knowledge on the basis of the existing knowledge base, so that the knowledge base is enriched and expanded. The comprehensive vector formed by the distributed knowledge representation has important significance for the construction, reasoning, fusion and application of the knowledge base.

Based on the characteristics of the knowledge graph, the knowledge graph is widely applied to financial institutions (such as banks) and can be used for supporting application scenarios such as government affair cloud intelligent response and data center intelligent response. In financial institutions such as banks, the private knowledge graph can be applied to information recommendation scenes: establishing a private knowledge map based on data such as product transaction data, customer information, news information and the like, and then introducing the knowledge map into information recommendation and search, on one hand, establishing a customer map, and recommending reading preference of people related to the customer by establishing map relations among family members, so that recommendation is more accurate and resonance is stronger; on the other hand, the information label map is constructed, when relevant consultation is recommended to the client, the method is not limited to a certain concerned label, but articles corresponding to a series of labels which are logically associated with the concerned label can be recommended to the client, and the recommendation content is enriched.

However, the existing knowledge graph is mostly constructed in a mode of packaging the whole product or realized by depending on an integrated graph database, and the mode causes the problems of high cost and poor flexibility. In order to solve the problem, the application provides a method and a device for extracting entities and relations of a knowledge graph based on a Hadoop big data platform, can realize the configurability of data access, verification and extraction, and can flexibly combine according to different data sources so as to extract and output effective graph data.

As shown in fig. 1, a method for extracting a knowledge graph provided by the present application includes:

s101: and judging the type of the accessed data source, and registering the data into a temporary table according to the type of the data source.

In a specific embodiment, the file configuration link extends through the whole data extraction process, and includes the following four aspects of configuration (as shown in fig. 2): 1. configuration of data sources: a variety of data sources are supported to provide data. The Hadoop-based big data system supports loading data from HDFS (distributed file system), Hive (Hadoop-based data warehouse tool) and Hbase (distributed and column-oriented open source database); 2. and (3) configuration of the extracted content: the method mainly comprises the steps of configuring fields, wherein the fields comprise field names, data sources of the fields, field types, ontology concepts to which the fields belong and the like, and the fields can be loaded from a single table or extracted by multi-table association; 3. auxiliary rule configuration: the method comprises the steps of configuring a PrimaryID generation rule, configuring an HDFS data file and configuring Schema mapping of a DataFrame; 4. output configuration: and appointing the output position of the processed DataFrame object according to the requirement. The configuration is customized as needed. The configuration file supports various types of configuration formats (including mainstream data configuration formats such as XML, Properties, JSON, YAML). The configuration file is stored in the memory after being analyzed, and the refreshing, updating and closing of the configuration are supported. The method provides rich configuration reading API, supports quick acquisition of any configuration subtree, supports acquisition of list data, supports acquisition of projection fields and the like, and can meet the functions of loading, querying, modifying and the like of general configuration information.

After the configuration link is completed, the type of the data source needs to be judged when the data is acquired from the data source, and different data processing modes are adopted for different data sources.

S102: and determining a corresponding data extraction mode according to the number of the data sources, and extracting data fields from the temporary table according to the data extraction mode.

In an embodiment, the step S101 processes the data from different sources into a unified format and registers the unified format as a temporary table in spark sql, thereby completing the unification of the data formats from different sources. Then, different data extraction modes are determined according to the number of the data sources, and the data screening work is completed from each temporary table.

S103: and verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data according to the data corresponding to the fields passing the verification to obtain the updated data.

A knowledge graph ontology is a data schema of a knowledge graph that includes definition and constraint rules for concepts, relationships, entity attributes, and relationship attributes involved in the knowledge graph. And the knowledge graph ontology data is used as a basis and foundation in the processes of knowledge graph construction, knowledge fusion and knowledge reasoning.

The method and the device introduce the concept of the knowledge graph ontology, verify whether the data meet requirements or not and whether the data are standard or not by using the knowledge graph ontology, and verify the data compliance by introducing the knowledge graph ontology concept to the extraction link of the knowledge graph in the prior art.

In a specific embodiment, a target state data set is obtained through extraction in the extraction step of S102, then data in the data set is filtered according to the requirement of the ontology of the knowledge graph, and data that does not meet the requirement of the ontology are removed to obtain valid data. As shown in fig. 3, the content of the knowledge-map ontology is stored in Hbase, and the content of the knowledge-map ontology is read from Hbase and then mapped into the in-memory data structure, wherein some metadata information such as name, category, length, etc. of the attribute of the knowledge-map ontology is encapsulated, and the extracted data set is verified by using the ontology metadata, thereby ensuring the validity and validity of the data.

In a specific embodiment, the specific process of registering the DataFrame object as the temporary table is as follows:

val df＝SparkSession.createDataframe(..)

df.createTempTable(“tmp_table_name”)

sparkSession.sql(“select*from tmp_table_name where…”)

in the three lines of codes, the first line is to create a DataFrame object, the second line is to register the DataFrame object as a temporary table in the current spark session, the temporary table takes effect in the current session, and the temporary table is destroyed after the spark session is closed.

The purpose of registering the temporary table is to allow the temporary table to be queried for the required data using standard SQL query statements, otherwise data can only be processed using the api of the dataframe, such as df.

In an embodiment, when the type of the data source is HDFS, a DataFrame object is constructed in a corresponding manner according to the type of the data source, and the DataFrame object is registered as a temporary table, as shown in fig. 4, including:

s401: and encapsulating the acquired HDFS data into an elastic distributed data set.

In one embodiment, the HDFS data is read into Spark and automatically encapsulated into RDD (elastic distributed data set).

S402: and generating Schema according to the elastic distributed data set and constructing a DataFrame object.

In a specific embodiment, the HDFS data includes two types of data, namely CSV data and DAT data, and for a DAT file, after a schema configuration file corresponding to a field of the DAT file needs to be analyzed, a schema is generated and a DataFrame object is constructed; for a CSV file, the schema is directly generated and a DataFrame object is constructed.

S403: the DataFrame object is registered as a temporary table.

In a specific embodiment, the DataFrame object constructed in S402 is registered in SparkSQL as a temporary table 1 (as shown in fig. 8), and the data is stored in the temporary table.

In one embodiment, the types of data sources include: hbase; constructing a DataFrame object in a corresponding manner according to the type of the data source, and registering the DataFrame object as a temporary table, as shown in fig. 5, including:

s501: and positioning and filtering the acquired Hbase data.

In one embodiment, for data from which the data source is Hbase, the specific data location is located by the table, column and qualifier functions provided by the Hbase, and then the data is filtered at the Region end by the filter provided by the Hbase.

S502: and encapsulating the filtered Hbase data into an elastic distributed data set.

In one embodiment, the data from the Hbase filtered in S501 is read into Spark and packaged into RDD.

S503: and generating Schema according to the elastic distributed data set and constructing a DataFrame object.

S504: the DataFrame object is registered as a temporary table.

In a specific embodiment, the constructed DataFrame object is registered as temporary table 2 in SparkSQL (as shown in fig. 8).

In one embodiment, the types of data sources include: hive; constructing a DataFrame object in a corresponding manner according to the type of the data source, and registering the DataFrame object as a temporary table, as shown in fig. 6, including:

s601: and constructing a first SQL query statement and acquiring a DataFrame object from the acquired Hive data by using the first SQL query statement.

In a specific embodiment, if the data source is Hive, Spark is needed to open a hiveEnable option, and then Hive data is read in a form of constructing an SQL query statement (a first SQL query statement); in another specific embodiment, the SQL query statement (the first SQL query statement) may also be spliced by means of configuring the field names to perform data query directly from Hive, and the result obtained by the query is returned in the form of a DataFrame object.

S602: the DataFrame object is registered as a temporary table.

In a specific embodiment, the DataFrame object obtained in step S601 is registered in SparkSQL as a temporary table 3 (as shown in fig. 8).

In an embodiment, as shown in fig. 7, determining a corresponding data extraction manner according to the number of data sources, and extracting a field of data from the temporary table includes:

s701: when the number of the data sources is single, extracting the fields of the data from the temporary table according to preset extraction fields.

In a specific embodiment, for a node or an edge of a knowledge graph extracted from a single data source, an extraction field corresponding to extraction is set (the name of the set extraction field must be consistent with the name of a data item defined in a schema used for data access), and extraction is performed from a temporary table corresponding to the data source.

S702: and when the number of the data sources is multiple, determining the data sources to which the extraction fields belong and establishing the association mode between the temporary tables corresponding to the data sources.

In an embodiment, as shown in fig. 8, when there are multiple data sources for data access, extraction needs to be performed from the multiple data sources, and at this time, before the extraction, it is necessary to determine the data source to which each extraction field belongs, and specify an association manner (e.g., the multiple table association manner shown in the figure) between the temporary tables (temporary table 1, temporary table 2, and temporary table 3) corresponding to the data sources (data source 1, data source 2, and data source 3).

S703: and generating a second SQL query statement by using the preset associated field and extracting the field of the data from the temporary table by using the second SQL query statement.

In a specific embodiment, since all the data is already in the temporary table, the SQL query statement (the second SQL query statement) can be constructed through the preset associated fields and directly queried from the temporary table, and the required data is screened out.

In an embodiment, as shown in fig. 9, updating the original data in the knowledge graph according to the data corresponding to the fields meeting the requirement includes:

s901: and acquiring original data in the knowledge graph, comparing whether the original data is consistent with data corresponding to fields meeting requirements or not, and if not, designating to cover the original data or supplement newly added data according to a pre-configured configuration file.

In a specific embodiment, the original data is obtained, and whether the original data is consistent with the data screened in S703 or not is compared, if not, it indicates that the data has changed, at this time, the original data is designated to be overwritten or added according to a pre-configured mode of the configuration file, if the data is overwritten, the changed new data is substituted for the corresponding original data to form updated data, and if the data is added, the new data is added to the original data in the form of new content to form updated data. In a specific case, it is also possible to combine the new data and the original data to generate updated data, for example, read data from a specified column according to an ID management configuration file and combine the data into a primary key (PrimaryID) of the data and add the primary key as a new column of data.

In one embodiment, as shown in fig. 10, the method further includes: and exporting the updated data in a corresponding mode according to the type of the data source.

In one embodiment, the data output is the same as the data access, and a plurality of different output sources (including HDFS, Hive, Hbase, ES and the like) can be configured. For the HDFS file, the output end is directly written in the SaveAsTextFile in Spark; for Hive files, using SparkSql to directly write insert statements at the output end; and for Hbase, the ordered Hfile is directly written into an output end through the provided bulk load mode, and for ES files, the ordered Hfile is imported into the output end in batch through a Spark plug-in provided by an ES.

The method has the advantage that the extraction and output functions of a plurality of data sources can be supported, so that various upstream and downstream storage components can be better adapted. In addition, the method has the technical characteristic that the data from different sources are packaged into a uniform RDD form, so that the Hbase packaging and reading interface can read the data from different sources and in different formats, thereby shielding various non-uniform formats and names of the bottom layer data and realizing the technical effect of performing uniform processing on the data.

Based on the same inventive concept, the embodiment of the present application further provides a knowledge graph extraction apparatus, which can be used to implement the methods described in the above embodiments, as described in the following embodiments. The problem solving principle of the knowledge graph extracting device is similar to that of the knowledge graph extracting method, so the implementation of the knowledge graph extracting device can refer to the implementation of the knowledge graph extracting method, and repeated parts are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

As shown in fig. 11, a knowledge-map extracting apparatus includes:

a temporary table registration unit 1101, configured to determine a type of an accessed data source, and register data as a temporary table in Spark SQL according to the type of the data source;

an extracting unit 1102, configured to determine a corresponding data extraction manner according to the number of data sources, and extract a field of data from the temporary table according to the data extraction manner;

and the verification updating unit 1103 is configured to verify the data corresponding to the extracted field according to the requirement of the knowledge graph body, and update the original data in the knowledge graph according to the data corresponding to the field that passes the verification, so as to obtain updated data.

In one embodiment, as shown in fig. 12, the temporary table registration unit 1101 includes: a DataFrame object constructing module 1201, configured to construct a DataFrame object in a corresponding manner according to the type of the data source, and register the DataFrame object as a temporary table.

In one embodiment, as shown in FIG. 13, the types of data sources include: HDFS; the DataFrame object constructing module comprises:

the first RDD packaging module 1301 is configured to package the acquired HDFS data into an elastic distributed data set;

a first constructing module 1302, configured to generate Schema according to the elastic distributed data set and construct a DataFrame object;

a first registration module 1303, configured to register the DataFrame object as a temporary table.

In one embodiment, as shown in FIG. 14, the types of data sources include: hbase; the DataFrame object constructing module comprises:

a positioning module 1401, configured to position and filter the acquired Hbase data;

a second RDD packaging module 1402, configured to package the filtered Hbase data into an elastic distributed data set;

a second constructing module 1403, configured to generate Schema according to the elastic distributed data set and construct a DataFrame object;

a second registration module 1404 configured to register the DataFrame object as a temporary table.

In one embodiment, as shown in FIG. 15, the types of data sources include: hive; the DataFrame object constructing module comprises:

a first query statement constructing module 1501, configured to construct a first SQL query statement and acquire a DataFrame object from the obtained Hive data by using the first SQL query statement;

a third registering module 1502 is configured to register the DataFrame object as a temporary table.

In one embodiment, as shown in fig. 16, the extracting unit 1102 includes:

a single-source extraction module 1601, configured to extract, when the number of data sources is single, fields of data from the temporary table according to preset extraction fields;

a multi-source extraction module 1602, configured to configure a data source to which the extraction field belongs and specify an association manner between temporary tables corresponding to the data sources when the number of the data sources is multiple; and constructing a second SQL query statement by using the preset associated fields and extracting the fields of the data from the temporary table.

In one embodiment, the verification update unit 1103 includes:

The knowledge graph extraction device provided by the application provides a universal data access interface and a data output interface, and is preset with a plurality of commonly used data source processing modules, so that the data can be flexibly adjusted according to different data sources to obtain data; in addition, the method converts all the acquired data into DataFrame objects in Spark, thereby realizing various non-uniform formats and names for shielding bottom data and realizing the technical effect of uniform processing.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

An embodiment of the present application further provides a specific implementation manner of an electronic device, which is capable of implementing all steps in the method in the foregoing embodiment, and referring to fig. 17, the electronic device specifically includes the following contents:

a processor (processor)1701, memory 1702, communication Interface 1703, bus 1704, and nonvolatile memory 1705;

the processor 1701, the memory 1702 and the communication interface 1703 complete mutual communication through the bus 1704;

the processor 1701 is configured to call the computer programs in the memory 1702 and the nonvolatile memory 1705, and the processor implements all the steps of the method in the foregoing embodiments when executing the computer programs, for example, the processor implements the following steps when executing the computer programs:

s101: and judging the type of the accessed data source, and registering the data into a temporary table in spark SQL according to the type of the data source.

S103: and verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data in the knowledge graph according to the data corresponding to the fields passing the verification to obtain the updated data.

Embodiments of the present application also provide a computer-readable storage medium capable of implementing all the steps of the method in the above embodiments, where the computer-readable storage medium stores thereon a computer program, and the computer program when executed by a processor implements all the steps of the method in the above embodiments, for example, the processor implements the following steps when executing the computer program:

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment. Although embodiments of the present description provide method steps as described in embodiments or flowcharts, more or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or end product executes, it may execute sequentially or in parallel (e.g., parallel processors or multi-threaded environments, or even distributed data processing environments) according to the method shown in the embodiment or the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the embodiments of the present description, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of multiple sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein. The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. The above description is only an example of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure. Various modifications and variations to the embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present specification should be included in the scope of the claims of the embodiments of the present specification.

Claims

1. A method for extracting a knowledge graph is characterized by comprising the following steps:

judging the type of an accessed data source, and registering data into a temporary table in Spark SQL according to the type of the data source;

determining a corresponding data extraction mode according to the number of the data sources, and extracting data fields from the temporary table according to the data extraction mode;

and verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data in the knowledge graph according to the data corresponding to the fields passing the verification to obtain the updated data.

2. The method of knowledge-graph extraction according to claim 1, wherein said registering data into a temporary table in Spark SQL according to the type of the data source comprises: and constructing a DataFrame object in a corresponding mode according to the type of the data source, and registering the DataFrame object as a temporary table.

3. The method of knowledge-graph extraction according to claim 2, wherein the types of data sources include: HDFS; constructing a DataFrame object in a corresponding mode according to the type of the data source, and registering the DataFrame object as a temporary table, wherein the method comprises the following steps:

packaging the acquired HDFS data into an elastic distributed data set;

and registering the DataFrame object as a temporary table.

4. The method of knowledge-graph extraction according to claim 2, wherein the types of data sources include: hbase; constructing a DataFrame object in a corresponding mode according to the type of the data source, and registering the DataFrame object as a temporary table, wherein the method comprises the following steps:

positioning and filtering the acquired Hbase data;

packaging the filtered Hbase data into an elastic distributed data set;

and registering the DataFrame object as a temporary table.

5. The method of knowledge-graph extraction according to claim 2, wherein the types of data sources include: hive; constructing a DataFrame object in a corresponding mode according to the type of the data source, and registering the DataFrame object as a temporary table, wherein the method comprises the following steps:

and registering the DataFrame object as a temporary table.

6. The method for extracting a knowledge graph according to claim 1, wherein the determining a corresponding data extraction manner according to the number of the data sources and extracting data fields from the temporary table according to the data extraction manner comprises:

when the number of the data sources is single, extracting data fields from the temporary table according to preset extraction fields;

when the number of the data sources is multiple, determining the data sources to which the extraction fields belong and establishing the association mode between the temporary tables corresponding to the data sources;

and generating a second SQL query statement by using a preset associated field and extracting a field of data from the temporary table by using the second SQL query statement.

7. The method for extracting a knowledge graph according to claim 1, wherein the updating original data in the knowledge graph according to the data corresponding to the verified field includes:

8. The method of knowledge-graph extraction according to claim 1, further comprising: and exporting the updated data in a corresponding mode according to the type of the data source.

9. A knowledge-graph extraction apparatus, comprising:

the temporary table registration unit is used for determining a corresponding data extraction mode according to the number of the data sources and extracting data fields from the temporary table according to the data extraction mode;

the extraction unit is used for determining a corresponding data extraction mode according to the number of the data sources and extracting data fields from the temporary table according to the data extraction mode;

and the verification updating unit is used for verifying the data corresponding to the extracted fields according to the requirements of the knowledge graph body, and updating the original data in the knowledge graph according to the data corresponding to the fields passing the verification to obtain the updated data.

10. The knowledge-graph extraction apparatus according to claim 9, wherein the temporary-table registration unit includes: and the DataFrame object constructing module is used for constructing a DataFrame object in a corresponding mode according to the type of the data source and registering the DataFrame object into a temporary table.

11. The knowledge-graph extraction apparatus according to claim 10, wherein the types of data sources comprise: HDFS; the DataFrame object constructing module comprises:

and the first registration module is used for registering the DataFrame object into a temporary table.

12. The knowledge-graph extraction apparatus according to claim 10, wherein the types of data sources comprise: hbase; the DataFrame object constructing module comprises:

13. The knowledge-graph extraction apparatus according to claim 10, wherein the types of data sources comprise: hive; the DataFrame object constructing module comprises:

14. The knowledge-graph extraction apparatus according to claim 9, wherein the extraction unit includes:

the single-source extraction module is used for extracting data fields from the temporary table according to preset extraction fields when the number of the data sources is single;

the multi-source extraction module is used for determining the data sources to which the extraction fields belong and establishing the association mode among the temporary tables corresponding to the data sources when the number of the data sources is multiple; and generating a second SQL query statement by using a preset associated field and extracting a field of data from the temporary table according to the second SQL query statement.

15. The knowledge-graph extraction apparatus according to claim 9, wherein the verification update unit includes:

16. The knowledge-graph extraction apparatus of claim 9, further comprising: and the export unit is used for exporting the updated data in a corresponding mode according to the type of the data source.

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of knowledge-graph extraction of any one of claims 1 to 8 when executing the program.

18. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of knowledge-graph extraction according to any one of claims 1 to 8.