WO2023078104A1 - 知识图谱构建方法、平台及计算机存储介质 - Google Patents

知识图谱构建方法、平台及计算机存储介质 Download PDF

Info

Publication number
WO2023078104A1
WO2023078104A1 PCT/CN2022/126759 CN2022126759W WO2023078104A1 WO 2023078104 A1 WO2023078104 A1 WO 2023078104A1 CN 2022126759 W CN2022126759 W CN 2022126759W WO 2023078104 A1 WO2023078104 A1 WO 2023078104A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
attribute
similarity
knowledge graph
knowledge
Prior art date
Application number
PCT/CN2022/126759
Other languages
English (en)
French (fr)
Inventor
鞠泱
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023078104A1 publication Critical patent/WO2023078104A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of data processing, and in particular to a knowledge map construction method, platform and computer storage medium.
  • knowledge graphs in order to be able to combine people, processes, data, and things together to make network connections more relevant, knowledge graphs, as an effective tool for integrating and governing data, can use graph analysis to carry out association relationships
  • the technical means of mining provides insight into the relationship and logic between data and provides support for decision-making.
  • the knowledge map realizes the modeling, extraction, fusion, storage, and application of knowledge, and at the same time associates relevant knowledge to achieve the level of intelligent knowledge application.
  • knowledge map technology has been adopted by more and more industries. Since the construction of a knowledge map requires complex information extraction and data processing processes, the knowledge map in related technologies can only target a specific field, so the scope and role of the current knowledge map application are relatively small.
  • Embodiments of the present application provide a knowledge graph construction method, platform, device, and computer storage medium.
  • the embodiment of the present application provides a knowledge map construction method, which is applied to a knowledge map construction platform, and the knowledge map construction platform includes heterogeneous data from the first platform and first knowledge from the second platform Atlas data
  • the method includes: acquiring heterogeneous data from the first platform and first knowledge atlas data from the second platform; performing information extraction processing on the heterogeneous data to obtain first conversion data; Perform similarity comparison processing on the first conversion data and the first knowledge map data to obtain similarity data; according to the similarity data and preset threshold conditions, perform a similarity comparison process on the first conversion data and the first knowledge map
  • the data is fused and constructed to obtain the second knowledge map data.
  • an embodiment of the present application provides a knowledge graph construction platform, including a memory, a processor, and a computer program stored on the memory and operable on the processor.
  • the processor executes the computer program, the above-mentioned
  • the first aspect is the knowledge map construction method.
  • an embodiment of the present application provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, implements the method for constructing a knowledge graph in the first aspect above.
  • FIG. 1 is a knowledge map construction system for implementing a knowledge map construction method provided by an embodiment of the present application
  • Fig. 2 is a flow chart of the knowledge map construction method provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of the implementation process of step S200 in FIG. 2;
  • FIG. 4 is a schematic diagram of the implementation process of step S300 in FIG. 2;
  • FIG. 5 is a schematic diagram of the implementation process of step S400 in FIG. 2;
  • FIG. 6 is a schematic diagram of the implementation process of step S410 in FIG. 5;
  • FIG. 7 is a schematic diagram of the implementation process of step S440 in FIG. 6;
  • FIG. 8 is a schematic diagram of the implementation process of step S430 in FIG. 6;
  • Fig. 9 is a schematic flowchart of the formation of the first knowledge map data provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of the implementation process of step S600 in FIG. 9;
  • Fig. 11 is a schematic structural diagram of a knowledge graph construction platform provided by an embodiment of the present application.
  • This application provides a method for constructing a knowledge map, by obtaining heterogeneous data from different platforms and the first knowledge map data, automatically performing similarity comparison processing on the heterogeneous data and the first knowledge map data to obtain similarity data, Therefore, the similarity between the heterogeneous data and the first knowledge map data can be judged by using the similarity data and the preset threshold condition.
  • different fusion construction processes are performed on the heterogeneous data and the first knowledge graph data, so that data from different platforms and fields can be combined, and data from different platforms can be combined. The similarity between them, automatically carry out the corresponding fusion construction process, and improve the accuracy and efficiency of knowledge map construction.
  • FIG. 1 shows a knowledge graph construction system 100 for implementing a knowledge graph construction method.
  • the knowledge graph construction system 100 includes: an information collection module 110 , an information extraction module 120 , a knowledge mapping module 130 and a knowledge fusion module 140 .
  • the information collection module 110 can obtain the basic data used to construct the knowledge map, including the heterogeneous data from the first platform and the first knowledge map data from the second platform, and the information collection module 110 can also obtain knowledge map construction System local data.
  • the information extraction module 120 may extract the data acquired by the information collection module 110, and extract entity information and relationship information.
  • the knowledge mapping module 130 is used to establish the mapping relationship between the structured information extracted from the basic data and the knowledge map ontology, and can create, open, query and delete databases through the Python interface, and can also map nodes, edges, clusters, and records through the Python interface. Addition, deletion, modification and search.
  • the knowledge fusion module 140 is used to correlate and fuse heterogeneous data from different platforms with the first knowledge map data, and update the knowledge map through fusion.
  • the information extraction module 120 may include a general extraction module and a model extraction module, and the general extraction module is used for cleaning structured data and information extraction processing to obtain transformed data.
  • the model extraction module is used to use the training model to extract information from unstructured data and convert it into structured conversion data.
  • the knowledge fusion module 140 may also be provided with a general data preprocessing module, a fusion identifier configuration module, a mutually exclusive attribute configuration module and a similarity judgment module.
  • the universal data preprocessing module can trim and transform heterogeneous data, clean structured data in heterogeneous data, and extract models from unstructured data in heterogeneous data.
  • the fusion identifier configuration module is used to determine the key attributes used by the data in the process of knowledge fusion.
  • the mutually exclusive attribute configuration module is used to determine the mutually exclusive attributes used in the process of data fusion.
  • the similarity judgment module includes a plurality of threshold control gates, which are used to set different thresholds to meet the fusion needs of data types in different fields, and perform corresponding similarity judgments on heterogeneous data.
  • the knowledge map construction system 100 described in the embodiment of the present application for executing the knowledge map construction method is to illustrate the technical solution of the embodiment of the present application more clearly, and does not constitute a limitation to the technical solution provided by the embodiment of the present application.
  • Those skilled in the art know that with the evolution of the knowledge graph construction system 100 and the emergence of new application scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
  • the structure of the knowledge map construction system 100 shown in FIG. 1 does not constitute a limitation to the embodiment of the application, and may include more or less components than those shown in the illustration, or combine certain components , or different component arrangements.
  • FIG. 2 shows a flow chart of a knowledge map construction method provided by an embodiment of the present application.
  • the knowledge map construction method can be applied to a knowledge map construction platform.
  • the knowledge map construction-platform includes different structural data and the first knowledge map data from the second platform, the knowledge map construction method includes but is not limited to the following steps:
  • Step S100 acquiring heterogeneous data from the first platform and first knowledge map data from the second platform;
  • Step S200 performing information extraction processing on heterogeneous data to obtain first converted data
  • Step S300 performing similarity comparison processing on the first converted data and the first knowledge map data to obtain similarity data
  • step S400 according to the similarity data and preset threshold conditions, the first conversion data and the first knowledge map data are fused and constructed to obtain the second knowledge map data.
  • the knowledge map construction platform includes heterogeneous data from the first platform and the first knowledge map data from the second platform, that is, data from different platforms.
  • the fields of the first knowledge graph data may be the same or different, so that data from different fields and different platforms can be fused to build a knowledge graph and improve the accuracy of the knowledge graph.
  • the second platform can build a platform for the knowledge map, that is, the first knowledge map data can be the local data of the knowledge map construction platform, improve the speed of data acquisition, and improve the construction efficiency of the knowledge map.
  • heterogeneous data that is, specified types of entities and relationships can be extracted from natural language texts.
  • events and other factual information and convert these factual information into structured first conversion data.
  • Structured data refers to data that can be expressed and stored using a relational database and is represented in a two-dimensional form. The general feature is that the data is in units of behavior, and a row of data represents the information of an entity, and the attributes of each row of data are the same. Therefore, the storage and arrangement of structured data is very regular, which is very helpful for operations such as query and modification, so as to facilitate subsequent similarity comparison processing and improve the construction efficiency of knowledge graphs.
  • the fusion construction of the knowledge map needs to associate the same entity from different sources, and at the same time, attribute fusion is required. It is necessary to perform fusion analysis on entities and attributes of different data, merge nodes of the same entity or attributes, and avoid repeated creation of entity nodes or attribute nodes. Therefore, the similarity comparison process is performed on the first knowledge graph data and the first conversion data, that is, the similarity between the first knowledge graph data and the first conversion data is calculated to obtain similarity data, and the first knowledge is judged by using the similarity data. Whether the entity nodes and attribute nodes in the graph data are merged with the entity nodes and attribute nodes in the first conversion data, so as to complete the fusion construction process of the knowledge graph.
  • the first knowledge map data is fused and constructed accordingly to obtain the second knowledge map data, and the fusion update of the knowledge map is completed, so as to realize the combination of data from different fields and different platforms, and automatically use heterogeneous data to carry out fusion construction of knowledge maps. Improve the accuracy of the knowledge map and improve the construction efficiency.
  • step S200 in the embodiment shown in FIG. 2 also includes but is not limited to the following steps:
  • step S210 information extraction is performed on the first structure data according to the training model to obtain the first converted data, wherein the training model includes a converter-based bidirectional encoding representation BERT, an expansion gate convolutional neural network DGCNN, and a pointer network.
  • the training model includes a converter-based bidirectional encoding representation BERT, an expansion gate convolutional neural network DGCNN, and a pointer network.
  • the heterogeneous data from the first platform may include structured data and unstructured data, wherein the first structured data may be unstructured data.
  • Structured data refers to data that can be expressed and stored using a relational database, in two-dimensional form. The general characteristics are: data is in units of rows, a row of data represents the information of an entity, and the attributes of each row of data are the same. Therefore, the storage and arrangement of structured data is very regular, which is very helpful for operations such as query and modification. Structured data usually has a fixed format and already meets the structured conditions for information extraction. Therefore, only regularization and cleaning of structured data is required to perform information extraction.
  • Regularization and cleaning of structured data includes replacing units in the data text with a uniform format, or replacing acronyms in the text with complete words, or removing punctuation marks in the data text, or replacing abbreviations with full spelling , Replace the Arabic numerals with English numerals, replace the plural with the singular, etc., so as to simplify the data text, facilitate text recognition, and improve the accuracy of information extraction.
  • Unstructured data is data that has an irregular or incomplete data structure, has no predefined data model, and is inconvenient to be represented by two-dimensional logical tables of the database.
  • unstructured data can be office documents in all formats, text, pictures, various reports, images, audio and video information, and so on.
  • Unstructured data has a variety of formats and standards, and technically unstructured information is more difficult to standardize and understand than structured information. Prune and transform unstructured data to avoid pollution due to different data forms and affect the accuracy of information extraction. Therefore, in order to improve the accuracy of information extraction from unstructured data, it is necessary to perform model extraction processing on the unstructured first structured data.
  • the model extraction process needs to train the training model according to the training data provided by the user, and then use the trained training model to perform information extraction on the first structure data.
  • the training model includes Bidirectional Encoder Representation from Transformers (Bidirectional Encoder Representation from Transformers, BERT), Dilate Gated Convolutional Neural Network (DGCNN) and pointer network.
  • BERT Bidirectional Encoder Representation from Transformers
  • DGCNN Dilate Gated Convolutional Neural Network
  • pointer network pointer network.
  • the extraction model in related technologies usually consists of BERT, Bi-directional Long Short-Term Memory (BiLSTM) and Conditional Random Field (CRF).
  • BERT and BiLSTM which has a high accuracy for sequence labeling, but there is no conditional constraint on state transition, and it is a serial detection method, the extraction model is easy to output a completely wrong label sequence.
  • the training model reduces the risk of gradient disappearance through BERT, DGCNN and pointer network, as well as the gating mechanism, and uses the residual method to enable information to be transmitted in multiple channels, and can also capture further information in the text without increasing model parameters.
  • the pointer network can capture the first and last positions in the sequence in the text, improve the accuracy of training model extraction, reduce extraction steps, and improve extraction efficiency.
  • B indicates the beginning of a named entity
  • I indicates that the current word is the latter part of the named entity
  • O indicates that it is not a named entity.
  • a test sentence is "the(B) wall(I) street(I) journal(I) reported(O) today(O) that(O) apple(B) corporation(I) made(O) money(O )".
  • the wall street journal (Wall Street Journal) and “apple corporation” (Apple Corporation) are named entities.
  • BiLSTM and DGCNN when the "the wall street journal" of the test sentence is entered into BiLSTM, the named entity is a sequentially output sequence label "B ⁇ I ⁇ I ⁇ I" in the order of "the ⁇ wall ⁇ street ⁇ journal". ". And when the "the wall street journal” of the test sentence is entered into the DGCNN, the DGCNN will input “the wall street journal” once and directly get “BIII”. It can be seen that BiLSTM is a serial method, and its processing is to predict labels one by one. However, DGCNN uses a parallel method to predict all labels at one time, improving the accuracy and efficiency of prediction.
  • the CRF may only decode the entity “the wall street” (Wall Street).
  • the pointer network captures the first word “the” and the last word “journal” of the entity, so that the entire entity can be recognized. Therefore, using the pointer network for entity recognition can improve the recognition effect and improve the recognition accuracy.
  • the training model can reduce the steps of relation extraction and improve the efficiency of information extraction in the process of relation extraction. For example, to extract triplet information from the text "person A is from region B", the extraction model in the related art will extract the entities “person A” and "region B” from advanced named entities, and then input these two named entities through Text classification derives the relation "from”.
  • the training model can capture the first and last words of the named entity "person A”, "region B” and "from”, so that the entire triplet information can be directly extracted, reducing the relationship extraction steps and improving extraction efficiency.
  • step S300 in the embodiment shown in FIG. 2 also includes but is not limited to the following steps:
  • Step S310 performing similarity comparison processing on the first entity data and the second entity data to obtain entity similarity data.
  • the first conversion data includes the first entity data
  • the first knowledge graph data includes the second entity data
  • the first knowledge graph data includes the second entity data as entity nodes.
  • the first entity data and the second entity data are the same entity, then the first entity data is fused into the second entity data; if it is considered that the first entity data and the second entity data are not the same entity, then in the first knowledge graph
  • the first conversion data corresponding to the first entity data is added to the data, and the first knowledge map data is updated, so as to realize the integration and construction of the knowledge map using data from different platforms and different fields, and improve the accuracy of the knowledge map.
  • the entity similarity data is the similarity between the first entity data and the second entity data.
  • the comparison of the similarity between the first entity data and the second entity data can be calculated using the minimum edit distance algorithm, that is, the string in the first entity data is converted into a string in the second entity data by the minimum edit operation, Among them, the minimum edit distance can be calculated by the following formula:
  • N is the length of the character string to be converted in the first entity data
  • M is the length of the target character string in the second entity data.
  • the minimum edit distance is used to calculate the similarity between the first entity data and the second entity data.
  • the smaller the minimum edit distance the higher the repetition rate between the first entity data and the second entity data. The higher the degree.
  • step S400 in the embodiment shown in FIG. 2 also includes but is not limited to the following steps:
  • Step S410 when the entity similarity data does not meet the preset entity threshold condition, perform fusion processing on the first converted data and the first knowledge graph data;
  • Step S420 when the entity similarity data satisfies the preset entity threshold condition, add the first conversion data on the basis of the first knowledge graph data to obtain the second knowledge graph data.
  • the minimum edit distance is used to calculate the entity similarity data, and when the entity similarity data does not meet the preset entity threshold condition, that is, the minimum edit distance in the entity similarity data is less than or equal to the preset entity
  • the entity distance threshold in the threshold condition can be considered to have a high similarity between the first real data and the second entity data, and the first entity data and the second entity data are the same entity, so the first conversion data and the first knowledge graph data need to be Fusion processing, associating the first conversion data corresponding to the first entity data with the first knowledge graph data corresponding to the second entity data.
  • the entity similarity data satisfies the preset entity threshold condition, that is, the minimum edit distance in the entity similarity data is greater than the entity distance threshold in the preset entity threshold condition, it can be considered that the first entity data is similar to the second entity data degree is low, the first entity data and the second entity data are not the same entity, then use the first conversion data corresponding to the first entity data to create an entity node, add the first conversion data on the basis of the first knowledge map data, and complete The knowledge map is fused and updated to obtain the second knowledge map data to form a new knowledge map. Therefore, by calculating the entity similarity between the first transformed data and the first knowledge map data, and performing different fusion construction processes according to different entity similarities, it is possible to avoid excessive invalid calculations during the knowledge fusion process and improve the efficiency of knowledge fusion.
  • the first conversion data includes a plurality of first comparison data
  • the first knowledge graph data includes a plurality of second comparison data.
  • a fusion identifier can be specified for the first transformation data and the first knowledge map data, and the key attribute group that can be used to characterize the essence of the entity during knowledge fusion can be selected for Entity similarity judgment, that is, using the fusion identifier to determine the first entity data from the first comparison data, and determine the second entity data from the second comparison data, so that only by comparing the first entity data and the second entity data can Judging whether the first entity data and the second entity data are the same entity avoids invalid calculations caused by full comparison, and can improve the accuracy of similarity calculations during knowledge fusion to a certain extent.
  • the gender attribute of a person is a mutually exclusive attribute, and two entities with different genders must be different entities. Therefore, through the judgment of mutually exclusive attributes, the gender data of the person can be integrated into the fusion identifier, thereby improving the efficiency and accuracy of knowledge fusion.
  • step S410 in the embodiment shown in FIG. 5 also includes but is not limited to the following steps:
  • Step S430 performing similarity comparison processing on the first attribute data and the second attribute data to obtain attribute similarity data
  • Step S440 performing fusion processing on the first conversion data and the first knowledge graph data according to the attribute similarity data and preset attribute threshold conditions.
  • the first conversion data further includes first attribute data
  • the first knowledge map data further includes second attribute data
  • step S440 in the embodiment shown in FIG. 6 also includes but is not limited to the following steps:
  • Step S450 in the case that the attribute similarity data does not meet the preset attribute threshold condition, adding the first attribute data on the basis of the first knowledge graph data to obtain the second knowledge graph data;
  • Step S460 if the attribute similarity data satisfies the preset attribute threshold condition, maintain the first knowledge graph data or replace the first attribute data with the second attribute data to obtain the second knowledge graph data.
  • the similarity between the first entity data and the second entity data is high.
  • the attribute similarity data does not meet the preset attribute threshold condition, it can be considered that the first The similarity between the attribute data and the second attribute data is low, and the first attribute data is considered to be different from the second attribute data. Therefore, the first attribute data, that is, the first knowledge graph data, is added on the basis of the first knowledge graph data It includes second entity data, first attribute data and second attribute data, wherein the second entity data is associated with the first attribute data and the second attribute data respectively.
  • the attribute similarity data satisfies the preset attribute threshold condition, it can be considered that the similarity between the first attribute data and the second attribute data is high.
  • the first knowledge map data can be maintained, Alternatively, the first attribute data is used to update and replace the second attribute data in the first knowledge graph data, so as to complete the fusion update of the knowledge graph and obtain the second knowledge graph data.
  • step S430 in the embodiment shown in FIG. 6 also includes but is not limited to the following steps:
  • Step S470 based on the Jaccard coefficient and/or word frequency-reverse document frequency to perform similarity comparison processing on the first attribute data and the second attribute data to obtain attribute similarity data.
  • the Jaccard coefficient that is, the Jaccard coefficient, calculates the similarity between the first attribute data and the second attribute data. Since the Jaccard coefficient is mainly used to calculate the similarity between individuals measured by symbolic measures or Boolean values, because the characteristic attributes of individuals are identified by symbolic measures or Boolean values, it is impossible to measure the size of the specific value of the difference. Whether the features are consistent, so that the Jaccard coefficient can be used to judge whether the first attribute data and the second attribute data are the same, so as to perform corresponding attribute fusion processing.
  • the attribute similarity data based on the Jaccard coefficient can be calculated by the following formula:
  • S is a character string representing the first attribute data
  • T is a character string representing the second attribute data.
  • TF-IDF Term Frequency–Inverse Document Frequency
  • Figure 9 shows that the first knowledge map data is obtained by the following steps:
  • Step S500 acquiring data to be processed from the second platform
  • Step S600 performing information extraction processing on the data to be processed to obtain second converted data
  • Step S700 importing the second converted data into the OrientDB database to obtain the first knowledge map data.
  • the data to be processed comes from the second platform or the knowledge graph construction platform, that is, the data to be processed can be local data and can be uploaded by the user.
  • the data to be processed may include structured data and unstructured data, for example, structured data in JSON file format and unstructured data in text file format.
  • structured data information extraction can be performed after regularization processing to improve the accuracy of knowledge map construction.
  • unstructured data use the training model to extract information and convert unstructured data into structured data. After information extraction processing is performed on the data to be processed, the second converted data in a structured data format is obtained.
  • the OrientDB database is an open source database management system, which includes the functions and documents of the traditional database management system. Based on Python and OrientDB, the knowledge map is constructed for the second transformation data. Due to the high processing performance and fast processing speed of OrientDB, it can improve the knowledge map. Build efficiency.
  • step S600 in the embodiment shown in FIG. 9 also includes but is not limited to the following steps:
  • Step S610 performing information extraction on the second structure data according to the training model to obtain the second converted data, wherein the training model includes BERT, DGCNN and pointer network.
  • the data to be processed includes the second conversion data, and the second conversion data is unstructured data.
  • Unstructured data means that the data structure is irregular or incomplete, there is no predefined data model, and it is inconvenient to use the database two-dimensional Logical table to represent the data.
  • the training model is used to prune and transform the second structure data to obtain the second transformed data.
  • the training model includes BERT, DGCNN and pointer network. Use the pointer network to capture the beginning and end positions of the sequence in the text, reduce the extraction steps, and improve the extraction efficiency. At the same time, use DGCNN to predict all labels at one time, improving the accuracy and efficiency of prediction.
  • FIG. 11 shows a knowledge map construction platform 1100 provided by an embodiment of the present application.
  • the knowledge graph construction platform 1100 includes a memory 1110, a processor 1120, and a computer program stored in the memory 1110 and operable on the processor 1120.
  • the processor 1120 executes the computer program, it realizes the knowledge graph construction method in the above-mentioned embodiments.
  • the memory 1110 as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the knowledge map construction method in the above-mentioned embodiments of the present application.
  • the processor 1120 executes the non-transitory software programs and instructions stored in the memory 1110 to implement the knowledge map construction method in the above-mentioned embodiments of the present application.
  • the memory 1110 may include a storage program area and a storage data area, wherein the storage program area may store the operating system and at least one application program required by a function; the storage data area may store the data required to execute the knowledge map construction method in the above-mentioned embodiments wait.
  • the memory 1110 may include a high-speed random access memory 1110, and may also include a non-transitory memory 1110, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. It should be noted that the memory 1110 may include memory 1110 remotely set relative to the processor 1120, and these remote memory 1110 may be connected to the terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to implement the knowledge map construction method in the above embodiments are stored in the memory, and when executed by one or more processors, the knowledge map construction method in the above embodiments is executed, for example, executing Method step S100 to step S400 in Fig. 2 described above, method step S210 in Fig. 3, method step S310 in Fig. 4, method step S410 to step S420 in Fig. 5, method step S430 to step in Fig. 6 S440, method step S450 to step S460 in FIG. 7 , method step S470 in FIG. 8 , method step S500 to step S700 in FIG. 9 , and method step S610 in FIG. 10 .
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make the computer execute the knowledge map construction method in the above-mentioned embodiments, for example, execute the above-described Method step S100 to step S400 in Fig. 2, method step S210 in Fig. 3, method step S310 in Fig. 4, method step S410 to step S420 in Fig. 5, method step S430 to step S440 in Fig. 6, The method step S450 to step S460 in FIG. 7 , the method step S470 in FIG. 8 , the method step S500 to step S700 in FIG. 9 , and the method step S610 in FIG. 10 .
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the embodiment of the present application includes: acquiring heterogeneous data from the first platform and first knowledge map data from the second platform; performing information extraction processing on the heterogeneous data to obtain the first conversion data;
  • the first knowledge map data is subjected to similarity comparison processing to obtain similarity data; according to the similarity data and preset threshold conditions, the first transformation data and the first knowledge map data are fused and constructed to obtain second knowledge map data.
  • the knowledge map construction method is applied to the knowledge map construction platform, and data from different platforms can be obtained, including heterogeneous data from the first platform and the first knowledge map from the second platform Data, so that data from different fields on different platforms can be used to fuse and update the knowledge map.
  • information extraction is performed on heterogeneous data from different platforms, and the first converted data with key attributes is extracted.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种知识图谱构建方法、平台及计算机存储介质,该方法应用于知识图谱构建平台,知识图谱构建平台包括来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,该方法包括:获取来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据(S100);对异构数据进行信息抽取处理,得到第一转化数据(S200);对第一转化数据和第一知识图谱数据进行相似度比较处理,得到相似度数据(S300);根据相似度数据和预设阈值条件,对第一转化数据和第一知识图谱数据进行融合构建处理,得到第二知识图谱数据(S400)。

Description

知识图谱构建方法、平台及计算机存储介质
相关申请的交叉引用
本申请基于申请号为202111308484.2、申请日为2021年11月05日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种知识图谱构建方法、平台及计算机存储介质。
背景技术
随着5G技术的快速发展,为了能够将人、流程、数据和事物结合一起使得网络连接变得更加相关,而知识图谱作为一种整合数据和治理数据的有效工具,能够利用图谱分析进行关联关系挖掘的技术手段,洞察数据之间的关系和逻辑,为决策提供支持。此外,在搭建领域知识库的过程中,知识图谱实现了知识的建模、抽取、融合、存储、应用,同时将相关知识进行关联,达到智能化的知识应用水平,成为了企业推进人工智能应用部署的重要技术手段之一,当前,知识图谱技术已被越来越多的行业所采纳。由于构建一套知识图谱需要复杂的信息抽取和数据处理流程,相关技术中的知识图谱仅能针对于一个特定领域,因此目前知识图谱所应用的范围和作用较小。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种知识图谱构建方法、平台、设备及计算机存储介质。
第一方面,本申请实施例提供了一种知识图谱构建方法,应用于知识图谱构建平台,所述知识图谱构建平台包括来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,所述方法包括:获取来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据;对所述异构数据进行信息抽取处理,得到第一转化数据;对所述第一转化数据和所述第一知识图谱数据进行相似度比较处理,得到相似度数据;根据所述相似度数据和预设阈值条件,对所述第一转化数据和所述第一知识图谱数据进行融合构建处理,得到第二知识图谱数据。
第二方面,本申请实施例提供一种知识图谱构建平台,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面的知识图谱构建方法。
第三方面,本申请实施例提供一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,实现如上述第一方面的知识图谱构建方法。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请实施例提供的用于执行知识图谱构建方法的知识图谱构建系统;
图2是本申请实施例提供的知识图谱构建方法的流程图;
图3是图2中步骤S200的实现过程示意图;
图4是图2中步骤S300的实现过程示意图;
图5是图2中步骤S400的实现过程示意图;
图6是图5中步骤S410的实现过程示意图;
图7是图6中步骤S440的实现过程示意图;
图8是图6中步骤S430的实现过程示意图;
图9是本申请实施例提供的第一知识图谱数据形成的流程示意图;
图10是图9中步骤S600的实现过程示意图;
图11是本申请实施例提供的一种知识图谱构建平台的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在模块示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于模块中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请提供了一种知识图谱构建方法,通过获取来自于不同平台的异构数据和第一知识图谱数据,自动对异构数据和第一知识图谱数据进行相似度比较处理,得到相似度数据,从而能够利用相似度数据和预设阈值条件进行判断异构数据与第一知识图谱数据之间的相似度。基于异构数据与第一知识图谱数据之间的相似度,对异构数据和第一知识图谱数据进行不同的融合构建处理,从而能够结合不同平台、不同领域的数据,并基于不同平台的数据之间的相似度,自动进行相应的融合构建处理,提高知识图谱构建的准确性和效率。
为便于理解,下面结合附图对本申请实施例提供的知识图谱构建方法的应用场景进行介绍。
图1示出了一种用于执行知识图谱构建方法的知识图谱构建系统100,该知识图谱构建系统100包括:信息采集模块110,信息抽取模块120,知识映射模块130和知识融合模块140。其中,信息采集模块110能够获取用于构建知识图谱的基础数据,包括来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,信息采集模块110也可以获取知识图谱构建系统的本地数据。信息抽取模块120可以对信息采集模块110所获取的数据进行抽取,抽取出实体信息和关系信息。知识映射模块130用于建立从基础数据抽取出的结构化信息与知识图谱本体的映射关系,能够通过Python接口创建、打开、查询和删除数据库,还能够通 过Python接口对节点、边、集群和记录的增加、删除、修改和查找。知识融合模块140用于将来自于不同平台的异构数据和第一知识图谱数据进行关联融合,对知识图谱进行融合更新。
需要说明的是,信息抽取模块120可以包括有普通抽取模块和模型抽取模块,普通抽取模块用于对结构化数据进行清洗以及信息抽取处理,得到转化数据。模型抽取模块用于利用训练模型对非结构化数据进行信息抽取,转化为结构化的转化数据。
需要说明的是,知识融合模块140还可以设置有泛用性数据预处理模块,融合标识符配置模块、互斥属性配置模块和相似度判断模块。泛用性数据预处理模块能够对异构数据进行修剪和转化,对异构数据中的结构化数据进行清洗,对异构数据中的非结构化数据进行模型抽取。融合标识符配置模块用于确定数据在知识融合过程中所使用的关键属性。互斥属性配置模块用于确定数据在知识融合过程中所使用的互斥属性。相似度判断模块包括多个阈值控制门,用于设定不同阈值来满足不同领域数据类型的融合需要,对异构数据进行相应的相似度判断。
本申请实施例描述的用于执行知识图谱构建方法的知识图谱构建系统100是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着知识图谱构建系统100的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图1中示出的知识图谱构建系统100的结构并不构成对本申请实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
基于上述知识图谱构建系统100的结构,提出本申请的知识图谱构建方法的各个实施例。
参照图2,图2示出了本申请实施例提供的知识图谱构建方法的流程图,该知识图谱构建方法可以应用于知识图谱构建平台,该知识图谱构建-平台包括来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,该知识图谱构建方法包括但不限于有以下步骤:
步骤S100,获取来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据;
步骤S200,对异构数据进行信息抽取处理,得到第一转化数据;
步骤S300,对第一转化数据和第一知识图谱数据进行相似度比较处理,得到相似度数据;
步骤S400,根据相似度数据和预设阈值条件,对第一转化数据和第一知识图谱数据进行融合构建处理,得到第二知识图谱数据。
可以理解的是,知识图谱构建平台中包括有来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,即来自于不同平台的数据,其中,异构数据的领域与第一知识图谱数据的领域可以相同,也可以不相同,从而能够利用不同领域和不同平台的数据进行融合构建知识图谱,提高知识图谱的准确性。而第二平台可以为知识图谱构建平台,即第一知识图谱数据可以为知识图谱构建平台的本地数据,提高数据获取速度,提高知识图谱的构建效率。
由于数据中所包含的信息是复杂无序,通过对异构数据进行信息抽取处理,从异构数据中抽取得到融合构建知识图谱的有效信息,即从自然语言文本中抽取指定类型的实体、关系、事件等事实信息,并将这些事实信息转化为结构化的第一转化数据。结构化的数据指可以使用关系型数据库表示和存储,表现为二维形式的数据,一般特点是数据以行为单位,一行数据表示一个实体的信息,每一行数据的属性是相同的。所以,结构化的数据的存储和排列是很有规律的,这对查询和修改等操作很有帮助,以便于进行后续的相似度比较处理,提高知 识图谱的构建效率。
知识图谱的融合构建是需要将不同来源的同一实体关联起来,同时还需要进行属性融合。需要对不同数据的实体以及属性进行融合分析,将同一实体的结点或属性的结点进行合并,避免重复创建实体结点或属性结点。因此,对第一知识图谱数据与第一转化数据进行相似度比较处理,即计算第一知识图谱数据与第一转化数据之间的相似度,得到相似度数据,利用相似度数据判断第一知识图谱数据中的实体结点以及属性结点是否与第一转化数据中的实体结点以及属性结点进行合并,以完成知识图谱的融合构建处理。
由于来自于不同领域和平台的数据特点不尽相同,为了满足不同领域的数据的融合需求,通过预先设置不同的阈值条件,判断相似度数据是否满足预设阈值条件,对第一转化数据和第一知识图谱数据进行相应的融合构建处理,得到第二知识图谱数据,完成知识图谱的融合更新,从而实现将不同领域、不同平台的数据进行结合,自动利用异构数据进行知识图谱的融合构建,提高知识图谱的准确性,提高构建效率。
参照图3,图2所示实施例中的步骤S200还包括但不限于有以下步骤:
步骤S210,根据训练模型对第一结构数据进行信息抽取得到第一转化数据,其中,训练模型包括基于转换器的双向编码表征BERT、膨胀门卷积神经网络DGCNN和指针网络。
可以理解的是,来自于第一平台的异构数据中会包括结构化数据和非结构化数据,其中,第一结构数据可以是非结构化数据。结构化的数据是指可以使用关系型数据库表示和存储,表现为二维形式的数据。一般特点是:数据以行为单位,一行数据表示一个实体的信息,每一行数据的属性是相同的。所以,结构化的数据的存储和排列是很有规律的,这对查询和修改等操作很有帮助。结构化数据通常具有固定的格式,已经满足信息抽取的结构化条件,因此,只需要对结构化数据进行正则化清洗,即能够进行信息抽取。对结构化数据进行正则化清洗包括将数据文本中的单位替换为统一格式,或者将文本中首字母略缩词替换为完整单词,或者去掉数据文本中的标点符号,或者将缩写替换为全拼、将阿拉伯数字替换为英文数字、将美元复数替换为单数等,从而能够简化数据文本,便于文本识别,提高信息抽取的准确性。
而非结构化数据是数据结构不规则或不完整,没有预定义的数据模型,不方便用数据库二维逻辑表来表现的数据。例如,非结构化数据可以是所有格式的办公文档、文本、图片、各类报表、图像、音频和视频信息等等。非结构化数据其格式非常多样,标准也是多样性的,而且在技术上非结构化信息比结构化信息更难标准化和理解。对非结构化数据进行修剪和转化,避免因数据形式不同而产生污染,影响信息提取的准确性。因此,为了提高非结构化数据信息抽取的准确性,需要对非结构化的第一结构数据进行模型抽取处理。模型抽取处理需要根据用户提供的训练数据对训练模型进行训练,再利用训练好的训练模型对第一结构数据进行信息抽取。其中,训练模型包括基于转换器的双向编码表征(Bidirectional Encoder Representation from Transformers,BERT)、膨胀门卷积神经网络(Dilate Gated Convolutional Neural Network,DGCNN)和指针网络。相关技术中的抽取模型通常由BERT、双向长短期记忆(Bi-directional Long Short-Term Memory,BiLSTM)和条件随机场(Conditional Random Field,CRF)组成。虽然相关技术中的抽取模型采用BERT和BiLSTM,对序列标注的精度较高,但没有状态转移的条件约束,并且是串行检测方法,抽取模型容易输出一个完全错误的标注序列。而训练模型通过BERT、DGCNN和指针网络,以及门控机制降低梯度消失风险,并且运用残差方法使得信息能够在多通道传输,还能够在不增加模型参数 的基础上,捕捉文本中更远的距离,另外,通过指针网络能够在文本中捕捉序列中的首尾位置,提高训练模型抽取的准确性以及减少抽取步骤,提高抽取效率。
需要说明的是,对于一组已标注的数据,B表示一个命名实体的开头,I表示当前词为命名实体的后面部分,O表示不是命名实体。例如,一个测试句子为“the(B) wall(I) street(I) journal(I) reported(O) today(O) that(O) apple(B) corporation(I) made(O) money(O)”。其中,“the wall street journal”(华尔街日报)、“apple corporation”(苹果公司)为命名实体。
对于BiLSTM和DGCNN,当将测试句子的“the wall street journal”进入BiLSTM时,该命名实体是按照“the→wall→street→journal”的先后顺序依次输出的序列标注“B→I→I→I”。而当将测试句子的“the wall street journal”进入DGCNN,DGCNN将“the wall street journal”一次输入直接得到“BIII”。可以看出,BiLSTM是一种串行方法,其处理方式为逐个预测标签。而DGCNN采用并行方式,一次预测所有标签,提高预测的准确性和效率。
对于实体“the wall street journal”,CRF可能只解码出“the wall street”(华尔街)这个实体。而指针网络捕捉到实体的首个单词“the”和最后一个单词“journal”,从而可将整个实体识别,因此,采用指针网络进行实体识别,能够提高识别效果,提高识别准确率。
另外,训练模型在针对关系抽取的过程,能够减少关系抽取的步骤,提高信息抽取的效率。例如,对文本“人物A来自地区B”抽取三元组信息,相关技术中的抽取模型会先进性命名实体抽取出实体“人物A”和“地区B”,再将这两个命名实体输入通过文本分类得出关系“来自”。而训练模型则可以捕捉到命名实体“人物A”“地区B”以及“来自”的头尾两个字词,从而能够直接抽取整个三元组信息,减少关系抽取步骤,提高抽取效率。
参照图4,图2所示实施例中的步骤S300还包括但不限于有以下步骤:
步骤S310,对第一实体数据和第二实体数据进行相似度比较处理,得到实体相似度数据。
其中,第一转化数据包括第一实体数据,第一知识图谱数据包括第二实体数据。
可以理解的是,通过对异构数据进行信息抽取,将异构数据中的实体信息进行抽取,并转化为第一转化数据中的第一实体数据。而第一知识图谱数据中包括有作为实体结点的第二实体数据。为了将不同来源的同一实体进行关联,同时避免重复创建实体结点,因此需要对第一转化数据中的第一实体数据和第一知识图谱数据中的第二实体数据进行相似度比较处理,得到实体相似度数据,从而利用实体相似度数据判断第一实体数据与第二实体数据是否为同一实体。若认为第一实体数据与第二实体数据为同一实体,则对第一实体数据融合至第二实体数据中;若认为第一实体数据与第二实体数据不是同一实体,则在第一知识图谱数据中添加第一实体数据所对应的第一转化数据,更新第一知识图谱数据,从而实现利用不同平台、不同领域的数据对知识图谱进行融合构建,提高知识图谱的准确性。
需要说明的是,实体相似度数据为第一实体数据与第二实体数据之间的相似度。而第一实体数据与第二实体数据的相似度比较,可以采用最小编辑距离算法进行计算,即采用最小的编辑操作将第一实体数据中的字符串转换为第二实体数据中的字符串,其中,最小编辑距离可以通过如下公式进行计算:
Figure PCTCN2022126759-appb-000001
Figure PCTCN2022126759-appb-000002
其中,N为第一实体数据中待转换字符串的长度,M为第二实体数据中目标字符串的长度。
因此,采用最小编辑距离对第一实体数据与第二实体数据之间的相似度进行计算,最小编辑距离越小,则说明第一实体数据与第二实体数据之间的重复率越高,相似度越高。
参照图5,图2所示实施例中的步骤S400还包括但不限于有以下步骤:
步骤S410,在实体相似度数据不满足预设实体阈值条件的情况下,对第一转化数据和第一知识图谱数据进行融合处理;
或者,
步骤S420,在实体相似度数据满足预设实体阈值条件的情况下,在第一知识图谱数据的基础上添加第一转化数据,得到第二知识图谱数据。
可以理解的是,若采用最小编辑距离计算实体相似度数据,而在实体相似度数据不满足预设实体阈值条件的情况下,即,实体相似度数据中的最小编辑距离小于或等于预设实体阈值条件中的实体距离阈值,可以认为第一实数据与第二实体数据的相似度高,第一实体数据与第二实体数据为同一实体,需要对第一转化数据和第一知识图谱数据进行融合处理,将第一实体数据所对应的第一转化数据和第二实体数据所对应的第一知识图谱数据关联起来。
在实体相似度数据满足预设实体阈值条件的情况下,即实体相似度数据中的最小编辑距离大于预设实体阈值条件中的实体距离阈值,可以认为第一实体数据与第二实体数据的相似度低,第一实体数据与第二实体数据不是同一实体,则利用第一实体数据所对应的第一转化数据创建实体结点,在第一知识图谱数据的基础上添加第一转化数据,完成知识图谱的融合更新,得到第二知识图谱数据,形成新的知识图谱。因此,通过对第一转化数据和第一知识图谱数据计算实体相似度,根据不同的实体相似度进行不同的融合构建处理,能够避免在知识融合过程中无效计算过多,提高知识融合的效率。
需要说明的是,第一转化数据中包含有多个第一比较数据,第一知识图谱数据中包含有多个第二比较数据。为了提高知识融合的准确性,在进行知识融合之前,可以为第一转化数据和第一知识图谱数据进行指定融合标识符,选择在知识融合时可用于表征实体本质的关键属性组以用于进行实体相似度判断,即利用融合标识符从第一比较数据中确定出第一实体数据,并且从第二比较数据中确定第二实体数据,从而仅通过比较第一实体数据和第二实体数据能够判断出第一实体数据与第二实体数据是否为同一实体,避免全量比对时所造成的无效计算,且可以一定程度上提升知识融合时相似度计算的准确性。同时,用户需要指定互斥属性。例如,人的性别属性即为互斥属性,性别不同的两个实体必然是不同的实体。因此,可以通过对互斥属性的判断,对人的性别数据融入融合标识符,从而提高知识融合效率和准确率。
参照图6,图5所示实施例中的步骤S410还包括但不限于有以下步骤:
步骤S430,对第一属性数据和第二属性数据进行相似度比较处理,得到属性相似度数据;
步骤S440,根据属性相似度数据和预设属性阈值条件,对第一转化数据和第一知识图谱数据进行融合处理。
其中,第一转化数据还包括第一属性数据,第一知识图谱数据还包括第二属性数据。
可以理解的是,实现知识融合,需要将不同来源的同一实体关联起来,同时还需要对数据中的属性数据进行融合。在第一转化数据中的第一实体数据与第一知识图谱数据的第二实体数据为同一实体的情况下,还需要对第一转化数据中的第一属性数据与第一知识图谱数据中的第二属性数据进行相似度比较处理,计算得到属性相似度数据。通过属性相似度数据和预设属性阈值条件判断第一属性数据是否与第二属性数据相同,从而进行融合处理,将第一转化数据的第一属性数据融合更新至第一知识图谱数据中。
参照图7,图6所示实施例中的步骤S440还包括但不限于有以下步骤:
步骤S450,在属性相似度数据不满足预设属性阈值条件的情况下,在第一知识图谱数据的基础上添加第一属性数据,得到第二知识图谱数据;
或者,
步骤S460,在属性相似度数据满足预设属性阈值条件的情况下,维持第一知识图谱数据或将第一属性数据对第二属性数据进行替换,得到第二知识图谱数据。
可以理解的是,第一实体数据与第二实体数据之间的相似度高,为了避免重复创建属性结点,在属性相似度数据不满足预设属性阈值条件的情况下,可以认为,第一属性数据与第二属性数据之间的相似度低,认为第一属性数据与第二属性数据不相同,因此,在第一知识图谱数据的基础上添加第一属性数据,即第一知识图谱数据包括第二实体数据、第一属性数据和第二属性数据,其中,第二实体数据分别与第一属性数据和第二属性数据相关联。在属性的相似度数据满足预设属性阈值条件的情况下,可以认为第一属性数据与第二属性数据之间的相似度高,为了避免重复创建属性结点,可以维持第一知识图谱数据,或者采用第一属性数据对第一知识图谱数据中的第二属性数据进行更新替换,从而完成知识图谱的融合更新,得到第二知识图谱数据。
参照图8,图6所示实施例中的步骤S430还包括但不限于有以下步骤:
步骤S470,基于杰卡德系数和/或词频-逆向文件频率至对第一属性数据和第二属性数据进行相似度比较处理,得到属性相似度数据。
可以理解的是,第一属性数据以及第二属性数据中的文本长度较短,且具有较高的相似性,而且在短文本中的一字之差会严重影响相似度的判断,因此可以利用杰卡德系数,即Jaccard系数,对第一属性数据与第二属性数据之间的相似度进行计算。由于Jaccard系数主要用于计算符号度量或布尔值度量的个体间的相似度,因为个体的特征属性都是由符号度量或者布尔值标识,因此无法衡量差异具体值的大小,仅关系个体间共同具有的特征是否一致,从而能够利用Jaccard系数来判断第一属性数据和第二属性数据是否相同,以进行相应的属性融合处理。
其中,基于Jaccard系数的属性相似度数据可以通过如下公式进行计算:
Figure PCTCN2022126759-appb-000003
其中,S为表示第一属性数据的字符串,T为表示第二属性数据的字符串。利用Jaccard系数对第一属性数据和第二属性数据的相似度进行判断,能够避免短文本中因一字之差而导 致相似度的误判断,提高相似度比较的准确性。
可以理解的是,一个词语的重要性随着该词语在文件中出现的次数成正比增加,但同时会随着该词语在语料库中出现的频率成反比下降。所以,一个词语在一篇文章中出现次数越多,同时在所有文档中出现次数越少,越能够代表该文章。因此,通过词频-逆向文件频率(Term Frequency–Inverse Document Frequency,TF-IDF)表征短文本的语义特征,计算第一属性数据与第二属性数据之间的相似度,能够降低相似度计算的工作量,同时提高相似度计算的准确率。
参照图9,图9示出了第一知识图谱数据由以下步骤得到:
步骤S500,获取来自于第二平台的待处理数据;
步骤S600,对待处理数据进行信息抽取处理,得到第二转化数据;
步骤S700,将第二转化数据导入OrientDB数据库,得到第一知识图谱数据。
可以理解的是,获取用于构建知识图谱的待处理数据,待处理数据来自于第二平台,也可以来自于知识图谱构建平台,即待处理数据可以为本地数据,可以通过用户上传得到。待处理数据可以包含有结构化数据和非结构化数据,例如,JSON文件格式的结构化数据和文本文件格式的非结构化数据。对于结构化数据,可以进行正则化处理后进行信息抽取处理,提高知识图谱构建的准确性。对于非结构化数据,则利用训练模型进行信息抽取,将非结构化数据转化为结构化数据。对待处理数据进行信息抽取处理后,得到结构化数据格式的第二转化数据。将第二转化数据导入OrientDB数据库中,利用OrientDB数据库对第二转化数据进行管理处理,建立从待处理数据中抽取出的第二转化数据与知识图谱本体的映射关系,得到第一知识图谱数据,构建知识图谱。OrientDB数据库是一个开源数据库管理系统,包含有传统数据库管理系统的功能以及文档,基于Python和OrientDB对第二转化数据进行知识图谱构建,由于OrientDB的处理性能高,处理速度快,能够提高知识图谱的构建效率。
参照图10,图9所示实施例中的步骤S600还包括但不限于有以下步骤:
步骤S610,根据训练模型对第二结构数据进行信息抽取得到第二转化数据,其中,训练模型包括BERT、DGCNN和指针网络。
可以理解的是,待处理数据包括第二转化数据,第二转化数据为非结构化数据,非结构化数据是数据结构不规则或不完整,没有预定义的数据模型,不方便用数据库二维逻辑表来表现的数据。为了避免因数据形式不同而产生污染,影响信息提取的准确性,利用训练模型对第二结构数据进行修剪和转化,得到第二转化数据。其中,训练模型包括BERT、DGCNN和指针网络。利用指针网络在文本中捕捉序列的首尾位置,减少抽取步骤,提高抽取效率,同时利用DGCNN一次预测所有标签,提高预测的准确性和效率。
参照图11,图11示出了本申请实施例提供的知识图谱构建平台1100。该知识图谱构建平台1100包括存储器1110、处理器1120及存储在存储器1110上并可在处理器1120上运行的计算机程序,处理器1120执行计算机程序时实现如上述实施例中的知识图谱构建方法。
存储器1110作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如本申请上述实施例中的知识图谱构建方法。处理器1120通过运行存储在存储器1110中的非暂态软件程序以及指令,从而实现上述本申请上述实施例中的知识图谱构建方法。
存储器1110可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至 少一个功能所需要的应用程序;存储数据区可存储执行上述实施例中的知识图谱构建方法所需的数据等。此外,存储器1110可以包括高速随机存取存储器1110,还可以包括非暂态存储器1110,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。需要说明的是,存储器1110可包括相对于处理器1120远程设置的存储器1110,这些远程存储器1110可以通过网络连接至该终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述实施例中的知识图谱构建方法所需的非暂态软件程序以及指令存储在存储器中,当被一个或者多个处理器执行时,执行上述实施例中的知识图谱构建方法,例如,执行以上描述的图2中的方法步骤S100至步骤S400、图3中的方法步骤S210、图4中的方法步骤S310、图5中的方法步骤S410至步骤S420、图6中的方法步骤S430至步骤S440、图7中的方法步骤S450至步骤S460、图8中的方法步骤S470、图9中的方法步骤S500至步骤S700和图10中的方法步骤S610。
本申请还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机可执行指令,计算机可执行指令用于使计算机执行如上述实施例中的知识图谱构建方法,例如,执行以上描述的图2中的方法步骤S100至步骤S400、图3中的方法步骤S210、图4中的方法步骤S310、图5中的方法步骤S410至步骤S420、图6中的方法步骤S430至步骤S440、图7中的方法步骤S450至步骤S460、图8中的方法步骤S470、图9中的方法步骤S500至步骤S700和图10中的方法步骤S610。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本申请实施例包括:获取来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据;对异构数据进行信息抽取处理,得到第一转化数据;对第一转化数据和第一知识图谱数据进行相似度比较处理,得到相似度数据;根据相似度数据和预设阈值条件,对第一转化数据和第一知识图谱数据进行融合构建处理,得到第二知识图谱数据。根据本申请实施例提供的方案,知识图谱构建方法应用于知识图谱构建平台,能够获取来自于不同平台的数据,包括来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,从而能够利用不同平台的不同领域的数据对知识图谱进行融合更新。为了提高知识图谱的准确性以及便于后续步骤处理,对来自于不同平台的异构数据进行信息抽取处理,抽取出带有关键属性的第一转化数据。对第一知识图谱数据与第一转化数据进行相似度比较处理,得到相似度数据,从而判断相似度数据是否满足预设阈值条件,利用异构数据对第一知识图谱数据进行融合更新,或者将异构数据在第一知识图谱数据的基础上进行添加更新,得到第二知识图谱数据,实现利用不同领域、不同平台的数据自动对知识图谱进行更新构建处理,提高知识图谱的准确性和效率。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理 器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
上面结合附图对本申请实施例作了详细说明,但是本申请不限于上述实施例,在技术领域普通技术人员所具备的知识范围内,还可以在不脱离本申请宗旨的前提下作出各种变化。

Claims (11)

  1. 一种知识图谱构建方法,应用于知识图谱构建平台,所述知识图谱构建平台包括来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据,所述方法包括:
    获取来自于第一平台的异构数据和来自于第二平台的第一知识图谱数据;
    对所述异构数据进行信息抽取处理,得到第一转化数据;
    对所述第一转化数据和所述第一知识图谱数据进行相似度比较处理,得到相似度数据;
    根据所述相似度数据和预设阈值条件,对所述第一转化数据和所述第一知识图谱数据进行融合构建处理,得到第二知识图谱数据。
  2. 根据权利要求1所述的知识图谱构建方法,其中,所述异构数据包括第一结构数据;
    所述对所述异构数据进行信息抽取处理,得到第一转化数据,包括:
    根据训练模型对所述第一结构数据进行信息抽取得到第一转化数据,其中,所述训练模型包括基于转换器的双向编码表征BERT、膨胀门卷积神经网络DGCNN和指针网络。
  3. 根据权利要求1所述的知识图谱构建方法,其中,所述第一转化数据包括第一实体数据,所述第一知识图谱数据包括第二实体数据;
    所述对所述第一转化数据和所述第一知识图谱数据进行相似度比较处理,得到相似度数据,包括:
    对所述第一实体数据和所述第二实体数据进行相似度比较处理,得到实体相似度数据。
  4. 根据权利要求3所述的知识图谱构建方法,其中,所述根据所述相似度数据和预设阈值条件,对所述第一转化数据和所述第一知识图谱数据进行融合构建处理,得到第二知识图谱数据,包括:
    在所述实体相似度数据不满足预设实体阈值条件的情况下,对所述第一转化数据和所述第一知识图谱数据进行融合处理;
    或者,
    在所述实体相似度数据满足预设实体阈值条件的情况下,在所述第一知识图谱数据的基础上添加所述第一转化数据,得到第二知识图谱数据。
  5. 根据权利要求4所述的知识图谱构建方法,其中,所述异构数据还包括第一属性数据,所述第一知识图谱数据还包括第二属性数据;
    所述对所述第一转化数据和所述第一知识图谱数据进行融合处理,包括:
    对所述第一属性数据和所述第二属性数据进行相似度比较处理,得到属性相似度数据;
    根据所述属性相似度数据和预设属性阈值条件,对所述第一转化数据和所述第一知识图谱数据进行融合处理。
  6. 根据权利要求5所述的知识图谱构建方法,其中,所述根据所述属性相似度数据和预设属性阈值条件,对所述第一转化数据和所述第一知识图谱数据进行融合处理,包括:
    在所述属性相似度数据不满足预设属性阈值条件的情况下,在所述第一知识图谱数据的基础上添加所述第一属性数据,得到第二知识图谱数据;
    或者,
    在所述属性相似度数据满足预设属性阈值条件的情况下,维持所述第一知识图谱数据或将所述第一属性数据对所述第二属性数据进行替换,得到第二知识图谱数据。
  7. 根据权利要求5所述的知识图谱构建方法,其中,所述对所述第一属性数据和所述第二属性数据进行相似度比较处理,得到属性相似度数据,包括:
    基于杰卡德系数和/或词频-逆向文件频率至对所述第一属性数据和所述第二属性数据进行相似度比较处理,得到属性相似度数据。
  8. 根据权利要求1至7任意一项所述的知识图谱构建方法,其中,所述第一知识图谱数据由以下步骤得到:
    获取来自于所述第二平台的待处理数据;
    对所述待处理数据进行信息抽取处理,得到第二转化数据;
    将所述第二转化数据导入OrientDB数据库,得到第一知识图谱数据。
  9. 根据权利要求8所述的知识图谱构建方法,其中,所述待处理数据包括第二结构数据;
    所述对所述待处理数据进行信息抽取处理,得到第二转化数据,包括:
    根据训练模型对所述第二结构数据进行信息抽取得到第二转化数据,其中,所述训练模型包括BERT、DGCNN和指针网络。
  10. 一种知识图谱构建平台,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至9中任意一项所述的知识图谱构建方法。
  11. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,实现如权利要求1至9中任意一项所述的知识图谱构建方法。
PCT/CN2022/126759 2021-11-05 2022-10-21 知识图谱构建方法、平台及计算机存储介质 WO2023078104A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111308484.2 2021-11-05
CN202111308484.2A CN116089623A (zh) 2021-11-05 2021-11-05 知识图谱构建方法、平台及计算机存储介质

Publications (1)

Publication Number Publication Date
WO2023078104A1 true WO2023078104A1 (zh) 2023-05-11

Family

ID=86187358

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/126759 WO2023078104A1 (zh) 2021-11-05 2022-10-21 知识图谱构建方法、平台及计算机存储介质

Country Status (2)

Country Link
CN (1) CN116089623A (zh)
WO (1) WO2023078104A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720786A (zh) * 2023-08-01 2023-09-08 中国科学院工程热物理研究所 一种融合kg和plm的装配质量稳定性预测方法、系统及介质
CN118296976A (zh) * 2024-06-06 2024-07-05 浙江大学 微带滤波器设计迭代方法、系统、介质、产品及终端

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920588A (zh) * 2018-06-26 2018-11-30 北京光年无限科技有限公司 一种用于人机交互的知识图谱更新方法及系统
CN110489561A (zh) * 2019-07-12 2019-11-22 平安科技(深圳)有限公司 知识图谱构建方法、装置、计算机设备和存储介质
CN111708893A (zh) * 2020-05-15 2020-09-25 北京邮电大学 基于知识图谱的科技资源整合方法及系统
CN113157930A (zh) * 2020-12-30 2021-07-23 上海科技发展有限公司 基于多源异构数据的知识图谱构建方法、系统以及终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920588A (zh) * 2018-06-26 2018-11-30 北京光年无限科技有限公司 一种用于人机交互的知识图谱更新方法及系统
CN110489561A (zh) * 2019-07-12 2019-11-22 平安科技(深圳)有限公司 知识图谱构建方法、装置、计算机设备和存储介质
CN111708893A (zh) * 2020-05-15 2020-09-25 北京邮电大学 基于知识图谱的科技资源整合方法及系统
CN113157930A (zh) * 2020-12-30 2021-07-23 上海科技发展有限公司 基于多源异构数据的知识图谱构建方法、系统以及终端

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720786A (zh) * 2023-08-01 2023-09-08 中国科学院工程热物理研究所 一种融合kg和plm的装配质量稳定性预测方法、系统及介质
CN116720786B (zh) * 2023-08-01 2023-10-03 中国科学院工程热物理研究所 一种融合kg和plm的装配质量稳定性预测方法、系统及介质
CN118296976A (zh) * 2024-06-06 2024-07-05 浙江大学 微带滤波器设计迭代方法、系统、介质、产品及终端

Also Published As

Publication number Publication date
CN116089623A (zh) 2023-05-09

Similar Documents

Publication Publication Date Title
US10650188B2 (en) Constructing a narrative based on a collection of images
CN109635171B (zh) 一种新闻节目智能标签的融合推理系统和方法
WO2023078104A1 (zh) 知识图谱构建方法、平台及计算机存储介质
CN108804521B (zh) 一种基于知识图谱的问答方法及农业百科问答系统
WO2020001373A1 (zh) 一种本体构建方法及装置
CN106776711B (zh) 一种基于深度学习的中文医学知识图谱构建方法
CN112699246B (zh) 基于知识图谱的领域知识推送方法
CN107562772B (zh) 事件抽取方法、装置、系统和存储介质
CN103678684A (zh) 一种基于导航信息检索的中文分词方法
CN111460153A (zh) 热点话题提取方法、装置、终端设备及存储介质
CN109408578B (zh) 一种针对异构环境监测数据融合方法
CN109522396B (zh) 一种面向国防科技领域的知识处理方法及系统
CN113239111B (zh) 一种基于知识图谱的网络舆情可视化分析方法及系统
CN112989827B (zh) 一种基于多源异构特征的文本数据集质量评估方法
CN113971210B (zh) 一种数据字典生成方法、装置、电子设备及存储介质
CN106874397B (zh) 一种面向物联网设备的自动语义标注方法
CN116628173A (zh) 一种基于关键字提取的智能客服信息生成系统及生成方法
WO2024078105A1 (zh) 专利文献中的技术问题抽取方法及相关设备
CN114997288A (zh) 一种设计资源关联方法
KR20220074576A (ko) 마케팅 지식 그래프 구축을 위한 딥러닝 기반 신조어 추출 방법 및 그 장치
CN116821376B (zh) 煤矿安全生产领域的知识图谱构建方法及系统
CN117874247A (zh) 一种基于知识图谱的全媒体坐席检索方法
CN116186067A (zh) 一种工业数据表存储查询方法及设备
CN114443904B (zh) 视频查询方法、装置、计算机设备及计算机可读存储介质
CN116090450A (zh) 一种文本处理方法及计算设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22889129

Country of ref document: EP

Kind code of ref document: A1