CN115146080A

CN115146080A - Method and device for constructing knowledge graph

Info

Publication number: CN115146080A
Application number: CN202211005380.9A
Authority: CN
Inventors: 黄安付; 彭鹏; 曹一丁; 杨雷
Original assignee: Baiyang Times Beijing Technology Co ltd
Current assignee: Baiyang Times Beijing Technology Co ltd
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-10-04

Abstract

The application discloses a method and a device for constructing a knowledge graph, which comprise the following steps: acquiring a structured data set, a semi-structured data set and an unstructured data set of a target field; training a named entity recognition model by utilizing a structured data set, a remote supervision data set and a sample selector; converting the unstructured data set into a first converted structured data set by using the trained named entity recognition model; and constructing a knowledge graph of the target field according to the structured data set, the semi-structured data set and the first conversion structured data set. Therefore, the method and the device introduce the remote supervision data set, comprehensively utilize the sample selector and the named entity recognition model, enable the named entity recognition model to learn a large amount of high-quality training data, convert the unstructured data set into the structured data set by utilizing the trained named entity recognition model, greatly enrich the data for constructing the knowledge graph, and further construct the knowledge graph containing rich information.

Description

Method and device for constructing knowledge graph

Technical Field

The present application relates to the field of knowledge graph technology, and in particular, to a method and an apparatus for constructing a knowledge graph.

Background

With the development of science and technology, the knowledge graph is playing an increasingly important role in the aspects of analyzing and processing large-scale data and information mining as a novel knowledge structural representation method. A knowledge graph is a structured semantic knowledge base that describes concepts and their interrelationships in the physical world in symbolic form. Meanwhile, the knowledge graph is also a basic technology of downstream services such as intelligent question answering, relation prediction, semantic search, intelligent recommendation and the like.

In some specific areas with strong verticality, the existing canonical structured data is few, so that only a general knowledge graph containing a small amount of information can be constructed. For example, in the field of the sea battlefield, various data and information related to weaponry, weather conditions, battlefield situation and the like are needed, and in the existing knowledge graph, because the structured data of the specification of the existing field of the sea battlefield is less, the existing knowledge graph has the problems of information loss, low precision and the like, for example, most of the existing knowledge graph only contains weapon names and brief descriptions, but the needed information is more than that, and the information such as production dates, manufacturers and the like does not exist in most of the existing knowledge graph. Therefore, how to construct a knowledge graph containing rich information becomes a problem to be urgently solved.

Disclosure of Invention

Based on the above problems, the application provides a method and a device for constructing a knowledge graph, which can be used for constructing a domain-specific knowledge graph containing rich information and having strong verticality.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a method for constructing a knowledge graph, including:

acquiring a structured data set, a semi-structured data set and an unstructured data set of a target field;

training a named entity recognition model using the structured dataset, a remote supervisory dataset, and a sample selector;

converting the unstructured dataset into a first converted structured dataset by using the trained named entity recognition model;

constructing a knowledge graph of a target domain from the structured dataset, the semi-structured dataset, and the first transformed structured dataset.

Optionally, the training the named entity recognition model by using the structured data set, the remote supervision data set, and the sample selector includes:

carrying out preliminary training on a named entity recognition model by utilizing the structured data set;

mixing the structured data set and the remote monitoring data set to obtain a mixed data set;

and training the preliminarily trained named entity recognition model by utilizing the mixed data set and the sample selector.

Optionally, the training, by using the mixed data set and the sample selector, the named entity recognition model after the preliminary training includes:

randomly extracting a group of target data in the mixed data set, and judging whether the target data can be used as a training sample by using a sample selector;

and if the target data can be used as a training sample, training the named entity recognition model by using the target data.

Optionally, the method further includes:

feeding back model representations of the named entity recognition model to the sample selector;

the sample selector is updated based on the model representation.

Optionally, the constructing a knowledge graph of a target domain according to the structured data set, the semi-structured data set, and the first transformed structured data set includes:

processing the semi-structured data to obtain a second conversion structured data set;

constructing a knowledge graph from the structured data set, the first transformed structured data set, and the second transformed structured data set.

Optionally, the constructing a knowledge graph according to the structured data set, the first transformed structured data set, and the second transformed structured data set includes:

performing entity alignment and entity disambiguation on the structured data set, the first transformed structured data set, and the second transformed structured data set to obtain a total structured data set;

and constructing a knowledge graph of the target field according to the total structured data set.

Optionally, the named entity recognition model includes:

the Bi-LSTM network and the BERT model of the CRF layer are added.

On the other hand, the embodiment of the present application further provides an apparatus for constructing a knowledge graph, including:

the acquisition module is used for acquiring a structured data set, a semi-structured data set and an unstructured data set of a target field;

a training module for training a named entity recognition model using the structured dataset, a remote surveillance dataset, and a sample selector;

a conversion module for converting the unstructured dataset into a first converted structured dataset using the trained named entity recognition model;

a construction module to construct a knowledge-graph of a target domain from the structured dataset, the semi-structured dataset, and the first transformed structured dataset.

Optionally, the training module includes:

the preliminary training module is used for carrying out preliminary training on the named entity recognition model by utilizing the structured data set;

the mixing module is used for mixing the structured data set and the remote monitoring data set to obtain a mixed data set;

and the subsequent training module is used for training the primarily trained named entity recognition model by utilizing the mixed data set and the sample selector.

Optionally, the subsequent training module includes:

the judging module is used for randomly extracting a group of target data in the mixed data set and judging whether the target data can be used as a training sample or not by using the sample selector;

and the target training module is used for training the named entity recognition model by utilizing the target data.

Compared with the prior art, the method has the following beneficial effects:

the application provides a method for constructing a knowledge graph, which comprises the following steps: acquiring a structured data set, a semi-structured data set and an unstructured data set of a target field; training a named entity recognition model using the structured dataset, a remote surveillance dataset, and a sample selector; converting the unstructured dataset into a first converted structured dataset by using the trained named entity recognition model; constructing a knowledge graph of a target domain from the structured dataset, the semi-structured dataset, and the first transformed structured dataset. According to the method and the device, a large amount of data can be obtained to train the named entity recognition model by introducing the remote supervision data set, and the sample selector and the named entity recognition model are comprehensively utilized, so that the named entity recognition model can independently learn high-quality training data, an unstructured data set can be converted into a structured data set by using the trained named entity recognition model, and then data for constructing the knowledge graph is enriched, and a specific field knowledge graph containing rich information and high in verticality can be constructed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a flow chart of a method for constructing a knowledge graph according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a named entity recognition model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an apparatus for constructing a knowledge graph according to an embodiment of the present application.

Detailed Description

As described above, the currently constructed knowledge graph has problems of information loss and the like for the field with strong verticality.

Through research, the inventor invents a method and a device for constructing a knowledge graph, and can construct the knowledge graph which contains rich information and has strong verticality and a specific field.

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Method embodiment

Referring to fig. 1, the figure is a flowchart of a method for constructing a knowledge graph according to an embodiment of the present application, where the method for constructing a knowledge graph includes:

s101, acquiring a structured data set, a semi-structured data set and an unstructured data set of a target field.

It should be noted that the target field may be a specific field with strong verticality, such as a field of a sea battlefield. The normalized structured data set can be used for constructing a training set, a verification set and a test set required by model training, manual marking work can be added in the process of obtaining the structured data set so as to improve the quality of model training, and the structured data set can be obtained through databases of desensitization data with open sources released by the official. The semi-structured data set is a data set with a certain uniform structure, and can be converted into a structured data set through reasonable processing, such as regularized extraction and other operations, and the semi-structured data set can be obtained through webpage crawling of encyclopedia and the like. The unstructured data set is composed of data which are not provided with or provided with a small amount of unified structures, contains massive text data and can be obtained through websites and books in a target field.

And S102, training a named entity recognition model by utilizing the structured data set, the remote supervision data set and the sample selector.

In the embodiments provided in the present application, the named entity recognition model may be a BERT (Bidirectional Encoder retrieval from transformations) model added with a Bi-directional Long Short-Term Memory (Bi-LSTM) network and a Conditional Random Field (CRF) layer, and the remote supervision data set may include data labeled in a large batch using heuristic rules. Since a large amount of noise may exist in data labeled in a large batch by using a heuristic rule, a constructed sample selector is adopted in the embodiment of the application, and the sample selector can autonomously learn and select high-quality training data.

S103, converting the unstructured data set into a first conversion structured data set by using the trained named entity recognition model.

In the embodiment of the present application, the named entity recognition model can be enabled to convert the unstructured data into the structured data through the training in step S102, so that the trained named entity recognition model can be used to convert the unstructured data set into the structured data set, and the structured data set is named as a first converted structured data set, and the conversion method may be to extract the unstructured data set as input data of the named entity recognition model.

And S104, constructing a knowledge graph of the target field according to the structured data set, the semi-structured data set and the first conversion structured data set.

Specifically, the semi-structured data set may be processed to be converted into a structured data set, and the processing manner may be, for example, regularization extraction, and entity alignment and entity disambiguation are performed on the original structured data set, the structured data set converted from the semi-structured data set, and the structured data set converted from the unstructured data set, so as to obtain a total structured data set, and a knowledge graph of the target field is constructed according to the total structured data set.

According to the method for constructing the knowledge graph, a structured data set, a semi-structured data set and an unstructured data set in a target field are obtained; training a named entity recognition model using the structured dataset, a remote surveillance dataset, and a sample selector; converting the unstructured dataset into a first converted structured dataset by using the trained named entity recognition model; constructing a knowledge graph of a target domain from the structured dataset, the semi-structured dataset, and the first transformed structured dataset. According to the method and the device, a large amount of data can be obtained to train the named entity recognition model by introducing the remote supervision data set, and the sample selector and the named entity recognition model are comprehensively utilized, so that the named entity recognition model can independently learn high-quality training data, an unstructured data set can be converted into a structured data set by using the trained named entity recognition model, data for constructing the knowledge graph are enriched, and the specific-field knowledge graph containing rich information and high in verticality can be constructed.

Referring to fig. 2, which is a flowchart of a method for training a named entity recognition model according to an embodiment of the present application, the method for training the named entity recognition model includes:

s201, utilizing the structured data set to carry out preliminary training on a named entity recognition model.

It should be noted that, because the data in the structured data set is normalized data, the preliminary training of the named entity recognition model by using the structured data set can make the named entity recognition model have a good foundation, and can effectively improve the training efficiency.

S202, mixing the structured data set and the remote monitoring data set to obtain a mixed data set.

S203, randomly extracting a group of target data in the mixed data set, and judging whether the target data can be used as a training sample by using a sample selector.

It should be noted that, in the embodiments provided in the present application, as an example, the sample selector may use a reinforcement learning method based on a policy. Firstly, a subset of a data set is selected as a packet, and a sample selector selects data with correct labels from the labeled data in the packet, so that the selected data with correct labels can be used as high-quality data to train a named entity recognition model.

And S204, if the target data can be used as a training sample, training the named entity recognition model by using the target data.

It should be noted that, the named entity recognition model may adopt a structure of a BERT plus CRF layer, and the training process is to train and encode the labeled data using a pre-trained BERT model to obtain an accurate semantic representation of the character, and to perform state transition constraint on the encoded structure by the CRF layer.

Specifically, in the embodiments provided in the present application, a BIO labeling manner may be adopted for labeling. The method comprises the steps that codes are obtained through a BERT layer, then coding information is input into a Bi-directional long-short time memory network layer Bi-LSTM, the Bi-directional long-short time memory network layer Bi-LSTM comprises a forward long-short time memory network LSTM and a reverse long-short time memory network LSTM, context information of sentences is mainly learned, and the output of the Bi-directional long-short time memory network layer is the probability that each word in data is different named entities; introducing a conditional random field model (CRF) layer for learning a transfer rule before adjacent entity labels of a data center, wherein the CRF layer receives an input sequence X and outputs a target sequence Y, and two adjacent nodes meet the following formula:

P(Y _i |X，Y1，Y2，…Yn)＝p(Y _i |X，Y _i-1 ,Y _i+1 )

wherein, the lower corner of Y is marked as the serial number of Y, and the given input sequence X is solved by using CRF, so that the probability formula of Y can be obtained as follows:

because adjacent state sequences have a defined relationship and depend on observation sequence data, the state feature s and the transition feature t are substituted into the following formula:

wherein Z (X) is a normalization function, t _k For transfer of state functions, λ _k Is t _k Weight of (1), s _l Is a characteristic function of the state, u _l Is as s _l K is the number of transition state functions, l is the number of state feature functions, t _k And s _l Is 0 or 1, in t _k For example, the formula is as follows:

s205, feeding back the model performance of the named entity recognition model to the sample selector.

Specifically, in the embodiments provided in the present application, as an example, there is a standard structured data set and a remote supervision data set, the standard structured data set and the remote supervision data set are combined into a mixed data set, a subset is randomly selected from the mixed data set, for each data in the subset, the sample selector needs to perform a decision to decide whether to select the data, and after each data in the subset is determined to be completed, a return of the current policy of the sample selector on the subset is returned, and the return is used to update the sample selector.

Use state S _t Representing the current data, and its tag sequence, representing the state as a vector S _t Vector S _t May consist of a policy network and rewards.

A policy network: the agent obtains a decision value a according to the current strategy _t Wherein a is _t E {0,1},0 indicates that the data will not be selected, and 1 indicates that the data will be selected. The policy network can be represented by the following formula.

A _Θ (s _t ,a _t )＝a _t σ(W*S _t +b)+(1-a _t )(1-σ(W*S _t +b))

Wherein A is _Θ (s _t ,a _t ) Representing a policy network, a _t Indicates whether the selector selects the t-th sample, S _t Is the state vector of t, and W and b are parameters of the multilayer perceptron MLB for processing the state vector, and then using the sigmoid σ function.

The reward may be used to evaluate the ability of the current sample selector to predict each data tag, and when the model completes all selections in the current subset, an average reward for that subset will be obtained. The reward can be calculated by the following formula.

Wherein, r is the return,

the structured data set is then used to construct,

for remote supervision of data sets, p (z | x) _j ) And p (y | x) _k ) Conditional probability, x, of tag sequence calculated for CRF layer _j For data candidates in the remote supervision data set, x _k For candidate data in a structured dataset, z for a remote supervised datasetSpecific data, y is specific data in the structured dataset.

In the embodiment provided by the present application, the policy parameter of the sample selector may be updated by reporting a value and solving a gradient, for example, the policy parameter may be updated by the following formula:

wherein, theta ₁ Is the set of all parameters after the sample selector is updated, theta is the set of all parameters before the sample selector is updated, alpha is the learning rate, r (a) _t ) Is the average return value of the return signals,

for remote supervision of data set A _Θ (s _t ,a _t ) A network of policies is represented that is,

the gradient operator of the policy network for the parameter Θ is represented.

S206, updating the sample selector based on the model representation.

In the embodiment provided by the present application, as an example, a joint training mode may be adopted to iteratively update parameters of the sample selector and the named entity recognition model, specifically, select a part of data from the remote supervision data set, mix the part of data with the structured data to obtain a mixed data set, learn parameters of the named entity recognition model from the mixed data set, and provide a feedback to the sample selector to optimize the effect of the sample selector.

According to the embodiment provided by the application, the named entity recognition model is preliminarily trained by utilizing the structured data set, the sample selector is utilized to judge and select the mixed data set obtained after the structured data set and the remote supervision data set are mixed, so that the named entity recognition model can be trained according to a large amount of high-quality data, the sample selector is updated by utilizing the model expression of the trained named entity recognition model, so that the sample selector can adjust strategies according to the model expression of the named entity recognition model, and the tag sequence of the named entity recognition depends on distribution, so that the probability of occurrence of an illegal sequence in tag sequence prediction can be effectively reduced by adopting the CRF, and the performance of the named entity recognition model is higher.

Device embodiment

Referring to fig. 3, the diagram is a schematic structural diagram of an apparatus for constructing a knowledge graph according to an embodiment of the present application, where the apparatus for constructing a knowledge graph includes: the device comprises an acquisition module 301, a training module 302, a conversion module 303 and a construction module 304.

The obtaining module 301 is configured to obtain a structured data set, a semi-structured data set, and an unstructured data set in a target field.

A training module 302 for training a named entity recognition model using the structured dataset, the remote supervisory dataset, and the sample selector.

A converting module 303, configured to convert the unstructured dataset into a first converted structured dataset by using the trained named entity recognition model.

A construction module 304 for constructing a knowledge graph of the target domain from the structured data set, the semi-structured data set, and the first transformed structured data set.

Optionally, the training module 302 includes:

the preliminary training module is used for processing and training the named entity recognition model by utilizing the structured data set;

the mixing module is used for mixing the structured data set and the remote supervision data set to obtain a mixed data set;

Optionally, the subsequent training module includes:

the judging module is used for randomly extracting a group of target data in the mixed data set and judging whether the target data can be used as a training sample by using a sample selector;

Optionally, the apparatus further comprises:

and a feedback module. For feeding back a model representation of the named entity recognition model to the sample selector;

an update module to update the sample selector based on the model representation.

Optionally, the building block 304 includes:

the processing module is used for processing the semi-structured data to obtain a second conversion structured data set;

a transformation building module to build a knowledge graph from the structured data set, the first transformed structured data set, and the second transformed structured data set.

Optionally, the conversion building block includes:

an alignment disambiguation module for performing entity alignment and entity disambiguation the structured data set, the first transformed structured data set, and the second transformed structured data set to obtain a total structured data set;

and the total construction module is used for constructing the knowledge graph of the target field according to the total structured data set.

The device for constructing the knowledge graph, provided by the embodiment of the application, utilizes an acquisition module to acquire a structured data set, a semi-structured data set and an unstructured data set of a target field, utilizes a training module to train a named entity recognition model by utilizing the structured data set, a remote supervision data set and a sample selector, utilizes a trained named entity recognition module to convert the unstructured data set into a first converted structured data set by utilizing a conversion module, and utilizes a construction module to construct the knowledge graph of the target field according to the structured data set, the semi-structured data set and the first converted structured data set. According to the method and the device, a large amount of data can be obtained to train the named entity recognition model by introducing the remote supervision data set, and the sample selector and the named entity recognition model are comprehensively utilized, so that the named entity recognition model can independently learn high-quality training data, an unstructured data set can be converted into a structured data set by using the trained named entity recognition model, data for constructing the knowledge graph are enriched, and the specific-field knowledge graph containing rich information and high in verticality can be constructed.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the apparatus and system embodiments, because they are substantially similar to the method embodiments, are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts suggested as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of constructing a knowledge graph, comprising:

2. The method of claim 1, wherein training a named entity recognition model using the structured dataset, a remote supervised dataset, and a sample selector comprises:

mixing the structured data set and the remote supervision data set to obtain a mixed data set;

3. The method of claim 2, wherein training the preliminarily trained named entity recognition model using the mixed dataset and sample selector comprises:

4. The method of claim 3, further comprising:

feeding back a model representation of the named entity recognition model to the sample selector;

updating the sample selector based on the model representation.

5. The method of claim 1, wherein constructing a knowledge graph of a target domain from the structured dataset, the semi-structured dataset, and the first transformed structured dataset comprises:

6. The method of claim 5, wherein constructing a knowledge graph from the structured data set, the first transformed structured data set, and the second transformed structured data set comprises:

7. The method of claim 1, wherein the named entity recognition model comprises:

the BERT model of the Bi-LSTM network and the CRF layer is added.

8. An apparatus for constructing a knowledge graph, comprising:

a construction module to construct a knowledge graph of a target domain from the structured dataset, the semi-structured dataset, and the first transformed structured dataset.

9. The apparatus of claim 8, wherein the training module comprises:

10. The apparatus of claim 9, wherein the subsequent training module comprises: