CN117648979A

CN117648979A - Knowledge graph data construction method and device and computer equipment

Info

Publication number: CN117648979A
Application number: CN202311744459.8A
Authority: CN
Inventors: 王国迪
Original assignee: China Life Insurance Co ltd
Current assignee: China Life Insurance Co ltd
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-03-05

Abstract

The application relates to a method, a device, a computer device, a storage medium and a computer program product for constructing knowledge-graph data. The method comprises the following steps: and acquiring training data, wherein the training data comprises sample texts and sample labels corresponding to the sample texts. And performing feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text. And predicting text feature vectors corresponding to the characters contained in the sample text respectively through a prediction model, and determining a prediction label corresponding to the sample text. Based on the original text data set and the marked text data set, determining a target sample text in each sample text, and based on the target sample text and the prediction label, obtaining updated training data, the text features in the text can be rapidly extracted, the efficiency of converting unstructured data into structured data is improved, and the efficiency of constructing knowledge maps in the field of financial insurance is improved.

Description

Knowledge graph data construction method and device and computer equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for constructing knowledge graph data, a computer device, a storage medium, and a computer program product.

Background

Knowledge Graph (knowledgegraph) is an important branching technology of artificial intelligence, is a structured semantic Knowledge base, and describes concepts and interrelationships thereof in the physical world in a symbolic form; the basic composition unit of the knowledge graph is an entity-relation-entity triplet, and the entities and related attribute-value pairs thereof are mutually connected through the relation to form a net-shaped knowledge structure.

In the related art, a knowledge graph is built, and data is first obtained, where the data is a knowledge source, and may be some tables, texts, databases, and the like. Structured data, unstructured data, and semi-structured data can be classified according to the type of data. Structured data is data represented by a table, a database and the like according to a certain format, and can be directly used for constructing a knowledge graph. Unstructured data are text, audio, video, pictures and the like, and information extraction is needed to further establish a knowledge graph. Semi-structured data is a data between structured and unstructured, and information extraction is also needed to build a knowledge graph. How to extract structured data from data such as multi-source heterogeneous insurance clauses and the like to construct a knowledge graph is a technology to be solved urgently.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product for constructing knowledge-graph data that can improve data extraction efficiency.

In a first aspect, the present application provides a method for constructing knowledge-graph data. The method comprises the following steps:

acquiring training data, wherein the training data comprises sample texts and sample labels corresponding to the sample texts;

performing feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text, wherein the sample text comprises a plurality of characters, and the text feature vector sequence comprises text feature vectors respectively corresponding to the characters;

predicting text feature vectors corresponding to the characters contained in the sample text respectively through a prediction model, and determining a prediction tag corresponding to the sample text;

and determining a target sample text in each sample text based on the original text data set and the marked text data set, and obtaining updated training data based on the target sample text and the predictive label.

In one embodiment, the prediction model includes a position tag determining layer and a position tag predicting layer, and the predicting, by the prediction model, the text feature vector corresponding to each character included in the sample text, to determine a prediction tag corresponding to the sample text includes:

processing text feature vectors corresponding to the characters respectively through the position label determining layer to obtain prediction labels corresponding to the characters;

and carrying out maximum likelihood processing on the predictive label corresponding to each character through the position label predictive layer to obtain the predictive label of the sample text.

In one embodiment, the position tag determining layer includes a bidirectional long short time memory layer, the bidirectional long short time memory layer includes a plurality of forward modules and a plurality of backward modules, the forward modules are connected end to end, the backward modules are connected end to end, the text feature vectors corresponding to the characters are processed by the position tag determining layer to obtain a prediction tag corresponding to the characters, and the method includes:

for each forward module, based on the character and the output result of the previous forward module of the forward module, obtaining the output result of the forward module;

For each backward module, based on the character and the output result of the last backward module of the backward module, obtaining the output result of the backward module;

and splicing based on the output result of each forward module and the output result of each backward module to obtain an implicit state sequence, and processing the implicit state sequence through a linear layer to obtain a prediction label corresponding to each character.

In one embodiment, the determining the target sample text at each of the sample texts based on the original text data set and the annotated text data set includes:

calculating a first average similarity between the sample text and the original text data set, and calculating a second average similarity between the sample text and the annotated text data set;

and taking the sample text of which the first average similarity does not meet a preset matching condition and the second average similarity meets the preset matching condition as a target sample text.

In one embodiment, the step of taking, as the target sample text, the sample text in which the first average similarity does not satisfy a preset matching condition and the second average similarity satisfies the preset matching condition includes:

Carrying out normalization processing on each sample text to obtain each sample text after normalization processing, and respectively calculating the confidence coefficient of each sample text;

and taking the sample text of which the first average similarity does not meet a preset matching condition and the second average similarity meets the preset matching condition and the sample text of which the confidence degree meets a minimum confidence degree condition as a target sample text.

In one embodiment, the performing, by the position tag prediction layer, maximum likelihood processing on a prediction tag corresponding to each character to obtain a prediction tag of the sample text includes:

performing maximum likelihood processing on the predicted label corresponding to each character through the position label predicting layer to obtain probability values of sample labels to which each character belongs respectively;

and determining a predictive label of the sample text based on the probability value of each sample label to which each character belongs.

In a second aspect, the present application further provides a device for constructing knowledge graph data. The device comprises:

the first acquisition module is used for acquiring training data, wherein the training data comprises sample texts and sample labels corresponding to the sample texts;

The conversion module is used for carrying out feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text, wherein the sample text comprises a plurality of characters, and the text feature vector sequence comprises text feature vectors respectively corresponding to the characters;

the prediction module is used for predicting text feature vectors corresponding to the characters contained in the sample text respectively through a prediction model and determining a prediction label corresponding to the sample text;

and the updating module is used for determining a target sample text in each sample text based on the original text data set and the marked text data set, and obtaining updated training data based on the target sample text and the predictive label.

In one embodiment, the prediction model includes a position tag determination layer and a position tag prediction layer, and the prediction module is specifically configured to:

In one embodiment, the location tag determining layer includes a bidirectional long short time memory layer, where the bidirectional long short time memory layer includes a plurality of forward modules and a plurality of backward modules, each of the forward modules is connected end to end, each of the backward modules is connected end to end, and the prediction module is specifically configured to:

In one embodiment, the updating module is specifically configured to:

In one embodiment, the updating module is specifically configured to: carrying out normalization processing on each sample text to obtain each sample text after normalization processing, and respectively calculating the confidence coefficient of each sample text;

In one embodiment, the prediction module is specifically configured to: performing maximum likelihood processing on the predicted label corresponding to each character through the position label predicting layer to obtain probability values of sample labels to which each character belongs respectively;

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The method, the device, the computer equipment, the storage medium and the computer program product for constructing the knowledge graph data, wherein the method comprises the following steps: acquiring training data, wherein the training data comprises sample texts and sample labels corresponding to the sample texts; performing feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text, wherein the sample text comprises a plurality of characters, and the text feature vector sequence comprises text feature vectors respectively corresponding to the characters; predicting text feature vectors corresponding to the characters contained in the sample text respectively through a prediction model, and determining a prediction tag corresponding to the sample text; and determining a target sample text in each sample text based on the original text data set and the marked text data set, and obtaining updated training data based on the target sample text and the predictive label. By adopting the method, the text characteristics in the text can be extracted rapidly, the efficiency of converting unstructured data into structured data is improved, the efficiency of constructing the knowledge graph in the field of financial insurance is improved, and the fusion process and the convenience degree of multi-source heterogeneous data can be simplified.

Drawings

FIG. 1 is a flow chart of a method for constructing knowledge-graph data in one embodiment;

FIG. 2 is a flowchart illustrating a step of determining a predictive label corresponding to text in one embodiment;

FIG. 3 is a flowchart illustrating a step of determining a predicted tag corresponding to a character in one embodiment;

FIG. 4 is a flow diagram of a process for determining a target sample text in one embodiment;

FIG. 5 is a flow diagram of a process for determining a target sample text in one embodiment;

FIG. 6 is a flowchart illustrating a step of determining a predictive label corresponding to text in one embodiment;

FIG. 7 is a schematic diagram of a model in one embodiment;

FIG. 8 is a block diagram of a knowledge-graph data construction apparatus in one embodiment;

fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a method for constructing knowledge graph data is provided, and the method is applied to a terminal for illustration, it can be understood that the method can also be applied to a server, and can also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server, where the terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices can be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device and the like, and the server can be realized by a stand-alone server or a server cluster formed by a plurality of servers. In this embodiment, the method for constructing knowledge graph data includes the following steps:

Step 102, obtaining training data.

The training data comprises sample texts and sample labels corresponding to the sample texts, the sample texts can be texts used for training models, such as texts of insurance clauses and the like, unstructured texts and the like, and the sample labels corresponding to the sample texts can be pre-labeled labels, insurance rules characterized by the texts and the like; each sample text has a corresponding sample tag.

Specifically, in the process of constructing the knowledge graph, the terminal can construct the knowledge graph through the data output by the prediction model, and in the process of training the prediction model, a plurality of sample texts and sample labels corresponding to the sample texts can be obtained in advance.

And 104, performing feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text.

The sample text comprises a plurality of characters, the text feature vector sequence comprises text feature vectors corresponding to the characters respectively, the characters can be words, letters, punctuation marks, numbers and the like, and the text feature vectors corresponding to the characters can be obtained by encoding the characters through a multi-layer bi-directional encoder.

Specifically, the terminal inputs a plurality of characters contained in the sample text into the multi-layer bi-directional encoder to obtain text feature vectors corresponding to the characters respectively, and combines the text feature vectors based on the text feature vectors corresponding to the characters respectively to obtain a text feature vector sequence.

And 106, predicting text feature vectors corresponding to the characters contained in the sample text through a prediction model, and determining a prediction label corresponding to the sample text.

The prediction model may be a deep learning model or a neural network model, and the prediction label corresponding to the sample text may be obtained by processing the sample text based on the prediction model.

Specifically, the terminal can process a text feature vector sequence corresponding to a plurality of characters contained in the sample text through a prediction model to obtain a prediction tag corresponding to the sample text. In one example, the terminal may input a text feature vector sequence to a prediction model, and perform prediction processing on the text feature vector sequence in the prediction model to obtain a prediction tag corresponding to the sample text.

And step 108, determining target sample texts in all texts based on the original text data set and the marked text data set, and obtaining updated training data based on the target sample texts and the predictive labels.

Wherein the original text data set may be text data comprising a plurality of unlabeled labels and the labeled text data set may be text data comprising a plurality of labels that have been labeled with corresponding label data.

Specifically, screening among a plurality of sample texts based on the similarity between each sample text and a plurality of text data contained in an original text data set and the similarity between each sample text and a plurality of text data contained in a marked text data set, extracting one or more target sample texts, and adding each target sample text and a prediction label corresponding to each target sample text to training data to obtain updated training data.

In the knowledge graph data construction method, training data is acquired, wherein the training data comprises sample texts and sample labels corresponding to the sample texts. And performing feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text, wherein the sample text comprises a plurality of characters, and the text feature vector sequence comprises text feature vectors respectively corresponding to the characters. And predicting text feature vectors corresponding to the characters contained in the sample text respectively through a prediction model, and determining a prediction label corresponding to the sample text. And determining target sample texts in all the texts based on the original text data set and the marked text data set, and obtaining updated training data based on the target sample texts and the predictive labels. By adopting the method, the text characteristics in the text can be extracted rapidly, the efficiency of converting unstructured data into structured data is improved, the efficiency of constructing the knowledge graph in the field of financial insurance is improved, and the fusion process and the convenience degree of multi-source heterogeneous data can be simplified.

In one embodiment, the predictive model includes a position tag determination layer, which may be a two-way long short time memory Layer (LSTM), for example, which may include a plurality of forward modules and a plurality of backward modules, and a position tag prediction layer; the position tag prediction layer may be a CRF layer for determining a prediction tag corresponding to the sample text.

Accordingly, as shown in fig. 2, the specific processing procedure of "predicting, by using a prediction model, the text feature vectors corresponding to each character included in the sample text, and determining the prediction label corresponding to the sample text" includes:

step 202, processing text feature vectors corresponding to the characters respectively through a position label determining layer to obtain a prediction label corresponding to each character.

Specifically, the terminal may input text feature vectors corresponding to each character included in the sample text to the position tag determining layer, and process each text feature vector by the position tag determining layer to obtain an output result of the position tag determining layer, for example, may obtain output results of a plurality of forward modules and a plurality of backward modules included in the position tag determining layer, and use each output result as a prediction tag corresponding to each character.

And 204, performing maximum likelihood processing on the predictive labels corresponding to the characters through the position label predictive layer to obtain predictive labels of the sample text.

Specifically, the terminal may obtain a predicted tag (a tag to which the terminal ultimately belongs) of each character output by the BiLSTM layer. The terminal can input the prediction labels corresponding to the characters to the CRF layer, and correct and restrict the prediction labels corresponding to the characters through the CRF layer. In the CRF layer, the transfer matrix Z may be regarded as a parameter of the CRF layer. The terminal can process the predictive labels corresponding to the characters through the maximum likelihood function to obtain probability values of the sample labels contained in the training data to which the characters belong, and the predictive labels corresponding to the sample texts are determined based on the probability values of the sample labels to which the characters belong.

In this embodiment, the label prediction is performed on the sample text by the bidirectional long and short time memory layer and the CRF layer, so that the accuracy of label prediction can be improved.

In one embodiment, the location tag determination layer comprises a bidirectional long short time memory layer comprising a plurality of forward modules and a plurality of backward modules, each forward module being connected end to end, each backward module being connected end to end.

Accordingly, as shown in fig. 3, the specific processing procedure of the step of processing, by the position tag determining layer, the text feature vectors corresponding to each character to obtain the prediction tags corresponding to each character includes:

step 302, for each forward module, based on the character, the output result of the previous forward module of the forward module, the output result of the forward module is obtained.

Specifically, the terminal may input each character to the plurality of forward modules in the arrangement order of the characters, and input each character to the plurality of backward modules in the arrangement order of the characters, for example, the first character is input to the first forward module and the first backward module, respectively.

In one example, for each of a plurality of forward modules included in lstm, the terminal may input a character corresponding to an arrangement order of the forward modules to the forward module, and an output result of a previous forward module of the forward module to the forward module, to obtain an output result of the forward module, and input the output result to a next forward module of the forward module, and input the input result to the position tag prediction layer.

Step 304, for each backward module, based on the character and the output result of the last backward module of the backward module, obtaining the output result of the backward module.

Specifically, the terminal may input each character to the plurality of backward modules according to the arrangement order of the characters, and input each character to the plurality of backward modules according to the arrangement order of the characters, for example, the first character is input to the first forward module and the first backward module, respectively.

In one example, for each of a plurality of backward modules included in lstm, the terminal may input a character corresponding to an arrangement order of the backward modules to the backward module, and an output result of a next backward module of the backward module to the backward module, to obtain an output result of the backward module, and input the output result to a previous backward module of the backward module, and input the input result to the position tag prediction layer.

And 306, splicing based on the output result of each forward module and the output result of each backward module to obtain an implicit state sequence, and processing the implicit state sequence through a linear layer to obtain a prediction label corresponding to each character.

Specifically, the terminal may perform a splicing process on the output result of each forward module and the output result of each backward module to obtain a complete implicit state sequence, so that the terminal may input the implicit state sequence to the linear layer, map the implicit state sequence to n labels through the linear layer to obtain a mapping score value of each label to which each character belongs, and perform n classification processing on each position through the Softmax layer and the mapping score value of each label to which each character belongs, so as to determine a prediction label corresponding to the character by using a label with the largest mapping score value. Where n labels may be the number of sample labels contained in the training data.

In one embodiment, as shown in fig. 4, the specific process of determining the target sample text in each sample based on the original text data set and the annotated text data set includes:

step 402, a first average similarity between the sample text and the original text data set is calculated, and a second average similarity between the sample text and the annotated text data set is calculated.

Specifically, the terminal can respectively calculate the similarity between the sample text and a plurality of original texts contained in an original text data set through a preset similarity algorithm, and perform mean processing based on the similarity between the sample text and the plurality of original texts contained in the original text data set to obtain a first average similarity corresponding to the sample text; correspondingly, the terminal can respectively calculate the similarity between the sample text and a plurality of marked texts contained in the marked text data set through a preset similarity algorithm, and perform mean value processing based on the similarity between the sample text and the plurality of marked texts contained in the marked text data set to obtain a second average similarity corresponding to the sample text. The preset similarity algorithm may be a euclidean distance algorithm, and the similarity may be a euclidean distance, for example.

In step 404, the sample text whose first average similarity does not satisfy the preset matching condition and whose second average similarity satisfies the preset matching condition is used as the target sample text.

The content of the preset matching condition may be greater than a preset similarity threshold, or the content may be the maximum similarity, or the like, and the content that does not satisfy the preset matching condition may be less than the preset similarity threshold, or the content may be the minimum similarity, or the like.

Specifically, in the plurality of sample texts, screening is performed based on a first average similarity and a second average similarity corresponding to the sample texts and a preset matching condition, so as to obtain a target sample text. In one example, the terminal may perform screening based on the first average similarity, for example, may primarily screen a plurality of first sample texts smaller than a preset similarity threshold, and use, as the target sample text, a first sample text whose second average similarity is greater than the preset similarity threshold in the plurality of first sample texts; or, the terminal may perform screening based on the first average similarity, for example, a plurality of first sample texts smaller than a preset similarity threshold may be initially screened, and among the plurality of first sample texts, a first sample text with the second maximum average similarity is used as a target sample text; in another example, the terminal may perform screening based on the second average similarity, for example, a plurality of first sample texts greater than a preset similarity threshold may be initially screened, and among the plurality of first sample texts, a first sample text with the smallest first average similarity is taken as the target sample text.

In this embodiment, the similarity between the unlabeled sample and the target sample can be considered through the similarity measurement, so as to avoid selecting an outlier sample, and meanwhile avoid selecting an unlabeled sample similar to a sample which has been fully trained, so that the comprehensiveness of data selection is ensured.

In one embodiment, as shown in fig. 5, the specific processing procedure of "the sample text in which the first average similarity does not satisfy the preset matching condition and the second average similarity satisfies the preset matching condition as the target sample text" includes:

step 502, performing normalization processing on each sample text to obtain normalized sample text, and calculating the confidence coefficient of each sample text respectively.

Specifically, the terminal may sum the probability products of each sample text and the tag sequence corresponding to the sample text, perform normalization processing to obtain a plurality of sample texts with equal lengths, and calculate the confidence coefficient of each sample text.

And 504, taking the sample text of which the first average similarity does not meet the preset matching condition and the second average similarity meets the preset matching condition and the confidence degree meets the minimum confidence degree condition as the target sample text.

Specifically, among the plurality of sample texts, the terminal may first determine a confidence level of each sample text, and determine a sample text having a confidence level less than or equal to a preset confidence threshold value as a sample text satisfying a minimum confidence condition. In this way, in each sample text meeting the minimum confidence coefficient condition, the terminal can calculate the first average similarity and the second average similarity of each sample text, and screen the sample text based on the first average similarity, the second average similarity and the preset matching condition corresponding to the sample text to obtain the target sample text.

In the embodiment, through confidence calculation, the diversity of data carried by the sample text and the carrying of more information can be ensured; the similarity measurement can consider the similarity of unlabeled samples and target samples to avoid selecting outlier samples, and meanwhile avoid selecting unlabeled samples similar to samples which are already sufficiently trained, so that the comprehensiveness of data selection is ensured.

In one embodiment, as shown in fig. 6, the specific processing procedure of the step of performing maximum likelihood processing on the prediction label corresponding to each character through the position label prediction layer to obtain the prediction label of the sample text includes:

step 602, performing maximum likelihood processing on the predicted label corresponding to each character through the position label prediction layer to obtain probability values of the sample labels to which each character belongs.

Specifically, the terminal may perform a splicing process on the output result of each forward module and the output result of each backward module to obtain a complete implicit state sequence, and the terminal may input the implicit state sequence to the linear layer, and map the implicit state sequence to n labels through the linear layer to obtain a mapping score value of each label to which each character belongs. The terminal can also carry out n-class processing on each position through the Softmax layer and the mapping score value of each label to which each character belongs, and the label with the largest mapping score value determines the prediction label corresponding to the character. For each character, the terminal may input the prediction label corresponding to each character to the CRF layer based on the sample label to which the character belongs, and correct and constrain the prediction label corresponding to each character through the CRF layer. In the CRF layer, the transfer matrix Z may be regarded as a parameter of the CRF layer. For each character, the terminal can process the predictive label corresponding to the character through the maximum likelihood function to obtain the probability value of each sample label to which the character belongs,

Step 604, determining a predictive label of the sample text based on the probability value of each sample label to which each character belongs.

Specifically, for each character, the terminal may perform processing based on the probability value of each sample label to which the character belongs, and determine the sample label with the highest probability value as the prediction label corresponding to the sample text.

In this embodiment, label prediction may be performed on the sample text by the CRF layer, so that accuracy of label prediction may be improved.

The following describes in detail, in connection with a specific embodiment, a specific implementation procedure of the method for constructing knowledge-graph data:

knowledge Graph (knowledgegraph) is an important branching technology of artificial intelligence, and is proposed by google in 2012, is a structured semantic Knowledge base for describing concepts and interrelationships thereof in physical world in symbol form, and its basic constituent units are 'entity-relationship-entity' triples, and entities and related attribute-value pairs thereof, and the entities are mutually linked through relationships to form a net Knowledge structure. The knowledge graph is built by first obtaining data, which is the source of knowledge, and may be tables, texts, databases, etc. Structured data, unstructured data, and semi-structured data can be classified according to the type of data. Structured data is data represented by a table, a database and the like according to a certain format, and can be directly used for constructing a knowledge graph. Unstructured data are text, audio, video, pictures and the like, and information extraction is needed to further establish a knowledge graph. Semi-structured data is a data between structured and unstructured, and information extraction is also needed to build a knowledge graph.

The method for constructing knowledge graph data is based on technical schemes of named entity identification, natural language processing, neural network and the like, and aims at the problem of named entity identification of unstructured safety data sets, and the method for identifying the entity in the insurance field based on active learning BiLSTM-CRF is provided. A two-step active learning sampling algorithm combining uncertainty measurement and similarity is adopted, and a data set for entity identification is marked. Through researching the field knowledge graph construction technology and flow, a construction scheme of the insurance knowledge graph is provided, and the construction scheme mainly comprises the flows of ontology construction, insurance knowledge extraction, knowledge mapping and fusion, knowledge storage and the like. Finally, a reusable knowledge graph of the insurance domain is constructed.

The knowledge graph data construction method provided by the embodiment can train a text vector device by using a CBOW model in Word2vec, and convert texts such as insurance clauses and the like into vectors to serve as input for named entity recognition. The model adopts a multi-layer bi-directional encoder to extract and train text characteristics. Specifically, the method for constructing knowledge graph data provided in this embodiment trains a feature extraction model combining two-way long short-time memory (BiLSTM) and Conditional Random Field (CRF). A secure entity extraction model based on BiLSTM and CRF model frameworks is designed herein. Because the BiLSTM module can fully capture context information and effectively improve remote dependencies. The CRF model is used to effectively supplement the BiLSTM model, which cannot process data with strong dependencies when processing output labels.

Considering that the scale of manual annotation data is relatively small, the model accuracy is greatly affected. The performance of the method can be improved through the active learning strategy, and the labeling cost is reduced. The idea of active learning is to improve machine learning efficiency by selective learning of data in a corpus. The unlabeled data is first sent to the model trained herein, then a portion of the samples are selected for manual labeling according to the model labeling results and sampling strategy, and then the labeled data is added to the labeled sample set to continue to participate in the training.

In the entity identification flow, word embedding sequence after pretreatment and vectorization is firstly taken as input of BiLSTM layer, then hidden state sequence of forward LSTM output is spliced with hidden state of backward LSTM output to form complete hidden state sequence. Wherein m is the number of units in the hidden layer, and the linear layer is accessed after splicing is completed. The hidden state can be mapped into n dimensions, n is the number of labels of the marked data set, so that the automatically extracted features are obtained and marked as a matrix K, and K represents the score value of the character classified to the first label. Then, the Softmax layer is accessed to conduct n classification on each position, and the BiLSTM layer outputs the final belonged label of the character. However, the BiLSTM only marks each position, and cannot utilize the marked information, so that a CRF layer is accessed for correction and constraint. The transfer matrix Z is added to the CRF layer as a parameter of the CRF layer. Where the probability of the first transition to the first tag is represented. The maximum likelihood estimation is adopted as a cost function, the Viterbi algorithm is utilized for decoding, and the probability of the model for the marking sequence Y of the sentence X is as follows:

A sampling strategy is defined to select samples that need to be marked as having the smallest predictable label in the current model. The uncertainty of the measured samples is quantified by calculating the confidence of the unlabeled sequences. The data independence point problem cannot be solved due to uncertainty metrics. Combining a two-step sampling strategy based on uncertainty metrics and similarity. Firstly, screening unlabeled samples with high pre-information based on an uncertainty measurement method, and avoiding selecting outlier samples by considering the similarity between unlabeled samples and target samples based on similarity measurement, and simultaneously avoiding selecting unlabeled samples similar to samples which are already sufficiently trained. Thus, the remaining samples find the least similar sample using the similarity measure.

And a minimum confidence algorithm is adopted, and the probability products of the marked sequences are summed and normalized to solve the problem that the minimum confidence is influenced by the length of the sequences.

Wherein X is _ij Representing the sequence samples, y ₁ ,y ₂ ,…,y _n Is the model output sequence.

Using the similarity to calculate the similarity between samples, a first average similarity d (x, U) between x and unlabeled set U can be calculated by the following formula:

the second average similarity between the similarity x and the labeled set L is used to calculate by the following formula:

As shown in fig. 7, there may be a schematic structural diagram of the prediction model: the training data (input) may include a plurality of text feature vectors (word equipping) corresponding to the sample text, the prediction model may include a position tag determination layer, the position tag determination layer may be a bi-directional long short time memory layer (BiLSTM), the position tag prediction layer may be a Conditional Random Field (CRF), the bi-directional long short time memory layer (BiLSTM) includes a forward long short time memory layer (forward) and a backward long short time memory layer (backward), the forward long time memory layer (forward) includes a plurality of forward modules (LSTM), the backward long short time memory layer (backward) includes a plurality of backward modules (LSTM), wherein each forward module is connected end to end, each backward module is connected end to end, the input may be a text feature vector for a first forward module, and the output of the first forward module may be input to a second forward module and the CRF layer; for the second forward module, the input may include a text feature vector and an output result of the first forward module, where the output result of the second forward module may be input to the third forward module and the CRF layer, and other forward modules are similar and are not described again; for the first backward module, the input may include the same text feature vector as the input of the first forward module, and may further include an output result of the second backward module, where the output result of the first backward module may be input to the CRF layer; for the second backward module, the input may include the text feature vector same as the input of the first forward module, and may further include an output result of the third backward module, where the output result of the second backward module may be input to the CRF layer and the first backward module, and other backward modules are similar and are not described again.

Specifically, the CRF layer processes the implicit state sequence output by the position label determining layer to obtain an output result, inputs the output result to a sampling module (sampling) and a label module (taging) to obtain a marked sample text, and re-inputs the marked sample text to training data.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a knowledge graph data construction device for realizing the knowledge graph data construction method. The implementation scheme of the solution provided by the device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiment of the device for constructing one or more knowledge-graph data provided below may refer to the limitation of the method for constructing the knowledge-graph data hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 8, there is provided a knowledge-graph data construction apparatus 800, including: a first acquisition module 802, a conversion module 804, and a prediction module 806, an update module 808, wherein:

the first obtaining module 802 is configured to obtain training data, where the training data includes sample text and a sample label corresponding to the sample text.

The conversion module 804 is configured to perform feature vector conversion processing on the sample text to obtain a text feature vector sequence corresponding to the sample text, where the sample text includes a plurality of characters, and the text feature vector sequence includes text feature vectors corresponding to the characters respectively.

And a prediction module 806, configured to predict, by using a prediction model, text feature vectors corresponding to the characters included in the sample text, and determine a prediction label corresponding to the sample text.

And an updating module 808, configured to determine a target sample text from each of the sample texts based on the original text data set and the labeled text data set, and obtain updated training data based on the target sample text and the prediction labels.

and processing the text feature vectors corresponding to the characters respectively through the position label determining layer to obtain the predictive labels corresponding to the characters.

And aiming at each forward module, obtaining the output result of the forward module based on the character and the output result of the previous forward module of the forward module.

And aiming at each backward module, obtaining the output result of the backward module based on the character and the output result of the last backward module of the backward module.

In one embodiment, the updating module is specifically configured to:

a first average similarity between the sample text and the original text data set is calculated, and a second average similarity between the sample text and the annotated text data set is calculated.

In one embodiment, the updating module is specifically configured to: and carrying out normalization processing on each sample text to obtain each sample text after normalization processing, and respectively calculating the confidence coefficient of each sample text.

In one embodiment, the prediction module is specifically configured to: and carrying out maximum likelihood processing on the predicted label corresponding to each character through the position label prediction layer to obtain the probability value of each sample label to which each character belongs.

The modules in the knowledge graph data constructing apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing training data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of constructing knowledge-graph data.

It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use, and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The method for constructing the knowledge graph data is characterized by comprising the following steps:

2. The method according to claim 1, wherein the prediction model includes a position tag determination layer and a position tag prediction layer, the predicting, by the prediction model, the text feature vector corresponding to each of the characters included in the sample text, and determining the prediction tag corresponding to the sample text includes:

3. The method according to claim 2, wherein the position tag determining layer includes a bidirectional long short time memory layer, the bidirectional long short time memory layer includes a plurality of forward modules and a plurality of backward modules, the forward modules are connected end to end, the backward modules are connected end to end, and the processing, by the position tag determining layer, the text feature vector corresponding to each of the characters to obtain the predictive tag corresponding to each of the characters includes:

4. The method of claim 3, wherein said determining a target sample text at each of said sample texts based on the original text dataset and the annotated text dataset comprises:

5. The method of claim 4, wherein the step of taking, as the target sample text, a sample text for which the first average similarity does not satisfy a preset matching condition and the second average similarity satisfies the preset matching condition, comprises:

6. The method according to claim 3, wherein the performing, by the position tag prediction layer, maximum likelihood processing on the prediction tag corresponding to each character to obtain the prediction tag of the sample text includes:

7. A knowledge-graph data construction apparatus, the apparatus comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.