CN108182295B

CN108182295B - Enterprise knowledge graph attribute extraction method and system

Info

Publication number: CN108182295B
Application number: CN201810136568.4A
Authority: CN
Inventors: 孙世通; 刘德彬; 严开; 陈玮
Original assignee: Chongqing Socialcredits Big Data Technology Co ltd; Chongqing Telecommunication System Integration Co ltd
Current assignee: China Telecom Yijin Technology Co.,Ltd.; Chongqing Yucun Technology Co ltd
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2021-09-10
Anticipated expiration: 2038-02-09
Also published as: CN108182295A

Abstract

The invention provides an enterprise knowledge graph attribute extraction method, which comprises the following steps: defining an entity category and an event category; defining attribute structures for each type of entity; preparing and marking the corpus; extracting entity attributes; and fusing entity attributes. The invention = objectivity and high efficiency of text content extraction and classification by combining knowledge of experts on entity attributes in specific fields and machine learning, and is applied to Chinese corpora of full enterprise data; and various target attributes can be identified by a small amount of labels. The problem of extraction of node entity attributes in a knowledge graph and fusion of multi-source attributes is solved.

Description

Enterprise knowledge graph attribute extraction method and system

Technical Field

The invention relates to an information processing method and an information processing system, in particular to an enterprise knowledge graph attribute extraction method and an enterprise knowledge graph attribute extraction system.

Background

The knowledge graph is a semantic network based on a graph data structure, and the basic units of the knowledge graph are nodes (nodes) and edges (edges). In the enterprise knowledge graph, nodes represent event entities and enterprise entities; edges characterize relationships between entities. In the whole enterprise knowledge graph, if an enterprise is focused, basic information of the enterprise, a development process formed by connecting event nodes in series, and contents such as enterprise clusters associated with each layer (the association includes but is not limited to equity investment, cooperation, upstream and downstream, subordination and the like) can be found.

The knowledge graph is applied to the fields of enterprise information and enterprise risk discovery, and has the core value that enterprise information of various categories is organically connected in series, so that a risk model is facilitated to identify hidden associated risks, group risks and the like. In the step of structuring node data, two major problems are mainly faced: 1) extracting different attributes from different data sources, and 2) reasonably fusing the attributes from different sources in the same entity.

In terms of technical aspect, to construct such an enterprise knowledge graph, the following two difficulties need to be overcome:

entity attribute extraction, multi-source attribute fusion and establishment of relationships among different entities.

In the prior art, attribute extraction and fusion based on an industry experience rule and a dictionary and attribute extraction and fusion based on supervised learning and pattern matching are adopted.

The defects of the prior art are that the attribute extraction and fusion based on the industry experience rule and the dictionary: for entities in different industries, the determination of the industrial attributes of the entities needs the intervention of professionally qualified industry experts, but the problems of low marking efficiency, inconsistent marking standards and the like cannot be solved all the time depending on manpower. While the dictionary depending on the unified standard can recognize the relationship of the verb as the central word in the text, the extraction of the relationship such as the noun colloquial word is easy to be misjudged. In addition, this method cannot effectively process and judge unknown words.

In the prior art, attribute extraction and fusion based on supervised learning and pattern matching are adopted: the classifier is constructed on the corpus labeled manually, but the main bottleneck of the classifier is that more labels are needed and the requirement on data quality is high.

In the prior art, the extraction of enterprise knowledge graph attributes is mainly based on text data, but certain restrictions exist when an image, an audio/video and a text appear simultaneously and cross-source processing is required. The condition of extracting entities and relations with different levels and granularities is not considered in the modeling process.

In the prior art, the enterprise knowledge map attribute extraction adopts manual marking for processing the target text, so that the efficiency is low, the cost is high, and the massive text cannot be quickly processed.

In the prior art, correlation analysis and reasoning between texts cannot be realized by extracting enterprise knowledge graph attributes, and end-to-end adaptive learning and relationship establishment are realized.

Disclosure of Invention

The invention provides a method for extracting enterprise knowledge graph attributes efficiently, automatically and accurately, which comprises the following steps:

defining entity types, event types and entity attribute structures of training samples;

preparing and marking a training sample corpus;

training an entity attribute extraction model;

inputting the target text into an entity attribute extraction model to obtain target text entity attributes;

and performing entity attribute fusion on the target text.

Further, the entity category, the event category and the entity attribute structure defining the training sample comprise,

defining entity categories as enterprise factors or/and personal factors;

defining event categories as one or more of official documents, court announcements, tenders, equity, strategies, personnel, finance, debt, products, marketing, branding, accidents;

defining the fields of the attributes as a plurality of or one of type fields, time fields, mark fields and body fields;

the preparation and marking of the training sample corpus comprises marking the event category and the entity attribute structure of each text of the training sample library.

Further, the training of the entity attribute extraction model comprises the following steps:

s1: marking according to characters, inputting an N x K dimensional character vector matrix as a first bidirectional long-short time memory recurrent neural network to obtain an N x T dimensional marking class probability distribution matrix of each character, wherein N is a batch size numerical value, K is a character embedding vector length, T is a character marking class number, the position of the maximum value corresponds to a label of the current character, and character embedding data of each character are obtained;

s2: determining training sample subject information;

s3, defining the event vector according to the following formula, wherein evenEmbedding is the event vector, w_jThe vector of the jth word in the sentence is represented, and n represents the sentence within the front-back distance n of the main body;

and according to event labels, taking an N-K dimensional event vector matrix as the initial input of a second bidirectional long-time and short-time memory cyclic neural network, wherein N is a batch size numerical value, K is a word embedded vector length, L is the category number of the event labels, and the position of the maximum value corresponds to the label of the current event.

The bayesian network is defined as:

P(A,B,C,D)＝P(D|A,B)*P(C|A)*P(B|A)P(A)

a is the probability of whether the text describes an event of some kind,

b is the probability of the event extraction being successful,

c is the probability of containing time information,

d is the probability of containing the vocabulary of the specific field,

wherein the value of B is determined by whether the label output by the N-L dimension labeling class probability distribution matrix is the same as the labeling of the training sample, if the label is the same, the value of B is 1, if the label is not the same, the value of B is 0,

acquiring a first N x L dimensional matrix from a second bidirectional long-short-term memory recurrent neural network, inputting the first N x L dimensional matrix into a Bayesian network, performing feature fusion on the second N x L dimensional matrix and the first N x L dimensional matrix output by the Bayesian network, and feeding a feature fusion result back to the second bidirectional long-short-term memory recurrent neural network;

s4: and defining a loss function as the mean square error of the output of each time node of the bidirectional long-time memory recurrent neural network and the marking data of the training sample, and repeating the step S3 until the loss function is converged.

Further, the entity attribute extraction model includes,

acquiring a first N x L dimensional matrix from a forward hidden layer of a second bidirectional long-short time memory cyclic neural network, inputting the first N x L dimensional matrix into a Bayesian network, performing feature fusion on the second N x L dimensional matrix and the first N x L dimensional matrix output by the Bayesian network, and taking a feature fusion result as an input of the second bidirectional long-short time memory cyclic neural network to the backward hidden layer;

alternatively, the first and second electrodes may be,

acquiring a first N x L dimensional matrix from a second bidirectional long-short-term memory recurrent neural network output layer, inputting the first N x L dimensional matrix into a Bayesian network, performing feature fusion on the second N x L dimensional matrix and the first N x L dimensional matrix output by the Bayesian network, and taking a feature fusion result as the input of the second bidirectional long-short-term memory recurrent neural network input layer;

further, performing entity attribute fusion on the target text comprises the following steps:

a, selecting a basic structure of event entity data as a base value according to the similarity with a structure template;

b, traversing the candidate set events, and matching the tree types according to the depth priority of the tree type structure;

c when two events are compared, the following rules are followed:

if the node attribute value in the basic structure is missing, directly supplementing;

if the attribute values of the corresponding nodes conflict in the basic structure, if the attribute values of the candidate set obtained by the quality evaluation function are better, replacing the non-null value of the substrate;

if the base attribute is in a list format, adding unique non-repetitive elements in the candidate set to the table of the base;

and D, repeating the step B and the step C until the attribute can not be perfected continuously.

In order to ensure the implementation of the method, the invention also provides an enterprise knowledge graph attribute extraction system, which comprises the following units:

the defining unit is used for defining entity types, event types and entity attribute structures of the training samples;

the marking unit is used for training the sample corpus preparation and marking;

the training unit is used for training the entity attribute extraction model;

the entity attribute extraction unit is used for inputting the target text into the entity attribute extraction model to obtain the entity attribute of the target text;

and the attribute fusion unit is used for executing entity attribute fusion on the target text.

Further, the definition unit defines entity category, event category and entity attribute structure of the training sample including,

defining entity categories as enterprise factors or/and personal factors;

the training sample corpus preparation and marking comprises labeling the event category and the entity attribute structure of each text of the training sample library.

Further, the training unit trains the entity attribute extraction model by adopting the following steps:

s2: determining training sample subject information;

s3, defining the event vector according to the following formula, wherein evenEmbedding is the event vector, w_jA vector representing the jth word in the sentence, n representing the sentence within a distance n before and after the subject；

The bayesian network is defined as:

P(A,B,C,D)＝P(D|A,B)*P(C|A)*P(B|A)P(A)

a is the probability of whether the text describes an event of some kind,

b is the probability of the event extraction being successful,

c is the probability of containing time information,

d is the probability of containing the vocabulary of the specific field,

Further, the entity attribute extraction model includes,

acquiring a first N x L dimensional matrix from a forward hidden layer of a second bidirectional long-time and short-time memory recurrent neural network, inputting the first N x L dimensional matrix into a Bayesian network, performing feature fusion on the second N x L dimensional matrix output by the Bayesian network and the first N x L dimensional matrix, and taking a feature fusion result as the input of the second bidirectional long-time and short-time memory recurrent neural network to the backward hidden layer;

alternatively, the first and second electrodes may be,

further, the attribute fusion unit performs entity attribute fusion on the target text by adopting the following steps:

b, traversing the candidate set events, and matching attributes in pairs according to the depth priority of the tree structure;

c when two events are compared, the following rules are followed:

The invention has the beneficial effects that:

the method 1 realizes the acquisition of knowledge in the multi-source heterogeneous data and reduces the dependency degree of an algorithm model on the label.

2, the entity attribute extraction, the multi-source attribute fusion and the establishment of the relationship among different entities are realized.

Combining the objectivity and high efficiency of the expert on the entity attribute of the specific field and the machine learning on the extraction and classification of the text content, and applying the objectivity and high efficiency to Chinese corpora of the total enterprise data; and various target attributes can be identified by a small amount of labels.

And 4, after the attribute extraction model is trained through the sample data, automatic entity attribute extraction and knowledge graph construction are realized on the mass target text data, the efficiency is improved, and the labor cost is reduced.

5 the invention combines the advantages of Bayesian network and LSTM to provide Bayesian recurrent neural network. The Bayesian network feeds back the BilSTM recurrent neural network, so that the BilSTM recurrent neural network is transversely used for capturing long-time long-range space-time correlation between entities, and the Bayesian network is longitudinally used for analyzing and reasoning the correlation. Meanwhile, the BilSTM is updated by feeding back the inference result of the Bayesian network, thereby realizing end-to-end adaptive learning and relationship establishment.

Drawings

FIG. 1 is a flowchart of an enterprise knowledge graph attribute extraction method according to an embodiment of the present invention.

FIG. 2 is a block diagram of an enterprise knowledge graph attribute extraction system according to an embodiment of the present invention.

FIG. 3 is a diagram of a prior art long term memory network.

FIG. 4 is a diagram of a prior art BilSTM neural network model.

Fig. 5 is a schematic diagram of a bayesian recurrent neural network model according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a bayesian network according to an embodiment of the present invention.

FIG. 7 is a diagram of a prior art LSTM memory module according to the present invention.

FIG. 8 is a schematic diagram of feature fusion according to an embodiment of the present invention.

FIG. 9 is a schematic diagram of feature fusion according to an embodiment of the present invention.

Detailed Description

One of the ideas of the present invention for solving the description problem of the background art is as follows: and the Bayesian recurrent neural network is adopted as an entity attribute extraction model to realize the extraction of the enterprise knowledge map attributes. The Bayesian network is used as a network layer to be stacked on the upper layer of the BilSTM recurrent neural network, so that the temporal and spatial correlation between entities in a long time and a long range can be captured by using the BilSTM recurrent neural network in the transverse direction, and the correlation analysis and reasoning can be realized by using the Bayesian network in the longitudinal direction. Meanwhile, the BilSTM is updated by feeding back the inference result of the Bayesian network, thereby realizing end-to-end adaptive learning and relationship establishment. And constructing an accurate and efficient entity attribute extraction model, and realizing the automation of entity attribute extraction.

As shown in FIG. 1, the method for extracting enterprise knowledge graph attributes of the invention comprises the following steps:

preparing and marking a training sample corpus;

training an entity attribute extraction model;

and performing entity attribute fusion on the target text.

Wherein, in the step of defining the entity category and the event category,

the entity category may be business or personal.

The event category can be official documents, court announcements, tenders, equities, strategies, personnel, finance, debt, products, marketing, branding, accidents, etc

For each type of entity, a standardized attribute structure is defined, and taking accident as an example, in an embodiment of the present invention, the attribute structure defining an event is as follows:

taking the stock right as an example, the attribute structure of the defined event in an embodiment of the present invention is:

in the steps of preparing and marking the corpus, the word notation specification and meaning in one embodiment of the present invention are as follows:

B-ORG stands for entity start bit tag

I-ORG stands for entity composition tag

X represents placeholders such as punctuation

O represents other characters

After the corpus marking is finished, the follow-up program can understand the meaning of the entity in the text, and the machine can conveniently process the text.

According to the above specifications, marking of each character of the training text is completed in one embodiment of the invention.

The event tag specification and meaning in one embodiment of the invention is as follows:

judge represents the official document;

NOTESE stands for court announcements;

COURT stands for open bulletin;

bid stands for bid;

STOCK stands for STOCK right;

STRATGY stands for STRATEGY;

HR stands for human;

FINANCE stands for FINANCE;

debt represents debt;

PROD stands for product;

marker stands for marketing;

BRAND stands for BRAND;

ACCIDENT represents an ACCIDENT;

it should be noted that the labels and specifications of the events can be flexibly selected according to specific items, and are not limited to the events listed by the present invention.

The event labels are expressed in English to facilitate subsequent procedures to process the text.

And according to the specifications, marking each text of the training texts.

In one embodiment of the invention, the marking of the training text is performed manually, and the marking result is used as a reference for model training in the subsequent steps.

The following describes the steps of training the entity attribute extraction model with reference to the embodiment,

in view of the problems (mentioned in the background) of the current mainstream methods in processing entity attribute extraction, the deep neural network is to deal with the difficulties. The invention provides an end-to-end semi-supervised and unsupervised method applied to the attribute extraction problem of an event entity taking an enterprise as a main body, so that the acquisition of knowledge in multi-source heterogeneous data is realized and the dependence degree of an algorithm model on a label is reduced.

Long-Short Term Memory Network (LSTM) is a special recurrent neural Network used to learn the Long-Term dependence of time series data. Since its introduction, it has been widely used in handwriting, speech recognition, machine translation, etc., and has achieved unusual results. The method can realize long-term data memory and has obvious effect in text semantic analysis. The LSTM is expanded in the time dimension, a chain LSTM neural network can be obtained, and the relationship between entities with uncertain length and the entities can be modeled so as to further characterize the respective characteristics of the entities. The LSTM memory module is shown in fig. 7.

The cell of the LSTM can be characterized by the following equation:

i_t＝g(W_xix_t+W_hih_t-1+b_i)

f_t＝g(W_xfx_t+W_hfh_t-1+b_f)

o_t＝g(W_xox_t+W_hoh_t-1+b_o)

the input variation can be characterized by the following equation:

c_in_t＝tanh(W_xcx_t+W_hch_t-1+b_{c_in})

the state change can be characterized by the following equation:

c_t＝f_t·c_t-1+i_t·c_in_t

h_t＝o_t·tanh(c_t)

the Bidirectional long-short term memory network (BilSTM) comprises a forward hidden layer and a backward hidden layer, can acquire the long-time long-range associated dependency relationship of a context, captures the characteristics of a contextual entity, acquires the space-time correlation among more entities, can eliminate the influence of noise such as interference entities on a neural network model from two directions, greatly assists in mining the long-time dependency relationship, and extracts the high-level semantic characteristics which are vital to information extraction, entity relationship identification and the like. The advantages of LSTM and its variants over bayesian networks are the ability to capture long sequence relationships between entities, but their reasoning ability and interpretability is poor. The BilSTM neural network model is shown in FIG. 4.

Bayesian Networks (BN), also known as Belief networks (Belief networks), are a probabilistic graphical model. The method simulates uncertainty of causal relationship in human reasoning process to realize relationship establishment and reasoning, and has good knowledge expression and capability of processing uncertainty knowledge. In addition, the bayesian network can encode and interpret knowledge from a probability perspective, and has been widely used in many fields including computer intelligence science, medical diagnosis, information retrieval, and the like. The Bayesian network has the advantages of strong reasoning capability and the disadvantages of poor modeling capability on long sequences and incapability of capturing indirect relationships between entities.

The invention combines the advantages of the Bayesian network and the BilSTM to provide the Bayesian recurrent neural network. The Bayesian network is used as a network layer to be stacked on the upper layer of the BilSTM recurrent neural network, so that the temporal and spatial correlation between entities in a long time and a long range can be captured by using the BilSTM recurrent neural network in the transverse direction, and the correlation analysis and reasoning can be realized by using the Bayesian network in the longitudinal direction. Meanwhile, the BilSTM is updated by feeding back the inference result of the Bayesian network, thereby realizing end-to-end adaptive learning and relationship establishment.

Fig. 5 shows a bayesian recurrent neural network model according to an embodiment of the present invention.

The embodiment of the invention trains an entity attribute extraction model by adopting the following steps:

s1 marks the words, and inputs the word vector matrix (N × K) as the BiLSTM to obtain the mark class probability distribution (N × 4 matrix) of each word. Wherein N is the length of each batch, K is the Embedding vector length, 4 is the number of classes of word labeling, and the position of the maximum value corresponds to the label of the current word. At the same time, the word embedding of each word is obtained.

Embedding can be viewed as a mathematical spatial Mapping (Mapping): map (lambda y: f (x)), which is characterized by: more precisely, when the function f is called a simple shot, for Y in each value domain, there is at most one X in the definition domain so that f (X) is Y), and the structure before and after mapping is unchanged, corresponding to the word embedding concept, it can be understood to find a function or mapping, generate a new spatial expression, and map the X spatial information expressed by the word one-hot to the multidimensional spatial vector of Y.

Batch Size: batch size. In an embodiment of the present invention, the parameter updating method includes three methods:

(1) and (3) the Batch Gradient Descent decreases in Batch Gradient, a loss function is calculated by traversing all data sets, and a parameter is updated once, so that the obtained direction can more accurately point to the direction of the extreme value.

(2) And the Stochastic Gradient Descent reduces the random Gradient, calculates a loss function for each sample once, and updates the parameters once, thereby having the advantage of high speed.

(3) And (3) dividing sample data into a plurality of batches to calculate loss functions and update parameters in batches in the compromise of the Mini-batch Gradient Decent and the previous two methods, wherein the directions are stable. S2 obtains subject candidates for the event from the text according to the result of the sequence annotation,

s2: determining the subject by syntactic and part-of-speech analysis (dependency syntactic analysis, common general knowledge to those skilled in the art is not expanded here);

s3, defining an event vector according to the following formula, wherein evenEmbedding is the event vector, wj represents a vector of the jth word in a sentence, and n represents a sentence within the front-back distance n of a main body;

through the steps, the event vector matrix of the text can be obtained from the label class probability distribution of each word in the training text or the target text.

And according to event labels, inputting an event vector matrix (N x K) as a BilSTM to obtain the labeled event probability distribution (N x L matrix) of each event in the training sample. Wherein N is the length of each batch, K is the Embedding vector length, L is the number of categories of event labeling (which will not be described in detail later), and the position of the maximum value corresponds to the label of the current event.

The position of the maximum value corresponds to the label of the current event, namely the event with the maximum probability in the probability distribution is judged as the result of entity attribute extraction.

In an embodiment of the present invention, the event label refers to a text set labeled as the same event type in the training sample.

In an embodiment of the present invention, as shown in fig. 6, according to the actual dependency relationship, a DAG (Directed Acyclic Graph) defining a joint probability that a bayesian network existing text describes a certain class of events and a joint probability that a text describes the certain class of events is defined as:

P(A,B,C,D)＝P(D|A,B)*P(C|A)*P(B|A)P(A)

a is the probability of whether the text describes an event of some kind,

b is the probability of the event extraction being successful,

d is the probability of containing the vocabulary of the specific field,

c is the probability of containing time information,

and B events (the probability of successful extraction) can be obtained by calculating whether the labels obtained by calculation are the same as the marking of the training sample in all the events of the corpus, and if the labels are the same, B is assigned to 1, and if the labels are not the same, B is assigned to 0.

If the second bidirectional long-and-short time memory label event output by the recurrent neural network is the same as the label event marked manually, the event extraction is successful, otherwise, the event extraction is unsuccessful.

In an embodiment of the present invention, a training sample is input to the BiLSTM to obtain an event class distribution of the sample, where the sample event has a maximum probability of being an accident, that is, the sample is extracted as an accident event, if the marking of the sample is an accident, the event extraction success B is 1, and if the marking of the sample is not an accident, the event extraction failure B is 0

In an embodiment of the present invention, the probability that the accident event contains the domain-specific vocabulary is that the number of samples containing the domain-specific vocabulary in all the samples manually labeled as the accident in the sample library is divided by the total number of samples manually labeled as the accident.

In an embodiment of the present invention, the probability that the accident event contains the time information is obtained by dividing the number of samples containing the time information in all the samples manually labeled as the accident event in the sample library by the total number of samples manually labeled as the accident event.

The matrix output by the Bayesian network is a probability distribution matrix of whether the text describes a certain event or not;

specifically, the above process may include two embodiments,

the first embodiment: acquiring a first N x L dimensional matrix from a forward hidden layer of a second bidirectional long-time and short-time memory recurrent neural network, inputting the first N x L dimensional matrix into a Bayesian network, performing feature fusion on the second N x L dimensional matrix output by the Bayesian network and the first N x L dimensional matrix, and taking a feature fusion result as the input of the second bidirectional long-time and short-time memory recurrent neural network to the backward hidden layer;

specifically, the first embodiment includes, in the following,

as shown in fig. 8, a first N × L dimensional matrix is obtained from a time t of the second bidirectional long-short time memory cyclic neural network to the front hidden layer, the first N × L dimensional matrix is input to the bayesian network, feature fusion is performed on the second N × L dimensional matrix output by the bayesian network and the first N × L dimensional matrix, and a feature fusion result is used as an input of the second bidirectional long-short time memory cyclic neural network to the rear hidden layer at the time t;

it will be appreciated by those skilled in the art that in the present invention, time t refers to the input sequence t, and the recurrent neural network will have one input Xt at each time.

In other embodiments, a first N x L dimensional matrix is obtained from a time t1 when the recurrent neural network is recalled from a second bidirectional long-short term, the first N x L dimensional matrix is input into the bayesian network, feature fusion is performed on the second N x L dimensional matrix and the first N x L dimensional matrix output by the bayesian network, a feature fusion result is used as an input at a time t2 when the recurrent neural network is recalled from the second bidirectional long-short term, and t1 and t2 are different input sequences;

second embodiment: as shown in fig. 9, a first N × L dimensional matrix is obtained from the second bidirectional long-and-short-term memory recurrent neural network output layer, the first N × L dimensional matrix is input to the bayesian network, feature fusion is performed on the second N × L dimensional matrix output by the bayesian network and the first N × L dimensional matrix, and a feature fusion result is used as an input of the second bidirectional long-and-short-term memory recurrent neural network input layer;

in the invention, the Bayesian network is used as a network layer to be stacked on the upper layer of the BilSTM recurrent neural network, so that the temporal and spatial correlation in a long time and a long range between entities is captured by using the BilSTM recurrent neural network in the transverse direction, and the correlation analysis and reasoning are realized by using the Bayesian network in the longitudinal direction. Meanwhile, the BilSTM is updated by feeding back the inference result of the Bayesian network, thereby realizing end-to-end adaptive learning and relationship establishment.

It should be noted that, taking the arithmetic mean of the bidirectional long-and-short-term memory recurrent neural network output matrix and the bayesian network output matrix is only one way of matrix feature fusion, and the invention is not limited to this, and the way of matrix feature fusion may also include geometric mean, square mean (root mean square mean, rms), harmonic mean, weighted mean, and the like.

S4 defines the loss function (loss function) as the mean square error (mean square error) of the output of each time node of the BilSTM and label, iterates the model until the loss function converges, i.e., repeats the step S3 until the loss function converges.

The following describes the step of performing entity attribute fusion on the target text with reference to an embodiment.

Inputting the target text into the entity attribute extraction model to obtain the entity attribute of the target text, obtaining the main bodies and attribute structures of all the target texts, and obtaining the distribution of the event categories to which the target text belongs:

Distribution＝[p1,p2,…,pL]

however, events obtained from different data sources may describe the same event, but the attribute extraction results are missing or conflicting. Therefore, the invention introduces a fusion strategy and solves the problem on the basis of event extraction.

The invention defines the category similarity of two events and can be characterized by the similarity of the event distribution (cosine similarity, etc.). When too many events are extracted, the similarity between two traversal events will cause a large calculation overhead. Therefore, an event candidate set is obtained, and an event set to be fused is selected from the candidate set.

The basic rule for selecting the candidate set is as follows:

subject of event is the same

Event class distribution has high Similarity (Cosine Similarity)

The events are close in time

For the event candidate set, complementary fusion of attributes is also required to be realized, and the step mainly depends on the matching degree of attributes such as time, subject, category and the like to achieve entity alignment of the same event. The attribute fusion steps are as follows:

a, selecting an event entity data basic structure as a base value according to the similarity with the structure template

B traversing candidate set events, matching attributes in pairs according to tree structure depth priority

C when two events are compared, the following rules are followed:

d, repeating B-C until the attribute can not be improved continuously

In one embodiment of the invention, two events are extracted from two target texts

The structural template in the embodiment is

In the present embodiment the basic structure is

The attribute values are eventType, tags, subject, time, tags in the embodiment;

in another embodiment of the present invention, a plurality of target table texts get two events after passing through an attribute extraction model:

event 1:

event 2:

because the two events have the same subject and the same time, namely the two events have the same structure template, and the event 3 is obtained by fusing the event 1 and the time 2

In another embodiment of the present invention, two events are obtained after a plurality of target texts pass through an attribute extraction model:

event 4:

event 5:

in the embodiment, the two events have the same base structure, but the time attribute conflicts, and the time attribute value of the event 5 obtained by the quality evaluation function is better, so that the time attribute of the event 4 is replaced by time 2017-05-0800:00: 00.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments are still modified, or some or all of the technical features are equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. An enterprise knowledge graph attribute extraction method is characterized by comprising the following steps:

preparing and marking a training sample corpus;

training an entity attribute extraction model; the method for training the entity attribute extraction model comprises the following steps:

s2: determining training sample subject information;

s3, defining an event vector according to the following formula, wherein evenEmbedding is the event vector, w j represents a vector of the jth word in a sentence, and n represents a sentence within a distance n between the front and the back of a subject;

according to event marking, taking an N-X-K dimensional event vector matrix as the initial input of a second bidirectional long-and-short time memory cyclic neural network, wherein N is a batch size numerical value, K is a word embedded vector length, L is the category number of the event marking, and the position of the maximum value corresponds to the label of the current event;

the bayesian network is defined as:

P(A,B,C,D)＝P(D|A,B)*P(C|A)*P(B|A)P(A),

a is the probability of whether the text describes an event of some kind,

b is the probability of the event extraction being successful,

c is the probability of containing time information,

d is the probability of containing the vocabulary of the specific field,

s4: defining a loss function as the mean square error of the output of each time node of the bidirectional long-time memory recurrent neural network and the marking data of the training sample, and repeating the step S3 until the loss function is converged;

and performing entity attribute fusion on the target text.

2. The method of extracting enterprise knowledge-graph attributes of claim 1,

the entity category, the event category and the entity attribute structure of the defined training sample comprise,

defining entity categories as enterprise factors or/and personal factors;

3. The method of extracting enterprise knowledge-graph attributes of claim 1,

an entity attribute extraction model, comprising,

alternatively, the first and second electrodes may be,

and acquiring a first N x L dimensional matrix from the second bidirectional long-short-term memory recurrent neural network output layer, inputting the first N x L dimensional matrix into the Bayesian network, performing feature fusion on the second N x L dimensional matrix output by the Bayesian network and the first N x L dimensional matrix, and taking a feature fusion result as the input of the second bidirectional long-short-term memory recurrent neural network input layer.

4. The method of extracting enterprise knowledge-graph attributes of claim 1,

performing entity attribute fusion on the target text comprises the following steps:

1) selecting a basic structure of the event entity data as a base value according to the similarity with the structure template;

2) traversing the candidate set events, and matching attributes according to the depth priority of the tree structure;

3) when two events are compared, the following rules are followed:

4) and repeating the step 2) and the step 3) until the attribute can not be improved continuously.

5. An enterprise knowledge graph attribute extraction system is characterized by comprising the following units:

the training unit is used for training the entity attribute extraction model; the training unit trains the entity attribute extraction model by adopting the following steps:

s2: determining training sample subject information;

s3: defining an event vector according to the following formula, wherein evenEmbedding is the event vector, w j represents a vector of the jth word in a sentence, and n represents a sentence within a distance n between the front and the back of a subject;

the bayesian network is defined as:

P(A,B,C,D)＝P(D|A,B)*P(C|A)*P(B|A)P(A),

a is the probability of whether the text describes an event of some kind,

b is the probability of the event extraction being successful,

c is the probability of containing time information,

d is the probability of containing the vocabulary of the specific field,

6. The enterprise knowledge-graph attribute extraction system of claim 5,

the definition unit defines entity category, event category and entity attribute structure of the training sample,

defining entity categories as enterprise factors or/and personal factors;

7. The enterprise knowledge-graph attribute extraction system of claim 5,

an entity attribute extraction model, comprising,

alternatively, the first and second electrodes may be,

8. The enterprise knowledge-graph attribute extraction system of claim 5,

the attribute fusion unit performs entity attribute fusion on the target text by adopting the following steps:

2) traversing the candidate set events, and matching attributes in pairs according to the depth priority of the tree structure;

3) when two events are compared, the following rules are followed: