CN112613306A

CN112613306A - Method, device, electronic equipment and storage medium for extracting entity relationship

Info

Publication number: CN112613306A
Application number: CN202011635963.0A
Authority: CN
Inventors: 王坤; 梁彧; 田野; 傅强; 王杰; 杨满智; 蔡琳; 金红; 陈晓光
Original assignee: Eversec Beijing Technology Co Ltd
Current assignee: Eversec Beijing Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-06

Abstract

The invention discloses a method, a device, electronic equipment and a storage medium for extracting entity relations, wherein the method comprises the following steps: inputting sentences into a pre-trained relation recognition model, and obtaining a relation probability array according to result information output by the relation recognition model, wherein the ith element of the relation probability array represents the probability that the ith sentence relation exists in the sentences, and i is a natural number; obtaining a relation label of a sentence relation corresponding to an element which is greater than a preset probability threshold value in the relation probability array to obtain a relation label set; and respectively inputting each relationship label in the relationship label set and the sentence into a pre-trained sequence labeling model together, and respectively obtaining the triplet of the sentence relationship corresponding to each relationship label according to the output result information of the sequence labeling model.

Description

Method, device, electronic equipment and storage medium for extracting entity relationship

Technical Field

The embodiment of the invention relates to the technical field of machine learning and natural language processing, in particular to a method and a device for extracting entity relationships, electronic equipment and a storage medium.

Background

The current entity relationship extraction techniques can be divided into three major categories: one is rule-based pattern matching, which uses predefined relationship templates, and when matching the current template, the relationship and entity information are extracted, and entity identification is needed to help the relationship extraction. The method has many limitations, and each relation can have a plurality of expression modes, cannot be completely defined, and needs huge manpower to search for the template rule. The other is semi-supervised entity relation extraction, and the method summarizes an entity relation sequence pattern from the context containing the relation seeds, and then utilizes the relation sequence pattern to find more relation seed instances to form a new relation seed set. In the entity relationship extraction method based on the BootStraping method, a key problem is how to filter the acquired mode so as to avoid introducing excessive noise into the iterative process to cause the problem of semantic drift. To address this problem, a co-learning (co-learning) method has been proposed that utilizes two conditionally independent feature sets to provide different and complementary information, thereby reducing labeling errors. Although only a small number of seed sets are needed to continuously acquire the tag data of a few entity relationships, in actual use, many wrong tags occur. The method for solving the entity relation extraction in supervision can be divided into a pipeline learning method and a multi-task learning method, the pipeline learning method is that the relation between entities is directly extracted on the basis that the entity identification is completed, error conduction is serious due to low accuracy of the entity identification, and the whole effect is influenced. The multi-task learning method is mainly based on an end-to-end model of a neural network, and recognition of entities and extraction of relationships among the entities are completed simultaneously.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for extracting an entity relationship, so as to improve accuracy of relationship identification and accuracy of entity extraction.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of embodiments of the invention.

In a first aspect of the present disclosure, an embodiment of the present invention provides a method for extracting an entity relationship, including:

inputting sentences into a pre-trained relation recognition model, and obtaining a relation probability array according to result information output by the relation recognition model, wherein the ith element of the relation probability array represents the probability that the ith sentence relation exists in the sentences, and i is a natural number;

obtaining a relation label of a sentence relation corresponding to an element which is greater than a preset probability threshold value in the relation probability array to obtain a relation label set;

and respectively inputting each relationship label in the relationship label set together with the sentence into a pre-trained sequence labeling model, and respectively obtaining a triplet of the sentence relationship corresponding to each relationship label according to the output result information of the sequence labeling model, wherein the triplet of the sentence relationship comprises a relationship name, a relationship subject and a relationship object.

In one embodiment, the relationship recognition model is obtained by training through the following steps:

acquiring a training sample set, wherein the training sample comprises a sample sentence and label information of a triple used for representing at least one sentence relation contained in the sample sentence;

determining an initialized relationship recognition model, wherein the initialized relationship recognition model comprises a target layer for outputting a relationship probability array formed by probabilities of each preset sentence relationship contained in the sentence;

and training to obtain the relation recognition model by using a machine learning method and taking the sample sentences in the training samples in the training sample set as the input of the initialized relation recognition model and taking the label information corresponding to the input sample sentences as the expected output of the initialized relation recognition model.

In one embodiment, the initialized relationship identification model is a multi-label classification model.

In an embodiment, the annotation information further includes a starting position of the relationship subject in the sentence in the triplet of each sentence relationship, and a starting position of the relationship object in the triplet of each sentence relationship in the sentence.

In an embodiment, the relationship recognition model is obtained by training in a GPU.

In one embodiment, the sequence labeling model is obtained by training through the following steps:

acquiring a training sample set, wherein the training sample comprises a relation label, a sentence and label information used for representing a triple including a sentence relation corresponding to the relation label in the sentence;

determining an initialized sequence annotation model, wherein the initialized sequence annotation model comprises a target layer for outputting a triplet containing a sentence relation corresponding to a relation tag in a sentence;

and training to obtain the sequence labeling model by using a machine learning method and taking the relation labels and sentences in the training samples in the training sample set as the input of the initialized sequence labeling model and taking the labeling information corresponding to the input relation labels and sentences as the expected output of the initialized sequence labeling model.

In an embodiment, the sequence annotation model is obtained by training in a GPU.

In a second aspect of the present disclosure, an embodiment of the present invention further provides an apparatus for extracting an entity relationship, including:

the system comprises a relation recognition unit, a relation recognition unit and a relation recognition unit, wherein the relation recognition unit is used for inputting sentences into a pre-trained relation recognition model and obtaining a relation probability array according to result information output by the relation recognition model, the ith element of the relation probability array represents the probability that the ith sentence relation exists in the sentences, and i is a natural number;

a relation label obtaining unit, configured to obtain a relation label set of a sentence relation corresponding to an element in the relation probability array that is greater than a predetermined probability threshold;

and the triple obtaining unit is used for respectively inputting each relation label in the relation label set and the sentence into a pre-trained sequence labeling model together, and respectively obtaining a triple of the sentence relation corresponding to each relation label according to the output result information of the sequence labeling model, wherein the triple of the sentence relation comprises a relation name, a relation subject and a relation object.

In one embodiment, the relationship recognition model is obtained by training through the following modules:

the system comprises a first sample acquisition module, a second sample acquisition module and a third sample acquisition module, wherein the first sample acquisition module is used for acquiring a training sample set, and the training samples comprise sample sentences and label information of triples used for representing at least one sentence relation contained in the sample sentences;

a first model determination module for determining an initialized relationship recognition model, wherein the initialized relationship recognition model comprises a target layer for outputting a relationship probability array formed by probabilities of each predetermined sentence relationship included in a sentence;

and the first model training module is used for training to obtain the relationship recognition model by using a machine learning method and taking the sample sentences in the training samples in the training sample set as the input of the initialized relationship recognition model and taking the marking information corresponding to the input sample sentences as the expected output of the initialized relationship recognition model.

In one embodiment, the sequence labeling model is obtained by training through the following modules:

the second sample acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a relation label, a sentence and label information used for representing a triple including a sentence relation corresponding to the relation label in the sentence;

the second model determining module is used for determining an initialized sequence labeling model, wherein the initialized sequence labeling model comprises a target layer which is used for outputting a triple containing a sentence relation corresponding to a relation label in a sentence;

and the second model training module is used for training to obtain the sequence labeling model by taking the relation labels and sentences in the training samples in the training sample set as the input of the initialized sequence labeling model and taking the labeling information corresponding to the input relation labels and sentences as the expected output of the initialized sequence labeling model by using a machine learning method.

In a third aspect of the disclosure, an electronic device is provided. The electronic device includes: a processor; and a memory for storing executable instructions that, when executed by the processor, cause the electronic device to perform the method of the first aspect.

In a fourth aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the method in the first aspect.

The technical scheme provided by the embodiment of the invention has the beneficial technical effects that:

the method comprises the steps of inputting sentences into a pre-trained relation recognition model, and obtaining a relation probability array according to result information output by the relation recognition model, wherein the ith element of the relation probability array represents the probability that the ith sentence relation exists in the sentences, and i is a natural number; obtaining a relation label of a sentence relation corresponding to an element which is greater than a preset probability threshold value in the relation probability array to obtain a relation label set; and respectively inputting each relationship label in the relationship label set together with the sentence into a pre-trained sequence labeling model, and respectively obtaining a triplet of the sentence relationship corresponding to each relationship label according to the output result information of the sequence labeling model, wherein the triplet of the sentence relationship comprises a relationship name, a relationship subject and a relationship object, so that the relationship identification accuracy and the entity extraction accuracy can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only a part of the embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the contents of the embodiments of the present invention and the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for extracting entity relationships according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a training method of a relationship recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a relationship recognition model;

FIG. 4 is a flowchart illustrating a training method of a sequence annotation model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a sequence annotation model;

FIG. 6 is a schematic structural diagram of an apparatus for extracting entity relationships according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a training module of the relationship recognition model provided in accordance with an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a training module of the sequence annotation model provided in the embodiment of the present invention;

FIG. 9 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present invention.

Detailed Description

In order to make the technical problems solved, the technical solutions adopted and the technical effects achieved by the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments, but not all embodiments, of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, belong to the scope of protection of the embodiments of the present invention.

It should be noted that the terms "system" and "network" are often used interchangeably herein in embodiments of the present invention. Reference to "and/or" in embodiments of the invention is intended to include any and all combinations of one or more of the associated listed items. The terms "first", "second", and the like in the description and claims of the present disclosure and in the drawings are used for distinguishing between different objects and not for limiting a particular order.

It should be further noted that, in the embodiments of the present invention, each of the following embodiments may be executed alone, or may be executed in combination with each other, and the embodiments of the present invention are not limited in this respect.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The technical solutions of the embodiments of the present invention are further described by the following detailed description with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a method for extracting entity relationships according to an embodiment of the present invention, where the embodiment is applicable to a case where a sentence relationship triple (a relationship name, a relationship subject, and a relationship object) is extracted from a given sentence, and the method can be executed by an apparatus for extracting entity relationships configured in an electronic device, as shown in fig. 1, the method for extracting entity relationships according to the embodiment includes:

in step S110, a sentence is input into a pre-trained relationship recognition model, and a relationship probability array is obtained according to result information output by the relationship recognition model.

Wherein the ith element of the relationship probability array represents the probability that the ith sentence relationship exists in the sentence, wherein i is a natural number.

It should be noted that the relationship recognition model described in this embodiment may be obtained by training in various ways as long as it can obtain the relationship probability array after a sentence is input. As an example, fig. 2 is a flowchart illustrating a training method of a relationship recognition model according to an embodiment of the present invention, and as shown in fig. 2, the relationship recognition model can be obtained by training according to the following method:

in step S210, a training sample set is obtained, where the training sample includes a sample sentence and annotation information of a triplet used for representing at least one sentence relationship included in the sample sentence.

For example, the training samples may take the form:

the above is a sample example, organized in XML file form, where the text field specifies a sample sentence, rso _ list is a list, and each element is sentence relationship information, where a relation indicates a current relationship name, a subject indicates a subject of the sentence relationship, a subject idx indicates a starting position of the subject of the sentence relationship in the sentence, an object indicates an object of the sentence relationship, and an object idx indicates a starting position of the object of the sentence relationship in the sentence.

The subject of the sentence relation is optional at the starting position objectIdx in the sentence, and the object of the sentence relation is optional at the starting position objectIdx in the sentence, because an entity with the same name as the relation entity or the relation object may appear in one sentence, if the position is not specified, the specific positions of the relation entity and the relation object cannot be intuitively determined during training, and the model may be mistaken or needs to be further analyzed and determined, so that the training effect and the training efficiency can be improved by definitely specifying the starting position.

According to the format of the sample example, the following steps can be adopted for making the sample:

it is first determined that a relationship class needs to be extracted, which limits the scope of relationship entity triples that can be identified. For a given sentence, acquiring specified relations existing in the sentence, and for each relation in the sentence, specifying a relation subject (subject) and a relation object (object), and after extracting the relation entity triple, storing the relation entity triple as a format specified by the sample example.

It should be noted that, in order to obtain a better model training effect, in the sample set used for training, the number of each relationship in the labeling information of the samples should be kept as consistent as possible, otherwise, the training of the model is not facilitated. Secondly, the relation in each sentence should extract the triple information as much as possible, so as to keep the length difference of each sentence not large, and in addition, the length of each sentence is not too long, for example, not more than 512, because the subsequent model does not support a long sentence, and if the sentence is too long, the sentence is split as much as possible.

In step S220, an initialized relationship recognition model is determined, wherein the initialized relationship recognition model comprises a target layer for outputting a relationship probability array formed by probabilities of the sentences containing the predetermined sentence relationships.

Since the relationships contained in a sentence are identified (there may be multiple relationships in a sentence), it is not possible to do so directly using a multi-classification approach, since the multi-classification model can only determine one class at a time. Therefore, in the embodiment, the relation is identified by adopting a multi-label classification mode, and the multi-label classification is just suitable for a scene in which a sentence appears and a plurality of categories appear at the same time.

In order to further improve the performance of relationship identification, the embodiment may build a relationship identification model of relationship on the pre-training model bert, and the overall processing flow is as follows:

1) and data preprocessing, namely converting the format of an original data set into the format of the data set required by multi-label classification, and cleaning the text to remove useless characters.

2) Text vectorization mainly changes text data into data of a numerical type, so that processing of a model is facilitated.

3) Sentence level Embedding, which mainly semantically encodes sentences, and processes the sentences into a one-dimensional vector.

4) And (3) multi-label classification, wherein the stage is to calculate on the basis of vectors of sentences, and the main purpose is to calculate the probability of each category appearing in the current sentence.

5) And acquiring an existing relation list, setting a threshold value according to probability data output by the model, and determining the relation existing in the sentence.

As an exemplary illustration, the present embodiment organizes the relationship recognition model with the structure shown in fig. 3, and includes five layers from bottom to top:

the first layer is the input layer, i.e. the text data after text washing.

The second layer is a numerical layer which counts the number of the inputted textAnd (4) valuating operation, namely firstly performing tagging operation on the text, cutting the text into individual tags, and then converting the tags into corresponding numerical values by looking up a dictionary. Each tag is then marked as being masked, M since it is not necessary to mask the tag in the current task_iAll have a value of 0. Finally with S_iWhich sentence each tag belongs to is marked, S since there is only one sentence in the current task_iAll have a value of 0.

The third layer is a sentence coding layer, and is mainly used for carrying out feature coding on a sentence and outputting a one-dimensional vector. This layer uses the output of the numerical layer as input, then encodes each tag, and then outputs the whole encoding information of the sentence at the position corresponding to [ CLS ] through multi-layer bidirectional transform calculation.

The fourth layer is a multi-label classification layer, which consists of two parts, a Dropout part and a density + sigmoid part. The Dropout part is mainly used for preventing the model from being over-fitted and improving the generalization performance of the model. Dense mainly reduces the dimension of the features through a full connection layer, so that the output dimension is aligned with the number of the relationships, and then calculates the probability of each relationship through sigmoid.

The fifth layer is an output layer, and the output layer acquires the probability value of each relation output by the model and then identifies the relation with high probability by setting a threshold value.

In step S230, a machine learning method is used to train the relationship recognition model by using the sample sentences in the training samples in the training sample set as the input of the initialized relationship recognition model and using the label information corresponding to the input sample sentences as the expected output of the initialized relationship recognition model.

Wherein the initialized relationship recognition model can be a multi-label classification model.

Further, the annotation information may further include a starting position of the relationship subject in the sentence in the triplet of each sentence relationship, and a starting position of the relationship object in the triplet of each sentence relationship in the sentence.

The relationship recognition model is obtained by training in a GPU.

Before training the model, firstly defining a loss function, for each sample, outputting the probability of n (natural number) relations by the model, and for each training sample, marking whether each relation appears or not, if the relation appears, marking the probability as 1, not appearing and marking the probability as 0, wherein the purpose of training is to make the probability of the relation which should appear continuously approximate to 1, and the probability which should not appear continuously approximate to 0. The loss function defining the model is shown below, representing the loss value of n _ sample samples, with the loss function value of loss for each sample being loss_i。

Wherein loss is: loss function values for all samples;

the sample is: the number of samples;

loss_icomprises the following steps: loss function value of ith sample;

wherein, N _ table is a label of the Nth sentence relation;

loss_ijcomprises the following steps: loss function value of jth label of ith sample;

loss_ij＝-y_ijlog(p_ij)-(1-y_ij)log(1-p_ij)；

p_ijthe predicted probability of the jth label of the ith sample;

y_ijis the true label of the jth label of the ith sample.

Because the parameters of the model are excessive, the training speed of a common CPU is very low, so that the GPU is required to be used for training, an optimizer selected during training is Adam, and the training hyper-parameters comprise the learning rate and the sample set scale.

In step S120, a relationship label set is obtained by obtaining a relationship label of a sentence relationship corresponding to an element in the relationship probability array, where the element is greater than a predetermined probability threshold.

In step S130, the relationship labels in the relationship label set and the sentences are input to a pre-trained sequence labeling model, and triples of sentence relationships corresponding to the relationship labels are obtained according to output result information of the sequence labeling model.

The triplet of the sentence relation comprises a relation name, a relation subject and a relation object.

It should be noted that the sequence tagging model described in this embodiment can be obtained by adopting various training methods, as long as it can obtain the triple of the relationship in the sentence after inputting the relationship tag corresponding to the original sentence and the included relationship. As an example, fig. 4 is a flowchart illustrating a training method of a sequence annotation model according to an embodiment of the present invention, and as shown in fig. 4, the sequence annotation model can be obtained by training as follows:

in step S410, a training sample set is obtained, where the training sample includes a relationship label, a sentence, and label information for representing a triple including a sentence relationship corresponding to the relationship label in the sentence.

This step needs to extract the entity pair corresponding to each relationship in the sentence, namely, the relationship subject (subject) and the relationship object (object), by using the output result of the relationship recognition model in step S110. In this embodiment, a sequence labeling manner is adopted to extract an entity pair of each relationship, that is, each tag in a sentence is labeled, and whether the tag belongs to a relationship subject or a relationship object, or belongs to another entity. And then obtaining the relationship subject and the relationship object according to the labeling result.

For sentences as shown below, and relationships found at the time of relationship identification:

the corresponding output signatures are shown in the following table:

《

in

State of China

Wind power

Water (W)

Ten pieces of cloth

Speaker (A)

》

Is that

…

Is that

Poplar

Article (Chinese character)

Weighing apparatus

Making

A

O

B-SUB

I-SUB

O

…

O

B-OBJ

I-OBJ

Cate

Before entity pair extraction is performed by using sequence labeling, the labeling type of the label needs to be defined, and through a plurality of experiments, the following labeling mode is designed in this embodiment:

B-SUB,I-SUB,B-OBJ,I-OBJ,O,PAD,CATE,CLS,SEP。

SUB: the corresponding is a relation body (subject), B-SUB represents the starting position of the subject, and I-SUB represents the middle position of the word of the subject.

And (3) OBJ: the corresponding relation object (object), B-OBJ represents the starting position of the object, and I-OBJ represents the middle position of the word of the object.

PAD: when the sentence length is less than the designated length, the sentence is filled, and the PAD is used for marking the filled part of the sentence.

CATE: since it is necessary to specify which relationship pair is to be extracted when extracting the pair, the relationship name needs to be added behind the sentence, and this part needs to be specially marked.

CLS: in Bert, a [ CLS ] symbol is inserted into the beginning of a sentence, and special marking is also needed.

SEP: in Bert, a [ SEP ] symbol is inserted into the end part of a sentence, and special marking is also needed.

After the label categories are defined, the extraction process of the entity pairs can be entered next, and the overall process is as follows.

1) In order to extract the entity pair of the specified relationship, the model needs to sense which entity pair is currently extracted, so that the relationship text needs to be put in the sentence, and in this embodiment, the relationship text is directly spliced to the back of the sentence. And after splicing, text cleaning is carried out on the text, then the text with the length less than the specified length needs to be completed, the text with the length more than the specified length needs to be cut off, and the relation text is prevented from being removed during cutting off.

2) Text vectorization, which aims to change text data into numerical data, firstly needs to segment the text into a tag list, and then converts the tag list into a numerical list by looking up a dictionary.

3) The relation and sentence joint coding mainly aims to carry out semantic coding on sentences after the relation is spliced, and codes the whole sentences into a one-dimensional vector through a pre-training model Bert.

4) And (4) multiple classification and sequence labeling, wherein two tasks are executed in the step, namely relationship classification and sequence labeling. The purpose of the relationship classification is to enable the model to notice which relationship entity pair is currently extracted, and if the result of the relationship classification is inconsistent with the relationship text spliced behind the sentence, the extraction of the entity pair is failed. The sequence labeling task performs category labeling on each label and judges the label belongs to any one of 9 labeling categories specified by us.

5) And extracting entity pairs, namely extracting the relationship subjects and the relationship objects from the text according to the sequence labeling result.

In step S420, an initialized sequence annotation model is determined, where the initialized sequence annotation model includes a target layer for outputting a triplet including a sentence relationship corresponding to a relationship tag in a sentence.

Illustratively, the present embodiment organizes the sequence notation model with the structure shown in fig. 5, which includes five layers from bottom to top:

the first layer is an input layer, which refers to sentences after the relational text is spliced, and data after data cleaning and other operations.

The second layer is a numerical layer, which is to perform numerical operation on the input sentences and relations, firstly perform labeling operation on the whole text, cut the text into labels, and need to pay attention to the fact that a [ SEP ] needs to be inserted between the sentences and the relations]For distinguishing sentences and relations while inserting [ CLS ] before the whole]And after integration a [ SEP ] is inserted]. Then converting the labels into corresponding numerical values by looking up a dictionary, and then marking whether each label is shielded, wherein M is used as the label does not need to be shielded in the current task_iAll have a value of 0. Finally with S_iMarking whether each tag belongs to a sentence or a relational text, part S of a sentence_iAll of which are 0, belong to the part S of the relational text_iIs 1.

The third layer is a joint coding layer, which mainly performs feature coding on sentences and relational texts and outputs a one-dimensional vector. This layer uses the output of the numeric layer as input, then encodes each tag, then outputs the entire encoded information of the sentence at the position corresponding to [ CLS ] through multi-layer bidirectional transform calculation, and each tag also outputs its corresponding encoded information.

The fourth layer is a multitask layer, which is composed of two tasks, a relation classification and a sequence annotation. The relation classification uses the coding of [ CLS ] position output, i.e. the coding of the first tag, as the whole of sentence and relation, then passes through Dropout layer and Dense + softmax layer, the output dimension and the number of relation are consistent, representing the probability of each relation, then determines the current concerned relation category, here softmax is used as the activation function because here is a multi-classification task, and there is only one most possible category. And for sequence labeling, the encoding information of each label needs to be acquired, then the category of each label is judged, the label of each category passes through Dropout and Dense + softmax, and the output dimension is the number of label categories and represents the probability of each label.

The fifth layer is the output layer, which contains two parts, respectively, a relationship class and a label for each label. And when the relation category is judged, judging according to the probability value output by the relation classification, and taking the relation with the highest probability value as a result. When the mark of each label is judged, the mark with the highest probability is also obtained as the mark of the current label, and the texts marked as the relation subject and the relation object are respectively extracted after the marks are determined.

In step S430, using a machine learning method, using the relationship labels and sentences in the training samples in the training sample set as inputs of the initialized sequence tagging model, and using tagging information corresponding to the input relationship labels and sentences as expected outputs of the initialized sequence tagging model, and training to obtain the sequence tagging model.

Because the entity extraction comprises two tasks of relation classification and sequence marking, the loss function also comprises two parts, namely a loss value part of the relation classification and a loss value part of the sequence marking.

Loss value loss of relational classification_classThe definition of (A) is as follows:

therein, loss_classLoss values for the finger relationship classes;

n _ translation refers to the number of sentence relationships;

y_iit refers to whether the ith relationship is present,1 represents present, 0 represents absent;

p_irefers to the probability that the ith relationship appears in the current sentence.

Loss function loss of sequence annotation_seqN _ token refers to the number of sentence labels, n _ label refers to the number of labels, y_ijWhether the jth label of the ith token appears or not in the sentence is referred to, 1 represents appearance, 0 represents non-appearance, and the probability that the ith label is the jth label is referred to.

Because the parameters of the model are excessive, the training speed of a common CPU is very low, so that the GPU is required to be used for training, an optimizer selected during training is preferably Adam, and the training hyper-parameters comprise a learning rate, a sample set scale, a maximum training step number, a training automatic stop strategy and the like.

The embodiment is also supervised entity relationship extraction, but is different from the existing pipeline learning and multitask learning method, the embodiment uses two deep learning models to extract entity relationships, firstly uses a relationship recognition model to recognize relationships existing in sentences, and then constructs a sequence labeling model for each relationship to extract relationship subjects and objects of each relationship, and the method has the advantages that the performance is good, according to experimental statistics, the accuracy rate of extracting sentence triples by using the method of the embodiment can reach 96%, and the accuracy rate of extracting sentence triples by entities can reach 90%. A second advantage is that both models used are easy to train and information can be learned quickly on the data set.

As an implementation of the methods shown in the above figures, the present application provides an embodiment of an apparatus for extracting entity relationships, and fig. 6 illustrates a schematic structural diagram of the apparatus for extracting entity relationships provided in this embodiment, where the embodiment of the apparatus corresponds to the embodiments of the methods shown in fig. 1 to fig. 5, and the apparatus may be specifically applied to various electronic devices. As shown in fig. 6, the apparatus for extracting entity relationships in this embodiment includes a relationship identifying unit 610, a relationship tag obtaining unit 620, and a triple obtaining unit 630.

The relation recognition unit 610 is configured to input a sentence into a pre-trained relation recognition model, and obtain a relation probability array according to result information output by the relation recognition model, where an ith element of the relation probability array represents a probability that an ith sentence relation exists in the sentence, where i is a natural number.

The relationship label obtaining unit 620 is configured to obtain a relationship label set of the sentence relationship corresponding to the element in the relationship probability array, which is greater than the predetermined probability threshold.

The triplet obtaining unit 630 is configured to input each relationship tag in the relationship tag set together with the sentence into a pre-trained sequence tagging model, and obtain a triplet of the sentence relationship corresponding to each relationship tag according to output result information of the sequence tagging model, where the triplet of the sentence relationship includes a relationship name, a relationship subject, and a relationship object.

According to one or more embodiments of the present disclosure, the relationship recognition model is obtained by training through the following modules: a first sample acquisition module 710, a first model determination module 720, and a first model training module 730.

The first sample obtaining module 710 is configured to obtain a training sample set, where a training sample includes a sample sentence and annotation information of a triplet used for representing at least one sentence relationship included in the sample sentence.

The first model determination module 720 is configured for determining an initialized relationship recognition model, wherein the initialized relationship recognition model comprises a target layer for outputting a relationship probability array formed by probabilities of each predetermined sentence relationship included in the sentence.

The first model training module 730 is configured to train, by using a machine learning method, a sample sentence in a training sample in the training sample set as an input of the initialized relationship recognition model, and label information corresponding to the input sample sentence as an expected output of the initialized relationship recognition model, so as to obtain the relationship recognition model.

According to one or more embodiments of the present disclosure, the initialized relationship recognition model is a multi-label classification model.

According to one or more embodiments of the present disclosure, the annotation information further includes a starting position of the relationship subject in the sentence in the triplet of each sentence relationship, and a starting position of the relationship object in the triplet of each sentence relationship in the sentence.

According to one or more embodiments of the present disclosure, the relationship recognition model is obtained by training in a GPU.

According to one or more embodiments of the present disclosure, the sequence labeling model is obtained by training through the following modules: a second sample acquisition module 810, a second model determination module 820, and a second model training module 830.

The second sample obtaining module 810 is configured to obtain a training sample set, where a training sample includes a relationship label, a sentence, and label information for representing a triple including a sentence relationship corresponding to the relationship label in the sentence.

The second model determining module 820 is configured for determining an initialized sequence annotation model, wherein the initialized sequence annotation model comprises a target layer for outputting a triplet of sentence relationships in the sentence that includes a relationship tag corresponding to.

The second model training module 830 is configured to train the relationship labels and sentences in the training samples in the training sample set as inputs of the initialized sequence tagging model by using a machine learning method, and train the sequence tagging model by using tagging information corresponding to the input relationship labels and sentences as expected outputs of the initialized sequence tagging model.

According to one or more embodiments of the present disclosure, the sequence annotation model is obtained by training in a GPU.

The apparatus for extracting entity relationships provided in this embodiment can execute the method for extracting entity relationships provided in the method embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.

Referring now to FIG. 9, shown is a schematic diagram of an electronic device 900 suitable for use in implementing embodiments of the present invention. The terminal device in the embodiments of the present invention may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 9, the electronic device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication device 909 may allow the electronic apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 9 illustrates an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as a computer software program. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program, when executed by the processing apparatus 901, performs the above-described functions defined in the methods of the embodiments of the present invention.

It should be noted that the computer readable medium mentioned above can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In yet another embodiment of the invention, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: inputting sentences into a pre-trained relation recognition model, and obtaining a relation probability array according to result information output by the relation recognition model, wherein the ith element of the relation probability array represents the probability that the ith sentence relation exists in the sentences, and i is a natural number; obtaining a relation label of a sentence relation corresponding to an element which is greater than a preset probability threshold value in the relation probability array to obtain a relation label set; and respectively inputting each relationship label in the relationship label set together with the sentence into a pre-trained sequence labeling model, and respectively obtaining a triplet of the sentence relationship corresponding to each relationship label according to the output result information of the sequence labeling model, wherein the triplet of the sentence relationship comprises a relationship name, a relationship subject and a relationship object.

Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The foregoing description is only a preferred embodiment of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the present invention is not limited to the specific combinations of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present invention are mutually replaced to form the technical solution.

Claims

1. A method for extracting entity relationships, comprising:

2. The method of claim 1, wherein the relationship recognition model is trained by:

3. The method of claim 2, wherein the initialized relationship recognition model is a multi-label classification model.

4. The method of claim 2, wherein the annotation information further comprises a starting position of the relationship subject in the sentence in the triplet of each sentence relationship and a starting position of the relationship object in the triplet of each sentence relationship in the sentence.

5. The method of claim 2, wherein the relationship recognition model is trained in a GPU.

6. The method of claim 1, wherein the sequence annotation model is trained by:

7. The method of claim 6, wherein the sequence annotation model is obtained by training in a GPU.

8. An apparatus for extracting entity relationships, comprising:

9. An electronic device, comprising:

one or more processors; and

a memory to store executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.