CN113505229A - Entity relationship extraction model training method and device - Google Patents
Entity relationship extraction model training method and device Download PDFInfo
- Publication number
- CN113505229A CN113505229A CN202111057292.9A CN202111057292A CN113505229A CN 113505229 A CN113505229 A CN 113505229A CN 202111057292 A CN202111057292 A CN 202111057292A CN 113505229 A CN113505229 A CN 113505229A
- Authority
- CN
- China
- Prior art keywords
- entity
- entity relationship
- corpus
- relationship
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the disclosure provides an entity relationship extraction model training method and device. The method comprises the following steps: acquiring an entity and an entity label of a training corpus; extracting entity relations in the training corpus jointly through a plurality of entity relation extraction modes; generating one or more joint labels of the training corpus according to the entity labels and the entity relation; and training an initial entity relationship extraction model according to the training corpus and the one or more joint labels to obtain a target entity relationship extraction model. In this way, the entity relationship extraction model can be trained more accurately and comprehensively by using the joint labels which are extracted by different entity relationship extraction modes and can be mutually verified and mutually supplemented with the entity relationship, and the entity relationship in the training corpus can be extracted more efficiently and accurately by using the trained entity relationship extraction model.
Description
Technical Field
The present disclosure relates to the field of information technology, and in particular, to the field of entity relationship extraction model training methods, apparatuses, devices, and computer-readable storage media.
Background
At present, in order to extract effective information in a corpus, in many cases, entity relationships in the corpus need to be extracted, and the extraction methods of entity relationships at present mainly include three types:
one is unsupervised automatic Extraction (Auto Extraction), which usually extracts words or phrases that can describe the corresponding relationship from the text automatically according to the syntactic or semantic structure without determining the relationship label, and this Extraction method still depends on the quality of the initial seed and corpus, and needs to manually screen low frequency entity pairs, which is very troublesome and is generally used rarely;
secondly, relationship classification with supervision as a main part, namely relationship extraction is regarded as a classification task, a limited number of relationship labels are predefined in advance, the linguistic data are labeled manually, then a classification model is used for training and extracting the relationship, the method excessively depends on the quality and the quantity of the labeled linguistic data, the labeled linguistic data still only account for a small number in reality, and a large amount of data such as military information and the like are difficult to obtain the relationship, so that the classification model of the classifiable entity relationship is very limited, and the entity relationship capable of being classified is natural and very limited;
the method is mainly characterized in that a large number of unmarked corpora are aligned with a knowledge base formed by a large number of entity pairs and entity relations to determine the entity relations in the unmarked corpora, but the knowledge base is largely lost at the present stage, so that the number of corpora capable of realizing entity alignment is too small, the relation extraction training of the entity pairs is insufficient, and the performance of the whole entity relation extraction model is influenced.
Therefore, how to combine the advantages and disadvantages of the different entity relationship extraction methods to obtain a more effective entity relationship extraction model so as to extract the entity relationships in the corpus more efficiently and accurately becomes a problem to be solved urgently.
Disclosure of Invention
The disclosure provides an entity relationship extraction model training method, an entity relationship extraction model training device, entity relationship extraction equipment and a storage medium.
According to a first aspect of the disclosure, an entity relationship extraction model training method is provided. The method comprises the following steps:
acquiring an entity and an entity label of a training corpus;
extracting entity relations in the training corpus jointly through a plurality of entity relation extraction modes;
generating one or more joint labels of the training corpus according to the entity labels and the entity relation;
and training an initial entity relationship extraction model according to the training corpus and the one or more joint labels to obtain a target entity relationship extraction model.
The above-mentioned aspects and any possible implementation manners further provide an implementation manner, where the jointly extracting entity relationships in the corpus by multiple entity relationship extraction manners includes:
aligning the entity of the corpus with a corpus knowledge base to determine a first entity relationship in the corpus;
classifying the entity relationship in the corpus by using an entity relationship classifier to determine a second entity relationship in the corpus;
and if the first entity relationship is matched with the second entity relationship, determining the entity relationship in the training corpus according to the first entity relationship and the second entity relationship.
The foregoing aspect and any possible implementation manner further provide an implementation manner, where the classifying, by an entity relationship classifier, an entity relationship in the corpus to determine a second entity relationship in the corpus includes:
vectorizing the entities in the training corpus to obtain the feature vectors of the entities in the training corpus;
and inputting the characteristic vectors of the entities in the training corpus into the entity relationship classifier to perform entity relationship classification so as to determine the second entity relationship.
The above-described aspects and any possible implementations further provide an implementation in which the entity relationship classifier includes a plurality of SVM classifiers;
inputting the feature vectors of the entities in the corpus into the entity relationship classifier for entity relationship classification to determine the second entity relationship, including:
sequentially and respectively inputting the feature vectors of the entities in the training corpus into a plurality of SVM classifiers for entity relationship classification, stopping classification until the probability of the classified entity relationship is greater than a preset probability, and determining the entity relationship greater than the preset probability as the second entity relationship;
or
Determining the character category to which the entity in the training corpus belongs;
selecting a corresponding classifier from the plurality of SVM classifiers according to the character category;
and inputting the characteristic vectors of the entities in the training corpus into the corresponding classifier to perform entity relationship classification so as to determine the second entity relationship.
As to the foregoing aspect and any possible implementation manner, there is further provided an implementation manner, where if the first entity relationship matches the second entity relationship, determining an entity relationship in the corpus according to the first entity relationship and the second entity relationship, including:
and if the approximation degree of the first entity relationship and the second entity relationship reaches a preset approximation degree, determining at least one of the first entity relationship and the second entity relationship as the entity relationship in the training corpus.
The above-mentioned aspects and any possible implementation manners further provide an implementation manner, where the jointly extracting entity relationships in the corpus by multiple entity relationship extraction manners further includes:
and if the entity of the corpus cannot be aligned with the corpus knowledge base, determining the second entity relationship as the entity relationship in the corpus.
The foregoing aspect and any possible implementation manner further provide an implementation manner, where generating one or more joint tags of the corpus according to the entity tag and the entity relationship includes:
determining the relative position of the entity in the corpus when the entity relationship in the corpus is extracted;
and generating the joint label according to the entity label, the entity relation and the relative position.
The foregoing aspect and any possible implementation manner further provide an implementation manner, where the obtaining an entity label of a corpus includes:
acquiring word vectors or character vectors of the training corpus;
and inputting the word vector or the character vector into a pre-trained sequence labeling model to determine a target label sequence of the training corpus, wherein the target label sequence is formed by labels of all entities in the training corpus.
The foregoing aspects and any possible implementations further provide an implementation, where the inputting the word vector and the character vector into a pre-trained sequence tagging model to determine a target tag sequence of the corpus includes:
inputting the word vector or the character vector to a BilSTM layer of a sequence labeling model to obtain label scores of all labels distributed to each word in the training corpus;
inputting the label scores of all labels distributed to each word in the training corpus into a CRF layer of the sequence labeling model to obtain at least one label sequence and corresponding probability in the training corpus;
and outputting the corresponding label sequence with the highest probability in the at least one label sequence as the target label sequence.
According to a second aspect of the present disclosure, an entity relationship extraction model training apparatus is provided. The device includes:
the acquisition module is used for acquiring the entity and the entity label of the training corpus;
the extraction module is used for jointly extracting the entity relations in the training corpus through a plurality of entity relation extraction modes;
a generating module, configured to generate one or more joint tags of the corpus according to the entity tag and the entity relationship;
and the training module is used for training the initial entity relationship extraction model according to the training corpus and the one or more joint labels to obtain a target entity relationship extraction model.
According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as according to the first and/or second aspects of the present disclosure.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. The accompanying drawings are included to provide a further understanding of the present disclosure, and are not intended to limit the disclosure thereto, and the same or similar reference numerals will be used to indicate the same or similar elements, where:
FIG. 1 shows a flow diagram of a method of entity relationship extraction model training in accordance with an embodiment of the present disclosure;
FIG. 2 illustrates a flow diagram of another entity relationship extraction model training method in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of the working principle of obtaining an entity tag according to an embodiment of the present disclosure;
FIG. 4 is a diagram illustrating entity relationships extracted by entity-aligning an entity of a corpus with a corpus knowledge base, according to an embodiment of the present disclosure;
FIG. 5 is a diagram illustrating extraction of entity relationships in a corpus using SVM models according to an embodiment of the present disclosure;
FIG. 6 shows a schematic diagram of the results of federated tags and entity relationship extraction model output, in accordance with an embodiment of the present disclosure;
FIG. 7 shows a block diagram of an entity relationship extraction model training apparatus, according to an embodiment of the present disclosure;
FIG. 8 shows a block diagram of an electronic device for implementing the entity relationship extraction model training method of the embodiments of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
In the disclosure, the entity relationships in the corpus can be jointly extracted by using different entity relationship extraction modes, so that the entity relationships extracted by the different entity relationship extraction modes can be verified and supplemented with each other, the entity relationship extraction model is more accurately and comprehensively trained by using the joint label formed by the mutually verified and supplemented entity relationships, and the entity relationships in the corpus can be more efficiently and accurately extracted by using the trained entity relationship extraction model.
FIG. 1 shows a flow diagram of an entity relationship extraction model training method 100 according to an embodiment of the disclosure. The method 100 includes:
the entity label may be a BIESO label for an entity, where S represents that the entity contains only one word, B, I, E represents the start point, middle point, and end point of the entity, respectively, and O represents a non-entity word.
the multiple entity relationship extraction modes can be an unsupervised automatic extraction mode, a supervised relationship classification mode, a remote supervision-based entity relationship extraction method, any two items of combination or the combination of the three items.
Entity relationships are relationships between entities in a corpus, such as: if the entity is a person, the entity relationship may be various person relationships such as a teacher-student relationship, a sister relationship and the like; also for example: if the entity is a place name, the entity relationship is various position relationships.
The entity relations in the corpus can be extracted in a combined way by using a plurality of entity relation extraction modes, so that the extracted entity relations can be verified and supplemented with each other to ensure the accuracy of the extracted entity relations, the advantages of different entity relation extraction modes can be simultaneously played, the defects of different entity relation extraction modes are overcome, and therefore, the entity relations of more corpus can be extracted, the deficiency of entity relation extraction is avoided as much as possible, namely, the situation that some entity relations cannot be extracted due to the fact that a single entity relation extraction mode is not trained sufficiently is avoided as much as possible And the method is accurate, and is beneficial to extracting the entity relation in the training corpus more efficiently and accurately.
In the step of real test (application) of the entity relationship, the test corpus is directly input into the target entity relationship extraction model, and the entity relationship triple in the test corpus can be efficiently and accurately obtained.
In some embodiments, the jointly extracting entity relationships in the corpus by multiple entity relationship extraction methods includes:
aligning the entity of the corpus with a corpus knowledge base to determine a first entity relationship in the corpus;
classifying the entity relationship in the corpus by using an entity relationship classifier to determine a second entity relationship in the corpus;
that is, the multiple entity relationship extraction manners of this embodiment may be an entity relationship extraction method based on remote supervision (for extracting the first entity relationship) and a supervised-oriented relationship classification manner (for extracting the second entity relationship).
And if the first entity relationship is matched with the second entity relationship, determining the entity relationship in the training corpus according to the first entity relationship and the second entity relationship.
After a first entity relationship and a second entity relationship aiming at the training corpus are respectively obtained through different entity relationship extraction modes, the two entity relationships can be matched, so that the entity relationships extracted through the different entity relationship extraction modes are mutually verified, if the first entity relationship is matched with the second entity relationship, the entity relationships extracted through the two entity relationship extraction modes are consistent, the entity relationships in the training corpus can be accurately determined according to the first entity relationship and the second entity relationship, and the accuracy of the extracted entity relationships is ensured through an entity relationship combined extraction mode.
In addition, semantic and context analysis can be carried out on the entities in the training corpus, namely, an automatic extraction mode mainly based on unsupervised is utilized, and the entity relationship in the training corpus can be extracted to serve as a third entity relationship; and then, comparing the similarity of the third entity relationship with at least one of the first entity relationship and the second entity relationship, so that the entity relationships in the same training corpus extracted by more entity relationship extraction modes are mutually verified and supplemented, the accuracy and the comprehensiveness of the extracted entity relationships are ensured, and the entity relationship extraction effect is improved.
In some embodiments, the classifying, with an entity-relationship classifier, the entity relationship in the corpus to determine a second entity relationship in the corpus includes:
vectorizing the entities in the training corpus to obtain the feature vectors of the entities in the training corpus;
and inputting the characteristic vectors of the entities in the training corpus into the entity relationship classifier to perform entity relationship classification so as to determine the second entity relationship.
When the entity relation classifier is used for classifying the training corpus, the entity in the training corpus can be vectorized to obtain the characteristic vector of the entity, and then the characteristic vector of the entity in the training corpus is automatically input into the entity relation classifier, so that the entity relation in the training corpus is automatically classified, and the second entity relation in the training corpus is automatically extracted through a relation classification mode mainly based on supervision.
In some embodiments, the entity relationship classifier comprises a plurality of SVM (support vector machines) classifiers;
inputting the feature vectors of the entities in the corpus into the entity relationship classifier for entity relationship classification to determine the second entity relationship, including:
and sequentially and respectively inputting the feature vectors of the entities in the training corpus into a plurality of SVM classifiers for entity relationship classification, stopping classification until the probability of the classified entity relationship is greater than a preset probability, and determining the entity relationship greater than the preset probability as the second entity relationship.
Since an SVM classifier can classify only one entity relationship, and there may be different entity relationships in different corpora and multiple entity relationships in the same corpus, the entity-relationship classifier requires multiple SVM classifiers, i.e., different SVM classifiers are used to classify different entity relationships, such that, when entity relationship classification is performed, the feature vectors of the entities in the corpus can be sequentially and respectively input into a plurality of SVM classifications for entity relationship classification, namely, the entity relation classification of the entities in the training corpus is tried to be carried out by using different SVM classifiers in sequence until the probability of the classified entity relation is more than the preset probability, which shows that the entity relation is more accurate, the classification may be stopped, and the entity relationship greater than the preset probability may be determined as the second entity relationship, thereby ensuring the accuracy of the second entity relationship.
Certainly, the SVM classifier may also output a probability that is not corresponding to the entity relationship, but outputs yes or no for a certain entity relationship, in this case, after the feature vectors of the entities in the training corpus are sequentially and respectively input into the plurality of SVM classifiers for entity relationship classification, and when yes is output (or other characters used for expressing yes), the entity relationship that can be classified by the SVM classifier outputting yes is determined as the second entity relationship.
For example: an SVM classifier for classifying teacher-student relationships may only output "yes" or "no" for the teacher-student relationships, and the relationships respectively used for confirming that the entities are the teacher-student relationships or the relationships used for negating the entities are the teacher-student relationships, under which case, after feature vectors of two entities of a training corpus are sequentially input into a plurality of SVM classifiers for entity relationship classification, if the input is into the SVM classifier for classifying the teacher-student relationships and the output of the classifier is "yes", the entity relationships of the input entities are teacher-student.
And/or
In some embodiments, the entity relationship classifier comprises a plurality of SVM (support vector machines) classifiers;
inputting the feature vectors of the entities in the corpus into the entity relationship classifier for entity relationship classification to determine the second entity relationship, including:
determining the character category to which the entity in the training corpus belongs; the character category is used for characterizing personalized features of the entity, such as whether the entity is a place name, a person name, a scenic spot name and the like.
The classifier corresponding to the place name can be a position relation classifier, the classifier corresponding to the person name can be a character relation classifier, and the classifier corresponding to the scenic spot can be a scenic spot administration relation classifier.
Selecting a corresponding classifier from the plurality of SVM classifiers according to the character category;
and inputting the characteristic vectors of the entities in the training corpus into the corresponding classifier to perform entity relationship classification so as to determine the second entity relationship.
The feature vectors of the entities in the training corpus are sequentially and respectively input into the SVM classifiers for entity relationship classification, although the first entity relationship can be ensured to be more accurate, the entity relationship classification efficiency is low and the speed is low in the method, so that in order to improve the rapidity of the entity relationship classification, the corresponding classifier can be selected from the SVM classifiers in a targeted manner according to the character class to which the entity belongs, and then the corresponding classifier is utilized for entity relationship classification, so that the second entity relationship is rapidly and efficiently determined, and the blind test of the SVM classifiers is avoided.
For example: if the character type is the place name, selecting an SVM classifier for classifying the position relationship from the plurality of SVM classifiers, and thus quickly confirming whether the entity relationship is a certain position relationship by using the SVM classifier for classifying the position relationship.
Of course, if there are still multiple corresponding classifiers, the second entity relationship may be further confirmed by combining the above embodiment of "sequentially and respectively inputting the feature vectors of the entities in the corpus into multiple SVM classifiers for entity relationship classification", so as to ensure the accuracy of the second entity relationship.
In some embodiments, if the first entity relationship matches the second entity relationship, determining the entity relationship in the corpus according to the first entity relationship and the second entity relationship includes:
and if the approximation degree of the first entity relationship and the second entity relationship reaches a preset approximation degree, determining at least one of the first entity relationship and the second entity relationship as the entity relationship in the training corpus.
If the approximation degree of the first entity relationship and the second entity relationship reaches the preset approximation degree, it indicates that the entity relationships extracted by using different entity relationship extraction methods are relatively accurate and have no errors, so that at least one of the first entity relationship and the second entity relationship can be determined as the entity relationship in the training corpus, thereby ensuring the accuracy of the entity relationships jointly extracted by using different entity relationship extraction methods.
The approximation degree of the first entity relation and the second entity relation can be calculated by using Euclidean distance formulas, Manhattan distance formulas, cosine similarity formulas and other formulas, or the similarity of calculation semantics can be determined as the approximation degree of the relation between the two entities through semantic analysis.
For example: the first entity relationship obtained by aligning the two entities of Beijing and China with the corpus knowledge base is 'membership', the second entity relationship classified by using an SVM classifier is 'location', the distance approximation analysis can be carried out on the 'membership' and the 'location' by using a cosine similarity formula, the two approximation degrees are found to be greater than the preset approximation degree, the entity relationships of the entity pair extracted by the two entity relationship extraction modes are consistent, and the 'membership' or the 'location' can be directly used as the relationship of the entity pair.
Another example is: the first entity relationship obtained by aligning the two entities of the company and the employee with the corpus knowledge base is 'employment', the second entity relationship classified by using the SVM classifier is 'employment', semantic similarity analysis is carried out on the 'employment' and the 'employment', the two entities are found to be similar in semantics, the entity relationships of the entity pair extracted by the two entity relationship extraction modes are consistent, and the 'employment' or the 'employment' can be directly used as the relationship of the entity pair.
In some embodiments, the jointly extracting entity relationships in the corpus by multiple entity relationship extraction methods further includes:
and if the entity of the corpus cannot be aligned with the corpus knowledge base, determining the second entity relationship as the entity relationship in the corpus.
Because the corpus quantity capable of entity alignment in the corpus knowledge base is less and the information is deficient, if the entity of the corpus and the corpus knowledge base can not be entity aligned, the second entity relationship of the entity relationship classifier can be determined as the entity relationship of the entity in the corpus, so that the entity relationship missing in the corpus knowledge base is compensated by using the relationship classification method mainly based on supervision, and thus, the entity relationship missing in the entity relationship extraction method based on remote supervision is compensated by using the relationship classification method mainly based on supervision, so that the two entity relationship classification methods can be mutually compensated, the mutual defects can be compensated, the comprehensiveness of the entity relationship can be ensured, and the trained target entity relationship extraction model is more comprehensive and accurate.
Of course, if the second entity relationship cannot be classified by the entity relationship classifier, the first entity relationship may also be determined as the entity relationship in the corpus, so as to make up for the defects of limited classification model and insufficient entity relationship in the supervised-based relationship classification manner by using the entity relationship extraction method based on remote supervision.
In some embodiments, the generating one or more joint labels of the corpus according to the entity labels and the entity relationships includes:
determining the relative position of the entity in the corpus when the entity relationship in the corpus is extracted;
and generating the joint label according to the entity label, the entity relation and the relative position.
The entity labels, the entity relations and the relative positions of the entities in the corpus are combined to obtain one or more combined labels in the corpus, so that the trained target entity relation extraction model integrates the advantages of different entity relation extraction modes, and the identifiable entity relations are naturally more, more accurate and more comprehensive.
In an embodiment, the obtaining the entity label of the corpus includes:
acquiring word vectors or character vectors of the training corpus;
and inputting the word vector or the character vector into a pre-trained sequence labeling model to determine a target label sequence of the training corpus, wherein the target label sequence is formed by labels of all entities in the training corpus. The sequence annotation model may be a BilSTM-CRF model.
By inputting the word vector or the character vector of the training corpus into the trained sequence labeling model, a more accurate target label sequence of the training corpus can be obtained.
In one embodiment, the inputting the word vector and the character vector into a pre-trained sequence tagging model to determine a target tag sequence of the corpus includes:
inputting the word vector or the character vector to a BilSTM layer of a sequence labeling model to obtain label scores of all labels distributed to each word in the training corpus;
inputting the label scores of all labels distributed to each word in the training corpus into a CRF layer of the sequence labeling model to obtain at least one label sequence and corresponding probability in the training corpus;
and outputting the corresponding label sequence with the highest probability in the at least one label sequence as the target label sequence.
Because each word in the corpus may be assigned with a plurality of labels, for example, in the BIESO labeling mode, the labels possibly assigned in china in beijing are I and O, and each label is not necessarily accurate and is used for measuring by score, the respective label scores of all the labels assigned to each word in the corpus can be input into the CRF layer of the sequence labeling model to obtain at least one label sequence of the corpus and the respective corresponding probabilities of at least one label sequence, so as to sort according to the order of highest to low probability, and automatically determine the label sequence with the highest probability as the target label sequence, thereby ensuring the accuracy of the obtained target label sequence, i.e., the accuracy of the labels in the entities.
The technical solution of the present disclosure will be further explained below:
the method comprises three steps, namely named entity sequence labeling, supervised-based relationship classification and remote supervision combined extraction of entity relationships, labeling of the entity relationships and entity relationship extraction model training, wherein the overall structure flow is shown in figure 2, and the three steps are discussed in detail below.
Step one, named entity sequence labeling
As shown in FIG. 3, for the named entity sequence tagging problem, the popular BilSTM-CRF model is adopted. First, the word2vec word vector layer represents each sentence (i.e., corpus) as a word vector and a word vector. The word vectors and word vectors are then input into the BilSTM (Bi-directional Long Short-Term Memory) layer in the model, the output of which is the respective scores of all the tags for each word of the sentence. Then, the output of the BilSTM layer is input into a CRF layer, the respective scores (emission probability matrix) of all labels of each word and the transition probability matrix are used as parameters of an original CRF model, and finally, the probability of a label sequence formed by the labels of each word is obtained.
The specific process is as follows, if the sequence input by the CRF is X and the predicted tag sequence is y, then the score S (X, y) of the predicted sequence y of the CRF is calculated as follows:
in the formula (I), the compound is shown in the specification,is as followsiThe label score of the individual token (i.e. the ith word),is as followsiThe label of each token is transferred toiA transition score for the label of +1 token.
Each score corresponds to a complete path. Defining a probability value for each correct sequence y by using a Softmax function, and defining the set of all predicted sequences asYMaximize likelihood probability p (y | X):
using log-likelihood, a loss function is defined as-log: (p(y|X)):
During training, a minimum loss function is adopted to train parameters of the model, during prediction, a Viterbi algorithm is applied to obtain an entity label sequence with the highest probability, each label in the entity label sequence with the highest probability is used as an entity label of a corresponding entity in the training corpus, and the labeled entity label is BIESO label.
Step two, jointly extracting entity relations and labeling the entity relations in a supervised relation classification (namely SVM classifier classification) mode and a remote supervision mode
And (4) taking the result of the remote supervision relation marking and the result of the entity relation marking of the SVM classifier as the subsequent entity relation joint extraction training data for training the initial entity relation extraction model.
Firstly, labeling remote supervision relations, extracting a training set based on remote supervision quick construction information, and giving a knowledge baseWhereinA set of entities is represented that is,Ra set of relationships between the entities is represented,Frepresenting all triplets, given a textS(i.e., corpus) and two target entitiesIf the conditions are satisfied at the same timeAnd is,And is,WhereinThen the text is consideredSCan describe the relationshiprOf i.e.AndSalign and record asAn example is shown in figure 4.
Meanwhile, SVM classification labeling is carried out, namely the problem of relation extraction of entity pairs is converted into a text classification problem. For example, for the problem of N-type relation labeling of the text S, the relation of the entity pair is directly classified into one type or no relation in N, so the problem is converted into a classification problem of N +1 types. The classification algorithm used in the relationship extraction in this disclosure is a Support Vector Machine (SVM) algorithm. The SVM first performs text feature selection and text feature representation in a corpus, and then inputs the text feature representation as a feature vector of an entity in the corpus into an SVM model for classification, as shown in fig. 5. The text feature selection adopts a CHI method, the text feature representation adopts tf-idf, the learning process of the SVM essentially finds a divided hyperplane in a feature space, the hyperplane can separate training samples according to different categories, and by using the skill, the learning method of linear classification can be applied to the problem of nonlinear classification such as text classification and the like, and only the inner product in the dual form of a linear support vector machine is required to be converted into a core function.
And then, if the entity relationship marked by the remote supervision relationship mark is consistent with the entity relationship marked by the SVM classification, determining the entity relationship marked by the remote supervision relationship mark and the SVM classification as a final entity relationship.
Step three, training an entity relation extraction model
And then, training an entity relationship extraction model, wherein the method designs a special labeling strategy, and trains an initial entity relationship extraction model by taking the entity-relationship triples obtained in the first step and the second step as a joint label, wherein the joint label comprises three parts of information:
(1) step one, using BIESO to mark the position of the word in the entity, wherein S represents the entity containing only one word, B, I, E represents the start point, middle point and end point of the entity respectively, and O represents the non-entity word
(2) Labeling the entity relationship type in the second step, if Li represents the Located-in relationship
(3) And the semantic roles of the entities obtained in the step two, namely {1 and 2} represent the head entity and the tail entity in the semantic relationship respectively. Labels are shown as Tags + localized-in fig. 6, and through this labeling strategy, the training corpus "fertilizer is Located in security badge" is input into the initial entity relationship extraction model together, that is, the named entity labeling sequence in the previous step one and the entity relationship in the step two and the relative position between the entities are used for training and learning of the model, and further the entity-relationship triples of the training corpus are extracted by the target entity relationship extraction model obtained in a joint manner, for example, the triples extracted by the entity relationship extraction model in fig. 6 are (fertilizer, localized-in, security badge), of course, the initial entity relationship extraction model still adopts a BiLSTM-CRF model structure.
And finally, performing an experiment, wherein the evaluation indexes adopt accuracy rate-recall rate (PR) and average accuracy rate, and the result shows that the performance of the method model provided by the disclosure is obviously improved compared with that of a conventional remote supervision relation extraction method.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.
FIG. 7 shows a block diagram of an entity relationship extraction model training apparatus 700 according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes:
an obtaining module 710, configured to obtain an entity and an entity tag of a corpus;
an extraction module 720, configured to jointly extract entity relationships in the corpus through multiple entity relationship extraction manners;
a generating module 730, configured to generate one or more joint tags of the corpus according to the entity tag and the entity relationship;
the training module 740 is configured to train the initial entity relationship extraction model according to the training corpus and the one or more joint labels, so as to obtain a target entity relationship extraction model.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.
FIG. 8 illustrates a schematic block diagram of an electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
The apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computer system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
Claims (10)
1. An entity relationship extraction model training method is characterized by comprising the following steps:
acquiring an entity and an entity label of a training corpus;
extracting entity relations in the training corpus jointly through a plurality of entity relation extraction modes;
generating one or more joint labels of the training corpus according to the entity labels and the entity relation;
and training an initial entity relationship extraction model according to the training corpus and the one or more joint labels to obtain a target entity relationship extraction model.
2. The method according to claim 1, wherein the jointly extracting entity relationships in the corpus through multiple entity relationship extraction methods comprises:
aligning the entity of the corpus with a corpus knowledge base to determine a first entity relationship in the corpus;
classifying the entity relationship in the corpus by using an entity relationship classifier to determine a second entity relationship in the corpus;
and if the first entity relationship is matched with the second entity relationship, determining the entity relationship in the training corpus according to the first entity relationship and the second entity relationship.
3. The method of claim 2,
the classifying the entity relationship in the corpus by using the entity relationship classifier to determine a second entity relationship in the corpus includes:
vectorizing the entities in the training corpus to obtain the feature vectors of the entities in the training corpus;
and inputting the characteristic vectors of the entities in the training corpus into the entity relationship classifier to perform entity relationship classification so as to determine the second entity relationship.
4. The method of claim 3,
the entity relationship classifier comprises a plurality of SVM classifiers;
inputting the feature vectors of the entities in the corpus into the entity relationship classifier for entity relationship classification to determine the second entity relationship, including:
sequentially and respectively inputting the feature vectors of the entities in the training corpus into a plurality of SVM classifiers for entity relationship classification, stopping classification until the probability of the classified entity relationship is greater than a preset probability, and determining the entity relationship greater than the preset probability as the second entity relationship;
or
Determining the character category to which the entity in the training corpus belongs;
selecting a corresponding classifier from the plurality of SVM classifiers according to the character category;
and inputting the characteristic vectors of the entities in the training corpus into the corresponding classifier to perform entity relationship classification so as to determine the second entity relationship.
5. The method of claim 2,
if the first entity relationship matches the second entity relationship, determining the entity relationship in the corpus according to the first entity relationship and the second entity relationship, including:
and if the approximation degree of the first entity relationship and the second entity relationship reaches a preset approximation degree, determining at least one of the first entity relationship and the second entity relationship as the entity relationship in the training corpus.
6. The method according to claim 2, wherein the extracting entity relationships in the corpus jointly by a plurality of entity relationship extraction methods further comprises:
and if the entity of the corpus cannot be aligned with the corpus knowledge base, determining the second entity relationship as the entity relationship in the corpus.
7. The method according to any one of claims 1 to 6,
the generating one or more joint labels of the corpus according to the entity label and the entity relationship comprises:
determining the relative position of the entity in the corpus when the entity relationship in the corpus is extracted;
and generating the joint label according to the entity label, the entity relation and the relative position.
8. An entity relationship extraction model training device, comprising:
the acquisition module is used for acquiring the entity and the entity label of the training corpus;
the extraction module is used for jointly extracting the entity relations in the training corpus through a plurality of entity relation extraction modes;
a generating module, configured to generate one or more joint tags of the corpus according to the entity tag and the entity relationship;
and the training module is used for training the initial entity relationship extraction model according to the training corpus and the one or more joint labels to obtain a target entity relationship extraction model.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111057292.9A CN113505229B (en) | 2021-09-09 | 2021-09-09 | Entity relationship extraction model training method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111057292.9A CN113505229B (en) | 2021-09-09 | 2021-09-09 | Entity relationship extraction model training method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113505229A true CN113505229A (en) | 2021-10-15 |
CN113505229B CN113505229B (en) | 2021-12-24 |
Family
ID=78017001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111057292.9A Active CN113505229B (en) | 2021-09-09 | 2021-09-09 | Entity relationship extraction model training method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113505229B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144120A (en) * | 2019-12-27 | 2020-05-12 | 北京知道创宇信息技术股份有限公司 | Training sentence acquisition method and device, storage medium and electronic equipment |
CN111160035A (en) * | 2019-12-31 | 2020-05-15 | 北京明朝万达科技股份有限公司 | Text corpus processing method and device |
CN111274394A (en) * | 2020-01-16 | 2020-06-12 | 重庆邮电大学 | Method, device and equipment for extracting entity relationship and storage medium |
CN111367986A (en) * | 2020-03-12 | 2020-07-03 | 北京工商大学 | Joint information extraction method based on weak supervised learning |
CN113128203A (en) * | 2021-03-30 | 2021-07-16 | 北京工业大学 | Attention mechanism-based relationship extraction method, system, equipment and storage medium |
-
2021
- 2021-09-09 CN CN202111057292.9A patent/CN113505229B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144120A (en) * | 2019-12-27 | 2020-05-12 | 北京知道创宇信息技术股份有限公司 | Training sentence acquisition method and device, storage medium and electronic equipment |
CN111160035A (en) * | 2019-12-31 | 2020-05-15 | 北京明朝万达科技股份有限公司 | Text corpus processing method and device |
CN111274394A (en) * | 2020-01-16 | 2020-06-12 | 重庆邮电大学 | Method, device and equipment for extracting entity relationship and storage medium |
CN111367986A (en) * | 2020-03-12 | 2020-07-03 | 北京工商大学 | Joint information extraction method based on weak supervised learning |
CN113128203A (en) * | 2021-03-30 | 2021-07-16 | 北京工业大学 | Attention mechanism-based relationship extraction method, system, equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
ZHENG S ET AL.: "Joint extraction of entities and relations based on a novel tagging scheme", 《HTTPS://ARXIV.ORG/ABS/1706.05075》 * |
鄂海红 等: "深度学习实体关系抽取研究综述", 《软件学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113505229B (en) | 2021-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112084790B (en) | Relation extraction method and system based on pre-training convolutional neural network | |
CN113326764B (en) | Method and device for training image recognition model and image recognition | |
CN108416370B (en) | Image classification method and device based on semi-supervised deep learning and storage medium | |
CN110717039B (en) | Text classification method and apparatus, electronic device, and computer-readable storage medium | |
CN111967262B (en) | Determination method and device for entity tag | |
CN113360700B (en) | Training of image-text retrieval model, image-text retrieval method, device, equipment and medium | |
CN113780098A (en) | Character recognition method, character recognition device, electronic equipment and storage medium | |
CN113657395A (en) | Text recognition method, and training method and device of visual feature extraction model | |
CN112926308A (en) | Method, apparatus, device, storage medium and program product for matching text | |
CN114090601B (en) | Data screening method, device, equipment and storage medium | |
CN111666771A (en) | Semantic label extraction device, electronic equipment and readable storage medium of document | |
CN113901214B (en) | Method and device for extracting form information, electronic equipment and storage medium | |
CN113408273B (en) | Training method and device of text entity recognition model and text entity recognition method and device | |
CN112699237B (en) | Label determination method, device and storage medium | |
CN114495113A (en) | Text classification method and training method and device of text classification model | |
CN114020904A (en) | Test question file screening method, model training method, device, equipment and medium | |
CN114037059A (en) | Pre-training model, model generation method, data processing method and data processing device | |
CN115248890A (en) | User interest portrait generation method and device, electronic equipment and storage medium | |
CN117556005A (en) | Training method of quality evaluation model, multi-round dialogue quality evaluation method and device | |
CN113505229B (en) | Entity relationship extraction model training method and device | |
CN115909376A (en) | Text recognition method, text recognition model training device and storage medium | |
US20220207286A1 (en) | Logo picture processing method, apparatus, device and medium | |
CN115048523A (en) | Text classification method, device, equipment and storage medium | |
CN115600592A (en) | Method, device, equipment and medium for extracting key information of text content | |
CN112395873B (en) | Method and device for generating white character labeling model and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100085 room 703, 7 / F, block C, 8 malianwa North Road, Haidian District, Beijing Patentee after: Beijing daoda Tianji Technology Co.,Ltd. Address before: 100085 room 703, 7 / F, block C, 8 malianwa North Road, Haidian District, Beijing Patentee before: Beijing daoda Tianji Technology Co.,Ltd. |