CN112270196B - Entity relationship identification method and device and electronic equipment - Google Patents

Entity relationship identification method and device and electronic equipment Download PDF

Info

Publication number
CN112270196B
CN112270196B CN202011461566.6A CN202011461566A CN112270196B CN 112270196 B CN112270196 B CN 112270196B CN 202011461566 A CN202011461566 A CN 202011461566A CN 112270196 B CN112270196 B CN 112270196B
Authority
CN
China
Prior art keywords
entity
sentence
text
sample
subject
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011461566.6A
Other languages
Chinese (zh)
Other versions
CN112270196A (en
Inventor
张浩静
刘炎
覃建策
陈邦忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Perfect World Beijing Software Technology Development Co Ltd
Original Assignee
Perfect World Beijing Software Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Perfect World Beijing Software Technology Development Co Ltd filed Critical Perfect World Beijing Software Technology Development Co Ltd
Priority to CN202011461566.6A priority Critical patent/CN112270196B/en
Publication of CN112270196A publication Critical patent/CN112270196A/en
Application granted granted Critical
Publication of CN112270196B publication Critical patent/CN112270196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The application discloses an entity relationship identification method and device and electronic equipment, and relates to the technical field of data identification. The method comprises the following steps: firstly, subject supplementing processing is carried out on sentences which lack subjects in the text to be recognized; obtaining sentences containing entity pairs in the text to be recognized after subject supplement processing; acquiring entity information characteristics corresponding to the entities in the entity pair; then, inputting the entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning; and finally, determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model. According to the method and the device, some redundant and irrelevant text data can be removed, so that noise text data can be removed as much as possible, the proportion of effective text data is increased, the accuracy of the recognition model can be effectively improved, and the model training can be quicker. Therefore, the accuracy and efficiency of entity relationship identification can be improved.

Description

Entity relationship identification method and device and electronic equipment
Technical Field
The present application relates to the field of data identification technologies, and in particular, to a method and an apparatus for identifying an entity relationship, and an electronic device.
Background
In the internet era, people can communicate quickly and conveniently, and meanwhile various social software generates a large amount of text data every moment. In order to enable people to experience better and better life and to have higher and higher life quality, a large amount of generated text data needs to be fully utilized, optimal and fastest intelligent matching is carried out through a natural language processing technology, time is saved, and efficiency is improved. This requires that the resulting text data be structured, for example to generate a knowledge graph. One of the key steps in the process of generating the knowledge graph is to extract entity relationships and generate triples of { head entities, relationships and tail entities } so as to more effectively analyze specific potential contents existing in complex relationships and better serve the daily life of people.
At present, in the conventional scheme, a remote supervision algorithm is used for identifying and extracting entity relationships, a small knowledge graph is required to be used as an initial triple in the early stage, a large amount of noise is introduced due to a strong assumption condition of remote supervision, the obtained entity relationship extraction result is not very accurate, and the identification accuracy of the entity relationships is further influenced.
Disclosure of Invention
In view of this, the present application provides a method and an apparatus for identifying an entity relationship, and an electronic device, and mainly aims to solve the technical problem that the accuracy of identifying an entity relationship is affected in the prior art.
According to an aspect of the present application, there is provided a method for identifying entity relationships, the method including:
carrying out subject supplement processing on sentences lacking subjects in the text to be recognized;
obtaining sentences containing entity pairs in the text to be recognized after subject supplement processing;
acquiring entity information characteristics corresponding to the entities in the entity pair;
inputting the entity information features, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning;
and determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model.
According to another aspect of the present application, there is provided an apparatus for identifying entity relationships, the apparatus including:
the processing module is used for carrying out subject complete processing on the sentences which lack the subjects in the text to be recognized;
the acquisition module is used for acquiring sentences containing entity pairs in the text to be recognized after subject supplement processing;
the obtaining module is further configured to obtain entity information characteristics corresponding to the entity in the entity pair;
the input module is used for inputting the entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning;
and the determining module is used for determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model.
According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the method of identifying entity relationships described above.
According to yet another aspect of the present application, there is provided an electronic device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method for identifying the entity relationship when executing the computer program.
By means of the technical scheme, the method, the device and the electronic equipment for recognizing the entity relationship, provided by the application, consider that a large amount of noise is introduced under the strong assumption condition of a remote supervision algorithm, so that under the strong assumption condition of the remote supervision, firstly, the subject completing processing is carried out on the sentences which lack the subject in the text to be recognized, and then, the sentences which contain entity pairs in the text to be recognized and are subjected to the subject completing processing are obtained; then acquiring entity information characteristics corresponding to the entities in the entity pair; and finally, inputting the entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning, and further determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model. Compared with the prior art, some redundant and irrelevant text data can be removed, noise text data can be removed as much as possible, the proportion of effective text data is increased, the accuracy of the recognition model can be effectively improved, and the model can be trained more rapidly. Therefore, the accuracy and efficiency of entity relationship identification can be improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flowchart illustrating an entity relationship identification method provided in an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating another entity relationship identification method provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating subject missing completion according to an embodiment of the present disclosure;
FIG. 4 is a flow diagram illustrating dependency parsing provided by an embodiment of the present application;
FIG. 5 is a process flow diagram illustrating entity information description provided by an embodiment of the present application;
fig. 6 is a schematic flowchart illustrating a PCNN part in a PCNN + MIL remote supervision model according to an embodiment of the present disclosure;
fig. 7 illustrates a schematic flow chart in a PCNN + ATT remote supervision model provided in an embodiment of the present application;
FIG. 8 is a schematic flow chart of a BilSTM + ATT + MIL remote supervision model according to an embodiment of the present application;
FIG. 9 is a schematic flow chart illustrating model fusion provided by an embodiment of the present application;
FIG. 10 illustrates an example overall flow diagram provided by embodiments of the present application;
fig. 11 shows a schematic structural diagram of an entity relationship identification apparatus provided in an embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method aims to solve the technical problem that when the entity relationship is identified and extracted by using a remote supervision algorithm in the conventional scheme, a large amount of noise is introduced under the strong assumption condition of the remote supervision algorithm, so that the entity relationship identification accuracy is influenced. The embodiment provides an entity relationship identification method, as shown in fig. 1, the method includes:
and 101, performing subject supplement processing on the sentences which lack subjects in the text to be recognized.
The execution subject of the embodiment may be an apparatus or device for entity relationship identification, and may be configured on the client side or the server side, and may be used for identification of entity relationships in the text data.
If the text to be recognized has sentences with missing subjects, the method has interference on entity relationship recognition and influences the accuracy of the entity relationship recognition. Therefore, the present embodiment first performs subject-language-completion processing on a sentence lacking a subject in a text to be recognized. For example, in the sentence "Zhang Sanyige is better, he is engaged in years with Li four" and lacks subject, and subject supplement can be carried out according to the subject "Zhang Sange" in the previous sentence "Zhang Sange is better" to obtain the sentence "Zhang Sange is better, Zhang Sange is engaged in years with Li four".
And 102, obtaining sentences containing entity pairs in the text to be recognized after subject complete filling processing.
An entity pair may be a pair of entities (two entities, such as a person name, a place name, an organization name, a proper noun, etc.) having a certain relationship (such as a relationship of a parent-subsidiary, a relative relationship, a superior-subordinate relationship, a cooperative relationship, etc.) in a sentence. It should be noted that at least one set of entity pairs may exist in a sentence containing entity pairs.
And 103, acquiring entity information characteristics corresponding to the entities in the entity pair.
The entity information features can be features describing entity related information and are obtained from text data except the text corpus to be recognized. For example, related information describing the entity can be obtained from encyclopedia websites, and then information content capable of accurately expressing the entity characteristics is extracted from the information as entity information characteristics corresponding to the entity.
And 104, inputting the obtained entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning.
The preset recognition model can be obtained by training in advance by using a deep learning algorithm, and the deep learning algorithm can be preset according to actual requirements.
And the preset identification model performs feature matching calculation according to the input entity information features, the entity pairs and the sentences containing the entity pairs to find the probability of the labels of the entity relations corresponding to the similar features, and further can obtain the labels of the entity relations with the maximum probability value or more than a preset threshold value as output classification results.
And 105, determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model.
For example, according to a classification result output by a preset recognition model, a label of the entity relationship with the maximum probability value is obtained, and according to the content of the label, the entity relationship in the text to be recognized is determined.
It should be noted that the method of the embodiment can be applied to the recruitment field for extracting the entity relationship, and is also applicable to other fields. Taking the recruitment field as an example, the entity types of the recruitment field mainly comprise companies, subsidiaries, legal persons and the like, the entity relationship extraction is carried out on the recruitment field, such triples as { head entities, relationship and tail entities } are obtained, and a knowledge graph is further obtained, so that the recruitment field can be rapidly known, and specific potential content existing in the complex relationship can be more effectively explored through the structured text data.
In the method for identifying an entity relationship provided by this embodiment, a large amount of noise is introduced under a strong assumption condition of a remote supervision algorithm, so that in this embodiment, under the strong assumption condition of no remote supervision, a sentence lacking a subject in a text to be identified is first subject-aligned, and then a sentence containing an entity pair in the text to be identified after subject-aligned processing is obtained; then acquiring entity information characteristics corresponding to the entities in the entity pair; and finally, inputting the entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning, and further determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model. Compared with the prior art, the method and the device have the advantages that some redundant and irrelevant text data can be removed, then the noise text data can be removed as much as possible, the proportion of effective text data is increased, the accuracy of the recognition model can be effectively improved, and the model can be trained more rapidly. Therefore, the accuracy and efficiency of entity relationship identification can be improved.
Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully describe the implementation of this embodiment, this embodiment further provides another method for identifying an entity relationship, as shown in fig. 2, where the method includes:
step 201, performing corpus cleaning on the input sample text.
The input sample text may be raw data for a preset recognition model training.
For example, the corpus of the input sample text can be cleaned by a regular expression or the like. Invalid messy codes, symbols and the like in the corpus are removed, English punctuations and full half-angle conditions mixed in Chinese punctuations are corrected, the repeated use condition of the punctuations is checked, and finally sentence duplication removal is carried out on the corpus. The initial cleaning of the input text can be realized through the corpus cleaning operation, the useless information in the text is removed, the format is uniform, and the normative text data is obtained.
And step 202, performing subject supplement processing on the sentences which lack the subjects in the sample text after the corpus is cleaned.
At present, in the process of data processing in the previous stage, many sample text data are found to lack a subject, the ratio of the lack of the subject is counted to be approximately one third, and the lack of the subject is generally found in one text data, the second sentence or later. The loss of the subject often causes the loss of one entity in the candidate entity, so that a lot of text data containing useful information cannot be learned, and the accuracy and recall rate of output results are seriously influenced.
Therefore, the present embodiment provides an automatic subject supplement method, which performs a judgment of a subject missing, and performs an automatic subject supplement if the subject missing. Correspondingly, optionally, step 202 may specifically include: after word segmentation processing, part of speech tagging and named entity identification are respectively carried out on each sample sentence, sentence-by-sentence judgment is carried out on each sample sentence; if the first participle after skipping the lead word in the current sample sentence is a Named Entity Recognition (NER) Entity, judging whether a pronoun exists in a preset distance threshold range of the NER Entity in the text or not, and if the pronoun exists in the preset distance threshold range, performing subject addition and replacement on the pronoun position according to the NER Entity; if the first participle after skipping the leading word in the current sample sentence is not the NER entity, judging whether the first participle is a pronoun or not, and if the first participle is a pronoun, performing subject addition replacement on the position of the first participle; and if the first participle is not a pronoun, judging whether the subject needing to be added appears in the current sample sentence, and if the subject needing to be added does not appear in the current sample sentence, adding the subject for the current sample sentence.
For example, as shown in fig. 3, the subject missing completion flowchart is prepared for determining the subject missing through word segmentation, part-of-speech tagging and NER operation by a natural language processing tool, such as an LTP toolkit. For each sentence in each paragraph, after skipping the leading word, firstly judging whether the first token (participle) in the sentence is an NER entity, if so, judging whether a pronoun exists in a threshold range through a set reasonable threshold, if so, adding a subject to the pronoun position, and if not, not adding the subject. If the first token is not the NER entity, judging whether the first token is a pronoun or not, and if the first token is the pronoun, adding a subject. If the sentence is not a pronoun, judging whether the subject to be added appears in the current sentence, if the subject to be added appears in the current sentence, not adding the subject, and if the subject to be added does not appear, adding the subject to the current sentence.
Specific examples are shown in table 1:
Figure 5773DEST_PATH_IMAGE001
by automatically supplementing the subject, the situation that a lot of sample text data containing useful information cannot be learned due to the lack of the subject is reduced, and the accuracy and recall rate of output results are improved.
And step 203, performing sentence division processing on the sample text subjected to subject supplement processing.
For example, each paragraph is divided into separate sentences using a toolkit, such as the LTP toolkit.
And step 204, analyzing the initial triple of each sample statement obtained by sentence splitting processing based on the dependency syntax.
For convenience of description, taking a target sentence (one of the sample sentences obtained by sentence division processing, and other sentences are also processed according to the same operation) as an example, optionally, step 204 may specifically include: firstly, performing word segmentation processing on a target sentence; then, performing word tagging on the word segmentation obtained by word segmentation processing; then carrying out named entity recognition on the segmentation and the part-of-speech tagging result corresponding to the segmentation to obtain an entity tag of the target sentence; and finally, carrying out dependency syntax analysis based on entity labeling to obtain an initial triple of the target statement.
For example, the process of dependency parsing is shown in FIG. 4. The present embodiment may use the LTP toolkit to segment words of a sentence, and then perform part-of-speech tagging on the segmented words. And then, carrying out named entity recognition on the results of word segmentation and part of speech tagging to finish tagging entity pairs. And finally, performing dependency syntax analysis based on the result. The dependency syntax analysis mainly includes subject predicate object relationship, fixed-phrase post-guest relationship, subject-predicate-complement relationship with intermediate guest relationship, and the like. The sentence "company A creates company B" is the initial triplet resulting from the structure of the Hommine when dependency parsing is performed.
After the initial triple of each sample sentence is obtained through analysis, the sample sentence containing the entity pair in the sample text can be obtained by using the initial triple, and the process shown in step 205 can be specifically executed.
And step 205, mapping the single sentence obtained by sentence division processing by using the initial triple so as to screen out the sample sentence containing the entity pair.
Considering that a strong assumption condition of the remote supervision algorithm may introduce a lot of noise, for example, some entity pairs in a sentence may have a relation of son and mother, but the meaning of the relation of son and mother is not expressed in the current sentence, but other meanings. Therefore, the embodiment provides a filtering method for the keywords related to the relationship, so as to remove the noise text data as much as possible. The details shown in steps 205 to 207 may be specifically performed.
For example, based on a strong assumption condition of the remote monitoring algorithm, the initial triplet obtained in step 204 is used to map a single sentence obtained by sentence splitting, a sentence containing an entity pair is screened out, and two entities are labeled, for example: < e1> A company < \ e1> creation < e2> B company < \ e2> labels the two entities of A company and B company; sentences that do not contain entity pairs are discarded.
And step 206, labeling the relation labels to the sample sentences containing the entity pairs to obtain a first sentence set corresponding to the target relation labels.
The relationship tags corresponding to the statements included in the first statement set may be the same relationship tag, that is, a target relationship tag. The content of the relationship label can be labeled and set according to the specific relationship content. For example, the A sentence contains an entity pair of the primary and secondary company relationship, and the A sentence can be labeled with a relationship label of the primary and secondary company relationship. The sentence B contains entity pairs of cooperative relations, and relationship labels of the cooperative relations can be labeled for the sentence B.
After labeling the corresponding relationship tags for each statement, the statements with the same relationship tags may be summarized to obtain a statement set corresponding to the target relationship tag, that is, a first statement set. Each first statement set has a corresponding relationship label.
For example, a relationship tag is added to a sentence labeled with an entity pair, and the basis for adding the relationship tag is the assumption of remote supervision: if a sentence contains an entity pair to which a relationship relates, the sentence is considered to describe the relationship. That is, all sentences containing company a and company B in the corpus assume that company a is a parent company of company B and company B is a child company of company a, i.e., that company a and company B are in a relationship of child-parent companies. Assume that all sentences considered a parent-child relationship constitute a set K (other relationships operate similarly).
And step 207, obtaining the participles of which the front and back occurrence frequencies of the head entity and the tail entity accord with preset conditions according to the entity pairs contained in each sample statement in the first statement set.
For the entity pair contained in the statement, the entity appearing first in the entity pair may be taken as the head entity, and the entity appearing later may be taken as the tail entity. The preset conditions can be set according to actual requirements. For example, each sentence in the sentence set is obtained, and then, according to the entity pair contained in each sentence, the participles of the head entity and the tail entity in the sentence in a specific range before and after the sentence are determined, such as the first a participles and the last b participles of the head entity in the sentence 1, and the first c participles and the last d participles of the tail entity, and for the sentence 2, the corresponding participles may also be determined in this way. And then summarizing the determined participles corresponding to the sentences in the first sentence set and counting the occurrence frequency of the participles. So as to determine the word segmentation required to be acquired according to whether the occurrence frequency meets the preset condition. If the word segmentation is carried out according to the occurrence frequency, taking n word segments at the top of the rank or a certain proportion of word segments at the top of the rank as word segments meeting the preset condition, or setting a frequency threshold value, and taking the word segments with the occurrence frequency greater than the frequency threshold value as the word segments meeting the preset condition.
After the participles with the occurrence frequencies meeting the preset conditions are obtained, some interference words, useless words and the like can be possibly mixed, and the participles meeting the requirements can be obtained through screening.
Optionally, step 207 may specifically include: firstly, carrying out word frequency statistics on the words with the preset number of front and back participles and the tail entities in each sample sentence of a first sentence set; then, sorting is carried out according to the word frequency statistical result, and a preset number of participles which are ranked at the top are obtained; and finally, carrying out abnormal filtering on the preset number of the participles with the top ranking, and taking the participles which are not abnormally filtered out as the participles meeting the preset conditions.
For example, if the window size is defined as m, m tokens (participles) before and after the head entity and m tokens before and after the tail entity in all sentences in the set K are taken out to form a set a. And after performing word frequency statistics and sequencing on all the tokens in the set A, taking out the first n tokens most relevant to the relationship between the child and parent companies, and performing manual quick screening (the manual part is not time-consuming, and the n tokens are obtained because preliminary screening is performed), so as to obtain a set B.
And 208, training by using a deep learning algorithm to obtain a preset recognition model according to a second sentence set which contains the participles meeting the preset conditions in the first sentence set.
The second sentence set containing the participles meeting the preset conditions in the first sentence set is obtained, and through the filtering mode of the related keywords through the relation, some redundant and irrelevant text data can be removed, so that noise text data can be removed as much as possible, the proportion of effective text data is increased, the accuracy of the recognition model can be effectively improved, and the training of the model can be quicker. After the second sentence sets corresponding to each first sentence set are obtained, the recognition model can be trained according to the second sentence sets and the relationship labels (obtained according to the first sentence set) corresponding to the second sentence sets.
In this embodiment, model training may be performed based on a plurality of optional deep learning algorithm models to obtain a recognition model for recognizing an entity relationship. For example, after the model training is completed, the test set can be used for testing, and after the result output by the model meets the index requirement, the training can be determined to reach the standard, so that the recognition model with the training reaching the standard is obtained. And if the test output result of the recognition model obtained by training does not meet the index requirement, continuing the model training until the model training reaches the standard.
For example, a set K (a set of sentences) is filtered through a set B (a set of tokens), wherein the sentences in the set K including the tokens in the set B are left, and the rest are discarded. This results in a new sentence set H from the set K. The sentences in the set H are used for the subsequent recognition model. Redundant and irrelevant text data are removed through a filtering method of related keywords, the proportion of effective text data is increased, the accuracy of the model is effectively improved, and the model training is quicker.
In the process of training the model, the embodiment adds entity information description to the input part of the recognition model. And extracting entity information characteristics through a Convolutional Neural Network (CNN), and taking the entity information characteristics and the text information characteristics as the input of the model, so that description information is added to the entity, and the accuracy of the model is improved. Correspondingly, optionally, step 208 may specifically include: firstly, acquiring entity description information corresponding to entities in each sample statement in a second statement set; then, corpus cleaning is carried out on the entity description information; performing word segmentation processing on the entity description information after the corpus is cleaned; performing word embedding expression on the processed participles, inputting the participles into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity description information; then, creating a training set according to the entity information characteristics corresponding to each sample statement in the second statement set, each statement data and the corresponding relation label of the second statement set; and finally, training by using a deep learning algorithm based on the training set to obtain a preset recognition model.
For example, the entity information description flow chart is shown in fig. 5. Extracting entity description information corpora from web pages such as Wikipedia and the like, similarly cleaning the entity description corpora, then performing word segmentation by using a word segmentation tool, performing word embedding (word embedding) representation on token after word segmentation processing, then inputting the token into a Convolutional Neural Network (CNN), and then performing maximum pooling layer processing. In the embodiment, in addition to basic word embedding (word embedding representation of words in a sentence) and position embedding (representation of relative positions of each word and a head entity respectively in the sentence and relative positions of each word and a tail entity respectively in the sentence), entity description information is extracted from web pages of Wikipedia and the like, and the entity description information, the word embedding and the position embedding are used as input of the model together, so that description information is added to the entity, and the accuracy of the model is improved.
In this embodiment, in order to improve the identification accuracy of the identification model, optionally, in this embodiment, based on the training set, a PCNN + MIL (peer-peer Neural Networks + multiple instance learning) remote monitoring model, a PCNN + ATT (peer-peer Neural Networks + authorization) remote monitoring model, a blst + ATT + MIL (Bi-peer Long Short-Term Memory + authorization + multiple instance learning) remote monitoring model are obtained through respective training by using a stacking (stacking) method; and carrying out model fusion on the PCNN + MIL remote supervision model, the PCNN + ATT remote supervision model and the BilSTM + ATT + MIL remote supervision model obtained by training to obtain a preset identification model.
Model fusion is carried out on the three remote supervision models PCNN + MIL, PCNN + ATT, BilSTM + ATT + MIL to obtain the recognition models, and the advantages of the models are effectively combined to improve the accuracy and the recall rate of the final output result.
Optionally, the PCNN + MIL remote monitoring model, the input layer of the PCNN + ATT remote monitoring model, and the input layer of the BiLSTM + ATT + MIL remote monitoring model are a concatenation of word embedding, position embedding and entity information description, the word embedding is a word embedding representation of a word in a sentence, the position embedding is a representation of a relative position between each word and a head entity in the sentence, and a representation of a relative position between each word and a tail entity, and the entity information description is an entity information feature corresponding to the entity description information; the convolution layers of the PCNN + MIL remote supervision model and the PCNN + ATT remote supervision model comprise a preset number of convolution kernels with different sizes, and the size of the convolution kernels is determined according to the spliced total length described by word embedding, position embedding and entity information; the PCNN + MIL remote supervision model and the PCNN + ATT remote supervision model adopt maximum pooling; and the classification layer of the PCNN + MIL remote supervision model and the PCNN + ATT remote supervision model is classified by using softmax.
Each remote supervision model is described in detail below:
1. for the PCNN + MIL remote supervision model: for example, a flow chart of the PCNN part of the PCNN + MIL remote supervision model is shown in FIG. 6. And mapping entity pairs and adding relationship labels to sentences obtained by filtering based on the related relationship keywords. And then the sentence is subjected to word segmentation processing through a word segmentation tool. And sentences having the same entity pair are input to the remote supervision model PCNN + MIL as one bag (the sentence having the same entity pair is divided into one bag). The input layer of the PCNN + MIL remote supervision model comprises word embedding, position embedding and entity information description. The input layer is the concatenation of the word embedding, the position embedding and the entity information description. word embedding is a word-embedded representation of a word, available either as word2vec or Bert. The position embedding is a relative position representation of each word and a head entity respectively and a relative position representation of each word and a tail entity respectively in a sentence.
The convolution layer of the PCNN + MIL remote supervision model may include convolution kernels of 3 different sizes, 2 × dim, 3 × dim, and 4 × dim, where the dim size is the total length after concatenation of word embedding, position embedding, and entity information description. And extracting the correlation between adjacent tokens in the sentence through convolution operation. The segmented pooling layer employs maximum pooling (max pooling). Dividing the output of the convolutional layer into three parts according to two entities as boundaries, respectively performing max firing on the three parts, and splicing the three results. The classification layer uses softmax for classification. And splicing the segmented pooling layers with different convolutions to be used as the input of softmax, and carrying out classification operation.
Correspondingly, training the PCNN + MIL remote supervision model by using the training set may specifically include: performing model input in a bag form, wherein sentences containing the same entity pair are divided into a bag, each sentence in the bag passes through a classification layer of the PCNN to obtain first probability distribution, and the first probability distribution comprises a relation label of the sentence and a probability value of the sentence as the relation label; and taking a statement with the maximum probability value and a relation label predicted by the statement from the bag as a prediction result of the bag based on the first probability distribution.
For example, for the MIL part of the PCNN + MIL remote supervision model, which is input in the form of bag (a sentence containing the same entity pair is a bag), each sentence in a bag will obtain a probability distribution after passing through the classification layer in the PCNN, and then obtain the relationship label of the sentence, and the probability value of the relationship label. In order to eliminate the influence of error labels, the remote supervision model takes a sentence with the maximum probability value and a relation label obtained by prediction of the sentence from the bag as the prediction result of the bag and calculates a cross entropy loss function. And finally, performing backward propagation. The MIL may effectively eliminate the effect of a false label.
2. For the PCNN + ATT remote supervision model: for example, a flow chart in the PCNN + ATT remote supervision model is shown in fig. 7. PCNN is the upper diagram of fig. 7, ATT is the lower diagram of fig. 7. And mapping entity pairs and adding relationship labels to sentences obtained by filtering based on the related relationship keywords. And then the sentence is subjected to word segmentation processing through a word segmentation tool. And sentences having the same entity pair are input as one bag to the remote supervision model PCNN + ATT. The input layer of the PCNN + ATT remote supervision model comprises word embedding, position embedding and entity information description. The input layer is the concatenation of the word embedding, the position embedding and the entity information description. word embedding is a word-embedded representation of a word, available either as word2vec or Bert. The position embedding is a relative position representation of each word and a head entity respectively and a relative position representation of each word and a tail entity respectively in a sentence.
The convolution layer can adopt 3 convolution kernels with different sizes, namely 2 × dim, 3 × dim and 4 × dim, wherein the size of dim is the total length after splicing of word embedding, position embedding and entity information description. And extracting the correlation between adjacent tokens in the sentence through convolution operation. The segmented pooling layer employs maximum pooling (max pooling). Dividing the output of the convolutional layer into three parts according to two entities as boundaries, respectively performing max firing on the three parts, and splicing the three results. The nonlinear layer of the PCNN + ATT remote supervision model is subjected to nonlinear processing by using a tanh function.
Correspondingly, the training of the PCNN + ATT remote supervision model by using the training set may specifically include: inputting a model in a bag form, wherein each sentence in one bag can obtain a feature vector after passing through a nonlinear layer of the PCNN; calculating an attention score according to the matching degree between the feature vector and the relation label; then, according to the calculated attention score, obtaining a second probability distribution of the entity pair related to all the relations by using a classification layer of the PCNN; and finally, taking a statement with the maximum probability value and a relation label obtained by predicting the statement from the bag as a prediction result of the bag based on the second probability distribution.
For example, when input is in the bag form, each sentence in a bag will get a feature vector after passing through the non-linear layer in the PCNN. As shown in fig. 7, when going through the attention mechanism layer, assuming that there are n sentences containing an entity pair, after going through the non-linear layer in the PCNN, n feature vectors are obtained: x1, x 2. Attention tier (ATT) is then performed, the process steps are as follows:
formula (1) calculates the feature vector x of a sentencejDegree of match e between and relation label rjEquation (2) calculates the attention score aj. In the formula r is a vector representation of the relationship labels.
Figure 944911DEST_PATH_IMAGE002
Then, the feature vector b of the entity pair is calculated based on the attention mechanism, as shown in formula (3).
Figure 674969DEST_PATH_IMAGE003
And finally, obtaining the probability distribution of the entity pair related to all relations through a softmax layer, wherein the relation with the maximum probability value is a predicted relation label. The remote supervision model PCNN + ATT is a relation label obtained by taking a sentence with the maximum probability value and the prediction of the sentence from the bag as the prediction result of the bag, and a cross entropy loss function is calculated. And finally, performing backward propagation. The influence of wrong labels can be effectively eliminated by inputting the labels in bag units.
3. For the BilSTM + ATT + MIL remote supervision model: for example, the BilSTM + ATT flow chart in the BilSTM + ATT + MIL remote supervision model is shown in FIG. 8. And mapping entity pairs and adding relationship labels to sentences obtained by filtering based on the related relationship keywords. And then the sentence is subjected to word segmentation processing through a word segmentation tool. And sentences having the same entity pair are input as one bag to the remote supervision model BiLSTM + ATT + MIL. The input layer of the BilSTM + ATT + MIL remote supervision model comprises word embedding, position embedding and entity information description. The input layer is the concatenation of the word embedding, the position embedding and the entity information description. word embedding is a word-embedded representation of a word, available either as word2vec or Bert. The position embedding is a relative position representation of each word and a head entity respectively and a relative position representation of each word and a tail entity respectively in a sentence.
The BilSTM layer of the BilSTM + ATT + MIL remote supervision model is a bidirectional circulation neural network layer. The internal structure of the LSTM cell is shown in fig. 8. LSTM can handle long-term dependence problems to some extent compared to RNN, mainly due to the three added gates of LSTM: forget gate, input gate, output gate. The ability to remove or add information to the state of the cell in the LSTM model is through a well-designed structure called a "gate". Thus, abstract features of longer text containing numeric strings can be extracted well. LSTM can handle long term dependence problems to some extent, but only consider the above information, and in order to add the following information we here use bi-directional LSTM, i.e. BiLSTM. The outputs of the forward LSTM and backward LSTM are then spliced.
The Attention mechanism layer of the BilSTM + ATT + MIL remote supervision model is an Attention layer. Assume that the set of vectors for the output of the BilSTM layer is H: { h1,h2,…,hTI.e. the entry of the Attention layer.
Figure 265219DEST_PATH_IMAGE004
In the above formula wTIs the transpose of the parameter vector obtained by training and learning.
Figure 789742DEST_PATH_IMAGE005
Finally, the expression vector h of the sentence is obtainedsenc
The Attention mechanism simulates the state of a person when looking at things, i.e. although the human eye sees a wide range, the Attention distribution at every spatial position in the scene is different. Therefore, the method has a great promotion effect on the sequence learning task.
And (3) a linear layer of the BilSTM + ATT + MIL remote supervision model, namely a full connection layer. Linear layers are formed by a linear transformation and the ReLU activation function. The linear layer is used for carrying out dimensional transformation, and the transformation is carried out into the same dimension as the classification number. For the classification layer, the relation extraction is a multi-classification problem, so the softmax activation function is used.
Correspondingly, the training of the BilSTM + ATT + MIL remote supervision model by using the training set may specifically include: inputting a model in a bag form, and sequentially passing through a BilSTM layer, an attention mechanism layer, a linear layer and a classification layer of a BilSTM + ATT + MIL remote supervision model to obtain third probability distribution, wherein the third probability distribution comprises a relation label of a statement and a probability value of the statement to the relation label; and taking a statement with the maximum probability value and a relation label predicted by the statement from the bag as a prediction result of the bag based on the third probability distribution.
For example, the model is input in the bag form, and each sentence in a bag is subjected to a classification layer to obtain a probability distribution, that is, a relationship tag of the sentence, and a probability value of the relationship tag are obtained. In order to eliminate the influence of error labels, the remote supervision model takes a sentence with the maximum probability value and a relation label obtained by prediction of the sentence from the bag as the prediction result of the bag and calculates a cross entropy loss function. And finally, performing backward propagation. The MIL may effectively eliminate the effect of a false label.
Based on a training set, a stacking algorithm, namely multi-fold cross validation, is utilized to respectively train and obtain a PCNN + MIL remote supervision model, a PCNN + ATT remote supervision model and a BilSTM + ATT + MIL remote supervision model, and then the average value of results output by the PCNN + MIL remote supervision model, the PCNN + ATT remote supervision model and the BilSTM + ATT + MIL remote supervision model is obtained and is used as the recognition result of the preset recognition model.
For example, as shown in FIG. 9, where 3-fold cross-validation is used, each remote supervised model is fully trained and predicted 3 times, for each: a model is trained from two thirds of the data and then the remaining third of the training set is predicted, as is the test set. One third of the training data was predicted each time, and exactly every training sample was predicted after 3 times. The test set is predicted each time, so the last test set is predicted 3 times, and the final result is averaged with probability values of 3 times. Through model fusion, overfitting can be prevented, the advantages of three remote supervision models, namely PCNN + MIL, PCNN + ATT and BilSTM + ATT + MIL, are effectively combined, and the effect of further improving the result accuracy is achieved.
Considering that a large amount of noise is introduced under a strong assumption condition of a remote supervision algorithm, the embodiment can label a relation label to a statement containing an entity pair in a sample text to obtain a first statement set corresponding to a target relation label; acquiring participles of which the front and back occurrence frequencies of the head entity and the tail entity accord with a preset front condition according to the entity pairs contained in the sentences in the first sentence set; then training a deep learning recognition model according to a second sentence set which contains participles meeting a preset front condition in the first sentence set; and finally, recognizing the entity relationship by using a recognition model which reaches the standard after training. Compared with the prior art, some redundant and irrelevant text data can be removed through the filtering mode of the relevant keywords, then noise text data is removed as much as possible, the proportion of effective text data is increased, the accuracy of the recognition model can be effectively improved, and the model can be trained more quickly. Therefore, the accuracy and efficiency of entity relationship identification can be improved.
And 209, performing subject supplement processing on the sentences lacking the subjects in the text to be recognized, acquiring the sentences containing the entity pairs in the text to be recognized after the subject supplement processing, acquiring entity information characteristics corresponding to the entities in the entity pairs, and inputting the entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning.
Optionally, in order to improve the accuracy of identifying the entity relationship, before performing subject-language completion processing on a sentence lacking a subject language in the text to be identified, the method in this embodiment may further include: performing corpus cleaning on a text to be recognized; correspondingly, the subject supplementing processing is performed on the sentence lacking the subject in the text to be recognized, and specifically may include: and performing subject supplement processing on the sentences which lack subjects in the texts to be recognized after the linguistic data is cleaned. For example, the corpus of the text to be recognized may be cleaned by a method such as regular expression. Invalid messy codes, symbols and the like in the corpus are removed, English punctuations and full half-angle conditions mixed in Chinese punctuations are corrected, the repeated use condition of the punctuations is checked, and finally sentence duplication removal is carried out on the corpus. The corpus cleaning operation can realize the initial cleaning of the text to be recognized, remove useless information in the text, have uniform format, obtain normative text data, facilitate the input of data with standard format of the recognition model, and improve the accuracy of the recognition of entity relationship.
Similar to the processing procedure in step 202, if there is a sentence with a missing subject in the text to be recognized, the subject may be automatically supplemented, and correspondingly, optionally, the subject supplementing processing is performed on the sentence with the missing subject in the text to be recognized, which may specifically include: after word segmentation processing, part of speech tagging and named entity recognition are respectively carried out on each sentence in a text to be recognized, sentence-by-sentence judgment is carried out on each sentence; if the first participle after skipping the leading word in the current sentence is the NER entity, judging whether the NER entity has pronouns within a preset distance threshold range in the text or not, and if the pronouns exist within the preset distance threshold range, performing subject addition replacement on the pronouns according to the NER entity; if the first participle after skipping the leading word in the current sentence is not the NER entity, judging whether the first participle is a pronoun or not, and if the first participle is a pronoun, performing subject addition replacement on the position of the first participle; and if the first participle is not a pronoun, judging whether the subject needing to be added appears in the current sentence, and if the subject needing to be added does not appear in the current sentence, adding the subject for the current sentence. The specific subject completion process may refer to the corresponding description in step 202, and is not described herein again. Through the processing, the accuracy of the subsequent entity relationship identification can be improved.
Optionally, the obtaining of the entity information characteristic corresponding to the entity in the entity pair specifically includes: acquiring entity description information corpora corresponding to entities in the entity pair from the outside of the corpora of the text to be recognized; performing corpus cleaning on the entity description information corpus; performing word segmentation processing on the entity description information corpus after corpus cleaning; and performing word embedding expression on the processed word segmentation, inputting the word segmentation into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity in the entity pair. The specific processing procedure can refer to the description of the example in fig. 5, and is not described herein again. By acquiring the entity description information corpus corresponding to the entity in the entity pair from the corpus of the text to be recognized, the entity description information corpus is used as one of model input data, description information is added to the entity, more related description contents which are helpful for recognizing the meaning of the entity are referred, and the accuracy of recognizing the entity relationship by the recognition model can be improved.
And step 210, determining an entity relationship in the text to be recognized according to a classification result output by a preset recognition model.
For example, the text extraction features to be recognized are input into the recognition model, and according to the output result, if the probability that the company C and the company D are in the primary-secondary company relationship is 0.98, a triple of { the company C, the primary-secondary company relationship, and the company D } is generated. Thereby accurately identifying and extracting the entity relationship in the obtained text.
In order to fully describe the overall specific implementation process of the method of this embodiment, an overall flow of the solution is given, and as shown in fig. 10, the flow includes: the language material is input and then cleaned, and the cleaning method mainly adopts a regular expression. Then, the subject language of each sentence in each paragraph is automatically supplemented, and then the sentences are divided. And then based on the filtering of the related relational keywords, removing some redundant and irrelevant text data by a filtering method of the related relational keywords. And then, carrying out dependency syntax analysis on the sentences by using an LTP tool kit to obtain initial triples which are used as small knowledge maps for a remote supervision model. And then extracting entity description information from web pages such as Wikipedia and the like, and extracting entity information characteristics through CNN. And then inputting the extracted entity information characteristics, the word embedding and the position embedding as model input into three remote supervision models, namely PCNN + MIL, PCNN + ATT and BilSTM + ATT + MIL respectively. And finally, dividing the training and testing data into 3 equal parts on the basis of a stacking method, and respectively training and testing the three remote supervision models to perform model fusion.
The scheme in the embodiment provides a method for automatically supplementing the subject, so that the condition that a lot of text data containing useful information cannot be learned is avoided, and the overall accuracy and recall rate are improved. A filtering method based on the relational related keywords is also provided. Redundant and irrelevant text data are removed through a filtering method of related keywords, the proportion of effective text data is increased, the accuracy of the model is effectively improved, and the model training is quicker. And entity information description is added to an input part of the remote monitoring model, and the entity information description, the word embedding and the position embedding are used as the input of the model together, so that description information is added to the entity, and the accuracy of the model is improved. And model fusion is carried out, so that overfitting can be prevented, the advantages of three remote monitoring models, namely PCNN + MIL, PCNN + ATT and BilSTM + ATT + MIL, are effectively combined, and the effect of further improving the result accuracy is achieved.
Further, as a specific implementation of the method shown in fig. 1 and fig. 2, this embodiment provides an apparatus for identifying an entity relationship, as shown in fig. 11, the apparatus includes: a processing module 31, an obtaining module 32, a calculating module 33, and a determining module 34.
The processing module 31 is configured to perform subject complete processing on a sentence lacking a subject in the text to be recognized;
the obtaining module 32 is configured to obtain a sentence, which is subjected to subject complete processing and contains an entity pair, in the text to be recognized;
the obtaining module 31 is further configured to obtain an entity information feature corresponding to the entity in the entity pair;
the calculation module 33 is configured to input the entity information features, the entity pairs, and the sentences containing the entity pairs into a preset recognition model for deep learning;
and the determining module 34 is configured to determine an entity relationship in the text to be recognized according to the classification result output by the preset recognition model.
In a specific application scenario, the processing module 31 is specifically configured to perform sentence-by-sentence judgment on each sentence after performing word segmentation processing, part-of-speech tagging and named entity identification on each sentence in the text to be identified; if the first participle after skipping the leading word in the current sentence is an NER entity, judging whether pronouns exist in a preset distance threshold range of the NER entity in the text or not, and if the pronouns exist in the preset distance threshold range, performing subject addition replacement on the pronouns according to the NER entity; if the first participle after skipping the leading word in the current sentence is not the NER entity, judging whether the first participle is a pronoun or not, and if the first participle is a pronoun, performing subject addition replacement on the position of the first participle; and if the first participle is not a pronoun, judging whether the subject needing to be added appears in the current sentence, and if the subject needing to be added does not appear in the current sentence, adding the subject for the current sentence.
In a specific application scenario, the obtaining module 32 is specifically configured to obtain, from outside the corpus of the text to be recognized, an entity description information corpus corresponding to an entity in the entity pair; performing corpus cleaning on the entity description information corpus; performing word segmentation processing on the entity description information corpus after corpus cleaning; and performing word embedding expression on the processed word segmentation, inputting the word segmentation into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity in the entity pair.
In a specific application scenario, the apparatus further comprises: a marking module and a training module;
the obtaining module 32 is further configured to obtain a sample sentence in the sample text, where the sample sentence contains an entity pair;
the labeling module is used for labeling the relation labels on the sample sentences containing the entity pairs to obtain a first sentence set corresponding to the target relation labels;
the obtaining module 32 is further configured to obtain, according to the entity pair included in each sample statement in the first statement set, a participle whose occurrence frequency meets a preset condition in a predetermined range around the head entity and the tail entity;
and the training module is used for training by utilizing a deep learning algorithm to obtain the preset recognition model according to a second sentence set which contains the participles meeting the preset conditions in the first sentence set.
In a specific application scenario, the obtaining module 32 is further configured to perform word frequency statistics on the preset number of the participles before and after the head entity and the preset number of the participles before and after the tail entity in each sample sentence of the first sentence set; sorting according to the word frequency statistical result to obtain a preset number of participles ranked at the top; and carrying out abnormal filtering on the preset number of the participles which are ranked at the top, and taking the participles which are not abnormally filtered out as the participles which accord with the preset conditions.
In a specific application scenario, the processing module 31 is further configured to perform corpus cleaning on an input sample text; carrying out subject supplement processing on sentences which lack subjects in the sample text after the corpus is cleaned; carrying out sentence division processing on the sample text subjected to subject complete filling processing; analyzing initial triples of each sample statement obtained by sentence division processing based on dependency syntax;
the obtaining module 32 is further specifically configured to map the single sentence obtained by sentence division processing by using the initial triple, so as to screen out a sample sentence containing an entity pair.
In a specific application scenario, the processing module 31 is specifically configured to perform word segmentation processing on a target sentence; making a word property label on a word obtained by word segmentation processing; carrying out named entity recognition on the segmentation and the result of part-of-speech tagging corresponding to the segmentation to obtain entity tagging of the target sentence; and carrying out dependency syntax analysis based on the entity labels to obtain an initial triple of the target statement.
In a specific application scenario, the training module is specifically configured to obtain entity description information corresponding to an entity in each sample statement in the second statement set; performing corpus cleaning on the entity description information; performing word segmentation processing on the entity description information after the corpus is cleaned; performing word embedding expression on the processed participles, inputting the participles into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity description information; creating a training set according to the entity information features corresponding to the sample sentences in the second sentence set, the sentence data and the relationship labels corresponding to the second sentence set; and training by utilizing a deep learning algorithm based on the training set to obtain the preset recognition model.
In a specific application scenario, the training module is further specifically configured to respectively train to obtain a PCNN + MIL remote supervision model, a PCNN + ATT remote supervision model, a BiLSTM + ATT + MIL remote supervision model, and a training algorithm based on the training set; and carrying out model fusion on the PCNN + MIL remote supervision model, the PCNN + ATT remote supervision model and the BilSTM + ATT + MIL remote supervision model obtained by training to obtain the preset identification model.
In a specific application scenario, the processing module 31 is further configured to perform corpus cleaning on the text to be recognized; correspondingly, the processing module 31 is further specifically configured to perform subject-language completion processing on the sentences which lack the subject language in the text to be recognized after the corpus is cleaned.
It should be noted that other corresponding descriptions of the functional units related to the apparatus for identifying an entity relationship provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.
Based on the above methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the above method for identifying entity relationships shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.
Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 11, in order to achieve the above object, an embodiment of the present application further provides an electronic device, which may be a personal computer, a notebook computer, a smart phone, a server, or other network devices, and the device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the method for identifying entity relationships as shown in fig. 1 and 2.
Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
It will be understood by those skilled in the art that the above-described physical device structure provided in the present embodiment is not limited to the physical device, and may include more or less components, or combine some components, or arrange different components.
The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the above-described physical devices, and supports the operation of the information processing program as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. Through the scheme of using this embodiment, compare with prior art, some redundant, irrelevant text data are got rid of to the filtration mode of the relevant keyword of accessible relation, and then get rid of noise text data as far as possible, increase effective text data's proportion, can effectively promote the rate of accuracy of identification model to the training of model also can be more rapid. Therefore, the accuracy and efficiency of entity relationship identification can be improved.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (9)

1. A method for identifying entity relationships, comprising:
carrying out subject supplement processing on sentences lacking subjects in the text to be recognized;
obtaining sentences containing entity pairs in the text to be recognized after subject supplement processing;
acquiring entity description information corpora corresponding to the entities in the entity pair from the outside of the corpora of the text to be recognized; performing corpus cleaning on the entity description information corpus; performing word segmentation processing on the entity description information corpus after corpus cleaning; performing word embedding expression on the processed word segmentation, inputting the word segmentation into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity in the entity pair;
inputting the entity information features, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning;
determining an entity relationship in the text to be recognized according to a classification result output by the preset recognition model;
wherein, the training process of the preset recognition model comprises the following steps:
obtaining a sample sentence containing an entity pair in a sample text;
labeling a relation label to the sample statement containing the entity pair to obtain a first statement set corresponding to a target relation label, wherein the sample statements in the first statement set correspond to the same relation label;
performing word frequency statistics on the words with the preset number of front entities and back entities and the words with the preset number of back entities and back entities in each sample sentence of the first sentence set;
sorting according to the word frequency statistical result to obtain a preset number of participles ranked at the top;
performing abnormal filtering on the preset number of the participles which are ranked at the top, and taking the participles which are not abnormally filtered out as the participles which accord with the preset conditions;
according to the second sentence set which contains the participles meeting the preset conditions in the first sentence set, training by using a deep learning algorithm to obtain the preset recognition model, and the method specifically comprises the following steps: acquiring entity description information corresponding to the entities in the sample sentences in the second sentence set from the outside of the sample sentences in the second sentence set; performing corpus cleaning on the entity description information; performing word segmentation processing on the entity description information after the corpus is cleaned; performing word embedding expression on the processed participles, inputting the participles into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity description information; creating a training set according to the entity information features corresponding to the sample sentences in the second sentence set, the sentence data and the relationship labels corresponding to the second sentence set; and training by utilizing a deep learning algorithm based on the training set to obtain the preset recognition model.
2. The method according to claim 1, wherein the subject-language-supplementing processing of the sentence lacking the subject language in the text to be recognized specifically includes:
after word segmentation processing, part of speech tagging and named entity recognition are respectively carried out on each sentence in the text to be recognized, sentence-by-sentence judgment is carried out on each sentence;
if the first participle after skipping the leading word in the current sentence is an NER entity, judging whether pronouns exist in a preset distance threshold range of the NER entity in the text or not, and if the pronouns exist in the preset distance threshold range, performing subject addition replacement on the pronouns according to the NER entity;
if the first participle after skipping the leading word in the current sentence is not the NER entity, judging whether the first participle is a pronoun or not, and if the first participle is a pronoun, performing subject addition replacement on the position of the first participle; and if the first participle is not a pronoun, judging whether the subject needing to be added appears in the current sentence, and if the subject needing to be added does not appear in the current sentence, adding the subject for the current sentence.
3. The method of claim 1, wherein prior to obtaining the sample sentence having the entity pair in the sample text, the method further comprises:
performing corpus cleaning on an input sample text;
carrying out subject supplement processing on sentences which lack subjects in the sample text after the corpus is cleaned;
carrying out sentence division processing on the sample text subjected to subject complete filling processing;
analyzing initial triples of each sample statement obtained by sentence division processing based on dependency syntax;
the obtaining of the sample sentence containing the entity pair in the sample text specifically includes:
and mapping the single sentence obtained by sentence division processing by using the initial triple so as to screen out the sample sentence containing the entity pair.
4. The method according to claim 3, wherein the analyzing the initial triple of each sample sentence obtained by sentence splitting based on the dependency syntax specifically includes:
performing word segmentation processing on the target sentence;
making a word property label on a word obtained by word segmentation processing;
carrying out named entity recognition on the segmentation and the result of part-of-speech tagging corresponding to the segmentation to obtain entity tagging of the target sentence;
and carrying out dependency syntax analysis based on the entity labels to obtain an initial triple of the target statement.
5. The method according to claim 1, wherein the training with the deep learning algorithm based on the training set to obtain the preset recognition model specifically comprises:
respectively training by using a stacking algorithm based on the training set to obtain a PCNN + MIL remote supervision model, a PCNN + ATT remote supervision model and a BilSTM + ATT + MIL remote supervision model;
and carrying out model fusion on the PCNN + MIL remote supervision model, the PCNN + ATT remote supervision model and the BilSTM + ATT + MIL remote supervision model obtained by training to obtain the preset identification model.
6. The method according to claim 1, wherein before the subject-filling processing is performed on the sentence lacking the subject in the text to be recognized, the method further comprises:
performing corpus cleaning on the text to be recognized;
the subject supplement processing of the sentence lacking the subject in the text to be recognized specifically includes:
and performing subject supplement processing on the sentences which lack subjects in the text to be recognized after the corpus is cleaned.
7. An apparatus for identifying entity relationships, comprising:
the processing module is used for carrying out subject complete processing on the sentences which lack the subjects in the text to be recognized;
the acquisition module is used for acquiring sentences containing entity pairs in the text to be recognized after subject supplement processing;
the obtaining module is further configured to obtain an entity description information corpus corresponding to an entity in the entity pair from outside the corpus of the text to be recognized; performing corpus cleaning on the entity description information corpus; performing word segmentation processing on the entity description information corpus after corpus cleaning; performing word embedding expression on the processed word segmentation, inputting the word segmentation into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity in the entity pair;
the calculation module is used for inputting the entity information characteristics, the entity pairs and the sentences containing the entity pairs into a preset recognition model for deep learning;
the determining module is used for determining the entity relationship in the text to be recognized according to the classification result output by the preset recognition model;
the acquisition module is further used for acquiring sample sentences containing entity pairs in the sample texts;
the labeling module is used for labeling the relation labels on the sample sentences containing the entity pairs to obtain a first sentence set corresponding to the target relation labels, wherein the sample sentences in the first sentence set correspond to the same relation labels;
the obtaining module is further configured to perform word frequency statistics on the preset number of the participles before and after the head entity and the preset number of the participles before and after the tail entity in each sample sentence of the first sentence set; sorting according to the word frequency statistical result to obtain a preset number of participles ranked at the top; performing abnormal filtering on the preset number of the participles which are ranked at the top, and taking the participles which are not abnormally filtered out as the participles which accord with the preset conditions;
the training module is used for acquiring entity description information corresponding to the entities in the sample sentences in the second sentence set from the outside of the sample sentences in the second sentence set; performing corpus cleaning on the entity description information; performing word segmentation processing on the entity description information after the corpus is cleaned; performing word embedding expression on the processed participles, inputting the participles into a convolutional neural network, and performing maximum pooling layer processing to obtain entity information characteristics corresponding to the entity description information; creating a training set according to the entity information features corresponding to the sample sentences in the second sentence set, the sentence data and the relationship labels corresponding to the second sentence set; and training by utilizing a deep learning algorithm based on the training set to obtain the preset recognition model.
8. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 6.
9. An electronic device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when executing the computer program.
CN202011461566.6A 2020-12-14 2020-12-14 Entity relationship identification method and device and electronic equipment Active CN112270196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011461566.6A CN112270196B (en) 2020-12-14 2020-12-14 Entity relationship identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011461566.6A CN112270196B (en) 2020-12-14 2020-12-14 Entity relationship identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112270196A CN112270196A (en) 2021-01-26
CN112270196B true CN112270196B (en) 2022-04-29

Family

ID=74350055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011461566.6A Active CN112270196B (en) 2020-12-14 2020-12-14 Entity relationship identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112270196B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160917B (en) * 2021-05-18 2022-11-01 山东浪潮智慧医疗科技有限公司 Electronic medical record entity relation extraction method
CN113268575B (en) * 2021-05-31 2022-08-23 厦门快商通科技股份有限公司 Entity relationship identification method and device and readable medium
CN113420120A (en) * 2021-06-24 2021-09-21 平安科技(深圳)有限公司 Training method, extracting method, device and medium of key information extracting model
CN113569046B (en) * 2021-07-19 2022-10-21 北京华宇元典信息服务有限公司 Judgment document character relation identification method and device and electronic equipment
CN113571052A (en) * 2021-07-22 2021-10-29 湖北亿咖通科技有限公司 Noise extraction and instruction identification method and electronic equipment
CN114266253B (en) * 2021-12-21 2024-01-23 武汉百智诚远科技有限公司 Method for identifying semi-supervised named entity without marked data
CN113987090B (en) * 2021-12-28 2022-03-25 北京泷汇信息技术有限公司 Sentence-in-sentence entity relationship model training method and sentence-in-sentence entity relationship identification method
CN114021572B (en) * 2022-01-05 2022-03-22 苏州浪潮智能科技有限公司 Natural language processing method, device, equipment and readable storage medium
CN114925694A (en) * 2022-05-11 2022-08-19 厦门大学 Method for improving biomedical named body recognition by utilizing entity discrimination information
CN115358341B (en) * 2022-08-30 2023-04-28 北京睿企信息科技有限公司 Training method and system for instruction disambiguation based on relational model

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798136B (en) * 2017-11-23 2020-12-01 北京百度网讯科技有限公司 Entity relation extraction method and device based on deep learning and server
CN108874878B (en) * 2018-05-03 2021-02-26 众安信息技术服务有限公司 Knowledge graph construction system and method
CN109165385B (en) * 2018-08-29 2022-08-09 中国人民解放军国防科技大学 Multi-triple extraction method based on entity relationship joint extraction model
CN110245239A (en) * 2019-05-13 2019-09-17 吉林大学 A kind of construction method and system towards automotive field knowledge mapping
CN110263324B (en) * 2019-05-16 2021-02-12 华为技术有限公司 Text processing method, model training method and device
CN110347894A (en) * 2019-05-31 2019-10-18 平安科技(深圳)有限公司 Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN110704576B (en) * 2019-09-30 2022-07-01 北京邮电大学 Text-based entity relationship extraction method and device
CN111241295B (en) * 2020-01-03 2022-05-03 浙江大学 Knowledge graph relation data extraction method based on semantic syntax interactive network
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112270196A (en) 2021-01-26

Similar Documents

Publication Publication Date Title
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN110705301B (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN110633577B (en) Text desensitization method and device
CN108959474B (en) Entity relation extraction method
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN110457585B (en) Negative text pushing method, device and system and computer equipment
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
CN110659392B (en) Retrieval method and device, and storage medium
CN110968664A (en) Document retrieval method, device, equipment and medium
CN116561538A (en) Question-answer scoring method, question-answer scoring device, electronic equipment and storage medium
CN116050352A (en) Text encoding method and device, computer equipment and storage medium
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN112926341A (en) Text data processing method and device
CN113010657A (en) Answer processing method and answer recommending method based on answering text
CN113705207A (en) Grammar error recognition method and device
CN116911286A (en) Dictionary construction method, emotion analysis device, dictionary construction equipment and storage medium
Xu et al. Estimating similarity of rich internet pages using visual information
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document
CN113495964A (en) Method, device and equipment for screening triples and readable storage medium
CN111401070B (en) Word meaning similarity determining method and device, electronic equipment and storage medium
CN112926340A (en) Semantic matching model for knowledge point positioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant