CN113901827B - Entity identification and relation extraction method and device, electronic equipment and storage medium - Google Patents

Entity identification and relation extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113901827B
CN113901827B CN202111504146.6A CN202111504146A CN113901827B CN 113901827 B CN113901827 B CN 113901827B CN 202111504146 A CN202111504146 A CN 202111504146A CN 113901827 B CN113901827 B CN 113901827B
Authority
CN
China
Prior art keywords
entity
word
tail
head
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111504146.6A
Other languages
Chinese (zh)
Other versions
CN113901827A (en
Inventor
李征仁
张晓航
杜瑜
韩华伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202111504146.6A priority Critical patent/CN113901827B/en
Publication of CN113901827A publication Critical patent/CN113901827A/en
Application granted granted Critical
Publication of CN113901827B publication Critical patent/CN113901827B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The application provides an entity identification and relationship extraction method, an entity identification and relationship extraction device, an electronic device and a storage medium, wherein the method comprises the following steps: constructing an entity data set containing domain nouns; determining the entity data set as a training set corpus, and performing masking training on the pre-training BERT model to obtain a domain language model; and recognizing head entities and tail entities of all the marked words in the field text data to be processed through the field language model, and extracting the relationship between the head entities and the tail entities. The entity identification and relationship extraction method provided by the embodiment of the application automatically constructs the entity data set containing the domain nouns, needs a small amount of manual labeling work, and improves the efficiency of the entity identification and relationship extraction in the field. Meanwhile, a domain language model is trained according to an entity data set containing domain nouns, so that entity recognition and relation extraction tasks can be accurately completed in each domain through the domain language model.

Description

Entity identification and relation extraction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the technical field of natural language processing and knowledge extraction, and in particular, to a method and an apparatus for entity identification and relationship extraction, an electronic device, and a storage medium.
Background
In the tasks of entity identification and relationship extraction, the most common method is that a natural language processing model is combined with an artificial labeling method, namely, data labeling is carried out manually through a BIOES labeling method, then the artificially labeled data is trained through the natural language processing model, and the information of the relationship between words in a sentence is learned, so that the tasks of entity identification and relationship extraction are realized.
Therefore, a knowledge system of the corresponding field industry needs to be fully understood in advance, all possible relationships among entities need to be reasonably classified, in addition, the entities and the relationships of the sample corpus need to be correctly labeled by a BIOES labeling manual method, the more abundant the training set data of manual labeling is, the better the training effect of the subsequent model is.
In summary, it can be seen from the analysis that the training data is acquired through manual labeling work in the conventional method, so that a large amount of labor and time costs are consumed, and the efficiency of the entity identification and relationship extraction task is reduced. Meanwhile, due to the particularity of the field text, the requirements on the professional level and the careful degree of workers are high, and artificial subjective factors exist in the process of manually marking data, so that the data marking is inaccurate, and the accuracy of entity identification and relation extraction tasks is reduced.
Disclosure of Invention
The application provides an entity identification and relationship extraction method, an entity identification and relationship extraction device, an electronic device and a storage medium, and aims to solve the problems and defects in the prior art.
In a first aspect, the present application provides an entity identification and relationship extraction method, including:
constructing an entity data set containing domain nouns;
determining the entity data set as a training set corpus, and performing masking training on a pre-training BERT model to obtain a domain language model;
and identifying a head entity and a tail entity of each marked word in the field text data to be processed through the field language model, and extracting an entity relationship between the head entity and the tail entity of each marked word.
In one embodiment, the identifying, by the domain language model, a head entity and a tail entity of each tagged word in the domain text data to be processed, and extracting an entity relationship between the head entity and the tail entity of each tagged word, includes:
identifying a head entity and a tail entity in each tagged word through the domain language model;
calculating attention weight between each mark word and the corresponding head entity and tail entity;
extracting the relationship between each head entity and each tail entity based on the attention weight of each said tagged word to its head entity and tail entity.
Extracting the relationship between each head entity and each tail entity based on the attention weight of each tagged word and the head entity and the tail entity thereof, including:
normalizing the attention weights of the tagged words and the head entities and the tail entities thereof to obtain a first normalized association degree between the tagged words and the head entities thereof and a second normalized association degree between the tagged words and the tail entities thereof;
calculating the joint association degree between each tagged word and the corresponding head entity and tail entity according to the first normalized association degree and the second normalized association degree of each tagged word;
and determining the final relation between the head entity and the tail entity according to the joint association degree of each tagged word.
Determining a final relationship between a head entity and a tail entity according to the joint association degree of each tagged word, including:
determining a relation word with the closest relation between a head entity and a tail entity in each mark word according to the joint association degree of each mark word;
and obtaining the final relation between the head entity and the tail entity in each marked word according to the relation word of each marked word.
The calculating the attention weight between each mark word and the corresponding head entity and tail entity comprises the following steps:
determining the weighted association degree obtained after each marked word is extracted by the domain language model;
determining a first number of transform layers in the domain language model and a second number of heads in each transform layer;
and calculating attention weight between each mark word and the corresponding head entity and tail entity of the mark word by combining a preset calculation formula and based on the weighted association degree of each mark word and the first quantity and the second quantity.
The constructing of the entity data set containing the domain nouns comprises:
cutting the original text data through a preset word cutting tool to obtain each entity data to be processed;
and fusing each to-be-processed entity data with a domain noun set to construct the entity data set containing the domain nouns.
After identifying the head entity and the tail entity of each tagged word in the to-be-processed domain text data and extracting the entity relationship between the head entity and the tail entity of each tagged word through the domain language model, the method further comprises the following steps:
and constructing an entity triple of each tagged word according to the entity relationship between the head entity and the tail entity in each tagged word and the corresponding head entity and tail entity.
In a second aspect, the present application further provides an entity identification and relationship extraction apparatus, including:
the construction module is used for constructing an entity data set containing domain nouns;
the training module is used for determining the entity data set as a training set corpus and conducting masking training on the pre-training BERT model to obtain a domain language model;
and the extraction module is used for identifying the head entity and the tail entity of each marked word in the field text data to be processed through the field language model and extracting the entity relationship between the head entity and the tail entity of each marked word.
In a third aspect, the present application further provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the entity identification and relationship extraction method according to the first aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium comprising a computer program which, when executed by a processor, implements the steps of the entity identification and relationship extraction method of the first aspect.
In a fifth aspect, the present application further provides a computer program product comprising a computer program which, when executed by the processor, implements the steps of the entity identification and relationship extraction method of the first aspect.
According to the entity identification and relationship extraction method and device, the electronic equipment and the storage medium, in the entity identification and relationship extraction process, the entity data set containing the field nouns is automatically constructed, a small amount of manual labeling work is needed, and therefore the efficiency of the entity identification and relationship extraction in the field is improved. Meanwhile, a domain language model is trained according to an entity data set containing domain nouns, so that the tasks of entity identification and relation extraction in the professional domain can be accurately finished through the domain language model.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of an entity identification and relationship extraction method provided herein;
FIG. 2 is a second flowchart of the entity identification and relationship extraction method provided in the present application;
FIG. 3 is a third flowchart of the entity identification and relationship extraction method provided in the present application;
FIG. 4 is a schematic structural diagram of an entity identification and relationship extraction apparatus provided in the present application;
fig. 5 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The entity identification and relationship extraction method, apparatus, electronic device and storage medium provided by the present application are described below with reference to fig. 1 to 5.
The present application provides an entity identification and relationship extraction method, referring to fig. 1 to 5, fig. 1 is one of the flow diagrams of the entity identification and relationship extraction method provided by the present application; FIG. 2 is a second flowchart of the entity identification and relationship extraction method provided in the present application; FIG. 3 is a third flowchart of the entity identification and relationship extraction method provided in the present application; FIG. 4 is a schematic structural diagram of an entity identification and relationship extraction apparatus provided in the present application; fig. 5 is a schematic structural diagram of an electronic device provided in the present application.
The embodiments of the present application provide embodiments of an entity identification and relationship extraction method, and it should be noted that, although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be performed in an order different from that shown or described here.
The embodiment of the present application takes an electronic device as an execution subject for example, and the embodiment of the present application takes a data management system as one of forms of the electronic device, and does not limit the electronic device.
Specifically, referring to fig. 1, fig. 1 is a schematic flow diagram of an entity identification and relationship extraction method provided in the present application, and the entity identification and relationship extraction method provided in the embodiment of the present application includes:
in step S10, an entity data set containing domain nouns is constructed.
It should be noted that the entity identification and relationship extraction method provided by the present application can be applied to a plurality of fields including, but not limited to, the coal mine field, the oil field and the natural gas field, and is exemplified in the coal mine field in order to clearly illustrate various embodiments of the present application.
The text data in the coal mine field has distinct industry field characteristics, so that the text data is obviously different from the conventional text data. The most important of which are as follows:
(1) there are a number of professional industry terms in coal mine field text data. For example, when a coal mine enterprise names a working area of a mining area, the coal mine enterprise often adopts the forms of "number + working area category name" or "elevation + height unit + working area category name", for example, "110505 coal face is arranged under a coal mine", "which one has respectively removed-50 m horizontal two-rock gate auxiliary descending mountain and tetra-rock gate 11 track descending driving face", and "110505 coal face" and "-50 m horizontal two-rock gate" in the sentence are names of two mining working areas, and the expression of this form does not appear in a general natural language.
(2) The text data in the coal mine field contains a large number of professional vocabularies. Professional words in many fields involved in coal mine production work include, but are not limited to, various instruments, job titles, working methods such as "methane sensors", "paravanes", "construction drilling", and the like.
(3) Some special sentence structures exist in the text data of the coal mine field. In the text for describing the passing of coal mine safety accidents, the scene of the accident scene is often described in detail in long sentences as much as possible, for example, when a coal mining machine cuts coal to a No. 52 fully-mechanized mining hydraulic support from a lower outlet direction of a working surface (an upper roller is positioned at a No. 64 fully-mechanized mining hydraulic support), the coal wall between two ends of the upper roller and the lower roller of the fully-mechanized mining hydraulic support folded by a side protection plate is suddenly and integrally extruded; the text data of the accident rectification measure part often contains a large number of sentences which lack subject words and are identified by commands or suggestions, such as 'making non-repudiation' special actions and seriously fighting illegal unlawful behaviors. Firstly, to hold one against three, take powerful measures, prevent the covert engineering strictly ".
Therefore, the data management system needs to construct an entity data set including the term in the coal mining field, specifically, in steps S101 to S102.
Further, the description of steps S101 to S102 is as follows:
step S101, cutting original text data through a preset word cutting tool to obtain entity data to be processed;
and S102, fusing each entity data to be processed with a domain noun set to construct the entity data set containing the domain nouns.
It should be noted that the data management system is installed with a preset word cutting tool, and the preset word cutting tool includes, but is not limited to, a panning word cutting tool and an IK word cutting tool.
Specifically, the data management system obtains the original text data, and it should be noted that the original text data does not carry a term in the coal mine field, so that a term in the coal mine field needs to be added to the original text data. And then, the data management system cuts the entity data of the original text data through a preset word cutting tool to obtain each entity data to be processed in the original text data. And finally, fusing each entity data to be processed with the coal mine field noun set by the data management system to construct an entity data set containing the coal mine field nouns.
Further, the data management system also needs to extract character-level expressions of each entity data in the entity data set of the nouns in the coal mine field, and specifically, the data management system extracts the character-level expressions of each entity data in the entity data set of the nouns in the coal mine field through a CNN (Convolutional Neural network) model, that is, the data management system trains the model so that the model can learn rules of domain term naming, thereby obtaining the character-level expressions of the real words. Relevant parameters that can be set by the CNN model in this embodiment are: the size of the filter is 1 x 3, the channel is 1, the step length is 1, and the maxpololing method is adopted for pooling.
It can be further understood that in the process of constructing the entity data set containing the nouns in the coal mine field, a character-level expression of the entity data set containing the nouns in the coal mine field needs to be constructed at the same time.
And step S20, determining the entity data set as a training set corpus, and performing masking training on the pre-trained BERT model to obtain a domain language model.
After constructing the character-level expression of the entity data set containing the nouns in the Coal mine field and the entity data set of the domain nouns, the data management system needs to train and construct a Language Model in the Coal mine field, specifically, the data management system determines the entity data set as a training set corpus, and conducts mask prediction (MLM) training on a pre-training Model through the entity data set, constructs a Language Model with characteristics in the Coal mine safety production field, and obtains the Language Model in the Coal mine field, and further, the pre-training Language Model is a BERT Model, so that it can be understood that the data management system conducts mask prediction m training on the pre-training mlrt Model through the entity data set, and obtains a Coal mine field Coal-BERT Model.
It should be noted that the input of the model includes the concatenation of the word vector, the character vector and the position vector, so that the whole characteristics of the text can be fully covered, only the words in the entity data set are shielded in the process of executing the MLM task, and other words are not shielded and output as the masked entity word vector. Therefore, the Coal mine field Coal-BERT model can well learn the text expression mode of the noun in the Coal mine safety production field.
In this embodiment, for example, the input coal mine text data is "a gas extraction team leader ginger certain organization holding a president", the coal mine text data is split and marked to obtain "a gas extraction team", "a team leader", "a ginger party", "an organization", "a holding" and "a president", the entity data set is used to perform masking processing, that is, "the ginger party" and the "the president" are masked to obtain "the gas extraction team", "the team leader", "MASK", "the organization", "the holding" and "MASK", then the gas extraction team "," the team leader "," the MASK "," the organization "," the holding "and" the MASK "are used to train the pre-training BERT model, and the input prediction results are" the ginger party "and the" the president ".
And step S30, identifying the head entity and the tail entity of each tagged word in the text data of the field to be processed through the field language model, and extracting the entity relationship between the head entity and the tail entity of each tagged word.
The data management system determines input coal mine field text data to be processed and each mark word in the coal mine field text data to be processed. And then, the data management system identifies a head entity and a tail entity of each marking word in the text data of the Coal mine field to be processed through a Coal mine field Coal-BERT model. Finally, the data management system extracts the entity relationship between the head entity and the tail entity in each tagged term according to the relationship between each tagged term and its corresponding head entity and tail entity, as described in step S301 to step S303.
Further, after extracting the Entity relationship between the head Entity and the tail Entity in each tagged term, the data management system needs to construct an Entity triplet of each tagged term according to the Entity relationship between the head Entity and the tail Entity in each tagged term and the corresponding head Entity and tail Entity thereof, where the structure of the Entity triplet is "head Entity-Entity relationship-tail Entity", that is, the Entity triplet may be expressed as (Entity triplet)head,Rword,Entitytail) Wherein, EntityheadDenotes the head entity, RwordRepresenting Entity relationships, EntitytailRepresenting the tail entity.
According to the embodiment of the application, the entity relation of each marking word, the head entity and the tail entity are constructed into the entity triple in the coal mine field, so that the head entity and the tail entity are accurately associated.
Further, the description of steps S301 to S303 is as follows:
step S301, identifying a head entity and a tail entity in each tagged word through the domain language model;
step S302, calculating attention weights between each mark word and the corresponding head entity and tail entity;
step S303, extracting the relation between each head entity and each tail entity based on the attention weight of each mark word and the head entity and the tail entity of the mark word.
Specifically, the data management system identifies a head entity and a tail entity in each tagged term through a Coal mine field Coal-bed model, and generates an HRT (head-relation-tail) structure from the head entity and the tail entity in each tagged term. Next, the data management system calculates Attention weights between the head entity and the tail entity corresponding to each tagged term through the Attention calculation mechanism and the HRT structure, that is, weights of words related to the head entity and the tail entity in each tagged term, as shown in steps S3021 to S3023. Finally, the data management system extracts the entity relationship between the head entity and the tail entity in each tagged word according to the attention weight of each tagged word and the head entity and the tail entity thereof, as specifically shown in steps S3031 to S3033.
According to the embodiment of the application, the head entity and the tail entity in each tagged word are extracted through the domain language model, and the entity relationship between the head entity and the tail entity in each tagged word is extracted by combining the Attention computing mechanism and the HRT structure, so that the entity relationship of each tagged word is accurately extracted.
The embodiment provides an entity identification and relationship extraction method, in the process of entity identification and relationship extraction, an entity data set containing domain nouns is automatically constructed, and only a small amount of manual labeling work is needed, so that the efficiency of entity identification and relationship extraction in each domain is improved. Meanwhile, a domain language model is trained according to an entity data set containing domain nouns, so that entity recognition and relation extraction tasks can be accurately completed in the professional domain through the domain language model. In addition, in the whole process of entity identification and relationship extraction, manual operation is not required, so that the tasks of entity identification and relationship extraction are automatically completed.
Further, referring to fig. 2, fig. 2 is a second schematic flowchart of the entity identification and relationship extraction method provided in the present application, and the entity identification and relationship extraction method provided in the embodiment of the present application includes:
step S3021, determining a weighted association degree obtained after each tagged word is extracted by the domain language model;
step S3022, determining a first number of transform layers in the domain language model and a second number of heads in each transform layer;
step S3023, calculating attention weights between the head entities and the tail entities corresponding to the respective tagged terms and the tagged terms based on the weighted association degrees of the respective tagged terms and the first number and the second number in combination with a preset calculation formula.
It should be noted that the Coal-BERT model includes multiple transform layers, and each transform layer includes multiple heads for learning attention weights of relations between sentence words.
Specifically, in the process that the data management system extracts all the marked words through the Coal mine field Coal-BERT model, the weighted association degree Cor of all the marked words obtained after all the marked words are extracted through the Coal mine field Coal-BERT model is determinedm,nWhere m represents the head entity and n represents the tail entity. Next, the data management system determines a first number A of transform layers in a Coal mine field Coal-BERT model and a second number H of heads in each transform layer. And finally, the data management system is used for calculating the data according to a preset calculation formula, wherein the preset calculation formula is as follows:
Figure 206794DEST_PATH_IMAGE001
and combining the weighted association degrees Cor of the respective tagged wordsm,nAnd a first number A and a second number H, calculating attention weights between the respective tagged words and their corresponding head and tail entities, weighting the degree of association Corm,nThe first number a and the second number H are known, so that the Atte can be calculatedm,n,Attem,nI.e. the attention weight between the head entity and the tail entity corresponding to each tagged word.
According to the embodiment of the application, the attention weight between each tagged word and the corresponding head entity and tail entity of each tagged word is accurately calculated through the weighted association degree of each tagged word, the first number of the transform layers in the domain language model and the second number of the head in each transform layer.
Further, referring to fig. 3, fig. 3 is a third schematic flowchart of the entity identification and relationship extraction method provided in the present application, and the entity identification and relationship extraction method provided in the embodiment of the present application includes:
step S3031, normalizing the attention weights of each tagged word and the head entity and the tail entity thereof to obtain a first normalized association degree between each tagged word and the head entity thereof and a second normalized association degree between each tagged word and the tail entity thereof;
step S3032, calculating the joint association degree between each tagged word and the corresponding head entity and tail entity according to the first normalized association degree and the second normalized association degree of each tagged word;
step S3033, determining the final relationship between the head entity and the tail entity according to the joint association degree of each tagged word.
Specifically, the data management system normalizes each tagged word with the attention weight of its head entity to obtain a first normalized degree of association Cor between each tagged word and its head entityh-e(wi) Wherein w isiFor token words, i =1-N, N being the number of token words. Meanwhile, the data management system normalizes the attention weight of each tagged word and the tail entity thereof to obtain a second normalized association degree Cor between each tagged word and the tail entity thereoft-e(wi). Then, the data management system multiplies the first normalized association degree and the second normalized association degree, and then logarithms the product result to obtain the joint association degree R (w) between each mark word and the corresponding head entity and tail entityi) I.e. the degree of joint association R (w)i)=-log{Corh-e(wi)* Cort-e(wi)}. Finally, the data management system extracts the final entity relationship between the head entity and the tail entity in each tagged word according to the joint association degree of each tagged word, specifically as steps S30331 to S30332.
According to the embodiment of the application, the first normalized association degree and the second normalized association degree of each tagged word are used for calculating the joint association degree between each tagged word and the corresponding head entity and tail entity of each tagged word, and the entity relation is extracted according to the joint association degree of each tagged word, so that the entity relation of each tagged word can be accurately extracted.
Further, the description of steps S30331 to S30332 is as follows:
step S30331, determining the relational terms with the closest relationship between the head entity and the tail entity in each tagged term according to the joint association degree of each tagged term;
step S30332, obtaining the final relationship between the head entity and the tail entity in each mark word according to the relationship word of each mark word.
Specifically, the data management system is based on a preset formula Rword=arg min R(wi) And determining a relation word with the closest relation between the head entity and the tail entity in each mark word by combining the joint association degree of each mark word, wherein the head entity, the tail entity, the punctuation mark and the sentence special mark can not be used as the relation word in the embodiment. And then, the data management system extracts the final entity relationship between the head entity and the tail entity in each marked word according to the relationship word of each marked word.
According to the embodiment of the application, the relation words in the mark words are determined according to the joint association degree of the mark words, and then the entity relation between the head entity and the tail entity in the mark words is accurately extracted according to the relation words.
Further, the entity identification and relationship extraction device provided by the present application is described below, and the entity identification and relationship extraction device described below and the entity identification and relationship extraction method described above may be referred to in correspondence with each other.
As shown in fig. 4, fig. 4 is a schematic structural diagram of an entity identification and relationship extraction apparatus provided in the present application, and the entity identification and relationship extraction apparatus includes:
a construction module 401, configured to construct an entity data set including domain nouns;
a training module 402, configured to determine the entity data set as a training set corpus, and perform masking training on a pre-training BERT model to obtain a domain language model;
an extracting module 403, configured to identify, through the domain language model, a head entity and a tail entity of each tagged word in the domain text data to be processed, and extract an entity relationship between the head entity and the tail entity of each tagged word.
Further, the extracting module 403 is further configured to:
identifying a head entity and a tail entity in each tagged word through the domain language model;
calculating attention weight between each mark word and the corresponding head entity and tail entity;
extracting the relationship between each head entity and each tail entity based on the attention weight of each said tagged word to its head entity and tail entity.
Further, the extracting module 403 is further configured to:
normalizing the attention weights of the tagged words and the head entities and the tail entities thereof to obtain a first normalized association degree between the tagged words and the head entities thereof and a second normalized association degree between the tagged words and the tail entities thereof;
calculating the joint association degree between each tagged word and the corresponding head entity and tail entity according to the first normalized association degree and the second normalized association degree of each tagged word;
and determining the final relation between the head entity and the tail entity according to the joint association degree of each tagged word.
Further, the extracting module 403 is further configured to:
determining a relation word with the closest relation between a head entity and a tail entity in each mark word according to the joint association degree of each mark word;
and obtaining the final relation between the head entity and the tail entity in each marked word according to the relation word of each marked word.
Further, the extraction module 403 further includes a computing unit, configured to:
determining the weighted association degree obtained after each marked word is extracted by the domain language model;
determining a first number of transform layers in the domain language model and a second number of heads in each transform layer;
and calculating attention weight between each mark word and the corresponding head entity and tail entity of the mark word by combining a preset calculation formula and based on the weighted association degree of each mark word and the first quantity and the second quantity.
Further, the building module 401 is further configured to:
cutting the original text data through a preset word cutting tool to obtain each entity data to be processed;
and fusing each to-be-processed entity data with a domain noun set to construct the entity data set containing the domain nouns.
Further, the building module 401 is further configured to:
and constructing an entity triple of each tagged word according to the entity relationship between the head entity and the tail entity in each tagged word and the corresponding head entity and tail entity.
The specific embodiment of the entity identification and relationship extraction apparatus provided in the present application is substantially the same as the embodiments of the entity identification and relationship extraction method, and is not described herein again.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform an entity identification and relationship extraction method comprising:
constructing an entity data set containing domain nouns;
determining the entity data set as a training set corpus, and performing masking training on a pre-training BERT model to obtain a domain language model;
and identifying a head entity and a tail entity of each marked word in the field text data to be processed through the field language model, and extracting an entity relationship between the head entity and the tail entity of each marked word.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present application also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the entity identification and relationship extraction methods provided by the above methods, the method comprising:
constructing an entity data set containing domain nouns;
determining the entity data set as a training set corpus, and performing masking training on a pre-training BERT model to obtain a domain language model;
and identifying a head entity and a tail entity of each marked word in the field text data to be processed through the field language model, and extracting an entity relationship between the head entity and the tail entity of each marked word.
In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the entity identification and relationship extraction methods provided above, the method comprising:
constructing an entity data set containing domain nouns;
determining the entity data set as a training set corpus, and performing masking training on a pre-training BERT model to obtain a domain language model;
and identifying a head entity and a tail entity of each marked word in the field text data to be processed through the field language model, and extracting an entity relationship between the head entity and the tail entity of each marked word.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (7)

1. An entity identification and relationship extraction method, comprising:
constructing an entity data set containing domain nouns;
determining the entity data set as a training set corpus, and performing masking training on a pre-training BERT model to obtain a domain language model, wherein the domain language model is a Coal mine domain Coal-BERT model;
identifying a head entity and a tail entity of each marked word in the to-be-processed field text data through the field language model, and extracting an entity relationship between the head entity and the tail entity of each marked word;
wherein, the identifying, by the domain language model, the head entity and the tail entity of each tagged word in the domain text data to be processed, and extracting the entity relationship between the head entity and the tail entity of each tagged word, include:
identifying a head entity and a tail entity in each marking term through the Coal mine field Coal-BERT model, and generating an HRT structure by the head entity and the tail entity in each marking term, wherein the HRT structure is as follows: head entity-entity relationship-tail entity;
calculating Attention weight between each mark word and a head entity and a tail entity corresponding to the mark word through an Attention calculation mechanism and the HRT structure, wherein the Attention weight is the weight of a word which is related to the head entity and the tail entity in each mark word;
normalizing each said tagged word with attention weights of its head and tail entities to obtain a first normalized degree of association Cor between each said tagged word and its head entityh-e(wi) And a second normalized degree of association Cor between each of said tagged words and its tail entityt-e(wi) Wherein w isiI is the number of the marked words;
correlating the first normalized correlation degree Corh-e(wi) And a second normalized degree of correlation Cort-e(wi) Multiplying, and taking logarithm of the product result to obtain the joint association degree R (w) between each mark word and the corresponding head entity and tail entityi) Wherein the expression of the joint correlation degree is R (w)i)=-log{Corh-e(wi)* Cort-e(wi)};
According to a predetermined formula Rword=arg min R(wi) Degree of joint association R (w) with each of said tagged termsi) Determining a related word with the closest relationship between the head entity and the tail entity in each of the tagged words, RwordThe words are the most closely related relational words;
and obtaining the final relation between the head entity and the tail entity in each marked word according to the relation word of each marked word.
2. The entity identification and relationship extraction method according to claim 1, wherein the calculating the attention weight between each of the tagged terms and its corresponding head and tail entities comprises:
determining the weighted association degree obtained after each marked word is extracted by the domain language model;
determining a first number of transform layers in the domain language model and a second number of heads in each transform layer;
and calculating attention weight between each mark word and the corresponding head entity and tail entity of the mark word by combining a preset calculation formula and based on the weighted association degree of each mark word and the first quantity and the second quantity.
3. The entity identification and relationship extraction method of claim 1, wherein the constructing of the entity data set containing domain nouns comprises:
cutting the original text data through a preset word cutting tool to obtain each entity data to be processed;
and fusing each to-be-processed entity data with a domain noun set to construct the entity data set containing the domain nouns.
4. The entity identifying and relationship extracting method according to any one of claims 1 to 3, wherein after identifying, by the domain language model, the head entity and the tail entity of each tagged word in the domain text data to be processed and extracting the entity relationship between the head entity and the tail entity of each tagged word, further comprising:
and constructing an entity triple of each tagged word according to the entity relationship between the head entity and the tail entity in each tagged word and the corresponding head entity and tail entity.
5. An entity identification and relationship extraction apparatus, comprising:
the construction module is used for constructing an entity data set containing domain nouns;
the training module is used for determining the entity data set as a training set corpus and conducting masking training on the pre-training BERT model to obtain a domain language model, wherein the domain language model is a Coal mine domain Coal-BERT model;
the extraction module is used for identifying a head entity and a tail entity of each marked word in the field text data to be processed through the field language model and extracting an entity relationship between the head entity and the tail entity of each marked word;
the extraction module is further configured to:
identifying a head entity and a tail entity in each marking term through the Coal mine field Coal-BERT model, and generating an HRT structure by the head entity and the tail entity in each marking term, wherein the HRT structure is as follows: head entity-entity relationship-tail entity;
calculating Attention weight between each mark word and a head entity and a tail entity corresponding to the mark word through an Attention calculation mechanism and the HRT structure, wherein the Attention weight is the weight of a word which is related to the head entity and the tail entity in each mark word;
normalizing each said tagged word with attention weights of its head and tail entities to obtain a first normalized degree of association Cor between each said tagged word and its head entityh-e(wi) And a second normalized degree of association Cor between each of said tagged words and its tail entityt-e(wi) Wherein w isiI is the number of the marked words;
correlating the first normalized correlation degree Corh-e(wi) And a second normalized degree of correlation Cort-e(wi) Multiplying, and taking logarithm of the product result to obtain the joint association degree R (w) between each mark word and the corresponding head entity and tail entityi) Wherein the expression of the joint correlation degree is R (w)i)=-log{Corh-e(wi)* Cort-e(wi)};
According to a predetermined formula Rword=arg min R(wi) Degree of joint association R (w) with each of said tagged termsi) Determining a related word with the closest relationship between the head entity and the tail entity in each of the tagged words, RwordThe words are the most closely related relational words;
and obtaining the final relation between the head entity and the tail entity in each marked word according to the relation word of each marked word.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the entity identification and relationship extraction method of any one of claims 1 to 4 are implemented when the computer program is executed by the processor.
7. A computer-readable storage medium comprising a computer program, wherein the computer program when executed by a processor implements the steps of the entity identification and relationship extraction method of any of claims 1 to 4.
CN202111504146.6A 2021-12-10 2021-12-10 Entity identification and relation extraction method and device, electronic equipment and storage medium Active CN113901827B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111504146.6A CN113901827B (en) 2021-12-10 2021-12-10 Entity identification and relation extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111504146.6A CN113901827B (en) 2021-12-10 2021-12-10 Entity identification and relation extraction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113901827A CN113901827A (en) 2022-01-07
CN113901827B true CN113901827B (en) 2022-03-18

Family

ID=79025543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111504146.6A Active CN113901827B (en) 2021-12-10 2021-12-10 Entity identification and relation extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113901827B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377690A (en) * 2019-06-27 2019-10-25 北京信息科技大学 A kind of information acquisition method and system based on long-range Relation extraction
CN113468888A (en) * 2021-06-25 2021-10-01 浙江华巽科技有限公司 Entity relation joint extraction method and device based on neural network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933789B (en) * 2019-02-27 2021-04-13 中国地质大学(武汉) Neural network-based judicial domain relation extraction method and system
CN112329465A (en) * 2019-07-18 2021-02-05 株式会社理光 Named entity identification method and device and computer readable storage medium
CN113011161A (en) * 2020-12-29 2021-06-22 中国航天科工集团第二研究院 Method for extracting human and pattern association relation based on deep learning and pattern matching
CN113033203A (en) * 2021-02-05 2021-06-25 浙江大学 Structured information extraction method oriented to medical instruction book text
CN113157936B (en) * 2021-03-16 2024-03-12 云知声智能科技股份有限公司 Entity relationship joint extraction method, device, electronic equipment and storage medium
CN113158653B (en) * 2021-04-25 2021-09-07 北京智源人工智能研究院 Training method, application method, device and equipment for pre-training language model
CN113378513B (en) * 2021-06-11 2022-12-23 电子科技大学 Method for generating labeling corpus extracted towards domain relation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377690A (en) * 2019-06-27 2019-10-25 北京信息科技大学 A kind of information acquisition method and system based on long-range Relation extraction
CN113468888A (en) * 2021-06-25 2021-10-01 浙江华巽科技有限公司 Entity relation joint extraction method and device based on neural network

Also Published As

Publication number Publication date
CN113901827A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
DE102008040739B4 (en) Method and system for calculating or determining trust or confidence evaluations for syntax trees at all levels
US8147250B2 (en) Cooccurrence and constructions
CN111104789B (en) Text scoring method, device and system
CN109299865B (en) Psychological evaluation system and method based on semantic analysis and information data processing terminal
Mareček et al. Extracting syntactic trees from transformer encoder self-attentions
CN102043774A (en) Machine translation evaluation device and method
CN115357719B (en) Power audit text classification method and device based on improved BERT model
Rikters et al. Confidence through attention
CN115238685B (en) Combined extraction method for building engineering change events based on position perception
CN111832281A (en) Composition scoring method and device, computer equipment and computer readable storage medium
CN112671985A (en) Agent quality inspection method, device, equipment and storage medium based on deep learning
CN115098634A (en) Semantic dependency relationship fusion feature-based public opinion text sentiment analysis method
CN103186658A (en) Method and device for reference grammar generation for automatic grading of spoken English test
Atapattu et al. Acquisition of triples of knowledge from lecture notes: A natural langauge processing approach
Pilán et al. Coursebook texts as a helping hand for classifying linguistic complexity in language learners’ writings
CN112579794B (en) Method and system for predicting semantic tree for Chinese and English word pairs
CN113901827B (en) Entity identification and relation extraction method and device, electronic equipment and storage medium
Rintyarna et al. Automatic ranking system of university based on technology readiness level using LDA-Adaboost. MH
Rahman et al. An Automated Approach for Answer Script Evaluation Using Natural Language Processing
CN110826329A (en) Automatic composition scoring method based on confusion degree
TW201502812A (en) Text abstract editing system, text abstract scoring system and method thereof
Islam et al. Readability classification of bangla texts
CN113793611A (en) Scoring method, scoring device, computer equipment and storage medium
Riccardi et al. Motivational feedback in crowdsourcing: a case study in speech transcription.
Miyata et al. Evaluating the suitability of human-oriented text simplification for machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant