CN115730017A - Device and method for generating entity relation extraction model - Google Patents

Device and method for generating entity relation extraction model Download PDF

Info

Publication number
CN115730017A
CN115730017A CN202110996377.7A CN202110996377A CN115730017A CN 115730017 A CN115730017 A CN 115730017A CN 202110996377 A CN202110996377 A CN 202110996377A CN 115730017 A CN115730017 A CN 115730017A
Authority
CN
China
Prior art keywords
entity
information
relationship
labeled
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110996377.7A
Other languages
Chinese (zh)
Inventor
曾俋颖
张琼之
邱德旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Delta Electronics Inc
Original Assignee
Delta Electronics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Delta Electronics Inc filed Critical Delta Electronics Inc
Priority to CN202110996377.7A priority Critical patent/CN115730017A/en
Publication of CN115730017A publication Critical patent/CN115730017A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a device and a method for generating an entity relationship extraction model. The device receives a text to be marked, and generates at least one piece of entity information to be marked corresponding to each field and at least one piece of relation information to be marked corresponding to each field based on a plurality of fields in the text to be marked and the entity information and the relation information in the entity relation database. The device labels the at least one entity information to be labeled and the at least one relation information to be labeled of each field according to the improved labeling format. The device generates a plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information and stores the combinations in an entity relationship database. Based on the pre-trained language model, the apparatus inputs the combinations into the pre-trained language model to generate an entity relationship extraction model.

Description

Device and method for generating entity relation extraction model
Technical Field
The invention relates to a device and a method for generating an entity relationship extraction model. More particularly, the present invention relates to an apparatus and method for performing a pre-labeling procedure and a training model procedure to generate a physical relationship extraction model.
Background
Knowledge extraction is the most important first step in knowledge management, and is to extract useful knowledge from a large number of documents, including entities and relationships. Through the knowledge, when various application services meet a scene needing to make judgment, the decision can be quickly and accurately made, and the task of the scene is completed. Many applications and solutions rely on knowledge of structured textual information to perform specific functions, such as: search engines, automatic navigation, knowledge question answering, recommendation systems, dialogue robots, etc., and to further improve the knowledge level, knowledge maps and semantic knowledge bases are required, so the extraction of entity relationships is one of the key technologies for constructing knowledge bases.
The existing entity relation extraction method mainly takes manual rule templates and syntactic structure analysis as main parts. Specifically, the manual rule template is matched by using a template rule designed by a domain expert, and a new template needs to be redesigned when a new domain or data is faced, so that the manual rule template is applicable to only a small domain besides time-consuming design. The syntactic structure is constructed by parsing the syntactic rules and structures of a single language by a linguistic analyst, and the syntactic structure is split for an input text sentence and the relationship between a noun and a verb is identified. Therefore, no matter which entity relationship extraction method is adopted, the intervention of experts or scholars is required, a great amount of labor labeling cost and time are consumed, and the conversion aiming at different fields cannot be performed quickly and flexibly.
In view of the above, it is an urgent need in the art to efficiently and automatically generate entity relationship extraction models.
Disclosure of Invention
The invention aims to provide a device for generating an entity relation extraction model. The device comprises a memory and a processor, wherein the processor is electrically connected to the memory. The memory is used for storing an entity relation database, wherein the entity relation database at least comprises a plurality of entity information and a plurality of relation information. The processor is used for executing a pre-labeling program and a training model program, wherein the pre-labeling program comprises the following steps: the processor receives a text to be annotated. The processor generates at least one piece of entity information to be labeled corresponding to each field and at least one piece of relation information to be labeled corresponding to each field based on a plurality of fields in the text to be labeled and the entity information and the relation information in the entity relation database. The processor labels the at least one entity information to be labeled and the at least one relation information to be labeled of each field according to an improved labeling format to generate at least one labeled entity information and at least one labeled relation information. The processor generates a plurality of combinations from the at least one tagged entity information and the at least one tagged relationship information and stores the combinations in the entity relationship database. The training model program comprises the following steps: the processor inputs the combinations into a pre-trained language model based on the pre-trained language model to generate a physical relationship extraction model.
Another object of the present invention is to provide a method for generating an entity relationship extraction model. The method is used for a device for generating the entity relationship extraction model, the device for generating the entity relationship extraction model comprises a memory and a processor, the memory stores an entity relationship database, the entity relationship database at least comprises a plurality of entity information and a plurality of relationship information, and the method for generating the entity relationship extraction model is executed by the processor and comprises the following steps: executing a pre-labeling program and a training model program, wherein the pre-labeling program comprises the following steps: receiving a text to be marked; generating at least one piece of entity information to be marked corresponding to each field and at least one piece of relation information to be marked corresponding to each field based on a plurality of fields in the text to be marked and the entity information and the relation information in the entity relation database; labeling the at least one entity information to be labeled and the at least one relation information to be labeled of each field according to an improved labeling format to generate at least one labeled entity information and at least one labeled relation information; combining and storing the at least one labeled entity information and the at least one labeled relationship information into the entity relationship database; wherein the training model program comprises the following steps: based on a pre-training language model, the combinations are input to the pre-training language model to generate a entity relationship extraction model.
As can be seen from the above description, the training of the traditional entity-relationship extraction model usually requires the training to be repeated and requires a large amount of input data generated by manual labeling/intervention to achieve the effect. Different from the traditional model generation mode, the entity relationship extraction model generation technology (at least comprising the device and the method) provided by the invention is constructed on a pre-training model, and input data are quickly marked and an entity relationship database is expanded through a mechanism of a pre-marking program, so that a large amount of data can be automatically generated without human intervention, and the entity relationship extraction model can be quickly trained. In addition, the invention further accelerates the training speed of the entity relation extraction model through the information of the improved labeling format. Therefore, the defects that in the prior art, the entity relationship extraction model needs the intervention of experts or scholars, a large amount of manual labeling cost and time are consumed, and the conversion aiming at different fields cannot be performed quickly and flexibly are overcome.
The detailed technology and embodiments of the present invention will be described below with reference to the accompanying drawings so that those skilled in the art can understand the technical features of the claimed invention.
Drawings
FIG. 1 is a block diagram illustrating an apparatus for generating an entity-relationship extraction model according to an embodiment of the present invention;
FIG. 2 depicts a schematic diagram of an entity relationship database in a first embodiment;
FIG. 3 is a schematic diagram illustrating an augmented entity-relationship database according to a first embodiment;
FIG. 4 depicts a schematic diagram of an architecture for training an entity-relationship extraction model in a first embodiment; and
FIG. 5 depicts a flow diagram of a method of generating an entity relationship extraction model of a second embodiment.
The reference numbers illustrate:
1: device for generating entity relation extraction model
11: memory device
13: transmit-receive interface
15: processor with a memory having a plurality of memory cells
133: text to be marked
400: entity relational database
409: neural network
411: input layer
413: pre-trained language model
415: sequence layer
S501-S509: step (ii) of
Detailed Description
The following explains the apparatus and method for generating entity relationship extraction model provided by the present invention by means of embodiments. However, these embodiments are not intended to limit the present invention to any specific environment, application, or manner of implementing the embodiments described herein. Therefore, the description of the embodiments is for the purpose of illustration only, and is not intended to limit the scope of the invention. It should be understood that in the following embodiments and the accompanying drawings, elements not directly related to the present invention have been omitted and not shown, and the sizes of the elements and the size ratios between the elements are merely illustrative and are not intended to limit the scope of the present invention.
A first embodiment of the present invention is an apparatus 1 for generating an entity-relationship extraction model, and a schematic diagram of the architecture thereof is depicted in FIG. 1. In the present embodiment, the apparatus 1 for generating the entity-relationship extraction model includes a memory 11, a transceiver interface 13 and a processor 15, and the processor 15 is electrically connected to the memory 11 and the transceiver interface 13. The memory 11 may be a memory, a Universal Serial Bus (USB) disk, a hard disk, an optical disk, a portable disk, or any other storage medium or circuit known to those skilled in the art with the same function. The transceiving interface 13 is an interface capable of receiving and transmitting data or other interfaces capable of receiving and transmitting data known to those skilled in the art, and the transceiving interface 13 can be implemented by, for example: external devices, external web pages, external applications, etc. receive data from a source. The processor 15 may be various Processing units, central Processing Units (CPUs), microprocessors, or other computing devices known to those skilled in the art. In some embodiments, the apparatus 1 for generating the entity relationship extraction model may be, but is not limited to, an electronic device such as a mobile electronic device, a desktop computer, a portable computer, etc.
In the present embodiment, the memory 11 stores an entity relationship database 400, and the entity relationship database 400 at least includes a plurality of entity information and a plurality of relationship information. For ease of understanding, FIG. 2 illustrates one embodiment of an entity relationship database 400. As shown in FIG. 2, the entity relationship database 400 records fields of input data, entity 1, relationship, entity 2, and confidence score. Taking the first data of the entity relationship database 400 in fig. 2 as an example, the entity relationship database 400 records that the input data is "Tom wa born in Honolulu, hawaii", the entity 1 corresponding to the input data is "Tom", the relationship is "wa born in", the entity 2 is "Honolulu", and the confidence score is "1.0".
In some embodiments, the entity-relationship database 400 is generated by the processor 15 executing a crawler process and a entity-relationship database construction process. The crawler program comprises the following steps: the processor 15 collects a plurality of knowledge base data contents, each of which includes a plurality of item names and an item context corresponding to each of the item names. The processor 15 performs a sentence-breaking process on the entry text to generate an input data. The entity relation database construction program comprises the following steps: the processor 15 inputs the input data to an entity relationship extraction system to generate an output data, wherein the output data includes a plurality of triplets of data, each triplet of data includes a plurality of entity information, at least one relationship information and a confidence score. The processor 15 stores the three sets of data with the confidence score exceeding a predetermined value in the output data to the entity relationship database based on the confidence score.
For example, a crawler program may be executed by the processor 15 in the crawler program to crawl the names of items (e.g., database related to a category) and the content of items (e.g., articles related to a category) from data sources such as general knowledge bases (e.g., dbpedia, YAGO, freebase, wikipedia, etc.), domain knowledge bases (e.g., patent knowledge bases, manufacturing phrase knowledge bases, etc.), standard entity-relationship data set knowledge bases (e.g., OPECI, OIE 2016), etc. Next, the processor 15 performs sentence-breaking processing on each of the entry texts by using a rule of a sentence-breaking number, and generates a plurality of input data in units of a single sentence. It should be noted that after the crawler program captures the item names and item contexts of each knowledge base, the processor 15 may further perform a preprocessing operation on the item contexts, such as: and data cleaning operations such as text paragraph extraction, html label removal, repeated sentence removal, abnormal messy code information removal and the like.
Also for example, in the construction process of the entity relationship database 400, the processor 15 inputs the input data into an entity relationship extraction system, which may be an open source entity relationship extraction tool trained based on machine learning, such as: openIE5, rnoie, etc. Then, the processor 15 extracts a plurality of triplet sets of data including a plurality of entity information, at least one relationship information and a confidence score from the clause-processed entry text (i.e. input data) through the entity relationship extraction system. As shown in fig. 2, each triplet of data includes entity 1, relationship and entity 2 and a confidence score, wherein the confidence score represents the confidence level of the extracted result of the triplet of data, and the confidence score can be automatically generated by the entity relationship extraction system. Finally, the processor 15 may store the three sets of data with confidence scores greater than 0.85 to the entity relationship database 400 by setting the preset value of the confidence score to 0.85.
In some embodiments, the entity relationship database 400 may also be generated by an external device, and the processor 15 receives the entity relationship database 400 through the transceiving interface 13 and stores the received entity relationship database in the memory 11. It should be noted that fig. 2 is only for convenience of illustration, but not for limiting the scope of the invention, and the physical relationship database 400 may also include other fields (e.g., data sources) in actual operation.
Continuing with the description, the specific operation of the apparatus 1 for generating entity relationship extraction model is shown in FIG. 1. In this embodiment, the processor 15 executes a pre-labeling program and a training model program. First, in the pre-annotation process, the processor 15 receives a text to be annotated 133 through the transceiving interface 13. It should be noted that the text to be labeled 133 is an article that has not been labeled with an entity and a relationship, and may be, for example, an article of a certain category or an article related to the field of the training model, and the text to be labeled 133 is used for subsequently amplifying the data of the entity-relationship database 400.
In some embodiments, the processor 15 performs sentence-breaking processing on the text 133 to be labeled, and generates a plurality of fields in units of a single sentence. In some embodiments, the processor 15 performs a text preprocessing operation on the text 133 to be labeled, such as: and data cleaning operations such as text paragraph extraction, html label removal, repeated sentence removal, abnormal messy code information removal and the like.
Then, the processor 15 generates at least one piece of information of the entity to be annotated corresponding to each of the fields and at least one piece of information of the relationship to be annotated corresponding to each of the fields based on the plurality of fields in the text to be annotated 133 and the entity information and the relationship information in the entity relationship database 400. Specifically, the generating of the at least one to-be-labeled entity information corresponding to each of the fields and the at least one to-be-labeled relationship information corresponding to each of the fields may include the following steps. First, the processor 15 compares the fields in the text to be labeled 133 with the entity information in the entity relationship database 400 to generate the at least one entity to be labeled information corresponding to each of the fields. Then, the processor 15 compares each of the fields containing at least two pieces of information of the entity to be labeled with the relationship information in the entity relationship database 400 to generate the at least one piece of information of the relationship to be labeled corresponding to each of the fields.
Then, the processor 15 labels the at least one piece of entity information to be labeled and the at least one piece of relationship information to be labeled of each field according to an improved labeling format to generate at least one piece of labeled entity information and at least one piece of labeled relationship information. The processor 15 generates a plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information and stores the combinations in the entity relationship database 400. In some embodiments, the combinations generated by the at least one tagged entity information and the at least one tagged relationship information are generated by the processor 15 according to a sequence of the at least one tagged entity information and the at least one tagged relationship information in the fields, and the combinations of the at least one tagged entity information and the at least one tagged relationship information in the fields are generated. In some embodiments, the improved tag format is composed of a conventional sequence tag format (e.g., BMES, BIO, BIOES, etc.) and a physical tag and a relational tag corresponding to the conventional sequence tag format.
For convenience of understanding, the flow of the pre-labeling procedure is illustrated by way of example, and reference is made to fig. 1, 2 and 3, which should not be construed as limiting the scope of the invention. In the present example, the text to be annotated 133 includes a field A with a sentence "Wang ws born in Taiwan, tainan, zhongshan street". First, the processor 15 compares the field a with each entity information in the entity relationship database 400 in fig. 2 (i.e., the entity 1 and entity 2 fields) to determine which words/phrases in the field a belong to the entity. In this example, since "Wang", "Taiwan", "Tainan" and "Zhongshan street" in the field a have been labeled as entities in the 4 th, 5 th and 6 th pens of the entity relationship database 400, respectively, after the comparison, the processor 15 generates the to-be-labeled entity information of the corresponding field a as "Wang", "Taiwan", "Tainan" and "Zhongshan street" (depending on the sequence of the fields a).
The processor 15 then determines which fields contain at least two pieces of information about the entities to be annotated (i.e., there is an opportunity to form a combination by two entities and a relationship; if there are no two entities, even if there is a relationship, the combination cannot be formed). In this example, since the field a has more than two pieces of entity information to be labeled, the processor 15 compares the field a with each relationship information (i.e., relationship field) in the entity relationship database 400 in fig. 2 to determine which words/phrases in the field a belong to the relationship. In this example, since "was born in" is labeled as a relationship in the 1 st and 2 nd pens of the entity relationship database 400, after the comparison, the processor 15 generates the information of the relationship to be labeled corresponding to the field A as "was born in".
Then, the processor 15 labels the at least one labeled entity information and the at least one labeled relationship information of each of the fields according to the improved labeling format. In this example, BMES labeling is used (i.e., B is the beginning of a word, M is the middle of a word, E is the end of a word, and S is a single word). For example, the processor 15 labels the Entity information to be labeled "Wang" in the field a and the relationship information to be labeled, and adds a prefix Entity to form "Wang [ Entity-S ]", labels "Taiwan" with "Taiwan [ Entity-S ]", labels "Tainan" with "Tainan [ Entity-S ]", labels "zhongshanan" with "zhongshanan [ Entity-B ] street [ Entity-E ]", and labels "Wang [ Entity-S ]", "Taiwan [ Entity-S ]", "Tainan [ Entity-S ]", "zhongshanngshangan [ Entity-S ]", "zhongshangan [ Entity-B ] street [ Entity-E ]" as labeled post-Entity information. The processor 15 labels the Relation information "ws borne in", and adds a prefix relationship before the conventional sequence labeling format B, M or E to form "ws [ relationship-B ] borne [ relationship-M ] in [ relationship-E ]", where the labeled "ws [ relationship-B ] borne [ relationship-M ] in [ relationship-E ] is the labeled entity information.
Then, the processor 15 generates a plurality of combinations of the labeled entity information and the labeled relationship information in each field according to the sequence of the fields a and stores the combinations in the entity relationship database 400. Referring to fig. 3, for example, based on the labeled Entity information "Wang [ Entity-S ]", "Taiwan [ Entity-S ]", "Tainan [ Entity-S ] and" Zhongshan [ Entity-B ] street [ Entity-E ] "and the labeled Relation information" ws [ Relation-B ] bolt [ relationship-M ] in [ Relation-E ] ", the processor 15 generates a combination of" Wang wa bolt in Taiwan "," Wang wa bolt in Tainan "and" Wang wa in Zhongshan street "that respectively correspond to the arrangement of the Entity 1, the Relation and the Entity 2, and stores the combination to the 7 th, 8 th and 9 th nodes (the confidence score generated in this example is set as 1) of the Entity Relation database 400 of fig. 3.
Accordingly, the processor 15 can perform the same operation for all the fields contained in the annotation text 133, and the processor 15 performs automatic annotation by comparing the strings in the entity-relationship database 400, and can generate various combinations of the original fields to expand the data content in the entity-relationship database 400. In addition, since the processor 15 labels the entities and relationships in each field through the improved labeling format, the data content in the entity-relationship database 400 further has the characteristic information of the location in addition to the information of the entities and relationships, which is beneficial to the efficiency and time of the subsequent training model.
In some embodiments, the processor 15 can be arranged in other arrangements to generate combinations, and one skilled in the art can understand how to generate combinations in other arrangements according to the foregoing description. It should be noted that, for the sake of clarity, the modified notation format illustrated in FIG. 3 is only partially intended to illustrate the scope of the present invention, and it should be understood that those skilled in the art can understand the operation based on the foregoing description.
As will be described below, the processor 15 performs the steps of the training model routine, please refer to FIG. 4. In this embodiment, the processor 15 inputs the combinations into a pre-trained language model 413 based on the pre-trained language model 413 to generate an entity relationship extraction model, wherein the entity extraction model is used to identify the entity information and the relationship information in a text paragraph. It should be noted that the pre-trained language model 413 at least comprises a trained language layer model, which already comprises a plurality of parameters of trained weights due to the multi-layer network structure trained based on a large amount of texts, such as: google proposed a pre-training language model BERT (Bidirectional Encoder expressions from transformations), where each of the "transformations" is a model that uses a self-attention mechanism to enhance inter-relationships within a sequence of interest.
Specifically, the training model program may include the following steps. First, as shown in fig. 4, the processor 15 concatenates an input layer 411 and a sequence layer 415 with the pre-training language model 413 to effectively reduce a complexity of model training, wherein the input layer 411 is used to divide the fields into words as input of the pre-training language model 413, and the sequence layer 415 performs an analysis operation based on the modified markup format to generate the entity information and the relationship information in the text passage. Next, the processor 15 inputs the combinations in the entity relationship database 400 into the input layer 411, and generates the entity relationship extraction model by matching the pre-training language model 413 and the sequence layer 415.
It should be noted that the input layer 411 inputs a plurality of word sequences (i.e., input data in the entity relationship database 400), breaks the word sequences into a plurality of vocabulary (Token) sequences, inputs the vocabulary sequences into the pre-training language model 413 (i.e., BERT layer), and the sequence layer 415 receives the output of the pre-training language model 413 and finally generates the labeling results corresponding to the entity relationships of the word sequences and the conventional sequence labeling formats (e.g., BMES, BIO, biees, etc.). Since the sequence Layer (CRF Layer) can add some constraints (i.e. the probability limit for normalizing the next word generation) to the serialized labels, the validity of the predicted labels is ensured, and the complexity of model training is effectively reduced. Therefore, concatenating the sequence layer after the language layer (i.e., BERT layer) can enhance the effect of sequence analysis. It should be noted that, for simplicity, only a portion of fig. 4 is shown, and those skilled in the art should understand the operation of machine learning training by neural network concatenation according to the foregoing description.
In some embodiments, as shown in fig. 4, machine learning can be performed through a Neural Network (Neural Network) 409 serially connected by three networks, i.e., an input layer 411, a pre-training language model 413 and a sequence layer 415, model fine-tuning (fine-tuning) is performed on the pre-training language model 413 based on data of the entity-relationship database 400, so as to train an entity-relationship extraction model, which is input as a segment of text sequence and label information, and which words in the new text sequence are entities and relationships can be predicted by the trained model.
As is clear from the above description, the apparatus 1 for generating the entity-relationship extraction model executes a pre-labeling procedure and a training model procedure. In the pre-labeling process, the processor 15 generates at least one piece of entity information to be labeled corresponding to each field and at least one piece of relation information to be labeled corresponding to each field based on a plurality of fields in the text 133 to be labeled and the entity information and the relation information in the entity relation database 400, labels the at least one piece of entity information to be labeled and the at least one piece of relation information to be labeled of each field according to the improved labeling format to generate at least one piece of entity information to be labeled and at least one piece of relation information to be labeled, and generates a plurality of combinations of the at least one piece of entity information to be labeled and the at least one piece of relation information to be labeled and stores the combinations in the entity relation database 400. In the training model program, the combinations are input to the pre-trained language model by the processor 15 based on the pre-trained language model to generate a entity relationship extraction model.
As can be seen from the above description, the training of the traditional entity-relationship extraction model usually requires the training to be repeated and requires a large amount of input data generated by manual labeling/intervention to achieve the effect. The device for generating the entity relationship extraction model is constructed on a pre-training model, quickly marks input data and amplifies an entity relationship database through a mechanism of a pre-marking program, automatically generates a large amount of data without human intervention, and therefore the entity relationship extraction model can be quickly trained. In addition, the invention further accelerates the training speed of the entity relation extraction model through the information of the improved labeling format. Therefore, the defects that in the prior art, the entity relationship extraction model needs the intervention of experts or scholars, a large amount of manual labeling cost and time are consumed, and the conversion aiming at different fields cannot be performed quickly and flexibly are overcome.
A second embodiment of the present invention is a method for generating a physical relationship extraction model, and a flowchart thereof is depicted in FIG. 5. The method for generating entity relationship extraction model is used for a device (hereinafter, referred to as the device) for generating entity relationship extraction model, such as: the apparatus 1 for generating an entity-relationship extraction model according to the first embodiment. The device comprises a memory, a transmitting and receiving interface and a processor, wherein the memory stores a physical relational database, such as: the entity relationship database 400 according to the first embodiment at least comprises a plurality of entity information and a plurality of relationship information. The method for generating the entity relationship extraction model generates the entity relationship extraction model through the steps S501 to S507 of the pre-labeling program and the step S509 of the training model program.
In some embodiments, the entity-relationship database is generated by a crawler process and an entity-relationship database construction process, wherein executing the crawler process comprises the steps of: collecting a plurality of knowledge base data contents, wherein each knowledge base data content comprises a plurality of item names and an item text corresponding to each item name; performing sentence-breaking processing on the item texts to generate input data; wherein, the entity relation database construction program comprises the following steps: inputting the input data into an entity relationship extraction system to generate output data, wherein the output data comprises a plurality of triples of data, and each triples of data comprises a plurality of entity information, at least one relationship information and a confidence score; and storing the three sets of data with the confidence scores exceeding a preset value in the output data into the entity relational database based on the confidence scores.
The following describes steps S501 to S507 of the pre-annotation process. First, in step S501, a text to be annotated is received by the apparatus.
Next, in step S503, the apparatus generates at least one piece of entity information to be annotated corresponding to each of the fields and at least one piece of relationship information to be annotated corresponding to each of the fields based on the plurality of fields in the text to be annotated and the entity information and the relationship information in the entity relationship database. In some embodiments, generating the at least one piece of information of the entity to be annotated corresponding to each of the fields and the at least one piece of information of the relationship to be annotated corresponding to each of the fields comprises: comparing the fields in the text to be labeled with the entity information in the entity relation database to generate at least one entity information to be labeled corresponding to each field; and comparing each field containing at least two pieces of entity information to be marked with the equal relation information in the entity relation database to generate the at least one piece of relation information to be marked corresponding to each field.
Then, in step S505, the apparatus labels the at least one to-be-labeled entity information and the at least one to-be-labeled relationship information of each field according to an improved labeling format to generate at least one labeled entity information and at least one labeled relationship information. In some embodiments, the improved markup format is composed of a conventional sequence markup format and a physical tag and a relational tag corresponding to the conventional sequence markup format.
Next, in step S507, the device generates a plurality of combinations from the at least one tagged entity information and the at least one tagged relationship information and stores the combinations in the entity relationship database. In some embodiments, generating the combinations from the at least one tagged entity information and the at least one tagged relationship information comprises: and generating the combinations of the at least one labeled entity information and the at least one labeled relation information in each field according to a sequence of the labeled entity information and the labeled relation information in the field.
The training model routine step S509 is described next. In step S509, the device inputs the combinations into a pre-trained language model based on the pre-trained language model to generate a entity relationship extraction model.
In some embodiments, the training model program further comprises: connecting an input layer and a sequence layer in series with the pre-training language model to effectively reduce the complexity of model training, wherein the input layer is used for dividing the fields into a plurality of vocabularies to be used as the input of the pre-training language model, and the sequence layer executes an analysis operation based on the improved labeling format to generate the entity information and the relation information in the text paragraph; and inputting the combinations including the improved markup format in the entity relationship database into the input layer, and generating the entity relationship extraction model by matching the pre-training language model and the sequence layer.
In addition to the above steps, the second embodiment can also perform all the operations and steps of the apparatus 1 for generating entity relationship extraction model described in the first embodiment, and have the same functions and technical effects. Those skilled in the art can directly understand how to implement the operations and steps of the second embodiment based on the first embodiment, and the second embodiment has the same functions and technical effects, so detailed descriptions are omitted.
In summary, the conventional entity relationship extraction model training usually requires a long training time and requires a large amount of input data generated by manual labeling/intervention to achieve the effect. The method for generating the entity relationship extraction model is constructed on a pre-training model, quickly marks input data and amplifies an entity relationship database through a mechanism of a pre-marking program, automatically generates a large amount of data without human intervention, and therefore the entity relationship extraction model can be quickly trained. In addition, the invention further accelerates the training speed of the entity relation extraction model through the information of the improved labeling format. Therefore, the defects that in the prior art, the entity relationship extraction model needs the intervention of experts or scholars, a large amount of manual labeling cost and time are consumed, and the conversion aiming at different fields cannot be performed quickly and flexibly are overcome.
The above embodiments are only intended to illustrate some embodiments of the present invention and to explain the technical features of the present invention, and are not intended to limit the scope and protection of the present invention. Any arrangement which can be easily changed or equalized by a person skilled in the art is included in the scope of the present invention, and the scope of the present invention is defined by the scope of the claims.

Claims (12)

1. An apparatus for generating an entity relationship extraction model, comprising:
a memory for storing an entity relationship database, wherein the entity relationship database at least comprises a plurality of entity information and a plurality of relationship information; and
a processor electrically connected to the memory for executing a pre-labeling procedure and a training model procedure, wherein the pre-labeling procedure comprises the following steps:
receiving a text to be marked;
generating at least one piece of entity information to be marked corresponding to each field and at least one piece of relation information to be marked corresponding to each field based on a plurality of fields in the text to be marked and the plurality of entity information and the plurality of relation information in the entity relation database;
labeling the at least one entity information to be labeled and the at least one relation information to be labeled of each field according to an improved labeling format to generate at least one labeled entity information and at least one labeled relation information; and
generating a plurality of combinations by the at least one marked entity information and the at least one marked relation information and storing the combinations in the entity relation database;
wherein the training model program comprises the following steps:
and inputting the combinations into a pre-training language model on the basis of the pre-training language model to generate an entity relation extraction model.
2. An apparatus for generating entity relationship extraction model according to claim 1, wherein the entity relationship database is generated by a crawler program and an entity relationship database construction program, wherein executing the crawler program comprises the steps of:
collecting a plurality of knowledge base data contents, wherein each knowledge base data content comprises a plurality of item names and item texts corresponding to the item names; and
sentence-breaking processing is carried out on the texts of the items to generate input data;
wherein the entity relationship database construction program comprises the following steps:
inputting the input data into an entity relationship extraction system to generate output data, wherein the output data comprises a plurality of ternary groups of data, and each ternary group of data comprises a plurality of entity information, at least one relationship information and a confidence score; and
and storing the plurality of triple data with the confidence scores exceeding preset values in output data to the entity relational database based on the confidence scores.
3. The apparatus for generating entity relationship extraction model as claimed in claim 1, wherein generating the at least one entity-to-be-labeled information corresponding to each of the fields and the at least one relationship-to-be-labeled information corresponding to each of the fields comprises:
comparing the plurality of fields in the text to be labeled with the plurality of entity information in the entity relationship database to generate at least one piece of entity information to be labeled corresponding to each field; and
comparing each field containing at least two pieces of entity information to be marked with the plurality of relation information in the entity relation database to generate the at least one piece of relation information to be marked corresponding to each field.
4. The apparatus for generating an entity-relationship extraction model as claimed in claim 1, wherein generating the plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information comprises:
and generating the plurality of combinations of the at least one labeled entity information and the at least one labeled relation information in each field according to the sequence of the at least one labeled entity information and the at least one labeled relation information of each field in the fields.
5. An apparatus for generating an entity-relationship extraction model as defined in claim 1, wherein the training model program further comprises:
connecting an input layer and a sequence layer in series with the pre-training language model; and
and inputting the plurality of combinations comprising the improved labeling formats in the entity relationship database into the input layer, and generating the entity relationship extraction model by matching the pre-training language model and the sequence layer.
6. The apparatus of claim 1, wherein the modified annotation format is comprised of a conventional sequence annotation format and entity tags and relationship tags corresponding to the conventional sequence annotation format.
7. A method for generating a physical relationship extraction model, for use in a device for generating a physical relationship extraction model, the device for generating a physical relationship extraction model comprising a memory and a processor, the memory storing a physical relationship database, wherein the physical relationship database comprises a plurality of entity information and a plurality of relationship information, the method for generating a physical relationship extraction model being performed by the processor and comprising the steps of:
executing a pre-labeling program and a training model program, wherein the pre-labeling program comprises the following steps:
receiving a text to be marked;
generating at least one piece of entity information to be marked corresponding to each field and at least one piece of relation information to be marked corresponding to each field based on a plurality of fields in the text to be marked and the plurality of pieces of entity information and the plurality of pieces of relation information in the entity relation database;
labeling the at least one entity information to be labeled and the at least one relation information to be labeled of each field according to an improved labeling format to generate at least one labeled entity information and at least one labeled relation information; and
generating a plurality of combinations from the at least one tagged entity information and the at least one tagged relationship information and storing the combinations to the entity relationship database;
wherein the training model program comprises the steps of:
based on the pre-training language model, the combinations are input to the pre-training language model to generate an entity relationship extraction model.
8. A method for generating a entity-relationship extraction model according to claim 7, wherein the entity-relationship database is generated by a crawler program and an entity-relationship database construction program, wherein executing the crawler program comprises the steps of:
collecting a plurality of knowledge base data contents, wherein each knowledge base data content comprises a plurality of item names and item texts corresponding to the item names; and
sentence-breaking processing is carried out on the texts of the items to generate input data;
wherein the entity relationship database construction program comprises the following steps:
inputting the input data into an entity relationship extraction system to generate output data, wherein the output data comprises a plurality of ternary groups of data, and each ternary group of data comprises a plurality of entity information, at least one relationship information and a confidence score; and
and storing the plurality of triple data with the confidence scores exceeding preset values in output data to the entity relational database based on the confidence scores.
9. The method according to claim 7, wherein generating the at least one entity-to-be-labeled information corresponding to each of the fields and the at least one relationship-to-be-labeled information corresponding to each of the fields comprises:
comparing the plurality of fields in the text to be labeled with the plurality of entity information in the entity relationship database to generate at least one piece of entity information to be labeled corresponding to each field; and
comparing each field containing at least two pieces of entity information to be labeled with the plurality of relationship information in the entity relationship database to generate the at least one piece of relationship information to be labeled corresponding to each field.
10. The method of generating an entity relationship extraction model of claim 7, wherein generating the plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information comprises:
and generating the plurality of combinations of the at least one labeled entity information and the at least one labeled relation information in each field according to the sequence of the at least one labeled entity information and the at least one labeled relation information of each field in the fields.
11. A method of generating an entity relationship extraction model as defined in claim 7, wherein the training model program further comprises:
connecting an input layer and a sequence layer with the pre-training language model in series; and
and inputting the plurality of combinations comprising the improved labeling formats in the entity relationship database into the input layer, and generating the entity relationship extraction model by matching the pre-training language model and the sequence layer.
12. The method of claim 7, wherein the modified annotation format is comprised of a conventional sequence annotation format and entity tags and relationship tags corresponding to the conventional sequence annotation format.
CN202110996377.7A 2021-08-27 2021-08-27 Device and method for generating entity relation extraction model Pending CN115730017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110996377.7A CN115730017A (en) 2021-08-27 2021-08-27 Device and method for generating entity relation extraction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110996377.7A CN115730017A (en) 2021-08-27 2021-08-27 Device and method for generating entity relation extraction model

Publications (1)

Publication Number Publication Date
CN115730017A true CN115730017A (en) 2023-03-03

Family

ID=85290450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110996377.7A Pending CN115730017A (en) 2021-08-27 2021-08-27 Device and method for generating entity relation extraction model

Country Status (1)

Country Link
CN (1) CN115730017A (en)

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
CN104050256B (en) Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method
JP6909832B2 (en) Methods, devices, equipment and media for recognizing important words in audio
US8452772B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a socials sphere
CN108874774B (en) Service calling method and system based on intention understanding
CN105718586A (en) Word division method and device
US9645988B1 (en) System and method for identifying passages in electronic documents
CN108287911B (en) Relation extraction method based on constrained remote supervision
CN111159363A (en) Knowledge base-based question answer determination method and device
TW201841121A (en) A method of automatically generating semantic similar sentence samples
CN112417891B (en) Text relation automatic labeling method based on open type information extraction
WO2008059111A2 (en) Natural language processing
CN106383814A (en) Word segmentation method of English social media short text
CN112000802A (en) Software defect positioning method based on similarity integration
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN111597302A (en) Text event acquisition method and device, electronic equipment and storage medium
CN110750632A (en) Improved Chinese ALICE intelligent question-answering method and system
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
Srinivasagan et al. An automated system for tamil named entity recognition using hybrid approach
CN112488593B (en) Auxiliary bid evaluation system and method for bidding
Basha et al. Natural Language Processing: Practical Approach
Ghosh et al. Clause identification and classification in bengali
CN111814433B (en) Uygur language entity identification method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination