WO2023116561A1

WO2023116561A1 - Entity extraction method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023116561A1
Application number: PCT/CN2022/139496
Authority: WO
Inventors: 刘钰; 贾梦妮; 黄鹏; 邱杰; 刘德安
Original assignee: 中电信数智科技有限公司
Priority date: 2021-12-24
Filing date: 2022-12-16
Publication date: 2023-06-29
Also published as: CN114265919A

Abstract

An entity extraction method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring first input data, wherein the first input data comprises text to be subjected to extraction and various preset a priori knowledge (S101); acquiring second input data, wherein the second input data comprises text to be subjected to extraction, and an entity and an entity type which have been identified from the text to be subjected to extraction (S102); inputting the first input data into a preset first extraction model to acquire a first extraction result, wherein the first extraction result comprises each extracted entity and the type, position and probability of the entity (S103); inputting the second input data into a preset second extraction model to acquire a second extraction result, wherein the second extraction result comprises the type, position, subject-object type and probability of each extracted entity (S104); and acquiring an optimal extraction result according to the first extraction result and the second extraction result (S105). In this way, the problem of the poor accuracy of entity extraction in the prior art can be alleviated.

Description

A kind of entity extraction method, device, electronic equipment and storage medium

technical field

The present application relates to the technical field of data processing, in particular, to an entity extraction method, device, electronic equipment and storage medium.

Background technique

At present, the method of building a knowledge map using plain text data is to use relatively static entity dictionaries and rule templates to obtain entities and entity relationships. This method has poor timeliness. If the entity in the threat intelligence is not updated to the entity dictionary, the entity will be missed. If the rule template in the threat intelligence fails to cover the type, the missing label will also occur, resulting in a decrease in accuracy. . Moreover, the method of obtaining entities and entity relationships through rule templates requires a large number of professionals to maintain the rule templates, which consumes a lot of manpower. In addition, in recent years, the construction of knowledge graph has borrowed the idea of deep learning and used BiLSTM+CRF to extract knowledge. Although the accuracy rate has improved, it still cannot meet the requirements for use.

Contents of the invention

The purpose of the embodiments of the present application is to provide an entity extraction method, device, electronic device and storage medium, so as to improve the problem of "poor accuracy of entity extraction in the prior art".

The present invention is achieved like this:

In the first aspect, the embodiment of the present application provides an entity extraction method, the method includes: acquiring first input data, the first input data includes the text to be extracted and preset prior knowledge, the prior knowledge Include each entity to define or characterize the vocabulary of each entity type; obtain the second input data, the second input data includes the entity and the entity type identified in the text to be extracted and the text to be extracted; Input data into the preset first extraction model to obtain the first extraction result, the first extraction result includes the extracted entities and the type, location and probability of the entity; input the second input data into the preset The second extraction model, to obtain a second extraction result, the second extraction result includes the type, location, guest-of-honor type and probability of each extracted entity; according to the first extraction result and the second extraction result, An optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest of honor, and the location.

In the embodiment of the present application, by inputting the first input data into the preset first extraction model to obtain the first extraction result, and inputting the second input data into the preset second extraction model to obtain the second extraction result, it is possible Make the first extraction model understand the semantics of the text to be extracted faster and more accurately based on the prior knowledge of the first input data, and the second extraction model based on the entities and entity types marked in the second input data, thereby improving the extraction The accuracy rate of entities in the text to be extracted. Moreover, according to the first extraction result and the second extraction result, an optimal extraction result can be obtained, thereby further improving the accuracy rate of extracting entities of the text to be extracted.

In combination with the technical solution provided in the first aspect above, in some possible implementations, the acquiring the first input data includes: splicing the text to be extracted with each definition text in the preset entity definition library, Obtaining the first input data; or, splicing the text to be extracted with preset vocabulary representing each entity type to obtain the first input data.

In the embodiment of the present application, in this manner, the first input data can be acquired quickly and accurately.

In combination with the technical solution provided in the first aspect above, in some possible implementations, the acquiring the second input data includes: matching the text to be extracted by an AC automaton to obtain the entity corresponding to the text to be extracted A type corresponding to the entity; splicing the to-be-extracted text with the matched entity and the type corresponding to the entity to obtain second input data.

In the embodiment of the present application, in this manner, the second input data can be acquired quickly and accurately.

In combination with the technical solution provided by the first aspect above, in some possible implementations, the first extraction model is obtained according to the following steps: obtaining a first training set, the first training set includes a first training sample and a first label , the first training sample is the text after splicing the preset text and the prior knowledge respectively, and the first label is the text after entity labeling the preset text; using the first training set The initial first extraction model is trained to obtain the first extraction model.

In combination with the technical solution provided by the first aspect above, in some possible implementations, the second extraction model is obtained according to the following steps: obtaining a second training set, the second training set includes a second training sample and a second label , the second training sample is the text after the preset text is matched with the corresponding entity and the type corresponding to the entity through the AC automaton and the preset text is spliced, and the second label is the text of the preset The text after the text is marked with entities; the second training set is used to train the initial second extraction model to obtain the second extraction model.

In combination with the technical solution provided in the first aspect above, in some possible implementations, the obtaining the optimal extraction result according to the first extraction result and the second extraction result includes: according to the preset first extraction result A weight ratio between the extraction model and the first extraction model, calculating the probability of each entity in the first extraction result and the probability of each entity in the second extraction result, and obtaining the total probability corresponding to each entity ; Comparing the total probability with a preset threshold, obtaining the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity, and the optimal extraction result includes the total probability All entities greater than the preset threshold and their corresponding types, guest types, and locations.

In the embodiment of the present application, through the above method, the output results of the first extraction model and the second extraction model can be fused according to the accuracy of the entity extraction (that is, the first extraction result and the second extraction result), so that Improve the accuracy of the optimal extraction results obtained.

In combination with the technical solution provided in the first aspect above, in some possible implementations, the method further includes: acquiring entity pairs according to entities in the optimal extraction results; acquiring entity pairs according to the entity pairs and the optimal Extract the result, mark the text to be extracted; input the text to be extracted after the mark into the preset RoBERTa model, and obtain the encoding of each entity in the entity pair; the first code of each entity in the entity pair The codes corresponding to the characters are spliced, and according to the preset classification algorithm, the spliced codes are classified according to the relationship, and the entity relationship corresponding to the entity pair is obtained. In the embodiment of the present application, because the optimal extraction result with high accuracy is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest, and the location of the entity, according to Each entity in the optimal extraction result obtains a corresponding entity pair, which can improve the accuracy of the obtained entity pair, that is, each entity pair can be obtained more accurately according to the optimal extraction result. In addition, according to the optimal extraction results and the extracted entity pairs, the text to be extracted is marked, and the marked text to be extracted is input into the preset RoBERTa model to obtain the encoding of each entity in the entity pair; The code corresponding to the first character of each entity is spliced, and according to the preset classification algorithm, the spliced code is classified according to the relationship, and the corresponding entity of the entity can be accurately obtained according to the marked text to be extracted. relation.

In the second aspect, the embodiment of the present application provides an entity extraction device, the device includes: an acquisition module, configured to acquire first input data, the first input data includes the text to be extracted and preset prior knowledge, The prior knowledge includes each entity definition or a vocabulary that characterizes each entity type; second input data is obtained, and the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted An extraction module, configured to input the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result including the extracted entity and the type, location and probability of the entity; Inputting the second input data into a preset second extraction model to obtain a second extraction result, the second extraction result includes the type, location, guest-of-honor type and probability of each extracted entity; a processing module for According to the first extraction result and the second extraction result, an optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, a type of the entity, a guest type, and a position of the entity.

In the third aspect, the embodiment of the present application provides an electronic device, including: a processor and a memory, the processor is connected to the memory; the memory is used to store programs; the processor is used to call the program stored in the memory The program in the above-mentioned embodiment of the first aspect executes the method provided in some possible implementation manners in combination with the above-mentioned embodiment of the first aspect.

In the fourth aspect, the embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned embodiment of the first aspect and/or in combination with the above-mentioned first aspect Some possible implementations of the embodiments provide methods.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the accompanying drawings that need to be used in the embodiments of the present application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present application, so It should not be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings according to these drawings without creative work.

FIG. 1 is a flow chart of steps of an entity extraction method provided by an embodiment of the present application.

Fig. 2 is a module block diagram of an entity extraction device provided by an embodiment of the present application.

FIG. 3 is a block diagram of modules of an electronic device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

In view of the poor accuracy of entity extraction in the prior art, the inventors of the present application proposed the following embodiments after research and exploration to solve the above problems.

An embodiment of the present application provides a method for entity extraction, which is used for entity extraction of plain text data. The entity extraction method includes a preset first extraction model and a second extraction model. In order to facilitate the subsequent description of the entity extraction method, the acquisition of the above-mentioned first extraction model and the second extraction model will be described below first.

The acquisition steps of the first extraction model are as follows:

Obtain the first training set, the first training set includes the first training sample and the first label, the first training sample is the text after splicing the preset text and each prior knowledge, and the first label is the entity labeling of the preset text the following text; using the first training set to train the initial first extraction model to obtain the first extraction model.

For example: the default text is: The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues Group in a certain country and is specially designed for the A operating system.

When the prior knowledge includes the definition of computer virus, the definition of organization and the definition of operating system, the first training samples are respectively: "[CLS] A computer virus refers to a computer virus that is inserted into a computer program by the compiler to destroy computer functions or destroy data, affecting the normal operation of the computer. A set of computer instructions or program codes that use and are capable of self-replication (i.e. computer virus definition) [SEP] The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country and specializes in Designed for the A operating system. [SEP]", "[CLS] Organization is an organic whole composed of two or more individuals in order to achieve a common goal (that is, organization definition) [SEP] The latest findings of the relevant research team A new variant of the Dacls remote access Trojan horse has been identified, associated with the Lazarues Group in a certain country, and designed specifically for the A operating system. [SEP]” and “[CLS] operating systems are computer programs that manage computer hardware and software resources (that is, the definition of the operating system) [SEP] The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country and is specially designed for the A operating system. [SEP]".

When the prior knowledge is computer virus, organization, and operating system (that is, the vocabulary that characterizes each entity type), the first training samples are: "[CLS] The research team of computer virus [SEP] recently discovered a new Dacls A remote access Trojan horse variant, which is associated with the Lazarues group in a certain country, and is specially designed for the A operating system. [SEP]", "[CLS] organization [SEP] research team recently discovered a new Dacls remote access Trojan Trojan horse variant, which is associated with the Lazarues Group in a certain country, and is specially designed for the A operating system. [SEP]” and “[CLS] operating system [SEP] related research team recently discovered a new Dacls remote access Trojan horse variant , which is associated with the Lazarues Group in a certain country, and is designed specifically for the A operating system. [SEP]”.

It should be noted that each prior knowledge can be spliced at the beginning of the preset text or at the end of the preset text, which is not limited here. Moreover, when splicing the prior knowledge of the preset text, it is necessary to add corresponding separators to the spliced prior knowledge and the preset text, that is, [CLS] and [SEP] in the above example, where [CLS] Used to be placed at the top of the spliced text, [SEP] is used to separate prior knowledge and preset text.

Further, the first label is the text after the entity labeling of the preset text, that is, the entity corresponding to the preset text and the type corresponding to the entity need to be marked in the preset text, and according to the marked text, you can also Get the position of the entity in the preset text, that is, the start position and end position of the entity in the preset text. Continuing to take the above preset text as an example, Chinese characters are calculated as one character corresponding to one position, and English letters are calculated as one letter corresponding to one position, then the corresponding position of Dacls in the preset text in the above preset text is "1620" , the corresponding position of the Trojan horse variant in the preset text is "2531". In addition, the entity type of Dacls and Trojan horse variants in the preset text is computer virus, the entity type of Lazarues is organization, and the entity type of A is operating system. It should be noted that when setting the first label, the BIESO label system can be used for labeling. The BIESO labeling system is a labeling method well known to those skilled in the art, and will not be described here.

It should be noted that the first extraction model uses the RoBERTa pre-training model and two classifiers, wherein the task layer is the above two classifiers. Specifically, the task layer trains the above two classifiers, which are used to predict the start position of the entity span and the end position of the entity span respectively. According to the prediction result, the loss value LossA of the initial first extraction model is obtained, and LossA is the loss value of the two classifiers. And, for example: loss(Start)=CE(prediction start,label start) for a classifier that predicts the start of an entity span, loss(End)=CE( Prediction end, label end), loss value LossA=loss(Start)+loss(End). The initial first extraction model is trained through the first training set to obtain the first extraction model. The above-mentioned RoBERTa pre-training model structure, the above-mentioned training method and the above-mentioned classifier can adopt the technical means commonly used in this field, and will not be described here.

The acquisition steps of the second extraction model are:

Obtain a second training set, the second training set includes a second training sample and a second label, and the second training sample is to match the preset text through the AC automaton to match the corresponding entity and the type corresponding to the entity to splicing with the preset text After the text, the second label is the text after the entity labeling of the preset text; the second training set is used to train the initial second extraction model to obtain the second extraction model. It should be noted that there is a preset entity library for the matching of the AC automaton, and multiple entities are preset in the entity library. When the preset text is matched by the AC automaton, it is based on each The match made by the entity.

For example: the default text is: The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues Group in a certain country and is specially designed for the A operating system. The preset text can be obtained by the AC automaton to Dacls (computer virus, vul), Trojan horse variant (computer virus, vul), Lazarues (organization, org) and A (operating system, sys) four entities and each The entity type corresponding to the entity, then the second training sample is "The relevant research team recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country, and is specially designed for the A operating system. <vul>Dacls </vul><vul>Trojan variant</vul><org>Lazarues</org><sys>A</sys>.".

It should be noted that each matched entity and the entity type of the entity can be spliced at the beginning of the preset text or at the end of the preset text, which is not limited here.

Further, the second label is the same as the above-mentioned first label, which is the text after the entity labeling of the preset text, that is, the entity corresponding to the preset text and the type corresponding to the entity need to be marked in the preset text, and according to The marked text can also obtain the position of the entity in the preset text, that is, the start position and end position of the entity in the preset text. In order to avoid redundant description, no further description is given here, and for the second label, please refer to the description of the aforementioned first label.

The initial second extraction model is trained through the second training set to obtain the second extraction model. The above-mentioned training method can adopt a training method well-known to those skilled in the art, and will not be described here again.

It should be noted that the aforementioned preset text may be multiple different texts, and the aforementioned prior knowledge may include multiple different entity definitions or multiple different vocabularies characterizing each entity type, and the number is not limited here. It should also be noted that the above-mentioned second extraction model adopts the RoBERTa pre-training model and the CRF model, wherein the CRF model is the task layer. The RoBERTa pre-training model structure and the CRF model structure are well-known to those skilled in the art, and will not be described here.

After the first extraction model and the second extraction model are acquired, the first extraction model and the second extraction model can be used to perform entity extraction on the plain text data. The specific flow and steps of the above entity extraction method are described below with reference to FIG. 1 .

It should be noted that the entity extraction method provided in the embodiment of the present application is not limited to the sequence shown in FIG. 1 and the following.

Step S101: Obtain first input data, the first input data includes the text to be extracted and preset prior knowledge.

Among them, the prior knowledge includes the vocabulary that each entity defines or characterizes each entity type. The text to be extracted is the text after preprocessing, and the preprocessing includes converting uppercase and lowercase letters, removing special symbols, and converting uncommon nouns in the text to be extracted into common nouns.

Specifically, the text to be extracted is spliced with each definition text in the preset entity definition library to obtain the first input data; or the text to be extracted is spliced with the preset vocabulary representing each entity type to obtain Enter data first.

For example: the text to be extracted is: It was recently discovered that the Mec variant is associated with the Lazzar organization.

When the prior knowledge is defined for each entity, and the definition text in the entity definition library includes: security vulnerability definition text and organization definition text, the first input data is "[CLS] Security vulnerability refers to a defect in the logical design of system software Or wrong, used by criminals to attack or control the entire computer (that is, the definition of security vulnerabilities) [SEP] recently discovered that the Mec variant is associated with the Lazzar organization. [SEP]” and “[CLS] organizations are composed of two or more An organic whole assembled by individuals to achieve a common goal (i.e. organizational definition) [SEP] Mec variants were recently found to be associated with Lazzar organization. [SEP]".

When the prior knowledge is the vocabulary that characterizes each entity type, and the preset vocabulary is security vulnerability and organization, the first input data is "[CLS] security vulnerability [SEP] recently found that the Mec variant is associated with the Lazzar organization. [SEP] ’ and "[CLS] group [SEP] recently discovered a Mec variant linked to the Lazzar group. [SEP]".

It should be noted that each definition text or each vocabulary representing each entity type can be spliced at the beginning of the text to be extracted, or at the end of the text to be extracted, which is not limited here. Moreover, when splicing the prior knowledge of the text to be extracted, it is necessary to add corresponding separators to the spliced prior knowledge and the text to be extracted, that is, [CLS] and [SEP] in the above examples, where [CLS] It is used to place at the top of the spliced text, and [SEP] is used to separate prior knowledge and text to be extracted.

Through the above manner, the first input data can be acquired quickly and accurately.

Step S102: Acquiring second input data, the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted.

Specifically, the AC automaton is used to match the text to be extracted to obtain the entity corresponding to the text to be extracted and the type corresponding to the entity; the text to be extracted and the matched entity and the type corresponding to the entity are spliced to obtain the second input data . It should be noted that there is a preset entity library for the matching of the AC automaton, and multiple entities are preset in the entity library. When the text to be extracted is matched by the AC automaton, it is based on each The match made by the entity.

For example: the text to be extracted is: It was recently discovered that the Mec variant is associated with the Lazzar organization. The text to be extracted can be obtained by the AC automaton to Mec variant (virus, vul) and Lazzar (organization, org) two entities and the entity type corresponding to each entity, then the second input data is "recently found Mec variant association To Lazzar Group <vul>Mec Variations</vul><org>Lazar Group</org>.".

It should be noted that each matched entity and the entity type of the entity may be spliced at the beginning of the text to be extracted, or may be spliced at the end of the text to be extracted, which is not limited here.

Through the above manner, the second input data can be acquired quickly and accurately.

It should be noted that the above-mentioned first input data and second input data are constructed in the same way as the first training sample and the second training sample trained by the aforementioned model, and will not be described here, and the same parts are referred to each other.

It should also be noted that step S101 and step S102 may be performed simultaneously, or may be performed sequentially, that is, step S101 is performed first and then step S102 is performed, or step S102 is performed first and then step S101 is performed, which is not limited here.

After the first input data and the second input data are acquired, the method continues to execute step S103.

Step S103: Input the first input data into a preset first extraction model, and obtain a first extraction result.

Wherein, the first extraction result includes each extracted entity and the type, position and probability of the entity. This position is the start position and end position of the entity in the text to be extracted, for example: the text to be extracted is the recently discovered Mec variant associated with the Lazzar organization, and the position of the Mec variant in the text to be extracted is "59", that is, 5 Indicates that M in the Mec variant is at the fifth position in the text to be extracted, and 9 indicates that the species in the Mec variant is at the ninth position in the text to be extracted; and the entity type corresponding to the Mec variant is a virus; the probability of the entity represents the prediction of the first extraction model The probability of the entity appearing in the text to be extracted, for example, the probability of Mec variant being 60%, means that the first extraction model predicts that the probability of Mec variant appearing in the text to be extracted is 60%.

Step S104: Input the second input data into the preset second extraction model, and obtain the second extraction result.

Wherein, the second extraction result includes the type, position, type and probability of each extracted entity. The probability of the entity represents the probability that the second extraction model predicts that the entity appears in the text to be extracted.

In this embodiment of the application, for the type and location of each entity, please refer to the entity type, location, and probability in the aforementioned step S103, so as not to be repeated here. The above-mentioned subject-object type refers to whether the entity is in the subject position or the object position in the text to be extracted, that is, the subject-object type is judged according to the position of the entity.

It should be noted that step S103 and step S104 can be performed at the same time, or can be performed sequentially, that is, step S103 is performed first and then step S104 is performed, or step S104 is performed first and then step S103 is performed, which is not limited here.

After the first extraction result and the second extraction result are obtained, the method continues to execute step S015.

Step S105: Obtain an optimal extraction result according to the first extraction result and the second extraction result.

Among them, the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type and location of the guest of honor.

Specifically, according to the preset weight ratio between the first extraction model and the first extraction model, the probability of each entity in the first extraction result and the probability of each entity in the second extraction result are calculated, and each The total probability corresponding to each entity; compare the total probability with the preset threshold, and obtain the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity. The optimal extraction result includes the total probability greater than the preset threshold All entities in , and the corresponding types, guest types, and locations of the entities.

For example: the entity at the "59" position in the first extraction result is a Mec variant, and its entity type is a virus, and its probability is 60%. The entity at the "59" position in the second extraction result is a Mec variant, and its entity type is The virus and its subject-object type are the main subject, and its probability is 70%. The weight of the above-mentioned first extraction model preset is a, and the weight of the second extraction model is b. Then the total probability of the above-mentioned Mec variant is 60% a+70 %b, compare the total probability with the preset threshold, if the total probability is greater than the preset threshold, then the probability of the Mec variant appearing at the position "59" in the text to be extracted is relatively high, at this time, the Mec variant and The type, position and position of the guest of honor corresponding to the Mec variant are taken as the optimal extraction results.

It should be noted that the above-mentioned values of a and b can be: a is 0.5, b is 0.5; or, a is 0.4, and b is 0.6; the above-mentioned preset threshold values can be: 0.7, 0.8, 0.85, 0.9 either. In addition, it should be noted that the weight of the first extraction model, the weight of the second extraction model, and the preset threshold can all be set according to actual conditions.

In the embodiment of the present application, by inputting the first input data into the preset first extraction model to obtain the first extraction result, and inputting the second input data into the preset second extraction model to obtain the second extraction result, it is possible Make the first extraction model understand the semantics of the text to be extracted faster and more accurately based on the prior knowledge of the first input data, and the second extraction model based on the entities and entity types marked in the second input data, thereby improving the extraction The accuracy rate of entities in the text to be extracted. According to the first extraction result and the second extraction result, an optimal extraction result can be obtained, thereby further improving the accuracy of extracting entities from the text to be extracted. In addition, according to the accuracy of the entity extracted by the first extraction model and the second extraction model, the output results of the two (ie, the first extraction result and the second extraction result) are fused, thereby improving the accuracy of the optimal extraction result obtained .

After the optimal extraction result is obtained, the entity pair and the relationship between the entity pair can also be obtained according to the optimal extraction result.

Specifically, according to each entity in the optimal extraction result, the entity pair is obtained; according to the entity pair and the optimal extraction result, the text to be extracted is marked; the marked text to be extracted is input into the preset RoBERTa model to obtain the entity pair The encoding of each entity; splicing the encoding corresponding to the first character of each entity in the entity pair, and performing relationship classification on the spliced encoding according to the preset classification algorithm, to obtain the entity relationship corresponding to the entity pair.

After the optimal extraction result is obtained from the text to be extracted, each entity in the optimal extraction result can be paired into an entity pair. For example, the text to be extracted is: the recently discovered Mec variant is associated with the Lazzar organization. The extracted entities in the text to be extracted are: Mec variant and Lazzar organization, then according to the above two entities, the obtained entity pair is (Mec variant, Lazzar organization).

According to the entity pair and the optimal extraction result (that is, each entity and the type of the entity, the guest type and the location), the text to be extracted is marked for each entity pair.

Taking (Mec variant, Lazzar organization) as an example, the tag for the entity pair (Mec variant, Lazzar organization) is: recently found that <S:vul>Mec variant<S:vul> is associated with <O:sys>Lazzar organization<O :sys>. Among them, S indicates that the entity is a subject, O indicates that the entity is an object, vul indicates that the entity type is a virus, and org indicates that the entity type is an organization.

It should be noted that if there are more than two entities extracted from the text to be extracted, each entity needs to be formed into an entity pair. For example, the entities proposed in the text to be extracted are A, B, C, and D. According to the above four entities, six entity pairs can be obtained, namely: (A, B), (A, C), (A, D), (B, C), (B, D) and (C, D). Moreover, according to the entity pair and the optimal extraction result, it is necessary to mark the text to be extracted for each entity pair.

Input the tagged text corresponding to each entity pair into the preset RoBERTa model to obtain the encoding of each entity in the entity pair. Wherein, the encoding is generated separately for the characters of each entity in the entity pair. Splice the code corresponding to the first character of each entity in the entity pair, and then put the spliced data into a classification algorithm (such as softmax) to classify the relationship, and then the entity relationship corresponding to the entity pair can be obtained. The RoBERTa model and the softmax algorithm are models and algorithms well known to those skilled in the art, and will not be described here.

In the embodiment of the present application, because the optimal extraction result with high accuracy is obtained, the corresponding entity pair is obtained according to each entity in the optimal extraction result, which can improve the accuracy of the obtained entity pair, that is, according to the optimal The optimal extraction results can obtain each entity pair more accurately. In addition, according to the optimal extraction results and the extracted entity pairs, the text to be extracted is marked, and the marked text to be extracted is input into the preset RoBERTa model to obtain the encoding of each entity in the entity pair; The code corresponding to the first character of each entity is spliced, and according to the preset classification algorithm, the spliced code is classified according to the relationship, and the corresponding entity of the entity can be accurately obtained according to the marked text to be extracted. relation.

Referring to FIG. 2 , based on the same inventive concept, the embodiment of the present application also provides an entity extraction device 100 , which includes: an acquisition module 101 , an extraction module 102 and a processing module 103 .

The obtaining module 101 is used to obtain the first input data, the first input data includes the text to be extracted and the preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type; obtains the second input data, The second input data includes the text to be extracted and recognized entities and entity types in the text to be extracted.

The extraction module 102 is used to input the first input data into the preset first extraction model to obtain the first extraction result, the first extraction result includes each entity extracted and the type, position and probability of the entity; the second input data The data is input into the preset second extraction model, and the second extraction result is obtained, and the second extraction result includes the type, position, type and probability of each extracted entity.

The processing module 103 is configured to obtain an optimal extraction result according to the first extraction result and the second extraction result. The optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest, and its location.

Optionally, the acquisition module 101 is specifically configured to splice the text to be extracted with each definition text in the preset entity definition library to obtain the first input data; or, combine the text to be extracted with the preset entity types The vocabulary of each is spliced separately to obtain the first input data.

Optionally, the acquisition module 101 is specifically configured to match the text to be extracted by the AC automaton, and obtain the entity corresponding to the text to be extracted and the type corresponding to the entity; the text to be extracted and the matched entity and the type corresponding to the entity splicing to obtain the second input data.

Optionally, the entity extraction device 100 also includes a construction module 104, which is used to obtain a first training set, the first training set includes a first training sample and a first label, and the first training sample is a preset text and each The text after splicing the prior knowledge respectively, the first label is the text after the entity labeling of the preset text; the first training set is used to train the initial first extraction model to obtain the first extraction model.

Optionally, the construction module 104 is also used to obtain a second training set, the second training set includes a second training sample and a second label, and the second training sample is to match the preset text through the AC automaton to match the corresponding entity and the The type corresponding to the entity is spliced with the preset text, and the second label is the text after the entity is marked on the preset text; the second training set is used to train the initial second extraction model to obtain the second extraction model.

Optionally, the processing module 103 is specifically configured to calculate the probability of each entity in the first extraction result and each entity probability in the second extraction result according to the preset weight ratio between the first extraction model and the first extraction model. Calculate the entity probability to obtain the total probability corresponding to each entity; compare the total probability with the preset threshold, obtain the entity whose total probability is greater than the preset threshold and the corresponding type, guest type and location of the entity, and optimally extract the result Including all entities whose total probability is greater than the preset threshold and the corresponding type, guest type and location of the entity.

Optionally, the processing module 103 is also configured to obtain entity pairs according to each entity in the optimal extraction result; mark the text to be extracted according to the entity pair and the optimal extraction result; input the marked text to be extracted into a preset The RoBERTa model obtains the encoding of each entity in the entity pair; splices the encoding corresponding to the first character of each entity in the entity pair, and performs relationship classification on the spliced encoding according to the preset classification algorithm to obtain the entity to the corresponding entity relationship.

Please refer to FIG. 3 , which is a schematic structural block diagram of an electronic device 200 provided by an embodiment of the present application based on the same inventive concept, and the electronic device 200 is used for the above-mentioned entity extraction method. In the embodiment of the present application, the electronic device 200 may be, but not limited to, a personal computer (Personal Computer, PC), a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a mobile Internet device (Mobile Internet Device, MID) and the like. Structurally, the electronic device 200 may include a processor 210 and a memory 220 .

The processor 210 and the memory 220 are electrically connected directly or indirectly to realize data transmission or interaction. For example, these components may be electrically connected to each other through one or more communication buses or signal lines. Wherein, the processor 210 may be an integrated circuit chip with signal processing capabilities. The processor 210 may also be a general-purpose processor, for example, may be a central processing unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a discrete gate or transistor logic device, a discrete The hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. Also, a general-purpose processor may be a microprocessor or any conventional processor or the like.

The memory 220 can be, but not limited to, random access memory (RandomAccessMemory, RAM), read-only memory (ReadOnlyMemory, ROM), programmable read-only memory (ProgrammableRead-OnlyMemory, PROM), erasable programmable read-only memory ( Erasable Programmable Read-Only Memory, EPROM), and Electric Erasable Programmable Read-Only Memory (EEPROM). The memory 220 is used to store a program, and the processor 210 executes the program after receiving an execution instruction.

It should be understood that the structure shown in FIG. 3 is only for illustration, and the electronic device 200 provided in the embodiment of the present application may also have fewer or more components than that shown in FIG. 3 , or have a configuration different from that shown in FIG. 3 . In addition, each component shown in FIG. 3 may be realized by software, hardware or a combination thereof.

It should be noted that, as those skilled in the art can clearly understand, for the convenience and brevity of description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, here No longer.

Based on the same inventive concept, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed, the method provided in the above-mentioned embodiments is executed.

The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).

In the embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

In addition, the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, each functional module in each embodiment of the present application may be integrated to form an independent part, each module may exist independently, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second etc. are used only to distinguish one entity or operation from another without necessarily requiring or implying any such relationship between these entities or operations. Actual relationship or sequence.

The above descriptions are only examples of the present application, and are not intended to limit the scope of protection of the present application. For those skilled in the art, various modifications and changes may be made to the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included within the protection scope of this application.

Claims

A method for entity extraction, characterized in that the method comprises:

Acquiring first input data, the first input data includes the text to be extracted and preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type;

Obtaining second input data, the second input data including the text to be extracted and the identified entities and entity types in the text to be extracted;

Inputting the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result includes each entity extracted and the type, location and probability of the entity;

Inputting the second input data into a preset second extraction model to obtain a second extraction result, the second extraction result includes the type, location, type and probability of each extracted entity;

According to the first extraction result and the second extraction result, an optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, a type of the entity, a guest type, and a position of the entity.
The method according to claim 1, wherein said obtaining the first input data comprises:

splicing the text to be extracted with each definition text in the preset entity definition library to obtain the first input data;

Alternatively, the to-be-extracted text is spliced with preset vocabulary representing each entity type to obtain the first input data.
The method according to claim 1, wherein said obtaining the second input data comprises:

Matching the text to be extracted by an AC automaton to obtain an entity corresponding to the text to be extracted and a type corresponding to the entity;

The text to be extracted is spliced with the matched entity and the type corresponding to the entity to obtain the second input data.
The method according to claim 1, wherein the first extraction model is obtained according to the following steps: obtaining a first training set, the first training set includes a first training sample and a first label, and the first The training sample is the text after splicing the preset text and the prior knowledge respectively, and the first label is the text after entity labeling the preset text;

Using the first training set to train an initial first extraction model to obtain the first extraction model.
The method according to claim 1, wherein the second extraction model is obtained according to the following steps: obtaining a second training set, the second training set includes a second training sample and a second label, and the second The training sample is the text after the preset text is matched with the corresponding entity and the type corresponding to the entity through the AC automaton and the preset text is spliced, and the second label is after the entity labeling of the preset text the text of

Using the second training set to train the initial second extraction model to obtain the second extraction model.
The method according to claim 1, wherein said obtaining an optimal extraction result according to said first extraction result and said second extraction result comprises:

Calculate the probability of each entity in the first extraction result and the probability of each entity in the second extraction result according to the preset weight ratio between the first extraction model and the first extraction model, and obtain The total probability corresponding to each entity;

Comparing the total probability with a preset threshold, obtaining the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity, and the optimal extraction result includes that the total probability is greater than All the entities with the preset threshold and the corresponding types, guest types and positions of the entities.
The method according to claim 1, further comprising:

Obtain entity pairs according to each entity in the optimal extraction result;

Marking the text to be extracted according to the entity pair and the optimal extraction result;

Input the text to be extracted after marking into the preset RoBERTa model to obtain the encoding of each entity in the entity pair;

Splicing the codes corresponding to the first character of each entity in the entity pair, and performing relationship classification on the spliced codes according to a preset classification algorithm, to obtain the entity relationship corresponding to the entity pair.
An entity extraction device, characterized in that the device comprises:

The acquisition module is used to acquire the first input data, the first input data includes the text to be extracted and the preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type; acquires the second input data, the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted;

An extraction module, configured to input the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result including the extracted entity and the type, location and probability of the entity; The second input data is input into a preset second extraction model, and a second extraction result is obtained, and the second extraction result includes the type, location, type and probability of each extracted entity;

A processing module, configured to obtain an optimal extraction result according to the first extraction result and the second extraction result, where the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, and the guest of honor type and location.
An electronic device, characterized in that it includes: a processor and a memory, the processor is connected to the memory; the memory is used to store a program; the processor is used to run the program stored in the memory, and execute The method according to any one of claims 1-7.
A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program executes the method according to any one of claims 1-7 when executed by a computer.