CN114265919A - Entity extraction method and device, electronic equipment and storage medium - Google Patents

Entity extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114265919A
CN114265919A CN202111598982.5A CN202111598982A CN114265919A CN 114265919 A CN114265919 A CN 114265919A CN 202111598982 A CN202111598982 A CN 202111598982A CN 114265919 A CN114265919 A CN 114265919A
Authority
CN
China
Prior art keywords
entity
text
extracted
preset
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111598982.5A
Other languages
Chinese (zh)
Inventor
刘钰
贾梦妮
黄鹏
邱杰
刘德安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Digital Intelligence Technology Co Ltd
Original Assignee
China Telecom Digital Intelligence Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Digital Intelligence Technology Co Ltd filed Critical China Telecom Digital Intelligence Technology Co Ltd
Priority to CN202111598982.5A priority Critical patent/CN114265919A/en
Publication of CN114265919A publication Critical patent/CN114265919A/en
Priority to PCT/CN2022/139496 priority patent/WO2023116561A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The application provides an entity extraction method, an entity extraction device, electronic equipment and a storage medium. The method comprises the following steps: acquiring first input data, wherein the first input data comprises a text to be extracted and preset prior knowledge; acquiring second input data, wherein the second input data comprises a text to be extracted, and an entity type which are identified in the text to be extracted; inputting first input data into a preset first extraction model to obtain a first extraction result, wherein the first extraction result comprises each extracted entity and the type, position and probability of the entity; inputting second input data into a preset second extraction model to obtain a second extraction result, wherein the second extraction result comprises the type, the position, the type and the probability of each extracted entity; and obtaining an optimal extraction result according to the first extraction result and the second extraction result. By the method, the problem of poor accuracy rate of entity extraction in the prior art can be improved.

Description

Entity extraction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to an entity extraction method, an entity extraction device, an electronic device, and a storage medium.
Background
At present, a method for establishing a knowledge graph by using plain text data comprises the following steps: entities and entity relationships are obtained using a relatively static entity dictionary and rule templates. The method has poor timeliness, and entity label missing can be caused if the entity in the threat information is not updated to the entity dictionary; if the rule template in the threat intelligence fails to cover the type, the missing mark condition also occurs, thereby reducing the accuracy. In addition, the way of obtaining the entity and the entity relationship through the rule template requires a large amount of professional personnel to maintain the rule template, which results in large labor consumption. In addition, in recent years, knowledge map construction draws on the idea of deep learning by using a BilSTM + CRF mode to extract knowledge, and although the accuracy is improved, the use requirement cannot be met.
Disclosure of Invention
An object of the embodiments of the present application is to provide an entity extraction method, an entity extraction device, an electronic device, and a storage medium, so as to solve the problem of poor accuracy of extracting an entity in the prior art.
The invention is realized by the following steps:
in a first aspect, an embodiment of the present application provides an entity extraction method, where the method includes: acquiring first input data, wherein the first input data comprises a text to be extracted and preset prior knowledge, and the prior knowledge comprises vocabularies for defining or representing entity types; acquiring second input data, wherein the second input data comprises the text to be extracted and the entity type which are identified in the text to be extracted; inputting the first input data into a preset first extraction model to obtain a first extraction result, wherein the first extraction result comprises each extracted entity and the type, position and probability of the entity; inputting the second input data into a preset second extraction model to obtain a second extraction result, wherein the second extraction result comprises the type, the position, the type and the probability of each extracted entity; and obtaining an optimal extraction result according to the first extraction result and the second extraction result, wherein the optimal extraction result comprises each entity corresponding to the text to be extracted and the type, the type and the position of the entity.
In the embodiment of the application, the first extraction result is obtained by inputting the first input data into the preset first extraction model, the second extraction result is obtained by inputting the second input data into the preset second extraction model, so that the semantics of the text to be extracted can be understood more quickly and accurately according to the prior knowledge of the first input data and the entities and entity types marked in the second input data by the first extraction model, and the accuracy of extracting the entities in the text to be extracted is improved. And according to the first extraction result and the second extraction result, the optimal extraction result can be obtained, so that the accuracy of extracting the entity from the text to be extracted can be further improved.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the acquiring first input data includes: splicing the text to be extracted and each definition text in a preset entity definition library respectively to obtain the first input data; or, respectively splicing the text to be extracted and preset vocabularies representing various entity types to obtain the first input data.
In the embodiment of the application, the first input data can be quickly and accurately acquired through the method.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the acquiring the second input data includes: matching the text to be extracted through an AC automaton to obtain an entity corresponding to the text to be extracted and a type corresponding to the entity; and splicing the text to be extracted, the matched entity and the type corresponding to the entity to obtain second input data.
In the embodiment of the application, the second input data can be quickly and accurately acquired through the method.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the first extraction model is obtained according to the following steps: acquiring a first training set, wherein the first training set comprises a first training sample and a first label, the first training sample is a text formed by splicing a preset text and each priori knowledge, and the first label is a text formed by performing entity labeling on the preset text; and training an initial first extraction model by using the first training set to obtain the first extraction model.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the second extraction model is obtained according to the following steps: acquiring a second training set, wherein the second training set comprises a second training sample and a second label, the second training sample is a text obtained by matching a preset text by an AC automaton to obtain a corresponding entity and splicing the type corresponding to the entity with the preset text, and the second label is a text obtained by performing entity labeling on the preset text; and training an initial second extraction model by using the second training set to obtain the second extraction model.
With reference to the technical solution provided by the first aspect, in some possible implementation manners, the obtaining an optimal extraction result according to the first extraction result and the second extraction result includes: calculating the probability of each entity in the first extraction result and the probability of each entity in the second extraction result according to the preset weight ratio of the first extraction model to the first extraction model, and acquiring the total probability corresponding to each entity; and comparing the total probability with a preset threshold value, and acquiring the entity with the total probability greater than the preset threshold value, and the type, the guest-host type and the position corresponding to the entity, wherein the optimal extraction result comprises all the entities with the total probability greater than the preset threshold value, and the type, the guest-host type and the position corresponding to the entity.
In the embodiment of the application, by the above manner, the output results (i.e. the first extraction result and the second extraction result) of the first extraction model and the second extraction model can be fused according to the accuracy of extracting the entity by the first extraction model and the second extraction model, so that the accuracy of the obtained optimal extraction result is improved.
With reference to the technical solution provided by the first aspect, in some possible implementations, the method further includes: acquiring entity pairs according to each entity in the optimal extraction result; marking the text to be extracted according to the entity pair and the optimal extraction result; inputting the marked text to be extracted into a preset RoBERTA model, and acquiring the code of each entity in the entity pair; and splicing the codes corresponding to the first character of each entity in the entity pair, and classifying the spliced codes according to a preset classification algorithm to acquire the entity relationship corresponding to the entity pair.
In the embodiment of the application, because the optimal extraction result with higher accuracy is obtained and the optimal extraction result includes each entity corresponding to the text to be extracted and the type, the type and the position of the entity, the corresponding entity pair is obtained according to each entity in the optimal extraction result, so that the accuracy of the obtained entity pair can be improved, that is, each entity pair can be more accurately obtained according to the optimal extraction result. In addition, marking the text to be extracted according to the optimal extraction result and the extracted entity pair, and inputting the marked text to be extracted into a preset RoBERTA model to obtain the code of each entity in the entity pair; and then splicing the codes corresponding to the first character of each entity in the entity pair, and classifying the spliced codes according to a preset classification algorithm, so that the entity relationship corresponding to the entity pair can be accurately obtained according to the marked text to be extracted.
In a second aspect, an embodiment of the present application provides an entity extraction apparatus, where the apparatus includes: the acquisition module is used for acquiring first input data, wherein the first input data comprise texts to be extracted and preset prior knowledge, and the prior knowledge comprises vocabularies for defining or representing entity types; acquiring second input data, wherein the second input data comprises the text to be extracted and the entity type which are identified in the text to be extracted; the extraction module is used for inputting the first input data into a preset first extraction model to obtain a first extraction result, wherein the first extraction result comprises the extracted entities and the types, positions and probabilities of the entities; inputting the second input data into a preset second extraction model to obtain a second extraction result, wherein the second extraction result comprises the type, the position, the type and the probability of each extracted entity; and the processing module is used for obtaining an optimal extraction result according to the first extraction result and the second extraction result, wherein the optimal extraction result comprises each entity corresponding to the text to be extracted and the type, the type and the position of the entity.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory, the processor and the memory connected; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory to perform a method as provided in the above-described first aspect embodiment and/or in combination with some possible implementations of the above-described first aspect embodiment.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the method as set forth in the above first aspect embodiment and/or in combination with some possible implementations of the above first aspect embodiment.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart illustrating steps of an entity extraction method according to an embodiment of the present disclosure.
Fig. 2 is a block diagram of an entity extraction apparatus according to an embodiment of the present disclosure.
Fig. 3 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
In view of the poor accuracy of extracting entities in the prior art, the inventors of the present application have conducted research and research to provide the following embodiments to solve the above problems.
The embodiment of the application provides an entity extraction method, which is used for carrying out entity extraction on plain text data. The entity extraction method comprises a first extraction model and a second extraction model which are preset, and for convenience of subsequent description of the entity extraction method, the following description is firstly given to the acquisition of the first extraction model and the second extraction model.
The first extraction model is obtained by the following steps:
acquiring a first training set, wherein the first training set comprises a first training sample and a first label, the first training sample is a text formed by splicing a preset text and each priori knowledge, and the first label is a text formed by performing entity labeling on the preset text; and training the initial first extraction model by using a first training set to obtain a first extraction model.
For example: the preset text is as follows: a new Dacls remote access trojan horse variant has recently been discovered by research teams, which is affiliated with the lazarus group in a country and is specifically designed for the a operating system.
When the prior knowledge comprises computer virus definition, organization definition and operating system definition, the first training samples are respectively: "[ CLS ] computer virus means a set of computer instructions or program code (i.e., computer virus definition) that an organizer has inserted in a computer program to destroy computer functions or destroy data, affect normal use of the computer, and can replicate itself [ SEP ] research team has recently discovered a new Dacls remote access trojan horse variant that is related to the Lazarues group in a country and is specifically designed for the A operating system. The [ SEP ] "," [ CLS ] organization is an organic whole composed of two or more individuals combined to achieve a common goal (i.e., organization definition) [ SEP ] research team has recently discovered a new Dacls remote access trojan horse variant, which is related to the Lazarues group in a country and is specifically designed for the a operating system. The [ SEP ] and "[ CLS ] operating systems are computer programs (i.e., operating System definitions) that manage computer hardware and software resources the research team on [ SEP ] has recently discovered a new Dacls remote access Trojan horse variant that is affiliated with the Lazarues group in a country and is specifically designed for the A operating system. [ SEP ] ".
When the prior knowledge is computer virus, organization and operating system (i.e. vocabulary representing each entity type), the first training samples are respectively: "[ CLS ] computer Virus [ SEP ] related research team recently discovered a new Dacls remote access trojan horse variant, which is related to the Lazarues group in a country and is specifically designed for the A operating system. [ SEP ] "," [ CLS ] organization [ SEP ] research team newly discovered a new Dacls remote access trojan horse variant that is related to the Lazarues group in a country and is specifically designed for the A operating system. [ SEP ] "and" [ CLS ] operating System [ SEP ] research teams have recently discovered a new Dacls remote access trojan horse variant that is affiliated with the Lazarues group in a country and is specifically designed for the A operating system. [ SEP ] ".
It should be noted that, the priori knowledge may be spliced at the beginning of the preset text, or may be spliced at the end of the preset text, which is not limited herein. When the priori knowledge is spliced to the preset text, corresponding separators, namely [ CLS ] and [ SEP ] in the above example, need to be added to the spliced priori knowledge and the preset text, where [ CLS ] is used to be placed at the head of the spliced text, and [ SEP ] is used to separate the priori knowledge from the preset text.
Further, the first tag is a text obtained by entity labeling the preset text, that is, an entity corresponding to the preset text and a type corresponding to the entity need to be marked in the preset text, and according to the labeled text, the position of the entity in the preset text, that is, the starting position and the ending position of the entity in the preset text, can also be obtained. Continuing with the preset text as an example, if the chinese characters are calculated with one character corresponding to one position and the english letters are calculated with one letter corresponding to one position, the corresponding position of Dacls in the preset text is "1620" and the corresponding position of the trojan horse variant in the preset text is "2531". In addition, the entity types of Dacls and trojan horse varieties in the preset text are computer viruses, the entity type of Lazarues is an organization, and the entity type of A is an operating system. Note that, when the first label is set, the label may be labeled using the BIESO label system. The BIESO label system is a labeling method well known to those skilled in the art and will not be described herein.
It should be noted that the first extraction model adopts a RoBERTa pre-training model and two classifiers, where the task layer is the two classifiers. Specifically, the task layer trains the two classifiers, which are respectively used for predicting the start position and the end position of the entity span, and obtains the loss value LossA of the initial first extraction model according to the prediction result, where LossA is the sum of the loss values of the two classifiers, for example: loss (start) CE (predicted start, labeled start) for the classifier that predicts the start position of the entity span, loss (end) CE (predicted end, labeled end) for the classifier that predicts the end position of the entity span, and loss (LossA) loss (start) + loss (end). And training the initial first extraction model through a first training set to obtain a first extraction model. The RoBERTa pre-training model structure, the training method and the classifier can adopt the technical means commonly used in the field, and are not described here.
The second extraction model is obtained by the following steps:
acquiring a second training set, wherein the second training set comprises a second training sample and a second label, the second training sample is a text obtained by matching a preset text by an AC automaton to obtain a corresponding entity and splicing the type corresponding to the entity with the preset text, and the second label is a text obtained by performing entity labeling on the preset text; and training the initial second extraction model by using a second training set to obtain a second extraction model. It should be noted that there is a preset entity library for matching of the AC automata, where a plurality of entities are preset in the entity library, and when matching is performed through the AC automata, the preset text is matched according to each entity in the entity library.
For example: the preset text is as follows: a new Dacls remote access trojan horse variant has recently been discovered by research teams, which is affiliated with the lazarus group in a country and is specifically designed for the a operating system. The preset text can be acquired to four entities of Dacls (computer virus, vul), trojan horse variant (computer virus, vul), Lazarues (organization, org) and a (operating system, sys) and an entity type corresponding to each entity through an AC automaton, and then a second training sample is' a new remote access trojan horse variant of Dacls is newly found by a related research team, is related to a lazarus group of a country, and is specially designed for an A operating system to be < vul Dacls </vul > < vulgar > lazarus </org > < sys > A </sys >. ".
It should be noted that each matched entity and the entity type of the entity may be spliced at the beginning of the preset text, or may be spliced at the end of the preset text, which is not limited herein.
Further, the second tag is the same as the first tag, and is a text obtained by entity-labeling the preset text, that is, an entity corresponding to the preset text and a type corresponding to the entity need to be labeled in the preset text, and according to the labeled text, the position of the entity in the preset text, that is, the starting position and the ending position of the entity in the preset text, can also be obtained. For avoiding redundancy, the second tag is referred to the description of the first tag.
And training the initial second extraction model through a second training set to obtain a second extraction model. The training method may be a training method known to those skilled in the art, and will not be described here.
It should be noted that the predetermined text may be a plurality of different texts, and the a priori knowledge may include a plurality of different entity definitions or a plurality of different words representing types of entities, and the number is not limited herein. It should be further noted that the second extraction model adopts a RoBERTa pre-training model and a CRF model, where the CRF model is a task layer. The RoBERTa pre-training model structure and CRF model structure are well known to those skilled in the art and will not be described herein.
After the first extraction model and the second extraction model are obtained, entity extraction can be carried out on the plain text data by utilizing the first extraction model and the second extraction model. The following describes specific processes and steps of the entity extraction method with reference to fig. 1.
It should be noted that the entity extraction method provided in the embodiment of the present application is not limited by the order shown in fig. 1 and the following.
Step S101: acquiring first input data, wherein the first input data comprises a text to be extracted and preset prior knowledge.
Wherein the a priori knowledge includes a vocabulary that each entity defines or characterizes each entity type. The text to be extracted is a text subjected to preprocessing, and the preprocessing comprises capital letter conversion processing, special symbol removing processing and processing of converting the uncommon nouns in the text to be extracted into the common nouns.
Specifically, a text to be extracted is spliced with each definition text in a preset entity definition library respectively to obtain first input data; or, respectively splicing the text to be extracted and preset vocabularies representing various entity types to obtain first input data.
For example: the text to be extracted is: recently, Mec variants were found to be associated with Lazzar tissue.
When a priori knowledge is defined for each entity, the definition text in the entity definition library includes: when the security vulnerability definition text and the organization definition text are used, the first input data is 'CLS' security vulnerability, which refers to defects or errors of system software in logic design, and is utilized by an illegal person to attack or control the whole computer (namely, the security vulnerability definition) [ SEP ] recently discovered Mec variety to be associated with the Lazzar organization. [ SEP ] and "[ CLS ] tissues are organic entities (i.e., tissue definitions) composed of two or more individuals combined to achieve a common goal [ SEP ] recently discovered that Mec variants are linked to Lazzar tissues. [ SEP ] ".
When the prior knowledge is vocabularies representing various entity types and the preset vocabularies are security holes and organizations, the first input data is 'CLS' security hole 'SEP', and Mec varieties are recently discovered and associated with the Lazzar organization. [ SEP ] "and" [ CLS ] tissue [ SEP ] recently Mec variants were found to be associated with Lazzar tissue. [ SEP ] ".
It should be noted that each definition text or each vocabulary representing each entity type may be spliced at the beginning of the text to be extracted, or spliced at the end of the text to be extracted, which is not limited herein. Moreover, when the text to be extracted is subjected to the priori knowledge concatenation, corresponding separators, namely [ CLS ] and [ SEP ] in the above examples, need to be added to the concatenated priori knowledge and the text to be extracted, where [ CLS ] is used to be placed at the head of the concatenated text, and [ SEP ] is used to separate the priori knowledge and the text to be extracted.
By the method, the first input data can be quickly and accurately acquired.
Step S102: and acquiring second input data, wherein the second input data comprises the text to be extracted and the entity type which are identified in the text to be extracted.
Specifically, matching a text to be extracted through an AC automaton to obtain an entity corresponding to the text to be extracted and a type corresponding to the entity; and splicing the text to be extracted, the matched entity and the type corresponding to the entity to obtain second input data. It should be noted that there is a preset entity library for matching of the AC automata, where a plurality of entities are preset in the entity library, and when the text to be extracted is matched by the AC automata, the matching is performed according to each entity in the entity library.
For example: the text to be extracted is: recently, Mec variants were found to be associated with Lazzar tissue. And the text to be extracted can be obtained into Mec variants (viruses, vuls) and Lazzar (organizations, orgs) and the entity type corresponding to each entity through an AC automaton, and then the second input data is that the Mec variant is recently found to be associated with Lazzar organization < vul > Mec variant </vul > < org > Lazar organization </org >. ".
It should be noted that each matched entity and the entity type of the entity may be spliced at the beginning of the text to be extracted, or may be spliced at the end of the text to be extracted, which is not limited herein.
By the method, the second input data can be quickly and accurately acquired.
It should be noted that the first input data and the second input data are the same as the first training sample and the second training sample trained by the model, so that repeated description is avoided, and the description is omitted here, and the same parts are referred to each other.
It should be further noted that step S101 and step S102 may be performed simultaneously, or may be performed sequentially, that is, step S101 is performed first and then step S102 is performed, or step S102 is performed first and then step S101 is performed, which is not limited herein.
After the first input data and the second input data are acquired, the method continues to step S103.
Step S103: and inputting the first input data into a preset first extraction model to obtain a first extraction result.
The first extraction result comprises the extracted entities and the types, positions and probabilities of the entities. The position is a starting position and an ending position of the entity in the text to be extracted, such as: the text to be extracted is a newly found Mec variant which is associated with the Lazzar tissue, the position of the Mec variant in the text to be extracted is "59", namely 5 represents that M in Mec variant is at the 5 th position in the text to be extracted, and 9 represents that M in Mec variant is at the 9 th position in the text to be extracted; and the entity type corresponding to variant Mec is a virus; the probability of an entity represents the probability that the first extraction model predicts the occurrence of the entity in the text to be extracted, for example: mec variants have a probability of 60%, meaning that the first extraction model predicts Mec variants as occurring in the text to be extracted as 60%.
Step S104: and inputting the second input data into a preset second extraction model to obtain a second extraction result.
And the second extraction result comprises the type, the position, the type of the host guest and the probability of each extracted entity. The probability of the entity represents the probability that the second extraction model predicts the occurrence of the entity in the text to be extracted.
In the embodiment of the present application, please refer to the entity type, the location, and the probability in step S103 for the type and the location of each entity, so as to avoid repeated description, and the description thereof is omitted here. The type of the host guest is the position of the entity in the subject or the position of the entity in the text to be extracted, namely the type of the host guest is judged according to the position of the entity.
It should be noted that step S103 and step S104 may be performed simultaneously, or may be performed sequentially, that is, step S103 is performed first, and then step S104 is performed, or step S104 is performed first, and then step S103 is performed, which is not limited herein.
After the first extraction result and the second extraction result are obtained, the method continues to execute step S015.
Step S105: and obtaining an optimal extraction result according to the first extraction result and the second extraction result.
The optimal extraction result comprises each entity corresponding to the text to be extracted, and the type, the type and the position of the entity.
Specifically, according to a preset weight ratio of the first extraction model to the first extraction model, calculating the probability of each entity in the first extraction result and the probability of each entity in the second extraction result to obtain a total probability corresponding to each entity; and comparing the total probability with a preset threshold value, and acquiring the entity with the total probability greater than the preset threshold value and the type, the host type and the position corresponding to the entity, wherein the optimal extraction result comprises all the entities with the total probability greater than the preset threshold value and the type, the host type and the position corresponding to the entity.
For example: the entity at the position of "59" in the first extraction result is Mec variants, the entity type is virus, the probability thereof is 60%, the entity at the position of "59" in the second extraction result is Mec variants, the entity type is virus, the subject type is subject, the probability thereof is 70%, the weight of the preset first extraction model is a, the weight of the second extraction model is b, the total probability of Mec variants is 60% a + 70% b, the total probability is compared with a preset threshold, if the total probability is greater than the preset threshold, the probability that Mec variants appear at the position of "59" in the text to be extracted is greater, and at this time, the types, the subject positions and the positions corresponding to the Mec and Mec variants can be used as the optimal extraction result.
It should be noted that the values of a and b may be: a is 0.5 and b is 0.5; alternatively, a is 0.4 and b is 0.6; the value of the preset threshold may be: 0.7, 0.8, 0.85, 0.9. In addition, it should be noted that the weight of the first extraction model, the weight of the second extraction model, and the preset threshold may all be set according to actual situations.
In the embodiment of the application, the first extraction result is obtained by inputting the first input data into the preset first extraction model, the second extraction result is obtained by inputting the second input data into the preset second extraction model, so that the semantics of the text to be extracted can be understood more quickly and accurately according to the prior knowledge of the first input data and the entities and entity types marked in the second input data by the first extraction model, and the accuracy of extracting the entities in the text to be extracted is improved. According to the first extraction result and the second extraction result, the optimal extraction result can be obtained, and therefore the accuracy of extracting the entity from the text to be extracted can be further improved. In addition, according to the accuracy of extracting the entity by the first extraction model and the second extraction model, the output results of the first extraction model and the second extraction model (namely the first extraction result and the second extraction result) are fused, so that the accuracy of the obtained optimal extraction result is improved.
After the optimal extraction result is obtained, the entity pair and the relationship of the entity pair can be obtained according to the optimal extraction result.
Specifically, an entity pair is obtained according to each entity in the optimal extraction result; marking the text to be extracted according to the entity pair and the optimal extraction result; inputting the marked text to be extracted into a preset RoBERTA model, and acquiring the code of each entity in the entity pair; and splicing the codes corresponding to the first character of each entity in the entity pair, and classifying the spliced codes according to a preset classification algorithm to acquire the entity relationship corresponding to the entity pair.
After the optimal extraction result is extracted from the text to be extracted, each entity in the optimal extraction result may be pairwise combined into an entity pair, for example: the text to be extracted is: recently, Mec variants were found to be associated with Lazzar tissue. The entities extracted from the text to be extracted are as follows: mec variant and Lazzar tissue, the pair of entities obtained from the two entities is (Mec variant, Lazzar tissue).
And marking the text to be extracted aiming at each entity pair according to the entity pair and the optimal extraction result (namely each entity and the type, the type and the position of the entity).
Taking (variant Mec, Lazzar tissue) as an example, the label for the entity pair (variant Mec, Lazzar tissue) is: recently < S: vul > Mec variant < S: vul > was found to be associated with < O: sys > Lazzar tissue < O: sys >. Wherein S represents that the entity is a subject, O represents that the entity is an object, vul represents that the entity type is a virus, and org represents that the entity type is an organization.
It should be noted that, if the number of entities extracted from the text to be extracted is more than two, each entity needs to be paired into an entity pair, for example: the entities proposed by the text to be extracted are A, B, C and D, and then six entity pairs can be obtained according to the above four entities, which are respectively: (A, B), (A, C), (A, D), (B, C), (B, D) and (C, D). And, according to the entity pair and the optimal extraction result, the text to be extracted needs to be marked for each entity pair.
And inputting the marked text corresponding to each entity pair into a preset RoBERTA model to obtain the code of each entity in the entity pair. Wherein the encoding is generated separately for the characters of each entity in the pair of entities. And splicing the codes corresponding to the first character of each entity in the entity pair, and then putting the spliced data into a classification algorithm (for example, softmax) for relation classification, so as to obtain the entity relation corresponding to the entity pair. The RoBERTa model and softmax algorithm are well known to those skilled in the art and will not be described here.
In the embodiment of the application, because the optimal extraction result with higher accuracy is obtained, the corresponding entity pair is obtained according to each entity in the optimal extraction result, so that the accuracy of the obtained entity pair can be improved, that is, each entity pair can be more accurately obtained according to the optimal extraction result. In addition, marking the text to be extracted according to the optimal extraction result and the extracted entity pair, and inputting the marked text to be extracted into a preset RoBERTA model to obtain the code of each entity in the entity pair; and then splicing the codes corresponding to the first character of each entity in the entity pair, and classifying the spliced codes according to a preset classification algorithm, so that the entity relationship corresponding to the entity pair can be accurately obtained according to the marked text to be extracted.
Referring to fig. 2, based on the same inventive concept, an embodiment of the present application further provides an entity extraction apparatus 100, where the apparatus 100 includes: an acquisition module 101, an extraction module 102 and a processing module 103.
The acquisition module 101 is configured to acquire first input data, where the first input data includes a text to be extracted and preset prior knowledge, and the prior knowledge includes vocabularies defining or representing types of entities; and acquiring second input data, wherein the second input data comprises the text to be extracted and the entity type which are identified in the text to be extracted.
The extraction module 102 is configured to input the first input data into a preset first extraction model, and obtain a first extraction result, where the first extraction result includes each extracted entity and a type, a position, and a probability of the entity; and inputting second input data into a preset second extraction model to obtain a second extraction result, wherein the second extraction result comprises the type, the position, the type and the probability of each extracted entity.
And the processing module 103 is configured to obtain an optimal extraction result according to the first extraction result and the second extraction result, where the optimal extraction result includes each entity corresponding to the text to be extracted, and the type, the guest-host type, and the location of the entity.
Optionally, the obtaining module 101 is specifically configured to splice the text to be extracted and each definition text in a preset entity definition library, so as to obtain first input data; or, respectively splicing the text to be extracted and preset vocabularies representing various entity types to obtain first input data.
Optionally, the obtaining module 101 is specifically configured to match the text to be extracted through an AC automaton, and obtain an entity corresponding to the text to be extracted and a type corresponding to the entity; and splicing the text to be extracted, the matched entity and the type corresponding to the entity to obtain second input data.
Optionally, the entity extraction apparatus 100 further includes a construction module 104, where the construction module 104 is configured to obtain a first training set, where the first training set includes a first training sample and a first label, the first training sample is a text obtained by splicing the preset text and each priori knowledge, and the first label is a text obtained by performing entity labeling on the preset text; and training the initial first extraction model by using a first training set to obtain a first extraction model.
Optionally, the building module 104 is further configured to obtain a second training set, where the second training set includes a second training sample and a second label, the second training sample is a text obtained by matching a preset text with a corresponding entity through an AC automaton and splicing the entity with the preset text, and the second label is a text obtained by performing entity labeling on the preset text; and training the initial second extraction model by using a second training set to obtain a second extraction model.
Optionally, the processing module 103 is specifically configured to calculate, according to a preset weight ratio between the first extraction model and the first extraction model, a probability of each entity in the first extraction result and a probability of each entity in the second extraction result, and obtain a total probability corresponding to each entity; and comparing the total probability with a preset threshold value, and acquiring the entity with the total probability greater than the preset threshold value and the type, the host type and the position corresponding to the entity, wherein the optimal extraction result comprises all the entities with the total probability greater than the preset threshold value and the type, the host type and the position corresponding to the entity.
Optionally, the processing module 103 is further configured to obtain an entity pair according to each entity in the optimal extraction result; marking the text to be extracted according to the entity pair and the optimal extraction result; inputting the marked text to be extracted into a preset RoBERTA model, and acquiring the code of each entity in the entity pair; and splicing the codes corresponding to the first character of each entity in the entity pair, and classifying the spliced codes according to a preset classification algorithm to acquire the entity relationship corresponding to the entity pair.
Referring to fig. 3, based on the same inventive concept, an exemplary structural block diagram of an electronic device 200 is provided in the embodiment of the present application, and the electronic device 200 is used in the entity extraction method. In the embodiment of the present application, the electronic Device 200 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet Computer, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like. Structurally, electronic device 200 may include a processor 210 and a memory 220.
The processor 210 and the memory 220 are electrically connected, directly or indirectly, to enable data transmission or interaction, for example, the components may be electrically connected to each other via one or more communication buses or signal lines. The processor 210 may be an integrated circuit chip having signal processing capabilities. The Processor 210 may also be a general-purpose Processor, for example, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic device, or a discrete hardware component, which can implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. Further, a general purpose processor may be a microprocessor or any conventional processor or the like.
The Memory 220 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), and an electrically Erasable Programmable Read-Only Memory (EEPROM). The memory 220 is used for storing a program, and the processor 210 executes the program after receiving the execution instruction.
It should be understood that the structure shown in fig. 3 is merely an illustration, and the electronic device 200 provided in the embodiment of the present application may have fewer or more components than those shown in fig. 3, or may have a different configuration than that shown in fig. 3. Further, the components shown in fig. 3 may be implemented by software, hardware, or a combination thereof.
It should be noted that, as those skilled in the art can clearly understand, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Based on the same inventive concept, embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the computer program performs the methods provided in the above embodiments.
The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of entity extraction, the method comprising:
acquiring first input data, wherein the first input data comprises a text to be extracted and preset prior knowledge, and the prior knowledge comprises vocabularies for defining or representing entity types;
acquiring second input data, wherein the second input data comprises the text to be extracted and the entity type which are identified in the text to be extracted;
inputting the first input data into a preset first extraction model to obtain a first extraction result, wherein the first extraction result comprises each extracted entity and the type, position and probability of the entity;
inputting the second input data into a preset second extraction model to obtain a second extraction result, wherein the second extraction result comprises the type, the position, the type and the probability of each extracted entity;
and obtaining an optimal extraction result according to the first extraction result and the second extraction result, wherein the optimal extraction result comprises each entity corresponding to the text to be extracted and the type, the type and the position of the entity.
2. The method of claim 1, wherein the obtaining first input data comprises:
splicing the text to be extracted and each definition text in a preset entity definition library respectively to obtain the first input data;
or, respectively splicing the text to be extracted and preset vocabularies representing various entity types to obtain the first input data.
3. The method of claim 1, wherein the obtaining second input data comprises:
matching the text to be extracted through an AC automaton to obtain an entity corresponding to the text to be extracted and a type corresponding to the entity;
and splicing the text to be extracted, the matched entity and the type corresponding to the entity to obtain second input data.
4. The method of claim 1, wherein the first extraction model is obtained according to the following steps:
acquiring a first training set, wherein the first training set comprises a first training sample and a first label, the first training sample is a text formed by splicing a preset text and each priori knowledge, and the first label is a text formed by performing entity labeling on the preset text;
and training an initial first extraction model by using the first training set to obtain the first extraction model.
5. The method of claim 1, wherein the second extraction model is obtained according to the following steps:
acquiring a second training set, wherein the second training set comprises a second training sample and a second label, the second training sample is a text obtained by matching a preset text by an AC automaton to obtain a corresponding entity and splicing the type corresponding to the entity with the preset text, and the second label is a text obtained by performing entity labeling on the preset text;
and training an initial second extraction model by using the second training set to obtain the second extraction model.
6. The method of claim 1, wherein obtaining an optimal extraction result according to the first extraction result and the second extraction result comprises:
calculating the probability of each entity in the first extraction result and the probability of each entity in the second extraction result according to the preset weight ratio of the first extraction model to the first extraction model, and acquiring the total probability corresponding to each entity;
and comparing the total probability with a preset threshold value, and acquiring the entity with the total probability greater than the preset threshold value, and the type, the guest-host type and the position corresponding to the entity, wherein the optimal extraction result comprises all the entities with the total probability greater than the preset threshold value, and the type, the guest-host type and the position corresponding to the entity.
7. The method of claim 1, further comprising:
acquiring entity pairs according to each entity in the optimal extraction result;
marking the text to be extracted according to the entity pair and the optimal extraction result;
inputting the marked text to be extracted into a preset RoBERTA model, and acquiring the code of each entity in the entity pair;
and splicing the codes corresponding to the first character of each entity in the entity pair, and classifying the spliced codes according to a preset classification algorithm to acquire the entity relationship corresponding to the entity pair.
8. An entity extraction apparatus, the apparatus comprising:
the acquisition module is used for acquiring first input data, wherein the first input data comprise texts to be extracted and preset prior knowledge, and the prior knowledge comprises vocabularies for defining or representing entity types; acquiring second input data, wherein the second input data comprises the text to be extracted and the entity type which are identified in the text to be extracted;
the extraction module is used for inputting the first input data into a preset first extraction model to obtain a first extraction result, wherein the first extraction result comprises the extracted entities and the types, positions and probabilities of the entities; inputting the second input data into a preset second extraction model to obtain a second extraction result, wherein the second extraction result comprises the type, the position, the type and the probability of each extracted entity;
and the processing module is used for obtaining an optimal extraction result according to the first extraction result and the second extraction result, wherein the optimal extraction result comprises each entity corresponding to the text to be extracted and the type, the type and the position of the entity.
9. An electronic device, comprising: a processor and a memory, the processor and the memory connected;
the memory is used for storing programs;
the processor is configured to execute a program stored in the memory to perform the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when executed by a computer, performs the method of any one of claims 1-7.
CN202111598982.5A 2021-12-24 2021-12-24 Entity extraction method and device, electronic equipment and storage medium Pending CN114265919A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111598982.5A CN114265919A (en) 2021-12-24 2021-12-24 Entity extraction method and device, electronic equipment and storage medium
PCT/CN2022/139496 WO2023116561A1 (en) 2021-12-24 2022-12-16 Entity extraction method and apparatus, and electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111598982.5A CN114265919A (en) 2021-12-24 2021-12-24 Entity extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114265919A true CN114265919A (en) 2022-04-01

Family

ID=80829773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111598982.5A Pending CN114265919A (en) 2021-12-24 2021-12-24 Entity extraction method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114265919A (en)
WO (1) WO2023116561A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620722A (en) * 2022-12-15 2023-01-17 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
WO2023116561A1 (en) * 2021-12-24 2023-06-29 中电信数智科技有限公司 Entity extraction method and apparatus, and electronic device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702787A (en) * 2023-08-07 2023-09-05 四川隧唐科技股份有限公司 Long text entity identification method, device, computer equipment and medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059320B (en) * 2019-04-23 2021-03-16 腾讯科技(深圳)有限公司 Entity relationship extraction method and device, computer equipment and storage medium
CN110502749B (en) * 2019-08-02 2023-10-03 中国电子科技集团公司第二十八研究所 Text relation extraction method based on double-layer attention mechanism and bidirectional GRU
CA3092332A1 (en) * 2019-09-06 2021-03-06 Royal Bank Of Canada System and method for machine learning architecture for interdependence detection
CN110569366B (en) * 2019-09-09 2023-05-23 腾讯科技(深圳)有限公司 Text entity relation extraction method, device and storage medium
CN111539209B (en) * 2020-04-15 2023-09-15 北京百度网讯科技有限公司 Method and apparatus for entity classification
KR20210147368A (en) * 2020-05-28 2021-12-07 삼성에스디에스 주식회사 Method and apparatus for generating training data for named entity recognition
CN112559770A (en) * 2020-12-15 2021-03-26 北京邮电大学 Text data relation extraction method, device and equipment and readable storage medium
CN112988979B (en) * 2021-04-29 2021-10-08 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable medium and electronic equipment
CN113761190A (en) * 2021-05-06 2021-12-07 腾讯科技(深圳)有限公司 Text recognition method and device, computer readable medium and electronic equipment
CN114265919A (en) * 2021-12-24 2022-04-01 中电信数智科技有限公司 Entity extraction method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116561A1 (en) * 2021-12-24 2023-06-29 中电信数智科技有限公司 Entity extraction method and apparatus, and electronic device and storage medium
CN115620722A (en) * 2022-12-15 2023-01-17 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Also Published As

Publication number Publication date
WO2023116561A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
CN110020422B (en) Feature word determining method and device and server
CN114265919A (en) Entity extraction method and device, electronic equipment and storage medium
CN111581976A (en) Method and apparatus for standardizing medical terms, computer device and storage medium
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN111046142A (en) Text examination method and device, electronic equipment and computer storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN111177367B (en) Case classification method, classification model training method and related products
WO2021208727A1 (en) Text error detection method and apparatus based on artificial intelligence, and computer device
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
AU2020200232A1 (en) Method and system for determining risk score for a contract document
CN112002323A (en) Voice data processing method and device, computer equipment and storage medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN109522338A (en) Clinical term method for digging, device, electronic equipment and computer-readable medium
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN111695337A (en) Method, device, equipment and medium for extracting professional terms in intelligent interview
Nguyen et al. A hybrid approach to Vietnamese word segmentation
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN111177375A (en) Electronic document classification method and device
CN104252446A (en) Computing device, and verification system and method for consistency of contents of files
CN116663536B (en) Matching method and device for clinical diagnosis standard words
CN113887202A (en) Text error correction method and device, computer equipment and storage medium
CN113205814A (en) Voice data labeling method and device, electronic equipment and storage medium
CN110705258A (en) Text entity identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination