WO2023116561A1 - Entity extraction method and apparatus, and electronic device and storage medium - Google Patents

Entity extraction method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2023116561A1
WO2023116561A1 PCT/CN2022/139496 CN2022139496W WO2023116561A1 WO 2023116561 A1 WO2023116561 A1 WO 2023116561A1 CN 2022139496 W CN2022139496 W CN 2022139496W WO 2023116561 A1 WO2023116561 A1 WO 2023116561A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
text
extraction
extracted
preset
Prior art date
Application number
PCT/CN2022/139496
Other languages
French (fr)
Chinese (zh)
Inventor
刘钰
贾梦妮
黄鹏
邱杰
刘德安
Original Assignee
中电信数智科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中电信数智科技有限公司 filed Critical 中电信数智科技有限公司
Publication of WO2023116561A1 publication Critical patent/WO2023116561A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the present application relates to the technical field of data processing, in particular, to an entity extraction method, device, electronic equipment and storage medium.
  • the method of building a knowledge map using plain text data is to use relatively static entity dictionaries and rule templates to obtain entities and entity relationships.
  • This method has poor timeliness. If the entity in the threat intelligence is not updated to the entity dictionary, the entity will be missed. If the rule template in the threat intelligence fails to cover the type, the missing label will also occur, resulting in a decrease in accuracy. .
  • the method of obtaining entities and entity relationships through rule templates requires a large number of professionals to maintain the rule templates, which consumes a lot of manpower.
  • the construction of knowledge graph has borrowed the idea of deep learning and used BiLSTM+CRF to extract knowledge. Although the accuracy rate has improved, it still cannot meet the requirements for use.
  • the purpose of the embodiments of the present application is to provide an entity extraction method, device, electronic device and storage medium, so as to improve the problem of "poor accuracy of entity extraction in the prior art".
  • the embodiment of the present application provides an entity extraction method, the method includes: acquiring first input data, the first input data includes the text to be extracted and preset prior knowledge, the prior knowledge Include each entity to define or characterize the vocabulary of each entity type; obtain the second input data, the second input data includes the entity and the entity type identified in the text to be extracted and the text to be extracted; Input data into the preset first extraction model to obtain the first extraction result, the first extraction result includes the extracted entities and the type, location and probability of the entity; input the second input data into the preset The second extraction model, to obtain a second extraction result, the second extraction result includes the type, location, guest-of-honor type and probability of each extracted entity; according to the first extraction result and the second extraction result, An optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest of honor, and the location.
  • the first extraction model understands the semantics of the text to be extracted faster and more accurately based on the prior knowledge of the first input data, and the second extraction model based on the entities and entity types marked in the second input data, thereby improving the extraction The accuracy rate of entities in the text to be extracted.
  • an optimal extraction result can be obtained, thereby further improving the accuracy rate of extracting entities of the text to be extracted.
  • the acquiring the first input data includes: splicing the text to be extracted with each definition text in the preset entity definition library, Obtaining the first input data; or, splicing the text to be extracted with preset vocabulary representing each entity type to obtain the first input data.
  • the first input data can be acquired quickly and accurately.
  • the acquiring the second input data includes: matching the text to be extracted by an AC automaton to obtain the entity corresponding to the text to be extracted A type corresponding to the entity; splicing the to-be-extracted text with the matched entity and the type corresponding to the entity to obtain second input data.
  • the second input data can be acquired quickly and accurately.
  • the first extraction model is obtained according to the following steps: obtaining a first training set, the first training set includes a first training sample and a first label , the first training sample is the text after splicing the preset text and the prior knowledge respectively, and the first label is the text after entity labeling the preset text; using the first training set
  • the initial first extraction model is trained to obtain the first extraction model.
  • the second extraction model is obtained according to the following steps: obtaining a second training set, the second training set includes a second training sample and a second label , the second training sample is the text after the preset text is matched with the corresponding entity and the type corresponding to the entity through the AC automaton and the preset text is spliced, and the second label is the text of the preset
  • the text after the text is marked with entities; the second training set is used to train the initial second extraction model to obtain the second extraction model.
  • the obtaining the optimal extraction result according to the first extraction result and the second extraction result includes: according to the preset first extraction result A weight ratio between the extraction model and the first extraction model, calculating the probability of each entity in the first extraction result and the probability of each entity in the second extraction result, and obtaining the total probability corresponding to each entity ; Comparing the total probability with a preset threshold, obtaining the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity, and the optimal extraction result includes the total probability All entities greater than the preset threshold and their corresponding types, guest types, and locations.
  • the output results of the first extraction model and the second extraction model can be fused according to the accuracy of the entity extraction (that is, the first extraction result and the second extraction result), so that Improve the accuracy of the optimal extraction results obtained.
  • the method further includes: acquiring entity pairs according to entities in the optimal extraction results; acquiring entity pairs according to the entity pairs and the optimal Extract the result, mark the text to be extracted; input the text to be extracted after the mark into the preset RoBERTa model, and obtain the encoding of each entity in the entity pair; the first code of each entity in the entity pair
  • the codes corresponding to the characters are spliced, and according to the preset classification algorithm, the spliced codes are classified according to the relationship, and the entity relationship corresponding to the entity pair is obtained.
  • the optimal extraction result with high accuracy is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest, and the location of the entity, according to Each entity in the optimal extraction result obtains a corresponding entity pair, which can improve the accuracy of the obtained entity pair, that is, each entity pair can be obtained more accurately according to the optimal extraction result.
  • the text to be extracted is marked, and the marked text to be extracted is input into the preset RoBERTa model to obtain the encoding of each entity in the entity pair;
  • the code corresponding to the first character of each entity is spliced, and according to the preset classification algorithm, the spliced code is classified according to the relationship, and the corresponding entity of the entity can be accurately obtained according to the marked text to be extracted. relation.
  • the embodiment of the present application provides an entity extraction device, the device includes: an acquisition module, configured to acquire first input data, the first input data includes the text to be extracted and preset prior knowledge, The prior knowledge includes each entity definition or a vocabulary that characterizes each entity type; second input data is obtained, and the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted An extraction module, configured to input the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result including the extracted entity and the type, location and probability of the entity; Inputting the second input data into a preset second extraction model to obtain a second extraction result, the second extraction result includes the type, location, guest-of-honor type and probability of each extracted entity; a processing module for According to the first extraction result and the second extraction result, an optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, a type of the entity, a guest type, and a position of the entity.
  • an acquisition module configured to acquire
  • the embodiment of the present application provides an electronic device, including: a processor and a memory, the processor is connected to the memory; the memory is used to store programs; the processor is used to call the program stored in the memory
  • the program in the above-mentioned embodiment of the first aspect executes the method provided in some possible implementation manners in combination with the above-mentioned embodiment of the first aspect.
  • the embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned embodiment of the first aspect and/or in combination with the above-mentioned first aspect Some possible implementations of the embodiments provide methods.
  • FIG. 1 is a flow chart of steps of an entity extraction method provided by an embodiment of the present application.
  • Fig. 2 is a module block diagram of an entity extraction device provided by an embodiment of the present application.
  • FIG. 3 is a block diagram of modules of an electronic device provided by an embodiment of the present application.
  • An embodiment of the present application provides a method for entity extraction, which is used for entity extraction of plain text data.
  • the entity extraction method includes a preset first extraction model and a second extraction model.
  • the acquisition of the above-mentioned first extraction model and the second extraction model will be described below first.
  • the acquisition steps of the first extraction model are as follows:
  • the first training set includes the first training sample and the first label
  • the first training sample is the text after splicing the preset text and each prior knowledge
  • the first label is the entity labeling of the preset text the following text
  • the default text is:
  • the relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues Group in a certain country and is specially designed for the A operating system.
  • a computer virus refers to a computer virus that is inserted into a computer program by the compiler to destroy computer functions or destroy data, affecting the normal operation of the computer.
  • a set of computer instructions or program codes that use and are capable of self-replication i.e. computer virus definition
  • SEP computer virus definition
  • the relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country and specializes in Designed for the A operating system.
  • [SEP] "[CLS] Organization is an organic whole composed of two or more individuals in order to achieve a common goal (that is, organization definition) [SEP]
  • the latest findings of the relevant research team A new variant of the Dacls remote access Trojan horse has been identified, associated with the Lazarues Group in a certain country, and designed specifically for the A operating system.
  • [SEP]” and “[CLS] operating systems are computer programs that manage computer hardware and software resources (that is, the definition of the operating system) [SEP]
  • the relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country and is specially designed for the A operating system. [SEP]”.
  • the first training samples are: "[CLS] The research team of computer virus [SEP] recently discovered a new Dacls A remote access Trojan horse variant, which is associated with the Lazarues group in a certain country, and is specially designed for the A operating system.
  • [SEP] "[CLS] organization [SEP] research team recently discovered a new Dacls remote access Trojan Trojan horse variant, which is associated with the Lazarues Group in a certain country, and is specially designed for the A operating system.
  • each prior knowledge can be spliced at the beginning of the preset text or at the end of the preset text, which is not limited here.
  • the first label is the text after the entity labeling of the preset text, that is, the entity corresponding to the preset text and the type corresponding to the entity need to be marked in the preset text, and according to the marked text, you can also Get the position of the entity in the preset text, that is, the start position and end position of the entity in the preset text.
  • the position of the entity in the preset text that is, the start position and end position of the entity in the preset text.
  • the entity type of Dacls and Trojan horse variants in the preset text is computer virus
  • the entity type of Lazarues is organization
  • the entity type of A is operating system.
  • the BIESO label system can be used for labeling.
  • the BIESO labeling system is a labeling method well known to those skilled in the art, and will not be described here.
  • the initial first extraction model is trained through the first training set to obtain the first extraction model.
  • the above-mentioned RoBERTa pre-training model structure, the above-mentioned training method and the above-mentioned classifier can adopt the technical means commonly used in this field, and will not be described here.
  • the acquisition steps of the second extraction model are:
  • the second training set includes a second training sample and a second label
  • the second training sample is to match the preset text through the AC automaton to match the corresponding entity and the type corresponding to the entity to splicing with the preset text
  • the second label is the text after the entity labeling of the preset text
  • the second training set is used to train the initial second extraction model to obtain the second extraction model.
  • there is a preset entity library for the matching of the AC automaton and multiple entities are preset in the entity library. When the preset text is matched by the AC automaton, it is based on each The match made by the entity.
  • the default text is: The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues Group in a certain country and is specially designed for the A operating system.
  • the preset text can be obtained by the AC automaton to Dacls (computer virus, vul), Trojan horse variant (computer virus, vul), Lazarues (organization, org) and A (operating system, sys) four entities and each The entity type corresponding to the entity, then the second training sample is "The relevant research team recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country, and is specially designed for the A operating system.
  • ⁇ vul>Dacls ⁇ /vul> ⁇ vul>Trojan variant ⁇ /vul> ⁇ org>Lazarues ⁇ /org> ⁇ sys>A ⁇ /sys>.”
  • each matched entity and the entity type of the entity can be spliced at the beginning of the preset text or at the end of the preset text, which is not limited here.
  • the second label is the same as the above-mentioned first label, which is the text after the entity labeling of the preset text, that is, the entity corresponding to the preset text and the type corresponding to the entity need to be marked in the preset text, and according to The marked text can also obtain the position of the entity in the preset text, that is, the start position and end position of the entity in the preset text.
  • the second label please refer to the description of the aforementioned first label.
  • the initial second extraction model is trained through the second training set to obtain the second extraction model.
  • the above-mentioned training method can adopt a training method well-known to those skilled in the art, and will not be described here again.
  • the aforementioned preset text may be multiple different texts, and the aforementioned prior knowledge may include multiple different entity definitions or multiple different vocabularies characterizing each entity type, and the number is not limited here.
  • the above-mentioned second extraction model adopts the RoBERTa pre-training model and the CRF model, wherein the CRF model is the task layer.
  • the RoBERTa pre-training model structure and the CRF model structure are well-known to those skilled in the art, and will not be described here.
  • the first extraction model and the second extraction model can be used to perform entity extraction on the plain text data.
  • the specific flow and steps of the above entity extraction method are described below with reference to FIG. 1 .
  • Step S101 Obtain first input data, the first input data includes the text to be extracted and preset prior knowledge.
  • the prior knowledge includes the vocabulary that each entity defines or characterizes each entity type.
  • the text to be extracted is the text after preprocessing, and the preprocessing includes converting uppercase and lowercase letters, removing special symbols, and converting uncommon nouns in the text to be extracted into common nouns.
  • the text to be extracted is spliced with each definition text in the preset entity definition library to obtain the first input data; or the text to be extracted is spliced with the preset vocabulary representing each entity type to obtain Enter data first.
  • the text to be extracted is: It was recently discovered that the Mec variant is associated with the Lazzar organization.
  • the definition text in the entity definition library includes: security vulnerability definition text and organization definition text
  • the first input data is "[CLS] Security vulnerability refers to a defect in the logical design of system software Or wrong, used by criminals to attack or control the entire computer (that is, the definition of security vulnerabilities) [SEP] recently discovered that the Mec variant is associated with the Lazzar organization. [SEP]” and “[CLS] organizations are composed of two or more An organic whole assembled by individuals to achieve a common goal (i.e. organizational definition) [SEP] Mec variants were recently found to be associated with Lazzar organization. [SEP]".
  • the first input data is "[CLS] security vulnerability [SEP] recently found that the Mec variant is associated with the Lazzar organization. [SEP] ’ and "[CLS] group [SEP] recently discovered a Mec variant linked to the Lazzar group. [SEP]”.
  • each definition text or each vocabulary representing each entity type can be spliced at the beginning of the text to be extracted, or at the end of the text to be extracted, which is not limited here.
  • the first input data can be acquired quickly and accurately.
  • Step S102 Acquiring second input data, the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted.
  • the AC automaton is used to match the text to be extracted to obtain the entity corresponding to the text to be extracted and the type corresponding to the entity; the text to be extracted and the matched entity and the type corresponding to the entity are spliced to obtain the second input data .
  • there is a preset entity library for the matching of the AC automaton and multiple entities are preset in the entity library.
  • the text to be extracted is matched by the AC automaton, it is based on each The match made by the entity.
  • the text to be extracted is: It was recently discovered that the Mec variant is associated with the Lazzar organization.
  • the text to be extracted can be obtained by the AC automaton to Mec variant (virus, vul) and Lazzar (organization, org) two entities and the entity type corresponding to each entity, then the second input data is "recently found Mec variant association To Lazzar Group ⁇ vul>Mec Variations ⁇ /vul> ⁇ org>Lazar Group ⁇ /org>.”
  • each matched entity and the entity type of the entity may be spliced at the beginning of the text to be extracted, or may be spliced at the end of the text to be extracted, which is not limited here.
  • the second input data can be acquired quickly and accurately.
  • first input data and second input data are constructed in the same way as the first training sample and the second training sample trained by the aforementioned model, and will not be described here, and the same parts are referred to each other.
  • step S101 and step S102 may be performed simultaneously, or may be performed sequentially, that is, step S101 is performed first and then step S102 is performed, or step S102 is performed first and then step S101 is performed, which is not limited here.
  • step S103 After the first input data and the second input data are acquired, the method continues to execute step S103.
  • Step S103 Input the first input data into a preset first extraction model, and obtain a first extraction result.
  • the first extraction result includes each extracted entity and the type, position and probability of the entity.
  • This position is the start position and end position of the entity in the text to be extracted, for example: the text to be extracted is the recently discovered Mec variant associated with the Lazzar organization, and the position of the Mec variant in the text to be extracted is "59", that is, 5 Indicates that M in the Mec variant is at the fifth position in the text to be extracted, and 9 indicates that the species in the Mec variant is at the ninth position in the text to be extracted; and the entity type corresponding to the Mec variant is a virus; the probability of the entity represents the prediction of the first extraction model
  • the probability of the entity appearing in the text to be extracted for example, the probability of Mec variant being 60%, means that the first extraction model predicts that the probability of Mec variant appearing in the text to be extracted is 60%.
  • Step S104 Input the second input data into the preset second extraction model, and obtain the second extraction result.
  • the second extraction result includes the type, position, type and probability of each extracted entity.
  • the probability of the entity represents the probability that the second extraction model predicts that the entity appears in the text to be extracted.
  • the type and location of each entity please refer to the entity type, location, and probability in the aforementioned step S103, so as not to be repeated here.
  • the above-mentioned subject-object type refers to whether the entity is in the subject position or the object position in the text to be extracted, that is, the subject-object type is judged according to the position of the entity.
  • step S103 and step S104 can be performed at the same time, or can be performed sequentially, that is, step S103 is performed first and then step S104 is performed, or step S104 is performed first and then step S103 is performed, which is not limited here.
  • Step S105 Obtain an optimal extraction result according to the first extraction result and the second extraction result.
  • the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type and location of the guest of honor.
  • the probability of each entity in the first extraction result and the probability of each entity in the second extraction result are calculated, and each The total probability corresponding to each entity; compare the total probability with the preset threshold, and obtain the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity.
  • the optimal extraction result includes the total probability greater than the preset threshold All entities in , and the corresponding types, guest types, and locations of the entities.
  • the entity at the "59" position in the first extraction result is a Mec variant, and its entity type is a virus, and its probability is 60%.
  • the entity at the "59" position in the second extraction result is a Mec variant, and its entity type is The virus and its subject-object type are the main subject, and its probability is 70%.
  • the weight of the above-mentioned first extraction model preset is a, and the weight of the second extraction model is b.
  • the total probability of the above-mentioned Mec variant is 60% a+70 %b, compare the total probability with the preset threshold, if the total probability is greater than the preset threshold, then the probability of the Mec variant appearing at the position "59" in the text to be extracted is relatively high, at this time, the Mec variant and The type, position and position of the guest of honor corresponding to the Mec variant are taken as the optimal extraction results.
  • a and b can be: a is 0.5, b is 0.5; or, a is 0.4, and b is 0.6; the above-mentioned preset threshold values can be: 0.7, 0.8, 0.85, 0.9 either.
  • the weight of the first extraction model, the weight of the second extraction model, and the preset threshold can all be set according to actual conditions.
  • the first extraction model understands the semantics of the text to be extracted faster and more accurately based on the prior knowledge of the first input data, and the second extraction model based on the entities and entity types marked in the second input data, thereby improving the extraction The accuracy rate of entities in the text to be extracted.
  • an optimal extraction result can be obtained, thereby further improving the accuracy of extracting entities from the text to be extracted.
  • the output results of the two ie, the first extraction result and the second extraction result
  • are fused thereby improving the accuracy of the optimal extraction result obtained .
  • the entity pair and the relationship between the entity pair can also be obtained according to the optimal extraction result.
  • the entity pair is obtained; according to the entity pair and the optimal extraction result, the text to be extracted is marked; the marked text to be extracted is input into the preset RoBERTa model to obtain the entity pair
  • the encoding of each entity splicing the encoding corresponding to the first character of each entity in the entity pair, and performing relationship classification on the spliced encoding according to the preset classification algorithm, to obtain the entity relationship corresponding to the entity pair.
  • each entity in the optimal extraction result can be paired into an entity pair.
  • the text to be extracted is: the recently discovered Mec variant is associated with the Lazzar organization.
  • the extracted entities in the text to be extracted are: Mec variant and Lazzar organization, then according to the above two entities, the obtained entity pair is (Mec variant, Lazzar organization).
  • the text to be extracted is marked for each entity pair.
  • the tag for the entity pair (Mec variant, Lazzar organization) is: recently found that ⁇ S:vul>Mec variant ⁇ S:vul> is associated with ⁇ O:sys>Lazzar organization ⁇ O :sys>.
  • S indicates that the entity is a subject
  • O indicates that the entity is an object
  • vul indicates that the entity type is a virus
  • org indicates that the entity type is an organization.
  • each entity needs to be formed into an entity pair.
  • the entities proposed in the text to be extracted are A, B, C, and D.
  • six entity pairs can be obtained, namely: (A, B), (A, C), (A, D), (B, C), (B, D) and (C, D).
  • the encoding is generated separately for the characters of each entity in the entity pair.
  • a classification algorithm such as softmax
  • the corresponding entity pair is obtained according to each entity in the optimal extraction result, which can improve the accuracy of the obtained entity pair, that is, according to the optimal
  • the optimal extraction results can obtain each entity pair more accurately.
  • the text to be extracted is marked, and the marked text to be extracted is input into the preset RoBERTa model to obtain the encoding of each entity in the entity pair;
  • the code corresponding to the first character of each entity is spliced, and according to the preset classification algorithm, the spliced code is classified according to the relationship, and the corresponding entity of the entity can be accurately obtained according to the marked text to be extracted. relation.
  • the embodiment of the present application also provides an entity extraction device 100 , which includes: an acquisition module 101 , an extraction module 102 and a processing module 103 .
  • the obtaining module 101 is used to obtain the first input data, the first input data includes the text to be extracted and the preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type; obtains the second input data, The second input data includes the text to be extracted and recognized entities and entity types in the text to be extracted.
  • the extraction module 102 is used to input the first input data into the preset first extraction model to obtain the first extraction result, the first extraction result includes each entity extracted and the type, position and probability of the entity; the second input data The data is input into the preset second extraction model, and the second extraction result is obtained, and the second extraction result includes the type, position, type and probability of each extracted entity.
  • the processing module 103 is configured to obtain an optimal extraction result according to the first extraction result and the second extraction result.
  • the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest, and its location.
  • the acquisition module 101 is specifically configured to splice the text to be extracted with each definition text in the preset entity definition library to obtain the first input data; or, combine the text to be extracted with the preset entity types The vocabulary of each is spliced separately to obtain the first input data.
  • the acquisition module 101 is specifically configured to match the text to be extracted by the AC automaton, and obtain the entity corresponding to the text to be extracted and the type corresponding to the entity; the text to be extracted and the matched entity and the type corresponding to the entity splicing to obtain the second input data.
  • the entity extraction device 100 also includes a construction module 104, which is used to obtain a first training set, the first training set includes a first training sample and a first label, and the first training sample is a preset text and each The text after splicing the prior knowledge respectively, the first label is the text after the entity labeling of the preset text; the first training set is used to train the initial first extraction model to obtain the first extraction model.
  • a construction module 104 which is used to obtain a first training set
  • the first training set includes a first training sample and a first label
  • the first training sample is a preset text and each The text after splicing the prior knowledge respectively
  • the first label is the text after the entity labeling of the preset text
  • the first training set is used to train the initial first extraction model to obtain the first extraction model.
  • the construction module 104 is also used to obtain a second training set, the second training set includes a second training sample and a second label, and the second training sample is to match the preset text through the AC automaton to match the corresponding entity and the The type corresponding to the entity is spliced with the preset text, and the second label is the text after the entity is marked on the preset text; the second training set is used to train the initial second extraction model to obtain the second extraction model.
  • the processing module 103 is specifically configured to calculate the probability of each entity in the first extraction result and each entity probability in the second extraction result according to the preset weight ratio between the first extraction model and the first extraction model. Calculate the entity probability to obtain the total probability corresponding to each entity; compare the total probability with the preset threshold, obtain the entity whose total probability is greater than the preset threshold and the corresponding type, guest type and location of the entity, and optimally extract the result Including all entities whose total probability is greater than the preset threshold and the corresponding type, guest type and location of the entity.
  • the processing module 103 is also configured to obtain entity pairs according to each entity in the optimal extraction result; mark the text to be extracted according to the entity pair and the optimal extraction result; input the marked text to be extracted into a preset
  • the RoBERTa model obtains the encoding of each entity in the entity pair; splices the encoding corresponding to the first character of each entity in the entity pair, and performs relationship classification on the spliced encoding according to the preset classification algorithm to obtain the entity to the corresponding entity relationship.
  • FIG. 3 is a schematic structural block diagram of an electronic device 200 provided by an embodiment of the present application based on the same inventive concept, and the electronic device 200 is used for the above-mentioned entity extraction method.
  • the electronic device 200 may be, but not limited to, a personal computer (Personal Computer, PC), a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a mobile Internet device (Mobile Internet Device, MID) and the like.
  • the electronic device 200 may include a processor 210 and a memory 220 .
  • the processor 210 and the memory 220 are electrically connected directly or indirectly to realize data transmission or interaction.
  • these components may be electrically connected to each other through one or more communication buses or signal lines.
  • the processor 210 may be an integrated circuit chip with signal processing capabilities.
  • the processor 210 may also be a general-purpose processor, for example, may be a central processing unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a discrete gate or transistor logic device, a discrete
  • the hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
  • a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • the memory 220 can be, but not limited to, random access memory (RandomAccessMemory, RAM), read-only memory (ReadOnlyMemory, ROM), programmable read-only memory (ProgrammableRead-OnlyMemory, PROM), erasable programmable read-only memory ( Erasable Programmable Read-Only Memory, EPROM), and Electric Erasable Programmable Read-Only Memory (EEPROM).
  • RAM random access memory
  • ReadOnlyMemory, ROM read-only memory
  • PROM programmable read-only memory
  • PROM programmable Read-OnlyMemory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electric Erasable Programmable Read-Only Memory
  • FIG. 3 is only for illustration, and the electronic device 200 provided in the embodiment of the present application may also have fewer or more components than that shown in FIG. 3 , or have a configuration different from that shown in FIG. 3 .
  • each component shown in FIG. 3 may be realized by software, hardware or a combination thereof.
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed, the method provided in the above-mentioned embodiments is executed.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated to form an independent part, each module may exist independently, or two or more modules may be integrated to form an independent part.

Abstract

An entity extraction method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring first input data, wherein the first input data comprises text to be subjected to extraction and various preset a priori knowledge (S101); acquiring second input data, wherein the second input data comprises text to be subjected to extraction, and an entity and an entity type which have been identified from the text to be subjected to extraction (S102); inputting the first input data into a preset first extraction model to acquire a first extraction result, wherein the first extraction result comprises each extracted entity and the type, position and probability of the entity (S103); inputting the second input data into a preset second extraction model to acquire a second extraction result, wherein the second extraction result comprises the type, position, subject-object type and probability of each extracted entity (S104); and acquiring an optimal extraction result according to the first extraction result and the second extraction result (S105). In this way, the problem of the poor accuracy of entity extraction in the prior art can be alleviated.

Description

一种实体提取方法、装置、电子设备及存储介质A kind of entity extraction method, device, electronic equipment and storage medium 技术领域technical field
本申请涉及数据处理技术领域,具体而言,涉及一种实体提取方法、装置、电子设备及存储介质。The present application relates to the technical field of data processing, in particular, to an entity extraction method, device, electronic equipment and storage medium.
背景技术Background technique
目前,利用纯文本数据进行知识图谱搭建的方法是:使用相对静态的实体词典和规则模板获取实体及实体关系。该方式时效性差,威胁情报中的实体若未更新至实体词典中,将会导致实体漏标;威胁情报中若出现规则模板未能覆盖到类型,也会出现漏标情况,从而导致准确率降低。并且,通过规则模板去获取实体以及实体关系的方式,需要大量专业人员维护规则模板,对人力消耗较大。此外,近年来知识图谱搭建借鉴了深度学习的想法使用BiLSTM+CRF的方式进行知识抽取,准确率虽然有所提高但仍然不能达到使用要求。At present, the method of building a knowledge map using plain text data is to use relatively static entity dictionaries and rule templates to obtain entities and entity relationships. This method has poor timeliness. If the entity in the threat intelligence is not updated to the entity dictionary, the entity will be missed. If the rule template in the threat intelligence fails to cover the type, the missing label will also occur, resulting in a decrease in accuracy. . Moreover, the method of obtaining entities and entity relationships through rule templates requires a large number of professionals to maintain the rule templates, which consumes a lot of manpower. In addition, in recent years, the construction of knowledge graph has borrowed the idea of deep learning and used BiLSTM+CRF to extract knowledge. Although the accuracy rate has improved, it still cannot meet the requirements for use.
发明内容Contents of the invention
本申请实施例的目的在于提供一种实体提取方法、装置、电子设备及存储介质,以改善“现有技术提取实体的准确率较差”的问题。The purpose of the embodiments of the present application is to provide an entity extraction method, device, electronic device and storage medium, so as to improve the problem of "poor accuracy of entity extraction in the prior art".
本发明是这样实现的:The present invention is achieved like this:
第一方面,本申请实施例提供一种实体提取方法,所述方法包括:获取第一输入数据,所述第一输入数据包括待提取文本和预设的各先验知识,所述先验知识包括各实体定义或表征各实体类型的词汇;获取第二输入数据,所述第二输入数据包括所述待提取文本和所述待提取文本中已识别出的实体及实体类型;将所述第一输入数据输入预先设置的第一提取模型,获取第一提取结果,所述第一提取结果包括提取出的各实体和该实体的类型、位置及概率;将所述第二输入数据输入预先设置的第二提取模型,获取第二提取结果,所述第二提取结果包括提取出的各实体的类型、位置、主宾类型及概率;根据所述第一提取结果和所述第二提取结果,获取最优提取结果,所述最优提取结果包括所述待提取文本对应的各实体和该实体的类型、主宾类型及位置。In the first aspect, the embodiment of the present application provides an entity extraction method, the method includes: acquiring first input data, the first input data includes the text to be extracted and preset prior knowledge, the prior knowledge Include each entity to define or characterize the vocabulary of each entity type; obtain the second input data, the second input data includes the entity and the entity type identified in the text to be extracted and the text to be extracted; Input data into the preset first extraction model to obtain the first extraction result, the first extraction result includes the extracted entities and the type, location and probability of the entity; input the second input data into the preset The second extraction model, to obtain a second extraction result, the second extraction result includes the type, location, guest-of-honor type and probability of each extracted entity; according to the first extraction result and the second extraction result, An optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest of honor, and the location.
在本申请实施例中,通过将第一输入数据输入预先设置的第一提取模型,获取第一提取结果,以及将第二输入数据输入预先设置的第二提取模型,获取第二提取结果,能使第一提取模型根据第一输入数据的先验知识,以及第二提取模型根据第二输入数据中标记出的实体和实体类型,更快、更准确的理解待提取文本的语义,从而提高提取待提取文本中实体的准确率。并且,根据第一提取结果和第二提取结果,可获取到最优的提取结果,从而能进一步提高对待提取文本提取实体的准确率。In the embodiment of the present application, by inputting the first input data into the preset first extraction model to obtain the first extraction result, and inputting the second input data into the preset second extraction model to obtain the second extraction result, it is possible Make the first extraction model understand the semantics of the text to be extracted faster and more accurately based on the prior knowledge of the first input data, and the second extraction model based on the entities and entity types marked in the second input data, thereby improving the extraction The accuracy rate of entities in the text to be extracted. Moreover, according to the first extraction result and the second extraction result, an optimal extraction result can be obtained, thereby further improving the accuracy rate of extracting entities of the text to be extracted.
结合上述第一方面提供的技术方案,在一些可能的实现方式中,所述获取第一输入数据, 包括:将所述待提取文本与预设的实体定义库中的各个定义文本分别进行拼接,得到所述第一输入数据;或者,将所述待提取文本与预设的表征各实体类型的词汇分别进行拼接,得到所述第一输入数据。In combination with the technical solution provided in the first aspect above, in some possible implementations, the acquiring the first input data includes: splicing the text to be extracted with each definition text in the preset entity definition library, Obtaining the first input data; or, splicing the text to be extracted with preset vocabulary representing each entity type to obtain the first input data.
在本申请实施例中,通过该方式,能快速且准确的获取到第一输入数据。In the embodiment of the present application, in this manner, the first input data can be acquired quickly and accurately.
结合上述第一方面提供的技术方案,在一些可能的实现方式中,所述获取第二输入数据,包括:通过AC自动机对所述待提取文本进行匹配,获得所述待提取文本对应的实体和该实体对应的类型;将所述待提取文本和匹配出的实体及该实体对应的类型进行拼接,得到第二输入数据。In combination with the technical solution provided in the first aspect above, in some possible implementations, the acquiring the second input data includes: matching the text to be extracted by an AC automaton to obtain the entity corresponding to the text to be extracted A type corresponding to the entity; splicing the to-be-extracted text with the matched entity and the type corresponding to the entity to obtain second input data.
在本申请实施例中,通过该方式,能快速且准确的获取到第二输入数据。In the embodiment of the present application, in this manner, the second input data can be acquired quickly and accurately.
结合上述第一方面提供的技术方案,在一些可能的实现方式中,根据如下步骤获取所述第一提取模型:获取第一训练集,所述第一训练集包括第一训练样本和第一标签,所述第一训练样本为预设文本与所述各先验知识分别拼接后的文本,所述第一标签为对所述预设文本进行实体标注后的文本;利用所述第一训练集对初始第一提取模型进行训练,得到所述第一提取模型。In combination with the technical solution provided by the first aspect above, in some possible implementations, the first extraction model is obtained according to the following steps: obtaining a first training set, the first training set includes a first training sample and a first label , the first training sample is the text after splicing the preset text and the prior knowledge respectively, and the first label is the text after entity labeling the preset text; using the first training set The initial first extraction model is trained to obtain the first extraction model.
结合上述第一方面提供的技术方案,在一些可能的实现方式中,根据如下步骤获取所述第二提取模型:获取第二训练集,所述第二训练集包括第二训练样本和第二标签,所述第二训练样本为将预设文本通过AC自动机匹配出对应的实体和该实体对应的类型与所述预设文本进行拼接后的文本,所述第二标签为对所述预设文本进行实体标注后的文本;利用所述第二训练集对初始第二提取模型进行训练,得到所述第二提取模型。In combination with the technical solution provided by the first aspect above, in some possible implementations, the second extraction model is obtained according to the following steps: obtaining a second training set, the second training set includes a second training sample and a second label , the second training sample is the text after the preset text is matched with the corresponding entity and the type corresponding to the entity through the AC automaton and the preset text is spliced, and the second label is the text of the preset The text after the text is marked with entities; the second training set is used to train the initial second extraction model to obtain the second extraction model.
结合上述第一方面提供的技术方案,在一些可能的实现方式中,所述根据所述第一提取结果和所述第二提取结果,获取最优提取结果,包括:根据预设的所述第一提取模型和所述第一提取模型的权重比值,对所述第一提取结果中的每个实体概率和所述第二提取结果中每个实体概率进行计算,获取每个实体对应的总概率;将所述总概率与预设阈值进行比较,获取所述总概率大于所述预设阈值的实体和该实体对应的类型、主宾类型及位置,所述最优提取结果包括所述总概率大于所述预设阈值的所有实体和该实体对应的类型、主宾类型及位置。In combination with the technical solution provided in the first aspect above, in some possible implementations, the obtaining the optimal extraction result according to the first extraction result and the second extraction result includes: according to the preset first extraction result A weight ratio between the extraction model and the first extraction model, calculating the probability of each entity in the first extraction result and the probability of each entity in the second extraction result, and obtaining the total probability corresponding to each entity ; Comparing the total probability with a preset threshold, obtaining the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity, and the optimal extraction result includes the total probability All entities greater than the preset threshold and their corresponding types, guest types, and locations.
在本申请实施例中,通过上述方式,能根据第一提取模型和第二提取模型提取实体的准确率,对两者的输出结果(即第一提取结果和第二提取结果)进行融合,从而提高获取的最优提取结果的准确率。In the embodiment of the present application, through the above method, the output results of the first extraction model and the second extraction model can be fused according to the accuracy of the entity extraction (that is, the first extraction result and the second extraction result), so that Improve the accuracy of the optimal extraction results obtained.
结合上述第一方面提供的技术方案,在一些可能的实现方式中,所述方法还包括:根据所述最优提取结果中的各实体,获取实体对;根据所述实体对和所述最优提取结果,对所述待提取文本进行标记;将标记后的待提取文本输入预先设置的RoBERTa模型,获取所述实体对中每个实体的编码;对所述实体对中每个实体的第一个字符对应的编码进行拼接,并根据预设的 分类算法,对拼接后的编码进行关系分类,获取所述实体对对应的实体关系。在本申请实施例中,因获取到准确率较高的最优提取结果,且该最优提取结果包括所述待提取文本对应的各实体和该实体的类型、主宾类型及位置,故根据最优提取结果中的各实体获取对应的实体对,可提高获取到的实体对的准确率,即根据最优提取结果能更准确的获取到各实体对。此外,通过根据最优提取结果和提取出的实体对,对待提取文本进行标记,并将标记后的待提取文本输入预先设置的RoBERTa模型,获取实体对中每个实体的编码;再对实体对中每个实体的第一个字符对应的编码进行拼接,并根据预设的分类算法,对拼接后的编码进行关系分类,能根据标记出的待提取文本准确的获取到该实体对对应的实体关系。In combination with the technical solution provided in the first aspect above, in some possible implementations, the method further includes: acquiring entity pairs according to entities in the optimal extraction results; acquiring entity pairs according to the entity pairs and the optimal Extract the result, mark the text to be extracted; input the text to be extracted after the mark into the preset RoBERTa model, and obtain the encoding of each entity in the entity pair; the first code of each entity in the entity pair The codes corresponding to the characters are spliced, and according to the preset classification algorithm, the spliced codes are classified according to the relationship, and the entity relationship corresponding to the entity pair is obtained. In the embodiment of the present application, because the optimal extraction result with high accuracy is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest, and the location of the entity, according to Each entity in the optimal extraction result obtains a corresponding entity pair, which can improve the accuracy of the obtained entity pair, that is, each entity pair can be obtained more accurately according to the optimal extraction result. In addition, according to the optimal extraction results and the extracted entity pairs, the text to be extracted is marked, and the marked text to be extracted is input into the preset RoBERTa model to obtain the encoding of each entity in the entity pair; The code corresponding to the first character of each entity is spliced, and according to the preset classification algorithm, the spliced code is classified according to the relationship, and the corresponding entity of the entity can be accurately obtained according to the marked text to be extracted. relation.
第二方面,本申请实施例提供一种实体提取装置,所述装置包括:获取模块,用于获取第一输入数据,所述第一输入数据包括待提取文本和预设的各先验知识,所述先验知识包括各实体定义或表征各实体类型的词汇;获取第二输入数据,所述第二输入数据包括所述待提取文本和所述待提取文本中已识别出的实体及实体类型;提取模块,用于将所述第一输入数据输入预先设置的第一提取模型,获取第一提取结果,所述第一提取结果包括提取出的各实体和该实体的类型、位置及概率;将所述第二输入数据输入预先设置的第二提取模型,获取第二提取结果,所述第二提取结果包括提取出的各实体的类型、位置、主宾类型及概率;处理模块,用于根据所述第一提取结果和所述第二提取结果,获取最优提取结果,所述最优提取结果包括所述待提取文本对应的各实体和该实体的类型、主宾类型及位置。In the second aspect, the embodiment of the present application provides an entity extraction device, the device includes: an acquisition module, configured to acquire first input data, the first input data includes the text to be extracted and preset prior knowledge, The prior knowledge includes each entity definition or a vocabulary that characterizes each entity type; second input data is obtained, and the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted An extraction module, configured to input the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result including the extracted entity and the type, location and probability of the entity; Inputting the second input data into a preset second extraction model to obtain a second extraction result, the second extraction result includes the type, location, guest-of-honor type and probability of each extracted entity; a processing module for According to the first extraction result and the second extraction result, an optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, a type of the entity, a guest type, and a position of the entity.
第三方面,本申请实施例提供一种电子设备,包括:处理器和存储器,所述处理器和所述存储器连接;所述存储器用于存储程序;所述处理器用于调用存储在所述存储器中的程序,执行如上述第一方面实施例和/或结合上述第一方面实施例的一些可能的实现方式提供的方法。In the third aspect, the embodiment of the present application provides an electronic device, including: a processor and a memory, the processor is connected to the memory; the memory is used to store programs; the processor is used to call the program stored in the memory The program in the above-mentioned embodiment of the first aspect executes the method provided in some possible implementation manners in combination with the above-mentioned embodiment of the first aspect.
第四方面,本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序在被处理器运行时执行如上述第一方面实施例和/或结合上述第一方面实施例的一些可能的实现方式提供的方法。In the fourth aspect, the embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned embodiment of the first aspect and/or in combination with the above-mentioned first aspect Some possible implementations of the embodiments provide methods.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the accompanying drawings that need to be used in the embodiments of the present application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present application, so It should not be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings according to these drawings without creative work.
图1为本申请实施例提供的一种实体提取方法的步骤流程图。FIG. 1 is a flow chart of steps of an entity extraction method provided by an embodiment of the present application.
图2为本申请实施例提供的一种实体提取装置的模块框图。Fig. 2 is a module block diagram of an entity extraction device provided by an embodiment of the present application.
图3为本申请实施例提供的一种电子设备的模块框图。FIG. 3 is a block diagram of modules of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
鉴于现有技术提取实体的准确率较差,本申请发明人经过研究探索,提出以下实施例以解决上述问题。In view of the poor accuracy of entity extraction in the prior art, the inventors of the present application proposed the following embodiments after research and exploration to solve the above problems.
本申请实施例提供了一种实体提取方法,其用于对纯文本数据进行实体提取。该实体提取方法中,包括预先设置的第一提取模型和第二提取模型,为了方便后续对该实体提取方法的说明,下面先对上述第一提取模型和第二提取模型的获取进行说明。An embodiment of the present application provides a method for entity extraction, which is used for entity extraction of plain text data. The entity extraction method includes a preset first extraction model and a second extraction model. In order to facilitate the subsequent description of the entity extraction method, the acquisition of the above-mentioned first extraction model and the second extraction model will be described below first.
第一提取模型的获取步骤为:The acquisition steps of the first extraction model are as follows:
获取第一训练集,第一训练集包括第一训练样本和第一标签,第一训练样本为预设文本与各先验知识分别拼接后的文本,第一标签为对预设文本进行实体标注后的文本;利用第一训练集对初始第一提取模型进行训练,得到第一提取模型。Obtain the first training set, the first training set includes the first training sample and the first label, the first training sample is the text after splicing the preset text and each prior knowledge, and the first label is the entity labeling of the preset text the following text; using the first training set to train the initial first extraction model to obtain the first extraction model.
例如:预设文本为:有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。For example: the default text is: The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues Group in a certain country and is specially designed for the A operating system.
当先验知识包括计算机病毒定义、组织定义和操作系统定义时,第一训练样本分别为:“[CLS]计算机病毒是指编制者在计算机程序中插入的破坏计算机功能或者破坏数据,影响计算机正常使用并且能够自我复制的一组计算机指令或程序代码(即计算机病毒定义)[SEP]有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。[SEP]”、“[CLS]组织是由两个或两个以上的个人为了实现共同的目标组合而成的有机整体(即组织定义)[SEP]有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。[SEP]”和“[CLS]操作系统是管理计算机硬件与软件资源的计算机程序(即操作系统定义)[SEP]有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。[SEP]”。When the prior knowledge includes the definition of computer virus, the definition of organization and the definition of operating system, the first training samples are respectively: "[CLS] A computer virus refers to a computer virus that is inserted into a computer program by the compiler to destroy computer functions or destroy data, affecting the normal operation of the computer. A set of computer instructions or program codes that use and are capable of self-replication (i.e. computer virus definition) [SEP] The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country and specializes in Designed for the A operating system. [SEP]", "[CLS] Organization is an organic whole composed of two or more individuals in order to achieve a common goal (that is, organization definition) [SEP] The latest findings of the relevant research team A new variant of the Dacls remote access Trojan horse has been identified, associated with the Lazarues Group in a certain country, and designed specifically for the A operating system. [SEP]” and “[CLS] operating systems are computer programs that manage computer hardware and software resources (that is, the definition of the operating system) [SEP] The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country and is specially designed for the A operating system. [SEP]".
当先验知识为计算机病毒、组织和操作系统(即表征各实体类型的词汇)时,第一训练样本分别为:“[CLS]计算机病毒[SEP]有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。[SEP]”、“[CLS]组织[SEP]有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。[SEP]”和“[CLS]操作系统[SEP]有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。[SEP]”。When the prior knowledge is computer virus, organization, and operating system (that is, the vocabulary that characterizes each entity type), the first training samples are: "[CLS] The research team of computer virus [SEP] recently discovered a new Dacls A remote access Trojan horse variant, which is associated with the Lazarues group in a certain country, and is specially designed for the A operating system. [SEP]", "[CLS] organization [SEP] research team recently discovered a new Dacls remote access Trojan Trojan horse variant, which is associated with the Lazarues Group in a certain country, and is specially designed for the A operating system. [SEP]” and “[CLS] operating system [SEP] related research team recently discovered a new Dacls remote access Trojan horse variant , which is associated with the Lazarues Group in a certain country, and is designed specifically for the A operating system. [SEP]”.
需要说明的是,各先验知识可拼接在预设文本的开头,也可拼接在预设文本的结尾,此处不做限定。并且,在对预设文本进行先验知识拼接时,需要对拼接的先验知识和预设文本添 加上对应的分隔符,即上述例子中的[CLS]和[SEP],其中,[CLS]用于放置在拼接后的文本的首位,[SEP]用于分隔先验知识和预设文本。It should be noted that each prior knowledge can be spliced at the beginning of the preset text or at the end of the preset text, which is not limited here. Moreover, when splicing the prior knowledge of the preset text, it is necessary to add corresponding separators to the spliced prior knowledge and the preset text, that is, [CLS] and [SEP] in the above example, where [CLS] Used to be placed at the top of the spliced text, [SEP] is used to separate prior knowledge and preset text.
进一步,第一标签为对预设文本进行实体标注后的文本,即需要在预设文本中标记出该预设文本对应的实体及该实体对应的类型,且根据该标注后的文本,还可获取到实体在预设文本中的位置,即该实体在预设文本中的起始位置和终止位置。继续以上述预设文本为例,中文文字以一个字对应一个位置进行计算,英文字母以一个字母对应一个位置进行计算,则上述预设文本中Dacls在预设文本中对应的位置是“1620”,特洛伊木马变种在预设文本中对应的位置是“2531”。此外,预设文本中的Dacls和特洛伊木马变种的实体类型为计算机病毒,Lazarues的实体类型为组织,A的实体类型为操作系统。需要说明的是,在设置第一标签时,可使用BIESO标签体系进行标注。该BIESO标签体系为本领域技术人员熟知的一种标注方法,此处不再进行说明。Further, the first label is the text after the entity labeling of the preset text, that is, the entity corresponding to the preset text and the type corresponding to the entity need to be marked in the preset text, and according to the marked text, you can also Get the position of the entity in the preset text, that is, the start position and end position of the entity in the preset text. Continuing to take the above preset text as an example, Chinese characters are calculated as one character corresponding to one position, and English letters are calculated as one letter corresponding to one position, then the corresponding position of Dacls in the preset text in the above preset text is "1620" , the corresponding position of the Trojan horse variant in the preset text is "2531". In addition, the entity type of Dacls and Trojan horse variants in the preset text is computer virus, the entity type of Lazarues is organization, and the entity type of A is operating system. It should be noted that when setting the first label, the BIESO label system can be used for labeling. The BIESO labeling system is a labeling method well known to those skilled in the art, and will not be described here.
需要说明的是,第一提取模型采用RoBERTa预训练模型以及两个分类器,其中,任务层为上述两个分类器。具体的,任务层训练上述两个分类器,分别用于预测实体span的开始位置和实体span结束位置,依据预测结果得到初始第一提取模型的损失值LossA,LossA为两个分类器损失值之和,例如:对于预测实体span开始位置的分类器的损失值,loss(Start)=CE(预测start,标签start),对于预测实体span结束位置的分类器的损失值loss(End)=CE(预测end,标签end),损失值LossA=loss(Start)+loss(End)。通过第一训练集对初始第一提取模型进行训练,得到第一提取模型。上述RoBERTa预训练模型结构、上述训练方法和上述分类器可采用本领域中常用的技术手段,此处不再说明。It should be noted that the first extraction model uses the RoBERTa pre-training model and two classifiers, wherein the task layer is the above two classifiers. Specifically, the task layer trains the above two classifiers, which are used to predict the start position of the entity span and the end position of the entity span respectively. According to the prediction result, the loss value LossA of the initial first extraction model is obtained, and LossA is the loss value of the two classifiers. And, for example: loss(Start)=CE(prediction start,label start) for a classifier that predicts the start of an entity span, loss(End)=CE( Prediction end, label end), loss value LossA=loss(Start)+loss(End). The initial first extraction model is trained through the first training set to obtain the first extraction model. The above-mentioned RoBERTa pre-training model structure, the above-mentioned training method and the above-mentioned classifier can adopt the technical means commonly used in this field, and will not be described here.
第二提取模型的获取步骤为:The acquisition steps of the second extraction model are:
获取第二训练集,第二训练集包括第二训练样本和第二标签,第二训练样本为将预设文本通过AC自动机匹配出对应的实体和该实体对应的类型与预设文本进行拼接后的文本,第二标签为对预设文本进行实体标注后的文本;利用第二训练集对初始第二提取模型进行训练,得到第二提取模型。需要说明的是,针对AC自动机的匹配有一个预设的实体库,该实体库中预先设置有多个实体,预设文本在通过AC自动机进行匹配时,是根据上述实体库中的各实体进行的匹配。Obtain a second training set, the second training set includes a second training sample and a second label, and the second training sample is to match the preset text through the AC automaton to match the corresponding entity and the type corresponding to the entity to splicing with the preset text After the text, the second label is the text after the entity labeling of the preset text; the second training set is used to train the initial second extraction model to obtain the second extraction model. It should be noted that there is a preset entity library for the matching of the AC automaton, and multiple entities are preset in the entity library. When the preset text is matched by the AC automaton, it is based on each The match made by the entity.
例如:预设文本为:有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计。将该预设文本通过AC自动机可获取到Dacls(计算机病毒,vul)、特洛伊木马变种(计算机病毒,vul)、Lazarues(组织,org)和A(操作系统,sys)四个实体和每个实体对应的实体类型,则第二训练样本为“有关研究团队最新发现了一种新的Dacls远程访问特洛伊木马变种,它与某国的Lazarues集团有关联,并且专门为A操作系统设计<vul>Dacls</vul><vul>特洛伊木马变种 </vul><org>Lazarues</org><sys>A</sys>。”。For example: the default text is: The relevant research team has recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues Group in a certain country and is specially designed for the A operating system. The preset text can be obtained by the AC automaton to Dacls (computer virus, vul), Trojan horse variant (computer virus, vul), Lazarues (organization, org) and A (operating system, sys) four entities and each The entity type corresponding to the entity, then the second training sample is "The relevant research team recently discovered a new Dacls remote access Trojan horse variant, which is associated with the Lazarues group in a certain country, and is specially designed for the A operating system. <vul>Dacls </vul><vul>Trojan variant</vul><org>Lazarues</org><sys>A</sys>.".
需要说明的是,匹配出的各实体和该实体的实体类型可拼接在预设文本的开头,也可拼接在预设文本的结尾,此处不做限定。It should be noted that each matched entity and the entity type of the entity can be spliced at the beginning of the preset text or at the end of the preset text, which is not limited here.
进一步,第二标签与上述第一标签相同,其为对预设文本进行实体标注后的文本,即需要在预设文本中标记出该预设文本对应的实体及该实体对应的类型,且根据该标注后的文本,还可获取到实体在预设文本中的位置,即该实体在预设文本中的起始位置和终止位置。为了避免赘述,此处不再进行说明,第二标签请参考前述第一标签的说明。Further, the second label is the same as the above-mentioned first label, which is the text after the entity labeling of the preset text, that is, the entity corresponding to the preset text and the type corresponding to the entity need to be marked in the preset text, and according to The marked text can also obtain the position of the entity in the preset text, that is, the start position and end position of the entity in the preset text. In order to avoid redundant description, no further description is given here, and for the second label, please refer to the description of the aforementioned first label.
通过第二训练集对初始第二提取模型进行训练,得到第二提取模型。上述训练方法可以采用本领域技术人员所熟知的训练方法,此处不再说明。The initial second extraction model is trained through the second training set to obtain the second extraction model. The above-mentioned training method can adopt a training method well-known to those skilled in the art, and will not be described here again.
需要说明的是,上述预设文本可以为多个不同的文本,上述先验知识可包括多个不同的实体定义或多个不同的表征各实体类型的词汇,此处对数量不进行限定。还需要说明的是,上述第二提取模型采用RoBERTa预训练模型和CRF模型,其中,CRF模型为任务层。该RoBERTa预训练模型结构和CRF模型结构为本领域技术人员熟知的内容,此处不再进行说明。It should be noted that the aforementioned preset text may be multiple different texts, and the aforementioned prior knowledge may include multiple different entity definitions or multiple different vocabularies characterizing each entity type, and the number is not limited here. It should also be noted that the above-mentioned second extraction model adopts the RoBERTa pre-training model and the CRF model, wherein the CRF model is the task layer. The RoBERTa pre-training model structure and the CRF model structure are well-known to those skilled in the art, and will not be described here.
在获取到第一提取模型和第二提取模型后,可利用该第一提取模型和第二提取模型对纯文本数据进行实体提取。以下结合图1对上述实体提取方法的具体流程及步骤进行描述。After the first extraction model and the second extraction model are acquired, the first extraction model and the second extraction model can be used to perform entity extraction on the plain text data. The specific flow and steps of the above entity extraction method are described below with reference to FIG. 1 .
需要说明的是,本申请实施例提供的实体提取方法不以图1及以下所示的顺序为限制。It should be noted that the entity extraction method provided in the embodiment of the present application is not limited to the sequence shown in FIG. 1 and the following.
步骤S101:获取第一输入数据,第一输入数据包括待提取文本和预设的各先验知识。Step S101: Obtain first input data, the first input data includes the text to be extracted and preset prior knowledge.
其中,先验知识包括各实体定义或表征各实体类型的词汇。待提取文本为进行过预处理后的文本,该预处理包括大小写字母转换处理、特殊符号剔除处理以及将待提取文本中的不常用名词转换为常用名词的处理。Among them, the prior knowledge includes the vocabulary that each entity defines or characterizes each entity type. The text to be extracted is the text after preprocessing, and the preprocessing includes converting uppercase and lowercase letters, removing special symbols, and converting uncommon nouns in the text to be extracted into common nouns.
具体的,将待提取文本与预设的实体定义库中的各个定义文本分别进行拼接,得到第一输入数据;或者,将待提取文本与预设的表征各实体类型的词汇分别进行拼接,得到第一输入数据。Specifically, the text to be extracted is spliced with each definition text in the preset entity definition library to obtain the first input data; or the text to be extracted is spliced with the preset vocabulary representing each entity type to obtain Enter data first.
例如:待提取文本为:最近发现Mec变种关联到Lazzar组织。For example: the text to be extracted is: It was recently discovered that the Mec variant is associated with the Lazzar organization.
当先验知识为各实体定义,且实体定义库中的定义文本包括:安全漏洞定义文本和组织定义文本时,第一输入数据为“[CLS]安全漏洞是指系统软件在逻辑设计上的缺陷或错误,被不法者利用,攻击或控制整个电脑(即安全漏洞定义)[SEP]最近发现Mec变种关联到Lazzar组织。[SEP]”以及“[CLS]组织是由两个或两个以上的个人为了实现共同的目标组合而成的有机整体(即组织定义)[SEP]最近发现Mec变种关联到Lazzar组织。[SEP]”。When the prior knowledge is defined for each entity, and the definition text in the entity definition library includes: security vulnerability definition text and organization definition text, the first input data is "[CLS] Security vulnerability refers to a defect in the logical design of system software Or wrong, used by criminals to attack or control the entire computer (that is, the definition of security vulnerabilities) [SEP] recently discovered that the Mec variant is associated with the Lazzar organization. [SEP]” and “[CLS] organizations are composed of two or more An organic whole assembled by individuals to achieve a common goal (i.e. organizational definition) [SEP] Mec variants were recently found to be associated with Lazzar organization. [SEP]".
当先验知识为表征各实体类型的词汇,且预设的词汇为安全漏洞和组织时,第一输入数据为“[CLS]安全漏洞[SEP]最近发现Mec变种关联到Lazzar组织。[SEP]”以及“[CLS]组织[SEP]最近发现Mec变种关联到Lazzar组织。[SEP]”。When the prior knowledge is the vocabulary that characterizes each entity type, and the preset vocabulary is security vulnerability and organization, the first input data is "[CLS] security vulnerability [SEP] recently found that the Mec variant is associated with the Lazzar organization. [SEP] ’ and "[CLS] group [SEP] recently discovered a Mec variant linked to the Lazzar group. [SEP]".
需要说明的是,各定义文本或各表征各实体类型的词汇可拼接在待提取文本的开头,也可拼接在待提取文本的结尾,此处不做限定。并且,在对待提取文本进行先验知识拼接时,需要对拼接的先验知识和待提取文本添加上对应的分隔符,即上述各例子中的[CLS]和[SEP],其中,[CLS]用于放置在拼接后的文本的首位,[SEP]用于分隔先验知识和待提取文本。It should be noted that each definition text or each vocabulary representing each entity type can be spliced at the beginning of the text to be extracted, or at the end of the text to be extracted, which is not limited here. Moreover, when splicing the prior knowledge of the text to be extracted, it is necessary to add corresponding separators to the spliced prior knowledge and the text to be extracted, that is, [CLS] and [SEP] in the above examples, where [CLS] It is used to place at the top of the spliced text, and [SEP] is used to separate prior knowledge and text to be extracted.
通过上述方式,能快速且准确的获取到第一输入数据。Through the above manner, the first input data can be acquired quickly and accurately.
步骤S102:获取第二输入数据,第二输入数据包括待提取文本和待提取文本中已识别出的实体及实体类型。Step S102: Acquiring second input data, the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted.
具体的,通过AC自动机对待提取文本进行匹配,获得待提取文本对应的实体和该实体对应的类型;将待提取文本和匹配出的实体及该实体对应的类型进行拼接,得到第二输入数据。需要说明的是,针对AC自动机的匹配有一个预设的实体库,该实体库中预先设置有多个实体,待提取文本在通过AC自动机进行匹配时,是根据上述实体库中的各实体进行的匹配。Specifically, the AC automaton is used to match the text to be extracted to obtain the entity corresponding to the text to be extracted and the type corresponding to the entity; the text to be extracted and the matched entity and the type corresponding to the entity are spliced to obtain the second input data . It should be noted that there is a preset entity library for the matching of the AC automaton, and multiple entities are preset in the entity library. When the text to be extracted is matched by the AC automaton, it is based on each The match made by the entity.
例如:待提取文本为:最近发现Mec变种关联到Lazzar组织。将该待提取文本通过AC自动机可获取到Mec变种(病毒,vul)和Lazzar(组织,org)两个实体及每个实体对应的实体类型,则第二输入数据为“最近发现Mec变种关联到Lazzar组织<vul>Mec变种</vul><org>Lazar组织</org>。”。For example: the text to be extracted is: It was recently discovered that the Mec variant is associated with the Lazzar organization. The text to be extracted can be obtained by the AC automaton to Mec variant (virus, vul) and Lazzar (organization, org) two entities and the entity type corresponding to each entity, then the second input data is "recently found Mec variant association To Lazzar Group <vul>Mec Variations</vul><org>Lazar Group</org>.".
需要说明的是,匹配出的各实体和该实体的实体类型可拼接在待提取文本的开头,也可拼接在待提取文本的结尾,此处不做限定。It should be noted that each matched entity and the entity type of the entity may be spliced at the beginning of the text to be extracted, or may be spliced at the end of the text to be extracted, which is not limited here.
通过上述方式,能快速且准确的获取到第二输入数据。Through the above manner, the second input data can be acquired quickly and accurately.
需要说明的是,上述第一输入数据、第二输入数据与前述模型训练出的第一训练样本、第二训练样本的构建方式相同,避免赘述,此处不再说明,相同部分互相参考。It should be noted that the above-mentioned first input data and second input data are constructed in the same way as the first training sample and the second training sample trained by the aforementioned model, and will not be described here, and the same parts are referred to each other.
还需要说明的是,步骤S101和步骤S102可以同时进行,也可以有先后顺序的进行,即先进行步骤S101后进行步骤S102,或先进行步骤S102后进行步骤S101,此处不做限定。It should also be noted that step S101 and step S102 may be performed simultaneously, or may be performed sequentially, that is, step S101 is performed first and then step S102 is performed, or step S102 is performed first and then step S101 is performed, which is not limited here.
在获取到第一输入数据和第二输入数据后,本方法继续执行步骤S103。After the first input data and the second input data are acquired, the method continues to execute step S103.
步骤S103:将第一输入数据输入预先设置的第一提取模型,获取第一提取结果。Step S103: Input the first input data into a preset first extraction model, and obtain a first extraction result.
其中,第一提取结果包括提取出的各实体和该实体的类型、位置及概率。该位置为该实体在待提取文本中的起始位置和终止位置,比如:待提取文本为最近发现Mec变种关联到Lazzar组织,Mec变种在该待提取文本中的位置为“59”,即5代表Mec变种中M处于待提取文本中的第5位,9代表Mec变种中种处于待提取文本中的第9位;且Mec变种对应的实体类型为病毒;实体的概率表示第一提取模型预测该实体在待提取文本中出现的概率,例如:Mec变种的概率为60%,即表示第一提取模型预测在待提取文本中出现Mec变种的概率为60%。Wherein, the first extraction result includes each extracted entity and the type, position and probability of the entity. This position is the start position and end position of the entity in the text to be extracted, for example: the text to be extracted is the recently discovered Mec variant associated with the Lazzar organization, and the position of the Mec variant in the text to be extracted is "59", that is, 5 Indicates that M in the Mec variant is at the fifth position in the text to be extracted, and 9 indicates that the species in the Mec variant is at the ninth position in the text to be extracted; and the entity type corresponding to the Mec variant is a virus; the probability of the entity represents the prediction of the first extraction model The probability of the entity appearing in the text to be extracted, for example, the probability of Mec variant being 60%, means that the first extraction model predicts that the probability of Mec variant appearing in the text to be extracted is 60%.
步骤S104:将第二输入数据输入预先设置的第二提取模型,获取第二提取结果。Step S104: Input the second input data into the preset second extraction model, and obtain the second extraction result.
其中,第二提取结果包括提取出的各实体的类型、位置、主宾类型及概率。该实体的概率 表示第二提取模型预测该实体在待提取文本中出现的概率。Wherein, the second extraction result includes the type, position, type and probability of each extracted entity. The probability of the entity represents the probability that the second extraction model predicts that the entity appears in the text to be extracted.
在本申请实施例中,各实体的类型和位置请参考前述步骤S103中的实体类型、位置和概率,避免赘述,此处不再说明。上述主宾类型为该实体在待提取文本中是处于主语的位置或处于宾语的位置,即根据实体的所处位置,从而判断其主宾类型。In this embodiment of the application, for the type and location of each entity, please refer to the entity type, location, and probability in the aforementioned step S103, so as not to be repeated here. The above-mentioned subject-object type refers to whether the entity is in the subject position or the object position in the text to be extracted, that is, the subject-object type is judged according to the position of the entity.
需要说明的是,步骤S103和步骤S104可以同时进行,也可以有先后顺序的进行,即先进行步骤S103后进行步骤S104,或先进行步骤S104后进行步骤S103,此处不做限定。It should be noted that step S103 and step S104 can be performed at the same time, or can be performed sequentially, that is, step S103 is performed first and then step S104 is performed, or step S104 is performed first and then step S103 is performed, which is not limited here.
在获取到第一提取结果和第二提取结果后,本方法继续执行步骤S015。After the first extraction result and the second extraction result are obtained, the method continues to execute step S015.
步骤S105:根据第一提取结果和第二提取结果,获取最优提取结果。Step S105: Obtain an optimal extraction result according to the first extraction result and the second extraction result.
其中,最优提取结果包括待提取文本对应的各实体和该实体的类型、主宾类型及位置。Among them, the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type and location of the guest of honor.
具体的,根据预设的所述第一提取模型和所述第一提取模型的权重比值,对第一提取结果中的每个实体概率和第二提取结果中每个实体概率进行计算,获取每个实体对应的总概率;将总概率与预设阈值进行比较,获取总概率大于预设阈值的实体和该实体对应的类型、主宾类型及位置,最优提取结果包括总概率大于预设阈值的所有实体和该实体对应的类型、主宾类型及位置。Specifically, according to the preset weight ratio between the first extraction model and the first extraction model, the probability of each entity in the first extraction result and the probability of each entity in the second extraction result are calculated, and each The total probability corresponding to each entity; compare the total probability with the preset threshold, and obtain the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity. The optimal extraction result includes the total probability greater than the preset threshold All entities in , and the corresponding types, guest types, and locations of the entities.
例如:第一提取结果中“59”位置处的实体为Mec变种,其实体类型为病毒、其概率为60%,第二提取结果中“59”位置处的实体为Mec变种,其实体类型为病毒、其主宾类型为主语,其概率为70%,上述预设的第一提取模型的权重为a,第二提取模型的权重为b,则上述Mec变种的总概率为60%a+70%b,将该总概率与预设阈值进行比较,若该总概率大于预设阈值,则待提取文本中“59”位置处出现Mec变种的概率较大,此时,可将该Mec变种和Mec变种对应的类型、主宾位置及位置作为最优提取结果。For example: the entity at the "59" position in the first extraction result is a Mec variant, and its entity type is a virus, and its probability is 60%. The entity at the "59" position in the second extraction result is a Mec variant, and its entity type is The virus and its subject-object type are the main subject, and its probability is 70%. The weight of the above-mentioned first extraction model preset is a, and the weight of the second extraction model is b. Then the total probability of the above-mentioned Mec variant is 60% a+70 %b, compare the total probability with the preset threshold, if the total probability is greater than the preset threshold, then the probability of the Mec variant appearing at the position "59" in the text to be extracted is relatively high, at this time, the Mec variant and The type, position and position of the guest of honor corresponding to the Mec variant are taken as the optimal extraction results.
需要说明的是,上述a和b的取值可以是:a为0.5,b为0.5;或者,a为0.4,b为0.6;上述预设阈值的取值可以是:0.7、0.8、0.85、0.9中任一个。此外,还需要说明的是,上述第一提取模型的权重、第二提取模型的权重和预设阈值均可根据实际情况进行设置。It should be noted that the above-mentioned values of a and b can be: a is 0.5, b is 0.5; or, a is 0.4, and b is 0.6; the above-mentioned preset threshold values can be: 0.7, 0.8, 0.85, 0.9 either. In addition, it should be noted that the weight of the first extraction model, the weight of the second extraction model, and the preset threshold can all be set according to actual conditions.
在本申请实施例中,通过将第一输入数据输入预先设置的第一提取模型,获取第一提取结果,以及将第二输入数据输入预先设置的第二提取模型,获取第二提取结果,能使第一提取模型根据第一输入数据的先验知识,以及第二提取模型根据第二输入数据中标记出的实体和实体类型,更快、更准确的理解待提取文本的语义,从而提高提取待提取文本中实体的准确率。根据第一提取结果和第二提取结果,可获取到最优的提取结果,从而能进一步提高对待提取文本提取实体的准确率。此外,根据第一提取模型和第二提取模型提取实体的准确率,对两者的输出结果(即第一提取结果和第二提取结果)进行融合,从而提高获取的最优提取结果的准确率。In the embodiment of the present application, by inputting the first input data into the preset first extraction model to obtain the first extraction result, and inputting the second input data into the preset second extraction model to obtain the second extraction result, it is possible Make the first extraction model understand the semantics of the text to be extracted faster and more accurately based on the prior knowledge of the first input data, and the second extraction model based on the entities and entity types marked in the second input data, thereby improving the extraction The accuracy rate of entities in the text to be extracted. According to the first extraction result and the second extraction result, an optimal extraction result can be obtained, thereby further improving the accuracy of extracting entities from the text to be extracted. In addition, according to the accuracy of the entity extracted by the first extraction model and the second extraction model, the output results of the two (ie, the first extraction result and the second extraction result) are fused, thereby improving the accuracy of the optimal extraction result obtained .
在获取到最优提取结果之后,还可根据最优提取结果获取实体对,以及获取实体对的关 系。After the optimal extraction result is obtained, the entity pair and the relationship between the entity pair can also be obtained according to the optimal extraction result.
具体的,根据最优提取结果中的各实体,获取实体对;根据实体对和最优提取结果,对待提取文本进行标记;将标记后的待提取文本输入预先设置的RoBERTa模型,获取实体对中每个实体的编码;对实体对中每个实体的第一个字符对应的编码进行拼接,并根据预设的分类算法,对拼接后的编码进行关系分类,获取实体对对应的实体关系。Specifically, according to each entity in the optimal extraction result, the entity pair is obtained; according to the entity pair and the optimal extraction result, the text to be extracted is marked; the marked text to be extracted is input into the preset RoBERTa model to obtain the entity pair The encoding of each entity; splicing the encoding corresponding to the first character of each entity in the entity pair, and performing relationship classification on the spliced encoding according to the preset classification algorithm, to obtain the entity relationship corresponding to the entity pair.
在对待提取文本提取到最优提取结果后,可将该最优提取结果中的每个实体两两组成实体对,例如:待提取文本为:最近发现Mec变种关联到Lazzar组织。该待提取文本中被提取出的实体为:Mec变种和Lazzar组织,则根据上述两个实体,则获得的实体对为(Mec变种,Lazzar组织)。After the optimal extraction result is obtained from the text to be extracted, each entity in the optimal extraction result can be paired into an entity pair. For example, the text to be extracted is: the recently discovered Mec variant is associated with the Lazzar organization. The extracted entities in the text to be extracted are: Mec variant and Lazzar organization, then according to the above two entities, the obtained entity pair is (Mec variant, Lazzar organization).
根据实体对和最优提取结果(即各实体和该实体的类型、主宾类型及位置),针对每个实体对对待提取文本进行标记。According to the entity pair and the optimal extraction result (that is, each entity and the type of the entity, the guest type and the location), the text to be extracted is marked for each entity pair.
以(Mec变种,Lazzar组织)为例,针对实体对(Mec变种,Lazzar组织)的标记为:最近发现<S:vul>Mec变种<S:vul>关联到<O:sys>Lazzar组织<O:sys>。其中,S表示该实体为主语,O表示该实体为宾语,vul表示实体类型为病毒,org表示实体类型为组织。Taking (Mec variant, Lazzar organization) as an example, the tag for the entity pair (Mec variant, Lazzar organization) is: recently found that <S:vul>Mec variant<S:vul> is associated with <O:sys>Lazzar organization<O :sys>. Among them, S indicates that the entity is a subject, O indicates that the entity is an object, vul indicates that the entity type is a virus, and org indicates that the entity type is an organization.
需要说明的是,若待提取文本中所提取出的实体为两个以上,则需要将每个实体两两组成实体对,比如:待提取文本所提出的实体为A、B、C和D,则根据上述四个实体,可获得六个实体对,分别为:(A,B)、(A,C)、(A,D)、(B,C)、(B,D)和(C,D)。并且,根据实体对和最优提取结果,需要针对每个实体对对待提取文本分别进行标记。It should be noted that if there are more than two entities extracted from the text to be extracted, each entity needs to be formed into an entity pair. For example, the entities proposed in the text to be extracted are A, B, C, and D. According to the above four entities, six entity pairs can be obtained, namely: (A, B), (A, C), (A, D), (B, C), (B, D) and (C, D). Moreover, according to the entity pair and the optimal extraction result, it is necessary to mark the text to be extracted for each entity pair.
将每个实体对对应的标记后的文本输入预先设置的RoBERTa模型,获取实体对中每个实体的编码。其中,该编码是针对实体对中的每个实体的字符分别生成的。对实体对中每个实体的第一个字符对应的编码进行拼接,再将拼接好的数据放入分类算法(例如softmax)中进行关系分类,即可获得实体对对应的实体关系。该RoBERTa模型和softmax算法为本领域技术人员熟知的模型和算法,此处不再进行说明。Input the tagged text corresponding to each entity pair into the preset RoBERTa model to obtain the encoding of each entity in the entity pair. Wherein, the encoding is generated separately for the characters of each entity in the entity pair. Splice the code corresponding to the first character of each entity in the entity pair, and then put the spliced data into a classification algorithm (such as softmax) to classify the relationship, and then the entity relationship corresponding to the entity pair can be obtained. The RoBERTa model and the softmax algorithm are models and algorithms well known to those skilled in the art, and will not be described here.
在本申请实施例中,因获取到准确率较高的最优提取结果,故根据最优提取结果中的各实体获取对应的实体对,可提高获取到的实体对的准确率,即根据最优提取结果能更准确的获取到各实体对。此外,通过根据最优提取结果和提取出的实体对,对待提取文本进行标记,并将标记后的待提取文本输入预先设置的RoBERTa模型,获取实体对中每个实体的编码;再对实体对中每个实体的第一个字符对应的编码进行拼接,并根据预设的分类算法,对拼接后的编码进行关系分类,能根据标记出的待提取文本准确的获取到该实体对对应的实体关系。In the embodiment of the present application, because the optimal extraction result with high accuracy is obtained, the corresponding entity pair is obtained according to each entity in the optimal extraction result, which can improve the accuracy of the obtained entity pair, that is, according to the optimal The optimal extraction results can obtain each entity pair more accurately. In addition, according to the optimal extraction results and the extracted entity pairs, the text to be extracted is marked, and the marked text to be extracted is input into the preset RoBERTa model to obtain the encoding of each entity in the entity pair; The code corresponding to the first character of each entity is spliced, and according to the preset classification algorithm, the spliced code is classified according to the relationship, and the corresponding entity of the entity can be accurately obtained according to the marked text to be extracted. relation.
请参阅图2,基于同一发明构思,本申请实施例还提供一种实体提取装置100,该装置100包括:获取模块101、提取模块102和处理模块103。Referring to FIG. 2 , based on the same inventive concept, the embodiment of the present application also provides an entity extraction device 100 , which includes: an acquisition module 101 , an extraction module 102 and a processing module 103 .
获取模块101,用于获取第一输入数据,第一输入数据包括待提取文本和预设的各先验知 识,先验知识包括各实体定义或表征各实体类型的词汇;获取第二输入数据,第二输入数据包括待提取文本和待提取文本中已识别出的实体及实体类型。The obtaining module 101 is used to obtain the first input data, the first input data includes the text to be extracted and the preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type; obtains the second input data, The second input data includes the text to be extracted and recognized entities and entity types in the text to be extracted.
提取模块102,用于将第一输入数据输入预先设置的第一提取模型,获取第一提取结果,第一提取结果包括提取出的各实体和该实体的类型、位置及概率;将第二输入数据输入预先设置的第二提取模型,获取第二提取结果,第二提取结果包括提取出的各实体的类型、位置、主宾类型及概率。The extraction module 102 is used to input the first input data into the preset first extraction model to obtain the first extraction result, the first extraction result includes each entity extracted and the type, position and probability of the entity; the second input data The data is input into the preset second extraction model, and the second extraction result is obtained, and the second extraction result includes the type, position, type and probability of each extracted entity.
处理模块103,用于根据第一提取结果和第二提取结果,获取最优提取结果,最优提取结果包括待提取文本对应的各实体和该实体的类型、主宾类型及位置。The processing module 103 is configured to obtain an optimal extraction result according to the first extraction result and the second extraction result. The optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, the type of the guest, and its location.
可选的,获取模块101具体用于将待提取文本与预设的实体定义库中的各个定义文本分别进行拼接,得到第一输入数据;或者,将待提取文本与预设的表征各实体类型的词汇分别进行拼接,得到第一输入数据。Optionally, the acquisition module 101 is specifically configured to splice the text to be extracted with each definition text in the preset entity definition library to obtain the first input data; or, combine the text to be extracted with the preset entity types The vocabulary of each is spliced separately to obtain the first input data.
可选的,获取模块101具体用于通过AC自动机对待提取文本进行匹配,获得待提取文本对应的实体和该实体对应的类型;将待提取文本和匹配出的实体及该实体对应的类型进行拼接,得到第二输入数据。Optionally, the acquisition module 101 is specifically configured to match the text to be extracted by the AC automaton, and obtain the entity corresponding to the text to be extracted and the type corresponding to the entity; the text to be extracted and the matched entity and the type corresponding to the entity splicing to obtain the second input data.
可选的,实体提取装置100还包括构建模块104,该构建模块104用于获取第一训练集,第一训练集包括第一训练样本和第一标签,第一训练样本为预设文本与各先验知识分别拼接后的文本,第一标签为对预设文本进行实体标注后的文本;利用第一训练集对初始第一提取模型进行训练,得到第一提取模型。Optionally, the entity extraction device 100 also includes a construction module 104, which is used to obtain a first training set, the first training set includes a first training sample and a first label, and the first training sample is a preset text and each The text after splicing the prior knowledge respectively, the first label is the text after the entity labeling of the preset text; the first training set is used to train the initial first extraction model to obtain the first extraction model.
可选的,构建模块104还用于获取第二训练集,第二训练集包括第二训练样本和第二标签,第二训练样本为将预设文本通过AC自动机匹配出对应的实体和该实体对应的类型与预设文本进行拼接后的文本,第二标签为对预设文本进行实体标注后的文本;利用第二训练集对初始第二提取模型进行训练,得到第二提取模型。Optionally, the construction module 104 is also used to obtain a second training set, the second training set includes a second training sample and a second label, and the second training sample is to match the preset text through the AC automaton to match the corresponding entity and the The type corresponding to the entity is spliced with the preset text, and the second label is the text after the entity is marked on the preset text; the second training set is used to train the initial second extraction model to obtain the second extraction model.
可选的,处理模块103具体用于根据预设的所述第一提取模型和所述第一提取模型的权重比值,对第一提取结果中的每个实体概率和第二提取结果中每个实体概率进行计算,获取每个实体对应的总概率;将总概率与预设阈值进行比较,获取总概率大于预设阈值的实体和该实体对应的类型、主宾类型及位置,最优提取结果包括总概率大于预设阈值的所有实体和该实体对应的类型、主宾类型及位置。Optionally, the processing module 103 is specifically configured to calculate the probability of each entity in the first extraction result and each entity probability in the second extraction result according to the preset weight ratio between the first extraction model and the first extraction model. Calculate the entity probability to obtain the total probability corresponding to each entity; compare the total probability with the preset threshold, obtain the entity whose total probability is greater than the preset threshold and the corresponding type, guest type and location of the entity, and optimally extract the result Including all entities whose total probability is greater than the preset threshold and the corresponding type, guest type and location of the entity.
可选的,处理模块103还用于根据最优提取结果中的各实体,获取实体对;根据实体对和最优提取结果,对待提取文本进行标记;将标记后的待提取文本输入预先设置的RoBERTa模型,获取实体对中每个实体的编码;对实体对中每个实体的第一个字符对应的编码进行拼接,并根据预设的分类算法,对拼接后的编码进行关系分类,获取实体对对应的实体关系。Optionally, the processing module 103 is also configured to obtain entity pairs according to each entity in the optimal extraction result; mark the text to be extracted according to the entity pair and the optimal extraction result; input the marked text to be extracted into a preset The RoBERTa model obtains the encoding of each entity in the entity pair; splices the encoding corresponding to the first character of each entity in the entity pair, and performs relationship classification on the spliced encoding according to the preset classification algorithm to obtain the entity to the corresponding entity relationship.
请参阅图3,基于同一发明构思,本申请实施例提供的一种电子设备200的示意性结构框 图,该电子设备200用于上述的一种实体提取方法。本申请实施例中,电子设备200可以是,但不限于个人计算机(PersonalComputer,PC)、智能手机、平板电脑、个人数字助理(PersonalDigitalAssistant,PDA)、移动上网设备(MobileInternetDevice,MID)等。在结构上,电子设备200可以包括处理器210和存储器220。Please refer to FIG. 3 , which is a schematic structural block diagram of an electronic device 200 provided by an embodiment of the present application based on the same inventive concept, and the electronic device 200 is used for the above-mentioned entity extraction method. In the embodiment of the present application, the electronic device 200 may be, but not limited to, a personal computer (Personal Computer, PC), a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a mobile Internet device (Mobile Internet Device, MID) and the like. Structurally, the electronic device 200 may include a processor 210 and a memory 220 .
处理器210与存储器220直接或间接地电性连接,以实现数据的传输或交互,例如,这些元件相互之间可通过一条或多条通讯总线或信号线实现电性连接。其中,处理器210可以是一种集成电路芯片,具有信号处理能力。处理器210也可以是通用处理器,例如,可以是中央处理器(CentralProcessingUnit,CPU)、数字信号处理器(DigitalSignalProcessor,DSP)、专用集成电路(ApplicationSpecificIntegratedCircuit,ASIC)、分立门或晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。此外,通用处理器可以是微处理器或者任何常规处理器等。The processor 210 and the memory 220 are electrically connected directly or indirectly to realize data transmission or interaction. For example, these components may be electrically connected to each other through one or more communication buses or signal lines. Wherein, the processor 210 may be an integrated circuit chip with signal processing capabilities. The processor 210 may also be a general-purpose processor, for example, may be a central processing unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a discrete gate or transistor logic device, a discrete The hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. Also, a general-purpose processor may be a microprocessor or any conventional processor or the like.
存储器220可以是,但不限于,随机存取存储器(RandomAccessMemory,RAM)、只读存储器(ReadOnlyMemory,ROM)、可编程只读存储器(ProgrammableRead-OnlyMemory,PROM)、可擦可编程序只读存储器(ErasableProgrammableRead-OnlyMemory,EPROM),以及电可擦编程只读存储器(ElectricErasableProgrammableRead-OnlyMemory,EEPROM)。存储器220用于存储程序,处理器210在接收到执行指令后,执行该程序。The memory 220 can be, but not limited to, random access memory (RandomAccessMemory, RAM), read-only memory (ReadOnlyMemory, ROM), programmable read-only memory (ProgrammableRead-OnlyMemory, PROM), erasable programmable read-only memory ( Erasable Programmable Read-Only Memory, EPROM), and Electric Erasable Programmable Read-Only Memory (EEPROM). The memory 220 is used to store a program, and the processor 210 executes the program after receiving an execution instruction.
应当理解,图3所示的结构仅为示意,本申请实施例提供的电子设备200还可以具有比图3更少或更多的组件,或是具有与图3所示不同的配置。此外,图3所示的各组件可以通过软件、硬件或其组合实现。It should be understood that the structure shown in FIG. 3 is only for illustration, and the electronic device 200 provided in the embodiment of the present application may also have fewer or more components than that shown in FIG. 3 , or have a configuration different from that shown in FIG. 3 . In addition, each component shown in FIG. 3 may be realized by software, hardware or a combination thereof.
需要说明的是,由于所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。It should be noted that, as those skilled in the art can clearly understand, for the convenience and brevity of description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, here No longer.
基于同一发明构思,本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序在被运行时执行上述实施例中提供的方法。Based on the same inventive concept, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed, the method provided in the above-mentioned embodiments is executed.
该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘SolidStateDisk(SSD))等。The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).
在本申请所提供的实施例中,应该理解到,所揭露装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接 耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
另外,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。In addition, the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
再者,在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或两个以上模块集成形成一个独立的部分。Furthermore, each functional module in each embodiment of the present application may be integrated to form an independent part, each module may exist independently, or two or more modules may be integrated to form an independent part.
在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。In this document, relational terms such as first and second etc. are used only to distinguish one entity or operation from another without necessarily requiring or implying any such relationship between these entities or operations. Actual relationship or sequence.
以上所述仅为本申请的实施例而已,并不用于限制本申请的保护范围,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above descriptions are only examples of the present application, and are not intended to limit the scope of protection of the present application. For those skilled in the art, various modifications and changes may be made to the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included within the protection scope of this application.

Claims (10)

  1. 一种实体提取方法,其特征在于,所述方法包括:A method for entity extraction, characterized in that the method comprises:
    获取第一输入数据,所述第一输入数据包括待提取文本和预设的各先验知识,所述先验知识包括各实体定义或表征各实体类型的词汇;Acquiring first input data, the first input data includes the text to be extracted and preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type;
    获取第二输入数据,所述第二输入数据包括所述待提取文本和所述待提取文本中已识别出的实体及实体类型;Obtaining second input data, the second input data including the text to be extracted and the identified entities and entity types in the text to be extracted;
    将所述第一输入数据输入预先设置的第一提取模型,获取第一提取结果,所述第一提取结果包括提取出的各实体和该实体的类型、位置及概率;Inputting the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result includes each entity extracted and the type, location and probability of the entity;
    将所述第二输入数据输入预先设置的第二提取模型,获取第二提取结果,所述第二提取结果包括提取出的各实体的类型、位置、主宾类型及概率;Inputting the second input data into a preset second extraction model to obtain a second extraction result, the second extraction result includes the type, location, type and probability of each extracted entity;
    根据所述第一提取结果和所述第二提取结果,获取最优提取结果,所述最优提取结果包括所述待提取文本对应的各实体和该实体的类型、主宾类型及位置。According to the first extraction result and the second extraction result, an optimal extraction result is obtained, and the optimal extraction result includes each entity corresponding to the text to be extracted, a type of the entity, a guest type, and a position of the entity.
  2. 根据权利要求1所述的方法,其特征在于,所述获取第一输入数据,包括:The method according to claim 1, wherein said obtaining the first input data comprises:
    将所述待提取文本与预设的实体定义库中的各个定义文本分别进行拼接,得到所述第一输入数据;splicing the text to be extracted with each definition text in the preset entity definition library to obtain the first input data;
    或者,将所述待提取文本与预设的表征各实体类型的词汇分别进行拼接,得到所述第一输入数据。Alternatively, the to-be-extracted text is spliced with preset vocabulary representing each entity type to obtain the first input data.
  3. 根据权利要求1所述的方法,其特征在于,所述获取第二输入数据,包括:The method according to claim 1, wherein said obtaining the second input data comprises:
    通过AC自动机对所述待提取文本进行匹配,获得所述待提取文本对应的实体和该实体对应的类型;Matching the text to be extracted by an AC automaton to obtain an entity corresponding to the text to be extracted and a type corresponding to the entity;
    将所述待提取文本和匹配出的实体及该实体对应的类型进行拼接,得到第二输入数据。The text to be extracted is spliced with the matched entity and the type corresponding to the entity to obtain the second input data.
  4. 根据权利要求1所述的方法,其特征在于,根据如下步骤获取所述第一提取模型:获取第一训练集,所述第一训练集包括第一训练样本和第一标签,所述第一训练样本为预设文本与所述各先验知识分别拼接后的文本,所述第一标签为对所述预设文本进行实体标注后的文本;The method according to claim 1, wherein the first extraction model is obtained according to the following steps: obtaining a first training set, the first training set includes a first training sample and a first label, and the first The training sample is the text after splicing the preset text and the prior knowledge respectively, and the first label is the text after entity labeling the preset text;
    利用所述第一训练集对初始第一提取模型进行训练,得到所述第一提取模型。Using the first training set to train an initial first extraction model to obtain the first extraction model.
  5. 根据权利要求1所述的方法,其特征在于,根据如下步骤获取所述第二提取模型:获取第二训练集,所述第二训练集包括第二训练样本和第二标签,所述第二训练样本为将预设文本通过AC自动机匹配出对应的实体和该实体对应的类型与所述预设文本进行拼接后的文本,所述第二标签为对所述预设文本进行实体标注后的文本;The method according to claim 1, wherein the second extraction model is obtained according to the following steps: obtaining a second training set, the second training set includes a second training sample and a second label, and the second The training sample is the text after the preset text is matched with the corresponding entity and the type corresponding to the entity through the AC automaton and the preset text is spliced, and the second label is after the entity labeling of the preset text the text of
    利用所述第二训练集对初始第二提取模型进行训练,得到所述第二提取模型。Using the second training set to train the initial second extraction model to obtain the second extraction model.
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述第一提取结果和所述第二提取结果,获取最优提取结果,包括:The method according to claim 1, wherein said obtaining an optimal extraction result according to said first extraction result and said second extraction result comprises:
    根据预设的所述第一提取模型和所述第一提取模型的权重比值,对所述第一提取结果中的每个实体概率和所述第二提取结果中每个实体概率进行计算,获取每个实体对应的总概率;Calculate the probability of each entity in the first extraction result and the probability of each entity in the second extraction result according to the preset weight ratio between the first extraction model and the first extraction model, and obtain The total probability corresponding to each entity;
    将所述总概率与预设阈值进行比较,获取所述总概率大于所述预设阈值的实体和该实体对应的类型、主宾类型及位置,所述最优提取结果包括所述总概率大于所述预设阈值的所有实体和该实体对应的类型、主宾类型及位置。Comparing the total probability with a preset threshold, obtaining the entity whose total probability is greater than the preset threshold and the type, guest type, and location corresponding to the entity, and the optimal extraction result includes that the total probability is greater than All the entities with the preset threshold and the corresponding types, guest types and positions of the entities.
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, further comprising:
    根据所述最优提取结果中的各实体,获取实体对;Obtain entity pairs according to each entity in the optimal extraction result;
    根据所述实体对和所述最优提取结果,对所述待提取文本进行标记;Marking the text to be extracted according to the entity pair and the optimal extraction result;
    将标记后的待提取文本输入预先设置的RoBERTa模型,获取所述实体对中每个实体的编码;Input the text to be extracted after marking into the preset RoBERTa model to obtain the encoding of each entity in the entity pair;
    对所述实体对中每个实体的第一个字符对应的编码进行拼接,并根据预设的分类算法,对拼接后的编码进行关系分类,获取所述实体对对应的实体关系。Splicing the codes corresponding to the first character of each entity in the entity pair, and performing relationship classification on the spliced codes according to a preset classification algorithm, to obtain the entity relationship corresponding to the entity pair.
  8. 一种实体提取装置,其特征在于,所述装置包括:An entity extraction device, characterized in that the device comprises:
    获取模块,用于获取第一输入数据,所述第一输入数据包括待提取文本和预设的各先验知识,所述先验知识包括各实体定义或表征各实体类型的词汇;获取第二输入数据,所述第二输入数据包括所述待提取文本和所述待提取文本中已识别出的实体及实体类型;The acquisition module is used to acquire the first input data, the first input data includes the text to be extracted and the preset prior knowledge, the prior knowledge includes each entity definition or vocabulary that characterizes each entity type; acquires the second input data, the second input data includes the text to be extracted and the identified entities and entity types in the text to be extracted;
    提取模块,用于将所述第一输入数据输入预先设置的第一提取模型,获取第一提取结果,所述第一提取结果包括提取出的各实体和该实体的类型、位置及概率;将所述第二输入数据输入预先设置的第二提取模型,获取第二提取结果,所述第二提取结果包括提取出的各实体的类型、位置、主宾类型及概率;An extraction module, configured to input the first input data into a preset first extraction model to obtain a first extraction result, the first extraction result including the extracted entity and the type, location and probability of the entity; The second input data is input into a preset second extraction model, and a second extraction result is obtained, and the second extraction result includes the type, location, type and probability of each extracted entity;
    处理模块,用于根据所述第一提取结果和所述第二提取结果,获取最优提取结果,所述最优提取结果包括所述待提取文本对应的各实体和该实体的类型、主宾类型及位置。A processing module, configured to obtain an optimal extraction result according to the first extraction result and the second extraction result, where the optimal extraction result includes each entity corresponding to the text to be extracted, the type of the entity, and the guest of honor type and location.
  9. 一种电子设备,其特征在于,包括:处理器和存储器,所述处理器和所述存储器连接;所述存储器用于存储程序;所述处理器用于运行存储在所述存储器中的程序,执行如权利要求1-7中任一项所述的方法。An electronic device, characterized in that it includes: a processor and a memory, the processor is connected to the memory; the memory is used to store a program; the processor is used to run the program stored in the memory, and execute The method according to any one of claims 1-7.
  10. 一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序在被计算机运行时执行如权利要求1-7中任一项所述的方法。A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program executes the method according to any one of claims 1-7 when executed by a computer.
PCT/CN2022/139496 2021-12-24 2022-12-16 Entity extraction method and apparatus, and electronic device and storage medium WO2023116561A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111598982.5 2021-12-24
CN202111598982.5A CN114265919A (en) 2021-12-24 2021-12-24 Entity extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023116561A1 true WO2023116561A1 (en) 2023-06-29

Family

ID=80829773

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/139496 WO2023116561A1 (en) 2021-12-24 2022-12-16 Entity extraction method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN114265919A (en)
WO (1) WO2023116561A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702787A (en) * 2023-08-07 2023-09-05 四川隧唐科技股份有限公司 Long text entity identification method, device, computer equipment and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114265919A (en) * 2021-12-24 2022-04-01 中电信数智科技有限公司 Entity extraction method and device, electronic equipment and storage medium
CN115620722B (en) * 2022-12-15 2023-03-31 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059320A (en) * 2019-04-23 2019-07-26 腾讯科技(深圳)有限公司 Entity relation extraction method, apparatus, computer equipment and storage medium
CN110502749A (en) * 2019-08-02 2019-11-26 中国电子科技集团公司第二十八研究所 A kind of text Relation extraction method based on the double-deck attention mechanism Yu two-way GRU
CN110569366A (en) * 2019-09-09 2019-12-13 腾讯科技(深圳)有限公司 text entity relation extraction method and device and storage medium
CN111539209A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Method and apparatus for entity classification
US20210073247A1 (en) * 2019-09-06 2021-03-11 Royal Bank Of Canada System and method for machine learning architecture for interdependence detection
CN112559770A (en) * 2020-12-15 2021-03-26 北京邮电大学 Text data relation extraction method, device and equipment and readable storage medium
CN112988979A (en) * 2021-04-29 2021-06-18 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable medium and electronic equipment
KR20210147368A (en) * 2020-05-28 2021-12-07 삼성에스디에스 주식회사 Method and apparatus for generating training data for named entity recognition
CN113761190A (en) * 2021-05-06 2021-12-07 腾讯科技(深圳)有限公司 Text recognition method and device, computer readable medium and electronic equipment
CN114265919A (en) * 2021-12-24 2022-04-01 中电信数智科技有限公司 Entity extraction method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059320A (en) * 2019-04-23 2019-07-26 腾讯科技(深圳)有限公司 Entity relation extraction method, apparatus, computer equipment and storage medium
CN110502749A (en) * 2019-08-02 2019-11-26 中国电子科技集团公司第二十八研究所 A kind of text Relation extraction method based on the double-deck attention mechanism Yu two-way GRU
US20210073247A1 (en) * 2019-09-06 2021-03-11 Royal Bank Of Canada System and method for machine learning architecture for interdependence detection
CN110569366A (en) * 2019-09-09 2019-12-13 腾讯科技(深圳)有限公司 text entity relation extraction method and device and storage medium
CN111539209A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Method and apparatus for entity classification
KR20210147368A (en) * 2020-05-28 2021-12-07 삼성에스디에스 주식회사 Method and apparatus for generating training data for named entity recognition
CN112559770A (en) * 2020-12-15 2021-03-26 北京邮电大学 Text data relation extraction method, device and equipment and readable storage medium
CN112988979A (en) * 2021-04-29 2021-06-18 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable medium and electronic equipment
CN113761190A (en) * 2021-05-06 2021-12-07 腾讯科技(深圳)有限公司 Text recognition method and device, computer readable medium and electronic equipment
CN114265919A (en) * 2021-12-24 2022-04-01 中电信数智科技有限公司 Entity extraction method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702787A (en) * 2023-08-07 2023-09-05 四川隧唐科技股份有限公司 Long text entity identification method, device, computer equipment and medium

Also Published As

Publication number Publication date
CN114265919A (en) 2022-04-01

Similar Documents

Publication Publication Date Title
TWI729472B (en) Method, device and server for determining feature words
WO2023116561A1 (en) Entity extraction method and apparatus, and electronic device and storage medium
EP4141733A1 (en) Model training method and apparatus, electronic device, and storage medium
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN112988753B (en) Data searching method and device
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
US20230073994A1 (en) Method for extracting text information, electronic device and storage medium
CN111353311A (en) Named entity identification method and device, computer equipment and storage medium
CN112084779B (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN107861948B (en) Label extraction method, device, equipment and medium
CN113987125A (en) Text structured information extraction method based on neural network and related equipment thereof
CN112417887A (en) Sensitive word and sentence recognition model processing method and related equipment thereof
CN113408273B (en) Training method and device of text entity recognition model and text entity recognition method and device
WO2024051196A1 (en) Malicious code detection method and apparatus, electronic device, and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN113312451B (en) Text label determining method and device
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN114417862A (en) Text matching method, and training method and device of text matching model
CN110083817B (en) Naming disambiguation method, device and computer readable storage medium
CN112560425A (en) Template generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909875

Country of ref document: EP

Kind code of ref document: A1