CN113095083A - Entity extraction method and device - Google Patents

Entity extraction method and device Download PDF

Info

Publication number
CN113095083A
CN113095083A CN202110632223.XA CN202110632223A CN113095083A CN 113095083 A CN113095083 A CN 113095083A CN 202110632223 A CN202110632223 A CN 202110632223A CN 113095083 A CN113095083 A CN 113095083A
Authority
CN
China
Prior art keywords
entity
text
extracted
vector representation
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110632223.XA
Other languages
Chinese (zh)
Inventor
钱佳佳
陈立力
周明伟
刘伟棠
范鹏召
郑燕玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202110632223.XA priority Critical patent/CN113095083A/en
Publication of CN113095083A publication Critical patent/CN113095083A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an entity extraction method and device. The entity extraction method comprises the following steps: determining a natural question of an unknown entity, wherein the natural question is constructed by a known entity and an entity relationship, and the entity relationship is the relationship between the known entity and the unknown entity; obtaining vector representation of a natural question and a text to be extracted; and determining answers of the natural question based on the vector representation to obtain the unknown entity. According to the method and the device, the entity relationship can be extracted only by one step of operation, the relationship extraction is not required to be carried out after the entity extraction operation is carried out, the text analysis efficiency is improved, and the extraction of the entity relationship in the text to be extracted can be completed by a single model.

Description

Entity extraction method and device
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a method and an apparatus for extracting entities.
Background
The text information is automatically analyzed, semantic understanding needs to be carried out on text contents, including text entities and entity relations in texts, and therefore entity extraction and entity association of the text information are the basis.
At present, the steps of automatically analyzing text information are generally as follows: classifying words in the text to extract entities in the text information; relationships between the entities are then extracted from the textual information based on the extracted entities. The automatic text information analysis method needs two operations of entity extraction and relation extraction, and has the disadvantages of complicated steps and low text analysis efficiency.
Disclosure of Invention
The application provides an entity extraction method and device, which do not need to perform entity extraction operation before relationship extraction, improve text analysis efficiency, and can complete extraction of entity relationships in a text to be extracted by a single model.
To solve the above problem, the present application provides an entity extraction method, including:
determining a natural question of an unknown entity, wherein the natural question is constructed by a known entity and an entity relationship, and the entity relationship is the relationship between the known entity and the unknown entity;
obtaining vector representation of a natural question and a text to be extracted;
and determining answers of the natural question based on the vector representation to obtain the unknown entity.
In order to solve the above problem, the present application further provides an electronic device, which includes a processor; the processor is used for executing instructions to realize the method.
To solve the above problems, the present application also provides a computer-readable storage medium for storing instructions/program data that can be executed to implement the above-described method.
The method and the device directly extract the corresponding unknown entity from the text information based on the natural question of the unknown entity, so that the unknown entity can be extracted from the text information as long as the text information has the contents of the relationship between the unknown entity and the known entity and the entity relationship in the natural question, and the text analysis can be carried out only by one-step operation without carrying out the relationship extraction after the entity extraction operation, thereby improving the text analysis efficiency, further analyzing the text without utilizing two models of an entity identification model and a relationship extraction model, completing the extraction of the entity relationship in the text to be extracted by a single model, and when the relationship is extracted, the entity identification model does not need to be additionally relied, thereby avoiding the error result of the entity identification from being transmitted to the relationship extraction module, and improving the accuracy of the relationship extraction.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of an entity extraction method of the present application;
FIG. 2 is a schematic structural diagram of an embodiment of an electronic device of the present application;
FIG. 3 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application.
Detailed Description
The description and drawings illustrate the principles of the application. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the application and are included within its scope. Moreover, all examples herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the application and the concepts provided by the inventors and thus further the art, and are not to be construed as being limited to such specifically recited examples and conditions. Additionally, the term "or" as used herein refers to a non-exclusive "or" (i.e., "and/or") unless otherwise indicated (e.g., "or otherwise" or in the alternative). Moreover, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments may be combined with one or more other embodiments to form new embodiments.
In order to solve the problem that the existing text information automatic analysis method needs two steps of entity extraction and relationship extraction, the method extracts the corresponding unknown entity from the text information directly based on the natural question sentence of the unknown entity, so that the unknown entity can be extracted from the text information as long as the text information has the content of the relationship between the unknown entity and the known entity and the entity relationship in the natural question sentence, the text analysis can be carried out by only one step of operation, the relationship extraction is not needed to be carried out after the entity extraction operation, the text analysis efficiency is improved, the text analysis can be carried out without utilizing two models of an entity identification model and a relationship extraction model, the single model can complete the extraction of the entity relationship in the text to be extracted, the relationship extraction is not needed to be additionally dependent on the entity identification model, and the error result of the entity identification is prevented from being transmitted to the relationship extraction module, the accuracy rate of relation extraction is improved.
The following describes the entity extraction method in detail, wherein a flow diagram of an embodiment of the entity extraction method is specifically shown in fig. 1, and the entity extraction method of the embodiment includes the following steps. The application field of the entity extraction method is not limited, and the method can be applied to the telecommunication fraud field or the cross-border trade field. It should be noted that the following step numbers are only used for simplifying the description, and are not intended to limit the execution order of the steps, and the execution order of the steps in the present embodiment may be arbitrarily changed without departing from the technical idea of the present application.
S101: a natural question of the unknown entity is determined.
The natural question of the unknown entity can be determined firstly, wherein the natural question is constructed by the known entity and the entity relation corresponding to the unknown entity, so that the unknown entity can be extracted from the text to be extracted subsequently based on the natural question.
In an implementation mode, based on the field of the text to be extracted, the known entities to be extracted, the entity relationships between the known entities and the unknown entities can be defined; and then constructing a natural question by the known entity and the entity relation corresponding to each unknown entity.
Wherein, the entity concerned in the field of the text to be extracted can be taken as a known entity, for example, for a telecom fraud case, the entity concerned is generally a victim, a suspect and a case, that is, the victim, the suspect and the case can be taken as the known entity; then, the entity relationship that needs to be concerned by each known entity is determined, for example, the known entity-the entity relationship that needs to be concerned by the victim comprises name, gender, identification number, QQ number, bank card number, micro-signal, residence and temporary place, etc.
In another implementation mode, at least one slot template is defined according to the field of the text to be extracted, and then a natural question corresponding to each slot template is generated. Wherein, the format of the slot template is as follows: entity relationships (known entities, unknown entities).
For example, telecommunications fraud cases generally contain three core bodies: victims, suspects, cases themselves, and the like; the entity relationship of the victim comprises name, gender, identification card number, QQ number, bank card number, micro signal, household address and temporary place, etc.; the entity relationship of the suspect comprises name, gender, identity card number, QQ number, bank card number, micro signal, household address, temporary place and the like; the physical relations of the case comprise case sending time, case sending address, case involved amount and the like, all relations in the case can be extracted by using a slot filling method, namely slot templates in a table 1 are defined according to the characteristics of the telecommunication fraud case, wherein the format of the slot templates is R (e, a), R is the physical relation, e is a known entity, a is an unknown entity, then corresponding natural question sentences are generated according to the slot templates, each slot template can correspond to one or more natural question sentences, so that answers of the natural question sentences are dug out from the original case according to a natural language understanding method in the subsequent process, and all physical relation extraction defined in advance of the case is completed.
TABLE 1 Slot templates and their corresponding Natural question sentences
Figure 819357DEST_PATH_IMAGE001
S102: and obtaining vector representation of the natural question and the text to be extracted.
After the natural question of the unknown entity is determined based on step S101, the natural question and the text to be extracted may be processed to obtain a vector representation of the natural question and the text to be extracted.
Optionally, the natural question sentence and the text to be extracted may be input into the word vector model together to obtain a vector representation of the natural question sentence and the text to be extracted.
Furthermore, the natural question sentence and the text to be extracted can be spliced, and the spliced text is input into the word vector model to obtain the word vector representation of the spliced text. The sequence of the natural question sentence and the text to be extracted in the spliced text is not limited, for example, the natural question sentence may be before the text to be extracted, or the text to be extracted may be before the natural question sentence. In addition, the natural question sentence and the text to be extracted can be separated by a separator. Illustratively, the spliced text may be represented as [ CLS ] natural question [ SEP ] text to be extracted [ SEP ].
Wherein, if a plurality of unknown entities are required to be extracted from the text to be extracted, the natural question of each unknown entity and the text to be extracted can be spliced to obtain a spliced text of each unknown entity, then the spliced texts of the plurality of unknown entities are input into a word vector model one by one to determine the vector representation of the spliced text of each unknown entity, then all the unknown entities of each kind can be extracted from the text to be extracted based on step S103 and the vector representation of the spliced text of each unknown entity, the entity relationship can be extracted by the way of constructing the relationship template question, when the unknown entities satisfying the same entity relationship and the requirements of the known entities in the text to be extracted have a plurality of values, the entity relationship can be extracted simultaneously at one time, for example, a victim in a single fraud case may have a plurality of transfer records, that the case has a plurality of case amounts, by the application, all involved money amounts can be simultaneously identified by one-time input. Exemplarily, two unknown entities, namely a victim name and a suspect address, need to be extracted from a telecom fraud case, a natural question and a text to be extracted of each unknown entity can be spliced to obtain a spliced text of the name class of the victim ([ CLS ] what name of the victim [ SEP ] text to be extracted [ SEP ]) and a spliced text of the suspect address ([ CLS ] what address of the suspect [ SEP ] text to be extracted [ SEP ]), and then a vector representation of the spliced text of each unknown entity is obtained; and then all the names of the victims, namely Zhang III, Li IV and Wang V, are extracted from the text to be extracted based on the vector representation of what [ SEP ] text to be extracted [ SEP ] of the names of the [ CLS ] victims, and all the addresses of the suspects, namely XX district of XX city, are extracted from the text to be extracted based on the vector representation of what [ SEP ] text to be extracted [ SEP ] of the [ CLS ] suspects.
And processing the natural question and the text to be extracted by adopting Word vector models such as Tf-Idf, Word2Vec, BERT and the like.
Preferably, the BERT model is used for processing the natural question and the text to be extracted, and the BERT model is a general pre-training model and has good generalization capability and feature capture capability, so that for the same text to be extracted, the word vector representation of each word in the text to be extracted is changed due to the change of the natural question, that is, when other previously undefined relations in the text to be extracted need to be extracted, only one natural question template needs to be added, and the model can be well adapted, so that the answer (namely an unknown entity) of the natural question can be extracted from the text to be extracted better and more accurately.
Furthermore, the natural question and the text to be extracted can be input into the Word2Vec model to obtain the initial vector representation of the natural question and the text to be extracted output by the Word2Vec model, so that better space semantic information between depicted words can be obtained through the Word2Vec model; and then inputting the initial vector representations of the natural question and the text to be extracted into a BERT model to obtain the vector representations of the natural question and the text to be extracted output by the BERT model, so that the BERT model can deeply encode the natural question and the text to be extracted, and the deep voice information of the text to be extracted and the context voice information of each word can be obtained through the BERT model, so that the semantic information of each word in the text to be extracted can be better expressed by the vector representations of the natural question and the text to be extracted output by the BERT model.
The Word2vec model is obtained by training data samples in the same field as the text to be extracted, so that the Word2vec model contains a large amount of field information in the same field as the text to be extracted, and the text in the field of the document to be extracted can be well analyzed by using the Word2vec model.
In addition, in BERT, a total of two steps are included: pre-training and fine-tuning. In the pre-training, the model is trained on different pre-training tasks based on label-free data, so that the BERT model contains a large amount of common knowledge, in the fine-tuning, the model is initialized based on parameters obtained by the pre-training, and then all parameters are fine-tuned by using label data from specific tasks at downstream, so that the BERT model learns the knowledge in the same field as the text to be extracted, and the BERT model can learn more complex embedded representation in the context environment.
In addition, Word2vec contains two important models, namely a CBOW model and a Skip-gram model, wherein the CBOW model is trained by using context to predict target words to obtain Word vectors and is suitable for scenes with few text corpora, and the Skip-gram model is trained by using target words to predict surrounding words to obtain Word vectors and is good in performance on large corpora. Specifically, the corresponding model may be selected based on the field of the text to be extracted, for example, if the text to be extracted is a telecom fraud field, a CBOW model may be adopted.
Further, in order to facilitate the implementation of step S102 and step S103, the text to be extracted may be preprocessed first, a main part of the text to be extracted, that is, a detailed description of the text to be extracted, and synonym replacement may be performed on main words in the text to be extracted, so that words having the same meaning as a known entity in the text to be extracted are replaced with the known entity in a unified manner. For example, the text to be extracted is the text of a telecom fraud case, a victim and a victim can be uniformly replaced by a victim, and a victim, a suspect, a suspector, a suspect and the like can be replaced by a suspect.
S103: and determining answers of the natural question based on the vector representation to obtain the unknown entity.
After the vector representation of the natural question and the text to be extracted is obtained based on step S102, the vector representation may be processed to determine the answer to the natural question, that is, the unknown entity is extracted from the text to be extracted.
In a first implementation, it may be determined whether each word is an answer to a natural question based on a vector representation of each word in the text to be extracted. Specifically, the confidence level of each word in the extracted text as the answer of the natural question sentence can be determined based on the vector representation of each word in the text to be extracted; it is then determined whether each word is an answer to a natural question, i.e., whether each word is an unknown entity, based on the confidence with which each word is an answer to a natural question. The vector representations of the natural question and the text to be extracted can be input into a linear layer, the linear layer processes the vector representations of the natural question and the text to be extracted, and the confidence coefficient of each word in the text to be extracted as the answer of the natural question is obtained based on the result of the linear layer processing. In addition, the output of the linear layer can be processed through functions such as sigmoid and the like, so that the confidence coefficient of each word in the text to be extracted as the answer of the natural question is obtained.
In a second implementation manner, the starting position and the ending position of the answer of the natural question sentence in the text to be extracted can be determined based on the vector representation of each word in the text to be extracted, and then the answer is extracted from the text to be extracted based on the starting position and the ending position, so that the starting position and the ending position of the text to be extracted are respectively determined, the two steps of determining the starting position and the ending position can be performed in parallel, and the answer determination efficiency can be improved; and for an unknown entity composed of a plurality of words, the first implementation mode can extract the unknown entity from the text to be extracted only under the condition that all words in the unknown entity are determined to be answers of natural question sentences, otherwise, the first implementation mode can only extract partial words in the unknown entity, and different from the first implementation mode, the second implementation mode only focuses on the starting position and the ending position of the answer, so that the answers of the natural question sentences can be accurately extracted from the text to be extracted as long as the starting position and the ending position of the answer can be accurately determined, and the influence of the change of middle part words in the unknown entity on the answers is small.
Specifically, the vector representations of the natural question and the text to be extracted can be input into the first linear layer, the confidence level of the starting word of each word serving as an answer in the text to be extracted is calculated based on the output of the first linear layer, and the position of each word is determined to be the starting position of the answer based on the confidence level of the starting word of each word serving as an answer, so that the first linear layer only needs to pay attention to whether each word is the starting position of the answer or not, and does not need to pay attention to more information like the linear layer in the first implementation mode, so that the first linear layer which is relatively convergent is convenient to train and obtain, and whether each word is the starting position of the answer or not is also relatively easily and accurately judged by using the first linear layer after the training. In addition, the output of the first linear layer can be processed through a sigmoid function and the like to obtain the confidence of the starting word of each word in the text to be extracted as the answer. The word with the confidence coefficient greater than the first threshold as the starting word of the answer may be used as the starting word of the answer, where the first threshold is not limited, and is specifically set according to an actual situation, and may be, for example, 0.5.
In addition, after the confidence degrees of all the words in the text to be extracted are determined, normalization processing can be performed on the confidence degree of each word, so that whether each word in the text to be extracted is the initial word of the answer or not can be confirmed subsequently. For example, the confidence of the word with the confidence of the starting word being the answer being greater than the first threshold may be set to 1, and the confidence of the word with the confidence of the starting word being the answer being less than or equal to the first threshold may be set to 0, where the word with the confidence of 1 after normalization is the starting word of the answer.
The vector representations of the natural question and the text to be extracted can be input to the second linear layer, the confidence coefficient of an end word of each word serving as an answer in the text to be extracted is calculated based on the output of the second linear layer, whether the position of each word is the end position of the answer or not is determined based on the confidence coefficient of the end word of each word serving as the answer, and therefore the second linear layer only needs to pay attention to whether each word is the end position of the answer or not, more information does not need to be paid attention to like the linear layer in the first implementation mode, the second linear layer which is relatively convergent is convenient to train, and whether each word is the end position of the answer or not is easily and accurately judged by utilizing the second linear layer after the training. In addition, the output of the second linear layer can be processed through a sigmoid function and the like, so that the confidence of the end word of each word in the text to be extracted as the answer is obtained. The word with the confidence coefficient greater than the second threshold as the end word of the answer may be used as the end word of the answer, where the second threshold is not limited, and is specifically set according to an actual situation, and may be, for example, 0.5.
In addition, after the confidence degrees of all the words in the text to be extracted are determined, normalization processing can be performed on the confidence degree of each word, so that whether each word in the text to be extracted is an end word of the answer or not can be confirmed subsequently. For example, the confidence of the word with the confidence of the ending word being the answer being greater than the second threshold may be set to 1, and the confidence of the word with the confidence of the ending word being the answer being less than or equal to the second threshold may be set to 0, where the word with the confidence of 1 after normalization is the ending word of the answer.
Wherein, the expression of the Sigmoid function is as follows:
Figure 303034DEST_PATH_IMAGE002
wherein σ (z) is the confidence coefficient of each word in the text to be extracted, and the value range of σ (z) is (0, 1); and e is the value of each word in the text to be extracted output by the linear layer, the first linear layer, or the second linear layer, and e may range from (— infinity, + ∞).
In the embodiment, the corresponding unknown entity is extracted from the text information directly based on the natural question sentence of the unknown entity, therefore, as long as the text information has the content of the unknown entity and the known entity and the relationship between the entities in the natural question, the application can extract the unknown entity from the text information, thus, the text analysis can be carried out only by one-step operation without carrying out the relation extraction after the entity extraction operation, the text analysis efficiency is improved, therefore, the extraction of the entity relation in the text to be extracted can be finished by a single model without analyzing the text by utilizing two models, namely an entity recognition model and a relation extraction model, when the relationship extraction is carried out, the entity identification model does not need to be additionally relied on, so that the error result of the entity identification is prevented from being transmitted to the relationship extraction module, and the accuracy rate of the relationship extraction is improved.
In addition, after the unknown entity is extracted from the text to be extracted based on the step S103, the unknown entity can be filled into the slot template formed by the known entity and the entity relationship to complete slot filling, so that according to the characteristics of the field of the text to be extracted, a series of relationship slot positions are formulated, a head node (namely, the known entity) and a relationship predicate (namely, the entity relationship) of the slot positions are predefined, the entity and relationship extraction is converted into the problem of slot filling, and a new idea is provided for the entity relationship extraction of the text information.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an embodiment of an electronic device 20 according to the present application. The electronic device 20 of the present application includes a processor 22, and the processor 22 is configured to execute instructions to implement the method provided by any one of the embodiments of the entity extraction method of the present application and any non-conflicting combinations.
The electronic device 20 may be a terminal such as a mobile phone or a notebook computer, or may be a server.
The processor 22 may also be referred to as a CPU (Central Processing Unit). The processor 22 may be an integrated circuit chip having signal processing capabilities. The processor 22 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 22 may be any conventional processor or the like.
The electronic device 20 may further include a memory 21 for storing instructions and data required for operation of the processor 22.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present disclosure. The computer readable storage medium 30 of the embodiments of the present application stores instructions/program data 31 that when executed enable the methods provided by any of the above embodiments of the methods of the present application, as well as any non-conflicting combinations. The instructions/program data 31 may form a program file stored in the storage medium 30 in the form of a software product, so as to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium 30 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or various media capable of storing program codes, or a computer, a server, a mobile phone, a tablet, or other devices.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims (10)

1. A method of entity extraction, the method comprising:
determining a natural question of an unknown entity, wherein the natural question is constructed by a known entity and an entity relationship, and the entity relationship is the relationship between the known entity and the unknown entity;
obtaining vector representation of the natural question sentence and the text to be extracted;
and determining an answer of the natural question sentence based on the vector representation, namely obtaining the unknown entity.
2. The entity extraction method according to claim 1, wherein the step of obtaining the vector representation of the natural question sentence and the text to be extracted comprises:
splicing the natural question sentence and the text to be extracted to obtain a spliced text;
and processing the spliced text to obtain the vector representation of the spliced text.
3. The entity extraction method according to claim 2, wherein the step of processing the concatenated text to obtain the vector representation of the concatenated text comprises:
determining a vector representation of the stitched text based on a BERT model.
4. The entity extraction method according to claim 3, wherein the step of determining the vector representation of the stitched text based on the BERT model comprises:
inputting the spliced text into a Word2vec model to obtain an initial vector representation of the spliced text output by the Word2vec model;
and inputting the initial vector representation of the spliced text into the BERT model to obtain the vector representation of the spliced text output by the BERT model.
5. The entity extraction method according to claim 1, wherein the step of determining an answer to the natural question based on the vector representation comprises:
determining a starting position and an ending position of the unknown entity in the text to be extracted based on the vector representation;
and extracting all the unknown entities from the text to be extracted based on the starting position and the ending position.
6. The entity extraction method according to claim 5, wherein the step of determining the starting position and the ending position of the unknown entity in the text to be extracted based on the vector representation comprises:
inputting the vector representation into a first linear layer, calculating the confidence of each word in the text to be extracted as the starting word of the unknown entity based on the output of the first linear layer, and determining whether the position of each word is the starting position of the unknown entity based on the confidence of each word as the starting word of the unknown entity;
inputting the vector representation into a second linear layer, calculating the confidence of each word in the text to be extracted as the end word of the unknown entity based on the output of the second linear layer, and determining whether the position of each word is the end position of the unknown entity based on the confidence of each word as the end word of the unknown entity.
7. The entity extraction method according to claim 1, wherein the step of determining a natural question sentence of an unknown entity comprises:
determining at least one slot template corresponding to the text field to be extracted;
constructing a natural question corresponding to each slot template based on the known entity and the entity relation in each slot template;
the step of obtaining the vector representation of the natural question sentence and the text to be extracted comprises the following steps: comprehensively processing the natural question sentence corresponding to each slot template and the text to be extracted to obtain the vector representation corresponding to each slot template;
the step of determining an answer to the natural question based on the vector representation, i.e. to the unknown entity, comprises: determining unknown entities in each slot template based on the vector representation corresponding to each slot template.
8. The entity extraction method according to claim 7, wherein the step of determining an answer to the natural question based on the vector representation, that is to say obtaining the unknown entity, is followed by:
and filling the unknown entity into a preset slot template, wherein the slot template is formed by the known entity and the entity relation.
9. An electronic device, characterized in that the electronic device comprises a processor; the processor is configured to execute instructions to implement the method of any one of claims 1-8.
10. A computer-readable storage medium, characterized in that a program file capable of implementing the method of any one of claims 1-8 is stored in the computer-readable storage medium.
CN202110632223.XA 2021-06-07 2021-06-07 Entity extraction method and device Pending CN113095083A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110632223.XA CN113095083A (en) 2021-06-07 2021-06-07 Entity extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110632223.XA CN113095083A (en) 2021-06-07 2021-06-07 Entity extraction method and device

Publications (1)

Publication Number Publication Date
CN113095083A true CN113095083A (en) 2021-07-09

Family

ID=76666041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110632223.XA Pending CN113095083A (en) 2021-06-07 2021-06-07 Entity extraction method and device

Country Status (1)

Country Link
CN (1) CN113095083A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628174A (en) * 2023-02-17 2023-08-22 广东技术师范大学 End-to-end relation extraction method and system for fusing entity and relation information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977428A (en) * 2019-03-29 2019-07-05 北京金山数字娱乐科技有限公司 A kind of method and device that answer obtains
CN110795543A (en) * 2019-09-03 2020-02-14 腾讯科技(深圳)有限公司 Unstructured data extraction method and device based on deep learning and storage medium
CN111160032A (en) * 2019-12-17 2020-05-15 浙江大华技术股份有限公司 Named entity extraction method and device, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977428A (en) * 2019-03-29 2019-07-05 北京金山数字娱乐科技有限公司 A kind of method and device that answer obtains
CN110795543A (en) * 2019-09-03 2020-02-14 腾讯科技(深圳)有限公司 Unstructured data extraction method and device based on deep learning and storage medium
CN111160032A (en) * 2019-12-17 2020-05-15 浙江大华技术股份有限公司 Named entity extraction method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628174A (en) * 2023-02-17 2023-08-22 广东技术师范大学 End-to-end relation extraction method and system for fusing entity and relation information

Similar Documents

Publication Publication Date Title
CN109522553B (en) Named entity identification method and device
CN111428504B (en) Event extraction method and device
CN111274797A (en) Intention recognition method, device and equipment for terminal and storage medium
CN112784066B (en) Knowledge graph-based information feedback method, device, terminal and storage medium
CN111339277A (en) Question-answer interaction method and device based on machine learning
CN112487827A (en) Question answering method, electronic equipment and storage device
KR20190072823A (en) Domain specific dialogue acts classification for customer counseling of banking services using rnn sentence embedding and elm algorithm
CN112417855A (en) Text intention recognition method and device and related equipment
CN111967264A (en) Named entity identification method
CN114218945A (en) Entity identification method, device, server and storage medium
CN113627194B (en) Information extraction method and device, and communication message classification method and device
CN113051384B (en) User portrait extraction method based on dialogue and related device
CN110969005B (en) Method and device for determining similarity between entity corpora
CN114281996A (en) Long text classification method, device, equipment and storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN113095083A (en) Entity extraction method and device
CN111783425B (en) Intention identification method based on syntactic analysis model and related device
CN113436614A (en) Speech recognition method, apparatus, device, system and storage medium
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
US20230244878A1 (en) Extracting conversational relationships based on speaker prediction and trigger word prediction
CN116955591A (en) Recommendation language generation method, related device and medium for content recommendation
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN114722821A (en) Text matching method and device, storage medium and electronic equipment
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN112149389A (en) Resume information structured processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210709