CN116757213A - Entity extraction method, device and storage medium - Google Patents

Entity extraction method, device and storage medium Download PDF

Info

Publication number
CN116757213A
CN116757213A CN202310720860.1A CN202310720860A CN116757213A CN 116757213 A CN116757213 A CN 116757213A CN 202310720860 A CN202310720860 A CN 202310720860A CN 116757213 A CN116757213 A CN 116757213A
Authority
CN
China
Prior art keywords
character
characters
target
character sequence
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310720860.1A
Other languages
Chinese (zh)
Inventor
曹菁
张旭
李扬
贾海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202310720860.1A priority Critical patent/CN116757213A/en
Publication of CN116757213A publication Critical patent/CN116757213A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a method, a device and a storage medium for entity extraction, which relate to the technical field of communication and can be used for entity extraction. The method comprises the following steps: acquiring a plurality of first character sequences; each first character sequence comprises n first characters and a plurality of label types of the first characters; n is a positive integer; sequentially inputting a plurality of first character sequences into a first training model, and determining the target label type of each first character from a plurality of label types of each first character; generating a second character sequence corresponding to each first character sequence based on the target label type of each first character; the second character sequence comprises n first characters and target label types corresponding to the first characters; inputting the second character sequence into a second training model, and extracting a first character of which the target label type meets a first preset condition in the second character sequence as a target extraction result. The application is used for entity extraction.

Description

Entity extraction method, device and storage medium
Technical Field
The present application relates to the field of communications technologies, and in particular, to a method and apparatus for entity extraction, and a storage medium.
Background
Currently, entity extraction methods are generally based on rules or learning. The entity extraction method based on rules is insufficient in calculation power, and has serious template customization, weak model generalization capability and poor robustness. The learning-based entity extraction method has weak learning ability of the shallow model and high calculation power requirement of the deep model. In addition, in the entity extraction process, the effect of training the entity extraction model is greatly influenced by the factors such as the pureness, the quantity and the like of the data, so that the entity extraction method is difficult to realize in practical application. Therefore, how to refresh the accuracy of entity extraction is a technical problem to be solved.
Disclosure of Invention
The application provides a method, a device and a storage medium for entity extraction, which can perform entity extraction.
In order to achieve the above purpose, the application adopts the following technical scheme:
in a first aspect, the present application provides a method for entity extraction, the method comprising: acquiring a plurality of first character sequences; each first character sequence comprises n first characters and a plurality of label types of the first characters; n is a positive integer; sequentially inputting a plurality of first character sequences into a first training model, and determining the target label type of each first character from a plurality of label types of each first character; generating a second character sequence corresponding to each first character sequence based on the target label type of each first character; the second character sequence comprises n first characters and target label types corresponding to the first characters; inputting the second character sequence into a second training model, and extracting a first character of which the target label type meets a first preset condition in the second character sequence as a target extraction result.
With reference to the first aspect, in one possible implementation manner, the first training model includes: a first sub-model, a second sub-model, and a third sub-model. The first sub-model is used for converting the first character into a character embedding vector; the second sub-model is used for determining the probability value of each tag type corresponding to the first character according to the character embedding vector and the tag type; the third sub-model is used for determining the target tag type of the first character according to the probability value of the tag type.
With reference to the first aspect, in one possible implementation manner, sequentially inputting a plurality of first character sequences into the first training model, determining a target tag type of each first character from a plurality of tag types of each first character includes: inputting the first character sequence into the first submodel, and obtaining a character embedding vector corresponding to each first character in the first character sequence; generating a third character sequence according to the character embedding vector corresponding to each first character; the third character sequence comprises n character embedding vectors corresponding to the first characters and a plurality of label types corresponding to the first characters; inputting each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to a preset sequence, and obtaining probability values of the plurality of label types corresponding to each first character; and inputting the probability values of the plurality of label types into a third sub-model, and determining that the label type with the probability value meeting the second preset condition is the target label type corresponding to the first character.
With reference to the first aspect, in one possible implementation manner, after inputting probability values of a plurality of tag types into the third sub-model and determining that the tag type whose probability value meets the second preset condition is the target tag type corresponding to the first character, the method further includes: based on the target label type of the first character, sending first feedback information to the second sub-model; the first feedback information includes: the target label type and the probability value corresponding to the target label type; based on the first feedback information, updating the probability value of the target label type corresponding to each first character in the second sub-model.
With reference to the first aspect, in one possible implementation manner, inputting the second character sequence into the second training model, extracting, from the second character sequence, a first character whose target label type meets a preset condition as a target extraction result includes: updating the second character sequence according to a preset rule; the preset rule is used for checking the second character sequence; and extracting the first character of which the target label type meets the preset condition from the updated second character sequence as a target extraction result.
With reference to the first aspect, in one possible implementation manner, before acquiring the plurality of first character sequences, the method further includes: acquiring a plurality of fourth character sequences; the fourth character sequence comprises at least one second character; determining the number of second characters contained in each fourth character sequence; determining the number n of the first characters contained in the fifth character sequence based on the number of the second characters of the plurality of fourth character sequences; labeling a plurality of label types corresponding to the n first characters based on a preset labeling algorithm; a first character sequence is generated based on the n first characters and a plurality of tag types corresponding to the first characters.
In a second aspect, the present application provides an entity extraction apparatus, the apparatus comprising: a processing unit and a communication unit; the processing unit is used for acquiring a plurality of first character sequences; each first character sequence comprises n first characters and a plurality of label types of the first characters; n is a positive integer; the processing unit is further used for sequentially inputting a plurality of first character sequences into the first training model, and determining the target label type of each first character from a plurality of label types of each first character; the processing unit is further used for generating a second character sequence corresponding to each first character sequence based on the target label type of each first character; the second character sequence comprises n first characters and target label types corresponding to the first characters; the processing unit is further configured to input a second character sequence into the second training model, and extract, as a target extraction result, a first character in the second character sequence, where the target label type meets a first preset condition.
With reference to the second aspect, in one possible implementation manner, the first training model includes: a first sub-model, a second sub-model, and a third sub-model. The first sub-model is used for converting the first character into a character embedding vector; the second sub-model is used for determining the probability value of each tag type corresponding to the first character according to the character embedding vector and the tag type; the third sub-model is used for determining the target tag type of the first character according to the probability value of the tag type.
With reference to the second aspect, in one possible implementation manner, the processing unit is specifically configured to: inputting the first character sequence into the first submodel, and obtaining a character embedding vector corresponding to each first character in the first character sequence; generating a third character sequence according to the character embedding vector corresponding to each first character; the third character sequence comprises n character embedding vectors corresponding to the first characters and a plurality of label types corresponding to the first characters; inputting each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to a preset sequence, and obtaining probability values of the plurality of label types corresponding to each first character; and inputting the probability values of the plurality of label types into a third sub-model, and determining that the label type with the probability value meeting the second preset condition is the target label type corresponding to the first character.
With reference to the second aspect, in a possible implementation manner, the processing unit is further configured to: based on the target label type of the first character, indicating the communication unit to send first feedback information to the second sub-model; the first feedback information includes: the target label type and the probability value corresponding to the target label type; based on the first feedback information, updating the probability value of the target label type corresponding to each first character in the second sub-model.
With reference to the second aspect, in a possible implementation manner, the processing unit is further specifically configured to: updating the second character sequence according to a preset rule; the preset rule is used for checking the second character sequence; and extracting the first character of which the target label type meets the preset condition from the updated second character sequence as a target extraction result.
With reference to the second aspect, in a possible implementation manner, the processing unit is further configured to: acquiring a plurality of fourth character sequences; the fourth character sequence comprises at least one second character; determining the number of second characters contained in each fourth character sequence; determining the number n of the first characters contained in the fifth character sequence based on the number of the second characters of the plurality of fourth character sequences; labeling a plurality of label types corresponding to the n first characters based on a preset labeling algorithm; a first character sequence is generated based on the n first characters and a plurality of tag types corresponding to the first characters.
In a third aspect, the present application provides an entity extraction apparatus, comprising: a processor and a memory; wherein the memory is configured to store computer-executable instructions that, when executed by the entity extraction device, cause the entity extraction device to perform the entity extraction method as described in any one of the possible implementations of the first aspect and the first aspect.
In a fourth aspect, the present application provides a computer readable storage medium having instructions stored therein which, when executed by a processor of an entity extraction apparatus, enable the entity extraction apparatus to perform the entity extraction method as described in any one of the possible implementations of the first aspect and the first aspect.
In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on an entity extraction device, cause the entity extraction device to perform the entity extraction method as described in any one of the possible implementations of the first aspect and the first aspect.
In a sixth aspect, the present application provides a chip comprising a processor and a communication interface, the communication interface and the processor being coupled, the processor being for running a computer program or instructions to implement the entity extraction method as described in any one of the possible implementations of the first aspect and the first aspect.
Specifically, the chip provided in the embodiment of the application further includes a memory, which is used for storing a computer program or instructions.
In the present application, the names of the entity extraction means are not limited to the devices or function modules themselves, and in actual implementation, these devices or function modules may appear under other names. Insofar as the function of each device or function module is similar to that of the present application, it falls within the scope of the claims of the present application and the equivalents thereof.
These and other aspects of the application will be more readily apparent from the following description.
The technical scheme provided by the application has at least the following beneficial effects: the entity extraction device obtains a plurality of first character sequences comprising n first characters and a plurality of tag types corresponding to the first characters. The entity extraction device sequentially inputs a plurality of first character sequences into the first training model, and determines the target label type of each first character from a plurality of label types of each first character. The entity extraction device generates a second character sequence corresponding to the first character sequence according to the target label type of each first character. The second character sequence comprises n first characters and target label types corresponding to the first characters. The entity extraction device inputs the second character sequence into the second training model, and extracts a first character of which the target label type in the second character sequence meets a first preset condition as a target extraction result. In this way, the entity extraction device further determines whether the target label type of each character in the character sequence meets the preset condition or not after determining the target label type of the character, thereby determining the character of which the target label type meets the preset condition as the target extraction result and improving the accuracy of the entity extraction result.
Drawings
FIG. 1 is a schematic diagram of a physical extraction system according to an embodiment of the present application;
fig. 2 is a schematic hardware structure of an entity extraction device according to an embodiment of the present application;
FIG. 3 is a flow chart of a method for entity extraction according to an embodiment of the present application;
FIG. 4 is a flow chart of a method for entity extraction according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a first training model of an entity extraction device according to an embodiment of the present application;
FIG. 6 is a flow chart of a method for entity extraction according to an embodiment of the present application;
FIG. 7 is a flow chart of a method for entity extraction according to an embodiment of the present application;
FIG. 8 is a flow chart of a method for entity extraction according to an embodiment of the present application;
fig. 9 is a schematic diagram of an entity extraction device according to an embodiment of the present application.
Detailed Description
The entity extraction method, device and storage medium provided by the embodiment of the application are described in detail below with reference to the accompanying drawings.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.
The terms "first" and "second" and the like in the description and in the drawings are used for distinguishing between different objects or between different processes of the same object and not for describing a particular order of objects.
Furthermore, references to the terms "comprising" and "having" and any variations thereof in the description of the present application are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or apparatus.
It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
With the development of the internet, in order to efficiently process and utilize data such as image information, language information, voice information, video information and the like in the internet, the reliability of man-machine interaction is improved, and a natural language processing research method is proposed in the related art. The method is used for enabling the computer to recognize and learn the language used by daily communication of the human, enriches the form of human-computer interaction and improves the reliability of human-computer interaction. The entity extraction method belongs to the category of natural language processing and is used for extracting key elements from text files to form core information. For example, the entity extraction model extracts, for example, "person name", "organization name", "time", "place" information from the text file based on the rules and dictionary.
In the related art, entity extraction methods generally extract based on rules or learning. According to the entity extraction method based on the rules, the target information is marked through the linguistic template, so that the main content of the text can be extracted rapidly. For example, location of special punctuation marks, use of indicating pronouns, collocation of typical parts of speech, rational utilization of position words, center words, etc. The learning-based entity extraction model generally comprises a support vector machine (Support Vector Machine, SVM), a hidden Ma Keer (Hidden Markov Model, HMM), a shallow model such as a conditional random field (Conditional Random Field, CRF), a deep neural network (Deep Neural Networks, DNN), a Long Short-Term Memory (LSTM), a gated recurrent neural network (gated recurrent neural network, GRU), and the like. The entity extraction method based on learning inputs text data with specific labels into a machine learning model, and determines the position information of a main body to be extracted, text context dependency relationship, internal relation of keywords and the like by acquiring text rules, so that accurate positioning and grabbing of entity extraction are realized.
However, the entity extraction method based on rules is insufficient in calculation power, and has serious template customization, weak model generalization capability and poor robustness. The learning-based entity extraction method has weak learning ability of the shallow model and high calculation power requirement of the deep model. In addition, in the entity extraction process, the effect of training the entity extraction model is greatly influenced by the factors such as the pureness, the quantity and the like of the data, so that the entity extraction method is difficult to realize in practical application. Meanwhile, the accuracy of the text data and the number of the text data influence the accuracy of entity extraction. Therefore, how to improve the accuracy of entity extraction is a technical problem to be solved.
In order to solve the above technical problems, the present application provides a method for extracting entities, in which an entity extracting device obtains a plurality of first character sequences including n first characters and a plurality of tag types corresponding to the first characters. The entity extraction device sequentially inputs a plurality of first character sequences into the first training model, and determines the target label type of each first character from a plurality of label types of each first character. The entity extraction device generates a second character sequence corresponding to the first character sequence according to the target label type of each first character. The second character sequence comprises n first characters and target label types corresponding to the first characters. The entity extraction device inputs the second character sequence into the second training model, and extracts a first character of which the target label type in the second character sequence meets a first preset condition as a target extraction result. In this way, the entity extraction device further determines whether the target label type of each character in the character sequence meets the preset condition or not after determining the target label type of the character, thereby determining the character of which the target label type meets the preset condition as the target extraction result and improving the accuracy of the entity extraction result.
The present application provides a method for entity extraction, which can be applied to an entity extraction system 10 as shown in fig. 1, wherein the system 10 comprises: a text preprocessing module 101, a model training module 102, and a text post-processing module 103. Wherein the model training module 102 comprises: a transducer-based Bi-directional encoder (BidirectionalEncoder Representations from Transformer, BERT) model 1011, a Bi-directional long and short Term Memory (Bi-directional Long Short-Term Memory, bi-LSTM) model 1012, and a CRF model 1013. Specifically, the text preprocessing module is used for cutting the input text into single characters; the BERT model is used for converting the single character generated by the text preprocessing module into a character embedding vector, and inputting the character embedding vector to the BI-LSTM model in a bidirectional manner; the BI-LSTM model extracts front-back dependency relations of BI-directionally input character embedded vectors through a gate control unit, and determines the target state of each character embedded vector. Further, BI-LSTM inputs the target state of each character embedded vector into the CRF model; the CRF model is used for further confirming the target state of each character embedded vector and determining the final state of each character embedded vector; the text post-processing model is used for further result optimization of the output result of the CRF model.
Fig. 2 is a schematic structural diagram of an entity extraction device according to an embodiment of the present application, which can be applied to the entity extraction system 10 shown in fig. 1, where the entity extraction device 200 includes at least one processor 201, a communication line 202, at least one communication interface 204, and a memory 203. The processor 201, the memory 203, and the communication interface 204 may be connected through a communication line 202.
The processor 201 may be a central processing unit (central processing unit, CPU), an application specific integrated circuit (application specific integrated circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present application, such as: one or more digital signal processors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA).
Communication line 202 may include a path for communicating information between the above-described components.
The communication interface 204, for communicating with other devices or communication networks, may use any transceiver-like device, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.
The memory 203 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to include or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
In a possible design, the memory 203 may exist separately from the processor 201, i.e. the memory 203 may be a memory external to the processor 201, where the memory 203 may be connected to the processor 201 through a communication line 202 for storing execution instructions or application program codes, and the execution is controlled by the processor 201 to implement a method for determining a spatial measurement according to an embodiment of the present application. In yet another possible design, the memory 203 may be integrated with the processor 201, i.e., the memory 203 may be an internal memory of the processor 201, e.g., the memory 203 may be a cache, may be used to temporarily store some data and instruction information, etc.
As one possible implementation, processor 201 may include one or more CPUs, such as CPU0 and CPU1 in fig. 2. As another implementation, entity extraction device 200 may include multiple processors, such as processor 201 and processor 207 in fig. 2. As yet another implementation, entity extraction apparatus 200 may further include an output device 205 and an input device 206.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the network node is divided into different functional modules to implement all or part of the functions described above. The specific working processes of the above-described system, module and network node may refer to the corresponding processes in the foregoing method embodiments, which are not described herein.
Fig. 3 is a schematic diagram of an entity extraction method according to an embodiment of the present application, and as shown in fig. 3, the entity extraction method according to the embodiment of the present application may be implemented by the following steps 301 to 304.
In step 301, the entity extraction device obtains a plurality of first character sequences.
Each first character sequence comprises n first characters and a plurality of label types of the first characters, wherein n is a positive integer.
In one possible implementation, the entity extraction device obtains a plurality of first character sequences including n characters. Wherein each character corresponds to a plurality of tag types.
An example, the first character sequences are "A" respectively 1 A 2 A 3 …A n ”,“B 1 B 2 B 3 …B n "etc., wherein, in A 1 Character is exemplified by A 1 The plurality of label types corresponding to the characters are B-X, I-X, O respectively. Wherein B represents the character A 1 At the beginning of the first character sequence, I represents the character A 1 Located at the middle position of the first character sequence, O represents A 1 The character not being of any type, X representing A 1 The type of character in the first character sequence.
Alternatively, A 1 The plurality of label types corresponding to the character can also be B-X, I-X, O, E-X, S, wherein B represents the character A 1 At the beginning of the first character sequence, I represents the character A 1 Located at the middle position of the first character sequence, O represents A 1 The character being not of any type, E representing character A 1 At the end of the first character sequence, S represents the character A 1 Is a single character, itself is an entity, X represents A 1 The type of character in the first character sequence.
Step 302, the entity extraction device sequentially inputs a plurality of first character sequences into the first training model, and determines the target label type of each first character from a plurality of label types of each first character.
In a possible implementation manner, the entity extraction device divides the first character sequence with the label type in each band into n characters, converts the n characters into a vector form, and inputs the vector form into the first training model for training. Thereby determining a target tag type of the plurality of tag types for each of the first characters with respect to each other.
An example would be a first character sequence "A 1 A 2 A 3 …A n "split into n individual characters" A 1 ”、“A 2 ”“A 3 ”…“A n And converting the single character into a vector form, inputting the vector form into a first training model for training, and determining the target label type of each first character from a plurality of label types of each first character.
Step 303, the entity extraction device generates a second character sequence corresponding to each first character sequence based on the target tag type of each first character.
The second character sequence comprises n first characters and target label types corresponding to the first characters.
In a possible implementation manner, after determining the target tag type of each first character through the first training model, the entity extraction device generates a second character sequence corresponding to the first character sequence based on the target tag type of each first character. The second character sequence also comprises n first characters and target label types corresponding to the first characters.
Step 304, the entity extraction device inputs the second character sequence into the second training model, and extracts the first character of the second character sequence, the target label type of which meets the first preset condition, as the target extraction result.
In a possible implementation manner, the entity extraction device inputs the second character sequence into the second training model, optimizes the result of the second character sequence based on a preset rule of the second model, and extracts a first character of which the target label type in the second character sequence meets a first preset condition as a target extraction result.
Example, A in the second character sequence 1 The target label type of (C) is "B-label 1 ”,A 4 The target tag type of (C) is "I-label 1 If yes, the first character of the target label type meeting the first preset condition is "A 1 ”、“A 4 "that is, the target extraction result of the second character sequence is" A 1 ”“A 4 ”。
The scheme at least brings the following beneficial effects: the entity extraction device obtains a plurality of first character sequences comprising n first characters and a plurality of tag types corresponding to the first characters. The entity extraction device sequentially inputs a plurality of first character sequences into the first training model, and determines the target label type of each first character from a plurality of label types of each first character. The entity extraction device generates a second character sequence corresponding to the first character sequence according to the target label type of each first character. The second character sequence comprises n first characters and target label types corresponding to the first characters. The entity extraction device inputs the second character sequence into the second training model, and extracts a first character of which the target label type in the second character sequence meets a first preset condition as a target extraction result. In this way, the entity extraction device further determines whether the target label type of each character in the character sequence meets the preset condition or not after determining the target label type of the character, thereby determining the character of which the target label type meets the preset condition as the target extraction result and improving the accuracy of the entity extraction result.
Referring to fig. 3, as shown in fig. 4, the step 302, that is, the entity extraction device sequentially inputs a plurality of first character sequences into the first training model, and determines the target label type of each first character from a plurality of label types of each first character, which may be specifically implemented by the following steps 401 to 404:
in step 401, the entity extraction device inputs the first character sequence into the first submodel, and obtains a character embedding vector corresponding to each first character in the first character sequence.
In a possible implementation, the entity extraction means cuts the first character sequence into n characters and converts the n characters into a form of a word vector to be input into the first submodel. The entity extraction means converts the n character vectors into an embedded vector output based on the self-attention mechanism transcoder.
As an example, as shown in FIG. 1, a first character sequence "A" is displayed 1 A 2 A p A n "cut into individual characters" A 1 ”、“A 2 ”…“A p ”、“A n ", will" A 1 ”、“A 2 ”…“A p ”、“A n "Single character is converted into a form of a word vector, e.g." E 1 ”、“E 2 ”…“E p ”、“E n ". The first sub-model calculates the weight between the self-vector correlations and adjusts the word vector weights through the self-attention mechanism code converter, and extracts the character features of the first character sequence. The entity extraction device obtains a character embedding vector corresponding to each first character in the first character sequence based on the first submodel, wherein the character embedding vector is' T 1 ”、“T 2 ”…“T p ”、“T n ”。
Step 402, the entity extraction device generates a third character sequence according to the character embedding vector corresponding to each first character.
The third character sequence comprises n character embedding vectors corresponding to the first characters and a plurality of label types corresponding to the first characters.
In one example, the entity extraction means generates a third character sequence based on the character embedding vector generated by the first sub-model. For example, "T 1 T 2 …T p T n ", wherein T 1 The corresponding label types are respectively' B-label 1 ”、“I-label 1 ”、O;T 2 The corresponding label types are respectively' B-label 1 ”、“I-label 1 ”、O;T p The corresponding label types are respectively' B-label 2 ”、“I-label 2 ”、O;T p The corresponding label types are respectively' B-label 2 ”、“I-label 2 ”、O;T n The corresponding label types are respectively' B-label 3 ”、“I-label 3 ”、O。
Step 403, the entity extraction device inputs each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to a preset sequence, and obtains probability values of the plurality of label types corresponding to each first character.
In a possible implementation manner, the entity extraction device sequentially inputs each first character in the third character sequence into the second character model based on the positive sequence and the reverse sequence, and obtains probability values of a plurality of label types corresponding to each first character.
An example is based on a third character sequence "T 1 T 2 …T p T n And sequentially inputting the third character sequence into the second character model in a positive sequence and a reverse sequence. That is, the inputs of the second character model are "T" respectively 1 ”、“T 2 ”…“T p ”、“T n ”;“T n ”、“T p ”…“T 2 ”、“T 1 ”。
Further, the second sub-model obtains the front-back dependency relationship of the character embedding vector for the third character sequence based on the forgetting gate control unit, the input gate control unit and the output gate control unit, and determines probability values of a plurality of label types corresponding to each character. As shown in FIG. 5, "T 1 The probabilities of the corresponding tag types are "B" =0.9, "I" =0.2, "O" =1.8, respectively; the probabilities of the tag types corresponding to "T2" are "B" =1.5, "I" =0.2, "O" =1.1, respectively; the probabilities of the tag types corresponding to "Tp" are "B" =0.5, "I" =0.8, "O" =0.2, respectively; the probabilities of the tag types corresponding to "Tn" are "B" =0.9, "I" =0.1, "O" =1.4, respectively.
And 404, the entity extraction device inputs the probability values of the tag types into the third sub-model, and determines that the tag type with the probability value meeting the second preset condition is the target tag type corresponding to the first character.
In a possible implementation manner, the entity extraction device inputs probability values of a plurality of label types corresponding to each character determined by the second sub-model into the third sub-model. And the third sub-model determines the label type of which the probability value of the first character meets the condition as the target label type according to the input probability value of the label type predicted by the second sub-model, the state characteristics of the first character and the like.
As an example, as shown in FIG. 5, the third sub-model determines "T" 1 The corresponding target label classes are O and T 2 The corresponding target label class is ' B ' … ' T p The corresponding target label classes are B and T n The "corresponding target tag class is" O ".
The scheme at least brings the following beneficial effects: the entity extraction device inputs the first character sequence into the first submodel, and obtains a character embedding vector corresponding to each first character in the first character sequence. The entity generates a third character sequence based on the first character-embedded vector. The entity extraction device inputs each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to the order of the positive order and the reverse order, and obtains probability values of the plurality of label types corresponding to each first character. The entity extraction device inputs probability values corresponding to the plurality of tag types predicted by the second sub-model into a third sub-model, and the third sub-model determines that the tag type of which the probability value of the first character meets the condition is the target tag type according to the predicted probability value, the state and other characteristics of the first character and the like. In this way, the entity extraction device performs stepwise training on the first character sequence to be extracted by the entity through the first training model to determine the target tag type of the first character, thereby improving the accuracy of entity extraction.
Referring to fig. 4, as shown in fig. 6, in step 404, the entity extraction device inputs probability values of a plurality of tag types into the third sub-model, and determines that the tag type whose probability value satisfies the second preset condition is the target tag type corresponding to the first character. Thereafter, the following steps 601-602 are included:
in step 601, the entity extraction device sends first feedback information to the second sub-model based on the target tag type of the first character.
Wherein the first feedback information includes: the target tag type and the probability value corresponding to the target tag type.
In one possible implementation manner, the entity extraction device generates the first feedback information based on the target tag type determined by the third sub-model and the probability value of the target tag type. And the entity extraction device sends the first feedback information to a second sub-model for updating and iterating parameters of the second sub-model.
Step 602, the entity extraction device updates a probability value of the target tag type corresponding to each first character in the second sub-model based on the first feedback information.
In a possible implementation manner, the entity extraction device sends the first feedback information to the second sub-model. And after receiving the first feedback information, the second sub-model updates the probability value of the target label type corresponding to the first character. Meanwhile, the second sub-model updates and iterates parameters of the model itself based on a time-based directional propagation algorithm.
The scheme at least brings the following beneficial effects: the entity extraction device generates first feedback information based on the target tag type of the first character. And the entity extraction device sends the first feedback information to the second sub-model to update the parameters of the entity extraction device. In this way, the accuracy of the training of each sub-model in the first training model is improved.
Referring to fig. 3, as shown in fig. 7, in step 304, the entity extraction device inputs the second character sequence into the second training model, and extracts the first character of the second character sequence, where the target label type satisfies the first preset condition, as the target extraction result. Specifically, the method can be realized through the following steps 701-702:
step 701, the entity extraction device updates the second character sequence according to a preset rule; the preset rules are used to check the second character sequence.
In one possible implementation, the entity extraction device performs the verification on the second character sequence according to a preset rule. The entity extraction means updates the second character sequence based on the result of the inspection.
An example, preset rules include, but are not limited to, the following: the entity extraction device checks whether some nonsensical special characters exist in the second character sequence or not, and positions and deletes the nonsensical special characters through a regular expression; a second rule, the entity extraction device checks whether the mixed use of Chinese and English or the non-uniform characters of the upper case and the lower case of English appear in the second character sequence; rule three, entity extraction means check whether there is a character whose contents are exact in the second character sequence, for example, a case where there is only one bracket in the second character sequence; and rule four, the entity extraction device checks whether the irregular character exists in the second character sequence, and for the condition, the entity extraction device modifies and updates the irregular character based on the knowledge base.
Step 702, the entity extraction device extracts a first character, whose target tag type satisfies a preset condition, in the updated second character sequence as a target extraction result.
In one possible implementation manner, the entity extraction device extracts, from the updated second character sequence, a first character whose target tag type meets a preset condition as a target extraction result.
The scheme at least brings the following beneficial effects: the entity extraction device checks the second character sequence according to the preset rule, determines whether each character in the second character sequence and the corresponding target label meet the preset rule, and updates the characters which are not met. Further, the entity extraction device extracts the first character of which the target label type meets the preset condition in the updated second character sequence as a target extraction result. In this way, the accuracy of the entity extraction result is improved by further verification optimization of the second character sequence.
Referring to fig. 3, as shown in fig. 8, the above step 301, that is, before the entity extraction device obtains the plurality of first character sequences, further includes the following steps 801 to 805:
step 801, the entity extraction device obtains a plurality of fourth character sequences.
Wherein the fourth character sequence comprises at least one second character.
In one possible implementation, the entity extraction device obtains a plurality of initial character sequences to be extracted by the entity.
In one example, the entity extraction device obtains a plurality of initial texts to be extracted by the entity. For example, worksheet text from customer complaints, each worksheet text contains 50-1000 characters in number.
Step 802, the entity extraction device determines the number of second characters contained in each fourth character sequence.
In a possible implementation, the entity extraction means obtains the number of characters contained in each fourth character sequence.
In one example, the plurality of fourth character sequences includes the following characters: 500. 862, 52, 64, etc.
It should be noted that, the number of characters in the fourth character sequence is the number of characters in the text actually to be extracted by the entity, and the number of the present application is only used for illustration, and is not limited in practice.
In step 803, the entity extraction device determines the number n of the first characters included in the fifth character sequence based on the number of the second characters of the plurality of fourth character sequences.
In a possible implementation manner, the entity extraction device divides the plurality of fourth character sequences into fifth character sequences with the same number of characters according to the number of characters contained in the plurality of acquired fourth character sequences.
In one example, the fourth plurality of character sequences is divided into a fifth plurality of character sequences of 512 characters in length.
Note that when the length of the fourth character sequence is smaller than 512, the "0" is complemented with the insufficient characters.
In step 804, the entity extraction device marks a plurality of label types corresponding to the n first characters based on a preset marking algorithm.
In a possible implementation manner, the entity extraction device marks each character in the fifth character sequence by using a marking algorithm. Each character corresponds to a plurality of tag types.
For example, the entity extraction device may label each character using a "BIO" labeling algorithm or a "BIOES" labeling algorithm, or the like.
In step 805, the entity extraction device generates a first character sequence based on the n first characters and a plurality of tag types corresponding to the first characters.
In one possible implementation manner, the entity extraction device uses the n marked characters as a first character sequence to generate an initial entity extraction training sequence.
The scheme at least brings the following beneficial effects: the entity extraction device generates text sequences with uniform formats and numbers from the initial text through segmentation and integration. And the entity extraction device marks the text sequences with uniform formats and generates sequence data which is convenient for entity extraction and training. In this way, the usability of text data to be extracted by the entity is improved.
The entity extraction device according to the embodiment of the present application and the functions of the respective devices of the entity extraction device are described in detail above.
It can be seen that the technical solution provided by the embodiment of the present application is mainly described from the method perspective. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the entity extraction device according to the method example, for example, each functional module can be divided corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present application is schematic, which is merely a logic function division, and other division manners may be implemented in practice.
The embodiment of the application provides a physical extraction device which is used for executing a method required to be executed by any device in the physical extraction system. The entity extraction device may be an entity extraction device involved in the present application, or a module in the entity extraction device; or the chip in the entity extraction device, or other devices for executing the entity extraction method, which is not limited in the present application.
Fig. 9 is a schematic structural diagram of a physical extraction device according to an embodiment of the present application. The entity extraction device comprises: a processing unit 901, and a communication unit 902.
A processing unit 901, configured to obtain a plurality of first character sequences; each first character sequence comprises n first characters and a plurality of label types of the first characters; n is a positive integer; the processing unit 901 is further configured to sequentially input a plurality of first character sequences into the first training model, and determine a target tag type of each first character from a plurality of tag types of each first character; the processing unit 901 is further configured to generate a second character sequence corresponding to each first character sequence based on the target tag type of each first character; the second character sequence comprises n first characters and target label types corresponding to the first characters; the processing unit 901 is further configured to input a second character sequence into the second training model, and extract, as a target extraction result, a first character in the second character sequence, where the target label type meets a first preset condition.
Optionally, the processing unit 901 is further configured to: a first sub-model, a second sub-model, and a third sub-model. The first sub-model is used for converting the first character into a character embedding vector; the second sub-model is used for determining the probability value of each tag type corresponding to the first character according to the character embedding vector and the tag type; the third sub-model is used for determining the target tag type of the first character according to the probability value of the tag type.
Optionally, the processing unit 901 is specifically configured to: inputting the first character sequence into the first submodel, and obtaining a character embedding vector corresponding to each first character in the first character sequence; generating a third character sequence according to the character embedding vector corresponding to each first character; the third character sequence comprises n character embedding vectors corresponding to the first characters and a plurality of label types corresponding to the first characters; inputting each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to a preset sequence, and obtaining probability values of the plurality of label types corresponding to each first character; and inputting the probability values of the plurality of label types into a third sub-model, and determining that the label type with the probability value meeting the second preset condition is the target label type corresponding to the first character.
Optionally, the processing unit 901 is further configured to: based on the target tag type of the first character, instruct the communication unit 902 to send first feedback information to the second sub-model; the first feedback information includes: the target label type and the probability value corresponding to the target label type; based on the first feedback information, updating the probability value of the target label type corresponding to each first character in the second sub-model.
Optionally, the processing unit 901 is further specifically configured to: updating the second character sequence according to a preset rule; the preset rule is used for checking the second character sequence; and extracting the first character of which the target label type meets the preset condition from the updated second character sequence as a target extraction result.
Optionally, the processing unit 901 is further configured to: acquiring a plurality of fourth character sequences; the fourth character sequence comprises at least one second character; determining the number of second characters contained in each fourth character sequence; determining the number n of the first characters contained in the fifth character sequence based on the number of the second characters of the plurality of fourth character sequences; labeling a plurality of label types corresponding to the n first characters based on a preset labeling algorithm; a first character sequence is generated based on the n first characters and a plurality of tag types corresponding to the first characters.
The embodiment of the application provides a physical extraction device which is used for executing a method required to be executed by any device in the physical extraction system. The entity extraction device may be an entity extraction device involved in the present application, or a module in the entity extraction device; or the chip in the entity extraction device, or other devices for executing the entity extraction method, which is not limited in the present application.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, when the computer executes the instructions, the computer executes each step in the method flow shown in the method embodiment.
Embodiments of the present application provide a computer program product comprising instructions which, when executed on a computer, cause the computer to perform the entity extraction method of the above method embodiments.
Embodiments of the present application provide a chip comprising a processor and a communication interface, the communication interface and the processor being coupled, the processor being configured to execute a computer program or instructions to implement the entity extraction method as in the method embodiments described above.
The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: electrical connections having one or more wires, portable computer diskette, hard disk. Random access Memory (Random Access Memory, RAM), read-Only Memory (ROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), registers, hard disk, optical fiber, portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any other form of computer-readable storage medium suitable for use by a person or persons of skill in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuit, ASIC). In embodiments of the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Since the apparatus, device, computer readable storage medium, and computer program product in the embodiments of the present application can be applied to the above-mentioned method, the technical effects that can be obtained by the apparatus, device, computer readable storage medium, and computer program product can also refer to the above-mentioned method embodiments, and the embodiments of the present application are not described herein again.
The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (10)

1. A method of entity extraction, the method comprising:
acquiring a plurality of first character sequences; each first character sequence comprises n first characters and a plurality of label types of the first characters; n is a positive integer;
sequentially inputting the plurality of first character sequences into a first training model, and determining the target label type of each first character from a plurality of label types of each first character;
generating a second character sequence corresponding to each first character sequence based on the target label type of each first character; the second character sequence comprises n first characters and target label types corresponding to the first characters;
Inputting the second character sequence into a second training model, and extracting the first character of which the target label type meets a first preset condition in the second character sequence as a target extraction result.
2. The method of claim 1, wherein the first training model comprises: a first sub-model, a second sub-model, and a third sub-model;
the first sub-model is used for converting the first character into a character embedding vector;
the second sub-model is used for determining a probability value of each tag type corresponding to the first character according to the character embedding vector and the tag type;
the third sub-model is used for determining the target tag type of the first character according to the probability value of the tag type.
3. The method of claim 2, wherein sequentially inputting the plurality of first character sequences into a first training model, determining a target tag type for each of the first characters from a plurality of tag types for each of the first characters, comprises:
inputting the first character sequence into the first sub-model, and obtaining the character embedding vector corresponding to each first character in the first character sequence;
Generating a third character sequence according to the character embedding vector corresponding to each first character; the third character sequence comprises n character embedding vectors corresponding to the first characters and a plurality of label types corresponding to the first characters;
inputting each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to a preset sequence, and obtaining probability values of the plurality of label types corresponding to each first character;
and inputting probability values of the tag types into the third sub-model, and determining that the tag type with the probability value meeting a second preset condition is the target tag type corresponding to the first character.
4. The method of claim 3, wherein the inputting the probability values of the plurality of tag types into the third sub-model, after determining that the tag type whose probability value satisfies a second preset condition is the target tag type corresponding to the first character, further comprises:
transmitting first feedback information to the second sub-model based on the target tag type of the first character; the first feedback information includes: the target tag type and the probability value corresponding to the target tag type;
And updating the probability value of the target tag type corresponding to each first character in the second sub-model based on the first feedback information.
5. The method according to claim 1, wherein the inputting the second character sequence into a second training model, extracting the first character of the target tag type satisfying a preset condition from the second character sequence as a target extraction result, includes:
updating the second character sequence according to a preset rule; the preset rule is used for checking the second character sequence;
and extracting the first character of which the target label type meets a preset condition in the updated second character sequence as a target extraction result.
6. The method of claim 1, wherein prior to the obtaining the plurality of first character sequences, further comprising:
acquiring a plurality of fourth character sequences; said fourth character sequence comprising at least one of said second characters;
determining the number of the second characters contained in each fourth character sequence;
determining the number n of the first characters contained in a fifth character sequence based on the number of the second characters of a plurality of the fourth character sequences;
Labeling a plurality of label types corresponding to n first characters based on a preset labeling algorithm;
generating the first character sequence based on n first characters and a plurality of label types corresponding to the first characters.
7. An entity extraction device, the device comprising: a processing unit and a communication unit;
the processing unit is used for acquiring a plurality of first character sequences; each first character sequence comprises n first characters and a plurality of label types of the first characters; n is a positive integer;
the processing unit is further configured to sequentially input the plurality of first character sequences into a first training model, and determine a target label type of each first character from a plurality of label types of each first character;
the processing unit is further configured to generate a second character sequence corresponding to each first character sequence based on the target tag type of each first character; the second character sequence comprises n first characters and target label types corresponding to the first characters;
the processing unit is further configured to input the second character sequence into a second training model, and extract, as a target extraction result, the first character in the second character sequence, where the target label type meets a first preset condition.
8. The apparatus according to claim 7, wherein the processing unit is specifically configured to:
inputting the first character sequence into the first sub-model, and obtaining the character embedding vector corresponding to each first character in the first character sequence;
generating a third character sequence according to the character embedding vector corresponding to each first character; the third character sequence comprises n character embedding vectors corresponding to the first characters and a plurality of label types corresponding to the first characters;
inputting each first character in the third character sequence and a plurality of label types corresponding to the plurality of first characters into the second submodel according to a preset sequence, and obtaining probability values of the plurality of label types corresponding to each first character;
and inputting probability values of the tag types into the third sub-model, and determining that the tag type with the probability value meeting a second preset condition is the target tag type corresponding to the first character.
9. An entity extraction device, comprising: a processor and a communication interface; the communication interface is coupled to the processor for running a computer program or instructions to implement the entity extraction method as claimed in any one of claims 1-6.
10. A computer readable storage medium having instructions stored therein, characterized in that when executed by a computer, the computer performs the entity extraction method of any one of the preceding claims 1-6.
CN202310720860.1A 2023-06-16 2023-06-16 Entity extraction method, device and storage medium Pending CN116757213A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310720860.1A CN116757213A (en) 2023-06-16 2023-06-16 Entity extraction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310720860.1A CN116757213A (en) 2023-06-16 2023-06-16 Entity extraction method, device and storage medium

Publications (1)

Publication Number Publication Date
CN116757213A true CN116757213A (en) 2023-09-15

Family

ID=87960356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310720860.1A Pending CN116757213A (en) 2023-06-16 2023-06-16 Entity extraction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116757213A (en)

Similar Documents

Publication Publication Date Title
JP5901001B1 (en) Method and device for acoustic language model training
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
US10262062B2 (en) Natural language system question classifier, semantic representations, and logical form templates
KR102316063B1 (en) Method and apparatus for identifying key phrase in audio data, device and medium
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
US10019439B2 (en) Temporal translation grammar for language translation
US9454962B2 (en) Sentence simplification for spoken language understanding
CN110737768B (en) Text abstract automatic generation method and device based on deep learning and storage medium
WO2021139229A1 (en) Text rhetorical sentence generation method, apparatus and device, and readable storage medium
CN102439540A (en) Input method editor
CN114840327B (en) Multi-mode multi-task processing method, device and system
CN112328761B (en) Method and device for setting intention label, computer equipment and storage medium
Veiga et al. Generating a pronunciation dictionary for European Portuguese using a joint-sequence model with embedded stress assignment
JPWO2007138875A1 (en) Word dictionary / language model creation system, method, program, and speech recognition system for speech recognition
WO2018057427A1 (en) Syntactic re-ranking of potential transcriptions during automatic speech recognition
EP3614297A1 (en) Hybrid natural language understanding
CN111368037A (en) Text similarity calculation method and device based on Bert model
CN111508502A (en) Transcription correction using multi-tag constructs
CN112084769A (en) Dependency syntax model optimization method, device, equipment and readable storage medium
CN112579733A (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN112528654A (en) Natural language processing method and device and electronic equipment
CN116628186A (en) Text abstract generation method and system
KR102608867B1 (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN112633007B (en) Semantic understanding model construction method and device and semantic understanding method and device
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination