CN109299458B - Entity identification method, device, equipment and storage medium - Google Patents

Entity identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN109299458B
CN109299458B CN201811061626.8A CN201811061626A CN109299458B CN 109299458 B CN109299458 B CN 109299458B CN 201811061626 A CN201811061626 A CN 201811061626A CN 109299458 B CN109299458 B CN 109299458B
Authority
CN
China
Prior art keywords
lstm
entity recognition
entity
probability
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811061626.8A
Other languages
Chinese (zh)
Other versions
CN109299458A (en
Inventor
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duoyi Network Co ltd
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Guangzhou Duoyi Network Co ltd
Original Assignee
Duoyi Network Co ltd
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Guangzhou Duoyi Network Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duoyi Network Co ltd, GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD, Guangzhou Duoyi Network Co ltd filed Critical Duoyi Network Co ltd
Priority to CN201811061626.8A priority Critical patent/CN109299458B/en
Publication of CN109299458A publication Critical patent/CN109299458A/en
Application granted granted Critical
Publication of CN109299458B publication Critical patent/CN109299458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses an entity recognition method, which comprises the steps of obtaining a trained entity recognition model based on LSTM, wherein the entity recognition model based on LSTM is trained by using labeled training corpus; inputting a text to be recognized into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized belongs to a label; inputting the probability into a CRF model to obtain the mark of each character; the LSTM network has great dependence on data, the size and quality of data quantity can influence the model training result, the LSTM model and the CRF model are combined, the LSTM model is used for solving the problem of extracting sequence characteristics, the CRF model can be used for effectively utilizing sentence-level marking information, the execution efficiency of a dialogue system is improved through the LSTM + CRF model, meanwhile, entity recognition and word segmentation are realized, and the entity recognition accuracy and efficiency are improved.

Description

Entity identification method, device, equipment and storage medium
Technical Field
The present invention relates to the field of information technology, and in particular, to a method, an apparatus, a device, and a storage medium for entity identification.
Background
In the field of artificial intelligence, attempts to mimic human conversational capabilities may be traced back to early stages of artificial intelligence. In the last years, the application of the message service class is rapidly grown, domestic WeChat, foreign WhatsApp, facebook Messenger and the like almost occupy all fragmented time of users, hundreds of millions of active users actually become the entrance of a browser in the era of mobile internet, the users can obtain most information by using only one application, and the flow bonus brought by downloading the mobile application is gradually disappeared, so that the advantages of a conversation system are reflected, the development cost is low, and the conversation system can be attached to a software platform.
In a dialog system, entity words in a sentence input by a user often need to be recognized, that is, entity recognition, and word segmentation is needed for subsequent analysis. However, in the existing dialogue system, two tasks of entity recognition and word segmentation are processed separately.
When the inventor implements the invention, the entity identification application in the prior art is found to have the following defects: the entity identification is to identify entity words therein from the sentence level, such as: person name, place name, organization name. It is similar to word segmentation, and if these two tasks are performed in isolation, it will result in the accuracy of entity word recognition and word segmentation to be reduced, such as sentences: the great bridge of Changjiang river in Nanjing. If the entity word 'Changjiang river bridge' is not recognized, the entity word is likely to be cut into: nanjing/city chang/river bridge. On the contrary, if the entity word "Yangtze river bridge" is considered, the method is divided into: nanjing city/Changjiang river bridge.
Disclosure of Invention
In view of this, embodiments of the present invention provide an entity recognition method, apparatus, device and storage medium, which can improve the execution efficiency of a dialog system and improve the accuracy of entity recognition and word segmentation by combining entity recognition with word segmentation tasks.
In a first aspect, an embodiment of the present invention provides an entity identification method, including the following steps:
acquiring an LSTM-based entity recognition model after training is finished, wherein the LSTM-based entity recognition model is trained by using labeled training corpora;
inputting a text to be recognized into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized belongs to a label;
and inputting the probability into a CRF model to obtain the mark of each character.
In a first possible implementation manner of the first aspect, the obtaining a trained LSTM-based entity recognition model, where the LSTM-based entity recognition model is trained using labeled corpus, includes:
acquiring the labeled training corpus;
converting the words and characters in the labeled training corpus into vectors;
and inputting the vectors of the words and the characters into the LSTM-based entity recognition model, and training parameters in the LSTM-based entity recognition model by using a back propagation method to obtain the LSTM-based entity recognition model after training.
With reference to the first aspect and the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the obtaining the labeled corpus includes:
and training the training corpus in an IB mode to obtain the labeled training corpus.
In a third possible implementation manner of the first aspect, the inputting the text to be entity-recognized into the trained LSTM-based entity recognition model, and obtaining the probability that each character in the text to be entity-recognized belongs to a label includes:
and sequentially inputting the characters of the text to be recognized by the entity into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized by the entity belongs to the label.
In a fourth possible implementation manner of the first aspect, the inputting the probability into the CRF model to obtain the label of each character includes:
inputting the probability into a prediction formula, and solving the maximum value of the prediction formula to obtain the optimal output label sequence, wherein the prediction formula is
Figure BDA0001796971770000031
Wherein y is a label sequence to be predicted of the text to be recognized by the entity, and y = (y) 1 ,y 2 ,…,y n ),X=p i,yi The probability, p, of labeling each character in the text to be recognized by the entity i,yi Meaning that the ith word is marked as the yth i The probability of an individual label; a. The yi,yi+1 Mean the th i Individual label is transferred to the y-th i+1 The probability of an individual label;
and labeling according to the optimal output label sequence to further obtain the label of each character.
In a second aspect, an embodiment of the present invention further provides an entity identification apparatus, including:
the entity recognition model acquisition module is used for acquiring a trained entity recognition model based on the LSTM, wherein the entity recognition model based on the LSTM is trained by using the labeled training corpus;
the probability acquisition module is used for inputting the text to be recognized into the entity recognition model based on the LSTM after the training is finished, and acquiring the probability that each character in the text to be recognized belongs to the label;
and the mark acquisition module is used for inputting the probability into a CRF model to obtain the mark of each character.
In a first possible implementation manner of the second aspect, the entity identification model obtaining module includes:
acquiring the labeled training corpus;
converting the words and characters in the labeled training corpus into vectors;
and inputting the vectors of the words and the characters into the LSTM-based entity recognition model, and training parameters in the LSTM-based entity recognition model by using a back propagation method to obtain the LSTM-based entity recognition model after training.
In a second possible implementation manner of the second aspect, the tag obtaining module includes:
inputting the probability into a prediction formula, and solving the maximum value of the prediction formula to obtain the optimal output label sequence, wherein the prediction formula is
Figure BDA0001796971770000041
Wherein y is a tag sequence to be predicted of the text to be recognized by the entity, and y = (y) 1 ,y 2 ,…,y n ),X=p i,yi Probability p of labeling each character in the text for the entity identification i,yi Means that the ith word is marked as the yth i The probability of each label; a. The yi,yi+1 Mean the th i Individual label is transferred to the y i+1 The probability of an individual label;
and labeling according to the optimal output label sequence to further obtain the label of each character.
In a third aspect, an embodiment of the present invention further provides an entity identification device, which is characterized by including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the entity identification method as described above when executing the computer program.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the entity identification method described above.
The embodiment of the invention has the following beneficial effects:
acquiring an LSTM-based entity recognition model after training is finished, wherein the LSTM-based entity recognition model is trained by using labeled training corpora; inputting a text to be recognized into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized belongs to a label; inputting the probability into a CRF model to obtain the mark of each character; by combining the LSTM-based entity recognition model with the CRF model, entity recognition and word segmentation can be simultaneously carried out, time consumption for model prediction is reduced, word segmentation is carried out by using information of entity words obtained by entity recognition, and word segmentation accuracy and efficiency can be improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1 is a schematic diagram of an entity identification device according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an entity identification method according to a second embodiment of the present invention;
fig. 3 is a schematic diagram illustrating the result of LSTM entity identification provided in the second embodiment of the present invention;
FIG. 4 is a diagram illustrating the result of LSTM + CRF entity identification provided in the second embodiment of the present invention;
FIG. 5 is a diagram illustrating an entity identification display result according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of an entity identifying device according to a third embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As will be appreciated by one skilled in the art, the present invention may be embodied as an apparatus, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software, and may be referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a virtual machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Example one
Referring to fig. 1, fig. 1 is a schematic diagram of an entity identification device according to an embodiment of the present invention, configured to execute an entity identification method according to an embodiment of the present invention, as shown in fig. 1, the entity identification device includes: at least one processor 11, such as a CPU, at least one network interface 14 or other user interface 13, a memory 15, at least one communication bus 12, the communication bus 12 being used to enable connectivity communications between these components. The user interface 13 may optionally include a USB interface, a wired interface, and other standard interfaces. The network interface 14 may optionally include a Wi-Fi interface as well as other wireless interfaces. The memory 15 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile) such as at least one disk memory. The memory 15 may optionally comprise at least one memory device located remotely from the aforementioned processor 11.
In some embodiments, memory 15 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof:
an operating system 151, which contains various system programs for implementing various basic services and for processing hardware-based tasks;
and (5) a procedure 152.
Specifically, the processor 11 is configured to call the program 152 stored in the memory 15 to execute the entity identification method according to the following embodiments.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center for the entity identification method, with various interfaces and lines connecting the various parts of the overall entity identification method.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the entity-identified electronic device by executing or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, a text conversion function, etc.), and the like; the storage data area may store data (such as audio data, text message data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the entity recognition integrated module can be stored in a computer readable storage medium if it is implemented in the form of software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
A method of entity identification of an embodiment of the present invention will be described below with reference to the accompanying drawings.
Example two
Fig. 2 is a flowchart illustrating an entity identification method according to a second embodiment of the present invention.
An entity identification method, comprising the steps of:
s11, obtaining an LSTM-based entity recognition model after training is completed, wherein the LSTM-based entity recognition model is trained by using labeled training corpora;
s12, inputting a text to be recognized into the entity recognition model based on the LSTM after the training is finished, and acquiring the probability that each character in the text to be recognized belongs to a label;
and S13, inputting the probability into a CRF model to obtain the mark of each character.
In the embodiment of the invention, in order to improve the accuracy and efficiency of entity recognition, the LSTM model and the CRF model are combined, and the entity recognition and sentence entity recognition can be realized simultaneously.
Preferably, the obtaining of the trained entity recognition model based on LSTM includes:
acquiring the labeled training corpus;
converting the words and characters in the labeled training corpus into vectors;
and inputting the vectors of the words and the characters into the LSTM-based entity recognition model, and training parameters in the LSTM-based entity recognition model by using a back propagation method to obtain the LSTM-based entity recognition model after training.
Further, the acquiring the labeled corpus includes:
and training the training corpus in an IB mode to obtain the labeled training corpus.
In the embodiment of the present invention, first, a labeled corpus is obtained, where the labeled corpus is a process of manual labeling, and the corpus is labeled according to an IB (Inside, begin) (or labeled in other ways, such as by replacing 0,1, 2), where Begin: the first character mark belonging to the entity word is added with the current suffix if the corresponding character is the entity word. And (5) Inside: if the character is not the first character and belongs to the entity word part, adding the current suffix. Is provided with the followings: the suffix of the person name is P, the suffix of the organization name is C, the suffix of the place name is L, and if an entity identification unit is an entity start, the mark is (tagB- \8230;); if an entity identification unit is an entity intermediate vocabulary, the mark is (tag I- \8230; b). Taking the most common names of People (PER), location name (LOC) and organization name (ORG) in an entity as an example, for each sentence in the corpus, each character is labeled, for example: mariteng is the CEO of Tencent. The following can be noted: the mark of the horse is B-P; the chemical marker is I-P; the label of Teng is I-P; is marked as B; the label of the Teng is B-C; the label of the message is I-C; is marked as B; the label of C is B; the label of E is I; the label of O is I.
In the embodiment of the present invention, the words and characters in the labeled training corpus are converted into vectors, because the computer can only calculate the type of numeric value, and the input word x is character type, and the computer cannot directly calculate, vector conversion is required, and the converted vectors may be called word vectors, also called word-embedded vectors. Firstly, obtaining a word list of all words needing prediction and training according to statistics, and if the size of the word list is k, endowing each word in the word list with a unique id, wherein the value range of the id is 0-k-1, the size of a random initialization matrix is [ k, dim ], the dim is a preset threshold value, searching for the corresponding id according to each character, and further obtaining a corresponding word vector. In constructing word vectors (WordEmbedding), the first step of processing text corpora by using a mathematical model is to convert a text into a mathematical representation, and there are two methods, the first method can represent a word by a one-hot matrix, and the one-hot matrix refers to a matrix in which each line has only one element of 1 and the other elements of 0. For each word in the dictionary, a number is assigned, and when a certain sentence is coded, every word in the dictionary is codedAnd converting the word into a one-hot matrix with the position of 1 corresponding to the word number in the dictionary. For example, we will express "I love chips", which can be expressed as a matrix
Figure BDA0001796971770000111
A wordlebelling matrix can also be used, which assigns each word a vector representation of a fixed length, which can be set by itself, such as 300, and is actually much smaller than the dictionary length (such as 10000). And the value of the angle between two word vectors can be used as a measure of their relationship and can be expressed as ≧ based on a matrix>
Figure BDA0001796971770000112
In the embodiment of the present invention, the vectors of words and characters are input into the LSTM-based entity recognition model, and parameters in the LSTM-based entity recognition model are trained by using a back propagation method to obtain the LSTM-based entity recognition model after training, wherein the LSTM-based entity recognition model has a calculation formula as follows:
Figure BDA0001796971770000113
wherein σ is a sigmoid operation for each element, a representative point times, x t For input, h t For output, all W, h, c and b in the formula are initialized randomly, and corresponding probabilities can be obtained by inputting corresponding vectors into the formula, for example, "I love china" is input into the first layer LSTM neuron unit of the LSTM-based entity recognition model, and simultaneously, the output of the ith LSTM unit of the first layer LSTM is simultaneously used as the input of the (I + 1) th LSTM unit of the first layer LSTM, and then, the probability that each character output by each neuron unit of the LSTM belongs to each label is given.
In this embodiment, after obtaining the probability that each character belongs to each label, parameters in the LSTM-based entity recognition model are trained using a back propagation method to obtain a trained LSTM-based entity recognition model.The back propagation is to update the LSTM parameters based on the LSTM output result by using a chain derivation strategy, where the chain derivation is a complex function obtained by "taking multiple functions together, and the derivative is equal to the derivative of the value of the outside function substituted by the inside function, multiplied by the derivative of the inside function", for example, f (x) = x ×, and 2 g (x) =2x +1, then { f [ g (x)]}'=2[g(x)]×g'(x)=2[2x+1]X 2=8x +4. Thereby updating the parameters in the calculation formula of the LSTM-based entity recognition model.
Preferably, after obtaining the trained LSTM-based entity recognition model, the inputting the text to be entity recognized into the trained LSTM-based entity recognition model, and obtaining the probability that each character in the text to be entity recognized belongs to the label includes:
and sequentially inputting the characters of the text to be recognized into the entity recognition model based on the LSTM after the training is finished, and acquiring the probability that each character in the text to be recognized belongs to a label.
In this embodiment, the formula is calculated based on the above-mentioned LSTM-based entity recognition model:
Figure BDA0001796971770000121
the entity recognition model based on the LSTM reads in one character of the text to be recognized in each step, and the probability that the character belongs to the IOB mark can be obtained through calculation in the entity recognition model based on the LSTM. Referring to fig. 3, the sentence "martemailing is CEO of Tencent", and after inputting a character at each step, the probability that the character corresponds to each label is obtained. For example, the character "horse" has a probability of 0.5 for belonging to tag B, a probability of 0.9 for belonging to tag B-P, a probability of 0.8 for belonging to tag B-L, a probability of 0.2 for belonging to tag B-C, a probability of 0.4 for belonging to tag I, a probability of 0.5 for belonging to tag I-P, a probability of 0.1 for belonging to tag I-L, and a probability of 0.5 for belonging to tag I-C.
Preferably, the inputting the probability into the CRF model to obtain the mark of each character includes:
predicting the probability inputSolving the maximum value of the prediction formula to obtain the optimal output label sequence, wherein the prediction formula is
Figure BDA0001796971770000122
Wherein y is a label sequence to be predicted of the text to be recognized by the entity, and y = (y) 1 ,y 2 ,…,y n ),X=p i,yi The probability that each character in the text to be recognized belongs to the label is marked for the entity to be recognized, namely the ith character is marked as the yth character i The probability of an individual label; a. The yi,yi+1 Mean the th i Individual label is transferred to the y i+1 The probability of an individual label;
and labeling according to the optimal output label sequence to further obtain the label of each character.
In this example, referring to FIG. 4, the schematic structure diagram of LSTM + CRF, for each input X = (X1, X2, \8230;, xn), we get a predicted label sequence y = (y =) 1 ,y 2 ,…,y n ) Defining the score of the prediction as
Figure BDA0001796971770000131
Wherein p is i,yi Probability of output as yi for i-th position softmax, A yi,yi+1 For transition probability from yi to yi +1, when the number of tag (B-person, B-location \8230;) is n, the transition probability matrix is (n + 2) × (n + 2) because a start position and an end position are additionally added. The scoring function S well compensates for the deficiency of the conventional BilSTM because when a predicted sequence has a high score, it is not the label corresponding to the maximum probability value output by softmax at each position, and it is also necessary to consider the maximum sum of the transition probabilities before, i.e. it is also necessary to comply with the output rule (B cannot be followed by B), for example, if the most likely sequence output by BilSTM is BBIBIOOO, then because B->B has a small or even negative probability, then according to the s-score, such a sequence will not get the highest score, i.e. it is not the one we want. Taking "the martemaciation is CEO in tengchun" as an example, the maximum score sequence obtained after passing through the CRF model is as follows:
<xnotran> S (' CEO', (B-P, I-P, I-P, B, B-C, I-C, B, B, I, I)) = A (B-P, I-P) + A (I-P ) + A (I-P, B) + A (B, B-C) + A (B-C, I-C) + A (I-C, B) + A (B, B) + A (B, I) + A (I, I) +0.9+0.9+0.9+0.8+0.8+0.9+0.8+0.9+0.9+0.9. </xnotran> Wherein A is yi,yi+1 Is from y i To y i+1 The transition probability value of (2) is obtained by the statistics of the labeled data. Therefore, the word segmentation result is as follows: martematerium/yes/Tencent/CEO.
It should be noted that, the introduced CRF model is obtained by modeling the output label binary, then calculating by using dynamic programming, and finally labeling according to the obtained optimal path.
In this embodiment, marks of each character may be displayed on the text to be recognized by the entity, for example, referring to fig. 5, and a label of the corresponding character is displayed at a preset position of each character of the text to be recognized by the entity, for example, above or below the character, or a subscript or a superscript.
The embodiment has the following beneficial effects:
acquiring an LSTM-based entity recognition model after training is finished, wherein the LSTM-based entity recognition model is trained by using labeled training corpora; inputting a text to be recognized into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized belongs to a label; inputting the probability into a CRF model to obtain the mark of each character; the LSTM network has great dependence on data, the size and quality of data volume can influence the model training result, the LSTM model and the CRF model are combined, the LSTM model is used for solving the problem of extracting sequence characteristics, the CRF model can effectively utilize sentence-level marking information, the LSTM + CRF model improves the execution efficiency of a dialogue system, meanwhile, entity recognition and word segmentation are realized, and the entity recognition accuracy and efficiency are improved.
EXAMPLE III
Referring to fig. 6, a schematic structural diagram of an entity identification apparatus according to a third embodiment of the present invention is provided;
an entity identification apparatus comprising:
an entity recognition model obtaining module 31, configured to obtain a trained LSTM-based entity recognition model, where the LSTM-based entity recognition model is trained using labeled training corpora;
a probability obtaining module 32, configured to input the text to be entity-recognized into the trained LSTM-based entity recognition model, and obtain a probability that each character in the text to be entity-recognized belongs to a label;
and a mark acquiring module 33, configured to input the probability into the CRF model to obtain a mark of each character.
Preferably, the entity recognition model obtaining module 31 includes:
the corpus acquiring unit is used for acquiring the labeled corpus;
a vector obtaining unit, configured to convert words and characters in the labeled training corpus into vectors;
and the parameter training unit is used for inputting the vectors of the words and the characters into the LSTM-based entity recognition model, and training parameters in the LSTM-based entity recognition model by using a back propagation method so as to obtain the LSTM-based entity recognition model after training.
Preferably, the corpus acquiring unit includes:
and training the training corpus in an IOB mode to obtain the labeled training corpus.
Preferably, the probability obtaining module 32 includes:
and sequentially inputting the characters of the text to be recognized by the entity into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized by the entity belongs to the label.
Preferably, the mark acquiring module 33 includes:
inputting the probability into a prediction formula, and solving the maximum value of the prediction formula to obtain the optimal output label sequence, wherein the prediction formula is
Figure BDA0001796971770000151
Wherein y is a label sequence to be predicted of the text to be recognized by the entity, and y = (y) 1 ,y 2 ,…,y n ),X=p i,yi The probability that each character in the text to be recognized belongs to the label is marked for the entity to be recognized, namely the ith character is marked as the yth character i The probability of an individual label; a. The yi,yi+1 Mean the th i Individual label is transferred to the y i+1 The probability of an individual label;
and labeling according to the optimal output label sequence to further obtain the label of each character.
The embodiment has the following beneficial effects:
acquiring an LSTM-based entity recognition model after training is finished, wherein the LSTM-based entity recognition model is trained by using labeled training corpora; inputting a text to be recognized into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized belongs to a label; inputting the probability into a CRF model to obtain the mark of each character; the LSTM network has great dependence on data, the size and quality of data quantity can influence the model training result, the LSTM model and the CRF model are combined, the LSTM model is used for solving the problem of extracting sequence characteristics, the CRF model can be used for effectively utilizing sentence-level marking information, the execution efficiency of a dialogue system is improved through the LSTM + CRF model, meanwhile, entity recognition and word segmentation are realized, and the entity recognition accuracy and efficiency are improved.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.
It should be noted that, in the foregoing embodiments, the description of each embodiment has an emphasis, and in a part that is not described in detail in a certain embodiment, reference may be made to the related description of other embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that acts and simulations are necessarily required of the invention.

Claims (8)

1. An entity identification method, comprising:
acquiring an LSTM-based entity recognition model after training is finished, wherein the LSTM-based entity recognition model is trained by using labeled training corpora;
inputting a text to be recognized into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized belongs to a label;
inputting the probability into a CRF model to obtain the mark of each character, wherein the method specifically comprises the following steps: inputting the probability into a prediction formula, and solving the maximum value of the prediction formula to obtain the optimal output label sequence, wherein the prediction formula is
Figure FDA0003933365810000011
Wherein y is a tag sequence to be predicted of the text to be recognized by the entity, and y = (y) 1 ,y 2 ,…,y n ),X=p i,yi The probability, p, of labeling each character in the text to be recognized by the entity i,yi Meaning that the ith word is marked as the yth i The probability of an individual label; a. The yi,yi+1 Mean the th i Individual label is transferred to the y i+1 The probability of an individual label; and labeling according to the optimal output label sequence to further obtain the label of each character.
2. The entity recognition method according to claim 1, wherein the obtaining of the trained LSTM-based entity recognition model, wherein the LSTM-based entity recognition model is trained using labeled corpus, comprises:
acquiring the labeled training corpus;
converting the words and characters in the labeled training corpus into vectors;
and inputting the vectors of the words and the characters into the LSTM-based entity recognition model, and training parameters in the LSTM-based entity recognition model by using a back propagation method to obtain the LSTM-based entity recognition model after training.
3. The entity recognition method according to claim 2, wherein the obtaining the labeled corpus comprises:
and training the training corpus in an IB mode to obtain the labeled training corpus.
4. The entity recognition method according to claim 1, wherein the inputting the text to be recognized into the trained LSTM-based entity recognition model, and the obtaining the probability that each character in the text to be recognized belongs to the label comprises:
and sequentially inputting the characters of the text to be recognized by the entity into the trained entity recognition model based on the LSTM, and acquiring the probability that each character in the text to be recognized by the entity belongs to the label.
5. An entity identification apparatus, comprising:
the entity recognition model acquisition module is used for acquiring a trained entity recognition model based on the LSTM, wherein the entity recognition model based on the LSTM is trained by using the labeled training corpus;
the probability acquisition module is used for inputting the text to be recognized into the entity recognition model based on the LSTM after the training is finished, and acquiring the probability that each character in the text to be recognized belongs to the label;
the mark acquisition module is used for inputting the probability into a CRF model to obtain marks of each character, and specifically comprises the following steps: inputting the probability into a prediction formula, and solving the maximum value of the prediction formula to obtain the optimal output label sequence, wherein the prediction formula is
Figure FDA0003933365810000021
Wherein y is a label sequence to be predicted of the text to be recognized by the entity, and y = (y) 1 ,y 2 ,…,y n ),X=p i,yi The probability, p, of labeling each character in the text to be recognized by the entity i,yi Means that the ith word is marked as the yth i The probability of an individual label; a. The yi,yi+1 Mean the th y i Individual label is transferred to the y-th i+1 The probability of an individual label; and labeling according to the optimal output label sequence to further obtain the label of each character.
6. The entity recognition apparatus of claim 5, wherein the entity recognition model obtaining module comprises:
acquiring the labeled training corpus;
converting the words and characters in the labeled training corpus into vectors;
and inputting the vectors of the words and the characters into the LSTM-based entity recognition model, and training parameters in the LSTM-based entity recognition model by using a back propagation method to obtain the LSTM-based entity recognition model after training.
7. An entity identification device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor when executing the computer program implementing the entity identification method of any one of claims 1 to 4.
8. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the entity identification method according to any one of claims 1 to 4.
CN201811061626.8A 2018-09-12 2018-09-12 Entity identification method, device, equipment and storage medium Active CN109299458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811061626.8A CN109299458B (en) 2018-09-12 2018-09-12 Entity identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811061626.8A CN109299458B (en) 2018-09-12 2018-09-12 Entity identification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109299458A CN109299458A (en) 2019-02-01
CN109299458B true CN109299458B (en) 2023-03-28

Family

ID=65166558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811061626.8A Active CN109299458B (en) 2018-09-12 2018-09-12 Entity identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109299458B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681670B (en) * 2019-02-25 2023-05-12 北京嘀嘀无限科技发展有限公司 Information identification method, device, electronic equipment and storage medium
CN109902303B (en) * 2019-03-01 2023-05-26 腾讯科技(深圳)有限公司 Entity identification method and related equipment
CN110287283B (en) * 2019-05-22 2023-08-01 中国平安财产保险股份有限公司 Intention model training method, intention recognition method, device, equipment and medium
CN110516251B (en) * 2019-08-29 2023-11-03 秒针信息技术有限公司 Method, device, equipment and medium for constructing electronic commerce entity identification model
CN110598210B (en) * 2019-08-29 2023-08-04 深圳市优必选科技股份有限公司 Entity recognition model training, entity recognition method, entity recognition device, entity recognition equipment and medium
CN110705211A (en) * 2019-09-06 2020-01-17 中国平安财产保险股份有限公司 Text key content marking method and device, computer equipment and storage medium
CN110555102A (en) * 2019-09-16 2019-12-10 青岛聚看云科技有限公司 media title recognition method, device and storage medium
CN110826330B (en) * 2019-10-12 2023-11-07 上海数禾信息科技有限公司 Name recognition method and device, computer equipment and readable storage medium
CN110738054B (en) * 2019-10-14 2023-07-07 携程计算机技术(上海)有限公司 Method, system, electronic equipment and storage medium for identifying hotel information in mail
CN110738182A (en) * 2019-10-21 2020-01-31 四川隧唐科技股份有限公司 LSTM model unit training method and device for high-precision identification of bid amount
CN110738055A (en) * 2019-10-23 2020-01-31 北京字节跳动网络技术有限公司 Text entity identification method, text entity identification equipment and storage medium
CN112733869A (en) * 2019-10-28 2021-04-30 中移信息技术有限公司 Method, device and equipment for training text recognition model and storage medium
CN110738319A (en) * 2019-11-11 2020-01-31 四川隧唐科技股份有限公司 LSTM model unit training method and device for recognizing bid-winning units based on CRF
CN110825827B (en) * 2019-11-13 2022-10-25 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device
CN111079405A (en) * 2019-11-29 2020-04-28 微民保险代理有限公司 Text information identification method and device, storage medium and computer equipment
CN111209396A (en) * 2019-12-27 2020-05-29 深圳市优必选科技股份有限公司 Entity recognition model training method, entity recognition method and related device
CN111476022B (en) * 2020-05-15 2023-07-07 湖南工商大学 Character embedding and mixed LSTM entity identification method, system and medium for entity characteristics
CN111914561B (en) * 2020-07-31 2023-06-30 建信金融科技有限责任公司 Entity recognition model training method, entity recognition device and terminal equipment
CN112214987B (en) * 2020-09-08 2023-02-03 深圳价值在线信息科技股份有限公司 Information extraction method, extraction device, terminal equipment and readable storage medium
CN112182157B (en) * 2020-09-29 2023-09-22 中国平安人寿保险股份有限公司 Training method of online sequence labeling model, online labeling method and related equipment
CN112733911B (en) * 2020-12-31 2023-05-30 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of entity recognition model
CN113268673B (en) * 2021-04-23 2023-06-02 国家计算机网络与信息安全管理中心 Method and system for analyzing internet action type information clue
CN113821592A (en) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113486178B (en) * 2021-07-12 2023-12-01 恒安嘉新(北京)科技股份公司 Text recognition model training method, text recognition method, device and medium
CN116384515B (en) * 2023-06-06 2023-09-01 之江实验室 Model training method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment

Also Published As

Publication number Publication date
CN109299458A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299458B (en) Entity identification method, device, equipment and storage medium
CN109271631B (en) Word segmentation method, device, equipment and storage medium
CN108877782B (en) Speech recognition method and device
CN111190600B (en) Method and system for automatically generating front-end codes based on GRU attention model
CN112633003A (en) Address recognition method and device, computer equipment and storage medium
CN116127020A (en) Method for training generated large language model and searching method based on model
CN113096242A (en) Virtual anchor generation method and device, electronic equipment and storage medium
CN113450759A (en) Voice generation method, device, electronic equipment and storage medium
CN114626380A (en) Entity identification method and device, electronic equipment and storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN116244416A (en) Training method for generating large language model and man-machine voice interaction method based on model
CN113434642B (en) Text abstract generation method and device and electronic equipment
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN116821306A (en) Dialogue reply generation method and device, electronic equipment and storage medium
CN114970666B (en) Spoken language processing method and device, electronic equipment and storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN115292467A (en) Information processing and model training method, apparatus, device, medium, and program product
CN114490969A (en) Question and answer method and device based on table and electronic equipment
CN108038230B (en) Information generation method and device based on artificial intelligence
CN115510203B (en) Method, device, equipment, storage medium and program product for determining answers to questions
CN114706942B (en) Text conversion model training method, text conversion device and electronic equipment
CN115965018B (en) Training method of information generation model, information generation method and device
US11461399B2 (en) Method and apparatus for responding to question, and storage medium
US20220261554A1 (en) Electronic device and controlling method of electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant