CN114528841A - Entity identification method and device, electronic equipment and storage medium - Google Patents

Entity identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114528841A
CN114528841A CN202210148608.3A CN202210148608A CN114528841A CN 114528841 A CN114528841 A CN 114528841A CN 202210148608 A CN202210148608 A CN 202210148608A CN 114528841 A CN114528841 A CN 114528841A
Authority
CN
China
Prior art keywords
label
character
text
recognized
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210148608.3A
Other languages
Chinese (zh)
Inventor
刘欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202210148608.3A priority Critical patent/CN114528841A/en
Publication of CN114528841A publication Critical patent/CN114528841A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention relates to the field of artificial intelligence, and discloses an entity identification method, which comprises the following steps: when the number of samples carrying label information corresponding to the target field is smaller than a number threshold, obtaining samples carrying label information corresponding to a plurality of fields to obtain a sample set; determining a label transfer matrix corresponding to the label category set based on the label information of the sample set; performing coding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set; determining a label distribution matrix corresponding to the text to be recognized based on the first feature vector, the second feature vector and the label information of the sample set; and inputting the label distribution matrix and the label transfer matrix into the first entity recognition model to obtain an entity recognition result. The invention also provides an entity identification device, electronic equipment and a storage medium. The invention improves the entity identification accuracy.

Description

Entity identification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method and an apparatus for entity identification, an electronic device, and a storage medium.
Background
The named entity recognition task is also called a sequence labeling task, is an important task in the field of natural language processing, and can be used for multiple application scenarios such as information extraction and text classification.
Currently, entity recognition is usually performed on a text by using an entity recognition model obtained by supervised training, however, the number of labeled samples in some fields is small, which results in low entity recognition accuracy of the entity recognition model obtained by training, and therefore, an entity recognition method is urgently needed to improve the entity recognition accuracy in the field of hand samples.
Disclosure of Invention
In view of the above, there is a need to provide an entity identification method, aiming at improving the entity identification accuracy in a small sample field.
The entity identification method provided by the invention comprises the following steps:
receiving a text to be recognized, and determining a target field corresponding to the text to be recognized;
when the number of samples carrying the label information corresponding to the target field is smaller than a number threshold, obtaining a sample carrying the label information corresponding to each field in a plurality of fields from a preset database to obtain a sample set;
determining a label category set corresponding to the sample set based on label information of the sample set, calculating transition probabilities among label categories in the label category set, and determining a label transition matrix corresponding to the label category set based on the transition probabilities;
performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set;
determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set, and determining a label distribution matrix corresponding to the text to be recognized based on the label distribution array;
and inputting the label distribution matrix and the label transfer matrix into a first entity recognition model to obtain an entity recognition result.
Optionally, the determining a target field corresponding to the text to be recognized includes:
performing word segmentation processing on the text to be recognized to obtain a word set;
matching each word in the word set with a word library corresponding to each field respectively to obtain a matched word set corresponding to each field;
and taking the field corresponding to the matching word set with the maximum number of matching words as the target field corresponding to the text to be recognized.
Optionally, the performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set includes:
combining the text to be recognized with each sample in the sample set respectively to obtain a plurality of sample pairs;
inputting each sample pair into a coding model respectively to perform coding processing to obtain a coding vector of each character in each sample pair;
and calculating the average value of the coding vector of each character to obtain a first characteristic vector of each character in the text to be recognized and a second characteristic vector of each character in the character set corresponding to the sample set.
Optionally, the determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector, and the label information of the sample set includes:
sequentially calculating a probability value of each character in the text to be recognized in each label category in the label category set based on the first feature vector, the second feature vector and the label information of the sample set;
summarizing the probability value to obtain a label distribution array corresponding to each character in the text to be recognized.
Optionally, the probability value is calculated by the following formula:
Figure BDA0003508354830000021
Figure BDA0003508354830000022
wherein f isijProbability value of the ith character in the text to be recognized in the jth label category in the label category set, CkLabel category for the kth character in the character set corresponding to the sample set, YjIs the jth label category in the label category set, N is the total number of characters in the character set corresponding to the sample set, eiAs a first feature vector for the ith character in the text to be recognized, ekA second feature vector, Sim (e), for the k-th character in the character set corresponding to the sample seti,ek) Is the similarity value of the ith character in the text to be recognized and the kth character in the character set corresponding to the sample set, I (C)K=Yj) For the indication function, if the label type of the kth character in the character set corresponding to the sample set is the same as the jth label type in the label type set, I is 1, and if the label type of the kth character in the character set corresponding to the sample set is different from the jth label type in the label type set, I is 0.
Optionally, if the number of samples carrying the tag information corresponding to the target field is greater than or equal to a number threshold, the method includes:
training a second entity recognition model by adopting a sample carrying label information corresponding to the target field to obtain a trained second entity recognition model;
and executing entity recognition processing on the text to be recognized based on the trained second entity recognition model to obtain an entity recognition result.
Optionally, the calculation formula of the transition probability is:
Figure BDA0003508354830000031
wherein, Ti-jTransitioning to jth tag for ith tag class in tag class setTransition probability of signature class, p (c)i,cj) The number of samples in the sample set containing both the ith label type and the jth label type in the label type set, p (c)i) The number of samples in the sample set that contain the ith label class in the label class set.
In order to solve the above problem, the present invention further provides an entity identification apparatus, including:
the receiving module is used for receiving a text to be recognized and determining a target field corresponding to the text to be recognized;
the acquisition module is used for acquiring a sample carrying the label information corresponding to each field in a plurality of fields from a preset database to obtain a sample set when the number of the samples carrying the label information corresponding to the target field is smaller than a number threshold;
the calculation module is used for determining a label category set corresponding to the sample set based on the label information of the sample set, calculating transition probabilities among label categories in the label category set, and determining a label transition matrix corresponding to the label category set based on the transition probabilities;
the encoding module is used for performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set;
the determining module is used for determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set, and determining a label distribution matrix corresponding to the text to be recognized based on the label distribution array;
and the identification module is used for inputting the label distribution matrix and the label transfer matrix into a first entity identification model to obtain an entity identification result.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores an entity identification program executable by the at least one processor, the entity identification program being executable by the at least one processor to enable the at least one processor to perform the entity identification method described above.
In order to solve the above problems, the present invention also provides a computer-readable storage medium having an entity identification program stored thereon, the entity identification program being executable by one or more processors to implement the above entity identification method.
Compared with the prior art, the method comprises the steps of firstly, when the number of labeled samples corresponding to a target field is smaller than a number threshold, obtaining samples carrying label information corresponding to a plurality of fields to obtain a sample set, and calculating label transfer matrixes corresponding to label category sets based on the label information of the sample set; then, coding the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set; then, determining a label distribution matrix corresponding to the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set; and finally, inputting the label distribution matrix and the label transfer matrix into the first entity recognition model to obtain an entity recognition result. According to the invention, by introducing labeled samples in multiple fields, learning the relationship information among the labels of the samples, and performing entity identification processing on the text to be identified according to the learned relationship information, the entity identification accuracy in the field of the hand sample is improved. Thus, the present invention improves entity recognition accuracy in a small field.
Drawings
Fig. 1 is a schematic flowchart of an entity identification method according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of an entity identification apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing an entity identification method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The invention provides an entity identification method. Fig. 1 is a schematic flow chart of an entity identification method according to an embodiment of the present invention. The method may be performed by an electronic device, which may be implemented by software and/or hardware.
In this embodiment, the entity identification method includes:
and S1, receiving the text to be recognized, and determining the target field corresponding to the text to be recognized.
The method comprises the following steps of determining the number of labeled samples in the target field, and determining an entity identification method corresponding to a text to be identified according to the number.
The determining the target field corresponding to the text to be recognized includes:
a11, performing word segmentation processing on the text to be recognized to obtain a word set;
in this embodiment, the word segmentation process may be performed on the text to be recognized by using a forward maximum matching method, a reverse maximum matching method, or a least segmentation method.
A12, matching each word in the word set with a word library corresponding to each field respectively to obtain a matching word set corresponding to each field;
in this embodiment, a corresponding word library is configured for each field in advance.
And A13, taking the field corresponding to the matching word set with the maximum number of matching words as the target field corresponding to the text to be recognized.
For example, if the number of the matching words in the matching word set corresponding to the travel field is the largest, the travel is taken as the target field corresponding to the text to be recognized.
And S2, when the number of the samples carrying the label information corresponding to the target field is smaller than the number threshold, obtaining the sample carrying the label information corresponding to each field in a plurality of fields from a preset database to obtain a sample set.
When the number of labeled samples in the target field is small, an entity recognition model with high accuracy cannot be trained according to the samples in the field, in order to improve the entity recognition accuracy, in the embodiment, labeled samples in multiple fields are introduced, relationship information among labels of the samples is learned, and then entity recognition processing is performed on a text to be recognized according to the learned relationship information.
The number threshold may be 10 thousands, and when the number of labeled samples corresponding to the target field is less than 10 thousands, the labeled samples corresponding to each of the plurality of fields are acquired across the fields to obtain a sample set.
In this embodiment, if the number of samples carrying tag information corresponding to the target field is greater than or equal to a number threshold, the method includes:
b11, training a second entity recognition model by adopting the sample carrying the label information corresponding to the target field to obtain a trained second entity recognition model;
if the number of the labeled samples in the target field is large, the samples can be used for training to obtain a trained second entity recognition model with high accuracy.
In this embodiment, the second entity recognition model may be a deep neural network model, or may be a CRF (conditional random field) model.
And B12, executing entity recognition processing on the text to be recognized based on the trained second entity recognition model to obtain an entity recognition result.
And the entity in the text to be recognized can be accurately recognized by using the trained second entity recognition model.
S3, determining a label category set corresponding to the sample set based on the label information of the sample set, calculating transition probabilities among label categories in the label category set, and determining a label transition matrix corresponding to the label category set based on the transition probabilities.
The named entity recognition task is also a sequence labeling task, and the label information corresponding to the sample set comprises a label category corresponding to each character of each sample in the sample set. For example, if sample 1 in the sample set is: i amoLove (A)oOn the upper partB-locSea waterI-locIs/are as followsoEastB-archSquare blockI-archMing dynastyI-archBeadI-arch(ii) a Sample 2 is: north ChinaB-locJing made of Chinese medicinal materialsI-locIs/are as followsoRoastingoDuckoGood tasteoEatingo
Wherein the label o represents a non-entity, the label B-loc represents the beginning of the place name entity, the I-loc represents the middle of the place name entity, the B-arch represents the beginning of the building entity, and the I-arch represents the middle of the building entity.
Transition probabilities between label classes, i.e., the probability of one label class transitioning to another label class, e.g., the probability of label o transitioning to label B-loc, the probability of label B-loc transitioning to label I-loc, and the probability of label I-loc transitioning to label I-loc.
In this embodiment, the calculation formula of the transition probability is:
Figure BDA0003508354830000061
wherein, Ti-jTransition probability, p (c), for transitioning from the ith label category to the jth label category in the set of label categoriesi,cj) The number of samples in the sample set containing both the ith label type and the jth label type in the label type set, p (c)i) The number of samples in the sample set that contain the ith label class in the label class set.
If j ═ I, the probability that one label class is transferred to the same label class (for example, the probability that the label I-loc is converted into the label I-loc) is calculated, and in this case, in the calculation formula of the transfer probability, the numerator is the number of samples in the sample set that simultaneously contain two ith label classes, and the denominator is the number of samples in the sample set that contain one ith label class.
If there are 5 label categories in the label category set, the label transfer matrix is a 5 x 5 matrix.
S4, performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set.
In this embodiment, each character in the text to be recognized and each character in each sample in the sample set are encoded through the encoding model, which may be a Robert model, and the Robert model may learn semantic information, position information, tag information, and a correlation between the characters of each character of the input text, so that features represented by feature vectors obtained through encoding are relatively rich.
The performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set includes:
c11, combining the text to be recognized with each sample in the sample set respectively to obtain a plurality of sample pairs;
for example, if there are F labeled samples in the sample set, F sample pairs are obtained after combination.
C12, inputting each sample pair into a coding model respectively to perform coding processing, and obtaining a coding vector of each character in each sample pair;
in this embodiment, two samples in each sample pair are spliced, and the middle of the two samples is connected by a separator, and then the two samples are input to the coding model to perform coding processing, where the separator may be [ sep ].
In the encoding process, each character in each sample pair can learn the semantic information, the label information, the position information and the association relationship among the characters of other characters in the sample pair to obtain an encoding vector.
And C13, calculating the average value of the encoding vector of each character to obtain a first feature vector of each character in the text to be recognized and a second feature vector of each character in the character set corresponding to the sample set.
If the F sample pairs all contain the text to be recognized, each character in the text to be recognized is encoded at least F times (when a sample in the sample set contains the same character as a certain character in the text to be recognized, the character is encoded more than F times).
In this embodiment, according to the number of times that a character is encoded, an average value of the encoding vector of each character is calculated to obtain a feature vector of each character, which includes a first feature vector of each character in a text to be recognized and a second feature vector of each character in a character set corresponding to a sample set.
S5, determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set, and determining a label distribution matrix corresponding to the text to be recognized based on the label distribution array.
According to the label information of the sample set, the dependency relationship between the characters and the labels can be learned, the label distribution array corresponding to each character in the text to be recognized can be calculated according to the dependency relationship, the label distribution array of each character is summarized, and the label distribution matrix corresponding to the text to be recognized is obtained.
The determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and the label information of the sample set includes:
d11, sequentially calculating the probability value of each character in the text to be recognized in each label category in the label category set based on the first feature vector, the second feature vector and the label information of the sample set;
the probability value is calculated by the formula:
Figure BDA0003508354830000081
Figure BDA0003508354830000082
wherein f isijProbability value of the ith character in the text to be recognized in the jth label category in the label category set, CkLabel category for the kth character in the character set corresponding to the sample set, YjIs the jth label category in the label category set, N is the total number of characters in the character set corresponding to the sample set, eiAs a first feature vector for the ith character in the text to be recognized, ekA second feature vector, Sim (e), for the k-th character in the character set corresponding to the sample seti,ek) Is the similarity value of the ith character in the text to be recognized and the kth character in the character set corresponding to the sample set, I (C)K=Yj) For the indication function, if the label type of the kth character in the character set corresponding to the sample set is the same as the jth label type in the label type set, I is 1, and if the label type of the kth character in the character set corresponding to the sample set is different from the jth label type in the label type set, I is 0.
D12, summarizing the probability value to obtain a label distribution array corresponding to each character in the text to be recognized.
If the tag categories are concentrated with 5 tag categories, the tag distribution array corresponding to each character in the text to be recognized is an array of 1 × 5, and if the text to be recognized has 10 characters, the tag distribution matrix corresponding to the text to be recognized is an array of 10 × 5.
And S6, inputting the label distribution matrix and the label transfer matrix into a first entity recognition model to obtain an entity recognition result.
In this embodiment, the label distribution matrix and the label transfer matrix are input into the first entity recognition model, and the entity recognition result corresponding to the text to be recognized can be output.
The first entity recognition model is a CRF model, the CRF model combines the characteristics of a maximum entropy model and a hidden Markov model, the maximum link of the entity is output, and the context information is considered, so that the entity recognition result is more accurate.
As can be seen from the foregoing embodiments, in the entity identification method provided by the present invention, first, when the number of labeled samples corresponding to a target field is smaller than a number threshold, samples carrying label information corresponding to multiple fields are obtained to obtain a sample set, and a label transfer matrix corresponding to a label category set is calculated based on the label information of the sample set; then, coding the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set; then, determining a label distribution matrix corresponding to the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set; and finally, inputting the label distribution matrix and the label transfer matrix into the first entity recognition model to obtain an entity recognition result. According to the invention, by introducing labeled samples in multiple fields, learning the relationship information among the labels of the samples, and performing entity identification processing on the text to be identified according to the learned relationship information, the entity identification accuracy in the field of the hand sample is improved. Thus, the present invention improves entity identification accuracy in a small field.
Fig. 2 is a schematic block diagram of an entity identification apparatus according to an embodiment of the present invention.
The entity identifying device 100 of the present invention may be installed in an electronic device. According to the implemented functions, the entity identifying apparatus 100 may include a receiving module 110, an obtaining module 120, a calculating module 130, an encoding module 140, a determining module 150, and an identifying module 160. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the receiving module 110 is configured to receive a text to be recognized and determine a target field corresponding to the text to be recognized;
the determining the target field corresponding to the text to be recognized includes:
a21, performing word segmentation processing on the text to be recognized to obtain a word set;
a22, matching each word in the word set with a word library corresponding to each field respectively to obtain a matching word set corresponding to each field;
and A23, taking the field corresponding to the matching word set with the maximum number of matching words as the target field corresponding to the text to be recognized.
An obtaining module 120, configured to obtain, from a preset database, a sample carrying tag information corresponding to each of multiple fields to obtain a sample set when the number of samples carrying tag information corresponding to the target field is smaller than a number threshold.
If the number of samples carrying tag information corresponding to the target field is greater than or equal to a number threshold, the obtaining module 120 is further configured to:
b21, training a second entity recognition model by adopting the sample carrying the label information corresponding to the target field to obtain a trained second entity recognition model;
and B22, executing entity recognition processing on the text to be recognized based on the trained second entity recognition model to obtain an entity recognition result.
A calculating module 130, configured to determine a label category set corresponding to the sample set based on the label information of the sample set, calculate transition probabilities among label categories in the label category set, and determine a label transition matrix corresponding to the label category set based on the transition probabilities.
The calculation formula of the transition probability is as follows:
Figure BDA0003508354830000101
wherein, Ti-jTransition probability, p (c), for transitioning from the ith label category to the jth label category in the set of label categoriesi,cj) The number of samples in the sample set containing both the ith label type and the jth label type in the label type set, p (c)i) The number of samples in the sample set that contain the ith label class in the label class set.
The encoding module 140 is configured to perform encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set.
The encoding processing is executed on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set, and the encoding processing includes:
c21, combining the text to be recognized with each sample in the sample set respectively to obtain a plurality of sample pairs;
c22, inputting each sample pair into a coding model respectively to carry out coding processing, and obtaining a coding vector of each character in each sample pair;
and C23, calculating the average value of the encoding vector of each character to obtain a first feature vector of each character in the text to be recognized and a second feature vector of each character in the character set corresponding to the sample set.
A determining module 150, configured to determine a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector, and the label information of the sample set, and determine a label distribution matrix corresponding to the text to be recognized based on the label distribution array.
The determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and the label information of the sample set includes:
d21, sequentially calculating the probability value of each character in the text to be recognized in each label category in the label category set based on the first feature vector, the second feature vector and the label information of the sample set;
the probability value is calculated by the formula:
Figure BDA0003508354830000102
Figure BDA0003508354830000103
wherein f isijProbability value of the ith character in the text to be recognized in the jth label category in the label category set, CkLabel category for the kth character in the character set corresponding to the sample set, YjIs the jth label category in the label category set, N is the total number of characters in the character set corresponding to the sample set, eiAs a first feature vector for the ith character in the text to be recognized, ekA second feature vector, Sim (e), for the k-th character in the character set corresponding to the sample seti,ek) Is the similarity value of the ith character in the text to be recognized and the kth character in the character set corresponding to the sample set, I (C)K=Yj) For indicating the function, if the label type of the kth character in the character set corresponding to the sample set is the same as the jth label type in the label type set,i is 1, and if the label type of the kth character in the character set corresponding to the sample set is different from the jth label type in the label type set, I is 0.
D22, summarizing the probability value to obtain a label distribution array corresponding to each character in the text to be recognized.
And the identification module 160 is configured to input the label distribution matrix and the label transfer matrix into a first entity identification model to obtain an entity identification result.
Fig. 3 is a schematic structural diagram of an electronic device for implementing an entity identification method according to an embodiment of the present invention.
The electronic device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. The electronic device 1 may be a computer, or may be a single network server, a server group composed of a plurality of network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers.
In the embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, wherein the memory 11 stores an entity identification program 10, and the entity identification program 10 can be executed by the processor 12. Fig. 3 shows only the electronic device 1 with the components 11-13 and the entity recognition program 10, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
The storage 11 includes a memory and at least one type of readable storage medium. The memory provides cache for the operation of the electronic equipment 1; the readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk provided on the electronic device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various application software installed in the electronic device 1, for example, codes of the entity identification program 10 in an embodiment of the present invention. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices. In this embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, such as executing the entity identification program 10.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is used for establishing a communication connection between the electronic device 1 and a client (not shown).
Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The entity identification program 10 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 12, implement the steps of the entity identification method described above.
Specifically, the specific implementation method of the entity identification program 10 by the processor 12 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not repeated herein.
Further, the integrated modules/units of the electronic device 1 may be stored in a computer readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. The computer readable medium may be non-volatile or non-volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The computer readable storage medium has stored thereon an entity identification program 10, and the entity identification program 10 can be executed by one or more processors to implement the steps in the entity identification method described above.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. An entity identification method, characterized in that the method comprises:
receiving a text to be recognized, and determining a target field corresponding to the text to be recognized;
when the number of samples carrying the label information corresponding to the target field is smaller than a number threshold, obtaining a sample carrying the label information corresponding to each field in a plurality of fields from a preset database to obtain a sample set;
determining a label category set corresponding to the sample set based on label information of the sample set, calculating transition probabilities among label categories in the label category set, and determining a label transition matrix corresponding to the label category set based on the transition probabilities;
performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set;
determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set, and determining a label distribution matrix corresponding to the text to be recognized based on the label distribution array;
and inputting the label distribution matrix and the label transfer matrix into a first entity recognition model to obtain an entity recognition result.
2. The entity recognition method of claim 1, wherein the determining the target area corresponding to the text to be recognized comprises:
performing word segmentation processing on the text to be recognized to obtain a word set;
matching each word in the word set with a word library corresponding to each field respectively to obtain a matched word set corresponding to each field;
and taking the field corresponding to the matching word set with the maximum number of matching words as the target field corresponding to the text to be recognized.
3. The entity identification method according to claim 1, wherein the encoding process performed on the text to be identified and the sample set to obtain a first feature vector corresponding to each character in the text to be identified and a second feature vector corresponding to each character in the character set corresponding to the sample set comprises:
combining the text to be recognized with each sample in the sample set respectively to obtain a plurality of sample pairs;
inputting each sample pair into a coding model respectively to perform coding processing to obtain a coding vector of each character in each sample pair;
and calculating the average value of the coding vector of each character to obtain a first characteristic vector of each character in the text to be recognized and a second characteristic vector of each character in the character set corresponding to the sample set.
4. The entity recognition method of claim 1, wherein the determining a tag distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and the tag information of the sample set comprises:
sequentially calculating a probability value of each character in the text to be recognized in each label category in the label category set based on the first feature vector, the second feature vector and the label information of the sample set;
summarizing the probability value to obtain a label distribution array corresponding to each character in the text to be recognized.
5. The entity identification method of claim 4, wherein the probability value is calculated by the formula:
Figure FDA0003508354820000021
Figure FDA0003508354820000022
wherein f isijProbability value of the ith character in the text to be recognized in the jth label category in the label category set, CkLabel category for the kth character in the character set corresponding to the sample set, YjIs the jth label category in the label category set, N is the total number of characters in the character set corresponding to the sample set, eiAs a first feature vector for the ith character in the text to be recognized, ekA second feature vector, Sim (e), for the k-th character in the character set corresponding to the sample seti,ek) Is the similarity value of the ith character in the text to be recognized and the kth character in the character set corresponding to the sample set, I (C)K=Yj) For the indication function, if the label type of the kth character in the character set corresponding to the sample set is the same as the jth label type in the label type set, I is 1, and if the label type of the kth character in the character set corresponding to the sample set is different from the jth label type in the label type set, I is 0.
6. The entity identification method of claim 1, wherein if the number of samples carrying tag information corresponding to the target domain is greater than or equal to a number threshold, the method comprises:
training a second entity recognition model by adopting a sample carrying label information corresponding to the target field to obtain a trained second entity recognition model;
and executing entity recognition processing on the text to be recognized based on the trained second entity recognition model to obtain an entity recognition result.
7. The entity recognition method of claim 1, wherein the transition probability is calculated by the formula:
Figure FDA0003508354820000023
wherein, Ti-jTransition probability, p (c), for transitioning from the ith label category to the jth label category in the set of label categoriesi,cj) The number of samples in the sample set containing both the ith label type and the jth label type in the label type set, p (c)i) The number of samples in the sample set that contain the ith label class in the label class set.
8. An entity identification apparatus, the apparatus comprising:
the receiving module is used for receiving a text to be recognized and determining a target field corresponding to the text to be recognized;
the acquisition module is used for acquiring a sample carrying the label information corresponding to each field in a plurality of fields from a preset database to obtain a sample set when the number of the samples carrying the label information corresponding to the target field is smaller than a number threshold;
the calculation module is used for determining a label category set corresponding to the sample set based on the label information of the sample set, calculating transition probabilities among label categories in the label category set, and determining a label transition matrix corresponding to the label category set based on the transition probabilities;
the encoding module is used for performing encoding processing on the text to be recognized and the sample set to obtain a first feature vector corresponding to each character in the text to be recognized and a second feature vector corresponding to each character in the character set corresponding to the sample set;
the determining module is used for determining a label distribution array corresponding to each character in the text to be recognized based on the first feature vector, the second feature vector and label information of the sample set, and determining a label distribution matrix corresponding to the text to be recognized based on the label distribution array;
and the identification module is used for inputting the label distribution matrix and the label transfer matrix into a first entity identification model to obtain an entity identification result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores an entity identification program executable by the at least one processor to enable the at least one processor to perform the entity identification method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon an entity identification program executable by one or more processors to implement the entity identification method of any one of claims 1 to 7.
CN202210148608.3A 2022-02-17 2022-02-17 Entity identification method and device, electronic equipment and storage medium Pending CN114528841A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210148608.3A CN114528841A (en) 2022-02-17 2022-02-17 Entity identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210148608.3A CN114528841A (en) 2022-02-17 2022-02-17 Entity identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114528841A true CN114528841A (en) 2022-05-24

Family

ID=81623140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210148608.3A Pending CN114528841A (en) 2022-02-17 2022-02-17 Entity identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114528841A (en)

Similar Documents

Publication Publication Date Title
CN111241304B (en) Answer generation method based on deep learning, electronic device and readable storage medium
CN110033018B (en) Graph similarity judging method and device and computer readable storage medium
CN114462412B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN112597135A (en) User classification method and device, electronic equipment and readable storage medium
CN113157927A (en) Text classification method and device, electronic equipment and readable storage medium
CN115063589A (en) Knowledge distillation-based vehicle component segmentation method and related equipment
CN114706985A (en) Text classification method and device, electronic equipment and storage medium
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN112800178A (en) Answer generation method and device, electronic equipment and readable storage medium
CN114818685B (en) Keyword extraction method and device, electronic equipment and storage medium
CN113688239B (en) Text classification method and device under small sample, electronic equipment and storage medium
CN114281991A (en) Text classification method and device, electronic equipment and storage medium
CN114580354B (en) Information coding method, device, equipment and storage medium based on synonym
CN116450829A (en) Medical text classification method, device, equipment and medium
CN113706252B (en) Product recommendation method and device, electronic equipment and storage medium
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113342977B (en) Invoice image classification method, device, equipment and storage medium
CN113656586B (en) Emotion classification method, emotion classification device, electronic equipment and readable storage medium
CN113610580B (en) Product recommendation method and device, electronic equipment and readable storage medium
CN114398877A (en) Theme extraction method and device based on artificial intelligence, electronic equipment and medium
CN114610854A (en) Intelligent question and answer method, device, equipment and storage medium
CN114139530A (en) Synonym extraction method and device, electronic equipment and storage medium
CN114528841A (en) Entity identification method and device, electronic equipment and storage medium
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination