CN113111656A - Entity identification method, entity identification device, computer readable storage medium and computer equipment - Google Patents

Entity identification method, entity identification device, computer readable storage medium and computer equipment Download PDF

Info

Publication number
CN113111656A
CN113111656A CN202010031702.1A CN202010031702A CN113111656A CN 113111656 A CN113111656 A CN 113111656A CN 202010031702 A CN202010031702 A CN 202010031702A CN 113111656 A CN113111656 A CN 113111656A
Authority
CN
China
Prior art keywords
entity
adjacent
candidate
probability
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010031702.1A
Other languages
Chinese (zh)
Other versions
CN113111656B (en
Inventor
谢润泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010031702.1A priority Critical patent/CN113111656B/en
Publication of CN113111656A publication Critical patent/CN113111656A/en
Application granted granted Critical
Publication of CN113111656B publication Critical patent/CN113111656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to an entity identification method, an entity identification device, a computer readable storage medium and a computer device, wherein the method comprises the following steps: acquiring word segmentation of a text to be recognized; determining a closeness probability between adjacent said segments; combining the word segments to obtain adjacent word groups; determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases; determining an entity type of the candidate entity; and when the entity type of the candidate entity is the target entity type, taking the candidate entity as the target entity. Whether the adjacent word group can be used as a candidate entity or not is predicted through the compact probability, so that the candidate entity can be recognized without setting a position label for the word segmentation of the text to be recognized, and the entity recognition including the recognition of the nested entity is realized by training a model through a complex training process, thereby simplifying the entity recognition process and improving the entity recognition efficiency.

Description

Entity identification method, entity identification device, computer readable storage medium and computer equipment
Technical Field
The present application relates to the field of internet technologies, and in particular, to an entity identification method, an entity identification apparatus, a computer-readable storage medium, and a computer device.
Background
With the development of artificial intelligence and big data technology, the technical demand for natural language processing is continuously increasing, wherein entity recognition as a necessary pre-operation of tasks such as semantic understanding, speech synthesis and the like has an important role in natural language understanding. Entity Recognition (NER), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc.
In the current entity recognition task, the entity recognition of three categories of a person name, a place name and a mechanism name is mainly concerned, the categories are relatively fixed, and the structure in the entity is relatively flat and has less nested structures. The entity of the nested structure is called as a nested entity, and the entity containing the structure exists, such as a dish entity 'Mao's braised pork in brown sauce, 'braised pork in brown sauce' is also a dish entity, and further such as 'dog cannot eat steamed stuffed buns' is also a dish entity.
As natural language processing deepens into different vertical categories, such as catering, medical treatment, finance and the like, entity identification focuses more on identification of vertical categories, such as dish name identification in catering and industry name identification in finance. Different from the traditional named entities, a plurality of nested entities exist in the vertical entity, nouns in the nested entities can be used as one entity, noun phrases can also be used as one entity, and the method is characterized in that the type range is relatively open, and meanwhile, a plurality of nested structures exist among the entities. However, at present, there are few entity identification methods capable of identifying nested entities, and these methods require complex processes to implement and have low identification efficiency, and cannot meet the actual identification requirements.
Disclosure of Invention
In view of the above, it is necessary to provide an entity identification method, an entity identification device, a computer-readable storage medium, and a computer apparatus for solving the technical problems of complexity and low efficiency of the existing entity identification method.
An entity identification method, comprising:
acquiring word segmentation of a text to be recognized;
determining a closeness probability between adjacent said segments;
combining the word segments to obtain adjacent word groups;
determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases;
determining an entity type of the candidate entity;
and when the entity type of the candidate entity is the target entity type, taking the candidate entity as the target entity.
In one embodiment, the determining candidate entities from the immediately adjacent phrases according to the close probabilities of the corresponding participles of the immediately adjacent phrases comprises:
acquiring external characteristic resources of the adjacent phrases; the external characteristic resource is acquired from the Internet by adopting the adjacent phrase and is used for reflecting the information quantity of the adjacent phrase;
and determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases and external characteristic resources.
In one embodiment, the determining the close probability between adjacent said participles comprises:
processing the adjacent word segmentation through a prediction model to obtain a compact probability;
the prediction model is obtained by training based on a preset network model according to an acquired relation training sample and is used for processing each input adjacent word segmentation to obtain a compact probability; the relationship training sample includes each of the input adjacent segmented words and a corresponding close probability.
In one embodiment, the determining candidate entities from the immediately adjacent phrases according to the close probabilities of the corresponding participles of the immediately adjacent phrases comprises:
processing the close probability of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score;
when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the close probability of the corresponding word segmentation of the input adjacent word group to obtain the quality score of the adjacent word group; the quality training sample comprises the close probability of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In one embodiment, the determining candidate entities from the immediately adjacent phrases according to the close probabilities of the corresponding participles of the immediately adjacent phrases and the external feature resources includes:
processing the compact probability and the external characteristic resources of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score;
when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the input word segmentation compact probability corresponding to the adjacent phrase and external characteristic resources to obtain the quality score of the adjacent phrase; the quality training sample comprises the close probability and the external feature resources of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In one embodiment, when the entity type of the candidate entity is a target entity type, the determining the entity type of the candidate entity with the candidate entity as the target entity comprises:
inputting the candidate entity into a classification model to obtain an entity type of the candidate entity;
when the entity type of the candidate entity is the target entity type, taking the candidate entity as a target entity;
the classification model is obtained by training based on a preset network model according to collected classification training samples and is used for processing input candidate entities to obtain entity types; the quality training samples include the input candidate entities and corresponding entity types.
In one embodiment, the determining candidate entities from the immediately adjacent phrases according to the close probabilities of the corresponding participles of the immediately adjacent phrases comprises:
acquiring the compact probability of the corresponding word segmentation of the adjacent word group;
calculating a compact probability mean value according to the compact probability of the word segmentation corresponding to the adjacent word group;
determining the participles at the boundary in the adjacent phrases;
acquiring the close probability between the segmentation words at the boundary and the segmentation words of the non-adjacent phrases;
screening out the maximum compact probability from the compact probabilities between the segmentation words positioned at the boundary and the segmentation words of the non-adjacent word groups;
subtracting the maximum compact probability from the compact probability mean value to obtain the mass fraction of the adjacent phrase;
and when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity.
An entity identification apparatus comprising:
the word segmentation acquisition module is used for acquiring the word segmentation of the text to be recognized;
a close probability determination module for determining close probability between adjacent word segments;
the adjacent phrase combination module is used for combining the participles to obtain adjacent phrases;
the candidate entity determining module is used for determining a candidate entity from the adjacent word group according to the close probability of the word segmentation corresponding to the adjacent word group;
an entity type determining module, configured to determine an entity type of the candidate entity;
and the target entity determining module is used for taking the candidate entity as the target entity when the entity type of the candidate entity is the target entity type.
In one embodiment, the candidate entity determination module comprises:
the external characteristic resource acquisition module is used for acquiring external characteristic resources of the adjacent phrases; the external characteristic resource is acquired from the Internet by adopting the adjacent phrase and is used for reflecting the information quantity of the adjacent phrase;
and the candidate entity screening module is used for determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases and external characteristic resources.
In one embodiment, the tight probability determination module comprises:
the compact probability calculation module is used for processing the adjacent word segmentation through a prediction model to obtain compact probability;
the prediction model is obtained by training based on a preset network model according to an acquired relation training sample and is used for processing each input adjacent word segmentation to obtain a compact probability; the relationship training sample includes each of the input adjacent segmented words and a corresponding close probability.
In one embodiment, the target entity determination module comprises:
the first quality score calculating module is used for processing the close probability of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score;
the target entity determining module is used for taking the adjacent word group as a candidate entity when the quality score reaches a preset threshold value;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the close probability of the corresponding word segmentation of the input adjacent word group to obtain the quality score of the adjacent word group; the quality training sample comprises the close probability of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In one embodiment, the candidate entity screening module comprises:
the second quality score calculation module is used for processing the compact probability of the word segmentation corresponding to the adjacent word group and the external characteristic resource through a quality model to obtain a quality score;
when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the input word segmentation compact probability corresponding to the adjacent phrase and external characteristic resources to obtain the quality score of the adjacent phrase; the quality training sample comprises the close probability and the external feature resources of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In one embodiment, the target entity determination module comprises:
the entity type determining module is used for inputting the candidate entity into a classification model to obtain the entity type of the candidate entity;
the classification model is obtained by training based on a preset network model according to collected classification training samples and is used for processing input candidate entities to obtain entity types; the quality training samples include the input candidate entities and corresponding entity types.
In one embodiment, the candidate entity determination module comprises:
the first close probability obtaining module is used for obtaining the close probability of the word segmentation corresponding to the adjacent word group;
the mean value calculation module is used for calculating a compact probability mean value according to the compact probability of the word segmentation corresponding to the adjacent word group;
the boundary word segmentation determining module is used for determining the word segmentation at the boundary in the adjacent word group;
the second tight probability acquisition module is used for acquiring the tight probability between the participle at the boundary and the participle of the non-adjacent phrase;
a maximum close probability determining module, configured to screen out a maximum close probability from close probabilities between the word segmentation at the boundary and the word segmentation of the non-adjacent word group;
the mass fraction calculating module is used for subtracting the maximum close probability from the close probability mean value to obtain the mass fraction of the adjacent phrase;
and the candidate entity selection module is used for taking the adjacent word group as a candidate entity when the quality score reaches a preset threshold value.
A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the entity identification method as described above.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the entity identification method as described above.
In the entity recognition task, after the segmentation of the text to be recognized is obtained, the close probability between adjacent segmentation is determined, the adjacent word groups are combined according to the segmentation, then the candidate entity is determined from the adjacent word groups according to the close probability of the segmentation corresponding to the adjacent word groups, and the entity type of the candidate entity is determined, wherein when the entity type of the candidate entity is the target entity type, the candidate entity is taken as the target entity. The entity identification of the embodiment is divided into two tasks of candidate entity generation and entity type prediction, wherein the close probability of adjacent participles in a text to be identified is firstly determined, the participles are combined to obtain a plurality of adjacent participles, the plurality of adjacent participles comprise nested entities, then whether the adjacent word groups can be used as candidate entities or not is predicted according to the close probability, the task of candidate entity generation is completed, further, the entity type of the candidate entity is determined, and when the entity type of the candidate entity is determined to be the target entity type, the target entity is obtained, and the task of entity type prediction is completed. Whether the adjacent word group can be used as a candidate entity or not is predicted through the compact probability, so that the candidate entity can be recognized without setting a position label for the word segmentation of the text to be recognized, and the entity recognition including the recognition of the nested entity is realized by training a model through a complex training process, thereby simplifying the entity recognition process and improving the entity recognition efficiency.
Drawings
FIG. 1 is a diagram of an example embodiment of an application environment for an entity identification method;
FIG. 2 is a flow diagram illustrating a method for entity identification in one embodiment;
FIG. 3 is a schematic diagram of the network structure of LabelNe in one embodiment;
FIG. 4 is a schematic diagram of a NestedNER network in one embodiment;
FIG. 5 is a flowchart illustrating an entity identification method according to another embodiment;
FIG. 6 is a diagram illustrating a network structure of an ENestedenER in an embodiment;
FIG. 7 is a block diagram of an entity identification apparatus in one embodiment;
FIG. 8 is a block diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
FIG. 1 is a diagram of an embodiment of an application environment of the entity identification method. Referring to fig. 1, the entity identification method is applied to an entity identification system. The entity identification system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.
The terminal 110 may send the text to be recognized to the server 120, and the server 120 may obtain the word segmentation of the text to be recognized, and execute the entity recognition method in the embodiment of the present application to obtain the target entity.
As shown in FIG. 2, in one embodiment, an entity identification method is provided. The embodiment is mainly illustrated by applying the method to the server 120 in fig. 1. Referring to fig. 2, the entity identification method specifically includes the following steps:
s202, acquiring word segmentation of the text to be recognized.
In natural language processing, it is often necessary to perform entity recognition, i.e., recognizing a specific type of a specific word in a text, in order to subsequently determine the meaning of the word, or recognizing the intention of a text input person, or the like. Specifically, the entity identification method may define the identified target entity type in advance, and needs to be specific to different identification fields and application scenarios. For example, for a dish recommendation scene, the type of the target entity to be identified may be defined as a dish, and for a route navigation scene, the type of the target entity to be identified may be defined as a place name.
In this embodiment, the text to be recognized is referred to as the text to be recognized. The text to be recognized is generally a sentence and is composed of a plurality of words, for example, the text to be recognized is "i love eating braised fish", and is composed of several participles of "i", "love eating", "braised fish", and "fish".
In this embodiment, how to perform word segmentation on the text to be recognized is not limited, for example, the text to be recognized may be subjected to word segmentation by a word segmentation tool, and the text to be recognized may also be subjected to word segmentation by other methods, which is not specifically limited.
When the named entity recognition is needed, the terminal can send an entity recognition processing request to a recognition system of the server, and after the entity recognition system receives the entity recognition processing request, the entity recognition processing request is analyzed to obtain a text to be recognized, and then the text to be recognized is subjected to word segmentation processing.
S204, determining the close probability between the adjacent participles.
In the related scheme, entity identification is divided into two tasks of entity boundary prediction and entity type prediction, specifically, a label is labeled on a participle of a text through a label schema (a label representation mode), so that the participle has a label, the label comprises a position label and a type label, the participle can be combined through the position label to obtain a candidate entity, the entity type of the candidate entity can be identified through the type label, and if the entity type of the candidate entity is a target entity, the candidate entity can be used as the target entity.
Specifically, the representation mode of the label commonly used by the entity recognition task is a BIO (B-begin, I-entity, O-outside) labeling mode, wherein B-X represents that the text where the participle is located belongs to X type and the participle is at the beginning of the segment, I-X represents that the text where the participle is located belongs to X type and the participle is in the middle position of the text, and O represents that the participle does not belong to any entity type.
For example, the labels of the participles in the text "i love eating braised fish" are respectively:
i: o, meaning "I am" not part of an entity
Eating love: o, meaning "love eating" is not part of a solid
Braising: b _ DISH, meaning that "Red-cooked" is the beginning of a DISH entity
Fish: i _ DISH, meaning that "fish" is an intermediate part of a vegetable entity
The two most important information in the entity recognition task are boundary information (entity boundary prediction) and type information (entity type prediction), and the model in the related scheme simultaneously learns the relationship between the boundary and the type of the entity in training. For example, LSTM + CRF model, where LSTM refers to Long Short-Term Memory network (Long Short-Term Memory), CRF refers to conditional random field (conditional random field), and LSTM + CRF model represents boundary information and type information by using a unified BIO labeling manner, for example, the label B _ DISH represents that the current word is both a part (type) of a DISH entity and a beginning (boundary) of a DISH entity. In the nested entity identification task, one word may correspond to a plurality of different position tags, for example, the "beef" in the "Chaoshan beef chafing DISH" is not only the I _ DISH of the "Chaoshan beef chafing DISH" but also the B _ DISH of the "beef chafing DISH" of the entity, and if the nested entity is to be identified, the representation mode of the tag needs to be modified.
In the embodiment, the entity recognition task is abstracted into two tasks of independent candidate entity generation and entity type prediction. The task generated by the candidate entity replaces the task of entity boundary prediction in the original related scheme, and the embodiment does not need to add boundary information to the participle, that is, does not need to add a position tag to the participle.
In the embodiment, a model supporting multi-granularity candidate entities is trained to support nested entity recognition, the model is called LabelNet, the network structure of the model is as shown in FIG. 3, and the model comprises a 3.1 dictionary matching layer (Phrase embedding), a 3.2 distributed representation layer (Word embedding), a 3.3 feature representation layer (BILSTM), a 3.4 semantic matching layer (Phrase representation) and a 3.5 dichotomy layer (sigmod), and the process of realizing entity recognition based on LabelNet is as follows:
3.1 dictionary matching layer: generating candidate entities, namely K1, K2, K3 and K4 in the text to be recognized in a character string matching mode according to the sorted proprietary dictionary, and transferring c1, c2, c3 and c4 which are mapped into vectors to a 3.4 semantic matching layer. For example, the text to be recognized "Chaoshan beef chafing dish franchise brand" can be matched in a special dictionary to obtain K1: chaoshan beef, K2: chaoshan beef chafing dish, K3: chafing dish alliance, K4: several candidate entities of a brand.
3.2 distributed representation layer: the participles of the text to be recognized, namely W1, W2, W3, W4 and W5 are mapped into vectors and then transferred to a 3.3 feature representation layer. For example, the word segmentation for the text to be recognized, "Chaoshan beef chafing dish alliance chafing dish" is W1: chaoshan, W2: beef, W3: chafing dish, W4: chafing dish and W5: and (4) the brand, the participles are mapped into vectors and then are transmitted to a 3.3 characteristic representation layer.
3.3 feature representation layer: and calculating probability values of the participles according to the participles W1, W2, W3, W4 and W5 which are mapped into vectors, and transmitting the probability values to a 3.4 semantic matching layer, wherein L represents context characteristics on the left side of the participle, R represents context characteristics on the right side of the participle, and C represents a combination of the left side and the right side of the participle.
3.4 semantic matching layer: and determining the average probability value AVG of the candidate entity according to the probability value of the word segmentation to obtain AVG1, AVG2, AVG3 and AVG4, and transferring the AVG1, the AVG2, the AVG3 and the AVG4 to a 3.5 classification layer. The AVG1 is an average probability value of a combination of W1 and W2 (K1), the AVG2 is an average probability value of a combination of W1, W2 and W3 (K2), the AVG3 is an average probability value of a combination of W4 (K3), and the AVG4 is an average probability value of a combination of W5 (K4).
3.5 two classification layers: and determining whether the entity type of the candidate entity is the target entity type according to c1, c2, c3 and c4 after the vectors are mapped to K1, K2, K3 and K4 and corresponding average probability values AVG1, AVG2, AVG3 and AVG4, if yes, outputting 1, and otherwise, outputting 0.
The multi-granularity entity identification can be realized through the LabelNet of the embodiment, and the nested entity identification can be realized because the multi-granularity entity comprises the nested entity. The candidate entity generation of this embodiment is based on dictionary matching, so there is no need to rely on location tags for boundary prediction, and it can also be determined whether the candidate entity is the target entity through a classification layer. However, the generation of LabelNet at candidate entities relies entirely on the lexicon, and the limited coverage of the lexicon causes the model to have no generalization capability and candidate entities not in the lexicon are not identified.
Therefore, the present embodiment further proposes that, when a candidate entity is generated, the compact probabilities between adjacent participles in the text to be recognized are calculated, for example, five participles "i", "like", "eat", "wife", "cookie" are obtained after performing participle on the text to be recognized "i like to eat wife cookies", four compact probabilities s _ ab, s _ bc, s _ cd, sde are obtained, where s _ ab represents the compact probabilities of "i" and "like", s _ bc represents the compact probabilities of "like" and "eating", s _ cd represents the compact probabilities of "eating" and "wife", and sde represents the compact probabilities of "wife" and "cookie".
In a specific implementation, the close probability between adjacent participles can reflect the close degree between adjacent participles in the text to be recognized, and the greater the close probability, the greater the probability that the adjacent participles can be combined into a phrase is.
And S206, combining the participles to obtain an adjacent phrase.
In this embodiment, the word segments in the text to be recognized may be combined to obtain the adjacent word group. Wherein, the adjacent phrases are obtained by enumeration. For example, for the word segmentation "i love eating braised fish", the following adjacent phrases can be obtained: ' I ', ' I love to eat braised fish ', ' I ' eat braised fish ' and.
Of course, some of the enumerated adjacent phrases have little meaning, some of the lower-quality adjacent phrases can be filtered out through heuristic rules or other manners, for example, for the above example, the adjacent phrases "braised fish" can be finally obtained through filtering.
And S208, determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases.
In this embodiment, after obtaining the adjacent phrase, in order to further verify whether the quality of the adjacent phrase meets the requirement, the candidate entity may be determined from the adjacent phrase according to the close probability of the corresponding participle of the adjacent phrase.
The close probability may reflect the degree of closeness between adjacent participles, and the greater the close probability, the greater the probability that the adjacent participles can be combined into a phrase. Taking the text to be recognized "i love eating octopus small balls" as an example, table 1 below gives an example in two different label representation modes, wherein B represents Break, the relationship of two participles is broken, T represents Tie, and the relationship of two participles is closely connected.
Table 1:
i am Love to eat Octopus Small Ball
BIO O O B_DISH I_DISH I_DISH、B_DISH
Tie or Break B B B T T
As can be seen from table 1, if the BIO labeling manner in the correlation scheme is adopted, the "pellet" may have two labels I _ DISH and B _ DISH, that is, the "pellet" is a middle part of an entity, and is a beginning of another entity, and is a nested entity, if a nested entity is to be identified, it needs to compare models obtained through a complex process for identification, and in this embodiment, a Tie or Break prediction (tight Break prediction) manner may be used, and a candidate entity may be determined according to a tight probability between adjacent participles, specifically, if the tight probability corresponding to the adjacent participles in the immediately adjacent phrase satisfies a preset condition, it indicates that the adjacent participles may be tight, and referring to table 1, the tight adjacent participles may be labeled as "T", otherwise, labeled as "B", then it can be determined that "octopus boluses" can be combined. Since the candidate entities are generated based on the close probability, the generation can be predicted by the embodiment no matter the entities are nested entities or other entities.
Specifically, in this embodiment, the quality score of the adjacent phrase may be defined according to the close probability of the adjacent participle in the adjacent phrase, and then the adjacent phrase reaching the preset quality threshold is used as the candidate entity. For example, if the quality score reaches 0.5, the next adjacent phrase may be used as a candidate entity.
In one embodiment, the step S208 includes the following steps: processing the close probability of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score; when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity; the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the close probability of the corresponding word segmentation of the input adjacent word group to obtain the quality score of the adjacent word group; the quality training sample comprises the close probability of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In a specific implementation, when the quality score of the adjacent phrase is determined, the close probability of the corresponding word segmentation of the adjacent phrase can be input into the quality model, and the quality score is determined by processing the adjacent phrase through the quality model. The quality model is obtained by training based on a preset network model according to the collected quality training samples. Specifically, the quality training sample including the input close probabilities of the adjacent phrases and the corresponding participles may be input to a preset network model for training to obtain a quality model, so that the quality model is used to process the input close probabilities of the adjacent phrases and the corresponding participles to obtain corresponding quality scores.
In the embodiment, the quality scores of the adjacent phrases are predicted by inputting the close probabilities of the adjacent phrases into the quality model, so that the candidate entities can be determined from the adjacent phrases according to the quality scores.
Of course, the embodiment may determine the quality score of the adjacent phrase in other ways besides through the quality model. In an optional embodiment, the S208 specifically includes the following contents: acquiring the compact probability of the corresponding word segmentation of the adjacent word group; calculating a compact probability mean value according to the compact probability of the word segmentation corresponding to the adjacent word group; determining the participles at the boundary in the adjacent phrases; acquiring the close probability between the segmentation words at the boundary and the segmentation words of the non-adjacent phrases; screening out the maximum compact probability from the compact probabilities between the segmentation words positioned at the boundary and the segmentation words of the non-adjacent word groups; subtracting the maximum compact probability from the compact probability mean value to obtain the mass fraction of the adjacent phrase; and when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity.
Optionally, in addition to screening candidate entities by predicting quality scores through the quality model, the embodiment may also be obtained by calculating a calculation formula of the quality scores. In one embodiment, the formula for calculating the quality score of the immediately adjacent phrase is:
score(candidate_entity)=avg(inner_tight)-max(left_tight,right_tight)
in the above calculation formula, score (candidate _ entry) represents a quality score of the adjacent phrase, avg (inner _ light) represents an average value of the close probabilities of the tokens corresponding to the adjacent phrase, and max (left _ light, right _ light) represents a maximum value of the close probabilities between the tokens at the boundary in the corresponding tokens of the adjacent phrase and the tokens of the non-adjacent phrase. The word segmentation at the boundary in the word segmentation corresponding to the adjacent word group means that the word segmentation at the leftmost side and the word segmentation at the rightmost side in the adjacent word group, for example, it is assumed that the adjacent word group is "red burned lion head", "red burned" is the leftmost word segmentation in the adjacent word group, and "head" is the rightmost word segmentation in the adjacent word group.
For example, the quality score of the immediately adjacent phrase may be calculated by: assuming that the adjacent phrase in the word segmentation of the taste of the Shaoxing braised pork ball in the text to be recognized is the braised pork ball, the mass fraction of the adjacent phrase is as follows: the mean of the tightness probabilities of < braising, lion > and < lion, head >, minus the maximum of the tightness probabilities of < shaoxing, braising > and < head, head >. Wherein, the adjacent phrases with the mass fraction larger than a certain value can be used as candidate entities.
Of course, in addition to the above-mentioned manner of passing through the quality model and the calculation formula, the embodiment may also use other manners to calculate the quality score based on the close probability of the adjacent word group to determine whether the adjacent word group is a candidate entity, for example, if the close probability of the adjacent word group reaches a preset threshold, the adjacent word group may be combined, and the embodiment does not need to limit this. The determination method of the adjacent phrases in this embodiment is no longer based on a dictionary, and has generalization capability.
S210, determining the entity type of the candidate entity.
In particular implementations, the entity type of the candidate entity may be determined according to a classification model. It should be noted that, in this embodiment, the task of entity type prediction and the task of candidate entity generation may be implemented by different models. The entity type prediction model and the candidate entity generation model may be trained separately or together, which is not limited in this embodiment.
In one embodiment, the determining the entity type of the candidate entity comprises: inputting the candidate entity into a classification model to obtain an entity type of the candidate entity; the classification model is obtained by training based on a preset network model according to collected classification training samples and is used for processing input candidate entities to obtain entity types; the quality training samples include the input candidate entities and corresponding entity types.
In a specific implementation, when the entity type of the candidate entity is determined, the candidate entity may be input into the classification model, and the entity type of the candidate entity is determined by processing the candidate entity through the classification model. The classification model is obtained by training based on a preset network model according to the collected classification training samples. Specifically, the classification training sample including the input candidate entity and the entity type corresponding to the candidate entity may be input to a preset network model for training to obtain a classification model, so that the classification model is used to process the input candidate entity to obtain the corresponding entity type.
In the embodiment, the classification model of the candidate entity is separated from the quality model for generating the candidate entity, so that the implementation process is simple and efficient when the training data and the training model are labeled.
S212, when the entity type of the candidate entity is the target entity type, taking the candidate entity as the target entity.
In a specific implementation, the target entity may be determined from the candidate entities according to actual task requirements. For example, for a candidate entity, whether it is a target entity may be further determined according to whether the entity type of the candidate entity is a target entity type.
Assuming that the entity type of the candidate entity is a place name, if the target entity type is also the place name, the candidate entity can be taken as the target entity. For example, if the target entity type is "restaurant", and if the entity type of the candidate entity is "dish name", the candidate entity may be the target entity since the "dish name" belongs to "restaurant".
In the embodiment, the task of entity identification is divided into two tasks of independent candidate entity generation and entity type prediction, so that the labeling and training processes of the model become simple and efficient, and the actual identification requirements are met.
In order to make the present embodiment better understood by those skilled in the art, the following description will be given using a specific example. On the basis of LabelNe, the present embodiment proposes a new network structure, called NestedNER, whose idea is to divide the entity recognition task into two steps, and the candidate entity generates + entity type prediction. Unlike LabelNet, NestedNER candidate entities are generated based on prediction of a model instead of dictionary matching, the network structure of the NestedNER candidate entities is shown in FIG. 4, the model is generated aiming at the candidate entities, namely a boundary prediction is introduced into a close Break prediction Layer (Tie or Break Layer), the close Break prediction Layer models the close and Break relations of two adjacent participles, and the tightness degree of the two adjacent participles is predicted.
Referring to fig. 4, unlike the related scheme, in this embodiment, instead of predicting the probability value of a single participle, the predicted close probability between two adjacent participles in the text to be recognized, nestner includes a 4.1 distributed representation Layer (Word embedding), a 4.2 feature representation Layer (BILSTM), a 4.3 close Break prediction Layer (Tie or Break Layer), and a 4.4 classification Layer (Typing Layer), and the process of implementing entity recognition based on nestner is as follows:
4.1 distributed representation layer: the participles of the text to be recognized, namely W1, W2, W3, W4 and W5 are mapped into vectors and then transferred to a 4.2 feature representation layer.
4.2 feature representation layer: according to the participles W1, W2, W3, W4 and W5 after being mapped into vectors, the tight probability between adjacent participles is predicted, and the tight probability is transferred to a 4.3 tight disconnection prediction layer.
4.3 tight disconnect prediction layer: and generating predicted scores of light 12, light 23, light 34 and light 45 according to the close probability among the adjacent participles, further obtaining the quality scores of the adjacent phrases according to the predicted scores, taking the adjacent phrases with the quality scores larger than a preset threshold value as Candidate (Nested) entities (Candidate entities), such as W1W2W3, W1W2, W2W3W4 and W5, and transferring the Candidate entities to a 4.4 classification layer. Here, the quality score may be generated according to a quality model, or may be generated according to a quality score calculation formula. The numerical range of the compact probability obtained in the feature representation layer is (0,1), and after the compact probability is input into the compact disconnection prediction layer, the prediction score between each adjacent participle is obtained. The prediction score has a numerical range of (-1, 1), with values greater than 0 indicating that the neighboring participles are close (Tie) and values less than 0 indicating that the neighboring participles are open (Break).
4.4 classification layer: and (3) performing k +1 classification to judge the entity type of the candidate entity, wherein k is the number of the target entity types, and +1 is the number of the no type (NONE) more, and specifically, the entity type of the candidate entity can be a DISH name (DISH) or a no type (NONE). And if the entity type of the candidate entity is the target entity type, the candidate entity can be used as the target entity. For example, if the entity type of the candidate entity is assumed to be a place name, if the target entity type is also the place name, the candidate entity may be taken as the target entity, and if the entity type of the candidate entity is assumed to be a dish name, the candidate entity may be taken as the target entity if the target entity type is also a restaurant.
In the entity recognition task, after the segmentation of the text to be recognized is obtained, the close probability between adjacent segmentation is determined, the adjacent word groups are combined according to the segmentation, then the candidate entity is determined from the adjacent word groups according to the close probability of the segmentation corresponding to the adjacent word groups, and the entity type of the candidate entity is determined, wherein when the entity type of the candidate entity is the target entity type, the candidate entity is taken as the target entity.
In the related scheme, entity recognition is divided into two tasks of entity boundary prediction and entity type prediction, wherein the entity boundary prediction is based on labels marked on participles of a text to be recognized, the labels comprise position labels and type labels, the participles can be combined through the position labels to obtain entities, and the entity types of the entities can be recognized through the type labels. The entity identification of the embodiment is divided into two tasks of candidate entity generation and entity type prediction, the close probability of adjacent participles in a text to be identified is firstly determined, the participles are combined to obtain a plurality of adjacent participles, the plurality of adjacent participles comprise nested entities, then whether the adjacent word groups can be used as candidate entities or not is predicted according to the close probability, and the task of generating the candidate entities is completed. In the embodiment, the candidate entity can be identified without setting a position label for the word segmentation of the text to be identified, so that the entity identification is realized without training a model through a complex training process, including the identification of nested entities, the entity identification process is simplified, and the entity identification efficiency is improved.
In another embodiment, as shown in FIG. 5, an entity identification method is provided. The embodiment is mainly illustrated by applying the method to the server 120 in fig. 1. Referring to fig. 5, the entity identification method specifically includes the following steps:
and S502, acquiring word segmentation of the text to be recognized.
S504, determining the close probability between the adjacent participles.
S506, combining the participles to obtain an adjacent phrase.
In this embodiment, the close probability between each adjacent segmented word in the text to be recognized is obtained, then the adjacent word group is obtained according to the segmented word combination of the text to be recognized, and the candidate entity is determined according to the close probability corresponding to the adjacent word group.
Because the embodiment can combine a plurality of adjacent phrases with different granularities according to the participles, such as a fine-grained phrase and a coarse-grained phrase, wherein the fine-grained phrase is composed of more participles, the coarse-grained phrase is composed of less participles, and the fine-grained phrase includes the coarse-grained phrase, the embodiment can be applied to obtain candidate entities with a plurality of granularities, including candidate entities which are nested entities. Finally, the entity type of the candidate entity can be predicted and determined as the target entity when the entity type is the target entity type.
In this embodiment, nested entities, including fine-grained entities and coarse-grained entities, can be identified based on NestNER shown in fig. 4, and referring to table 2, examples of nested entities in some dishes are shown:
table 2:
Figure BDA0002364554710000161
Figure BDA0002364554710000171
therefore, by applying the entity identification method of the embodiment, entities with various granularities can be identified, in practical application, it is very necessary to correctly identify nested entities with different granularities, and different applications have different granularity requirements. If the entity recognition result is that the operator is oriented to select entities with coarse granularity (hot pot and beef hot pot) when the operator is used as the orientation system, and if the entity recognition result is input into the orientation model as the characteristic, the entities with different granularities (hot pot, beef hot pot and Chaoshan beef hot pot) are all helpful for increasing the effect of the model.
NestNER has the advantages that compared with a model based on a BIO labeling mode, the representation mode of labels and the sample labeling of nested entities are easy, and the NestNER has certain generalization capability. However, when NestNER generates multi-granularity candidate entities, the predicted compact probability only considers the degree of cohesion and the degree of integrity of the candidate entities, ignores the information amount of the candidate entities as prior information, and generally, the candidate entities with high quality should meet the following three criteria:
internal degree of coagulation (Concordance): refers to the degree of association between the participles of the candidate entity. For example, beef chafing dish > beef chafing dish eating method, and the degree of congealing of "beef chafing dish" and "eating method" is low.
Integrity (completensiss): whether the candidate entity is a complete phrase. For example, Liuzhou snail powder > Liuzhou snail, "Liuzhou snail" is not an integrated phrase.
Information amount (informational): the measure of the information of the candidate entity, such as "reading the original text in the morning today", has relatively no information amount, and is not suitable as the candidate entity, and further such as "snail in Liuzhou", can reflect one thing in reality and is suitable as the candidate entity.
Combining three standards of high-quality candidate entities, on the basis of the predicted compact probability (with generalization capability), the embodiment mainly introduces information-quantity-related features, such as external feature resources, trains a candidate entity quality model to predict the quality of the candidate entities, and filters low-quality candidate entities, which is equivalent to combining local information (compact probability) and global information (introduced external feature resources). Specifically, the way of screening out high-quality candidate entities may be realized by the following steps S508 to S512.
S508, obtaining external feature resources of the adjacent phrases; the external characteristic resource is acquired from the Internet by adopting the adjacent phrase and is used for reflecting the information quantity of the adjacent phrase;
the external characteristic resource refers to a characteristic resource acquired by an external means or an external resource by using an adjacent phrase, wherein the external means can comprise a search engine of the internet, and the external resource can comprise an article of the internet. For example, the feature resources obtained from the internet by the search engine are used for the adjacent phrases, or the feature resources obtained from the articles by the adjacent phrases are used.
Specifically, the external feature resource may include one of a search frequency (qv), an inverse text frequency index (idf), an adjacent phrase feature, a number of participles in an adjacent phrase, and a search result page feature; the search result page features at least comprise results after searching in a search engine based on the adjacent phrases.
Specifically, the search result page feature refers to that if ten results are assumed on the home page after the search engine based on the immediately adjacent word group, whether the ten results have advertisements, video resources, encyclopedia entries, and the like. For example, assuming the immediate phrase "anecdotal east west", the search result page features for its external resource features include: whether encyclopedia entries aiming at the adjacent phrases exist, whether video links exist in the home page of the popular search engine, and the like.
It can be understood that the more external feature resources of the adjacent phrase, the more information quantity representing the adjacent phrase, the more it can reflect that the corresponding things are true, therefore, the quality of the candidate entity can be improved by introducing the external feature resources to screen the candidate entity from the adjacent phrase.
S510, determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases and the external feature resources.
In this embodiment, a quality model is introduced, where the quality model obtains a quality score based on the closeness probability of the word segmentation corresponding to the adjacent phrase and the external feature resource, and then candidate entities may be determined from the adjacent phrase according to the quality score.
In an embodiment, the S510 specifically includes the following contents: processing the compact probability and the external characteristic resources of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score; when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the input word segmentation compact probability corresponding to the adjacent phrase and external characteristic resources to obtain the quality score of the adjacent phrase; the quality training sample comprises the close probability and the external feature resources of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In a specific implementation, when the quality score of the adjacent phrase is determined, the close probability of the segmentation word corresponding to the adjacent phrase and the external feature resource of the adjacent phrase may be input into the quality model, and the close probability of the segmentation word corresponding to the adjacent phrase and the external feature resource of the adjacent phrase are processed by the quality model to determine the quality score of the adjacent phrase. The quality model is obtained based on preset network training according to the collected entity training samples. Specifically, the entity training sample including the input compact probability of the word segmentation corresponding to the adjacent word group and the input external feature resource of the adjacent word group may be input to a preset network model for training to obtain a quality model, so that the compact probability of the word segmentation corresponding to the adjacent word group and the external feature resource of the adjacent word group are processed by using the quality model to obtain a corresponding quality score.
In the embodiment, the quality scores of the adjacent phrases are predicted by inputting the close probability of the adjacent phrases and the external characteristic resources into the quality model, so that the candidate entities can be determined from the adjacent phrases according to the quality scores. In addition, the embodiment also introduces external characteristic resources capable of reflecting the information quantity of the adjacent phrases, so that the quality of the obtained candidate entities is extremely high.
It should be noted that, in addition to the affinity probability and the external feature resource, the present embodiment may also introduce other data such as left-right entropy, language model score, and the like to be input into the quality model together for generating the quality score, so as to further improve the quality of the candidate entity, which is not limited in this embodiment.
S512, determining the entity type of the candidate entity.
S514, when the entity type of the candidate entity is the target entity type, taking the candidate entity as the target entity.
In this embodiment, after high-quality candidate entities are screened from the adjacent phrases based on the close probability of the segmentation corresponding to the adjacent phrases and the external resource features, the entity types of the candidate entities can be further determined, and then whether the types of the candidate entities are target entity types or not is further determined, and the candidate entities are output when the entity types of the candidate entities are determined to be the target entity types.
For example, assuming that "i want to eat a red-cooked fish in guangzhou" in the text to be recognized, wherein the candidate entity can be recognized as "guangzhou" and "red-cooked fish" from the text, wherein the entity type of "guangzhou" is a place name and the body type of "red-cooked fish" is a dish name, if the entity type is a dish name, the target entity is determined as "red-cooked fish", and if the entity type is a place name, the target entity is determined as "guangzhou".
In the embodiment, the external feature resources are introduced for determining the candidate entities, so that the embodiment not only considers the degree of internal coagulation and the integrity of the candidate entities, but also further considers the information quantity of the candidate entities, so that the finally obtained candidate entities have the degree of internal coagulation, the integrity and the information quantity, some unreasonable candidate entities are avoided, and the quality of the candidate entities is extremely high.
In order to make the present embodiment better understood by those skilled in the art, the following description will be given using a specific example. In this embodiment, an Entity Span Quality model (entry Span Quality model) is introduced to the NestedNER structure, which is equivalent to adding a Quality Span Quality Layer (Entity Span Quality Layer) between the tight break prediction Layer and the classification Layer, and the network structure of the Quality Span Quality Layer is shown in fig. 6 and is called an enostedner.
The network structure of the ENestedenER is equivalent to multi-task learning, wherein the models of each layer are relatively independent in training process on the basis of sharing the learned words and characters, so that the advantage of simple training process is realized, and each model can conveniently add some task-related features according to the task characteristics. For example, the quality model may be trained together with other models, or a pre-trained model may be directly used without participating in the training process of the enterededner, which is not limited in this embodiment.
Specifically, the endesterner includes a 6.1 distributed representation Layer (Word embedding), a 6.2 feature representation Layer (BILSTM), a 6.3 tight Break prediction Layer (Tie or Break Layer), a 6.4 quality Layer, and a 6.5 classification Layer (Typing Layer), and the process of implementing entity identification based on the endesterner is as follows:
6.1 distributed representation layer: the participles of the text to be recognized, namely W1, W2, W3, W4 and W5 are mapped into vectors and then transferred to the 6.2 feature representation layer.
6.2 feature representation layer: according to the participles W1, W2, W3, W4 and W5 after being mapped into vectors, the close probability between adjacent participles is predicted, and the close probability is transmitted to a 6.3 close disconnection prediction layer.
6.3 tight disconnect prediction layer: the prediction scores of light 12, light 23, light 34 and light 45 are generated from the close probabilities between adjacent tokens and input to the 6.4 quality layer. Specifically, the numerical range in which the close probability is obtained at the feature representation layer is (0,1), and after the close probability is input to the close disconnection prediction layer, the prediction score between each adjacent participle will be obtained. Where the prediction score has a numerical range of (-1, 1), greater than 0 indicates that the neighboring participle is close (Tie), and less than 0 indicates that the neighboring participle is broken (Break).
6.4 quality layer: the method comprises the steps of obtaining external feature resources corresponding to word segmentation of a text to be recognized to form adjacent phrases, inputting the external feature resources into a quality model of a quality layer in combination with prediction scores, calculating the quality scores of the adjacent phrases, taking the adjacent phrases with the quality scores larger than a preset threshold value as Candidate (Nested) entities (such as W1W2W3, W1W2, W2W3W4 and W5), and transmitting the Candidate entities to a 6.5 classification layer.
6.5 classification layer: a k +1 classification is made to determine the entity type of the candidate entity, where k is the number of target entity types and +1 is because there are more no types (NONE). If the entity type of the candidate entity is the target entity type, the candidate entity can be taken as the target entity. For example, assuming that the entity type of the candidate entity is a place name, if the target entity type is also a place name, the candidate entity may be taken as the target entity.
The classification layer of this embodiment is a multi-classification layer, that is, multiple entity types can be identified, and preferably, the present embodiment may further modify NestedNER according to task requirements, for example, the classification layer may be modified from the multi-classification layer to a multi-label classification layer, so that the classification layer may identify and output candidate entities of multiple entity types. For example, in a certain entity recognition task, the entity of the smelly mandarin fish belongs to the types of aquatic products and hui-dish, and target entities of the types of the aquatic products and the hui-dish can be output only by changing a classification layer from a multi-classification layer to a multi-label classification layer.
When the effect evaluation is performed on the constructed nested entity identification standard test set, the enedesedner of the embodiment is improved to a certain extent in accuracy or recall ratio compared with the LabelNet, and particularly in recall ratio, as shown in table 3:
table 3:
full test set Rate of accuracy Recall rate F1 (comprehensive evaluation index)
LabelNet 61.6% 68.2% 64.7%
ENestedNER 63.2% 86.2% 72.9%
The ENestedNER of the embodiment is improved in such a way, the training data set benefiting from the ENestedNER is easier to label and introduce a more generalized candidate entity generation layer (a tight disconnection prediction layer + a quality layer), and the ENestedNER can avoid some unreasonable results and is stronger in model generalization capability by combining with the example in the table 3.
Table 4:
Figure BDA0002364554710000211
Figure BDA0002364554710000221
in summary, the embodiment provides the nested entity identification model entestedenner with a flexible, simple and efficient network structure by respectively modeling entity boundary prediction (candidate entity generation) and entity type prediction, so that the identification of the nested entity can be realized without a complex label expression mode and a complex model, and the realization process is simple and efficient.
Fig. 2 and 5 are schematic flow diagrams of an entity identification method in one embodiment. It should be understood that although the steps in the flowcharts of fig. 2 and 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
As shown in fig. 7, in one embodiment, an entity recognition apparatus 700 is provided, which includes a segmentation obtaining module 702, a close probability determining module 704, an adjacent phrase combining module 706, a candidate entity determining module 708, an entity type determining module 710, and a target entity determining module 712, wherein:
a word segmentation obtaining module 702, which obtains the word segmentation of the text to be recognized;
a closeness probability determination module 704 that ranks closeness probabilities between adjacent ones of the participles;
the adjacent phrase combination module 706 combines the segmented words to obtain adjacent phrases;
a candidate entity determining module 708, configured to determine a candidate entity from the adjacent phrase according to the close probability of the corresponding word segmentation of the adjacent phrase;
an entity type determining module 710, configured to determine an entity type of the candidate entity;
a target entity determining module 712, configured to take the candidate entity as a target entity when the entity type of the candidate entity is the target entity type.
In one embodiment, the candidate entity determination module 708 comprises:
the external characteristic resource acquisition module is used for acquiring external characteristic resources of the adjacent phrases; the external characteristic resource is acquired from the Internet by adopting the adjacent phrase and is used for reflecting the information quantity of the adjacent phrase;
and the candidate entity screening module is used for determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases and external characteristic resources.
In one embodiment, the tight probability determination module 704 includes:
the compact probability calculation module is used for processing the adjacent word segmentation through a prediction model to obtain compact probability;
the prediction model is obtained by training based on a preset network model according to an acquired relation training sample and is used for processing each input adjacent word segmentation to obtain a compact probability; the relationship training sample includes each of the input adjacent segmented words and a corresponding close probability.
In one embodiment, the target entity determination module 712 includes:
the first quality score calculating module is used for processing the close probability of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score;
the target entity determining module is used for taking the adjacent word group as a candidate entity when the quality score reaches a preset threshold value;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the close probability of the corresponding word segmentation of the input adjacent word group to obtain the quality score of the adjacent word group; the quality training sample comprises the close probability of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In one embodiment, the candidate entity screening module comprises:
the second quality score calculation module is used for processing the compact probability of the word segmentation corresponding to the adjacent word group and the external characteristic resource through a quality model to obtain a quality score;
when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the input word segmentation compact probability corresponding to the adjacent phrase and external characteristic resources to obtain the quality score of the adjacent phrase; the quality training sample comprises the close probability and the external feature resources of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
In one embodiment, the target entity determination module 712 includes:
the entity type determining module is used for inputting the candidate entity into a classification model to obtain the entity type of the candidate entity;
the classification model is obtained by training based on a preset network model according to collected classification training samples and is used for processing input candidate entities to obtain entity types; the quality training samples include the input candidate entities and corresponding entity types.
In one embodiment, the candidate entity determination module 708 comprises:
the first close probability obtaining module is used for obtaining the close probability of the word segmentation corresponding to the adjacent word group;
the mean value calculation module is used for calculating a compact probability mean value according to the compact probability of the word segmentation corresponding to the adjacent word group;
the boundary word segmentation determining module is used for determining the word segmentation at the boundary in the adjacent word group;
the second tight probability acquisition module is used for acquiring the tight probability between the participle at the boundary and the participle of the non-adjacent phrase;
a maximum close probability determining module, configured to screen out a maximum close probability from close probabilities between the word segmentation at the boundary and the word segmentation of the non-adjacent word group;
the mass fraction calculating module is used for subtracting the maximum close probability from the close probability mean value to obtain the mass fraction of the adjacent phrase;
and the candidate entity selection module is used for taking the adjacent word group as a candidate entity when the quality score reaches a preset threshold value.
The entity recognition device determines the close probability between adjacent participles after acquiring the participles of a text to be recognized in an entity recognition task, combines adjacent word groups according to the participles, then acquires external feature resources of the adjacent word groups, determines candidate entities from the adjacent word groups according to the close probability of the corresponding participles of the adjacent word groups, and finally takes the candidate entities as target entities when the entity type of the candidate entities is the target entity type. In the embodiment, the task of entity identification is divided into candidate entity generation and entity type prediction, so that the identification of the nested entity can be realized without a complex label expression mode or a complex model when the nested entity is identified.
FIG. 8 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the server 120 in fig. 1. As shown in fig. 8, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the entity identification method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform the entity identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the entity identifying apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 8. The memory of the computer device may store therein various program modules constituting the entity identifying apparatus, such as a segmentation obtaining module 702, a close probability determining module 704, an immediate phrase combining module 706, a candidate entity determining module 708, an entity type determining module 710, and a target entity determining module 712 shown in fig. 7. The computer program constituted by the respective program modules causes the processor to execute the steps in the entity identification method of the respective embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 8 may perform the step of acquiring the segmentation of the text to be recognized, which has the type tag, by the acquisition module 702 in the entity recognition apparatus shown in fig. 7. The determination of the closeness probability between adjacent said participles may be performed by a computer device via a closeness probability determination module 704. The computer device may combine the segmented words to obtain an adjacent phrase by the adjacent phrase combining module 706. The computer device may determine candidate entities from the immediate phrases and determine target candidate entities from the candidate entities by candidate entity determination module 708 performing determining candidate entities from the immediate phrases according to the close probabilities of the tokens corresponding to the immediate phrases. The determination of the entity type of the candidate entity may be performed by the computer device via entity type determination module 710. The computer device may determine the candidate entity as the target entity by the target entity determining module 712 when the entity type of the candidate entity is the target entity type.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described entity identification method. Here, the steps of the entity identification method may be steps in the entity identification methods of the above-described respective embodiments.
In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a processor, causes the processor to perform the steps of the above-mentioned entity identification method. Here, the steps of the entity identification method may be steps in the entity identification methods of the above-described respective embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An entity identification method, comprising:
acquiring word segmentation of a text to be recognized;
determining a closeness probability between adjacent said segments;
combining the word segments to obtain adjacent word groups;
determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases;
determining an entity type of the candidate entity;
and when the entity type of the candidate entity is the target entity type, taking the candidate entity as the target entity.
2. The method of claim 1, wherein the determining candidate entities from the immediately adjacent phrases according to the close probabilities of corresponding tokens to the immediately adjacent phrases comprises:
acquiring external characteristic resources of the adjacent phrases; the external characteristic resource is acquired from the Internet by adopting the adjacent phrase and is used for reflecting the information quantity of the adjacent phrase;
and determining candidate entities from the adjacent phrases according to the close probability of the corresponding word segmentation of the adjacent phrases and external characteristic resources.
3. The method of claim 1, wherein the determining the close probability between adjacent ones of the participles comprises:
processing the adjacent word segmentation through a prediction model to obtain a compact probability;
the prediction model is obtained by training based on a preset network model according to an acquired relation training sample and is used for processing each input adjacent word segmentation to obtain a compact probability; the relationship training sample includes each of the input adjacent segmented words and a corresponding close probability.
4. The method of claim 1, wherein the determining candidate entities from the immediately adjacent phrases according to the close probabilities of corresponding tokens to the immediately adjacent phrases comprises:
processing the close probability of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score;
when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the close probability of the corresponding word segmentation of the input adjacent word group to obtain the quality score of the adjacent word group; the quality training sample comprises the close probability of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
5. The method of claim 2, wherein the determining candidate entities from the immediate phrase according to the close probability of the corresponding participle of the immediate phrase and the extrinsic feature resources comprises:
processing the compact probability and the external characteristic resources of the word segmentation corresponding to the adjacent word group through a quality model to obtain a quality score;
when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity;
the quality model is obtained by training based on a preset network model according to an acquired quality training sample and is used for processing the input word segmentation compact probability corresponding to the adjacent phrase and external characteristic resources to obtain the quality score of the adjacent phrase; the quality training sample comprises the close probability and the external feature resources of the corresponding word segmentation of the input adjacent word group and the corresponding quality score.
6. The method of claim 1, wherein the determining the entity type of the candidate entity comprises:
inputting the candidate entity into a classification model to obtain an entity type of the candidate entity;
the classification model is obtained by training based on a preset network model according to collected classification training samples and is used for processing input candidate entities to obtain entity types; the quality training samples include the input candidate entities and corresponding entity types.
7. The method of claim 1, wherein the determining candidate entities from the immediately adjacent phrases according to the close probabilities of corresponding tokens to the immediately adjacent phrases comprises:
acquiring the compact probability of the corresponding word segmentation of the adjacent word group;
calculating a compact probability mean value according to the compact probability of the word segmentation corresponding to the adjacent word group;
determining the participles at the boundary in the adjacent phrases;
acquiring the close probability between the segmentation words at the boundary and the segmentation words of the non-adjacent phrases;
screening out the maximum compact probability from the compact probabilities between the segmentation words positioned at the boundary and the segmentation words of the non-adjacent word groups;
subtracting the maximum compact probability from the compact probability mean value to obtain the mass fraction of the adjacent phrase;
and when the quality score reaches a preset threshold value, taking the adjacent phrase as a candidate entity.
8. An entity identification apparatus comprising:
the word segmentation acquisition module is used for acquiring the word segmentation of the text to be recognized;
a close probability determination module for determining close probability between adjacent word segments;
the adjacent phrase combination module is used for combining the participles to obtain adjacent phrases;
the candidate entity determining module is used for determining a candidate entity from the adjacent word group according to the close probability of the word segmentation corresponding to the adjacent word group;
an entity type determining module, configured to determine an entity type of the candidate entity;
and the target entity determining module is used for taking the candidate entity as the target entity when the entity type of the candidate entity is the target entity type.
9. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.
CN202010031702.1A 2020-01-13 2020-01-13 Entity identification method, entity identification device, computer readable storage medium and computer equipment Active CN113111656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010031702.1A CN113111656B (en) 2020-01-13 2020-01-13 Entity identification method, entity identification device, computer readable storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010031702.1A CN113111656B (en) 2020-01-13 2020-01-13 Entity identification method, entity identification device, computer readable storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN113111656A true CN113111656A (en) 2021-07-13
CN113111656B CN113111656B (en) 2023-10-31

Family

ID=76708859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010031702.1A Active CN113111656B (en) 2020-01-13 2020-01-13 Entity identification method, entity identification device, computer readable storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN113111656B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326923A1 (en) * 2006-05-15 2009-12-31 Panasonic Corporatioin Method and apparatus for named entity recognition in natural language
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link
CN109190124A (en) * 2018-09-14 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for participle
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110008309A (en) * 2019-03-21 2019-07-12 腾讯科技(深圳)有限公司 A kind of short phrase picking method and device
CN110321550A (en) * 2019-04-25 2019-10-11 北京科技大学 A kind of name entity recognition method and device towards Chinese medical book document

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326923A1 (en) * 2006-05-15 2009-12-31 Panasonic Corporatioin Method and apparatus for named entity recognition in natural language
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link
CN109190124A (en) * 2018-09-14 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for participle
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110008309A (en) * 2019-03-21 2019-07-12 腾讯科技(深圳)有限公司 A kind of short phrase picking method and device
CN110321550A (en) * 2019-04-25 2019-10-11 北京科技大学 A kind of name entity recognition method and device towards Chinese medical book document

Also Published As

Publication number Publication date
CN113111656B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
Lukovnikov et al. Pretrained transformers for simple question answering over knowledge graphs
CN108733766B (en) Data query method and device and readable medium
US20170337260A1 (en) Method and device for storing data
US10776885B2 (en) Mutually reinforcing ranking of social media accounts and contents
RU2664481C1 (en) Method and system of selecting potentially erroneously ranked documents with use of machine training algorithm
KR102360309B1 (en) Apparatus and method for emotion classification based on artificial intelligence for online data
US9672475B2 (en) Automated opinion prediction based on indirect information
CN110457585B (en) Negative text pushing method, device and system and computer equipment
CN110909539A (en) Word generation method, system, computer device and storage medium of corpus
Chen et al. HHGN: A hierarchical reasoning-based heterogeneous graph neural network for fact verification
KR20200087977A (en) Multimodal ducument summary system and method
Yen et al. Multimodal joint learning for personal knowledge base construction from Twitter-based lifelogs
Guo et al. Proposing an open-sourced tool for computational framing analysis of multilingual data
Shi et al. EKGTF: A knowledge-enhanced model for optimizing social network-based meteorological briefings
Zhang et al. SKG-Learning: a deep learning model for sentiment knowledge graph construction in social networks
Ruas et al. NILINKER: attention-based approach to NIL entity linking
Caicedo et al. Bootstrapping semi-supervised annotation method for potential suicidal messages
Yin et al. MRT: Tracing the evolution of scientific publications
Chen et al. A supervised and distributed framework for cold-start author disambiguation in large-scale publications
CN114090769A (en) Entity mining method, entity mining device, computer equipment and storage medium
CN117194616A (en) Knowledge query method and device for vertical domain knowledge graph, computer equipment and storage medium
CN113591469A (en) Text enhancement method and system based on word interpretation
Greenberg et al. Knowledge organization systems: A network for ai with helping interdisciplinary vocabulary engineering
KR102210772B1 (en) Apparatus and method for classfying user&#39;s gender identity based on online data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40048677

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant