CN113886571A

CN113886571A - Entity identification method, entity identification device, electronic equipment and computer readable storage medium

Info

Publication number: CN113886571A
Application number: CN202110624434.9A
Authority: CN
Inventors: 汪华东; 陈婷
Original assignee: Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Samsung Electronics Co Ltd
Priority date: 2020-07-01
Filing date: 2021-06-04
Publication date: 2022-01-04
Also published as: US20220245347A1; WO2022005188A1

Abstract

The application provides an entity identification method, an entity identification device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring at least one entity boundary word corresponding to a text sequence to be recognized; acquiring at least one entity candidate region in a text sequence to be recognized based on at least one entity boundary word; and acquiring an entity recognition result of the text sequence to be recognized based on the entity candidate region. The steps in the scheme may be performed by an artificial intelligence model. Compared with the prior art, the scheme can improve the coverage rate of the entity candidate region to the entity in the text sequence to be recognized on the premise of not increasing the number of the entity candidate regions, and reduces the complexity of calculation.

Description

Entity identification method, entity identification device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an entity identification method, an entity identification device, an electronic device, and a computer-readable storage medium.

Background

The main purpose of entity recognition is to extract all candidate entities which may be entities from a text sequence to be recognized and to determine the entity category.

Nested entity recognition, that is, nesting may exist in entities in a text sequence to be recognized, requires recognition of all candidate entities in an input text sequence, not only the outermost candidate entity, and a conventional sequence labeling-based method can only assign one label to each word, so that it is necessary to optimize the conventional entity recognition method.

Disclosure of Invention

The purpose of this application is to solve at least one of the above technical defects, and the technical solution provided by this application embodiment is as follows:

in a first aspect, an embodiment of the present application provides an entity identification method, including:

acquiring at least one entity boundary word corresponding to a text sequence to be recognized;

acquiring at least one entity candidate region in a text sequence to be recognized based on at least one entity boundary word;

and acquiring an entity recognition result of the text sequence to be recognized based on the entity candidate region.

In an optional embodiment of the present application, the obtaining at least one entity boundary word corresponding to a text sequence to be recognized includes:

respectively taking all words in the text sequence to be recognized as entity boundary words; alternatively, the first and second electrodes may be,

and acquiring the probability of the words in the text sequence to be recognized as entity boundary words based on the background expression vectors of the words in the text sequence to be recognized, and determining the entity boundary words of the text sequence to be recognized based on the probability.

In an optional embodiment of the present application, the obtaining at least one entity candidate region in the text sequence to be recognized based on at least one entity boundary word includes:

acquiring an entity suggestion region corresponding to the text sequence to be recognized based on the entity boundary words;

and acquiring a corresponding entity candidate region based on the entity suggested region.

In an optional embodiment of the present application, acquiring an entity suggested region corresponding to a text sequence to be recognized based on an entity boundary word includes:

and based on at least one preset width, respectively taking the entity boundary words as anchor words, and acquiring at least one corresponding entity suggestion region with the preset width.

In an optional embodiment of the present application, the obtaining a corresponding entity candidate region based on the entity suggested region includes:

acquiring a corresponding combination vector based on the background representation vector of the word covered by the entity suggestion region and the background representation vector of the corresponding anchor word;

acquiring similarity between a background representation vector and a combined vector of entity boundary words in a text sequence to be recognized;

and acquiring a corresponding entity candidate region based on the similarity.

In an optional embodiment of the present application, obtaining a similarity between a background representation vector and a combined vector of entity boundary words in a text sequence to be recognized includes:

and in a Euclidean space or a hyperbolic space, acquiring similarity between a background representation vector and a combined vector of entity boundary words in a text sequence to be recognized.

In an optional embodiment of the present application, the obtaining, based on the similarity, a corresponding entity candidate region includes:

determining initial boundary words of corresponding entity candidate regions from anchor words of the entity suggestion regions in the text sequence to be recognized and entity boundary words positioned on the left sides of the anchor words based on the similarity, and determining termination boundary words of the corresponding entity candidate regions from the anchor words of the entity suggestion regions in the text sequence to be recognized and the entity boundary words positioned on the right sides of the anchor words;

and determining a corresponding entity candidate area based on the starting boundary word and the ending boundary word.

In an optional embodiment of the present application, obtaining a corresponding combined vector based on the background representation vector of the word covered by the entity suggestion region and the background representation vector of the corresponding anchor word includes:

taking the width of the entity suggestion region as the width of a convolution kernel, and performing convolution processing on a background expression vector of a word covered by the entity suggestion region to obtain a corresponding feature vector;

and acquiring a corresponding combined vector based on the feature vector corresponding to the word covered by the entity suggestion region and the background representation vector of the corresponding anchor word.

determining initial boundary word candidates and termination boundary word candidates of anchor words in the entity suggestion region;

determining initial boundary words of the entity suggestion region in the initial boundary word candidates, and determining termination boundary words of the entity suggestion region in the termination boundary word candidates;

and determining a corresponding entity candidate area according to the obtained initial boundary word and the termination boundary word.

In an alternative embodiment of the present application, determining the start boundary word candidate and the end boundary word candidate of the anchor word of the entity suggestion region includes

Determining the anchor word in the entity suggestion region and the boundary word positioned on the left side of the anchor word as an initial boundary word candidate of the anchor word;

and determining the anchor word in the entity suggestion region and the boundary word positioned at the right side of the anchor word as the termination boundary word candidate of the anchor word.

In an alternative embodiment of the present application, determining a start boundary word of the entity suggestion region in the start boundary word candidates and determining an end boundary word of the entity suggestion region in the end boundary word candidates includes:

determining a first probability of each starting boundary word candidate serving as a starting boundary word of the entity suggestion region and a second probability of each ending boundary word candidate serving as an ending boundary word of the entity suggestion region;

determining a starting boundary word of the entity suggestion region based on the first probability, and determining an ending boundary word of the entity suggestion region according to the second probability.

In an optional embodiment of the present application, the obtaining an entity recognition result of the text sequence to be recognized based on the entity candidate region includes:

screening the entity candidate region to obtain the screened entity candidate region;

and judging the category of the screened entity candidate area to obtain an entity identification result of the text sequence to be identified.

In an optional embodiment of the present application, the screening the entity candidate region to obtain the screened entity candidate region includes:

acquiring a corresponding first classification characteristic vector based on the background expression vector of the word covered by the entity candidate region;

acquiring the probability that the entity candidate region belongs to the entity based on the first classification characteristic vector corresponding to the entity candidate region;

and acquiring the screened entity candidate region based on the probability that the entity candidate region belongs to the entity.

In an optional embodiment of the present application, the determining the category of the entity candidate region after being screened to obtain the entity recognition result of the text sequence to be recognized includes:

acquiring corresponding second classification characteristic vectors based on background expression vectors of starting boundary words and ending boundary words corresponding to the screened entity candidate regions;

and performing category judgment based on the second classification characteristic vector corresponding to the screened entity candidate region to obtain a corresponding entity identification result.

acquiring a corresponding third classification characteristic vector based on the background representation vectors of the starting boundary word and the ending boundary word corresponding to the entity candidate region;

and performing category judgment based on the third classification characteristic vector corresponding to the entity candidate area to obtain a corresponding entity identification result.

acquiring a preset number of entity boundary words adjacent to the entity boundary words from a text sequence to be recognized;

acquiring background representation vectors of the entity boundary words, and respectively obtaining similarity between the background representation vectors of the entity boundary words and the corresponding adjacent preset number of the entity boundary words;

and acquiring a corresponding entity candidate region based on the similarity.

respectively determining a starting boundary word and an ending boundary word of a corresponding entity candidate region from the entity boundary words of the text sequence to be recognized and the adjacent preset number of the entity boundary words based on the similarity;

In a second aspect, an embodiment of the present application provides an entity identification apparatus, including:

the entity boundary word acquisition module is used for acquiring at least one entity boundary word corresponding to the text sequence to be recognized;

the entity candidate region acquisition module is used for acquiring at least one entity candidate region in the text sequence to be recognized based on at least one entity boundary word;

and the entity recognition result acquisition module is used for acquiring an entity recognition result of the text sequence to be recognized based on the entity candidate region.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor;

the memory has a computer program stored therein;

a processor configured to execute a computer program to implement the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

compared with the prior art, the scheme can improve the coverage rate of the entity candidate area to the entity in the text sequence to be recognized and reduce the complexity of calculation on the premise of not increasing the number of the entity candidate areas.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1a is a schematic diagram of a recognition result of a nested entity in an example of the embodiment of the present application;

FIG. 1b is a diagram illustrating nested entities in a text sequence according to an example of an embodiment of the present application;

FIG. 2 is a diagram illustrating an example of obtaining entity candidate regions in the prior art;

FIG. 3 is a diagram illustrating another example of obtaining entity candidate regions in the prior art;

fig. 4 is a schematic flowchart of an entity identification method according to an embodiment of the present application;

FIG. 5 is a diagram illustrating an entity suggestion region obtained in an example of an embodiment of the present application;

fig. 6 is a schematic diagram illustrating entity identification performed by an entity identification network according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating entity recognition of a text sequence to be recognized through an entity recognition network in an example of the embodiment of the present application;

fig. 8 is a schematic structural diagram of an entity recognition network model according to an embodiment of the present application;

FIG. 9a is a schematic diagram of entity identification in an example of an embodiment of the present application;

FIG. 9b is a diagram illustrating entity candidates obtained in an example of an embodiment of the present application;

FIG. 9c is an entity candidate obtained in one example of the prior art;

FIG. 10 is a schematic diagram of a detection layer network structure of an entity boundary word in an embodiment of the present application;

FIG. 11a is a diagram illustrating detection of entity boundary words in an example according to an embodiment of the present application;

FIG. 11b is a diagram illustrating detection of entity boundary words in an example according to an embodiment of the present application;

FIG. 12a is a diagram illustrating an example of obtaining an entity suggested region according to an embodiment of the present disclosure;

FIG. 12b is a schematic diagram of a network structure of an entity proposed generation layer in the embodiment of the present application;

FIG. 13a is a schematic diagram of a network structure of an entity candidate identification layer according to an embodiment of the present application;

FIG. 13b is a diagram illustrating a detailed network structure of an entity candidate recognition layer according to an embodiment of the present application;

FIG. 13c is a graph comparing boundary attention calculation based on boundary word masks with ordinary boundary attention calculation in the embodiment of the present application;

FIG. 14 is a schematic diagram of a network structure of an entity candidate filter layer in an embodiment of the present application;

FIG. 15 is a diagram illustrating a network structure of an entity classifier module according to an embodiment of the present application;

FIG. 16 is a schematic diagram of an entity identification scheme based on hyperbolic representation in an embodiment of the present application;

fig. 17 is a schematic structural diagram of an entity recognition network model according to an embodiment of the present application;

FIG. 18a is a diagram illustrating an example of an application of nested entity recognition in a smart recognition screen according to an embodiment of the present application;

FIG. 18b is a diagram illustrating an application of nested entity recognition in news reading enhancement in an example provided by an embodiment of the present application;

FIG. 18c is a diagram illustrating an application of nested entity recognition in menu reading enhancement according to an example of the present application;

FIG. 18d is a diagram illustrating an application of nested entity recognition in image editing according to an example of the present application;

FIG. 18e is a diagram illustrating an application of nested entity recognition in knowledge graph construction according to an example of the present application;

fig. 19 is a block diagram illustrating an entity identifying apparatus according to an embodiment of the present disclosure;

fig. 20 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Named Entity Recognition (NER), a subtask of information extraction, aims at locating Entity names mentioned in unstructured text and judging the categories to which they belong according to a predefined set of categories, such as PERSON (PERSON name), organization name, LOCATION (place name), TIME (TIME), etc. The named entity is usually a proper term for crops or people, and can be defined according to actual needs.

Nested NER tasks refer to the phenomenon of nesting or overlapping of entities referred to in the text. As shown in fig. 1a, one organizationentity "Edinburgh University" is nested in another organizationentity "Edinburgh University". However, the nested structure makes the problem inefficient for conventional approaches.

The nested NER approach is distinguished from non-nested NER: the conventional NER method can only recognize outer entities or innermost entities in sentences, no nested/overlapping structure exists between all the recognized entities, and the nested entity recognition can recognize multi-granularity nested entities with overlapping structures, as shown in table 1.

TABLE 1

In entity recognition of a text sequence, for a text sequence containing no nested entities, it is necessary to recognize independent entities (i.e., entities without nested cases) therein, such as "how is the weather of the current day of Hi, Bixby, beijing? "where, both 'Bixby' and 'beijing' can be identified as entities, the former is" Product (PRO) ", and the latter is" Location (LOC) ", and specifically identifies what kind of entities in the text sequence, which depends on the entity class set defined by the identification task. For a text sequence containing nested entities, it is necessary to identify both independent entities and nested entities (which can be understood as entities containing two independent entities), as shown in fig. 1b, the text sequence "Thomas Jefferson of third president of america" draws "independent declaration", where "american (LOC)" and "Thomas Jefferson (Person name, Person, PER)" are both independent entities, and "Thomas Jefferson (PER)" of third president of america is a nested entity, and in the entity identification process, it is necessary to identify all three entities, i.e., "LOC)", "Thomas Jefferson (Person, PER)" and "Thomas Jefferson (PER)" of third president of america. The traditional method based on sequence labeling can only assign one label to each word, so that the nested entities can not be identified.

In the prior art, a region-based nested entity identification method is generally adopted, in which a plurality of entity candidate regions of a text sequence are respectively identified to detect whether the candidate regions are candidate entities, and nested entities can be identified because different nested entities respectively correspond to different entity candidate regions. The key of the nested entity identification is how to generate an entity candidate region corresponding to the nested entity. To obtain the entity candidate region, the following methods can be adopted: 1) taking nodes of the syntactic parse tree as entity candidate areas; 2) for an input text sequence containing N words, generating N (N +1)/2 candidate subsequences, and taking the N (N +1)/2 candidate subsequences as entity candidate regions; 3) a state Transition Method (Transition based Method) is adopted to construct a candidate subsequence through a specially designed action sequence, and the constructed candidate subsequence is taken as an entity candidate area. However, these entity candidate region acquisition methods almost acquire all text subsequences corresponding to the text sequence, and the calculation cost of the entity candidate region acquisition process and the subsequent entity candidate region identification process is high.

In addition, in the prior art, a mode-based method (Schema-based approaches) is adopted for nested entity identification, and the method designs a mark mode expressing better to process the nested entities instead of changing a mark unit. One representative direction is a hypergraph-based approach, where hypergraph labels are used to ensure that several word-level labels can recover nested entity candidates. Researchers have also proposed bipartite graph-based methods to handle nested entity labels. But these patterns need a refined design to prevent spurious structural and structural ambiguities, while expressing stronger, unambiguous patterns will inevitably lead to higher training and decoding time complexity.

On the basis of the Region-based nested entity identification method, the prior art further provides a method for acquiring an entity candidate Region corresponding to a nested entity based on an Anchor-Region network (Anchor-Region Networks), and the method can be divided into two implementation modes: 1) and taking each word in the text sequence as an anchor word, and acquiring a plurality of entity candidate regions according to a plurality of preset widths. For example, the text sequence includes 6 words, which are t1, t2, t3, t4, t5, and t6 in sequence, and 6 preset widths are preset, and are respectively 1 to 6, and the 6 preset widths respectively correspond to 6 types of propofol (candidates), which are respectively propofol 1 to propofol 6. For the anchor word t3, the entity candidate regions corresponding to propofol 1-propofol 6 are respectively as shown in fig. 2, where the regions in the rectangular box are the entity candidate regions, and for example, the entity candidate region corresponding to propofol 2 is [ t3, t4]. 2) And taking a certain head entity word in the text sequence as an anchor word, and predicting the boundary of the entity candidate region by taking the head entity word as a reference so as to obtain the entity candidate region. For example, as shown in fig. 3, for The text sequence "The minimum of The partial", The anchor word prediction network obtains that The head entity word is "minimum" (The entity type is "Person (PER)"), and The prediction is performed based on The head entity word "minimum" to obtain that The corresponding candidate entity boundaries are "The" and "partial", respectively, so as to obtain The corresponding entity candidate regions. In the former implementation, there are nested entities with widely varying widths in the text sequence, and in order to make each obtained entity candidate region cover all the nested entities in the text sequence as much as possible, the number of preset widths needs to be increased, that is, more entity candidate regions need to be obtained, which increases the complexity of model calculation. In the latter implementation, in many cases, the head entity word of the text sequence cannot be determined, and thus the entity candidate region of the text sequence cannot be obtained. In view of the foregoing problems, embodiments of the present application provide the following entity identification method.

Fig. 4 is a schematic flowchart of an entity identification method provided in an embodiment of the present application, and as shown in fig. 4, the method may include: step S401, acquiring at least one entity boundary word corresponding to a text sequence to be recognized; step S402, acquiring at least one entity candidate region in a text sequence to be recognized based on at least one entity boundary word; step S403, based on the entity candidate region, obtaining an entity recognition result of the text sequence to be recognized.

Compared with the prior art, the scheme provided by the application can improve the coverage rate of the entity candidate region to the entity in the text sequence to be recognized on the premise of not increasing the number of the entity candidate regions, and reduces the complexity of calculation.

Example 1

In an optional embodiment, the obtaining an entity candidate region corresponding to a text sequence to be recognized includes: acquiring an entity suggestion region corresponding to a text sequence to be recognized; and acquiring a corresponding entity candidate region based on the entity suggested region. The method for acquiring the entity suggestion region corresponding to the text sequence to be recognized comprises the following steps: and based on at least one preset width, respectively taking the words in the text sequence to be recognized as anchor words, and acquiring at least one corresponding entity suggestion region with the preset width. Specifically, in the embodiment of the application, an entity suggestion region is determined through each word in a text to be recognized, and then a starting boundary word and an ending boundary word of an entity candidate region are determined from each word in the text to be recognized through the entity suggestion region.

An entity identification method provided by an embodiment of the present application may include: acquiring at least one entity suggestion region corresponding to a text sequence to be recognized; acquiring an entity candidate region corresponding to the entity suggestion region; and acquiring an entity recognition result of the text sequence to be recognized based on the entity candidate region.

The anchor word in the entity suggestion region may be any word in the text sequence to be recognized, and the width of the anchor word may be any width not greater than the width of the text sequence to be recognized. For example, a text sequence to be recognized, which includes 5 words, is t1, t2, t3, t4, and t5 in sequence,

first predefines

1,3, 5, 3 preset widths corresponding to 3 types of propofol (candidates), respectively, as propofol 1-propofol 3. Then, when t3 is selected as the anchor word, the corresponding entity suggested region may be as shown in fig. 5, where a region in the rectangular box is an entity candidate region, for example, the entity suggested region corresponding to propofol 2 is [ t2, t4], it should be noted that the entity candidate region corresponding to propofol 2 may also be [ t3, t5] or [ t1, t3], as long as the anchor word of the corresponding entity suggested region is t3 and the width is 3. For the text sequence to be recognized, if the anchor words of the entity suggestion regions having the corresponding relationship are the same as the anchor words of the entity candidate regions, and the boundary words are different, the corresponding entity candidate regions can be obtained by adjusting the boundaries of the entity suggestion regions.

Specifically, in the process of obtaining the corresponding entity candidate regions by adjusting the boundaries of the entity suggested regions, the association relationship between the entity suggested regions and the words in the text sequence to be recognized is referred to, so that the adjusted boundaries are more accurate, that is, the coverage rate of the entities in the text sequence to be recognized by the corresponding entity candidate regions is higher compared with that of the entity suggested regions. Because each entity candidate region can cover both an independent entity and a nested entity, when each entity candidate region is classified, the independent entity and the nested entity in the text sequence to be recognized can be recognized, and a corresponding entity recognition result is obtained.

According to the scheme provided by the embodiment of the application, the boundary of each entity suggestion region in the text sequence to be recognized is adjusted to obtain the corresponding entity candidate region by referring to the incidence relation between the entity suggestion region and each word in the text sequence to be recognized, and then each entity candidate region is recognized to obtain the corresponding entity recognition result.

In this embodiment of the application, the entity identification process may be implemented by a preset entity identification network, and the entity identification network may have a structure as shown in fig. 6, and may include a feature extraction module (also referred to as a semantic coding module or a sentence coding module) 601, a candidate region determination module (also referred to as a candidate region generation module) 602, and an entity classification module (also referred to as an entity category classification module) 603, which are connected in sequence. Specifically, the feature extraction module 601 is configured to extract features of an input text sequence to be recognized to obtain a corresponding background semantic coding matrix, the candidate region determination module 602 is configured to receive the background semantic coding matrix output by the feature extraction module 601 and output a plurality of entity candidate regions of a text to be processed, and the entity classification module 603 is configured to receive the plurality of entity candidate regions output by the candidate region determination module 602 and output a corresponding entity recognition result. The following embodiment will further describe the processing procedures in the respective modules in the entity identification process of the text recognition network.

In an optional embodiment of the present application, the obtaining an entity candidate region corresponding to the entity suggested region includes:

and acquiring an entity candidate region corresponding to the entity suggestion region through an entity identification network based on the background semantic coding matrix corresponding to the text sequence to be identified.

It should be noted that, in the scheme, the background semantic coding matrix is a background representation matrix, and the background semantic coding vector is a background representation vector.

The background semantic coding vector of each word in the text sequence to be recognized can be obtained through the background semantic coding matrix, the background semantic coding vector of each word covered by the reference entity candidate region is included, and the relation between the background semantic coding vector of each word covered by the reference entity candidate region and the background semantic coding vector of each word in the text sequence to be recognized is used as a basis for adjusting the boundary, so that the entity candidate region is obtained. This process may be performed in a candidate area determination module of the entity identification network.

In an optional embodiment of the present application, acquiring an entity candidate region corresponding to an entity proposed region through an entity recognition network based on a background semantic coding matrix corresponding to a text sequence to be recognized includes:

acquiring similarity between the background semantic coding vector of the word in the text sequence to be recognized and the corresponding combined vector based on the background semantic coding matrix corresponding to the text sequence to be recognized and the combined vector corresponding to the entity suggested region;

and determining an entity candidate region corresponding to the entity suggested region based on the similarity.

The relationship between the background semantic code vector of each word covered by the reference entity candidate region and the background semantic code vector of each word in the text sequence to be recognized may be a similarity between a combined vector corresponding to the reference entity candidate region and the background semantic code vector of each word in the text sequence to be recognized.

Specifically, for each entity suggestion region, the similarity between the background semantic code vector of each word in the text sequence to be recognized and the combined vector corresponding to the entity suggestion region is obtained, that is, each word corresponds to one similarity. According to the size relation of the similarity corresponding to each word, the boundary of the corresponding entity suggestion region can be adjusted, namely the boundary word is re-determined for the entity suggestion region, so that the corresponding entity candidate region is obtained, and the probability of the entity candidate region covering the entity is greater than that of the corresponding entity suggestion region. And then, classifying the entities in the entity candidate area in an entity classification module to obtain a corresponding entity identification result.

In an optional embodiment of the present application, the obtaining a similarity between a background semantic code vector of a word in a text sequence to be recognized and a corresponding combined vector based on the background semantic code matrix corresponding to the text sequence to be recognized and the combined vector corresponding to the entity suggested region includes:

acquiring a background semantic coding vector of a word in the text sequence to be recognized based on a background semantic coding matrix corresponding to the text sequence to be recognized;

acquiring a corresponding combined vector based on a feature vector corresponding to a word covered by the entity suggestion region and a background semantic coding vector of a corresponding anchor word;

and performing multi-head self-attention calculation based on the combined vector corresponding to the entity suggestion region and the background semantic coding vector of the word in the text sequence to be recognized, and acquiring the similarity between the background semantic coding vector of the word in the text sequence to be recognized and the combined vector corresponding to the entity suggestion region.

The combined vector corresponding to each entity suggested region may be the sum of the feature vector corresponding to each entity suggested region and the background semantic coding vector of the corresponding anchor word, that is, the combined vector is merged with the related information of the corresponding entity suggested region and the anchor word.

Specifically, the similarity between the combined vector and each word is calculated by using a Multi-head Self-Attention (Multi-head Self-Attention) operation, where the combined vector corresponding to each reference matrix is used as a corresponding index (Query) matrix in a Multi-head Self-Attention algorithm, a background semantic coding matrix of the text sequence to be recognized is used as a corresponding Key value (Key) matrix in the Multi-head Self-Attention algorithm, and the similarity corresponding to each reference matrix is obtained through the Multi-head Self-Attention algorithm, so the similarity here may also be referred to as an Attention score. In particular, for a text sequence to be recognized, where the background semantic code vector for each word is u_i(i ═ 1,2,3 … L), k (k may be a smaller integer of 1,2,3, etc.) preset widths for obtaining entity suggested regions are set in advance, an entity suggested region (hereinafter referred to as an entity suggested region corresponding to ki) of the text region to be recognized is obtained according to the k-th preset width with the i-th word as an anchor word, and the similarity between the combined vector corresponding to the entity suggested region and the background semantic coding vector of each word in the text sequence to be recognized is obtained through the following formula:

wherein the content of the first and second substances,<·,·>the inner product operation is represented by the following operation,

respectively, a query matrix and an addressing matrix in self-attention computation, wherein h e { lk, rk } are both background semantic coding matrixes U e R with recognized text sequences^L×DChanges in the meridian (first linearly transforming U into Q ═ F (U) e R^L×DThen Q is divided into 2K heads, i.e. 2K parts { Q) by characteristic dimension₁，Q₂,..,Q_2KTherein of

Each head corresponding to a query matrix, K_hSimilar operations are used for acquisition of (a) data). In order to regress the boundary of the entity candidate region based on the entity suggestion region corresponding to the k-th preset width, the feature vector P corresponding to each entity suggestion region can be used_kAdded to the query matrix of the self-care computation, i.e. Q_lk←Q_lk+P_k,Q_rk←Q_rk+P_k。

In an optional embodiment of the present application, the feature vector corresponding to any entity suggested region may be obtained by: and taking the width of the entity suggestion region as the width of a convolution kernel, and performing convolution processing on a splicing vector corresponding to the background semantic coding vector of the word covered by the entity suggestion region to obtain a corresponding feature vector.

Specifically, for the entity suggested region corresponding to ki, the feature vector is obtained through the following formula:

p_ki＝Conv1D_k(u_i)＝ReLU(W_kU_i-k+1:i+k-1)

wherein ReLU presets the activation function, u_i-k+1:i+k-1Suggesting a stitched vector corresponding to the background semantic code vector of the word covered by the region for the entity corresponding to ki, W_kAnd 2k-1 the size of the convolution kernel and convolution window, respectively. What is needed isWord u with position_iCan be processed simultaneously, and the convolution operation can be denoted as P_k＝Conv1D_k(U)。

In an optional embodiment of the present application, the determining, by the boundary determining sub-module, an entity candidate region corresponding to the entity proposed region based on the similarity includes:

determining words with the highest similarity in words between the anchor words corresponding to the entity suggestion region and the first endpoint words of the text sequence to be recognized as initial boundary words of the corresponding entity candidate region, and determining words with the highest similarity in words between the anchor words and the second endpoint words of the text sequence to be recognized as termination boundary words of the corresponding entity candidate region;

and determining a corresponding entity candidate area based on the first boundary word and the second boundary word.

If all the words in the text sequence to be recognized are regarded as a sequence which is horizontally arranged in sequence, the first endpoint word can be regarded as a left endpoint word in the text sequence to be recognized, the first endpoint word is located on the left side of the anchor point word, and the corresponding first boundary word is a starting boundary word. Similarly, the second endpoint word may be considered as a right endpoint word in the text sequence to be recognized, which is located on the right side of the anchor word, and then the corresponding second boundary word is an end boundary word. For ease of understanding and description, the scheme will be described hereinafter with the first boundary word as the starting boundary word and the second boundary word as the terminating boundary word.

Specifically, in the process of obtaining a corresponding entity candidate region by adjusting the boundary of the entity suggested region, the adjusted left boundary and right boundary need to be determined respectively, that is, the start boundary word and the end boundary word of the entity candidate region need to be determined respectively. The higher the similarity between the background semantic coding vector and the combined vector of each word in the text to be recognized is, the higher the matching degree of the word and the boundary of the target entity candidate region corresponding to the anchor word is, specifically, the word with the highest similarity to the combined vector is found out from the left side words of the anchor word and the anchor word as the initial boundary word of the entity candidate region, the word with the highest similarity to the combined vector is found out from the right side words of the anchor word and the anchor word as the termination boundary word of the entity candidate region, and then the entity candidate region is obtained.

Specifically, after the similarity of each word on the left side of the anchor word and the similarity of each word on the right side of the anchor word are obtained, the position of the start boundary word and the position of the end boundary word may be obtained through the following calculation formula:

wherein A is_lk[i，j]Represents the score matrix A_lkI row and j column of (1)_kiFor the entity corresponding to ki, the left boundary of the entity candidate region corresponding to the proposed region, r_kiFor the right boundary of the entity candidate region corresponding to the entity suggestion region corresponding to ki, it can be understood that the left boundary corresponds to the anchor word in the text sequence to be recognized and a word on the left side of the anchor word, and the right boundary corresponds to the anchor word in the text sequence to be recognized and a word on the left side of the anchor word, then the entity candidate region corresponding to the entity suggestion region corresponding to ki is the entity candidate region corresponding to ki

In an optional embodiment of the present application, the obtaining at least one entity suggestion region corresponding to a text sequence to be recognized includes:

and acquiring at least one entity suggestion region with corresponding preset width by using the words in the text sequence to be recognized as anchor words through an entity recognition network based on at least one preset width.

Specifically, if there are L (L is greater than or equal to 1) words (including punctuation marks) in the text sequence to be recognized and there are K (K is greater than or equal to 1) preset widths, the number of entity suggestion areas corresponding to each word in the text sequence to be recognized is K, and the total number of entity suggestion areas corresponding to all words in the text to be recognized is L × K. It can be seen that the entity proposed region in the embodiment of the present application is substantially the same as the entity candidate region obtained in implementation mode 1) of the method for acquiring an entity candidate region based on the anchor-region network in the prior art. As can be seen from the foregoing description, in the solution of the present application, it is also necessary to perform boundary adjustment on the entity suggested region to obtain an entity candidate region with higher coverage. The step of obtaining the entity suggested region is also performed in the entity candidate region determination module.

acquiring screened entity candidate regions through an entity identification network based on background semantic coding vectors corresponding to words covered by the entity candidate regions;

and acquiring the type and the position of the entity in the screened entity candidate area through the entity identification network.

Specifically, some entity candidate regions output by the entity candidate region determination module may not include entities, so that before entity classification, each entity candidate region may be screened by the entity candidate region screening module, and the screened entity candidate regions are input to the entity classification module for entity identification, so as to obtain corresponding entity types and positions.

Obviously, the entity candidate region screening module is located between the entity candidate region determining module and the entity classifying module. It should be noted that the entity candidate region screening module is not an essential structure of the entity identification network, and when the entity identification network does not have the entity candidate region screening module, the entity identification module directly performs classification processing on the entity candidate region output by the entity candidate region determining module.

In an optional embodiment of the present application, the obtaining, through an entity identification network, a filtered entity candidate region based on a background semantic code vector corresponding to each word covered by the entity candidate region includes:

acquiring a corresponding first classification characteristic vector based on the background semantic coding vector of each word covered by the entity candidate region;

acquiring the probability that each entity candidate region contains the entity based on the first classification characteristic vector corresponding to each entity candidate region;

and acquiring the screened entity candidate regions based on the probability that each entity candidate region contains the entity.

For example, if a certain entity candidate region corresponds to 5 words, a background semantic coding matrix (D5-dimensional matrix, D is greater than or equal to 1 and is an integer) corresponding to the 5 words is obtained from the background semantic coding matrix of the text sequence to be recognized, and the D5-dimensional matrix is averaged according to rows to obtain a first classification feature vector (D1 column vector) of the entity candidate region.

In addition, the feature vector corresponding to each entity candidate region may also be a spliced vector corresponding to a background semantic code vector of the start boundary word, the end boundary word and the anchor word, that is, the feature vector corresponding to each entity candidate region is a spliced vector corresponding to a background semantic code vector of the start boundary word, the end boundary word and the anchor word

Specifically, after the first classification feature vector of each entity candidate region is obtained, the following classifier is used to screen each entity candidate region:

p_ki＝Softmax(Wh_ki)

wherein p is_kiThe probability that the entity candidate region corresponding to the entity suggestion region corresponding to ki contains the entity is provided, and W belongs to R^2×3DFor linear transformation of the parameter matrix (the dimension of which depends on the hki eigenvector dimension), h_kiAnd proposing a first classification feature vector corresponding to the entity candidate region corresponding to the region for the entity corresponding to the ki.

And after the probability that each entity candidate region contains the entity is obtained, taking the entity candidate region with the probability greater than or equal to a first preset value as the screened entity candidate region.

In an optional embodiment of the present application, obtaining, by an entity identification network, a type and a location of an entity in a filtered entity candidate area includes:

acquiring a corresponding second classification characteristic vector based on the background semantic coding vector of the boundary word corresponding to the entity candidate region after screening;

and acquiring the type and the position of the corresponding entity based on the second classification characteristic direction corresponding to the screened entity candidate region.

The second classification feature vector of each entity candidate region may be obtained by splicing the corresponding background semantic code vectors of the start boundary word and the end boundary word, for example, if the background semantic code vector of the start boundary word of a certain entity candidate region is

The background semantic code vector of the termination boundary word is

The code vector corresponding to the physical candidate region is the one

In addition, the second classification feature vector of each entity candidate region can also be obtained by splicing corresponding background semantic code vectors of the starting boundary word, the ending boundary word and the anchor word.

Specifically, after the second classification feature vector of each entity candidate region is obtained, each entity candidate region is obtained by using the following classifier for classification:

o_ki＝softmax(W₂·ReLU(W₁·e_ki))

wherein o is_kiFor the entity corresponding to ki the prediction probability vector of the entity type of the entity candidate region corresponding to the proposed region, W₁∈R^2D×HAnd W₂∈R^C×HAre all linear transformation parameter matrices, e_kiAnd C is equal to the number of entity categories, wherein C comprises categories which do not belong to the entity, and is used for further screening the entity candidate regions.

It should be noted that the method in the embodiment of the present application may identify a nested entity, or may identify an independent entity. Under the condition that the recognized text sequence to be recognized does not contain the nested entity, only the recognized entity needs to be subjected to conflict judgment after recognition is completed. Here, a Non-Maximum Suppression (NMS) algorithm may be used to process redundant, overlapping entity candidate regions and output real entities. The idea of NMS is simple and efficient: when the entity classification module classifies entity candidate areas, the prediction probability of the entity corresponding to each entity candidate area is obtained, the candidate entity with the maximum probability is selected, the candidate entity in conflict with the candidate entity is deleted, and the previous processing process is repeated until all the candidate entities are processed. Finally, these non-conflicting candidate entities may be obtained as a final recognition result.

In an optional embodiment of the present application, before obtaining an entity candidate region corresponding to the entity proposed region, the method may further include:

and acquiring a background semantic coding matrix corresponding to the text sequence to be recognized through an entity recognition network.

Specifically, the step of obtaining the background semantic coding matrix of the text sequence to be recognized is performed in the feature extraction network.

In an optional embodiment of the present application, obtaining, by an entity identification network, a background semantic coding matrix corresponding to a text sequence to be identified includes:

acquiring an initial background semantic coding matrix corresponding to a text sequence to be recognized;

and acquiring a corresponding background semantic coding matrix based on the initial sentence background semantic coding matrix and the part of speech embedding matrix corresponding to the text sequence to be recognized.

The feature extraction module further comprises an ELMo (angles from Language models) sub-module and a Bi-directional Long Short-Term Memory (Bi-LSTM) sub-module.

Specifically, a text sequence x to be recognized containing L words is given (t)₁,t₂,…,t_L) Adopting ELMo for mutual transmissionEntering text for coding to obtain corresponding initial background semantic coding matrix W_ELMo＝ELMo(x)∈R^L×EWhere E is the dimension of the word vector. Considering that the part of speech has important influence on entity boundary and entity category identification, the part of speech sequence corresponding to the text sequence to be identified is assumed to be (p)₁,p₂,…,p_L) The corresponding part-of-speech embedding matrix is W_pos∈R^L×pWhere p embeds the dimensions of the vector for each part-of-speech. Then W is_ELMoAnd W_posThe words are spliced and input into a bidirectional long-short term memory submodule to obtain a background semantic coding matrix of the text sequence to be recognized

Wherein

A forward hidden vector representation and a backward hidden vector representation of Bi-LSTM, respectively.

The sentence Encoder may adopt other defining manners, for example, the feature extraction module may only include a bert (bidirectional Encoder retrieval from transforms) module, and a background semantic coding matrix corresponding to the text sequence to be recognized is obtained by the BRET module, where U ═ bert (x).

In an alternative embodiment of the present application, the entity recognition network is trained by:

determining a training loss function, wherein the training loss function comprises a boundary loss function, an entity candidate region screening loss function and an entity classification loss function;

acquiring a training sample set, wherein a text sequence sample in the training sample set is marked with a position label and a type label of a real entity;

and training the entity recognition network based on the training loss function and the training sample set until the value of the training loss function meets the preset condition, and obtaining the trained entity recognition network.

In the training stage of the entity recognition network, three loss functions are adopted for combined training, namely a boundary loss function, an entity candidate region screening loss function and an entity classification loss function.

Specifically, the boundary loss function is mainly used to optimize the entity candidate region determination module, and when the left and right boundaries of the entity candidate region are optimized by using the similarity, the corresponding cross entropy loss function is as follows:

where CE (. cndot.,) represents a standard cross-entropy loss function, l_kiAnd r_kiThe left boundary position and the right boundary position of the target entity candidate region corresponding to the entity suggestion region corresponding to ki are respectively, A_lk[i，:]Suggesting the similarity vector of each word as the left boundary for the entity corresponding to ki, A_rk[i，:]And ki, the words of the entity suggestion region corresponding to the ki are used as the expression vector of the similarity of the right boundary. Boundary loss function of L_b＝L_left+L_right。

The entity candidate region screening loss function is mainly used for optimizing an entity candidate region screening module, the entity candidate region screening module is used for judging the probability that an entity candidate region belongs to an entity, and belongs to binary classification judgment, and the corresponding binary cross entropy loss function is as follows:

wherein, y_kiA result of determining whether the entity candidate region corresponding to the entity proposed region corresponding to ki contains an entity, p_kiAnd proposing the probability that the entity candidate region corresponding to the region contains the entity for the entity corresponding to the ki.

The entity classification loss function is mainly used for optimizing the entity classification module, and the corresponding cross entropy loss function is as follows:

wherein, y_kiE {0, 1.. C } is an entity candidate region m ═ l corresponding to the entity suggestion region corresponding to ki_ki,r_ki]Corresponding entity type tag, o_kiAnd recommending the entity type prediction probability vector of the entity candidate region corresponding to the region for the entity corresponding to the ki.

In a model training phase, an end-to-end optimization method is adopted in the embodiment of the application, boundary loss, entity candidate region screening loss and entity classification loss are optimized simultaneously, and an optimization target loss function of the whole model is defined as follows:

L＝L_b+L_r+L_c

in addition, in the optimization process of the entity candidate region screening module, it is considered that the entity candidate region acquired by the entity candidate region determining module may have errors, and the acquired entity candidate region may lack diversity, so that the entity candidate region screening module is difficult to optimize quickly. In order to avoid the generation of accumulated errors in the training process, when an entity candidate region screening module is optimized, words in an input sample text sequence are combined pairwise to form an entity candidate region, if all the combined entity candidate regions are input into the entity candidate region screening module, high calculation complexity is brought, meanwhile, the scale of a negative sample is far larger than the number of positive samples, model optimization is not facilitated, and in order to avoid the problem, negative sampling is carried out on the negative sample.

In an optional embodiment of the present application, a loss value of a boundary loss function corresponding to any entity candidate region is obtained by:

acquiring a target entity candidate region of the entity candidate region based on the coincidence degree of the words covered by the entity suggestion region corresponding to the entity candidate region and the words covered by the real entity in the text sequence sample;

and substituting the similarity expression vector of the boundary corresponding to the entity candidate region and the unique expression vector of the boundary of the target candidate entity into a boundary loss function to obtain a corresponding loss value.

In the training process, obtaining the value of the boundary loss function requires obtaining a target entity candidate region corresponding to each entity proposed region, and the target entity candidate region is used as a supervision label of an optimized entity candidate region determination module.

Specifically, for each entity suggestion region, a corresponding target entity candidate region is determined based on a degree of overlap, also referred to as an Intersection-over unity Ratio (IoU Ratio), between words covered by the entity suggestion region and a word set covered by each real entity in the text sequence sample. The calculation formula of each contact ratio is as follows:

wherein, P_kiSuggesting a set of words covered by the area for the entity corresponding to ki, E_mThe word set covered by the m (m-1, 2,3 …) th real entity in the text sample sequence. After the target entity candidate area corresponding to each entity suggested area is determined according to the contact ratio, the similarity expression vector of the boundary corresponding to the entity candidate area and the unique hot expression vector of the boundary of the target candidate entity are substituted into a boundary loss function to obtain the left boundary loss and the right boundary loss, and further obtain the corresponding loss value.

Further, obtaining a target entity candidate region of the entity candidate region based on the overlap ratio of the words covered by the entity suggestion region corresponding to the entity candidate region and the words covered by the real entity in the text sequence sample, including:

if the coincidence degree corresponding to the entity candidate region is not smaller than a preset threshold value, taking the region corresponding to the corresponding real entity as a corresponding target entity candidate region;

and if the coincidence degree corresponding to the entity candidate region is smaller than a preset threshold value, taking a region corresponding to the anchor word of the entity candidate region as a corresponding target entity candidate region.

Specifically, for each entity suggestion region, if the degree of overlap between the entity suggestion region and a word set covered by a certain real entity is not less than a preset threshold (the preset threshold may be set to 0.1), determining a target entity candidate region corresponding to the entity suggestion region from a region corresponding to the real entity. And if the coincidence degree of the anchor word and a word set covered by a certain real entity is smaller than a preset threshold value, taking the anchor word corresponding to the anchor word as a corresponding target entity candidate area.

In addition, in the process of training the entity recognition network, a multi-scale basic entity region can be obtained by presetting various widths (also called scales), and the setting of the multi-scale basic region is beneficial to determining which real entities are regressed during training and is also beneficial to regressing all real entities in the text sample sequence according to the contact ratio scores.

The scheme of The embodiment of The present application is further described below by using an example, and a pre-trained entity recognition network is used to recognize The text sequence to be recognized (The minimum of The forming affinis associated a meeting) containing nested entities. As shown in fig. 7, the entity recognition network in this example comprises a sentence coding layer 701, an entity suggestion generation layer 702, an entity candidate filtering layer 703 and an entity classifier module 704, wherein the sentence coding layer 701 further comprises an ELMo sub-module and a dual forward long short term memory sub-module connected in sequence. And inputting the text sequence to be recognized into the entity recognition network, and finally outputting an entity recognition result.

Specifically, in the entity

suggestion generation layer

702, 3 preset widths are preset for obtaining corresponding entity suggestion areas, where the 3 preset widths respectively correspond to 3 different types of propofol (suggestions), respectively, the 3 preset widths correspond to a preset width 1 for the propofol 1, the 3 preset widths correspond to the propofol 2, and the 5 preset widths correspond to the propofol 3. The entity suggestion generation layer 702 outputs 27 entity candidate regions corresponding to different propofol, as shown in a dashed box 705, The entity candidate filter layer 703 filters The 27 entity candidate regions to obtain 3 filtered entity candidate regions, as shown in a dashed box 706, The entity classifier module 704 classifies The 3 filtered entity candidate regions to obtain an entity type corresponding to each filtered entity candidate region, where The filtered entity candidate regions [7,7] do not belong to any entity type, as shown in a dashed box 707, The last entity identification network outputs a nested entity "The minimum of forign affinis" and a corresponding entity type PER contained in The text sequence to be identified, and The independent entity "The forign affinis" and The corresponding type are ORG, as shown in a dashed box 708.

Example 2

In an optional embodiment, the obtaining at least one entity candidate region in the text sequence to be recognized based on at least one entity boundary word includes: acquiring an entity suggestion region corresponding to the text sequence to be recognized based on the entity boundary words; and acquiring a corresponding entity candidate region based on the entity suggested region. Specifically, in the embodiment of the application, the entity suggested region is determined through the entity boundary words, and then the starting boundary words and the ending boundary words of the entity candidate region are determined from the entity boundary words through the entity suggested region.

Fig. 8 is an architecture diagram of a model for performing an entity recognition method according to an embodiment of the present invention, which may be referred to as a Temporal Region recommendation Network (TRPN) model, and fig. 8 includes two modules: an entity candidate detector module and an entity classifier module. The structure and function of these two modules will be described separately below.

1. An Entity Candidate Detector module (ECDN) aims to detect all possible Entity candidates (i.e. Entity Candidate areas) in the input text (i.e. the text sequence to be recognized). It takes a sentence as input and outputs all entity candidates. The module comprises a sentence coding layer and an entity candidate generation module, wherein:

the sentence coding layer performs semantic coding on the input sentence by using Bi-LSTM (Bidirectional Long short term Memory)/CNN (Convolutional Neural Networks)/BERT (Bidirectional Encoder representation based on conversion) to obtain a background representation (Context representation) vector of each word. The entity candidate generation module may dynamically detect possible entity candidates with different granularities in the input text. The module includes two parts, an entity suggestion generation layer and an entity candidate filtering layer, wherein:

an Entity recommendation generation layer (Entity recommendation Windows) dynamically predicts Entity recommendation regions of different granularity with generated Entity recommendation Windows (i.e., different region widths) as a basis for the Entity regions. Here, we design a fast and memory efficient boundary attention to accelerate model inference, i.e. firstly identify the possible entity boundary words in the sentence through the entity boundary word detection layer, and then calculate the boundary attention score only on the entity boundary words.

An Entity Candidate Filter layer (Entity Candidate Filter) judges a probability that a generated Entity Candidate belongs to a real Entity by using a binary classification layer, and filters the generated Entity Candidate according to the probability.

2. An Entity Classifier module (ECN) is used to perform Entity class discrimination on the Entity candidates obtained by the detector module according to a predefined Entity class set. It takes each generated entity candidate and representation as input and outputs its entity class. The module consists of two sub-modules, namely an entity candidate coding layer and an entity category classification layer, wherein:

and an entity candidate encoding layer, wherein the sub-module is used for converting the entity candidate representation into a feature vector with fixed dimension. It takes each generated entity candidate and its background representation as input and outputs its corresponding entity class.

And the submodule judges the entity type of each entity candidate, predicts the probability of the entity belonging to each entity type by taking the entity feature vector of the entity candidate as input, and determines the entity type by the highest probability.

The overall flow of The above-mentioned entity recognition method executed by The model is described below by way of an example, as shown in fig. 9a, The sentence "The director.

Step 1, inputting The sentence into an entity candidate detector module, which utilizes an entity boundary word detection layer to obtain possible entity boundary words in The sentence, such as "t 1: The", "t 5: National", "t 11: Diseases", "t 13: national "and" t14: Geographic ".

Step 2, generating entity suggestion windows by using the entity suggestion window generation layers respectively by using each boundary word as an anchor word, wherein when the boundary word is't 5: National' as the anchor word, the corresponding entity suggestion window is't 5, t 5': national "," [ t4, t6]: the National Institute ", and when" t14: geographics "is taken as an anchor word, the corresponding entity suggestion window is" [ t14, t14]: National "," [ t13, t15 ]: national geographic, "and the like. Here, two Entity suggestions (Entity suggestions) with widths of 1 and 3 are predefined.

And step 3, inputting the entity suggestion windows [ t5, t5], [ t4, t6], [ t14, t14], [ t13, t15] into the entity candidate identification layer, and respectively using the entity suggestion windows as reference adjustment predictions to correspondingly obtain entity candidate regions [ t5, t11], [ t1, t11], [ t13, t14] and [ t13, t14 ].

Step 4, inputting the detected entity candidates into an entity candidate filtering layer to filter the error entity and the repeated entity to obtain [ t5, t11], [ t1, t11], [ t13, t14]

And 5: the filtered entity candidates [ t5, t11], [ t1, t11], [ t13, t14] are respectively determined to belong to entity classes, and input to an entity classifier. Finally, the entity and its category "[ t5, t11 ]: PERSON "," [ t1, t11 ]: ORGNIZATION "," [ t13, t14]: orgnizapion ".

Compared with the prior art, the scheme (fig. 9b) of the embodiment of the present application is most different from the prior art (fig. 9c) in the entity candidate generation module. As shown, there are two main points of distinction:

a first, different number of predefined Entity Proposal (Entity project) windows. Our method only requires two different entity suggestion windows (i.e., two entity suggestion windows with R ═ 1,3 and widths of 1 and 3). However, existing approaches typically require the definition of multiple entity suggestion windows (i.e., R ═ 1,2,3,4,5,6) to generate entities of different granularities.

The second, predefined entity suggestion window is used differently. Our method dynamically predicts multi-granularity entity candidates using the generated entity suggestion window as an entity candidate base, whereas the prior art directly adopts the entity suggestion window as an entity candidate.

The following describes each module of the above model in the embodiment of the present application in detail.

respectively taking all words in the text sequence to be recognized as entity boundary words; or acquiring the probability that the words in the text sequence to be recognized are used as entity boundary words based on the background expression vector of the words in the text sequence to be recognized, and determining the entity boundary words of the text sequence to be recognized based on the probability.

Taking all words in the text sequence to be recognized as entity boundary words respectively corresponds to the scheme in the embodiment 1, taking each word in the text sequence to be recognized as an entity boundary word, and further performing subsequent processing to determine an entity candidate region of the text sequence to be recognized. In this embodiment, part of words are selected from the text sequence to be recognized as entity boundary words, and then subsequent processing is performed to determine an entity candidate region of the text sequence to be recognized.

Specifically, the entity boundary words of the text sequence to be recognized may be obtained by the entity boundary word detection layer, that is, the module is configured to detect the boundary words of all possible entities in the input text and generate a boundary word sequence. The module is designed to remove non-boundary word representation in a subsequent entity candidate identification module, realize the compression of a Query matrix and a Key value matrix in boundary attention calculation, accelerate the speed of the entity candidate identification module and reduce the calculation cost.

As shown in fig. 10, a schematic diagram of an entity boundary word detection layer obtaining an entity boundary word is given, and for an input sentence "The director of …National Geographic ", the entity boundary word detection layer outputs entity boundary words as {" t2: director "," t5: National "," t11: Diseases "," t13: National "," t14: Geographic ". For the module, given a sentence, the module gives t for each word_iWill output a probability score p_i∈[0,1]Indicating the probability that the word belongs to a boundary word. The specific process can comprise the following steps:

step 1, for each word t_iInputting its background representation vector u_i∈R^dTo a fully connected neural network (FNN) to obtain a value v_iI.e. v_i＝FNN(u_i) Wherein the parameters of FNN () are shared for all words;

step 2, the value v is converted by a Sigmoid activation function_iConversion into probability values, i.e. p_i＝Sigmoid(v_i)；

Step 3, according to the probability value p_iThe boundary word is decided, if given a boundary word threshold α ∈ (0,1) (if it can be set to 0.5), then if p_i>α, then the word belongs to the boundary word mask_iOtherwise the word does not belong to the boundary word mask_i＝0；

And 4, outputting all entity boundary words in the sentence, namely outputting the entity boundary word mask sequence mask of the input sentence.

In particular, the entity boundary word detection is to detect all possible sets of entity boundary words as nested. The boundary words of an entity include its starting boundary word and its ending boundary word. As shown in fig. 11a, the detected entity boundary words are: { "t 2: director", "t 5: National", "t 11: Diseases", "t 13: National", "t 14: Geogaphic" }. As shown in fig. 11b, the detected entity boundary words are: { "t 1: Edinburgh", "t 2: University", "t 3: Library", "t 7: Adam", "t 8: Ferguson", "t 9: Building" }.

Further, the reason why the entity boundary word in the sentence can be identified can be classified into two aspects:

on one hand, the entity boundary words in the sentence usually have certain rules, and can be found through some rule matching. The position of the entity boundary word can be identified according to rules based on dictionary, part of speech, prefix, suffix and the like. For example:

"… went to U.K. …," U.K "is generally the starting boundary word for a LOCATION entity;

"… party went to …," party "is generally the terminating boundary word of the ORGNIZATION entity;

"… director of the …" which is generally the starting boundary word of a PERSON entity;

the "# Noun Phrase falls", "falls" word generally indicates that the preceding word is a terminating boundary word of PERSON, where "# Noun Phrase" denotes the Noun Phrase in the sentence;

if the prefix of a word is lowercase and the prefix of the current word is uppercase, then the current word is the starting boundary word of an entity. For example: "… the National Institute …", "… to Adam Ferguson Building …";

if the prefix of the current word is capital and the prefix of the next word of the current word is lowercase, then the current word is typically the terminating border word position of an entity, such as: "… Diseases hills …", "… University Library is …";

for phrases satisfying The syntax structure of # defiite arm (abbreviated DT) + # Noun, # Noun (abbreviated NN) corresponds to a Noun that is usually The starting word position of an entity, where "# defiite arm" represents a Definite Article and "# Noun" represents a Noun, i.e. represents a Definite Article followed by a Noun that is The starting word of an entity, e.g. "The" Definite Article "in The DT director/NN …" which is a Definite Article and "director" which follows is The starting word of an entity.

On the other hand, the entity boundary words in the sentence have a certain statistical regularity. As shown in table 2, statistics are given for two nested entity recognition data sets ACE2004 and ACE2005 showing the most frequent words that occur at different locations of the entity, and the most frequent part-of-speech rules for named entity recognition. As can be seen from table 2:

some words frequently serve as starting words for entities, such as "president", "North", "New", etc.;

some words frequently serve as termination words for entities, such as "county", "company", "party", etc.;

from the part-of-speech information of the text, many entity boundaries can be found by part-of-speech rules, where "Determiner + Noun" denotes qualifier plus Noun, "position + Noun" denotes Preposition + proper Noun "denotes Preposition-connection proper Noun," Verb + Noun "denotes Verb-connection Noun, and" Noun + Verb "denotes Noun-connection Verb.

TABLE 2

Specifically, the entity suggestion region may be obtained through an entity suggestion generation layer, and the module may generate a corresponding entity suggestion region for each entity boundary word in the sentence through two predefined entity suggestion windows with different lengths (i.e., preset widths). These generated entity suggestion regions are used as entity region references to dynamically detect entity candidates with different granularities. It will also encode each entity suggestion region according to the background representation of the word sequence. It should be noted that the number of the preset widths selected by the module may be one, two, or more, and it is understood that the smaller the selected preset width is, the fewer the entity suggested areas are obtained, and the smaller the subsequent calculation amount is. As shown in fig. 12a, the schematic of obtaining the entity suggested region for the module may include the following steps, for example:

step 1, for a given sentence, all entity candidates are generated using an entity suggestion window.

For each word in the sentence, two different entity suggestion regions of lengths 1 and 3 (i.e., entity suggestion windows of 1 and 3) are generated, as shown in fig. 12b, with the entity boundary word "t 3: Library" as the anchor word, the resulting entity suggestion region (Library) is as shown, i.e., "t 3, t3: Library" (corresponding to preset width 1, Library 1), "[ t2, t 4: University Library" (corresponding to preset width 3, Library 3). Table 3 shows the entity suggestion regions generated for all possible entity boundary words.

Two symmetric entity suggestion windows are generated centering on each anchor word, other asymmetric forms may also be utilized, such as [ t3, t4], [ t2, t5] for the anchor word "t 3: Library".

In general, for the word ti in the sentence, R entity suggestion regions with different lengths may be generated, where R is the number of the preset widths selected when the entity suggestion regions are generated. In fact, a recommendation (propofol) window of two widths or even one width is sufficient for nested entity identification.

TABLE 3

Anchor word	Entity suggestion window (k 1)	Entity suggestion window (k as 3)
			“t1:Edinburgh”	“[t1,t1]:Edinburgh”	“[t0,t2]:Edinburgh University”
“t2:University”	“[t2,t2]:University”	“[t1,t3]:Edinburgh University Library”
			“t3:Library”	“[t3,t3]:Library”	“[t2,t4]:University Library is”
“t7:Adam”	“[t7,t7]:Adam”	“[t6,t8]:to Adam Ferguson”
			“t8:Ferguson”	“[t7,t8]:Ferguson”	“[t7,t9]:Adam Ferguson Building”
“t9:Building”	“[t9,t9]:Building”	“[t8,t10]:Ferguson Building”

And 2, obtaining entity suggested region representations of all entity boundary words through sliding convolution operation on the sentence background representation matrix. In order to utilize the generated entity proposed region information in the subsequent modules, it needs to be encoded and a corresponding representation vector is obtained.

Generating entity proposed regions about anchor words by using propofol 1 (i.e. preset width of 1) and propofol 3 (i.e. preset width of 3), the embodiment of the present application introduces a local one-dimensional convolution Conv1D to perform convolution operation on each generated entity proposed region, and the output thereof is used as a feature representation vector:

entity suggested region representation for propofol 1: p is a radical of_i＝Conv1D₁(u_i),

Entity suggested region representation for propofol 3: p is a radical of_i＝Conv1D₃([u_i-1,u_i,u_i+1]),

Wherein, Conv1D_kRepresenting a 1-dimensional convolution operation with a kernel width k, u_i-1,u_i,u_i+1Respectively representing three words t covered by the entity suggestion region of the Proposal3 of the ith word_i-1,t_i,t_i+1The background of (2) represents a vector.

The scheme uses convolution operations of two different kernel widths to obtain a background representation vector of an entity suggestion region of each entity boundary word of a sentence. The advantage of using convolution is that the background representation vectors of the entity proposal region can be computed in parallel.

In the scheme provided by the embodiment of the application, only two entity suggestion windows (namely preset widths) are adopted to generate the entity suggestion region, so that the subsequent nested entity identification with a multilayer structure can be performed, and the following reasons are mainly included:

in the following description of the example of fig. 11b, there is a unique boundary word for all entities in general. All entity regions can be obtained under two entity suggestion windows through the boundary words, as shown in table 4 (the arrow direction in the table represents the region expansion direction of the entity candidate region prediction), specifically:

1. since the term "t 1: Edinburgh" is an entity boundary word, the word is taken as an entity suggestion region [ t1, t1] corresponding to the anchor word, [ t0, t2], and entity candidates "Edinburgh" and "Edinburgh University" can be predicted by taking the two entity suggestion regions as references.

2. Since "t 8: Ferguson" is an entity boundary word, the entity suggested region corresponding to the word as an anchor word is "[ t8, t8]: Ferguson", "[ t7, t9]: Adam Ferguson Building". the entity suggested region "[ t8, t8] is used as a reference to obtain an entity candidate [ t7, t8] (" Adam Ferguson ") through prediction, and the entity candidate" [ t7, t9]: Adam Ferguson Building "is obtained through prediction by using the entity suggested region [ t7, t9] as a reference.

3. Through the above two boundary word operations, the entity candidate region "[ t1, t3]: Edinburgh University Library" has not been detected yet, but the entity region has a unique entity boundary word "t 3: Library", which can obtain two entity proposal regions [ t3, t3], [ t2, t4]. the entity candidate [ t1, t3] can be obtained by using one of the two proposal windows as a reference.

TABLE 4

In the scheme provided by the embodiment of the application, only one entity suggestion window (namely, a preset width) is adopted to generate the entity suggestion region, so that the subsequent nested entity identification with a multilayer structure can be performed, and the following reasons mainly exist:

still taking fig. 11b as an example, it can be seen that essentially every entity has a unique boundary word. Generating the entity suggested region by using each entity boundary word as an anchor word can be expanded to the corresponding entity candidate region, as shown in table 5. Employing more entity suggestion windows may help achieve a more stable model representation. However, defining many entity proposal regions also incurs additional computational costs. Thus, to balance model prediction performance and computational cost, only two entity suggestion windows may be selected, such as Proposal1 of width 1 and Proposal3 of width 3.

TABLE 5

Wherein determining the start boundary word candidate and the end boundary word candidate of the anchor word of the entity suggestion region comprises

Determining initial boundary words of the entity suggestion region in the initial boundary word candidates, and determining termination boundary words of the entity suggestion region in the termination boundary word candidates, wherein the method comprises the following steps:

In other words, based on the entity suggested region, a corresponding entity candidate region is obtained, including:

and acquiring a corresponding entity candidate region based on the similarity.

The method for acquiring the similarity between the background representation vector and the combined vector of the entity boundary word in the text sequence to be recognized comprises the following steps:

Based on the similarity, obtaining a corresponding entity candidate region includes:

Specifically, the anchor word in the entity suggestion region in the text sequence to be recognized and the corresponding entity boundary word with the highest similarity in the entity boundary words located on the left side of the anchor word are determined as the initial boundary word of the corresponding entity candidate region, and the anchor word in the entity suggestion region in the text sequence to be recognized is determined. And the entity boundary word with the highest similarity in the entity boundary words positioned on the right side of the anchor word is determined as the termination boundary word of the corresponding entity candidate region; and determining a corresponding entity candidate area based on the starting boundary word and the ending boundary word.

Specifically, the entity candidate region may be obtained based on the entity suggestion region through the entity candidate identification layer, and the module dynamically predicts a position of a boundary word of the entity candidate region according to the entity suggestion window. It will generate all entity candidates and their background representations. Unlike the existing method, the entity proposed region is used as a reference for acquiring the entity candidate region, rather than being directly used as the entity candidate region.

As shown in fig. 13a, the module predicts the position of the boundary word of all entity candidates by the self-attention mechanism, for example, it may include the following steps:

step 1, Key value (Key) matrix transformation and index (Query) matrix transformation are obtained, namely, a background representation matrix of a text to be recognized is transformed through two different linear transformations to obtain four new sentence representations (each sentence representation is a matrix and contains background representation feature vectors of all words), namely, 1) Key value matrix representation (starting and ending), feature representation of words related to a starting boundary and a stopping boundary, and each Key value matrix contains feature vectors of words related to Key values. 2) Index matrix representation (k ═ 1,3), i.e., a characteristic representation of each word with respect to two different suggestion windows;

step 2, performing feature fusion on an index matrix (PQM) of entity suggestion perception (Proposal-aware) and different suggestion window representations from an entity suggestion window generation module, wherein k is 1 and 3;

step 3-4, Fast and Memory-efficient Boundary Attention operation (FMBA) calculates the Attention scores (normalized inner product scores) of each anchor word and all words in the sentence. To avoid the problems of high computational cost and high memory consumption of standard self-attention, we filter out the key-value matrix and the part of the entity suggestion-aware (propofol-aware) index representation (PQM) that is not possible as a boundary according to the boundary word mask (mask) obtained by the entity boundary detector;

and 5, determining the position of the boundary word of the entity candidate according to the boundary attention score.

The module uses the entity proposed region as a reference to dynamically predict the entity candidate region. A fast and memory efficient boundary attention operation is designed to predict the entity candidate boundary, and only possible entity boundary words are considered in the boundary attention operation, but not all words in the input sentence.

Fig. 13b shows a detailed structure of the entity candidate identification module, which mainly involves 5 parts of calculation, as indicated by the numeral numbers in the figure, and five parts of calculation processes are given below:

1. and (3) key value matrix representation and index matrix representation, and calculating:

starting key value: k_l＝W_lU；

Termination key value K_r＝W_rU；

Index (k ═ 1): Q₁＝W₁U；

Index (k ═ 3): Q₃＝W₂U。

Wherein, W_l,W_r,W₁,W₂For the weight parameter matrix, U is the background representation from the sentence encoding layer.

2. Entity suggestion-aware index Matrix (PQM) predicts entity candidate boundaries with an entity suggestion window as an entity region reference. Calculating an entity-suggested perceptual index matrix for Propusal 1 and an entity-suggested perceptual index matrix for Propusal 3:

Q₁←Q₁+Conv1D₁(U)；

Q₂←Q₂+Conv1D₃(U)。

3. the calculation of the compressed key-value matrix and the compressed PQM comprises two steps:

1) combining the resulting boundary mask M from the physical boundary detector with a key-value matrix representation K_l,K_rAnd entity suggestion aware index matrix representation Q₁And Q₂

2) The vector (non-boundary word) filtered out by filtering represents K to the original key value matrix_l,K_rAnd entity suggestion aware index matrix representation Q₁And Q₂Compressing, the compressed matrix is

And

4. start and end boundary attention operations: for compressed PQM

To the source from

Or

Vector q in (1)_iAs an anchor word index query, its relation is calculated using attention operations

A start boundary score and an end boundary score,

wherein <, > represents the similarity score calculation of the two vectors, which may adopt the similarity measurement in the Euclidean Space, or adopt the similarity measurement based on the non-Euclidean Space, such as adopting the similarity induced by the Hyperbolic distance measurement in the Hyperbolic Space (Hyperbolic Space), and in the Euclidean Space or the Hyperbolic Space, the similarity between the background representation vector of the entity boundary word in the text sequence to be recognized and the combined vector is obtained, and then the boundary score is obtained according to the similarity.

5. Determining the position of the entity boundary word of the entity candidate: for a message from

Or

Characteristic vector q in (1)_iTaking the key word position with the maximum attention score as a boundary word, and calculating the positions of the initial boundary word and the final boundary word in the following way:

the resulting predicted entity candidate region is [ l ]_i,r_i]，l_iFor the position of its boundary word, r_iFor terminating bits of a boundary wordAnd (4) placing.

Further, the most direct boundary attention operations have self-attention operations

The time and memory complexity is O (N)²d) Where N is the input sequence length, d is K, the feature vector dimension of Q. The operation is quadratic with respect to the input sequence length, which results in high computational and memory consuming costs, so that the operation cannot be well extended to long text sequences. To address these challenges, we propose a fast and memory efficient frontier attention operation (FMBA) to compute the frontier score computation of entity candidates.

The FMBA design operates based on sparse attention of detected boundary words. As shown in fig. 13c, it first compresses the index and key-value matrices taking into account the boundary words, and then computes the boundary attention operation:

compressing the key value matrix and the entity suggestion perception index matrix: since FMBA only needs to calculate attention scores between the border words in order to find two border word positions for each entity candidate. We can filter out the start and end key value matrixes Ks and Ke and the index matrix Q according to the boundary word mask_kThe non-boundary word part in (k is 1,2) to obtain the corresponding compression matrix

Calculating an attention score by the compressed key-value matrix and index matrix: calculating on boundary words

Instead of the former

Assume that the number of possible boundary words in the input sentence is

FMBA has a time and memory complexity of

Since the boundary words are usually much shorter than the sentence sequence length

Which can significantly reduce the computational and memory costs of border attention operations in the inference process.

The scheme designs a quick and memory-efficient boundary attention operation module to calculate a boundary score matrix, and the calculation complexity is increased from O (N)²d) Will be about

Here, the

The screening of the entity candidate region to obtain the screened entity candidate region includes:

Specifically, the entity candidate region acquired by the previous module may be filtered through the entity candidate filtering layer to obtain a filtered entity candidate region, the module concurrently estimates the probability that each entity candidate belongs to the entity, and filters the generated entity candidate according to the probability value. As shown in FIG. 14, the module will filter out those entity candidates that are unlikely to be the correct entity. The module first encodes each entity candidate into a fixed-dimension feature vector, which is then input to a binary classifier to determine whether the entity candidate belongs to a real entity. The module comprises two sub-modules, namely an entity candidate coding layer and an entity candidate classification layer, wherein:

the entity candidate encoding layer encodes all entity candidates having different lengths into a fixed-dimension feature vector. For entity candidates l_i,r_i]Plus its corresponding anchor word as t_iThe module encodes the entity candidate using three components, including the start word feature vector

Stop word feature vector

Anchor word feature vector u_i. Entity candidate [ l_i,r_i]The characteristics of the code (i.e., the first classification characteristics) are expressed as:

namely splicing three eigenvectors;

and the entity candidate classification layer is a full connectivity layer (FNN) with two types of Softmax and is used for determining the quality of the entity candidates and filtering wrong entity candidates. The probability of an entity candidate is defined as:

p_i＝Softmax(FNN(h_i))

unlike the prior art which adopts CNN/LSTM to code entity candidates, the entity candidate coding layer of the scheme only adopts the splicing of three components of the entity candidates, namely the feature vectors (namely background representation vectors) of the initial boundary word, the termination boundary word and the anchor word, and the operation is very efficient and is beneficial to accelerating model inference.

The method for judging the category of the entity candidate region after screening to obtain the entity recognition result of the text sequence to be recognized comprises the following steps:

Specifically, the entity classifier module may be used to perform category discrimination on the filtered entity candidate regions, and as shown in fig. 15, first, each filtered entity candidate (i.e., the filtered entity candidate region) is encoded, and classified into different predefined entity categories, so as to determine a final predicted entity.

The module encodes each filtered entity candidate to obtain a feature vector with fixed dimensionality, and then inputs the feature vector into a full-connection network with a Softmax output layer to carry out entity category judgment. It consists of two submodules:

entity candidate coding layer: entity candidates with different lengths are encoded into fixed-dimension feature vectors. For each entity candidate interval m ═ l, r ], its feature vector is defined as the concatenation of two boundary word feature vectors:

m＝Concat(u_l,u_r)。

the coding layer structure is simple and efficient, and other existing methods such as CNN/LSTM can also be used as entity candidate coding layers.

Entity class classification layer: and judging the entity candidate category according to the eigenvector m obtained by the entity candidate coding layer. The entity class classifier is defined as:

p＝Softmax(FNN(m)))

wherein the fully connected neural network (FNN) may be

FNN(m)＝W₂ReLU(W₁m),

Here W₁∈R^2D×H,W₂∈R^C×DFor network parameters that need to be learned, D is a predefined number of entity classes, including non-realThe body class None, ReLU () is the activation function of the network.

Unlike prior art methods, this scheme encodes each entity candidate based on an entity candidate representation from an entity candidate encoding layer, rather than directly encoding the entity candidates based on the original word sequence and the input sentence. This end-to-end approach not only reduces error propagation but also speeds up online model inference. In addition, the module only adopts the feature vector splicing of two boundary words of the entity candidate as the entity feature vector, the operation is efficient, no extra calculation cost is needed, and the model inference is accelerated.

In addition, when the entity candidate recognition layer performs the boundary attention operation, that is, the similarity calculation, the boundary detection may be performed in a Hyperbolic space, as shown in fig. 16, based on the similarity score induced by the Hyperbolic (Hyperbolic) distance.

A hyperbolic entity suggestion network architecture for nested entity recognition is presented in fig. 16, which includes two neural network modules: an entity candidate detector module and a hyperbolic entity classifier module. The purpose of the entity candidate detector module is to identify likely entity regions by calculating a multi-headed attention score in a hyperbolic space. Entity region candidates are then generated, which may be further divided into three modules, namely a sentence encoding layer (i.e., a sentence encoder in the graph), a candidate generation layer based on hyperbolic space (i.e., a candidate generator based on hyperbolic space in the graph), and an entity candidate classification layer (i.e., an entity candidate classifier in the graph). More specifically, the sentence coder may derive a background representation of each word through a Bi-directional long-short term memory module (Bi-LSTM), a Convolutional Neural Network (CNN), and a pre-trained language model (e.g., BERT). The candidate generator generates entity region candidates according to attention scores (namely, similarity) of different heads, and is different from the calculation of multi-head attention scores in the Euclidean space, wherein the similarity between an anchor word and each word is calculated in the hyperbolic space, the similarity of the hyperbolic close induction is used for replacing the similarity calculation of the Euclidean space, and the calculation in the hyperbolic space is adopted to be helpful for model learning of word alignment with hierarchical structure relationship. The entity candidate classification layer is a two-classification neural network layer, judges the probability that the generated entity candidate region belongs to the entity class, and filters the region candidates according to the probability, and can calculate in a Euclidean space or a hyperbolic space. And filtering the generated entity region candidates. The hyperbolic entity classification module aims to judge the category of detected entity region candidates according to predefined entity categories, and comprises two parts, namely an entity candidate coding layer, a hyperbolic space and an entity candidate classification layer, wherein the entity candidate coding layer is used for coding the filtered entity candidates, and the entity candidate classification layer is used for classifying the filtered entity region candidates obtained in the previous step into proper entity categories.

It should be noted that the network simultaneously adopts the positive entity candidates in the training process (e.g. m in the figure)₁、m₃、m₄、m₈) And negative entity candidates as in (m) of the graph₂、m₅、m₆、m₇) As a training sample, wherein the positive entity candidate may be understood as a screened entity candidate, that is, the entity candidate belongs to an entity, a label of the entity candidate is a specific entity type, and the negative entity candidate does not belong to an entity, and network parameter sharing of the entity candidate coding layer is obtained by respectively training the positive entity candidate and the negative entity candidate. Negative entity candidates are added in the training process, and the judgment capability of the hyperbolic entity classifier on entity categories is improved.

A hyperbolic distance-induced similarity calculation function may be defined:

K(q_h，k_h)＝-α_hd^c(q_h，k_h)²+β_h

wherein

Is a hyperbolic distance, here

The addition operation of the hyperbolic space is represented, and the calculation form is as follows:

accordingly, for entity class classifiers. Because of the significant conceptual hierarchy in nested entities, we can define a classifier in a hyperbolic space. Firstly, mapping a predefined class set C ═ { C1, C2, …, cT } + { None _ Type } into a feature vector space, wherein an embedded vector corresponding to the class C is defined as y_c∈R^D. The classifier is defined as follows: suppose h_mFor a given entity candidate m, its classifier is defined as

Wherein e_m∈R^DD-dimensional feature representation vector obtained by a non-linear transformation FNN () representing entity candidate m, i.e.

e_m＝FNN(h_m)

Wherein, y_cAn embedded vector representing the class C ∈ C. p is a radical of_m,cAnd representing the probability of the entity candidate m about the class c, and taking the class with the highest probability as a predicted entity class during model prediction.

Entity category discrimination layer based on hyperbolic space: in fact, parameters, operations and similarities in the entity type discrimination layer of the Euclidean space are replaced by parameters, operations and similarities in the hyperbolic space. Given an entity candidate m, then the representation vector for that entity candidate is:

here, the

Is a network full connection parameter, b₁And b₂Represents a Bias (Bias) parameter that,

representing the activation function in a hyperbolic space,

is a Mobius addition operation. If C represents a set of classes containing None, the probability that the entity candidate m belongs to class C is

Wherein K (e)_m,y_c) Hyperbolic distance induced similarity. Note that this is different from the intermediate inner product operation in the euclidean space.

Example 3

In an optional embodiment, the obtaining at least one entity candidate region in the text sequence to be recognized based on at least one entity boundary word includes: acquiring a preset number of entity boundary words adjacent to the entity boundary words from a text sequence to be recognized; acquiring background representation vectors of the entity boundary words, and respectively obtaining similarity between the background representation vectors of the entity boundary words and the corresponding adjacent preset number of the entity boundary words; and acquiring a corresponding entity candidate region based on the similarity. Respectively determining a starting boundary word and an ending boundary word of a corresponding entity candidate region from entity boundary words of a text sequence to be recognized and adjacent preset number of entity boundary words of the entity boundary words based on the similarity; and determining a corresponding entity candidate area based on the starting boundary word and the ending boundary word. Specifically, in the scheme, after the entity boundary words are obtained, the start boundary words and the end boundaries of the entity candidate regions are directly determined based on the entity boundary words, and then the corresponding entity candidate regions are obtained.

As shown in fig. 17, the difference between the model implementing the scheme and the models in the first two embodiments is mainly the entity candidate detector module, and the entity candidate detector module in the scheme is composed of three modules, namely an entity boundary word detection layer, a nearest word discovery layer and an entity candidate identification layer, wherein:

1. and an entity boundary word detection layer, which detects possible entity boundary words in the input text sequence, and the entity boundary word detection layer is consistent with the entity boundary word detection layers in the first two embodiments.

2. And in the nearest word discovery layer, the module uses the detected entity boundary words as anchor words, and then finds out each entity boundary word and the first K words nearest to the entity boundary word by using a Local Sensitive Hash (Local Sensitive Hash) technology, wherein K < < L, and L is the length of the text sequence to be recognized.

3. The entity candidate recognition layer calculates similarity scores of each anchor word and K entity boundary words before the anchor word is closest to the anchor word, and the anchor word and the adjacent entity boundary word with the maximum similarity score form an entity candidate boundary word pair, namely an initial boundary word and a termination boundary word;

the execution process of the nearest word discovery layer and the entity candidate recognition layer is repeated H times (e.g., H ═ 2). And finally collecting and outputting all entity candidates.

The main difference between this scheme and the scheme in example 2 is the physical candidate detector module:

there is no need to predefine an entity suggestion window that replaces the entity candidate generator in the entity candidate detector with the nearest neighbor discovery module;

the method has low computational complexity, and the scheme only needs to calculate similarity scores of the anchor word and the first K boundary words selected by the locality sensitive hashing technology, wherein the computational complexity is O (Nlog (K)), and usually K < < N.

It should be noted that the entity identification scheme provided by the present application can be well applied to identification of nested entities, and can also be applied to identification of traditional nested entities. The scheme can be suitable for the following application scenes in which the nested test question identification needs to be carried out:

1. intelligent Screen (Smart Screen)

The intelligent screen is an intelligent solution for mobile phones, when a user chats, reads and browses pictures, the user can trigger the function by pressing a screen text content area with a palm of hand, the intelligent screen can automatically extract information such as entities, keywords and the like in the text, such as name of a person, place name, position, telephone number and the like, and then perform information expansion, application service linking or interest recommendation and the like on the information, so that the purpose of achieving (One Step) by One Step of the user is realized. Fig. 18a shows a potential application example of nested entity recognition in an intelligent screen.

2. Reading enhancement for news reading

When the user is reading the news text, the user may not be familiar with the background information of the related entities mentioned in the news, and the reading enhancement function may automatically extract the related entities from the text and link the extracted entities with the related entity introduction web pages, which may help the user to quickly jump to the interested entity web pages, as shown in fig. 18 b.

3. Reading enhancement with respect to menus

When a consumer reads a menu, the consumer needs to order dishes on the basis of understanding the names of the dishes, for example, raw material components, dishes and the like corresponding to the names of the dishes are understood and imagined. But we often encounter uneaten or unfamiliar components when reading the menu, and the application of the reading enhancement tool can be used to identify the components in the menu name (i.e. nested entities) and link to related entities and component introductions to help us understand, as shown in fig. 18 c.

4. Image labeling (Image Tagging)

Image tagging is a tool that can help users to edit image tags quickly. Similar functionality has been applied in many smart phones. When a user wants to tag an image and a screenshot, the tool can automatically extract some key phrases from the text content in the image, and provide the key phrases to the user as candidates for selection and editing. As shown in fig. 18d, these key phrases with different granularities can be automatically extracted from the text content by using the nested entity recognition technology provided in the present application, and only one entity importance ranking needs to be added.

5. Construction of a Knowledge map (Knowledge Graph)

The knowledge graph is widely applied to the fields of question answering systems, recommendation systems, search engines and the like. This is particularly important for automated construction of large-scale complete knowledge maps. Nested entity recognition can provide richer entity relationships for the knowledge graph completion task. For example, in fig. 18e, first, a Relation extraction (relationship extraction) is performed on a sentence based on the nested entity recognition result in the sentence, and the Relation of each entity in the sentence is acquired. And then, constructing a Knowledge Graph (KG completion) on the basis of the previous step to finally obtain the Knowledge Graph (KG).

If the nested entity is not found, the related entity relationship is lost, such as:

hasLocation(Adam Ferguson Building,Edinburgh)；

if these nested entities are all discovered, more additional entities can be obtained, such as:

partOf(Edinburgh University Library,Edinburgh University)；

hasLocation(Edinburgh University,Edinburgh)。

fig. 19 is a block diagram illustrating a structure of an entity identification apparatus according to an embodiment of the present application, and as shown in fig. 19, the apparatus 1800 may include: an entity boundary word obtaining module 1801, an entity candidate area obtaining module 1802, and an entity recognition result obtaining module 1803, where:

the entity boundary word acquiring module 1801 is configured to acquire at least one entity boundary word corresponding to a text sequence to be recognized;

the entity candidate region obtaining module 1802 is configured to obtain at least one entity candidate region in the text sequence to be recognized based on the at least one entity boundary word;

the entity identification result obtaining module 1803 is configured to obtain an entity identification result of the text sequence to be identified based on the entity candidate region.

Compared with the prior art, the scheme provided by the embodiment of the application can improve the coverage rate of the entity candidate region to the entity in the text sequence to be recognized on the premise of not increasing the number of the entity candidate regions, and reduces the complexity of calculation.

In an optional embodiment of the present application, the entity boundary word obtaining module is specifically configured to:

In an optional embodiment of the present application, the entity candidate region obtaining module is specifically configured to:

In an optional embodiment of the present application, the entity candidate region obtaining module is further configured to:

and acquiring a corresponding entity candidate region based on the similarity.

In an optional embodiment of the present application, the entity identification result obtaining module is specifically configured to:

In an optional embodiment of the present application, the entity identification result obtaining module is further configured to:

and acquiring a corresponding entity candidate region based on the similarity.

Based on the same principle, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method provided in any optional embodiment of the present application is implemented, and specifically, the following situations are implemented:

acquiring at least one entity boundary word corresponding to a text sequence to be recognized; acquiring at least one entity candidate region in a text sequence to be recognized based on at least one entity boundary word; and acquiring an entity recognition result of the text sequence to be recognized based on the entity candidate region.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method shown in any embodiment of the present application.

It is to be understood that the medium may store a computer program corresponding to the verification method of the configuration management database.

Fig. 20 is a schematic structural diagram of an electronic device to which the embodiment of the present application is applied, and as shown in fig. 20, an electronic device 1900 shown in fig. 20 includes: a processor 1901 and a memory 1903. The processor 1901 is coupled to the memory 1903, such as via the bus 1902. Further, the electronic device 1900 may further include a transceiver 1904, and the electronic device 1900 may interact with other electronic devices through the transceiver 1904. In addition, the transceiver 1904 is not limited to one in practical applications, and the structure of the electronic device 1900 is not limited to the embodiment of the present application.

The processor 1901, applied in the embodiment of the present application, may be used to implement the functions of the entity identifying apparatus shown in fig. 19.

The processor 1901 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1901 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 1902 may include a path that conveys information between the aforementioned components. The bus 1902 may be a PCI bus or an EISA bus, etc. The bus 1902 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 20, but this is not intended to represent only one bus or type of bus.

The memory 1903 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 1903 is used for storing application program codes for executing the scheme of the application, and is controlled by the processor 1901 to execute. The processor 1901 is configured to execute application program code stored in the memory 1903 to implement the actions of the entity identifying apparatus provided by the embodiment shown in fig. 19.

The apparatus provided in the embodiment of the present application may implement at least one of the modules through an AI model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors, such as a Central Processing Unit (CPU), an Application Processor (AP), or the like, or pure graphics processing units, such as a Graphics Processing Unit (GPU), a Vision Processing Unit (VPU), and/or AI-specific processors, such as a Neural Processing Unit (NPU).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operating rules or artificial intelligence models are provided through training or learning.

Here, the provision by learning means that a predefined operation rule or an AI model having a desired characteristic is obtained by applying a learning algorithm to a plurality of learning data. This learning may be performed in the device itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may include a plurality of neural network layers. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generative confrontation networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to make, allow, or control the target device to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific method implemented by the computer-readable medium described above when executed by the electronic device may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. An entity identification method, comprising:

acquiring at least one entity candidate region in the text sequence to be recognized based on the at least one entity boundary word;

2. The method according to claim 1, wherein the obtaining at least one entity boundary word corresponding to the text sequence to be recognized comprises:

3. The method according to claim 1 or 2, wherein the obtaining at least one entity candidate region in the text sequence to be recognized based on the at least one entity boundary word comprises:

acquiring an entity suggested region corresponding to the text sequence to be recognized based on the entity boundary words;

4. The method according to claim 3, wherein the obtaining of the entity suggestion region corresponding to the text sequence to be recognized based on the entity boundary word comprises:

5. The method of claim 4, wherein the obtaining the corresponding entity candidate region based on the entity-suggested region comprises:

acquiring the similarity between the background representation vector of the entity boundary word in the text sequence to be recognized and the combined vector;

and acquiring a corresponding entity candidate region based on the similarity.

6. The method according to claim 5, wherein the obtaining the similarity between the background representation vector of the entity boundary word in the text sequence to be recognized and the combination vector comprises:

and in a Euclidean space or a hyperbolic space, acquiring the similarity between a background representation vector of an entity boundary word in the text sequence to be recognized and the combined vector.

7. The method of claim 5, wherein the obtaining the corresponding entity candidate region based on the similarity comprises:

determining starting boundary words of corresponding entity candidate regions from the anchor words of the entity suggestion regions in the text sequence to be recognized and the entity boundary words positioned on the left sides of the anchor words based on the similarity, and determining ending boundary words of the corresponding entity candidate regions from the anchor words of the entity suggestion regions in the text sequence to be recognized and the entity boundary words positioned on the right sides of the anchor words;

8. The method of claim 5, wherein obtaining a corresponding combined vector based on the background representation vector of the words covered by the entity suggestion region and the background representation vector of the corresponding anchor word comprises:

9. The method of claim 4, wherein the obtaining the corresponding entity candidate region based on the entity proposed region comprises:

10. The method of claim 9, wherein determining a start boundary word candidate and an end boundary word candidate for an anchor word for an entity suggestion region comprises

11. The method of claim 9, wherein determining a starting boundary word for an entity suggestion region in a starting boundary word candidate and determining an ending boundary word for the entity suggestion region in an ending boundary word candidate comprises:

determining a starting boundary word for an entity suggestion region based on the first probability, and determining an ending boundary word for an entity suggestion region according to the second probability.

12. The method according to any one of claims 1 to 11, wherein the obtaining of the entity recognition result of the text sequence to be recognized based on the entity candidate region comprises:

and judging the category of the screened entity candidate region to obtain an entity recognition result of the text sequence to be recognized.

13. The method of claim 12, wherein the screening the entity candidate region to obtain the screened entity candidate region comprises:

14. The method according to claim 13 or 14, wherein the determining the category of the entity candidate region after the screening to obtain the entity recognition result of the text sequence to be recognized comprises:

15. The method according to any one of claims 1 to 11, wherein the obtaining of the entity recognition result of the text sequence to be recognized based on the entity candidate region comprises:

16. The method according to claim 1 or 2, wherein the obtaining at least one entity candidate region in the text sequence to be recognized based on the at least one entity boundary word comprises:

acquiring a preset number of entity boundary words adjacent to the entity boundary words from the text sequence to be recognized;

acquiring the background representation vectors of the entity boundary words, and respectively obtaining the similarity between the background representation vectors of the entity boundary words and the corresponding adjacent preset number of the entity boundary words;

and acquiring a corresponding entity candidate region based on the similarity.

17. The method of claim 16, wherein the obtaining the corresponding entity candidate region based on the similarity comprises:

18. An entity identification apparatus, comprising:

an entity candidate region obtaining module, configured to obtain at least one entity candidate region in the text sequence to be recognized based on the at least one entity boundary word;

and the entity identification result acquisition module is used for acquiring the entity identification result of the text sequence to be identified based on the entity candidate region.

19. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 17.

20. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the method of any one of claims 1 to 17.