CN116151241A - Entity identification method and device - Google Patents
Entity identification method and device Download PDFInfo
- Publication number
- CN116151241A CN116151241A CN202310417766.9A CN202310417766A CN116151241A CN 116151241 A CN116151241 A CN 116151241A CN 202310417766 A CN202310417766 A CN 202310417766A CN 116151241 A CN116151241 A CN 116151241A
- Authority
- CN
- China
- Prior art keywords
- vector
- span
- character
- unit
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 239000013598 vector Substances 0.000 claims abstract description 138
- 239000011159 matrix material Substances 0.000 claims description 37
- 230000015654 memory Effects 0.000 claims description 34
- 230000006870 function Effects 0.000 claims description 15
- 230000002457 bidirectional effect Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000004931 aggregating effect Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 9
- 238000003062 neural network model Methods 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides an entity recognition method and a device, wherein the entity recognition method performs character embedding on an input text and generates unique vector representation for each character; determining potential entity areas and corresponding context areas in the text by enumerating span units in the input sequence; jointly modeling potential entity regions and context regions using a graph convolution network and a multi-headed attention layer; the result of the joint modeling determines the entity class of the potential entity region via a classifier. The entity identification method can efficiently and accurately identify the contained entity information from the unstructured sequence text. When the invention recognizes whether the character sequence in the text is an entity, the semantic information of the sequence is considered, the context information formed by the residual characters is fully modeled, and the entity recognition precision is effectively improved.
Description
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for entity identification.
Background
Natural text is typically propagated and recorded in unstructured sequences, where there is a large amount of entity information, such as names of people, places, organizations, and institutions, that express specific concepts, as shown in fig. 1. The rapid and accurate identification of entity information in unstructured sequence text is one of the key technologies for constructing question-answering systems and recommendation systems.
The entity identification in the unstructured sequence text is most complex, characteristics such as syntax, semantics and context are required to be considered at the same time, and the traditional rule-based information extraction method is difficult to meet the entity identification requirement of the unstructured sequence text. Human beings can read and acquire entity information in unstructured sequence texts, but entity identification of massive data is not enough for work.
Disclosure of Invention
Aiming at the technical problems in the related art, the invention provides an entity identification method, which comprises the following steps:
s1, performing character embedding on an input text, and generating a unique vector representation for each character to obtain a vector sequence of the input text;
S2, inputting a text vector sequence by enumerationThe span unit in the text obtaining unit obtains the span set of the input text;
S3, collecting the spansInput semantic feature vector of bidirectional convolution generation span region +.>;
S4, the semantic feature vector is processedInputting the two-way long-short-period memory network to obtain the context information +.>;
S5, the context information is processedObtaining the joint modeling result of the semantic features and the contextual features of the span units by nonlinear transformation>;
Specifically, the step S1 includes: s11, randomly initializing a feature matrixAn embedding matrix as a character, wherein->Is the length of the character table, < >>Representing the embedding dimension of each character;
s12, for each character in the input text, the character is extracted from the feature matrix according to the id of the character in the character tableThe middle cables lead to respective vector representations.
Specifically, the step S3 includes:
s31, reconstructing a span sequence of the chain structure into a graph structure;
s32, constructing each node characteristic in the two-way graph convolution layer aggregation graph;
s33, accumulating and averaging all nodes in the feature map, and calculating semantic feature representation of the span unit。
Specifically, the step S4 includes:
s41, using feature vectorsReplacing vector sequences of span units in the original input vector sequence, i.e.Become->;
s43, aggregating sequences based on self-attention mechanismsMidspan characteristics->And contextual characteristics->The dependency relationship exists, and the calculation formula is as follows:
wherein ,is of dimension +.>Is a normalized exponential function;is a feature vector of a span unit with dimension +.>;/>Is a feature matrix formed by state feature vectors of a two-way long-short-term memory network, and the dimension is +.>;/>As a joint modeling output of span semantic features and context features, the vector dimension is +.>。
Specifically, the step S5 includes:
s51, repeatlSub-step S4, depth modeling semantic and contextual features, the output feature vectors of which are expressed as;
S52, feature vectorsThe most input is output via the following way>Combined modeling result of semantic features and contextual features +.>。
In a second aspect, another embodiment of the present invention discloses an entity recognition apparatus, including:
an input text vector generation unit for character embedding the input text, generating a unique vector representation for each character to obtain a vector sequence of the input text;
A span set generating unit for inputting text for enumerating the input text vector sequenceThe span unit in (1) obtaining the span set of the input text +.>;
A semantic feature vector generation unit for aggregating the spansInput semantic feature vector of bidirectional convolution generation span region +.>;
A context information generating unit for generating the semantic feature vectorInputting the two-way long-short-period memory network to obtain the context information +.>;/>
A joint modeling result generation unit of semantic features and context features for generating the context informationObtaining the joint modeling result of the semantic features and the contextual features of the span units by nonlinear transformation>;
An entity acquisition unit for modeling the results of the joint modelingThe input classifier obtains the entity class.
Specifically, the input text vector generation unit includes: an embedded matrix initializing unit for randomly initializing a feature matrixAs charactersIs embedded in matrix->Is the length of the character table, < >>Representing the embedding dimension of each character;
a vector generation unit for generating a character matrix for each character in the input text based on its id in the character tableThe middle cables lead to respective vector representations.
Specifically, the semantic feature vector generating unit includes:
a graph structure reconstructing unit for reconstructing the span sequence of the chain structure into a graph structure;
the bidirectional graph convolution construction unit is used for constructing each node characteristic in the bidirectional graph convolution layer aggregation graph;
a semantic feature representation calculation unit for calculating semantic feature representation of the span unit by cumulatively averaging each node in the feature map。
Specifically, the context information generating unit includes:
a first vector replacement unit for using feature vectorsReplacement of vector sequences of span units in the original input vector sequence, i.e. +.>Become->;
Two-way long-short-term memory network construction unit for constructing two-way long-short-term memory network modelingIs a sequence feature of (2);
a first joint modeling unit for aggregating sequences based on a self-attention mechanismMidspan characteristics->And contextual characteristics->The dependency relationship exists, and the calculation formula is as follows:
wherein ,is of dimension +.>Is a normalized exponential function;is a feature vector of a span unit with dimension +.>;/>Is a feature matrix formed by state feature vectors of a two-way long-short-term memory network, and the dimension is +.>;/>As a joint modeling output of span semantic features and context features, the vector dimension is +.>。
Specifically, the unit for generating the joint modeling result of the semantic features and the contextual features comprises:
a first execution unit for repeatedly executinglA secondary context information generating unit for modeling semantic features and context feature depth, the output feature vector of which is expressed as;/>
A second modeling unit for modeling the feature vectorThe most input is output via the following way>Combined modeling result of semantic features and contextual features +.>。
In a third aspect, another embodiment of the present invention discloses a nonvolatile memory having instructions stored thereon, which when executed by a processor, are configured to implement an entity identification method as described above.
The entity recognition method of the invention performs character embedding on an input text, and generates unique vector representation for each character; determining potential entity areas and corresponding context areas in the text by enumerating span units in the input sequence; jointly modeling potential entity regions and context regions using a graph convolution network and a multi-headed attention layer; the result of the joint modeling determines the entity class of the potential entity region via a classifier. The entity identification method can efficiently and accurately identify the contained entity information from the unstructured sequence text. When the invention recognizes whether the character sequence in the text is an entity, the semantic information of the sequence is considered, the context information formed by the residual characters is fully modeled, and the entity recognition precision is effectively improved. According to the method and the device, through an enumeration mode, all character subsequences in the text are considered to be potential entities, and overlapped entity information in the text can be well identified.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of unstructured text provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a method for entity identification according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a text embedding process provided by an embodiment of the present invention;
FIG. 4 is a span enumeration schematic of input text of length 4 provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of joint modeling of span semantic features and context features provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a two-way long and short term memory network according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an entity identification device according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a seed entity identification device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.
Example 1
Referring to fig. 2, the embodiment discloses an entity identification method, which includes the following steps:
s1, character embedding is carried out on an input text, and a unique direction is generated for each characterVector sequence for quantity representation to obtain input text;
The computer cannot directly perform calculations on text characters, and this embodiment requires that characters in the input text be mapped to vector space first.
The specific step S1 includes: s11, randomly initializing a feature matrixAn embedding matrix as a character, wherein->Is the length of the character table, < >>Representing the embedding dimension of each character.
Specifically, the character table of the embodiment may be obtained by counting the number of different characters in the corpus. In another embodiment the character table may also be pre-set.
S12, for each character in the input text, the character is extracted from the feature matrix according to the id of the character in the character tableThe middle cables lead to respective vector representations.
Referring to fig. 3, fig. 3 is a schematic diagram of a process of text character embedding. In the embodiment, a corpus is firstly obtained, and my love my ancestor is expected to exist in the corpus, …, and my love my hometown; and (5) corpus waiting. The characters in the expected library are then counted and a character table is obtained, which includes i am, love, ancestor, state … home, county. And each character in the character table has a unique id, such as: i am (1), love (2), 3, ancestor (4), state (5), … families (V-1), county (V).
Assume that the character sequence of the input text is,nRepresenting the character length of the input text. The input text can be represented as a vector sequence +.>, wherein />Is of dimension ofdIs a character vector of (a).
Referring to FIG. 3, for the character "love" in "I love my our country" in the input text, the vector sequence it generates。
S2, inputting a text vector sequence by enumerationThe span unit in (1) obtaining the span set of the input text +.>;
Wherein the span set represents a potential entity region of the input text and a corresponding context region;
the present embodiment defines arbitrary length contiguous subsequences in the input text as a Span unit (Span), each Span unit being considered a potential entity area to be identified. Specifically, assuming that the length of the input text sequence is N, one can enumerateAnd the subsequent neural network model models all enumerated span units and judges whether the span units are entities or belong to which type of entity. Fig. 4 is a span enumeration schematic of an input text of length 4, wherein a total of 10 span units may be enumerated.
Assume that the vector sequence of the input text isThen the span set can be obtained after enumeration, wherein />。
S3, collecting the spansInput semantic feature vector of bidirectional convolution generation span region +.>;
The neural network model designed in this embodiment models span sets by jointly modeling span semantic features and context featuresSEach span unit of (a)Generating unique characteristic representation->. The specific operation process is shown in fig. 5, and mainly comprises the following three steps:
s3, collecting the spansInput semantic feature vector of bidirectional convolution generation span region +.>;
S4, the semantic feature vector is processedInputting the two-way long-short-period memory network to obtain the context information +.>;
S5, the context information is processedAcquiring semantic features of span units by nonlinear transformationCombined modeling of contextual features>;
The present embodiment takes a span unit of i=k=3 as an example, and details the modeling process of the present embodiment:
wherein step S3 comprises:
s31, reconstructing the span sequence of the chain structure into a graph structure.
In the reconstruction process, the character vector in the span unit is used as a node characteristic, and the node with the front time sequence can point to the subsequent node. As shown in fig. 4, the span unitThere are three nodes, wherein->Can point to +.> and />,/>Can only point to +.>;
S32, constructing each node characteristic in a two-way graph convolution (BiGCN) layer aggregation graph.
In particular using a non-linear function ReLU and three sets of characteristic parameters、/> and />And carrying out nonlinear transformation on the characteristics of the neighborhood nodes to update the characteristic vector of each node, wherein the mathematical expression is as follows:
therein, whereinIs of dimension +.>Parameter matrix of>Is of dimension +.>Parameter matrix of>Is of dimension +.>Is used for the parameter vector of (a). />,/>Is a vector concatenation operation.
Operation example: [1,2,3] [4,5,6]=[1,2,3,4,5,6]The method comprises the steps of carrying out a first treatment on the surface of the Let->Then。
S33, accumulating and averaging all nodes in the feature map, and calculating semantic feature representation of the span unit:
The step S4 specifically includes:
s41, using feature vectorsReplacing vector sequences of span units in the original input vector sequence, i.e.Become->;
S42, constructing a two-way long-short-term memory network (BiLSTM) modelIs a sequence feature of (2);
the structure of the network is shown in FIG. 6, in whichFor the input feature vector at the current time, +.>The two feature vectors are respectively output at the previous moment, and t represents the position of a character in the input text.
The specific calculation formula is as follows:
wherein Representation->Feature vector at middle t position, +.>From the previous moment ∈>And (5) calculating.Is of dimension +.>Parameter matrix of>Is of dimension +.>Is used for the parameter vector of (a). />,Is a vector concatenation operation. />Multiplication of corresponding elements in the representative vector, i.e.。
The present embodiment uses a two-way long short-term memory network (BiLSTM) to output a state vector at each time tConstitution ofIs expressed as +.>。/>
S43, aggregating sequences based on self-attention mechanismsMidspan characteristics->Is { about the contextual characteristics>The dependency relationship exists between the two, and the calculation formula is as follows:
wherein ,is of dimension +.>Parameter matrix of (2)Softmax is a normalized exponential function. />Is a feature vector of a span unit with dimension +.>。/>Is a feature matrix formed by state feature vectors of a two-way long-short-term memory network, and the dimension is +.>。/>As a joint modeling output of span semantic features and context features, the vector dimension is +.>。
The step S5 specifically includes:
s51, repeatlSub-step S4, depth modeling semantic and contextual features, the output feature vectors of which are expressed as;
S52, feature vectorsThe most input is output via the following way>Combined modeling result of semantic features and contextual features +.>:
wherein Is of dimension +.>Parameter matrix of>Is of dimension +.>The output of max (x, y) is the larger of x and y. />Maximum spanning Unit->And carrying out entity recognition by a subsequent classifier on the joint modeling result in the text D to output entity categories.
Constructing a linear classifier and calculating the span unit from the normalized exponential function softmaxProbability distribution of the belonging entity class:
wherein Is of dimension +.>Parameter matrix of>Is of dimension +.>Parameter vector of>Equal to the number +1 of entity categories in the corpus (non-entities are set as a class of entities). Output of classifier->Is of dimension ofWherein each dimension represents a stride element +.>The represented character sequence belongs to the probability value of a certain class of entity.
The embodiment takesEntity class corresponding to the dimension with the largest probability value in the text is used as the input textDMid span unit->Entity recognition results of the represented character sequence. For example, there are 4 entities in the corpus altogether, if +.>Span unit->Represented character sequence->Belonging to a second class of entities; if->Span unit->Represented character sequence->Belongs to non-entity.
The entity recognition method of the embodiment performs character embedding on an input text, and generates unique vector representation for each character; determining potential entity areas and corresponding context areas in the text by enumerating span units in the input sequence; jointly modeling potential entity regions and context regions using a graph convolution network and a multi-headed attention layer; the result of the joint modeling determines the entity class of the potential entity region via a classifier. The entity identification method can efficiently and accurately identify the contained entity information from the unstructured sequence text. When the character sequence in the text is identified as the entity, the embodiment considers the semantic information of the sequence, fully models the context information formed by the residual characters, and effectively improves the accuracy of entity identification. In this embodiment, by means of enumeration, all character sub-sequences in the text are considered to be potential entities, so that overlapping entity information in the text can be well identified. For example, "Wuhan Yangtze bridge" is an entity, and "Wuhan" included therein is also an entity.
Example two
Referring to fig. 7, the present embodiment discloses an entity recognition apparatus, which includes the following units:
an input text vector generation unit for character embedding the input text, generating a unique vector representation for each character to obtain a vector sequence of the input text;
The computer cannot directly perform calculations on text characters, and this embodiment requires that characters in the input text be mapped to vector space first.
The specific input text vector generation unit includes: an embedded matrix initializing unit for randomly initializing a feature matrixEmbedding matrix as characterWherein->Is the length of the character table, < >>Representing the embedding dimension of each character.
Specifically, the character table of the embodiment may be obtained by counting the number of different characters in the corpus. In another embodiment the character table may also be pre-set.
A vector generation unit for generating a character matrix for each character in the input text based on its id in the character tableThe middle cables lead to respective vector representations.
Referring to fig. 3, fig. 3 is a schematic diagram of a process of text character embedding. In the embodiment, a corpus is firstly obtained, and my love my ancestor is expected to exist in the corpus, …, and my love my hometown; and (5) corpus waiting. The characters in the expected library are then counted and a character table is obtained, which includes i am, love, ancestor, state … home, county. And each character in the character table has a unique id, such as: i am (1), love (2), 3, ancestor (4), state (5), … families (V-1), county (V).
Assume that the character sequence of the input text is,nRepresenting the character length of the input text. The input text can be represented as a vector sequence +.>, wherein />Is of dimension ofdIs a character vector of (a).
Referring to FIG. 3, for the character "love" in "I love my our country" in the input text, the vector sequence it generates。
A span set generating unit for inputting text for enumerating the input text vector sequenceThe span unit in (1) obtaining the span set of the input text +.>;
Wherein the span set represents a potential entity region of the input text and a corresponding context region;
the present embodiment defines arbitrary length contiguous subsequences in the input text as a Span unit (Span), each Span unit being considered a potential entity area to be identified. Specifically, assuming that the length of the input text sequence is N, one can enumerateAnd the subsequent neural network model models all enumerated span units and judges whether the span units are entities or belong to which type of entity. Fig. 4 is a span enumeration schematic of an input text of length 4, wherein a total of 10 span units may be enumerated.
Assume that the vector sequence of the input text isThen the span set can be obtained after enumeration, wherein />。
A semantic feature vector generation unit for aggregating the spansLanguage input with bidirectional graph convolution to generate span regionSense eigenvector->;
A context information generating unit for generating the semantic feature vectorInputting the two-way long-short-period memory network to obtain the context information +.>;
A joint modeling result generation unit of semantic features and context features for generating the context informationObtaining the joint modeling result of the semantic features and the contextual features of the span units by nonlinear transformation>;
The present embodiment takes a span unit of i=k=3 as an example, and details the modeling process of the present embodiment:
wherein the semantic feature vector generating unit includes:
and the diagram structure reconstruction unit is used for reconstructing the span sequence of the chain structure into a diagram structure.
In the reconstruction process, the character vector in the span unit is used as a node characteristic, and the node with the front time sequence can point to the subsequent node. As shown in fig. 4, the span unitThere are three nodes, wherein->Can point to +.> and />,/>Can only point to +.>;
And the bidirectional graph convolution construction unit is used for constructing each node characteristic in the bidirectional graph convolution (BiGCN) layer aggregation graph.
In particular using a non-linear function ReLU and three sets of characteristic parameters、/> and />And carrying out nonlinear transformation on the characteristics of the neighborhood nodes to update the characteristic vector of each node, wherein the mathematical expression is as follows:
therein, whereinIs of dimension +.>Parameter matrix of>Is of dimension +.>Parameter matrix of>Is of dimension +.>Is used for the parameter vector of (a). />,/>Is a vector concatenation operation.
Operation example: [1,2,3] [4,5,6]=[1,2,3,4,5,6]The method comprises the steps of carrying out a first treatment on the surface of the Let->Then。
A semantic feature representation calculation unit for calculating semantic feature representation of the span unit by cumulatively averaging each node in the feature map:
The context information generation unit specifically includes:
a first vector replacement unit for using feature vectorsReplacement of vector sequences of span units in the original input vector sequence, i.e. +.>Become->;
Two-way long-short-term memory network construction unit for constructing two-way long-short-term memory network (BiLSTM) modelingIs a sequence feature of (2);
the structure of the network is shown in FIG. 6, in whichFor the input feature vector at the current time, +.>The two feature vectors are respectively output at the previous moment, and t represents the position of a character in the input text.
The specific calculation formula is as follows:
wherein Representation->Feature vector at middle t position, +.>From the previous moment ∈>And (5) calculating.Is of dimension +.>Parameter matrix of>Is of dimension +.>Is used for the parameter vector of (a). />,Is a vector concatenation operation. />Multiplication of corresponding elements in the representative vector, i.e. +.>。
The present embodiment uses a two-way long short-term memory network (BiLSTM) to output a state vector at each time tConstitution ofIs expressed as +.>。
A first joint modeling unit for aggregating sequences based on a self-attention mechanismMidspan characteristics->Is { about the contextual characteristics>The dependency relationship exists between the two, and the calculation formula is as follows:
wherein ,is of dimension +.>Is a normalized exponential function. />Is a feature vector of a span unit with dimension +.>。/>Is a feature matrix formed by state feature vectors of a two-way long-short-term memory network, and the dimension is +.>。/>As a joint modeling output of span semantic features and context features, the vector dimension is +.>。
The generation unit of the joint modeling result of the semantic features and the contextual features specifically comprises:
a first execution unit for repeatedly executinglA secondary context information generating unit for modeling semantic features and context feature depth, the output feature vector of which is expressed as;
A second modeling unit for modeling the feature vectorThe most input is output via the following way>Combined modeling result of semantic features and contextual features +.>:
wherein Is of dimension +.>Parameter matrix of>Is of dimension +.>The output of max (x, y) is the larger of x and y. />Maximum spanning Unit->In textAnd D, carrying out entity recognition by a subsequent classifier on the joint modeling result in the step D to output entity categories.
An entity acquisition unit for modeling the results of the joint modelingThe input classifier obtains the entity class.
Constructing a linear classifier and calculating the span unit from the normalized exponential function softmaxProbability distribution of the belonging entity class:
wherein Is of dimension +.>Parameter matrix of>Is of dimension +.>Parameter vector of>Equal to the number +1 of entity categories in the corpus (non-entities are set as a class of entities). Output of classifier->Is of dimension ofWherein each dimension represents a stride element +.>The represented character sequence belongs to the probability value of a certain class of entity.
The embodiment takesEntity class corresponding to the dimension with the largest probability value in the text is used as the input textDMid span unit->Entity recognition results of the represented character sequence. For example, there are 4 entities in total in the corpus, ifSpan unit->Represented character sequence->Belonging to a second class of entities; if it isSpan unit->Represented character sequence->Belongs to non-entity.
The entity recognition method of the embodiment performs character embedding on an input text, and generates unique vector representation for each character; determining potential entity areas and corresponding context areas in the text by enumerating span units in the input sequence; jointly modeling potential entity regions and context regions using a graph convolution network and a multi-headed attention layer; the result of the joint modeling determines the entity class of the potential entity region via a classifier. The entity identification method can efficiently and accurately identify the contained entity information from the unstructured sequence text. When the character sequence in the text is identified as the entity, the embodiment considers the semantic information of the sequence, fully models the context information formed by the residual characters, and effectively improves the accuracy of entity identification. In this embodiment, by means of enumeration, all character sub-sequences in the text are considered to be potential entities, so that overlapping entity information in the text can be well identified. For example, "Wuhan Yangtze bridge" is an entity, and "Wuhan" included therein is also an entity.
Example III
Referring to fig. 8, fig. 8 is a schematic diagram of the structure of an entity recognition apparatus of the present embodiment. The entity identification device 20 of this embodiment comprises a processor 21, a memory 22 and a computer program stored in said memory 22 and executable on said processor 21. The steps of the above-described method embodiments are implemented by the processor 21 when executing the computer program. Alternatively, the processor 21 may implement the functions of the modules/units in the above-described device embodiments when executing the computer program.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used to describe the execution of the computer program in the entity identification device 20. For example, the computer program may be divided into modules in the second embodiment, and specific functions of each module refer to the working process of the apparatus described in the foregoing embodiment, which is not described herein.
The entity identification device 20 may include, but is not limited to, a processor 21, a memory 22. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the entity identification device 20 and does not constitute a limitation of the entity identification device 20, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the entity identification device 20 may also include input-output devices, network access devices, buses, etc.
The processor 21 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 21 is a control center of the entity identification device 20, and connects the various parts of the entire entity identification device 20 using various interfaces and lines.
The memory 22 may be used to store the computer program and/or module, and the processor 21 may implement the various functions of the entity identification device 20 by running or executing the computer program and/or module stored in the memory 22 and invoking data stored in the memory 22. The memory 22 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the integrated modules/units of the entity identification device 20 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the method embodiments described above when executed by the processor 21. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (10)
1. An entity identification method is characterized in that: the method comprises the following steps:
s1, performing character embedding on an input text to obtainGenerating a unique vector representation for each character to obtain a vector sequence of input text;
S2, inputting a text vector sequence by enumerationThe span unit in (1) obtaining the span set of the input text +.>;
S3, collecting the spansInput semantic feature vector of bidirectional convolution generation span region +.>;
S4, the semantic feature vector is processedInputting the two-way long-short-period memory network to obtain the context information +.>;
S5, the context information is processedObtaining the joint modeling result of the semantic features and the contextual features of the span units by nonlinear transformation>;
2. The method according to claim 1, characterized in that: the step S1 includes: s11, randomly initializing a feature matrixAn embedding matrix as a character, wherein->Is the length of the character table, < >>Representing the embedding dimension of each character;
3. The method according to claim 1, characterized in that: the step S3 includes:
s31, reconstructing a span sequence of the chain structure into a graph structure;
s32, constructing each node characteristic in the two-way graph convolution layer aggregation graph;
4. A method according to claim 3, characterized in that: the step S4 includes:
s41, using feature vectorsReplacement of vector sequences of span units in the original input vector sequence, i.e. +.>Become->;
s43, aggregating sequences based on self-attention mechanismsMidspan characteristics->And contextual characteristics->The dependency relationship exists, and the calculation formula is as follows:
wherein ,is of dimension +.>Is a normalized exponential function; />Is a feature vector of a span unit with dimension +.>;/>Is of a two-way long-short-term memory networkFeature matrix formed by state feature vectors, the dimension is +.>;/>As a joint modeling output of span semantic features and context features, the vector dimension is +.>。/>
5. The method according to claim 4, wherein: the step S5 includes:
s51, repeatlSub-step S4, depth modeling semantic and contextual features, the output feature vectors of which are expressed as;
6. An entity recognition device, characterized in that: comprising the following units:
an input text vector generation unit for character embedding the input text, generating a unique vector representation for each character to obtain a vector sequence of the input text;
A span set generating unit for inputting text for enumerating the input text vector sequenceThe span unit in (1) obtaining the span set of the input text +.>;
A semantic feature vector generation unit for aggregating the spansInput semantic feature vector of bidirectional convolution generation span region +.>;
A context information generating unit for generating the semantic feature vectorInputting the two-way long-short-period memory network to obtain the context information +.>;
A joint modeling result generation unit of semantic features and context features for generating the context informationObtaining the joint modeling result of the semantic features and the contextual features of the span units by nonlinear transformation>;
7. The apparatus according to claim 6, wherein: the input text vector generation unit includes: an embedded matrix initializing unit for randomly initializing a feature matrixAn embedding matrix as a character, wherein->Is the length of the character table, < >>Representing the embedding dimension of each character;
8. The apparatus according to claim 6, wherein: the semantic feature vector generating unit includes:
a graph structure reconstructing unit for reconstructing the span sequence of the chain structure into a graph structure;
the bidirectional graph convolution construction unit is used for constructing each node characteristic in the bidirectional graph convolution layer aggregation graph;
9. The apparatus according to claim 8, wherein: the context information generation unit includes:
a first vector replacement unit for using feature vectorsReplacement of vector sequences of span units in the original input vector sequence, i.e. +.>Become->;
Two-way long-short-term memory network construction unit for constructing two-way long-short-term memory network modelingIs a sequence feature of (2); />
A first joint modeling unit for aggregating sequences based on a self-attention mechanismMidspan characteristics->With contextual characteristicsThe dependency relationship exists, and the calculation formula is as follows:
wherein ,is of dimension +.>Is a normalized exponential function; />Is a feature vector of a span unit with dimension +.>;/>Is a feature matrix formed by state feature vectors of a two-way long-short-term memory network, and the dimension is +.>;/>As a joint modeling output of span semantic features and context features, the vector dimension is +.>。
10. The apparatus according to claim 9, wherein: the joint modeling result generating unit of the semantic features and the context features comprises:
a first execution unit for repeatedly executinglA secondary context information generating unit for modeling semantic features and context feature depth, the output feature vector of which is expressed as;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310417766.9A CN116151241B (en) | 2023-04-19 | 2023-04-19 | Entity identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310417766.9A CN116151241B (en) | 2023-04-19 | 2023-04-19 | Entity identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116151241A true CN116151241A (en) | 2023-05-23 |
CN116151241B CN116151241B (en) | 2023-07-07 |
Family
ID=86373973
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310417766.9A Active CN116151241B (en) | 2023-04-19 | 2023-04-19 | Entity identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116151241B (en) |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180082197A1 (en) * | 2016-09-22 | 2018-03-22 | nference, inc. | Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities |
EP3385862A1 (en) * | 2017-04-03 | 2018-10-10 | Siemens Aktiengesellschaft | A method and apparatus for performing hierarchical entity classification |
CN109165384A (en) * | 2018-08-23 | 2019-01-08 | 成都四方伟业软件股份有限公司 | A kind of name entity recognition method and device |
CN111178074A (en) * | 2019-12-12 | 2020-05-19 | 天津大学 | Deep learning-based Chinese named entity recognition method |
CN111651993A (en) * | 2020-05-11 | 2020-09-11 | 北京理工大学 | Chinese named entity recognition method fusing local-global character level association features |
CN112711948A (en) * | 2020-12-22 | 2021-04-27 | 北京邮电大学 | Named entity recognition method and device for Chinese sentences |
CN112765994A (en) * | 2021-01-26 | 2021-05-07 | 武汉大学 | Deep learning-based information element joint extraction method and system |
CN113408273A (en) * | 2021-06-30 | 2021-09-17 | 北京百度网讯科技有限公司 | Entity recognition model training and entity recognition method and device |
CN113535928A (en) * | 2021-08-05 | 2021-10-22 | 陕西师范大学 | Service discovery method and system of long-term and short-term memory network based on attention mechanism |
CN113591483A (en) * | 2021-04-27 | 2021-11-02 | 重庆邮电大学 | Document-level event argument extraction method based on sequence labeling |
CN113836910A (en) * | 2021-09-17 | 2021-12-24 | 山东师范大学 | Text recognition method and system based on multilevel semantics |
CN114239585A (en) * | 2021-12-17 | 2022-03-25 | 安徽理工大学 | Biomedical nested named entity recognition method |
CN114330338A (en) * | 2022-01-13 | 2022-04-12 | 东北电力大学 | Program language identification system and method fusing associated information |
CN115600605A (en) * | 2022-10-31 | 2023-01-13 | 陕西师范大学(Cn) | Method, system, equipment and storage medium for jointly extracting Chinese entity relationship |
CN115688752A (en) * | 2022-09-16 | 2023-02-03 | 杭州电子科技大学 | Knowledge extraction method based on multi-semantic features |
US20230059494A1 (en) * | 2021-08-19 | 2023-02-23 | Digital Asset Capital, Inc. | Semantic map generation from natural-language text documents |
-
2023
- 2023-04-19 CN CN202310417766.9A patent/CN116151241B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180082197A1 (en) * | 2016-09-22 | 2018-03-22 | nference, inc. | Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities |
EP3385862A1 (en) * | 2017-04-03 | 2018-10-10 | Siemens Aktiengesellschaft | A method and apparatus for performing hierarchical entity classification |
CN109165384A (en) * | 2018-08-23 | 2019-01-08 | 成都四方伟业软件股份有限公司 | A kind of name entity recognition method and device |
CN111178074A (en) * | 2019-12-12 | 2020-05-19 | 天津大学 | Deep learning-based Chinese named entity recognition method |
CN111651993A (en) * | 2020-05-11 | 2020-09-11 | 北京理工大学 | Chinese named entity recognition method fusing local-global character level association features |
CN112711948A (en) * | 2020-12-22 | 2021-04-27 | 北京邮电大学 | Named entity recognition method and device for Chinese sentences |
CN112765994A (en) * | 2021-01-26 | 2021-05-07 | 武汉大学 | Deep learning-based information element joint extraction method and system |
CN113591483A (en) * | 2021-04-27 | 2021-11-02 | 重庆邮电大学 | Document-level event argument extraction method based on sequence labeling |
CN113408273A (en) * | 2021-06-30 | 2021-09-17 | 北京百度网讯科技有限公司 | Entity recognition model training and entity recognition method and device |
CN113535928A (en) * | 2021-08-05 | 2021-10-22 | 陕西师范大学 | Service discovery method and system of long-term and short-term memory network based on attention mechanism |
US20230059494A1 (en) * | 2021-08-19 | 2023-02-23 | Digital Asset Capital, Inc. | Semantic map generation from natural-language text documents |
CN113836910A (en) * | 2021-09-17 | 2021-12-24 | 山东师范大学 | Text recognition method and system based on multilevel semantics |
CN114239585A (en) * | 2021-12-17 | 2022-03-25 | 安徽理工大学 | Biomedical nested named entity recognition method |
CN114330338A (en) * | 2022-01-13 | 2022-04-12 | 东北电力大学 | Program language identification system and method fusing associated information |
CN115688752A (en) * | 2022-09-16 | 2023-02-03 | 杭州电子科技大学 | Knowledge extraction method based on multi-semantic features |
CN115600605A (en) * | 2022-10-31 | 2023-01-13 | 陕西师范大学(Cn) | Method, system, equipment and storage medium for jointly extracting Chinese entity relationship |
Non-Patent Citations (2)
Title |
---|
CHIU J P C,NICHOLS E: "Named entity recognition with bidirectional LSTM-CNNs", TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS * |
魏笑;秦永彬;陈艳平;: "一种基于部件CNN的网络安全命名实体识别方法", 计算机与数字工程, no. 01 * |
Also Published As
Publication number | Publication date |
---|---|
CN116151241B (en) | 2023-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI788529B (en) | Credit risk prediction method and device based on LSTM model | |
CN112016312B (en) | Data relation extraction method and device, electronic equipment and storage medium | |
CN111461168A (en) | Training sample expansion method and device, electronic equipment and storage medium | |
CN108921566A (en) | A kind of wash sale recognition methods and device based on graph structure model | |
US11861308B2 (en) | Mapping natural language utterances to operations over a knowledge graph | |
CN112836502B (en) | Financial field event implicit causal relation extraction method | |
CN112507095A (en) | Information identification method based on weak supervised learning and related equipment | |
CN112966517A (en) | Training method, device, equipment and medium for named entity recognition model | |
CN114780701A (en) | Automatic question-answer matching method, device, computer equipment and storage medium | |
CN113486166B (en) | Construction method, device and equipment of intelligent customer service robot and storage medium | |
CN113110843A (en) | Contract generation model training method, contract generation method and electronic equipment | |
CN113626576A (en) | Method and device for extracting relational characteristics in remote supervision, terminal and storage medium | |
CN116151241B (en) | Entity identification method and device | |
CN117520503A (en) | Financial customer service dialogue generation method, device, equipment and medium based on LLM model | |
CN112507323A (en) | Model training method and device based on unidirectional network and computing equipment | |
CN112765936B (en) | Training method and device for operation based on language model | |
CN114692022A (en) | Position prediction method and system based on space-time behavior mode | |
CN112651753A (en) | Intelligent contract generation method and system based on block chain and electronic equipment | |
CN113282837A (en) | Event analysis method and device, computer equipment and storage medium | |
CN112836045A (en) | Data processing method and device based on text data set and terminal equipment | |
CN115544214B (en) | Event processing method, device and computer readable storage medium | |
CN111738358A (en) | Data identification method, device, equipment and readable medium | |
CN111882429B (en) | Bank system field length segmentation method and device | |
CN111046153B (en) | Voice assistant customization method, voice assistant customization device and intelligent equipment | |
CN111311076B (en) | Account risk management method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |