CN114528368A - Spatial relationship extraction method based on pre-training language model and text feature fusion - Google Patents

Spatial relationship extraction method based on pre-training language model and text feature fusion Download PDF

Info

Publication number
CN114528368A
CN114528368A CN202111338542.6A CN202111338542A CN114528368A CN 114528368 A CN114528368 A CN 114528368A CN 202111338542 A CN202111338542 A CN 202111338542A CN 114528368 A CN114528368 A CN 114528368A
Authority
CN
China
Prior art keywords
text
spatial relationship
word
entity
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111338542.6A
Other languages
Chinese (zh)
Other versions
CN114528368B (en
Inventor
张雪英
吴恪涵
王益鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Original Assignee
Nanjing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University filed Critical Nanjing Normal University
Priority to CN202111338542.6A priority Critical patent/CN114528368B/en
Publication of CN114528368A publication Critical patent/CN114528368A/en
Application granted granted Critical
Publication of CN114528368B publication Critical patent/CN114528368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a spatial relationship extraction method based on pre-training language model and text feature fusion, which comprises the steps of firstly cleaning and preprocessing text data, realizing the conversion of single-strip or batch text data to low-dimensional word vectors by utilizing the pre-training language model, and ensuring the consistency of the dimensions of the low-dimensional word vectors converted from the text data with different lengths; then, predicting the starting and ending positions of geographic entities and spatial relation feature words in the text by a classifier and a word vector which are formed by a feedforward neural network, and generating character span representation by a pooling method according to the starting and ending positions and the word vector representation; and finally, performing two tasks of geographic entity recognition and spatial relationship classification according to character span representation, thereby realizing text spatial relationship extraction. The method well considers the association relationship between the geographic entity type and the spatial relationship characteristic word and the spatial relationship extraction, realizes the extraction of the text-oriented triple form spatial relationship, and has better expansibility and universality.

Description

Spatial relationship extraction method based on pre-training language model and text feature fusion
Technical Field
The invention belongs to the field of natural language processing and geographic big data mining, and particularly relates to a geographic entity recognition and spatial relationship extraction method based on pre-training language model and text feature fusion.
Background
The spatial relationship is information for describing mutual constraint, interaction and mutual correlation state between the geographic entities, and is indispensable connection information when people describe spatial positions. People's daily life communication frequently involves descriptions about spatial locations, usually in the form of a pair of geographic entity objects plus a spatial relationship that motivates people to infer the spatial location of unknown geographic entities from known geographic entities, linking the semantic space of human thought with the physical space of the real world. The text is one of the most common communication and information interaction modes in daily life, and contains abundant position description information and corresponding spatial relationship information, however, due to the flexibility and ambiguity of text expression, it is difficult to correctly understand the spatial position described in the text. In order to understand the description of the spatial position more fully, the accurate recognition of the geographic entities and the spatial relationships in the text becomes a scientific problem to be solved urgently.
In order to obtain the spatial relationship in the text, researchers have used the relation extraction method of natural language processing to provide a relation extraction method based on rule template and machine learning to obtain the spatial relationship in the text. The method for acquiring the natural language spatial relationship based on the rule template is used for formulating an extraction rule and a template to acquire the spatial relationship through the steps of listing spatial vocabularies, defining the spatial relationship definition, constructing a spatial relationship characteristic word dictionary, inducing a syntactic pattern and the like, but the extraction method has poor generalization capability and low extraction result recall rate due to the defects of excessive dependence on expert knowledge, incomplete induction of the extraction rule and the like. A natural language spatial relationship obtaining method based on machine learning is to introduce frequency statistics, a Bootstrapping method, a kernel method, a support vector machine and other statistical learning methods to extract key features of natural language, so that dependence on a rule template is eliminated to a great extent, but the method is difficult to be applied to the problem of sparse spatial relationship instance distribution. Based on deep learning, a plurality of scientific researchers express entity information and relationship information in a text by using the same encoder through a combined extraction method, so that the dependency relationship between two tasks of entity identification and relationship extraction is enhanced, the problem of error accumulation caused by the fact that the entity identification and the relationship extraction are used as independent tasks is solved, and the influence of sparse spatial relationship instance distribution on a model is relieved.
However, existing experiments and analysis show that joint extraction is not an ideal relationship extraction method, and the blind sharing of the context representations of entities and relationships can adversely affect the spatial extraction performance of the model. In addition, the joint extraction method does not fully consider the entity type information and the relation feature word information, does not fully consider the influence of the entity type and the relation feature word on the relation classification task, and is difficult to further alleviate the problem caused by sparse distribution of the spatial relation examples.
Disclosure of Invention
The invention aims to provide a spatial relationship extraction method based on fusion of a pre-training language model and text features, aiming at the defects and shortcomings of the existing spatial relationship extraction method in the process of extracting the spatial relationship in a text.
The technical scheme adopted by the invention for solving the technical problems is a spatial relationship extraction method based on the fusion of a pre-training language model and text features, and the method comprises the following steps:
step 1: firstly, preprocessing text data, and removing meaningless characters such as #% $' and spaces in a text by using a regular expression to ensure that quotation marks before and after double quotation marks or single quotation marks are completely matched. Then, the text data is character-by-character divided, and [ CLS ] and [ SEP ] identifiers are added at the beginning and end of the text data division result. If the text data is input in batch, it is necessary to ensure that each text data is consistent in length, and text data with shorter length is filled with a [ PAD ] identifier.
And 2, step: inputting the preprocessed text data into a pre-training language model, and segmenting the text data character by character to obtain a result T ═ T1,t2,..,tNIs converted to a dense real word vector Z ═ Z1,z2,..,zN}。
And step 3: respectively inputting the dense real number vectors obtained in the step 2 into two single-layer feedforward neural networks, wherein the feedforward neural networks are used as two classifiers for predicting the word vectors ziWhether it is the beginning character or the ending character of the geographic entity or the spatial relationship characteristic word. The prediction results of the two single-layer feedforward neural networks are respectively recorded in the POSstartAnd POSendIn the index set, the index set is sorted in ascending order. Based on dense real number vector Z ═ Z1,z2,..,zN}、POSstartAnd POSendIndex set, selecting a pair of start index and end index [ i, j ]]And fusing Z in Z by the method of Max PoolingiTo zjTo constitute a character Span Representation (Span Representation). In the process of selecting the start index and the end index, the selection is strictly based on the principle of proximity, and the start index and the end index are required not to repeatedly appear.
And 4, step 4: inputting the character span representation obtained in the step 3 into a single-layer feedforward neural network, predicting a start index and an end index [ i, j]The entity type represented by the corresponding character span comprises specific geographic entities (mountains, rivers, administrative divisions and the like), and spatial relationship characteristic words or
Figure RE-GDA0003609234710000021
Figure RE-GDA0003609234710000022
Representing that the character span representation does not belong to any geographic entity or spatial relationshipA characteristic word type.
And 5: and (4) automatically adding geographic entity marks before and after the starting position and the ending position of the source text data by the model based on the geographic entity prediction result in the step (4), wherein the geographic entity marks the positions of the geographic entities identified by the model in the text, and meanwhile, the information of the starting position and the ending position of the spatial relation characteristic words is updated. And after the geographic entity mark is added, inputting newly generated text data into a pre-training language model for relation extraction to generate a corresponding low-dimensional dense word vector. The model represents the geographic entity by fusing the start and end token word vectors of the entity through an Average Pooling (Average Pooling) method, and represents the spatial relationship feature words by fusing the corresponding word vectors through a Max Pooling (Max Pooling) method.
Step 6: firstly, the model splices vector representations of any pair of geographic entities and spatial relation feature words, and the spliced vector representations are fused into text feature vectors through a self-attention mechanism (self-attention); then, inputting the text feature vector into a feed-forward neural network for spatial relationship classification; and finally, the model judges the spatial relationship among the geographic entities according to the probability information output by the feedforward neural network.
Furthermore, based on large-scale text data in the geographic field, the pre-training language model learns grammar rules and excavates hidden semantics from the text data through a self-supervision learning method, the text data divided by character granularity is used as input, the model encodes the text data from the three aspects of characters, positions and semantics to generate a word vector matrix, and the dimensionality of the matrix is the output dimensionality and the input text character length which are respectively set by the pre-training language model.
Furthermore, in the spatial relationship extraction process, two independent pre-training language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-training language models are not affected with each other and can independently update parameters, so that word vector representations meeting the subtask requirements are generated better, the word vector representations Z of the text data T generated by the pre-training language models can be represented as Z ═ BERT (T), and T ═ T ═ T-1,t2,..,tN},Z= {z1,z2,..,zNN denotes the number of characters in each sample of text data.
Furthermore, the two classifiers are respectively used for predicting the starting and ending positions of the geographic entity and the spatial relation characteristic word in the text data, the two classifiers take the word vector generated by the pre-training language model as input, output affine operation and GeLU activation function calculation results, and judge whether the current character is the starting or ending position of the geographic entity or the spatial relation characteristic word according to the set threshold and the output result, and the process can be expressed as POS through a formulastart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+ bend),ifPOSstart>δthen1else0。
Further, character span representation is generated by fusing word vector representation between the start index and the end index based on a pooling method, each dimension of each word vector is fully considered by a maximum pooling method, and the maximum value of each dimension is selected and fused to be final word vector representation; the average pooling method pays attention to the characteristics of the boundary word vectors, the geographic entity boundary marking word vectors represent the geographic entity in an average summation mode, the model further learns the boundary characteristics and the type characteristics of the entity better, and the two pooling methods can be specifically represented as S[i-j]=Max([zi;zi+1;…;zj]) And
Figure RE-GDA0003609234710000031
Figure RE-GDA0003609234710000032
furthermore, the model forms a text feature matrix by splicing word vectors of geographic entities and spatial feature words, and based on a self-attention mechanism, the text feature matrix is formed by a parameter Wq、WkAnd WvRespectively generating a query matrix Q, a key matrix K and a value matrix V, then further fusing the three types of matrices by utilizing a softmax function to generate a text with a specified dimension sizeThe feature vector.
Has the beneficial effects that:
1. the invention adopts the pre-training language model to replace the word2vec model, and obtains more complete word vector representation of context information.
2. The method constructs a group of two classifiers based on the feedforward neural network to judge the starting and ending positions of the geographic entities and the spatial relation characteristic words in the text, and further reduces the time loss of the algorithm of the starting and ending positions.
3. The invention fuses the word vectors by Average Pooling (Average Pooling) and maximum Pooling (Max Pooling) methods to generate character span representation to represent geographic entities and spatial feature words. Compared with the existing method for carrying out sequence labeling on a single character, the character span representation is more in line with the thinking mode of people, the identification error caused by over discrete meaning of the single character can be effectively reduced, and the identification precision of geographic entities and spatial characteristic words is further improved
4. The method is based on the vector representation of the self-attention mechanism and the fusion geographic entity pair and spatial relation feature words, so that two key text features of the geographic entity type and the spatial relation feature words are fused, and vector representation with more complete semantics is generated.
Drawings
FIG. 1 is a technical flowchart of a spatial relationship extraction method based on pre-training language model and text feature fusion according to the present invention.
Fig. 2 is a schematic diagram of a text data preprocessing process used in the example.
FIG. 3 is a diagram of an example process for generating a representation of a character span.
Fig. 4 is a schematic diagram of a text feature fusion process of geographic entity types and spatial relationship feature words.
Detailed Description
The following detailed description will be made in conjunction with the accompanying drawings of the specification, and the spatial relationship extraction method based on the fusion of the pre-training language model and the text features includes the following steps:
(1) preprocessing original text data, removing meaningless characters and spaces in the text by using a regular expression, and adding [ CLS ] and [ SEP ] marks at the beginning and the end of the text data. The preprocessed text data T is input into a pre-trained language model (by default, the BERT pre-trained language model), and a word vector representation Z corresponding to the input data is generated.
Z=BERT(T),T={t1,t2,..,tN},Z={z1,z2,..,zN}
If the input text data is in batch, the model will ensure that all the input text data is of consistent length, and text data of shorter length will be filled in with the [ PAD ] symbol.
(2) The vector representation of the text words generated by the pre-training language model is input into two independent classifiers, and the starting positions and the ending positions of the geographic entities and the spatial relation characteristic words in the text are respectively predicted. And allocating the index of the starting position and the ending position according to the principle of proximity to the prediction result, and constructing a geographic entity and spatial relationship characteristic word [ starting, ending ] index pair.
POSstart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+bend)
WstartAnd WendParameter matrices representing two classifiers, respectively, bstartAnd bendRespectively representing the bias coefficients of the two classifiers. POS (Point of sale)startAnd POSendRepresenting the starting and ending positions of the geographic entity or spatial relationship feature word, respectively.
1) The starting position classifier and the ending position classifier are constructed by a single-layer feedforward neural network, the feedforward neural network maps vector representation of a single character to a one-dimensional tensor, and whether the starting position or the ending position of a geographic entity and a spatial relation feature word is determined according to a set hyper-parameter threshold.
2) The method for constructing the [ start, end ] index pair provided by the invention matches the start position and the end position according to the principle of proximity. Specifically, the start positions and the end positions are sorted in ascending order, all the start positions in the sequence are traversed by taking the start position sequence as a reference, and the end positions which accord with the rule are selected for matching. According to the matching rule, any pair of [ start, end ] does not include other start positions or end positions.
(3) Starting and ending constructed based on the step (2)]And fusing the word vector representation from the starting position to the ending position by a maximum Pooling (Max Pooling) method to generate a corresponding character span representation. Based on the feedforward neural network, each generated character span representation is identified, and the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation is judged
Figure RE-GDA0003609234710000051
)。
S[i-j]=Max([zi;zi+1;…;zj])
Entity Class=softmax(WentityS[i-j]+bentity)
WentityParameter matrix representing a feed-forward neural network in an entity recognition process, bentityRepresenting the bias coefficients of the feed-forward neural network.
(4) And (4) adding entity start and entity end marks at the position corresponding to the text source data by combining the geographic entity type identification result obtained in the step (3), and updating the start and end positions predicted as the geographic entity or the spatial relationship characteristic word. Then, the text with the added start and end marks is input into another pre-trained language model, and corresponding word vector representations are generated. Finally, the word vectors of the start and end marks are fused through an Average Pooling (Average Pooling) method to represent the geographic entity, and the word vectors from the start to the end positions are fused through a maximum Pooling (Max Pooling) method to represent the spatial relationship feature words.
Z′=BERT′(T′),T′={t′1,t′2,…,t′M},Z={z′1,z′2,…,z′M}
Figure RE-GDA0003609234710000061
SFeature word-[k-l]=Max([zk;zk+1;…;zl])
i, j respectively represent the positions of the start marker and the end marker of the geographic entity predicted by the model, and k, l respectively represent the positions of the start marker and the end marker of the spatial feature word predicted by the model.
(5) The model first matches the geographic entities in combination to form a set of candidate pairs of geographic entities. Then, the model selects any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship characteristic words, and the word vector representations are spliced; then, fusing the spliced vector representation into a text feature vector through a self-attention mechanism (self-attention); and finally, inputting the text feature vectors into a feed-forward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feed-forward neural network.
S=concat(SEntity-sub;SEntity-obj;SFeatureword_1;…;SFeatureword_p)
Figure RE-GDA0003609234710000062
SEntity-subAnd SEntity-objCharacter span representation, S, representing subject and object, respectivelyFeatureword_iEach space relation characteristic word vector, W, identified by the representation modelq、WkAnd WvRespectively representing query vector generation parameter matrix, key vector generation parameter matrix and value vector generation parameter matrix, WrAnd brAnd respectively representing the feedforward neural network parameter matrix and the bias coefficient of the spatial relation classification.
As shown in FIG. 1, the spatial relationship extraction method based on the fusion of the pre-training language model and the text features of the invention mainly comprises the following three parts:
1. text word vector generation based on a pre-trained language model.
2. Character span representation generation based on the start and end position indices and pooling methods.
3. Text feature fusion considering geographic entity types and spatial relationship feature words.
The detailed flow of the spatial relationship extraction method of the present invention is described in detail by taking the chinese text data from chinese encyclopedia (geography) as an example.
(1) Chinese text data preprocessing and word vector generation based on a pre-training language model.
As shown in fig. 2, the selected text data "town a is located in the northeast of prefecture B. ", according to the data preprocessing step, dividing words by characters for text data, and adding [ CLS ] and [ SEP ] symbols at the beginning and end of the data, respectively. And inputting the preprocessed text data into a pre-training language model to generate a character vector representation matrix with uniform dimensionality.
(2) And identifying the starting position and the ending position of the geographic entity and the spatial relation characteristic word based on the two classifiers.
The vector representation of the text words generated by the pre-training language model is input into two independent classifiers, and the starting positions and the ending positions of the geographic entities and the spatial relation characteristic words in the text are respectively predicted. Town is northeast of B county for example "[ CLS ] A. [ SEP ] ", one of the two classifier predictors is start position 1, end position 3.
(3) A character span representation based on the start and end position pairs is generated.
As shown in fig. 3, a character span representation is generated by fusing the word vectors of the start position to the end position by a Max Pooling (Max Pooling) method based on the start and end positions of the geographic entity predicted by the two classifiers and the word vector representation of the text data. The word vectors of the two characters "A" and "town" in the example are fused to generate a character span representation to characterize "A town".
(4) And identifying the geographic entity and the spatial relation characteristic word of the text data.
Inputting each character span representation generated based on the feedforward neural network, and judging the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation
Figure RE-GDA0003609234710000071
). In the example, the recognition results of the model are "town a", "located", "county B", and "northeast", wherein town a "and county B" belong to administrative divisions in the geographic entity, and "located" and "northeast" belong to the spatial relationship feature word. And adding entity starting marks and entity ending marks to the model based on the positions corresponding to the text source data of the recognition result, and updating the starting positions and the ending positions of the predicted geographic entities or spatial relation characteristic words. Then, the model inputs the text added with the start and end marks into another pre-training language model, and generates corresponding word vector representation again. Finally, the word vectors of the start and end marks are fused by an Average Pooling (Average Pooling) method to represent the geographic entity, and the word vectors from the start to the end position are fused by a Max Pooling (Max Pooling) method to represent the spatial relationship feature words.
(5) Text feature fusion and spatial relationship extraction based on self-attention mechanism
The model firstly matches the geographic entities in a combined mode according to the identification result of the geographic entities to form a candidate geographic entity pair set. Then, as shown in fig. 4, the model selects the feature words in the set ("town a", "county B") and the spatial relationship, and concatenates the word vector representations corresponding to the above elements; then, fusing the spliced vector representation into a text feature vector through a self-attention mechanism (self-attention); and finally, inputting the text feature vectors into a feed-forward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feed-forward neural network.
In combination with the embodiment, the method provided by the invention uses the pre-training language model to generate the word vector, and simultaneously considers the association relationship between the geographic entity type and the spatial relationship characteristic word and the spatial relationship, and has good extraction performance and interpretability.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be made by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims (6)

1. The spatial relationship extraction method based on the fusion of the pre-training language model and the text features is characterized by comprising the following steps of:
step 1: preprocessing original text data, removing meaningless characters in a text by using a regular expression, ensuring that front and back quotation marks in the text are completely matched, segmenting the processed text data character by character, adding [ CLS ] and [ SEP ] identifiers at the beginning and the end of a segmentation result, and if the text data is input in a batch form, ensuring that each piece of text data is consistent in length, and filling the text data with shorter length by using the [ PAD ] identifier;
step 2: inputting the preprocessed text data into a pre-training language model, and performing character-by-character word segmentation on the text data to obtain a result T ═ T { (T) }1,t2,..,tNIs converted into a dense real number vector Z ═ Z1,z2,..,zN};
And step 3: respectively inputting the word vectors obtained in the step 2 into two classifiers formed by single-layer feedforward neural networks to predict word vectors ziWhether the predicted result is the beginning or the end of the geographic entity or the spatial relation characteristic word or not, and the predicted results of the two classifiers are respectively recorded in the POSstartAnd POSendIn the index set, and sorted according to the ascending order of the indexes,
POSstart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+bend)
based on the word vector Z ═ { Z1,z2,..,zN}、POSstartAnd POSendIndex set, a pair of start and end indexes [ i, j ] is selected according to the principle of proximity]And fusing Z in Z by the maximum pooling methodiTo zjGenerating a character span representation;
and 4, step 4: inputting the character span representation generated in the step 3 into an entity recognizer constructed by a single-layer feedforward neural network, predicting the entity type represented by the character span,
S[i-j]=Max([zi;zi+1;…;zj])
Entity Class=softmax(WentityS[i-j]+bentity)
the entity type comprises a specific geographic entity type, a spatial relationship characteristic word or
Figure FDA0003351165160000011
Figure FDA0003351165160000012
Representing that the character span representation does not belong to any geographic entity or spatial relationship characteristic word type;
and 5: according to the prediction result of the geographic entity in the text, the model automatically adds geographic entity marks before and after the starting position and the ending position of the source text data, meanwhile, the information of the starting position and the ending position of the spatial relationship feature word in the source text data is updated, after the geographic entity marks are added, newly generated text data are input into another pre-training language model to generate a corresponding text word vector, the model integrates the word vectors of the starting mark and the ending mark through an average pooling method to represent the geographic entity, and the word vectors of the starting mark and the ending mark are integrated through a maximum pooling method to represent the spatial relationship feature word;
step 6: matching the geographic entities by the model in a combined mode to form a candidate geographic entity pair set, selecting any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship characteristic words, and splicing the word vector representations; fusing the spliced vector representation into a text feature vector through a self-attention mechanism; inputting the text feature vector into a feedforward neural network for spatial relationship classification, judging the spatial relationship between the geographic entities according to probability information output by the feedforward neural network,
S=concat(SEntity-sub;SEntity-obj;SFeatureword_1;…;SFeatureword_p)
Figure FDA0003351165160000021
Relation Class=softmax(WrS′+br)。
2. the method for extracting spatial relationship based on fusion of pre-trained language model and text features as claimed in claim 1, wherein based on large-scale geographic domain text data, the pre-trained language model learns grammar rules and excavates hidden semantics from the text data by a self-supervised learning method, the text data divided by character granularity is used as input, the model encodes the text data from three aspects of character itself, position and semantics to generate a word vector matrix, and the dimensions of the matrix are the output dimension size and the input text character length set for the pre-trained language model respectively.
3. The spatial relationship extraction method based on the fusion of the pre-trained language model and the text features as claimed in claim 1, wherein in the spatial relationship extraction process, two independent pre-trained language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-trained language models are not affected by each other and can independently update parameters, so as to better generate word vector representation meeting the subtask requirements, and the word vector representation Z of the text data T generated by the pre-trained language model can be represented as Z ═ bert (T), and T ═ T { (T)1,t2,..,tN},Z={z1,z2,..,zNN denotes the number of characters in each sample of text data.
4. The spatial relationship extraction method based on pre-trained language model and text feature fusion as claimed in claim 1, wherein the two classifiers are formed by a single-layer feedforward neural network, the two classifiers are used for predicting the start and end positions of geographic entities and spatial relationship feature words in text data, the two classifiers take word vectors generated by the pre-trained language model as input, output affine operation and GeLU activation function calculationCalculating the result, and judging whether the current character is the starting or ending position of the geographic entity or the spatial relation characteristic word according to the set threshold value and the output result, wherein the process can be expressed as POS through a formulastart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+bend),if POSstart>δthen 1 else 0。
5. The spatial relationship extraction method based on the fusion of the pre-trained language model and the text features as claimed in claim 1, wherein the character span representation is generated based on the fusion of the word vector representations between the start and end indexes by a pooling method, the maximal pooling method takes full account of each dimension of each word vector, and the maximum value selected as each dimension is fused into the final word vector representation; the average pooling method pays attention to the characteristics of the boundary word vectors, the geographic entity boundary marking word vectors represent the geographic entity in an average summation mode, the model further learns the boundary characteristics and the type characteristics of the entity better, and the two pooling methods can be specifically represented as S[i-j]=Max([zi;zi+1;…;zj]) And
Figure FDA0003351165160000022
6. the method of claim 1, wherein the model forms a text feature matrix by concatenating word vectors of geographic entities and spatial feature words, and the spatial relationship extraction method is based on a self-attention mechanism and by a parameter Wq、WkAnd WvAnd respectively generating a query matrix Q, a key matrix K and a value matrix V, and then further fusing the three types of matrices by utilizing a softmax function to generate a text feature vector with a specified dimension.
CN202111338542.6A 2021-11-12 2021-11-12 Spatial relation extraction method based on fusion of pre-training language model and text features Active CN114528368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111338542.6A CN114528368B (en) 2021-11-12 2021-11-12 Spatial relation extraction method based on fusion of pre-training language model and text features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111338542.6A CN114528368B (en) 2021-11-12 2021-11-12 Spatial relation extraction method based on fusion of pre-training language model and text features

Publications (2)

Publication Number Publication Date
CN114528368A true CN114528368A (en) 2022-05-24
CN114528368B CN114528368B (en) 2023-08-25

Family

ID=81618545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111338542.6A Active CN114528368B (en) 2021-11-12 2021-11-12 Spatial relation extraction method based on fusion of pre-training language model and text features

Country Status (1)

Country Link
CN (1) CN114528368B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881038A (en) * 2022-07-12 2022-08-09 之江实验室 Chinese entity and relation extraction method and device based on span and attention mechanism
CN116402055A (en) * 2023-05-25 2023-07-07 武汉大学 Extraction method, device, equipment and medium for patent text entity

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120078503A1 (en) * 2009-06-10 2012-03-29 Ancestralhunt Partners, Llc System and method for the collaborative collection, assignment, visualization, analysis, and modification of probable genealogical relationships based on geo-spatial and temporal proximity
CN110377686A (en) * 2019-07-04 2019-10-25 浙江大学 A kind of address information Feature Extraction Method based on deep neural network model
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111680122A (en) * 2020-05-18 2020-09-18 国家基础地理信息中心 Space data active recommendation method and device, storage medium and computer equipment
CN113190655A (en) * 2021-05-10 2021-07-30 南京大学 Spatial relationship extraction method and device based on semantic dependence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120078503A1 (en) * 2009-06-10 2012-03-29 Ancestralhunt Partners, Llc System and method for the collaborative collection, assignment, visualization, analysis, and modification of probable genealogical relationships based on geo-spatial and temporal proximity
CN110377686A (en) * 2019-07-04 2019-10-25 浙江大学 A kind of address information Feature Extraction Method based on deep neural network model
CN111680122A (en) * 2020-05-18 2020-09-18 国家基础地理信息中心 Space data active recommendation method and device, storage medium and computer equipment
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN113190655A (en) * 2021-05-10 2021-07-30 南京大学 Spatial relationship extraction method and device based on semantic dependence

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881038A (en) * 2022-07-12 2022-08-09 之江实验室 Chinese entity and relation extraction method and device based on span and attention mechanism
CN114881038B (en) * 2022-07-12 2022-11-11 之江实验室 Chinese entity and relation extraction method and device based on span and attention mechanism
CN116402055A (en) * 2023-05-25 2023-07-07 武汉大学 Extraction method, device, equipment and medium for patent text entity
CN116402055B (en) * 2023-05-25 2023-08-25 武汉大学 Extraction method, device, equipment and medium for patent text entity

Also Published As

Publication number Publication date
CN114528368B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN111694924B (en) Event extraction method and system
CN107818084B (en) Emotion analysis method fused with comment matching diagram
WO2023005293A1 (en) Text error correction method, apparatus, and device, and storage medium
CN114528368B (en) Spatial relation extraction method based on fusion of pre-training language model and text features
CN116151132B (en) Intelligent code completion method, system and storage medium for programming learning scene
CN110750646B (en) Attribute description extracting method for hotel comment text
CN110782892B (en) Voice text error correction method
CN113158674B (en) Method for extracting key information of documents in artificial intelligence field
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN114444507A (en) Context parameter Chinese entity prediction method based on water environment knowledge map enhancement relationship
CN113204952A (en) Multi-intention and semantic slot joint identification method based on clustering pre-analysis
CN115658846A (en) Intelligent search method and device suitable for open-source software supply chain
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114332519A (en) Image description generation method based on external triple and abstract relation
CN111507103B (en) Self-training neural network word segmentation model using partial label set
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN115795060B (en) Entity alignment method based on knowledge enhancement
CN114970537B (en) Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy
CN116304064A (en) Text classification method based on extraction
CN115545005A (en) Remote supervision relation extraction method fusing knowledge and constraint graph
CN116127013A (en) Personal sensitive information knowledge graph query method and device
CN113297385B (en) Multi-label text classification system and method based on improved GraphRNN
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114155403A (en) Image segmentation Hash sorting method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant