CN114528368B

CN114528368B - Spatial relation extraction method based on fusion of pre-training language model and text features

Info

Publication number: CN114528368B
Application number: CN202111338542.6A
Authority: CN
Inventors: 张雪英; 吴恪涵; 王益鹏
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2023-08-25
Anticipated expiration: 2041-11-12
Also published as: CN114528368A

Abstract

The invention discloses a space relation extraction method based on a pre-training language model and text feature fusion, which comprises the steps of firstly cleaning and preprocessing text data, and converting single or batch text data into low-dimensional word vectors by using the pre-training language model, so as to ensure that the dimensions of the low-dimensional word vectors converted from text data with different lengths can be kept consistent; then a classifier formed by a feedforward neural network and a word vector predict the start and end positions of geographic entities and spatial relation feature words in the text, and a character span representation is generated by a pooling method according to the start and end positions and the word vector representation; finally, two tasks of geographic entity identification and spatial relationship classification are carried out according to character span representation, and further text spatial relationship extraction is achieved. The invention well considers the association relation of the geographic entity type, the spatial relation feature words and the spatial relation extraction, realizes the text-oriented triplet form spatial relation extraction, and has better expansibility and universality.

Description

Spatial relation extraction method based on fusion of pre-training language model and text features

Technical Field

The invention belongs to the field of natural language processing and geographic big data mining, and particularly relates to a geographic entity identification and spatial relation extraction method based on pre-training language model and text feature fusion.

Background

The spatial relationship is used as information for describing mutual constraint, interaction and mutual association state among geographic entities, and is connection information which is indispensable for human beings to describe spatial positions. Daily life communication of people frequently involves descriptions about spatial locations, which usually appear in the form of a pair of geographic entity objects plus a spatial relationship, wherein the spatial relationship inspires people to infer the spatial location of an unknown geographic entity from known geographic entities, connecting the semantic space of human ideas and the physical space of the real world. Text is one of the most common communication and information interaction modes in daily life, and contains rich position description information and corresponding spatial relationship information, however, due to flexibility and ambiguity of text expression, it is difficult to correctly understand the spatial position described in the text. In order to more fully understand the spatial location description, accurately identifying geographic entities and spatial relationships in text is a scientific problem that needs to be addressed.

In order to acquire the spatial relationship in the text, researchers have been proposed to acquire the spatial relationship in the text by using a relationship extraction method based on natural language processing and a relationship extraction method based on machine learning. The natural language spatial relationship acquisition method based on the rule template is to make extraction rules and templates to acquire spatial relationships through the steps of listing spatial vocabulary, defining spatial relationship definition, constructing a spatial relationship feature word dictionary, inducing a syntactic pattern and the like, but the extraction method has poor generalization capability and low recall rate of extraction results due to the defects of excessively relying on expert knowledge, being unable to completely induce the extraction rules and the like. The natural language spatial relationship acquisition method based on machine learning is characterized in that frequency statistics, a Bootstrapping method, a kernel method, a support vector machine and other statistical learning methods are introduced to extract key features of natural language, so that dependence on a rule template is eliminated to a great extent, but the problem that the method is difficult to be applied to sparse spatial relationship instance distribution exists. Based on deep learning, a plurality of scientific researchers use the same encoder to represent entity information and relation information in a text, so that the dependency relationship between two tasks of entity identification and relation extraction is enhanced, the problem of error accumulation caused by taking entity identification and relation extraction as independent tasks is solved, and the influence of spatial relation instance distribution sparsity on a model is relieved.

However, existing experiments and analyses indicate that joint extraction is not an ideal method of relational extraction, and blindly sharing the contextual representation of entities and relationships can adversely impair the spatial extraction performance of the model. In addition, the joint extraction method does not fully consider entity type information and relationship feature word information, does not fully consider the influence of entity types and relationship feature words on relationship classification tasks, and is difficult to further alleviate the problem caused by sparse spatial relationship instance distribution.

Disclosure of Invention

Aiming at the defects and shortcomings of the existing spatial relation extraction method in extracting spatial relation in texts, the invention provides a spatial relation extraction method based on fusion of a pre-training language model and text features.

The technical scheme adopted by the invention for solving the technical problems is a spatial relation extraction method based on fusion of a pre-training language model and text features, and the method comprises the following steps:

step 1: firstly, preprocessing text data, and removing nonsensical characters such as "#% $" and blank in the text by using a regular expression to ensure that double quotation marks or single quotation marks are completely matched. Then, the text data is divided character by character, and [ CLS ] and [ SEP ] identifiers are added at the beginning and end of the text data division result. If the text data is entered in bulk, it is necessary to ensure that each text data is consistent in length, and shorter text data is populated with [ PAD ] identifiers.

Step 2: inputting the preprocessed text data into a pre-training language model, and segmenting the text data into character by character to obtain a T= { T ₁ ,t ₂ ,..,t _N Converted into a dense digital vector z= { Z ₁ ,z ₂ ,..,z _N }。

Step 3: the dense real word vectors obtained in the step 2 are respectively input into two single-layer feedforward neural networks, and the feedforward neural networks are used as a classifier for predicting the word vector z _i Whether the character is the beginning character or the ending character of a geographic entity or a spatial relationship feature word. The prediction results of the two single-layer feedforward neural networks are recorded at POS respectively _start And POS _end In the index sets, the index sets are ordered according to ascending order. Based on dense real word vector z= { Z ₁ ,z ₂ ,..,z _N }、POS _start And POS _end Index set, selecting a pair of start index and end index [ i, j ]]And fusing Z in Z by a maximum Pooling (Max Pooling) method _i To z _j Thereby forming a character span representation (Span Representation). In the process of selecting the starting index and the ending index, the method strictly depends on the nearby principle and requires that the starting index and the ending index do not repeatedly appear.

Step 4: based on the character span representation input single-layer feedforward neural network obtained in the step 3, predicting a start index and an end index [ i, j ]]The entity type represented by the corresponding character span comprises specific geographic entities (mountain, river, administrative division and the like), spatial relationship feature words or。/>Representing the character span indicates that the character does not belong to any geographic entity or spatial relationship feature word type.

Step 5: based on the geographic entity prediction result in the step 4, the model automatically adds geographic entity marks before and after the starting and ending positions of the source text data, and the geographic entity marks are used for marking the positions of geographic entities identified by the model in the text and updating the starting and ending position information of the spatial relation feature words. After the geographic entity markers are added, the newly generated text data are input into a pre-trained language model for relation extraction, and corresponding low-dimensional dense word vectors are generated. The model fuses the entity start and end marker word vectors to represent the geographic entity by an Average Pooling (Average Pooling) method, and fuses the corresponding word vectors to represent the spatial relationship feature words by a maximum Pooling (Max Pooling) method.

Step 6: firstly, splicing vector representations of any pair of geographic entities and spatial relation feature words by a model, and fusing the spliced vector representations into text feature vectors through a self-attention mechanism (self-attention); then, inputting the text feature vector into a feedforward neural network for spatial relationship classification; and finally, the model judges the spatial relationship between the geographic entities according to the probability information output by the feedforward neural network.

Further, the text data of the large-scale geographic field is based on the text data of the large-scale geographic field, the pre-training language model learns grammar rules and mines implicit semantics from the text data through a self-supervision learning method, the text data segmented by character granularity is used as input, the model encodes the text data from three aspects of characters, positions and semantics, a word vector matrix is generated, and the dimensions of the matrix are the output dimension size and the input text character length set by the pre-training language model respectively.

Further, in the spatial relationship extraction flow of the present invention, two independent pre-training language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-training language models do not affect each other, and parameters can be updated independently, so that a word vector representation meeting the subtask requirement is better generated, the word vector representation Z of text data T generated by the pre-training language model can be represented as z=bert (T), and t= { T ₁ ,t ₂ ,..,t _N },Z＝{z ₁ ,z ₂ ,..,z _N N represents the number of characters in each sample of text data.

Further, two classifiers composed of single-layer feedforward neural network are respectively used for predicting the starting and ending positions of geographic entity and spatial relation feature words in text data, the classifier uses word vectors generated by a pre-training language model as input, outputs affine operation and GeLU activation function calculation results, and judges whether the current character is the starting or ending position of geographic entity or spatial relation feature words according to the set threshold and the output results, and the process can be expressed as POS through a formula ^start ＝GeLU(W _start Z+b _start ),POS ^end ＝GeLU(W _end Z+b _end )，ifPOS ^start >δthen1else0。

Further, generating a character span representation based on pooling method fusing word vector representations between start and end indexes, maximum poolThe method fully considers each dimension of each word vector, and selects the maximum value of each dimension to be fused into a final word vector representation; the average pooling method pays attention to the characteristics of boundary word vectors, the geographic entity boundary marker word vectors represent geographic entities in an average summation mode, the model further learns boundary characteristics and type characteristics of the entities better, and the two pooling methods can be specifically represented as S _[i-j] ＝Max([z _i ；z _i+1 ；…；z _j ]) And

further, the model of the invention forms a text feature matrix by splicing word vectors of geographic entities and space feature words, and based on a self-attention mechanism, the text feature matrix is formed by a parameter W _q 、W _k And W is _v And respectively generating a query matrix Q, a key matrix K and a value matrix V, and then further fusing the three types of matrixes by using a softmax function to generate text feature vectors with specified dimension sizes.

The beneficial effects are that:

1. the invention adopts a pre-training language model to replace a word2vec model, and obtains more complete word vector representation of the context information.

2. According to the invention, a group of classifiers are constructed based on the feedforward neural network to judge the starting and ending positions of the geographic entity and the spatial relationship feature words in the text, so that the time loss of a starting and ending position algorithm is further reduced.

3. The invention fuses character vectors through an Average Pooling (Average Pooling) and a maximum Pooling (Max Pooling) method to generate character span representations to represent geographic entities and spatial feature words. Compared with the existing method for sequence labeling of single words, the character span representation is more in line with the thinking mode of people, can effectively reduce recognition errors caused by excessively discrete single word semantics, and further improves the recognition accuracy of geographic entities and spatial feature words

4. The invention fuses the vector representation of the geographic entity pair and the spatial relation feature words based on the self-attention mechanism, thereby fusing the text features of two key types of geographic entity type and spatial relation feature words and generating the vector representation with more complete semantics.

Drawings

FIG. 1 is a technical flow chart of a spatial relationship extraction method based on a pre-training language model and text feature fusion.

Fig. 2 is a schematic diagram of a text data preprocessing process employed in the examples.

FIG. 3 is a schematic diagram of a process for generating a character span representation in an example.

Fig. 4 is a schematic diagram of a text feature fusion process of geographic entity types and spatial relationship feature words.

Detailed Description

The following describes the specific implementation of the present invention in detail with reference to the accompanying drawings, and the spatial relationship extraction method based on the fusion of the pre-training language model and the text features comprises the following steps:

(1) Preprocessing the original text data, removing nonsensical characters and spaces in the text by using a regular expression, and adding [ CLS ] and [ SEP ] marks at the beginning and ending parts of the text data. The preprocessed text data T is input into a pre-training language model (by default BERT pre-training language model) and a word vector representation Z corresponding to the input data is generated.

Z＝BERT(T),T＝{t ₁ ,t ₂ ,..,t _N },Z＝{z ₁ ,z ₂ ,..,z _N }

If the input text data is batched, the model will ensure that all the input text data is consistent in length, and text data of shorter length will be filled in with [ PAD ] symbols.

(2) Text word vector representations generated by the pre-trained language model are input into two independent classifiers that respectively predict the beginning and ending positions of geographic entities and spatial relationship feature words in the text. And the prediction result distributes the starting and ending position indexes according to the nearby principle, and constructs the index pairs of the geographic entity and the spatial relation feature words [ starting and ending ].

POS ^start ＝GeLU(W _start Z+b _start ),POS ^end ＝GeLU(W _end Z+b _end )

W _start And W is _end Respectively representing the parameter matrix of two classifiers, b _start And b _end Respectively representing the bias coefficients of the two classifiers. POS (Point of sale) ^start With POS ^end The start and end positions of the geographic entity or spatial relationship feature word, respectively.

1) The invention provides a starting and ending position two-classifier which is constructed by a single-layer feedforward neural network, wherein the feedforward neural network maps vector representation of a single word to one-dimensional tensor and judges whether the starting or ending position of a geographic entity and a spatial relation feature word is the starting or ending position of the geographic entity and the spatial relation feature word according to a set super-parameter threshold.

2) The invention provides a [ start, end ] index pair construction method, which matches a start position and an end position according to a nearby principle. Specifically, the starting positions and the ending positions are ordered according to ascending order, all the starting positions in the sequence are traversed by taking the starting position sequence as a reference, and ending positions conforming to rules are selected for matching. According to the matching rule, any pair of [ start, end ] pairs will not include other start or end positions.

(3) Based on [ start, end ] of step (2) construction]The index pairs are fused with word vector representations from a start position to an end position by a Max Pooling (Max Pooling) method to generate corresponding character span representations. Based on the feedforward neural network, each generated character span representation is identified, and the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation is judged)。

S _[i-j] ＝Max([z _i ；z _i+1 ；…；z _j ])

Entity Class＝softmax(W _entity S _[i-j] +b _entity )

W _entity Parameters representing feed-forward neural network in entity recognition processNumber matrix, b _entity Representing the bias factor of the feedforward neural network.

(4) And (3) combining the geographic entity type recognition result obtained in the step (3), firstly adding entity start and entity end marks at positions corresponding to the text source data, and simultaneously updating the start and end positions predicted as geographic entities or spatial relationship feature words. The text after adding the start and end flags is then input into another pre-trained language model and a corresponding word vector representation is generated. Finally, the geographic entity is represented by fusing the word vectors of the start and end marks by an Average Pooling (Average Pooling) method, and each word vector of the start to end positions is represented by a maximum Pooling (Max Pooling) method.

Z′＝BERT′(T′),T′＝{t′ ₁ ,t′ ₂ ,…,t′ _M },Z＝{z′ ₁ ,z′ ₂ ,…,z′ _M }

S _{Feature word-[k-l]} ＝Max([z _k ；z _k+1 ；…；z _l ])

i, j represent the model predicted geographic entity start and end marker positions, respectively, and k, l represent the model predicted spatial feature word start and end positions, respectively.

(5) The model first matches the geographic entities in a combined form to form a set of candidate geographic entity pairs. Then, selecting any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship feature words by the model, and splicing the word vector representations; then, fusing the spliced vector representations into text feature vectors by a self-attention mechanism (self-attention); and finally, inputting the text feature vector into a feedforward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feedforward neural network.

S＝concat(S _Entity-sub ；S _Entity-obj ；S _{Featureword_1} ；…；S _{Featureword_p} )

S _Entity-sub And S is equal to _Entity-obj Character span representations respectively representing a subject and an object, S _{Featureword_i} Representing each spatial relationship feature word vector identified by the model, W _q 、W _k And W is _v Respectively represent a query vector generation parameter matrix, a key vector generation parameter matrix and a value vector generation parameter matrix, W _r And b _r And respectively representing a feedforward neural network parameter matrix and a bias coefficient of the spatial relationship classification.

As shown in FIG. 1, the spatial relationship extraction method based on the fusion of the pre-training language model and the text features mainly comprises the following three parts:

1. text word vector generation based on a pre-trained language model.

2. Character span representations based on start and end position indexes and pooling methods are generated.

3. Text feature fusion taking into account geographic entity types and spatial relationship feature words.

The detailed flow of the spatial relationship extraction method of the present invention will be described in detail by taking chinese text data from chinese encyclopedia (geography) as an example.

(1) Chinese text data preprocessing and word vector generation based on a pre-trained language model.

As shown in fig. 2, the selected text data "a town is located in the northeast border of the B county. According to the data preprocessing step, text data is segmented according to characters, and [ CLS ] and [ SEP ] symbols are respectively added at the beginning and the end of the data. The preprocessed text data is input into a pre-training language model, and a word vector representation matrix with unified dimension is generated.

(2) And identifying the starting and ending positions of the geographic entity and the spatial relationship feature words based on the two classifiers.

Text word vector representations generated by the pre-trained language model are input into two independent classifiers that respectively predict the beginning and ending positions of geographic entities and spatial relationship feature words in the text. For example "[ CLS ] a town is in the northeast border of the B county. [ SEP ] ", one of the two classifier predictions is the start position 1 and the end position 3.

(3) Character span representations based on the start and end position pairs are generated.

As shown in fig. 3, the character span representation is generated by fusing word vectors from a start position to an end position by a Max Pooling method based on the start and end positions of the geographic entities predicted by the two classifiers and the word vector representation of the text data. In the example, the word vectors of the two characters "A" and "town" are fused to generate a character span representation to characterize "A town".

(4) And identifying the geographic entity and the spatial relation characteristic words of the text data.

Based on the feedforward neural network, each generated character span representation is input, and the entity type (geographic entity, spatial feature word or the like) of the character span representation is judged). In the example, the model is identified as "town a", "located", "countb" and "northeast", where "town a" and "countb" belong to administrative divisions in the geographic entity and "located", "northeast" belong to spatial relationship feature words. The model adds entity start and end marks based on the positions corresponding to the text source data of the recognition result, and simultaneously updates the start and end positions predicted as geographic entities or spatial relation feature words. The model then inputs the text after adding the start and end flags into another pre-trained language model, again generating a corresponding word vector representation. Finally, the geographic entity is represented by fusing the word vectors of the start and end marks by an Average Pooling (Average Pooling) method, and each word vector of the start to end positions is represented by a maximum Pooling (Max Pooling) method.

(5) Text feature fusion and spatial relationship extraction based on self-attention mechanism

The model firstly matches geographic entities in a combined mode according to geographic entity identification results to form a candidate geographic entity pair set. Then, as shown in fig. 4, the model selects ("town a", "county B") and spatial relationship feature words in the set, and splices the word vector representations corresponding to the above elements; then, fusing the spliced vector representations into text feature vectors by a self-attention mechanism (self-attention); and finally, inputting the text feature vector into a feedforward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feedforward neural network.

In combination with the example, the method provided by the invention uses the pre-training language model to generate the word vector, and simultaneously considers the association relationship between the geographic entity type and the spatial relationship feature word and the spatial relationship.

While the invention has been described with respect to the preferred embodiments, it is to be understood that the invention is not limited thereto, but is intended to cover modifications and alternatives falling within the spirit and scope of the present invention as disclosed by those skilled in the art.

Claims

1. The spatial relation extraction method based on the fusion of the pre-training language model and the text features is characterized by comprising the following steps:

step 1: preprocessing original text data, removing nonsensical characters in the text by using a regular expression, ensuring that leading numbers in the text are completely matched, adding [ CLS ] and [ SEP ] identifiers at the beginning and the end of a word segmentation result for the processed text data in a word-by-word Fu Fenci way, and if the text data are input in a batch mode, ensuring that the lengths of each text data are consistent, and filling the text data with shorter lengths by using the [ PAD ] identifiers;

step 2: inputting the preprocessed text data into a pre-training language model, and inputting a word-by-word character word segmentation result T= { T of the text data ₁ ,t ₂ ,..,t _N Converted into a dense digital vector z= { Z ₁ ,z ₂ ,..,z _N }；

Step 3: inputting the word vector obtained in the step 2 into two classifiers composed of single-layer feedforward neural networks respectively, and predicting the word vector z _i Whether it is the beginning or end of a geographic entity or spatial relationship feature word, the predictions of the two bi-classifiers are recorded at the POS, respectively _start And POS _end In the index set, and ordered according to the ascending order of the index,

POS _start ＝GeLU(W _start Z+b _start ),POS _end ＝GeLU(W _end Z+b _end )

based on the word vector z= { Z ₁ ,z ₂ ,..,z _N }、POS _start And POS _end An index set, selecting a pair of start and end indexes [ i, j ] according to the principle of closeness]And fusing Z in Z by a maximum pooling method _i To z _j Generating a character span representation;

step 4: inputting the character span representation generated in the step 3 into an entity identifier constructed by a single-layer feedforward neural network, predicting the entity type of the character span representation,

S _[i-j] ＝Max([z _i ；z _i+1 ；…；z _j ])

Entity Class＝softmax(W _entity S _[i-j] +b _entity )

the entity type comprises specific geographic entity type, spatial relation feature words or Representing that the character span represents a character word type that does not belong to any geographic entity or spatial relationship feature;

step 5: according to the prediction result of the geographic entity in the text, automatically adding geographic entity marks before and after the starting and ending positions of the source text data by the model, updating the starting and ending position information of the spatial relationship feature words in the source text data, inputting new text data into another pre-training language model after the geographic entity marks are added, generating corresponding text word vectors, fusing word vectors of the starting and ending marks by the model through an average pooling method to represent the geographic entity, and fusing the belonging word vectors by a maximum pooling method to represent the spatial relationship feature words;

step 6: matching the geographic entities by the model in a combined mode to form a candidate geographic entity pair set, selecting any pair of geographic entities in the set and character vector representations corresponding to the spatial relationship feature words, and performing splicing processing on the character vector representations; fusing the spliced vector representations into text feature vectors through a self-attention mechanism; inputting the text feature vector into a feedforward neural network for spatial relationship classification, judging the spatial relationship between geographic entities according to probability information output by the feedforward neural network,

Relation Class＝softmax(W _r S′+b _r )。

2. the spatial relationship extraction method based on fusion of a pre-training language model and text features according to claim 1, wherein the pre-training language model learns grammar rules and mines implicit semantics from text data by a self-supervision learning method based on large-scale geographic field text data, text data segmented by character granularity is used as input, the model encodes the text data from three aspects of character itself, position and semantics, a word vector matrix is generated, and the dimensions of the matrix are the output dimension size and the input text character length set by the pre-training language model respectively.

3. The method for extracting spatial relationship based on fusion of pre-training language model and text features as set forth in claim 1, whereinIn the process of extracting the space relation, two independent pre-training language models are used in two subtasks of geographic entity recognition and space relation classification, and in the model training process, the two pre-training language models are not mutually influenced and can independently update parameters, so that word vector representation meeting the requirement of the subtasks is better generated, the word vector representation Z of text data T generated by the pre-training language models can be expressed as Z=bert (T), and T= { T ₁ ,t ₂ ,..,t _N },Z＝{z ₁ ,z ₂ ,..,z _N N represents the number of characters in each sample of text data.

4. The method as claimed in claim 1, wherein the two classifiers are respectively used for predicting the start and end positions of geographic entities and spatial relationship feature words in text data, the classifier uses word vectors generated by the pre-training language model as inputs, outputs affine operation and GeLU activation function calculation results, and determines whether the current character is the start or end position of geographic entities or spatial relationship feature words according to the set threshold and output results, and the above process can be expressed as POS by a formula _start ＝GeLU(W _start Z+b _start ),POS _end ＝GeLU(W _end Z+b _end )，if POS _start >δthen 1 else 0。

5. The method for extracting spatial relationship based on the fusion of the pre-training language model and the text features according to claim 1, wherein character span representation is generated based on word vector representation between the fusion start index and the fusion end index of the pooling method, the maximum pooling method fully considers each dimension of each word vector, and the maximum value of each dimension is selected to be fused into a final word vector representation; the average pooling method pays attention to the characteristics of boundary word vectors, the geographic entity boundary marker word vectors represent geographic entities in an average summation mode, the model further learns boundary characteristics and type characteristics of the entities better, and the two pooling methods are specifically used forCan be expressed as S _[i-j] ＝Max([z _i ；z _i+1 ；…；z _j ]) And

6. the method for extracting spatial relationship based on fusion of pre-training language model and text feature according to claim 1, wherein the model forms text feature matrix by concatenating word vectors of geographic entity and spatial feature word, and based on self-attention mechanism, passes parameter W _q 、W _k And W is _v And respectively generating a query matrix Q, a key matrix K and a value matrix V, and then further fusing the three types of matrixes by using a softmax function to generate text feature vectors with specified dimension sizes.