CN114528368A

CN114528368A - Spatial relationship extraction method based on pre-training language model and text feature fusion

Info

Publication number: CN114528368A
Application number: CN202111338542.6A
Authority: CN
Inventors: 张雪英; 吴恪涵; 王益鹏
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-05-24
Anticipated expiration: 2041-11-12
Also published as: CN114528368B

Abstract

The invention discloses a spatial relationship extraction method based on pre-training language model and text feature fusion, which comprises the steps of firstly cleaning and preprocessing text data, realizing the conversion of single-strip or batch text data to low-dimensional word vectors by utilizing the pre-training language model, and ensuring the consistency of the dimensions of the low-dimensional word vectors converted from the text data with different lengths; then, predicting the starting and ending positions of geographic entities and spatial relation feature words in the text by a classifier and a word vector which are formed by a feedforward neural network, and generating character span representation by a pooling method according to the starting and ending positions and the word vector representation; and finally, performing two tasks of geographic entity recognition and spatial relationship classification according to character span representation, thereby realizing text spatial relationship extraction. The method well considers the association relationship between the geographic entity type and the spatial relationship characteristic word and the spatial relationship extraction, realizes the extraction of the text-oriented triple form spatial relationship, and has better expansibility and universality.

Description

Spatial relationship extraction method based on pre-training language model and text feature fusion

Technical Field

The invention belongs to the field of natural language processing and geographic big data mining, and particularly relates to a geographic entity recognition and spatial relationship extraction method based on pre-training language model and text feature fusion.

Background

The spatial relationship is information for describing mutual constraint, interaction and mutual correlation state between the geographic entities, and is indispensable connection information when people describe spatial positions. People's daily life communication frequently involves descriptions about spatial locations, usually in the form of a pair of geographic entity objects plus a spatial relationship that motivates people to infer the spatial location of unknown geographic entities from known geographic entities, linking the semantic space of human thought with the physical space of the real world. The text is one of the most common communication and information interaction modes in daily life, and contains abundant position description information and corresponding spatial relationship information, however, due to the flexibility and ambiguity of text expression, it is difficult to correctly understand the spatial position described in the text. In order to understand the description of the spatial position more fully, the accurate recognition of the geographic entities and the spatial relationships in the text becomes a scientific problem to be solved urgently.

In order to obtain the spatial relationship in the text, researchers have used the relation extraction method of natural language processing to provide a relation extraction method based on rule template and machine learning to obtain the spatial relationship in the text. The method for acquiring the natural language spatial relationship based on the rule template is used for formulating an extraction rule and a template to acquire the spatial relationship through the steps of listing spatial vocabularies, defining the spatial relationship definition, constructing a spatial relationship characteristic word dictionary, inducing a syntactic pattern and the like, but the extraction method has poor generalization capability and low extraction result recall rate due to the defects of excessive dependence on expert knowledge, incomplete induction of the extraction rule and the like. A natural language spatial relationship obtaining method based on machine learning is to introduce frequency statistics, a Bootstrapping method, a kernel method, a support vector machine and other statistical learning methods to extract key features of natural language, so that dependence on a rule template is eliminated to a great extent, but the method is difficult to be applied to the problem of sparse spatial relationship instance distribution. Based on deep learning, a plurality of scientific researchers express entity information and relationship information in a text by using the same encoder through a combined extraction method, so that the dependency relationship between two tasks of entity identification and relationship extraction is enhanced, the problem of error accumulation caused by the fact that the entity identification and the relationship extraction are used as independent tasks is solved, and the influence of sparse spatial relationship instance distribution on a model is relieved.

However, existing experiments and analysis show that joint extraction is not an ideal relationship extraction method, and the blind sharing of the context representations of entities and relationships can adversely affect the spatial extraction performance of the model. In addition, the joint extraction method does not fully consider the entity type information and the relation feature word information, does not fully consider the influence of the entity type and the relation feature word on the relation classification task, and is difficult to further alleviate the problem caused by sparse distribution of the spatial relation examples.

Disclosure of Invention

The invention aims to provide a spatial relationship extraction method based on fusion of a pre-training language model and text features, aiming at the defects and shortcomings of the existing spatial relationship extraction method in the process of extracting the spatial relationship in a text.

The technical scheme adopted by the invention for solving the technical problems is a spatial relationship extraction method based on the fusion of a pre-training language model and text features, and the method comprises the following steps:

step 1: firstly, preprocessing text data, and removing meaningless characters such as #% $' and spaces in a text by using a regular expression to ensure that quotation marks before and after double quotation marks or single quotation marks are completely matched. Then, the text data is character-by-character divided, and [ CLS ] and [ SEP ] identifiers are added at the beginning and end of the text data division result. If the text data is input in batch, it is necessary to ensure that each text data is consistent in length, and text data with shorter length is filled with a [ PAD ] identifier.

And 2, step: inputting the preprocessed text data into a pre-training language model, and segmenting the text data character by character to obtain a result T ═ T₁,t₂,..,t_NIs converted to a dense real word vector Z ═ Z₁,z₂,..,z_N}。

And step 3: respectively inputting the dense real number vectors obtained in the step 2 into two single-layer feedforward neural networks, wherein the feedforward neural networks are used as two classifiers for predicting the word vectors z_iWhether it is the beginning character or the ending character of the geographic entity or the spatial relationship characteristic word. The prediction results of the two single-layer feedforward neural networks are respectively recorded in the POS_startAnd POS_endIn the index set, the index set is sorted in ascending order. Based on dense real number vector Z ═ Z₁,z₂,..,z_N}、POS_startAnd POS_endIndex set, selecting a pair of start index and end index [ i, j ]]And fusing Z in Z by the method of Max Pooling_iTo z_jTo constitute a character Span Representation (Span Representation). In the process of selecting the start index and the end index, the selection is strictly based on the principle of proximity, and the start index and the end index are required not to repeatedly appear.

And 4, step 4: inputting the character span representation obtained in the step 3 into a single-layer feedforward neural network, predicting a start index and an end index [ i, j]The entity type represented by the corresponding character span comprises specific geographic entities (mountains, rivers, administrative divisions and the like), and spatial relationship characteristic words or

。

Representing that the character span representation does not belong to any geographic entity or spatial relationshipA characteristic word type.

And 5: and (4) automatically adding geographic entity marks before and after the starting position and the ending position of the source text data by the model based on the geographic entity prediction result in the step (4), wherein the geographic entity marks the positions of the geographic entities identified by the model in the text, and meanwhile, the information of the starting position and the ending position of the spatial relation characteristic words is updated. And after the geographic entity mark is added, inputting newly generated text data into a pre-training language model for relation extraction to generate a corresponding low-dimensional dense word vector. The model represents the geographic entity by fusing the start and end token word vectors of the entity through an Average Pooling (Average Pooling) method, and represents the spatial relationship feature words by fusing the corresponding word vectors through a Max Pooling (Max Pooling) method.

Step 6: firstly, the model splices vector representations of any pair of geographic entities and spatial relation feature words, and the spliced vector representations are fused into text feature vectors through a self-attention mechanism (self-attention); then, inputting the text feature vector into a feed-forward neural network for spatial relationship classification; and finally, the model judges the spatial relationship among the geographic entities according to the probability information output by the feedforward neural network.

Furthermore, based on large-scale text data in the geographic field, the pre-training language model learns grammar rules and excavates hidden semantics from the text data through a self-supervision learning method, the text data divided by character granularity is used as input, the model encodes the text data from the three aspects of characters, positions and semantics to generate a word vector matrix, and the dimensionality of the matrix is the output dimensionality and the input text character length which are respectively set by the pre-training language model.

Furthermore, in the spatial relationship extraction process, two independent pre-training language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-training language models are not affected with each other and can independently update parameters, so that word vector representations meeting the subtask requirements are generated better, the word vector representations Z of the text data T generated by the pre-training language models can be represented as Z ═ BERT (T), and T ═ T ═ T-₁,t₂,..,t_N},Z＝ {z₁,z₂,..,z_NN denotes the number of characters in each sample of text data.

Furthermore, the two classifiers are respectively used for predicting the starting and ending positions of the geographic entity and the spatial relation characteristic word in the text data, the two classifiers take the word vector generated by the pre-training language model as input, output affine operation and GeLU activation function calculation results, and judge whether the current character is the starting or ending position of the geographic entity or the spatial relation characteristic word according to the set threshold and the output result, and the process can be expressed as POS through a formula^start＝GeLU(W_startZ+b_start),POS^end＝GeLU(W_endZ+ b_end)，ifPOS^start>δthen1else0。

Further, character span representation is generated by fusing word vector representation between the start index and the end index based on a pooling method, each dimension of each word vector is fully considered by a maximum pooling method, and the maximum value of each dimension is selected and fused to be final word vector representation; the average pooling method pays attention to the characteristics of the boundary word vectors, the geographic entity boundary marking word vectors represent the geographic entity in an average summation mode, the model further learns the boundary characteristics and the type characteristics of the entity better, and the two pooling methods can be specifically represented as S_[i-j]＝Max([z_i；z_i+1；…；z_j]) And

furthermore, the model forms a text feature matrix by splicing word vectors of geographic entities and spatial feature words, and based on a self-attention mechanism, the text feature matrix is formed by a parameter W_q、W_kAnd W_vRespectively generating a query matrix Q, a key matrix K and a value matrix V, then further fusing the three types of matrices by utilizing a softmax function to generate a text with a specified dimension sizeThe feature vector.

Has the beneficial effects that:

1. the invention adopts the pre-training language model to replace the word2vec model, and obtains more complete word vector representation of context information.

2. The method constructs a group of two classifiers based on the feedforward neural network to judge the starting and ending positions of the geographic entities and the spatial relation characteristic words in the text, and further reduces the time loss of the algorithm of the starting and ending positions.

3. The invention fuses the word vectors by Average Pooling (Average Pooling) and maximum Pooling (Max Pooling) methods to generate character span representation to represent geographic entities and spatial feature words. Compared with the existing method for carrying out sequence labeling on a single character, the character span representation is more in line with the thinking mode of people, the identification error caused by over discrete meaning of the single character can be effectively reduced, and the identification precision of geographic entities and spatial characteristic words is further improved

4. The method is based on the vector representation of the self-attention mechanism and the fusion geographic entity pair and spatial relation feature words, so that two key text features of the geographic entity type and the spatial relation feature words are fused, and vector representation with more complete semantics is generated.

Drawings

FIG. 1 is a technical flowchart of a spatial relationship extraction method based on pre-training language model and text feature fusion according to the present invention.

Fig. 2 is a schematic diagram of a text data preprocessing process used in the example.

FIG. 3 is a diagram of an example process for generating a representation of a character span.

Fig. 4 is a schematic diagram of a text feature fusion process of geographic entity types and spatial relationship feature words.

Detailed Description

The following detailed description will be made in conjunction with the accompanying drawings of the specification, and the spatial relationship extraction method based on the fusion of the pre-training language model and the text features includes the following steps:

(1) preprocessing original text data, removing meaningless characters and spaces in the text by using a regular expression, and adding [ CLS ] and [ SEP ] marks at the beginning and the end of the text data. The preprocessed text data T is input into a pre-trained language model (by default, the BERT pre-trained language model), and a word vector representation Z corresponding to the input data is generated.

Z＝BERT(T),T＝{t₁,t₂,..,t_N},Z＝{z₁,z₂,..,z_N}

If the input text data is in batch, the model will ensure that all the input text data is of consistent length, and text data of shorter length will be filled in with the [ PAD ] symbol.

(2) The vector representation of the text words generated by the pre-training language model is input into two independent classifiers, and the starting positions and the ending positions of the geographic entities and the spatial relation characteristic words in the text are respectively predicted. And allocating the index of the starting position and the ending position according to the principle of proximity to the prediction result, and constructing a geographic entity and spatial relationship characteristic word [ starting, ending ] index pair.

POS^start＝GeLU(W_startZ+b_start),POS^end＝GeLU(W_endZ+b_end)

W_startAnd W_endParameter matrices representing two classifiers, respectively, b_startAnd b_endRespectively representing the bias coefficients of the two classifiers. POS (Point of sale)^startAnd POS^endRepresenting the starting and ending positions of the geographic entity or spatial relationship feature word, respectively.

1) The starting position classifier and the ending position classifier are constructed by a single-layer feedforward neural network, the feedforward neural network maps vector representation of a single character to a one-dimensional tensor, and whether the starting position or the ending position of a geographic entity and a spatial relation feature word is determined according to a set hyper-parameter threshold.

2) The method for constructing the [ start, end ] index pair provided by the invention matches the start position and the end position according to the principle of proximity. Specifically, the start positions and the end positions are sorted in ascending order, all the start positions in the sequence are traversed by taking the start position sequence as a reference, and the end positions which accord with the rule are selected for matching. According to the matching rule, any pair of [ start, end ] does not include other start positions or end positions.

(3) Starting and ending constructed based on the step (2)]And fusing the word vector representation from the starting position to the ending position by a maximum Pooling (Max Pooling) method to generate a corresponding character span representation. Based on the feedforward neural network, each generated character span representation is identified, and the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation is judged

)。

S_[i-j]＝Max([z_i；z_i+1；…；z_j])

Entity Class＝softmax(W_entityS_[i-j]+b_entity)

W_entityParameter matrix representing a feed-forward neural network in an entity recognition process, b_entityRepresenting the bias coefficients of the feed-forward neural network.

(4) And (4) adding entity start and entity end marks at the position corresponding to the text source data by combining the geographic entity type identification result obtained in the step (3), and updating the start and end positions predicted as the geographic entity or the spatial relationship characteristic word. Then, the text with the added start and end marks is input into another pre-trained language model, and corresponding word vector representations are generated. Finally, the word vectors of the start and end marks are fused through an Average Pooling (Average Pooling) method to represent the geographic entity, and the word vectors from the start to the end positions are fused through a maximum Pooling (Max Pooling) method to represent the spatial relationship feature words.

Z′＝BERT′(T′),T′＝{t′₁,t′₂,…,t′_M},Z＝{z′₁,z′₂,…,z′_M}

S_{Feature word-[k-l]}＝Max([z_k；z_k+1；…；z_l])

i, j respectively represent the positions of the start marker and the end marker of the geographic entity predicted by the model, and k, l respectively represent the positions of the start marker and the end marker of the spatial feature word predicted by the model.

(5) The model first matches the geographic entities in combination to form a set of candidate pairs of geographic entities. Then, the model selects any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship characteristic words, and the word vector representations are spliced; then, fusing the spliced vector representation into a text feature vector through a self-attention mechanism (self-attention); and finally, inputting the text feature vectors into a feed-forward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feed-forward neural network.

S＝concat(S_Entity-sub；S_Entity-obj；S_{Featureword_1}；…；S_{Featureword_p})

S_Entity-subAnd S_Entity-objCharacter span representation, S, representing subject and object, respectively_{Featureword_i}Each space relation characteristic word vector, W, identified by the representation model_q、W_kAnd W_vRespectively representing query vector generation parameter matrix, key vector generation parameter matrix and value vector generation parameter matrix, W_rAnd b_rAnd respectively representing the feedforward neural network parameter matrix and the bias coefficient of the spatial relation classification.

As shown in FIG. 1, the spatial relationship extraction method based on the fusion of the pre-training language model and the text features of the invention mainly comprises the following three parts:

1. text word vector generation based on a pre-trained language model.

2. Character span representation generation based on the start and end position indices and pooling methods.

3. Text feature fusion considering geographic entity types and spatial relationship feature words.

The detailed flow of the spatial relationship extraction method of the present invention is described in detail by taking the chinese text data from chinese encyclopedia (geography) as an example.

(1) Chinese text data preprocessing and word vector generation based on a pre-training language model.

As shown in fig. 2, the selected text data "town a is located in the northeast of prefecture B. ", according to the data preprocessing step, dividing words by characters for text data, and adding [ CLS ] and [ SEP ] symbols at the beginning and end of the data, respectively. And inputting the preprocessed text data into a pre-training language model to generate a character vector representation matrix with uniform dimensionality.

(2) And identifying the starting position and the ending position of the geographic entity and the spatial relation characteristic word based on the two classifiers.

The vector representation of the text words generated by the pre-training language model is input into two independent classifiers, and the starting positions and the ending positions of the geographic entities and the spatial relation characteristic words in the text are respectively predicted. Town is northeast of B county for example "[ CLS ] A. [ SEP ] ", one of the two classifier predictors is start position 1, end position 3.

(3) A character span representation based on the start and end position pairs is generated.

As shown in fig. 3, a character span representation is generated by fusing the word vectors of the start position to the end position by a Max Pooling (Max Pooling) method based on the start and end positions of the geographic entity predicted by the two classifiers and the word vector representation of the text data. The word vectors of the two characters "A" and "town" in the example are fused to generate a character span representation to characterize "A town".

(4) And identifying the geographic entity and the spatial relation characteristic word of the text data.

Inputting each character span representation generated based on the feedforward neural network, and judging the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation

). In the example, the recognition results of the model are "town a", "located", "county B", and "northeast", wherein town a "and county B" belong to administrative divisions in the geographic entity, and "located" and "northeast" belong to the spatial relationship feature word. And adding entity starting marks and entity ending marks to the model based on the positions corresponding to the text source data of the recognition result, and updating the starting positions and the ending positions of the predicted geographic entities or spatial relation characteristic words. Then, the model inputs the text added with the start and end marks into another pre-training language model, and generates corresponding word vector representation again. Finally, the word vectors of the start and end marks are fused by an Average Pooling (Average Pooling) method to represent the geographic entity, and the word vectors from the start to the end position are fused by a Max Pooling (Max Pooling) method to represent the spatial relationship feature words.

(5) Text feature fusion and spatial relationship extraction based on self-attention mechanism

The model firstly matches the geographic entities in a combined mode according to the identification result of the geographic entities to form a candidate geographic entity pair set. Then, as shown in fig. 4, the model selects the feature words in the set ("town a", "county B") and the spatial relationship, and concatenates the word vector representations corresponding to the above elements; then, fusing the spliced vector representation into a text feature vector through a self-attention mechanism (self-attention); and finally, inputting the text feature vectors into a feed-forward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feed-forward neural network.

In combination with the embodiment, the method provided by the invention uses the pre-training language model to generate the word vector, and simultaneously considers the association relationship between the geographic entity type and the spatial relationship characteristic word and the spatial relationship, and has good extraction performance and interpretability.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be made by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. The spatial relationship extraction method based on the fusion of the pre-training language model and the text features is characterized by comprising the following steps of:

step 1: preprocessing original text data, removing meaningless characters in a text by using a regular expression, ensuring that front and back quotation marks in the text are completely matched, segmenting the processed text data character by character, adding [ CLS ] and [ SEP ] identifiers at the beginning and the end of a segmentation result, and if the text data is input in a batch form, ensuring that each piece of text data is consistent in length, and filling the text data with shorter length by using the [ PAD ] identifier;

step 2: inputting the preprocessed text data into a pre-training language model, and performing character-by-character word segmentation on the text data to obtain a result T ═ T { (T) }₁,t₂,..,t_NIs converted into a dense real number vector Z ═ Z₁,z₂,..,z_N}；

And step 3: respectively inputting the word vectors obtained in the step 2 into two classifiers formed by single-layer feedforward neural networks to predict word vectors z_iWhether the predicted result is the beginning or the end of the geographic entity or the spatial relation characteristic word or not, and the predicted results of the two classifiers are respectively recorded in the POS_startAnd POS_endIn the index set, and sorted according to the ascending order of the indexes,

POS^start＝GeLU(W_startZ+b_start),POS^end＝GeLU(W_endZ+b_end)

based on the word vector Z ═ { Z₁,z₂,..,z_N}、POS_startAnd POS_endIndex set, a pair of start and end indexes [ i, j ] is selected according to the principle of proximity]And fusing Z in Z by the maximum pooling method_iTo z_jGenerating a character span representation;

and 4, step 4: inputting the character span representation generated in the step 3 into an entity recognizer constructed by a single-layer feedforward neural network, predicting the entity type represented by the character span,

S_[i-j]＝Max([z_i；z_i+1；…；z_j])

Entity Class＝softmax(W_entityS_[i-j]+b_entity)

the entity type comprises a specific geographic entity type, a spatial relationship characteristic word or

Representing that the character span representation does not belong to any geographic entity or spatial relationship characteristic word type;

and 5: according to the prediction result of the geographic entity in the text, the model automatically adds geographic entity marks before and after the starting position and the ending position of the source text data, meanwhile, the information of the starting position and the ending position of the spatial relationship feature word in the source text data is updated, after the geographic entity marks are added, newly generated text data are input into another pre-training language model to generate a corresponding text word vector, the model integrates the word vectors of the starting mark and the ending mark through an average pooling method to represent the geographic entity, and the word vectors of the starting mark and the ending mark are integrated through a maximum pooling method to represent the spatial relationship feature word;

step 6: matching the geographic entities by the model in a combined mode to form a candidate geographic entity pair set, selecting any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship characteristic words, and splicing the word vector representations; fusing the spliced vector representation into a text feature vector through a self-attention mechanism; inputting the text feature vector into a feedforward neural network for spatial relationship classification, judging the spatial relationship between the geographic entities according to probability information output by the feedforward neural network,

Relation Class＝softmax(W_rS′+b_r)。

2. the method for extracting spatial relationship based on fusion of pre-trained language model and text features as claimed in claim 1, wherein based on large-scale geographic domain text data, the pre-trained language model learns grammar rules and excavates hidden semantics from the text data by a self-supervised learning method, the text data divided by character granularity is used as input, the model encodes the text data from three aspects of character itself, position and semantics to generate a word vector matrix, and the dimensions of the matrix are the output dimension size and the input text character length set for the pre-trained language model respectively.

3. The spatial relationship extraction method based on the fusion of the pre-trained language model and the text features as claimed in claim 1, wherein in the spatial relationship extraction process, two independent pre-trained language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-trained language models are not affected by each other and can independently update parameters, so as to better generate word vector representation meeting the subtask requirements, and the word vector representation Z of the text data T generated by the pre-trained language model can be represented as Z ═ bert (T), and T ═ T { (T)₁,t₂,..,t_N},Z＝{z₁,z₂,..,z_NN denotes the number of characters in each sample of text data.

4. The spatial relationship extraction method based on pre-trained language model and text feature fusion as claimed in claim 1, wherein the two classifiers are formed by a single-layer feedforward neural network, the two classifiers are used for predicting the start and end positions of geographic entities and spatial relationship feature words in text data, the two classifiers take word vectors generated by the pre-trained language model as input, output affine operation and GeLU activation function calculationCalculating the result, and judging whether the current character is the starting or ending position of the geographic entity or the spatial relation characteristic word according to the set threshold value and the output result, wherein the process can be expressed as POS through a formula^start＝GeLU(W_startZ+b_start),POS^end＝GeLU(W_endZ+b_end)，if POS^start>δthen 1 else 0。

5. The spatial relationship extraction method based on the fusion of the pre-trained language model and the text features as claimed in claim 1, wherein the character span representation is generated based on the fusion of the word vector representations between the start and end indexes by a pooling method, the maximal pooling method takes full account of each dimension of each word vector, and the maximum value selected as each dimension is fused into the final word vector representation; the average pooling method pays attention to the characteristics of the boundary word vectors, the geographic entity boundary marking word vectors represent the geographic entity in an average summation mode, the model further learns the boundary characteristics and the type characteristics of the entity better, and the two pooling methods can be specifically represented as S_[i-j]＝Max([z_i；z_i+1；…；z_j]) And

6. the method of claim 1, wherein the model forms a text feature matrix by concatenating word vectors of geographic entities and spatial feature words, and the spatial relationship extraction method is based on a self-attention mechanism and by a parameter W_q、W_kAnd W_vAnd respectively generating a query matrix Q, a key matrix K and a value matrix V, and then further fusing the three types of matrices by utilizing a softmax function to generate a text feature vector with a specified dimension.