CN116562296A

CN116562296A - Geographic named entity recognition model training method and geographic named entity recognition method

Info

Publication number: CN116562296A
Application number: CN202310625300.8A
Authority: CN
Inventors: 徐流畅; 夏天舒; 张程锟; 张嘉俊; 姚俊伟
Original assignee: Sinyada Technology Co ltd
Current assignee: Sinyada Technology Co ltd
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-08

Abstract

The application discloses a geographic named entity recognition model training method and a geographic named entity recognition method, which relate to the technical field of information extraction, wherein the geographic named entity recognition model training method comprises the following steps: acquiring first network text data, and performing incremental pre-training on the initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model; labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results; and constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model, and performing fine adjustment on the initial geographic named entity recognition model by utilizing the geographic named entity recognition data set to obtain a target geographic named entity recognition model. The method adopts a deep learning migration mode, so that the geographic naming entity recognition model does not need to be trained independently, but incremental pre-training is performed on the basis of the original geographic naming entity semantic model, and time is saved greatly.

Description

Geographic named entity recognition model training method and geographic named entity recognition method

Technical Field

The application relates to the technical field of information extraction, in particular to a geographic named entity recognition model training method and a geographic named entity recognition method.

Background

Named entity recognition (Named Entity Recognition, NER) is a widely used technology in the field of natural language processing, which aims at recognizing and extracting named entities from text and classifying them into predefined entity categories, wherein named entities generally refer to entities with specific names or identifiers, such as person names, place names, organization names, dates, times, currencies, etc.

The text data of the ubiquitous network refers to text data collected from the Internet, and the named entity recognition method of the data generally comprises a geographic named entity recognition method based on a geographic named entity database, a geographic named entity recognition method based on machine learning and a geographic named entity recognition method based on a deep neural network, but the geographic named entity recognition method based on the geographic named entity database is premised on a complete geographic named entity database, and the method can ignore some geographic named entity information which does not exist in the geographic named entity database; the geographic naming entity identification method based on machine learning requires a large amount of labeling data, the acquisition and processing of the labeling data takes a large amount of labor and time, the method is very sensitive to the quality of feature extraction, and if the extracted features are insufficient or unreasonable, the accuracy of a task can be influenced; geographic naming entity identification methods based on deep neural networks also require a large amount of tagging data.

Disclosure of Invention

The geographic named entity recognition model training method and the geographic named entity recognition method aim to recognize geographic named entities on ubiquitous social media data and solve the problems of diversity, large data size and noise of the data in the recognition process.

In order to achieve the above purpose, the present application adopts the following technical scheme:

the geographic naming entity recognition model training method comprises the following steps:

acquiring first network text data, and performing incremental pre-training on an initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;

labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results;

and constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model, and performing fine adjustment on the initial geographic named entity recognition model by utilizing the geographic named entity recognition data set to obtain a target geographic named entity recognition model.

Preferably, the initial geographic naming entity semantic model comprises a semantic feature extraction module based on a multi-head self-attention mechanism and a confusing word correction module, wherein the semantic feature extraction module based on the multi-head self-attention mechanism comprises a multi-head self-attention layer, a residual network and a full-connection feedforward neural network, and takes a leak ReLU as an activation function, and the confusing word correction module sets a full-word dynamic mask strategy based on confusing word replacement based on a mask language model in the BERT pre-training language model.

Preferably, the residual network consists of several residual units, a single residual unit being denoted as:

self-attention ^l ＝self-attention ^l-1 +F(self-attention ^l-1 )

wherein self-attention ^l And self-section ^l-1 Representing the output of the multi-headed self-attention first and second layers, respectivelyF represents a multi-headed self-attention processing function.

Preferably, after the fully-connected feedforward neural network is connected to the output of the multi-head self-attention layer, the semantic feature extraction module based on the multi-head self-attention mechanism further comprises layer normalization of the output of each layer.

Preferably, the whole word dynamic mask strategy based on the confusion word replacement is composed of a dynamic mask strategy, a whole word mask strategy and a mask strategy based on the confusion word replacement, wherein the dynamic mask strategy is to mask each model input in N different modes, the whole word mask strategy is to mask a complete geographic naming entity, and the mask strategy based on the confusion word replacement is to replace word segmentation marks with the confusion word.

Preferably, the labeling the first web text data, and constructing a geographic naming entity identification dataset according to a labeling result includes:

marking the beginning character of each geographic naming Entity in the first network text data as B-Entity, marking the middle character as I-Entity, marking the rest characters in the first network text data as O, and obtaining a first geographic naming Entity data set;

collecting a Chinese fine-granularity named entity identification data set, and marking the Chinese fine-granularity named entity identification data set according to a marking method of the first network text data to obtain a second geographical named entity data set;

and fusing the first geographic naming entity data set and the second geographic naming entity data set to obtain a geographic naming entity identification data set.

Preferably, the constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model includes:

and adding a task type neural network structure after the first geographic naming entity semantic model to form an initial geographic naming entity recognition model, wherein the first geographic naming entity semantic model is an encoder of the initial geographic naming entity recognition model, and the task type neural network structure is a decoder of the initial geographic naming entity recognition model.

A geographic naming entity identification method, comprising the steps of:

acquiring second network text data, and cleaning the second network text data to obtain target network text data;

and inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, wherein the geographic named entity recognition model is obtained by training according to the target geographic named entity recognition model training method.

A geographic naming entity recognition model training device, comprising:

the increment module is used for acquiring first network text data, and performing increment pre-training on the initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;

the labeling module is used for labeling the first network text data and constructing a geographic naming entity identification data set according to labeling results;

and the construction module is used for constructing an initial geographic naming entity recognition model according to the first geographic naming entity semantic model, and performing fine adjustment on the initial geographic naming entity recognition model by utilizing the geographic naming entity recognition data set to obtain a target geographic naming entity recognition model.

A geographic named entity recognition device, comprising:

the acquisition module is used for acquiring second network text data and cleaning the second network text data to obtain target network text data;

the identification module is used for inputting the target network text data into a geographic named entity identification model for training to obtain a geographic named entity identification result corresponding to the initial network text data, wherein the geographic named entity identification model is obtained according to the target geographic named entity identification model training device.

The invention has the following beneficial effects:

compared with other traditional methods for identifying and extracting geographic naming entities, the method has stronger capability of processing unstructured texts, long texts, multi-label classification and other complex text structures, is particularly suitable for identifying and extracting social media data, and adopts a deep learning migration mode, so that the geographic naming entity identification model does not need to be trained independently, but is subjected to incremental pre-training on the basis of the original geographic naming entity semantic model, the time is greatly saved, and various different data environments can be quickly adapted.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a neural network architecture of a multi-headed self-attention mechanism based semantic feature extraction module provided herein;

FIG. 2 is a diagram of an example neural network architecture of the confusion word correction module provided herein;

FIG. 3 is an overall structure diagram of a semantic model of a geographic naming entity provided by the present application;

FIG. 4 is a neural network structure diagram of a geographic named entity recognition extraction model provided by the present application;

FIG. 5 is a training loss curve generated by fine tuning an initial geographic named entity recognition model using the geographic named entity recognition dataset of the present application;

FIG. 6 is a graph comparing the F1 values of the object geographic named entity recognition model and other geographic named entity recognition methods.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms "first," "second," and the like in the claims and the description of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it should be understood that the terms so used may be interchanged, if appropriate, merely to describe the manner in which objects of the same nature are distinguished in the embodiments of the present application when described, and furthermore, the terms "comprise" and "have" and any variations thereof are intended to cover a non-exclusive inclusion such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment provides a geographic naming entity recognition model training method, which comprises the following steps:

s110, acquiring first network text data, and performing incremental pre-training on an initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;

s120, labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results;

s130, an initial geographic naming entity recognition model is built according to the first geographic naming entity semantic model, and fine adjustment is conducted on the initial geographic naming entity recognition model by means of the geographic naming entity recognition data set, so that a target geographic naming entity recognition model is obtained.

In this embodiment, the initial geo-named entity semantic model is composed of a semantic feature extraction module based on a multi-head self-attention mechanism and a confusion word correction module, wherein the semantic feature extraction module based on the multi-head self-attention mechanism uses a general structure of a neural network language model as a design basis and is constructed by combining characteristics of geo-named entity text data, in order to enable the model to learn semantic features in the geo-named entity text more fully, a part including implementation of the multi-head self-attention mechanism is integrated into a module, and the module is overlapped layer by layer to construct a model architecture with sufficient depth to perform semantic extraction work of the geo-named entity, meanwhile, in order to solve gradient disappearance and network degradation problems, the implementation firstly introduces a residual network to solve gradient disappearance problems, the residual network is composed of a plurality of residual units, in the module including multi-head self-attention, a single residual unit can be expressed as:

self-attention ^l ＝self-attention ^l-1 +F(self-attention ^l-1 )

wherein self-attention ^l And self-section ^l-1 The outputs of the multi-head attention first and first-1 layers are shown, respectively, and F represents the processing function of multi-head self-attention.

Meanwhile, the residual network in the embodiment is to superimpose a layer of information after each multi-head self-attention layer is output, so that self-attention semantics and text characterization information are fused.

Next, to solve the problem of internal covariate bias in deep neural networks, the present embodiment performs layer normalization (Layer Normalization, LN) on the outputs of each layer superimposed in the module, where layer normalization is to normalize all neurons of one network layer, so that the inputs of each layer remain stably distributed, which helps to mitigate the effect of gradient vanishing, and at the same time allows the training process to become more stable and helps the model to converge more quickly.

Meanwhile, in order to dimension transform and increase nonlinearity of the output result of the self-attention layer, the embodiment adds a fully-connected feedforward neural network after the output of the self-attention layer, and uses a leak ReLU as an activation function, the leak ReLU prevents "dead neurons" from occurring and improves learning ability of the neural network by allowing a small non-zero gradient to be inputted negatively, and the mathematical expression is as follows:

in this embodiment, a has a value of 0.01.

In summary, a semantic feature extraction module based on a multi-head self-attention mechanism can be obtained, the neural network structure of the semantic feature extraction module is shown in fig. 1, the modules are overlapped layer by layer, and the design of residual connection, layer normalization and full-connection feedforward neural network is introduced, so that the semantic features of the geographic naming entity can be better extracted by the semantic model of the geographic naming entity, wherein each multi-head self-attention module in the semantic model of the geographic naming entity is defined as Geographical Named Entity Transformer Module.

Considering that the above module splits the text sequence into individual characters when inputting the text sequence of the geographic named entity, and does not consider the relationship between the geographic named entity elements in the text sequence of the geographic named entity, the present embodiment also designs a character encoding strategy taking into account the boundary information of the geographic named entity layers, which does not take words as basic input units, but continues to use vectorized expression at character level, and injects layer information by inserting a special tag [ B ] between the geographic named entity layers, which prompts the model that its previous character is the end of one geographic named entity layer and the next character is the beginning of another geographic named entity layer, simply speaking, the word is split on the geographic named entity text in the geographic named entity corpus by using a word splitting tool, and after word splitting, special tag symbols are inserted between the geographic named entity elements as separators, and these tag symbols are regarded as common characters in the model and also are part of the input in Geographical Named Entity Transformer Module.

Because homonyms, synonyms and the like exist in the input geographic naming entities, the embodiment also designs a confusion word correction module, so that the model can correctly identify the input geographic naming entities and automatically correct and output, wherein, as shown in fig. 2, the neural network structure of the confusion word correction module is designed as follows:

(1) In order to convert the output in the semantic feature extraction module based on multi-head self-attention into "correction word embedding" information of each character after being influenced by context, and prepare for calculating the probability of each character in the word to be predicted according to the Lookup Table in the subsequent step, the output in the semantic feature extraction module is required to be used as the input of a fully connected neural network layer, and a leakage ReLU activation function is used for realizing nonlinear transformation:

prob_embedding＝s(W×SA+b) (1)

in equation 1, SA represents the output of the semantic feature extraction module, s represents the leak ReLU activation function, and W and b represent weights of the fully connected neural network.

(2) Establishing a full connection layer, embedding correction words into the full connection layer, and performing linear transformation to obtain probability distribution scores:

logits＝C ^T ×prob_embedding+b (2)

in formula 2, C represents a Lookup Table.

(3) Finally, the result is input into a softmax function to obtain the conditional probability distribution of each character in the word to be predicted as each word in the dictionary:

prob＝softmax(logits) (3)

meanwhile, in order to enable the model to complete the aim of correcting the confusion word based on the known context, the embodiment also refers to the core idea of masking language model tasks in a classical BERT pre-training language model, and designs a full-word dynamic masking strategy based on confusion word replacement aiming at the characteristics of geographic naming entity text data, and the random masking strategy used in BERT mainly refers to the idea of completely filling, namely, certain characters in sentences are randomly shielded with a certain probability and replaced by [ MASK ] marks, and the special marks can be regarded as 'spaces' in completely filling, but have limitations in the scene of geographic naming entity text data, so the embodiment designs the following masking strategy:

(1) Dynamic masking policy: to avoid the model using the same masking strategy during each round of training, the present embodiment replicates the training dataset N copies so that each input text sequence is masked in N different ways during the training process, and therefore, for each input text sequence, the number of masking times is:

epoch/N

wherein epoch is the number of training rounds.

(2) Full word mask: aiming at the problem brought by the character-level masking strategy in the classical BERT model, the embodiment adopts a whole word masking strategy, namely, if the character to be masked belongs to a complete geographic naming entity element, the whole geographic naming entity element is masked.

(3) Mask policy based on confusion word substitution: in order to solve the problem of inconsistent pre-training and fine tuning caused by the MASK strategy employed by BERT, the present embodiment employs confusion to replace the MASK tag.

In the pre-training, the synonyms are generated by using the paraphrasing and homonyms as confusion words and using a Chinese paraphrasing generating tool library synonyms which is calculated based on word2vec for similarity, and the homonyms are generated by using a data enhancement tool package JionLP.

Considering the accuracy and efficiency of model training comprehensively, 15% of words in the mask input text are selected in this embodiment, 40% of the mask words are replaced by hyponyms, 40% of the words are replaced by homonyms, 10% of the words are replaced by random words, and 10% of the words remain unchanged.

The above-mentioned obtained confusion word correction module is immediately after the semantic feature extraction module based on the multi-head self-attention mechanism, so as to obtain an initial geographic naming entity semantic model, the neural network structure of which is shown in fig. 3, specifically, given a certain word, the embodiment takes the semantic representation of the output after the superposition of the initial geographic naming entity semantic model Geographical Named Entity Transformer Module to perform confusion word correction, wherein the initial parameters of the initial geographic naming entity semantic model are parameters of the BERT model, thus greatly reducing the time and the calculation cost required by model pre-training, and enabling the model to better adapt to specific tasks in the geographic domain on the basis of having general semantic representation capability.

Then, collecting first ubiquitous network text data, in this embodiment, adopting 3530611 pieces of geographical named entity text data which are located in Jinan city of Shandong in 2022 years after data cleaning, and inputting all the data into an initial geographical named entity semantic model to perform incremental pre-training on the initial geographical named entity semantic model, thereby obtaining a first geographical named entity semantic model.

And then, constructing a geographical named Entity recognition data set aiming at a geographical named Entity recognition task, firstly marking first ubiquitous network text data, namely marking a first character, namely a start character, of the geographical named Entity contained in the first ubiquitous network text data as 'B-Entity', marking the middle character as 'I-Entity', marking the rest characters in the text data as 'O', mapping the 'O' label as 0, mapping the 'B-Entity' label as 1 and mapping the 'I-Entity' label as 2, and accordingly obtaining the first geographical named Entity data set, as shown in table 1.

Table 1 first geo-named entity identification dataset data sample

And meanwhile, acquiring a CLUENER2020 data set, wherein the CLUENER2020 is a Chinese fine-granularity named entity identification data set constructed from data acquired from network news, marking the Chinese fine-granularity named entity identification data set by using the labeling method to obtain a second geographic named entity identification data set, and fusing the first geographic named entity identification data set and the second geographic named entity identification data set to obtain the geographic named entity identification data set.

In this embodiment, the task of identifying a geographic naming entity is converted into a labeling task of a text sequence, that is, each character in a given ubiquitous network text is labeled, and whether the characters belong to the geographic naming entity is determined, so as to, as shown in fig. 4, finally design an initial geographic naming entity identification model according to this embodiment, adding a task-type neural network structure on the basis of a first geographic naming entity semantic model, where the first geographic naming entity semantic model is called an encoder thereof, the task-type neural network structure is called a decoder thereof, and in order to make the model more suitable for a downstream task, fine tuning the initial geographic naming entity identification model by using a geographic naming entity identification data set to obtain a target geographic naming entity identification model, and in order to cooperate with using an early-stop strategy to prevent overfitting, while fine tuning the model by using a training set, each round of training is monitored to monitor the performance of the model in a verification set, in terms of the proportion of the training set, 80% of data is randomly selected as the training set, 10% of data is used as the verification set, and 10% of data is left as the test set, and finally obtained loss curve is shown in fig. 5.

The embodiment adopts a pre-training-fine-tuning mode, so that the training process of the model greatly saves time, and the scheme can be quickly adapted to various different data environments.

Since the F1 value is a comprehensive consideration for model performance, and the index is relatively more reliable, in this embodiment, fcnn+crf, rnn+crf, bilstm+crf, bert+crf, bert+softmax, bert+bilstm+crf, and the F1 value of the target geographic naming entity recognition model of this embodiment are compared in the form of a bar graph, as shown in fig. 6, from which the advancement of the target geographic naming entity recognition model of this embodiment can be verified more clearly and intuitively.

The embodiment also provides a geographic naming entity identification method, which comprises the following steps:

s210, acquiring second network text data, and cleaning the second network text data to obtain target network text data;

s220, inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, wherein the geographic named entity recognition model is obtained by training according to the target geographic named entity recognition model training method.

After training the target geographic named entity recognition model, the ubiquitous network text data to be recognized, namely second network text data, can be obtained, the second network text data is cleaned, so that target network text data is obtained, the target network text data is input into the target geographic named entity recognition model trained in advance for training, and the geographic named entity recognition result corresponding to the ubiquitous network text data to be recognized can be obtained.

The embodiment uses a deep learning migration mode, so that the geographic naming entity recognition and extraction model does not need to be trained independently, but incremental pre-training is performed on the basis of the original geographic naming entity semantic model, time is saved greatly, the problems of diversity, large data volume and noise of social media data can be solved, and the method is particularly suitable for recognition and extraction of the social media data.

The embodiment also provides a geographic naming entity recognition model training device, which comprises an increment module, a labeling module and a construction module, wherein:

the construction module is used for constructing an initial geographic naming entity recognition model according to the first geographic naming entity semantic model, and performing fine adjustment on the initial geographic naming entity recognition model by utilizing the geographic naming entity recognition data set to obtain a target geographic naming entity recognition model.

The embodiment also provides a geographic naming entity recognition device, which comprises an acquisition module and a recognition module, wherein:

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the present invention is not limited thereto, but any changes or substitutions within the technical scope of the present invention should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The geographic naming entity recognition model training method is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the initial geographic named entity semantic model comprises a semantic feature extraction module based on a multi-head self-attention mechanism and an confusion word correction module, wherein the semantic feature extraction module based on the multi-head self-attention mechanism comprises a multi-head self-attention layer, a residual network and a fully-connected feedforward neural network, and takes a leak ReLU as an activation function, and the confusion word correction module sets a full-word dynamic mask strategy based on confusion word replacement based on a mask language model in a BERT pre-training language model.

3. A geographical named entity recognition model training method according to claim 2, characterized in that the residual network consists of several residual units, a single residual unit being denoted as:

self-attention ^l ＝self-attention ^l-1 +F(self-attention ^l-1 )

wherein self-attention ^l And self-section ^l-1 The outputs of the multi-headed self-attention first layer and the l-1 layer are shown, respectively, and F represents the processing function of multi-headed self-attention.

4. The method according to claim 2, wherein the fully connected feedforward neural network is connected after the outputs of the multi-headed self-attention layers, and the semantic feature extraction module based on the multi-headed self-attention mechanism further comprises performing layer normalization on the outputs of each layer.

5. The method according to claim 2, wherein the full-word dynamic masking policy based on confusion word replacement consists of a dynamic masking policy, a full-word masking policy and a masking policy based on confusion word replacement, wherein the dynamic masking policy is to mask each model input in N different ways, the full-word masking policy is to mask the complete geographic named entity, and the masking policy based on confusion word replacement is to replace word segmentation markers with confusion words.

6. The method for training a geographic named entity recognition model according to claim 1, wherein the labeling the first web text data and constructing a geographic named entity recognition dataset according to the labeling result comprises:

7. The method for training a geographic named entity recognition model according to claim 1, wherein said constructing an initial geographic named entity recognition model from said first geographic named entity semantic model comprises:

8. A method for identifying a geographic named entity, comprising the steps of:

inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, wherein the geographic named entity recognition model is obtained by training according to the method of claims 1-7.

9. A geographic naming entity recognition model training device, comprising:

10. A geographic named entity recognition device, comprising:

the recognition module is used for inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, and the geographic named entity recognition model is obtained by training according to the device of claim 9.