CN116562296A - Geographic named entity recognition model training method and geographic named entity recognition method - Google Patents

Geographic named entity recognition model training method and geographic named entity recognition method Download PDF

Info

Publication number
CN116562296A
CN116562296A CN202310625300.8A CN202310625300A CN116562296A CN 116562296 A CN116562296 A CN 116562296A CN 202310625300 A CN202310625300 A CN 202310625300A CN 116562296 A CN116562296 A CN 116562296A
Authority
CN
China
Prior art keywords
geographic
named entity
entity recognition
text data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310625300.8A
Other languages
Chinese (zh)
Inventor
徐流畅
夏天舒
张程锟
张嘉俊
姚俊伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sinyada Technology Co ltd
Original Assignee
Sinyada Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sinyada Technology Co ltd filed Critical Sinyada Technology Co ltd
Priority to CN202310625300.8A priority Critical patent/CN116562296A/en
Publication of CN116562296A publication Critical patent/CN116562296A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a geographic named entity recognition model training method and a geographic named entity recognition method, which relate to the technical field of information extraction, wherein the geographic named entity recognition model training method comprises the following steps: acquiring first network text data, and performing incremental pre-training on the initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model; labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results; and constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model, and performing fine adjustment on the initial geographic named entity recognition model by utilizing the geographic named entity recognition data set to obtain a target geographic named entity recognition model. The method adopts a deep learning migration mode, so that the geographic naming entity recognition model does not need to be trained independently, but incremental pre-training is performed on the basis of the original geographic naming entity semantic model, and time is saved greatly.

Description

Geographic named entity recognition model training method and geographic named entity recognition method
Technical Field
The application relates to the technical field of information extraction, in particular to a geographic named entity recognition model training method and a geographic named entity recognition method.
Background
Named entity recognition (Named Entity Recognition, NER) is a widely used technology in the field of natural language processing, which aims at recognizing and extracting named entities from text and classifying them into predefined entity categories, wherein named entities generally refer to entities with specific names or identifiers, such as person names, place names, organization names, dates, times, currencies, etc.
The text data of the ubiquitous network refers to text data collected from the Internet, and the named entity recognition method of the data generally comprises a geographic named entity recognition method based on a geographic named entity database, a geographic named entity recognition method based on machine learning and a geographic named entity recognition method based on a deep neural network, but the geographic named entity recognition method based on the geographic named entity database is premised on a complete geographic named entity database, and the method can ignore some geographic named entity information which does not exist in the geographic named entity database; the geographic naming entity identification method based on machine learning requires a large amount of labeling data, the acquisition and processing of the labeling data takes a large amount of labor and time, the method is very sensitive to the quality of feature extraction, and if the extracted features are insufficient or unreasonable, the accuracy of a task can be influenced; geographic naming entity identification methods based on deep neural networks also require a large amount of tagging data.
Disclosure of Invention
The geographic named entity recognition model training method and the geographic named entity recognition method aim to recognize geographic named entities on ubiquitous social media data and solve the problems of diversity, large data size and noise of the data in the recognition process.
In order to achieve the above purpose, the present application adopts the following technical scheme:
the geographic naming entity recognition model training method comprises the following steps:
acquiring first network text data, and performing incremental pre-training on an initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;
labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results;
and constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model, and performing fine adjustment on the initial geographic named entity recognition model by utilizing the geographic named entity recognition data set to obtain a target geographic named entity recognition model.
Preferably, the initial geographic naming entity semantic model comprises a semantic feature extraction module based on a multi-head self-attention mechanism and a confusing word correction module, wherein the semantic feature extraction module based on the multi-head self-attention mechanism comprises a multi-head self-attention layer, a residual network and a full-connection feedforward neural network, and takes a leak ReLU as an activation function, and the confusing word correction module sets a full-word dynamic mask strategy based on confusing word replacement based on a mask language model in the BERT pre-training language model.
Preferably, the residual network consists of several residual units, a single residual unit being denoted as:
self-attention l =self-attention l-1 +F(self-attention l-1 )
wherein self-attention l And self-section l-1 Representing the output of the multi-headed self-attention first and second layers, respectivelyF represents a multi-headed self-attention processing function.
Preferably, after the fully-connected feedforward neural network is connected to the output of the multi-head self-attention layer, the semantic feature extraction module based on the multi-head self-attention mechanism further comprises layer normalization of the output of each layer.
Preferably, the whole word dynamic mask strategy based on the confusion word replacement is composed of a dynamic mask strategy, a whole word mask strategy and a mask strategy based on the confusion word replacement, wherein the dynamic mask strategy is to mask each model input in N different modes, the whole word mask strategy is to mask a complete geographic naming entity, and the mask strategy based on the confusion word replacement is to replace word segmentation marks with the confusion word.
Preferably, the labeling the first web text data, and constructing a geographic naming entity identification dataset according to a labeling result includes:
marking the beginning character of each geographic naming Entity in the first network text data as B-Entity, marking the middle character as I-Entity, marking the rest characters in the first network text data as O, and obtaining a first geographic naming Entity data set;
collecting a Chinese fine-granularity named entity identification data set, and marking the Chinese fine-granularity named entity identification data set according to a marking method of the first network text data to obtain a second geographical named entity data set;
and fusing the first geographic naming entity data set and the second geographic naming entity data set to obtain a geographic naming entity identification data set.
Preferably, the constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model includes:
and adding a task type neural network structure after the first geographic naming entity semantic model to form an initial geographic naming entity recognition model, wherein the first geographic naming entity semantic model is an encoder of the initial geographic naming entity recognition model, and the task type neural network structure is a decoder of the initial geographic naming entity recognition model.
A geographic naming entity identification method, comprising the steps of:
acquiring second network text data, and cleaning the second network text data to obtain target network text data;
and inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, wherein the geographic named entity recognition model is obtained by training according to the target geographic named entity recognition model training method.
A geographic naming entity recognition model training device, comprising:
the increment module is used for acquiring first network text data, and performing increment pre-training on the initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;
the labeling module is used for labeling the first network text data and constructing a geographic naming entity identification data set according to labeling results;
and the construction module is used for constructing an initial geographic naming entity recognition model according to the first geographic naming entity semantic model, and performing fine adjustment on the initial geographic naming entity recognition model by utilizing the geographic naming entity recognition data set to obtain a target geographic naming entity recognition model.
A geographic named entity recognition device, comprising:
the acquisition module is used for acquiring second network text data and cleaning the second network text data to obtain target network text data;
the identification module is used for inputting the target network text data into a geographic named entity identification model for training to obtain a geographic named entity identification result corresponding to the initial network text data, wherein the geographic named entity identification model is obtained according to the target geographic named entity identification model training device.
The invention has the following beneficial effects:
compared with other traditional methods for identifying and extracting geographic naming entities, the method has stronger capability of processing unstructured texts, long texts, multi-label classification and other complex text structures, is particularly suitable for identifying and extracting social media data, and adopts a deep learning migration mode, so that the geographic naming entity identification model does not need to be trained independently, but is subjected to incremental pre-training on the basis of the original geographic naming entity semantic model, the time is greatly saved, and various different data environments can be quickly adapted.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a neural network architecture of a multi-headed self-attention mechanism based semantic feature extraction module provided herein;
FIG. 2 is a diagram of an example neural network architecture of the confusion word correction module provided herein;
FIG. 3 is an overall structure diagram of a semantic model of a geographic naming entity provided by the present application;
FIG. 4 is a neural network structure diagram of a geographic named entity recognition extraction model provided by the present application;
FIG. 5 is a training loss curve generated by fine tuning an initial geographic named entity recognition model using the geographic named entity recognition dataset of the present application;
FIG. 6 is a graph comparing the F1 values of the object geographic named entity recognition model and other geographic named entity recognition methods.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The terms "first," "second," and the like in the claims and the description of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it should be understood that the terms so used may be interchanged, if appropriate, merely to describe the manner in which objects of the same nature are distinguished in the embodiments of the present application when described, and furthermore, the terms "comprise" and "have" and any variations thereof are intended to cover a non-exclusive inclusion such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment provides a geographic naming entity recognition model training method, which comprises the following steps:
s110, acquiring first network text data, and performing incremental pre-training on an initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;
s120, labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results;
s130, an initial geographic naming entity recognition model is built according to the first geographic naming entity semantic model, and fine adjustment is conducted on the initial geographic naming entity recognition model by means of the geographic naming entity recognition data set, so that a target geographic naming entity recognition model is obtained.
In this embodiment, the initial geo-named entity semantic model is composed of a semantic feature extraction module based on a multi-head self-attention mechanism and a confusion word correction module, wherein the semantic feature extraction module based on the multi-head self-attention mechanism uses a general structure of a neural network language model as a design basis and is constructed by combining characteristics of geo-named entity text data, in order to enable the model to learn semantic features in the geo-named entity text more fully, a part including implementation of the multi-head self-attention mechanism is integrated into a module, and the module is overlapped layer by layer to construct a model architecture with sufficient depth to perform semantic extraction work of the geo-named entity, meanwhile, in order to solve gradient disappearance and network degradation problems, the implementation firstly introduces a residual network to solve gradient disappearance problems, the residual network is composed of a plurality of residual units, in the module including multi-head self-attention, a single residual unit can be expressed as:
self-attention l =self-attention l-1 +F(self-attention l-1 )
wherein self-attention l And self-section l-1 The outputs of the multi-head attention first and first-1 layers are shown, respectively, and F represents the processing function of multi-head self-attention.
Meanwhile, the residual network in the embodiment is to superimpose a layer of information after each multi-head self-attention layer is output, so that self-attention semantics and text characterization information are fused.
Next, to solve the problem of internal covariate bias in deep neural networks, the present embodiment performs layer normalization (Layer Normalization, LN) on the outputs of each layer superimposed in the module, where layer normalization is to normalize all neurons of one network layer, so that the inputs of each layer remain stably distributed, which helps to mitigate the effect of gradient vanishing, and at the same time allows the training process to become more stable and helps the model to converge more quickly.
Meanwhile, in order to dimension transform and increase nonlinearity of the output result of the self-attention layer, the embodiment adds a fully-connected feedforward neural network after the output of the self-attention layer, and uses a leak ReLU as an activation function, the leak ReLU prevents "dead neurons" from occurring and improves learning ability of the neural network by allowing a small non-zero gradient to be inputted negatively, and the mathematical expression is as follows:
in this embodiment, a has a value of 0.01.
In summary, a semantic feature extraction module based on a multi-head self-attention mechanism can be obtained, the neural network structure of the semantic feature extraction module is shown in fig. 1, the modules are overlapped layer by layer, and the design of residual connection, layer normalization and full-connection feedforward neural network is introduced, so that the semantic features of the geographic naming entity can be better extracted by the semantic model of the geographic naming entity, wherein each multi-head self-attention module in the semantic model of the geographic naming entity is defined as Geographical Named Entity Transformer Module.
Considering that the above module splits the text sequence into individual characters when inputting the text sequence of the geographic named entity, and does not consider the relationship between the geographic named entity elements in the text sequence of the geographic named entity, the present embodiment also designs a character encoding strategy taking into account the boundary information of the geographic named entity layers, which does not take words as basic input units, but continues to use vectorized expression at character level, and injects layer information by inserting a special tag [ B ] between the geographic named entity layers, which prompts the model that its previous character is the end of one geographic named entity layer and the next character is the beginning of another geographic named entity layer, simply speaking, the word is split on the geographic named entity text in the geographic named entity corpus by using a word splitting tool, and after word splitting, special tag symbols are inserted between the geographic named entity elements as separators, and these tag symbols are regarded as common characters in the model and also are part of the input in Geographical Named Entity Transformer Module.
Because homonyms, synonyms and the like exist in the input geographic naming entities, the embodiment also designs a confusion word correction module, so that the model can correctly identify the input geographic naming entities and automatically correct and output, wherein, as shown in fig. 2, the neural network structure of the confusion word correction module is designed as follows:
(1) In order to convert the output in the semantic feature extraction module based on multi-head self-attention into "correction word embedding" information of each character after being influenced by context, and prepare for calculating the probability of each character in the word to be predicted according to the Lookup Table in the subsequent step, the output in the semantic feature extraction module is required to be used as the input of a fully connected neural network layer, and a leakage ReLU activation function is used for realizing nonlinear transformation:
prob_embedding=s(W×SA+b) (1)
in equation 1, SA represents the output of the semantic feature extraction module, s represents the leak ReLU activation function, and W and b represent weights of the fully connected neural network.
(2) Establishing a full connection layer, embedding correction words into the full connection layer, and performing linear transformation to obtain probability distribution scores:
logits=C T ×prob_embedding+b (2)
in formula 2, C represents a Lookup Table.
(3) Finally, the result is input into a softmax function to obtain the conditional probability distribution of each character in the word to be predicted as each word in the dictionary:
prob=softmax(logits) (3)
meanwhile, in order to enable the model to complete the aim of correcting the confusion word based on the known context, the embodiment also refers to the core idea of masking language model tasks in a classical BERT pre-training language model, and designs a full-word dynamic masking strategy based on confusion word replacement aiming at the characteristics of geographic naming entity text data, and the random masking strategy used in BERT mainly refers to the idea of completely filling, namely, certain characters in sentences are randomly shielded with a certain probability and replaced by [ MASK ] marks, and the special marks can be regarded as 'spaces' in completely filling, but have limitations in the scene of geographic naming entity text data, so the embodiment designs the following masking strategy:
(1) Dynamic masking policy: to avoid the model using the same masking strategy during each round of training, the present embodiment replicates the training dataset N copies so that each input text sequence is masked in N different ways during the training process, and therefore, for each input text sequence, the number of masking times is:
epoch/N
wherein epoch is the number of training rounds.
(2) Full word mask: aiming at the problem brought by the character-level masking strategy in the classical BERT model, the embodiment adopts a whole word masking strategy, namely, if the character to be masked belongs to a complete geographic naming entity element, the whole geographic naming entity element is masked.
(3) Mask policy based on confusion word substitution: in order to solve the problem of inconsistent pre-training and fine tuning caused by the MASK strategy employed by BERT, the present embodiment employs confusion to replace the MASK tag.
In the pre-training, the synonyms are generated by using the paraphrasing and homonyms as confusion words and using a Chinese paraphrasing generating tool library synonyms which is calculated based on word2vec for similarity, and the homonyms are generated by using a data enhancement tool package JionLP.
Considering the accuracy and efficiency of model training comprehensively, 15% of words in the mask input text are selected in this embodiment, 40% of the mask words are replaced by hyponyms, 40% of the words are replaced by homonyms, 10% of the words are replaced by random words, and 10% of the words remain unchanged.
The above-mentioned obtained confusion word correction module is immediately after the semantic feature extraction module based on the multi-head self-attention mechanism, so as to obtain an initial geographic naming entity semantic model, the neural network structure of which is shown in fig. 3, specifically, given a certain word, the embodiment takes the semantic representation of the output after the superposition of the initial geographic naming entity semantic model Geographical Named Entity Transformer Module to perform confusion word correction, wherein the initial parameters of the initial geographic naming entity semantic model are parameters of the BERT model, thus greatly reducing the time and the calculation cost required by model pre-training, and enabling the model to better adapt to specific tasks in the geographic domain on the basis of having general semantic representation capability.
Then, collecting first ubiquitous network text data, in this embodiment, adopting 3530611 pieces of geographical named entity text data which are located in Jinan city of Shandong in 2022 years after data cleaning, and inputting all the data into an initial geographical named entity semantic model to perform incremental pre-training on the initial geographical named entity semantic model, thereby obtaining a first geographical named entity semantic model.
And then, constructing a geographical named Entity recognition data set aiming at a geographical named Entity recognition task, firstly marking first ubiquitous network text data, namely marking a first character, namely a start character, of the geographical named Entity contained in the first ubiquitous network text data as 'B-Entity', marking the middle character as 'I-Entity', marking the rest characters in the text data as 'O', mapping the 'O' label as 0, mapping the 'B-Entity' label as 1 and mapping the 'I-Entity' label as 2, and accordingly obtaining the first geographical named Entity data set, as shown in table 1.
Table 1 first geo-named entity identification dataset data sample
And meanwhile, acquiring a CLUENER2020 data set, wherein the CLUENER2020 is a Chinese fine-granularity named entity identification data set constructed from data acquired from network news, marking the Chinese fine-granularity named entity identification data set by using the labeling method to obtain a second geographic named entity identification data set, and fusing the first geographic named entity identification data set and the second geographic named entity identification data set to obtain the geographic named entity identification data set.
In this embodiment, the task of identifying a geographic naming entity is converted into a labeling task of a text sequence, that is, each character in a given ubiquitous network text is labeled, and whether the characters belong to the geographic naming entity is determined, so as to, as shown in fig. 4, finally design an initial geographic naming entity identification model according to this embodiment, adding a task-type neural network structure on the basis of a first geographic naming entity semantic model, where the first geographic naming entity semantic model is called an encoder thereof, the task-type neural network structure is called a decoder thereof, and in order to make the model more suitable for a downstream task, fine tuning the initial geographic naming entity identification model by using a geographic naming entity identification data set to obtain a target geographic naming entity identification model, and in order to cooperate with using an early-stop strategy to prevent overfitting, while fine tuning the model by using a training set, each round of training is monitored to monitor the performance of the model in a verification set, in terms of the proportion of the training set, 80% of data is randomly selected as the training set, 10% of data is used as the verification set, and 10% of data is left as the test set, and finally obtained loss curve is shown in fig. 5.
The embodiment adopts a pre-training-fine-tuning mode, so that the training process of the model greatly saves time, and the scheme can be quickly adapted to various different data environments.
Since the F1 value is a comprehensive consideration for model performance, and the index is relatively more reliable, in this embodiment, fcnn+crf, rnn+crf, bilstm+crf, bert+crf, bert+softmax, bert+bilstm+crf, and the F1 value of the target geographic naming entity recognition model of this embodiment are compared in the form of a bar graph, as shown in fig. 6, from which the advancement of the target geographic naming entity recognition model of this embodiment can be verified more clearly and intuitively.
The embodiment also provides a geographic naming entity identification method, which comprises the following steps:
s210, acquiring second network text data, and cleaning the second network text data to obtain target network text data;
s220, inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, wherein the geographic named entity recognition model is obtained by training according to the target geographic named entity recognition model training method.
After training the target geographic named entity recognition model, the ubiquitous network text data to be recognized, namely second network text data, can be obtained, the second network text data is cleaned, so that target network text data is obtained, the target network text data is input into the target geographic named entity recognition model trained in advance for training, and the geographic named entity recognition result corresponding to the ubiquitous network text data to be recognized can be obtained.
The embodiment uses a deep learning migration mode, so that the geographic naming entity recognition and extraction model does not need to be trained independently, but incremental pre-training is performed on the basis of the original geographic naming entity semantic model, time is saved greatly, the problems of diversity, large data volume and noise of social media data can be solved, and the method is particularly suitable for recognition and extraction of the social media data.
The embodiment also provides a geographic naming entity recognition model training device, which comprises an increment module, a labeling module and a construction module, wherein:
the increment module is used for acquiring first network text data, and performing increment pre-training on the initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;
the labeling module is used for labeling the first network text data and constructing a geographic naming entity identification data set according to labeling results;
the construction module is used for constructing an initial geographic naming entity recognition model according to the first geographic naming entity semantic model, and performing fine adjustment on the initial geographic naming entity recognition model by utilizing the geographic naming entity recognition data set to obtain a target geographic naming entity recognition model.
The embodiment also provides a geographic naming entity recognition device, which comprises an acquisition module and a recognition module, wherein:
the acquisition module is used for acquiring second network text data and cleaning the second network text data to obtain target network text data;
the identification module is used for inputting the target network text data into a geographic named entity identification model for training to obtain a geographic named entity identification result corresponding to the initial network text data, wherein the geographic named entity identification model is obtained according to the target geographic named entity identification model training device.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the present invention is not limited thereto, but any changes or substitutions within the technical scope of the present invention should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. The geographic naming entity recognition model training method is characterized by comprising the following steps of:
acquiring first network text data, and performing incremental pre-training on an initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;
labeling the first network text data, and constructing a geographic naming entity identification data set according to labeling results;
and constructing an initial geographic named entity recognition model according to the first geographic named entity semantic model, and performing fine adjustment on the initial geographic named entity recognition model by utilizing the geographic named entity recognition data set to obtain a target geographic named entity recognition model.
2. The method according to claim 1, wherein the initial geographic named entity semantic model comprises a semantic feature extraction module based on a multi-head self-attention mechanism and an confusion word correction module, wherein the semantic feature extraction module based on the multi-head self-attention mechanism comprises a multi-head self-attention layer, a residual network and a fully-connected feedforward neural network, and takes a leak ReLU as an activation function, and the confusion word correction module sets a full-word dynamic mask strategy based on confusion word replacement based on a mask language model in a BERT pre-training language model.
3. A geographical named entity recognition model training method according to claim 2, characterized in that the residual network consists of several residual units, a single residual unit being denoted as:
self-attention l =self-attention l-1 +F(self-attention l-1 )
wherein self-attention l And self-section l-1 The outputs of the multi-headed self-attention first layer and the l-1 layer are shown, respectively, and F represents the processing function of multi-headed self-attention.
4. The method according to claim 2, wherein the fully connected feedforward neural network is connected after the outputs of the multi-headed self-attention layers, and the semantic feature extraction module based on the multi-headed self-attention mechanism further comprises performing layer normalization on the outputs of each layer.
5. The method according to claim 2, wherein the full-word dynamic masking policy based on confusion word replacement consists of a dynamic masking policy, a full-word masking policy and a masking policy based on confusion word replacement, wherein the dynamic masking policy is to mask each model input in N different ways, the full-word masking policy is to mask the complete geographic named entity, and the masking policy based on confusion word replacement is to replace word segmentation markers with confusion words.
6. The method for training a geographic named entity recognition model according to claim 1, wherein the labeling the first web text data and constructing a geographic named entity recognition dataset according to the labeling result comprises:
marking the beginning character of each geographic naming Entity in the first network text data as B-Entity, marking the middle character as I-Entity, marking the rest characters in the first network text data as O, and obtaining a first geographic naming Entity data set;
collecting a Chinese fine-granularity named entity identification data set, and marking the Chinese fine-granularity named entity identification data set according to a marking method of the first network text data to obtain a second geographical named entity data set;
and fusing the first geographic naming entity data set and the second geographic naming entity data set to obtain a geographic naming entity identification data set.
7. The method for training a geographic named entity recognition model according to claim 1, wherein said constructing an initial geographic named entity recognition model from said first geographic named entity semantic model comprises:
and adding a task type neural network structure after the first geographic naming entity semantic model to form an initial geographic naming entity recognition model, wherein the first geographic naming entity semantic model is an encoder of the initial geographic naming entity recognition model, and the task type neural network structure is a decoder of the initial geographic naming entity recognition model.
8. A method for identifying a geographic named entity, comprising the steps of:
acquiring second network text data, and cleaning the second network text data to obtain target network text data;
inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, wherein the geographic named entity recognition model is obtained by training according to the method of claims 1-7.
9. A geographic naming entity recognition model training device, comprising:
the increment module is used for acquiring first network text data, and performing increment pre-training on the initial geographic naming entity semantic model according to the first network text data to obtain a first geographic naming entity semantic model;
the labeling module is used for labeling the first network text data and constructing a geographic naming entity identification data set according to labeling results;
and the construction module is used for constructing an initial geographic naming entity recognition model according to the first geographic naming entity semantic model, and performing fine adjustment on the initial geographic naming entity recognition model by utilizing the geographic naming entity recognition data set to obtain a target geographic naming entity recognition model.
10. A geographic named entity recognition device, comprising:
the acquisition module is used for acquiring second network text data and cleaning the second network text data to obtain target network text data;
the recognition module is used for inputting the target network text data into a geographic named entity recognition model for training to obtain a geographic named entity recognition result corresponding to the initial network text data, and the geographic named entity recognition model is obtained by training according to the device of claim 9.
CN202310625300.8A 2023-05-30 2023-05-30 Geographic named entity recognition model training method and geographic named entity recognition method Pending CN116562296A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310625300.8A CN116562296A (en) 2023-05-30 2023-05-30 Geographic named entity recognition model training method and geographic named entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310625300.8A CN116562296A (en) 2023-05-30 2023-05-30 Geographic named entity recognition model training method and geographic named entity recognition method

Publications (1)

Publication Number Publication Date
CN116562296A true CN116562296A (en) 2023-08-08

Family

ID=87500032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310625300.8A Pending CN116562296A (en) 2023-05-30 2023-05-30 Geographic named entity recognition model training method and geographic named entity recognition method

Country Status (1)

Country Link
CN (1) CN116562296A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251650A (en) * 2023-11-20 2023-12-19 之江实验室 Geographic hotspot center identification method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251650A (en) * 2023-11-20 2023-12-19 之江实验室 Geographic hotspot center identification method, device, computer equipment and storage medium
CN117251650B (en) * 2023-11-20 2024-02-06 之江实验室 Geographic hotspot center identification method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107066446B (en) Logic rule embedded cyclic neural network text emotion analysis method
CN112100388B (en) Method for analyzing emotional polarity of long text news public sentiment
US20210064821A1 (en) System and method to extract customized information in natural language text
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN110704890A (en) Automatic text causal relationship extraction method fusing convolutional neural network and cyclic neural network
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN111639183A (en) Financial industry consensus public opinion analysis method and system based on deep learning algorithm
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN116562296A (en) Geographic named entity recognition model training method and geographic named entity recognition method
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN115221332A (en) Construction method and system of dangerous chemical accident event map
CN112445862B (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN111898337B (en) Automatic generation method of single sentence abstract defect report title based on deep learning
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN111178080A (en) Named entity identification method and system based on structured information
CN115827871A (en) Internet enterprise classification method, device and system
CN113626553B (en) Cascade binary Chinese entity relation extraction method based on pre-training model
CN113239143B (en) Power transmission and transformation equipment fault processing method and system fusing power grid fault case base
CN115757695A (en) Log language model training method and system
CN115438645A (en) Text data enhancement method and system for sequence labeling task
CN112035670B (en) Multi-modal rumor detection method based on image emotional tendency
Yambao et al. Feedforward approach to sequential morphological analysis in the Tagalog language
Trandafili et al. Employing a Seq2Seq Model for Spelling Correction in Albanian Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination