CN116167378A - Named entity recognition method and system based on countermeasure migration learning - Google Patents

Named entity recognition method and system based on countermeasure migration learning Download PDF

Info

Publication number
CN116167378A
CN116167378A CN202310133155.1A CN202310133155A CN116167378A CN 116167378 A CN116167378 A CN 116167378A CN 202310133155 A CN202310133155 A CN 202310133155A CN 116167378 A CN116167378 A CN 116167378A
Authority
CN
China
Prior art keywords
named entity
word segmentation
chinese word
data set
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310133155.1A
Other languages
Chinese (zh)
Inventor
程良伦
朱志鸿
张伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Nengge Knowledge Technology Co ltd
Guangdong University of Technology
Original Assignee
Guangdong Nengge Knowledge Technology Co ltd
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Nengge Knowledge Technology Co ltd, Guangdong University of Technology filed Critical Guangdong Nengge Knowledge Technology Co ltd
Priority to CN202310133155.1A priority Critical patent/CN116167378A/en
Publication of CN116167378A publication Critical patent/CN116167378A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a named entity recognition method and a named entity recognition system based on countermeasure transfer learning, wherein the method comprises the steps of constructing a training data set; inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation; inputting sentence vector representation into a two-way long and short-time memory network for feature extraction processing to obtain a feature set; inputting the private feature set into a self-attention layer for dependency analysis and processing to obtain a named entity task representation and a Chinese word segmentation task representation; inputting the named entity task representation and the Chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a Chinese word segmentation sequence tag; and performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag in combination with the task classifier. The embodiment of the invention can improve the accuracy of named entity identification and can be widely applied to the technical field of artificial intelligence.

Description

Named entity recognition method and system based on countermeasure migration learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a named entity identification method and system based on anti-migration learning.
Background
In the big data age, people can acquire more and more data, which increases the cost of extracting information and acquiring knowledge from a large amount of data. If manual processing is used entirely, it becomes less cumbersome and complex. Automated processing of massive amounts of data thus involves natural language processing techniques, and Named Entity Recognition (NER) is a preliminary and important task in the field of Natural Language Processing (NLP). When deep learning is used for named entity recognition, large scale annotation data is often required, and there are a large number of existing datasets available for use in Chinese Word Segmentation (CWS) tasks. In the related art, task sharing information between named entity recognition and Chinese word segmentation is concerned, private information of each task is not filtered, noise is brought to a named entity recognition method, and accuracy of named entity recognition is affected. In view of the foregoing, there is a need for solving the technical problems in the related art.
Disclosure of Invention
In view of this, the embodiment of the invention provides a named entity recognition method and a named entity recognition system based on anti-migration learning, which are realized.
In one aspect, the invention provides a named entity recognition method based on anti-migration learning, which comprises the following steps:
constructing a training data set, wherein the training data set comprises a named entity data set and a Chinese word segmentation data set;
inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation;
inputting the sentence vector representation into a two-way long and short-time memory network for feature extraction processing to obtain a feature set, wherein the feature set at least comprises named entity private features, shared features and Chinese word segmentation private features;
inputting the private feature set into a self-attention layer for dependency analysis and processing to obtain a named entity task representation and a Chinese word segmentation task representation;
inputting the named entity task representation and the Chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a Chinese word segmentation sequence tag;
performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag in combination with a task classifier to obtain a trained named entity recognition model;
and acquiring a named entity to be identified, inputting the named entity to be identified into the trained named entity identification model for named entity identification processing, and obtaining a named entity identification result.
Optionally, the constructing a training dataset, the training dataset including a named entity dataset and a chinese word segmentation dataset, includes:
performing data crawling processing on the data website to obtain an original data set;
labeling the original data set to obtain a named entity data set;
and selecting a data set in a field different from the named entity data set from the general data set to obtain the Chinese word segmentation data set.
Optionally, the inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation includes:
sentence embedding processing is carried out on the training data set to obtain an embedded vector;
and encoding the embedded vector according to the preprocessing model to obtain sentence vector representation.
Optionally, the feature extraction processing is performed on the sentence vector representation input to a two-way long and short-time memory network to obtain a feature set, where the feature set at least includes a named entity private feature, a shared feature and a chinese word segmentation private feature, and the feature extraction processing includes:
the bidirectional long-short-term memory network comprises a named entity private network layer, a Chinese word segmentation private network layer and a shared network layer;
carrying out named entity recognition task feature extraction processing on the sentence vector representation through the named entity private network layer to obtain named entity private features;
carrying out shared information feature extraction processing on the sentence vector table through the shared network layer to obtain shared features;
and extracting the Chinese word segmentation task characteristics from the sentence vector representation through the Chinese word segmentation private network layer to obtain Chinese word segmentation private characteristics.
Optionally, the inputting the private feature set into the self-attention layer to perform dependency analysis processing to obtain a named entity task representation and a chinese word segmentation task representation includes:
the self-attention layer comprises a named entity private attention layer, a Chinese word segmentation private attention layer and a shared attention layer;
the self-attention layer learns the dependency relationship among characters of the private feature set, and extracts internal structure information to obtain named entity vector representation, chinese word segmentation vector representation and shared vector representation;
and respectively connecting the shared vector representation with the named entity vector representation and the Chinese word segmentation vector representation to obtain a named entity task representation and a Chinese word segmentation task representation.
Optionally, the step of inputting the named entity task representation and the chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a chinese word segmentation sequence tag includes:
the conditional random field layer comprises a named entity conditional random field and a Chinese word segmentation conditional random field;
carrying out label marking processing on the named entity task representation through the named entity conditional random field to obtain a named entity sequence label;
and carrying out label marking processing on the Chinese word segmentation task representation through the Chinese word segmentation conditional random field to obtain a Chinese word segmentation sequence label.
Optionally, the sentence embedding processing is performed on the training data set to obtain an embedded vector, including:
converting each character in the training data set to obtain character vector representation;
marking the character vector representation to obtain word embedded vector representation;
carrying out semantic classification processing on the character vector representation to obtain a segment embedded vector representation;
performing marking position coding processing on the character vector representation to obtain a position embedded vector representation;
and carrying out vector summation on the word embedding vector representation, the segment embedding vector representation and the position embedding vector representation to obtain an embedding vector.
On the other hand, the embodiment of the invention also provides a named entity recognition system based on the anti-migration learning, which comprises the following steps:
a first module for constructing a training data set, the training data set comprising a named entity data set and a Chinese word segmentation data set;
the second module is used for inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation;
the third module is used for carrying out feature extraction processing on the sentence vector representation input two-way long and short time memory network to obtain a feature set, wherein the feature set at least comprises named entity private features, shared features and Chinese word segmentation private features;
the fourth module is used for inputting the private feature set into the self-attention layer for dependency analysis and processing to obtain a named entity task representation and a Chinese word segmentation task representation;
the fifth module is used for inputting the named entity task representation and the Chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a Chinese word segmentation sequence tag;
the sixth module is used for performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag combined with the task classifier to obtain a trained named entity recognition model;
and a seventh module, configured to obtain a named entity to be identified, input the named entity to be identified into the trained named entity identification model, and perform named entity identification processing to obtain a named entity identification result.
Optionally, the first module is configured to construct a training data set, where the training data set includes a named entity data set and a chinese word segmentation data set, and includes:
the first unit is used for carrying out data crawling processing on the data website to obtain an original data set;
the second unit is used for carrying out labeling processing on the original data set to obtain a named entity data set;
and the third unit is used for selecting a data set in the field different from the named entity data set from the general data set to obtain the Chinese word segmentation data set.
Optionally, the second module is configured to input the training data set into a preprocessing model for encoding processing, to obtain sentence vector representation, and includes:
a third unit, configured to perform sentence embedding processing on the training data set to obtain an embedded vector;
and a fourth unit, configured to encode the embedded vector according to the preprocessing model, to obtain a sentence vector representation.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects: according to the embodiment of the invention, the feature analysis is carried out on the named entity data set and the Chinese word segmentation data set to obtain the named entity sequence tag and the Chinese word segmentation sequence tag, the task classifier is combined to carry out the countermeasure learning training treatment on the untrained named entity recognition model, and the named entity recognition is carried out through the trained named entity recognition model, so that word boundary information shared by two different tasks can be introduced into the countermeasure training, noise of the private information of the Chinese word segmentation task is prevented from being introduced, and the accuracy of the named entity recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a named entity recognition method based on challenge migration learning according to an embodiment of the present application;
FIG. 2 is a diagram of a named entity recognition model architecture based on challenge migration learning according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an embedded layer model according to an embodiment of the present application;
fig. 4 is a schematic diagram of a preprocessing model according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
First, several nouns referred to in this application are parsed:
named Entity Recognition (NER), a preliminary and important task in the field of Natural Language Processing (NLP), can be used for many downstream NLP tasks, such as relationship extraction, event extraction, question-answering, etc., and its effect directly affects the subsequent relationship extraction and event extraction tasks. Named entity recognition is mainly applied to extracting entities with specific meanings, such as characters, organizations, places, etc., from unstructured text.
Chinese Word Segmentation (CWS) tasks are tasks that identify word boundaries, and Chinese NER tasks share much similarity with CWS tasks, referred to as task sharing information. Whereas Chinese NER typically has coarser granularity boundaries than CWS, the differences between such tasks are referred to as task private information.
A Bi-directional Long Short-Term Memory network (BiLSTM) is formed by combining a forward LSTM network with a backward LSTM network, which is a variant of the Recurrent Neural Network (RNN). The bidirectional long-short-time memory network which simultaneously considers the early-stage information and the later-stage information is provided on the basis of the unidirectional LSTM network, and the accuracy of a result obtained by time sequence prediction can be effectively ensured.
In the related art, in order to integrate word boundary information in a CWS task into a NER task, a joint model is proposed to execute chinese NER and CWS tasks. However, the model only focuses on task sharing information between Chinese NER and CWS, and does not filter private information for each task, which can create noise for both tasks. For example, the CWS task splits a "Lingshan island dock" into a "Lingshan island" and a "dock", while the NER task splits a "Lingshan island dock" as a whole. Therefore, how to utilize the sharing information of the tasks to prevent the NER tasks from being negatively affected by the CWS tasks has important research significance.
In view of the above, referring to fig. 1, an embodiment of the present invention provides a named entity recognition method based on anti-migration learning, including:
s101, constructing a training data set, wherein the training data set comprises a named entity data set and a Chinese word segmentation data set;
s102, inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation;
s103, inputting the sentence vector representation into a two-way long short-time memory network for feature extraction processing to obtain a feature set, wherein the feature set at least comprises named entity private features, shared features and Chinese word segmentation private features;
s104, inputting the private feature set into a self-attention layer for dependency analysis and processing to obtain a named entity task representation and a Chinese word segmentation task representation;
s105, inputting the named entity task representation and the Chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a Chinese word segmentation sequence tag;
s106, performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag in combination with a task classifier to obtain a trained named entity recognition model;
s107, acquiring a named entity to be identified, inputting the named entity to be identified into the trained named entity identification model for named entity identification processing, and obtaining a named entity identification result.
Referring to fig. 2, in an embodiment of the present invention, a Named Entity (NER) dataset and a Chinese Word Segmentation (CWS) dataset are first constructed, the constructed resulting training dataset is encoded by a pre-processing model (BERT), the BERT is mapped as a sequence encoder to a vector representation by inputting text sentences of the two datasets into an embedding layer, wherein the BERT is shared by three subtasks. It should be noted that the named entity dataset constructed in the embodiment of the present invention is a chinese named entity dataset. As a pre-training language model, BERT further improves the generalization capability of the word embedding model, and fully expresses character-level, word-level, sentence-level and even inter-sentence relation characteristic information, thereby obtaining sentence vector representation. The embodiment of the invention performs feature extraction processing on sentence vector representation input two-way long and short time memory network (BiLSTM) to obtain a hidden state of each character in a Chinese sentence as a feature set, wherein the feature set at least comprises named entity private features, shared features and Chinese word segmentation private features. Then, the embodiment of the invention inputs the hidden state of each character output by the BiLSTM layer into the self-attention layer, learns the dependency relationship between any two characters, and extracts the internal structure information of sentences to obtain the named entity task representation and the Chinese word segmentation task representation. Finally, the embodiment of the invention introduces a specific Conditional Random Field (CRF) layer for the two tasks respectively, inputs the final representations of the two tasks into the respective CRF layers for decoding, and obtains the final sequence label. Performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag in combination with the task classifier to obtain a trained named entity recognition model; and acquiring the named entity to be identified, inputting the named entity to be identified into a trained named entity identification model, and carrying out named entity identification processing to obtain a named entity identification result.
Further as a preferred embodiment, the constructing a training dataset comprising a named entity dataset and a chinese word segmentation dataset comprises:
performing data crawling processing on the data website to obtain an original data set;
labeling the original data set to obtain a named entity data set;
and selecting a data set in a field different from the named entity data set from the general data set to obtain the Chinese word segmentation data set.
In the embodiment of the invention, different data websites are crawled by using a web crawler technology to obtain text data, the original data is marked, and a Chinese named entity data set is constructed. For Chinese word segmentation tasks, a general data set from different fields is selected. Specifically, the original data is marked by using an entity type, and a marking mode adopts a BIO marking method, wherein B represents the beginning of an entity, I represents the middle or end of the entity, and O represents a non-entity. In the embodiment of the invention, the data in the ship field can be adopted, and the entity types can be five types (name, place, organization, ship name and ship type). The next step of data enhancement is to collect all entities of different categories first, for example, put all the entities of ship name category in one set, put all the entities of ship type category in another set, then internal disorder each set, put them back to the position of the entity of the same category, finally delete the repeated sentences, construct and get the Chinese named entity data set.
Further as a preferred embodiment, the inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation includes:
sentence embedding processing is carried out on the training data set to obtain an embedded vector;
and encoding the embedded vector according to the preprocessing model to obtain sentence vector representation.
In the embodiment of the invention, a named entity data set and text sentences of a Chinese word segmentation data set in a training data set are input into an embedding layer to be subjected to sentence embedding processing to obtain embedded vectors, and finally discrete characters of the input sentences are mapped into vector representations by taking a preprocessing model BERT as a sequence encoder to obtain sentence vector representations.
Further as a preferred embodiment, the feature extraction processing is performed on the sentence vector representation input to a two-way long and short-term memory network to obtain a feature set, where the feature set at least includes a named entity private feature, a shared feature, and a chinese word segmentation private feature, and the method includes:
the bidirectional long-short-term memory network comprises a named entity private network layer, a Chinese word segmentation private network layer and a shared network layer;
carrying out named entity recognition task feature extraction processing on the sentence vector representation through the named entity private network layer to obtain named entity private features;
carrying out shared information feature extraction processing on the sentence vector table through the shared network layer to obtain shared features;
and extracting the Chinese word segmentation task characteristics from the sentence vector representation through the Chinese word segmentation private network layer to obtain Chinese word segmentation private characteristics.
Referring to fig. 2, in the embodiment of the present invention, in order to fuse information on both sides of a sequence, feature extraction is performed using a bidirectional long-short-time memory network, so as to obtain a hidden state of each character in a chinese sentence. Wherein the bidirectional long-short-time memory network comprises a named entity private network layer (NER BILSTM), a Chinese word-segmentation private network layer (CWS BILSTM) and sharingNetwork layer (Shared BiLSTM). Given a sentence x= { c from a chinese named entity dataset or chinese word segmentation dataset 1 ,c 2 ,…,c N The hidden state of the BiLSTM layer may be represented as follows:
Figure BDA0004084757470000111
Figure BDA0004084757470000112
Figure BDA0004084757470000113
in the above-mentioned method, the step of,
Figure BDA0004084757470000114
and->
Figure BDA0004084757470000115
The hidden states of the forward and backward LSTM at the vector representation position i are respectively represented; representing the connection operation.
The named entity private network layer is used for extracting the characteristics of the Chinese named entity task, and the private characteristics of the shared network layer and the Chinese word segmentation private network layer are used for countermeasure training to learn shared word boundary information, wherein the named entity private network layer and the Chinese word segmentation private network layer can be collectively called as a private network layer. The hidden states of the private network layer and the shared network layer are respectively expressed as follows:
Figure BDA0004084757470000116
Figure BDA0004084757470000117
in θ private And theta s h ared Representing private network layer parameters and shared network layer parameters, x, respectively t Indicating the t-th character of the input BiLSTM layer,
Figure BDA0004084757470000118
representing private network layer hidden status,/->
Figure BDA0004084757470000119
Representing the shared network layer hidden state.
Further as a preferred embodiment, the inputting the private feature set into the self-attention layer to perform dependency analysis processing to obtain a named entity task representation and a chinese word segmentation task representation includes:
the self-attention layer comprises a named entity private attention layer, a Chinese word segmentation private attention layer and a shared attention layer;
the self-attention layer learns the dependency relationship among characters of the private feature set, and extracts internal structure information to obtain named entity vector representation, chinese word segmentation vector representation and shared vector representation;
and respectively connecting the shared vector representation with the named entity vector representation and the Chinese word segmentation vector representation to obtain a named entity task representation and a Chinese word segmentation task representation.
In the embodiment of the invention, the self-attention layer comprises a named entity private attention layer, a Chinese word segmentation private attention layer and a shared attention layer. The private feature set is input into the self-attention layer, the dependency relationship between any two characters is learned, and the internal structure information of sentences is extracted. Taking the named entity private attention layer as an example, h= { H 1 ,h 2 ,…,h N The output of the named entity private network layer is represented, and the applied scaling dot product attention is represented as follows:
Figure BDA0004084757470000121
in the above
Figure BDA0004084757470000122
And->
Figure BDA0004084757470000123
Respectively representing a query matrix, a key matrix and a value matrix; n is the total number of characters of the input sentence, d h Outputting a dimension of the vector representation for each character for the unidirectional LSTM; d represents the dimension of the hidden unit of the BiLSTM layer, and is equal to 2d in value h
In addition, the embodiment of the invention adopts a multi-head attention mechanism, and carries out linear projection on the query, the key and the value for h times through multi-head attention, and then carries out dot product zooming attention on the h projections in parallel. Finally, the results are concatenated and projected again to obtain a new representation. The formula for multi-head attention is as follows:
head i =Attention(QW i Q ,KW i K ,VW i V );
H′=(head i ⊕…⊕head h )W o
in the above-mentioned method, the step of,
Figure BDA0004084757470000124
and
Figure BDA0004084757470000125
are trainable parameters.
Further as a preferred embodiment, the inputting the named entity task representation and the chinese word segmentation task representation into a conditional random field layer for decoding to obtain a named entity sequence tag and a chinese word segmentation sequence tag includes:
the conditional random field layer comprises a named entity conditional random field and a Chinese word segmentation conditional random field;
carrying out label marking processing on the named entity task representation through the named entity conditional random field to obtain a named entity sequence label;
and carrying out label marking processing on the Chinese word segmentation task representation through the Chinese word segmentation conditional random field to obtain a Chinese word segmentation sequence label.
Referring to FIG. 2, in an embodiment of the present invention, the conditional random field layer includes a named entity conditional random field (NER CRF) and a Chinese word segmentation conditional random field (CWS CRF). Because the Chinese naming entity task and the Chinese word segmentation task have different dependency relationships between labels, a specific Conditional Random Field (CRF) layer is respectively introduced for the two tasks. The final representations of the two tasks are input into respective CRF layers for decoding to obtain final sequence tags. Given a sentence x= { c 1 ,c 2 ,…,c N Sum of the corresponding labels y= { y 1 ,y 2 ,…,y N The CRF labeling process is expressed as follows:
o i =W s h i i+b s
Figure BDA0004084757470000131
Figure BDA0004084757470000132
in the above, H' n ={h′ 1 ,h′ 2 ,…,h′ N -represents an input of the CRF layer;
Figure BDA0004084757470000133
and->
Figure BDA0004084757470000134
Are trainable parameters, |t| represents the number of output tags; o (o) i,yi Representing character c i Is the y of (2) i A score of the individual tags; t is a transfer matrix that scores two adjacent labels; y is Y x All candidate tag sequences representing a given sentence x; />
Figure BDA0004084757470000135
Is a predicted tag sequence, and is obtained by decoding through a Viterbi algorithm.
The probability of outputting the best tag sequence is defined as:
Figure BDA0004084757470000136
in the above-mentioned method, the step of,
Figure BDA0004084757470000137
representing the actual tag sequence.
Given T training samples
Figure BDA0004084757470000141
The loss function is defined as:
Figure BDA0004084757470000142
the loss function can be minimized by gradient back propagation.
Further as a preferred embodiment, the sentence embedding processing is performed on the training data set to obtain an embedded vector, including:
converting each character in the training data set to obtain character vector representation;
marking the character vector representation to obtain word embedded vector representation;
carrying out semantic classification processing on the character vector representation to obtain a segment embedded vector representation;
performing marking position coding processing on the character vector representation to obtain a position embedded vector representation;
and carrying out vector summation on the word embedding vector representation, the segment embedding vector representation and the position embedding vector representation to obtain an embedding vector.
Referring to fig. 3 and 4, in an embodiment of the present invention, text sentences of two data sets are input to an embedding layer, BERT as a sequence encoder that maps discrete characters of the input sentence into vector representations, the encoder being shared by three subtasks. As a pre-training language model, BERT further improves the generalization capability of the word embedding model and fully expresses character-level, word-level, sentence-level and even inter-sentence relation characteristic information.
FIG. 3 is a schematic diagram of an embedding layer model, with the result of the embedding process of a data set by the embedding layer as an input to the BERT. The embedded vector is composed of word embedded vector representation (Token embedded) and segment embedded vector representation (Segment Embeddings) and position embedded vector representation (Position Embeddings). The word embedding layer is used for converting each Chinese character into 768-dimensional vector representation, the input sentence is labeled before the word embedding layer is input, and [ CLS ] is respectively added at the beginning and the end of the label sequence]And [ SEP ]]Two special marks. The role of segment embedding is to distinguish between the tokens in the input sentence pair, i.e. to classify according to whether the two sentences are semantically similar. Specifically, an input sentence pair "[ CLS ]]Temporary berthing of ship at Lingshan island wharf [ SEP ]]Building national ocean park [ SEP ] by Lingshan island]", the segment embedding layer assigns the first vector (index 0) to sentence 1 ([ CLS)]Temporary berthing of ship at Lingshan island wharf [ SEP ]]) The second vector (index 1) was assigned to sentence 2 (Lingshan island was assigned to establish national ocean park [ SEP ]]). The function of the position embedding is to encode the position information of the tokens in the input sentence. Since the NER task ultimately predicts the tag sequence in view of the order of each character of the input sentence, the position embedding layer assigns a vector to each tag of the sentence so that the BERT learns the sequential features of the input sequence. Summing the word embedding vector, segment embedding vector and position embedding vector of each tag to obtain { E } 1 ,E 2 ,…,E N Input to the converter pre-training model of BERT for language understanding, and finally obtain the vector representation { c } of the sentence 1 ,c 2 ,…,c N }. The overall architecture of BERT is shown in fig. 4, where Trm is an abbreviation for transducer.
In the related art, the joint model integrates word boundary information in the Chinese word segmentation task into the named entity task so as to execute the Chinese named entity task and the Chinese word segmentation task, but only focuses on task sharing information between the Chinese NER and the CWS, does not filter private information of each task, and brings noise to the two tasks.
In summary, the embodiment of the invention has the following advantages: the named entity recognition method and system based on the anti-migration learning integrate word boundary information shared by tasks into a Chinese named entity recognition task. The countermeasure transfer learning incorporates countermeasure training into the transfer learning, optimizes the named entity recognition task, and not only utilizes shared information obtained from the named entity recognition and the Chinese word segmentation task training, but also prevents noise from being introduced into the Chinese word segmentation task. Compared with other models, the invention introduces countermeasure training, utilizes word boundary information shared by two tasks, prevents noise of Chinese word segmentation task private information from being introduced, and improves accuracy of a named entity recognition method.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A named entity recognition method based on challenge migration learning, the method comprising:
constructing a training data set, wherein the training data set comprises a named entity data set and a Chinese word segmentation data set;
inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation;
inputting the sentence vector representation into a two-way long and short-time memory network for feature extraction processing to obtain a feature set, wherein the feature set at least comprises named entity private features, shared features and Chinese word segmentation private features;
inputting the private feature set into a self-attention layer for dependency analysis and processing to obtain a named entity task representation and a Chinese word segmentation task representation;
inputting the named entity task representation and the Chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a Chinese word segmentation sequence tag;
performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag in combination with a task classifier to obtain a trained named entity recognition model;
and acquiring a named entity to be identified, inputting the named entity to be identified into the trained named entity identification model for named entity identification processing, and obtaining a named entity identification result.
2. The method of claim 1, wherein the constructing a training dataset comprising a named entity dataset and a chinese word segmentation dataset comprises:
performing data crawling processing on the data website to obtain an original data set;
labeling the original data set to obtain a named entity data set;
and selecting a data set in a field different from the named entity data set from the general data set to obtain the Chinese word segmentation data set.
3. The method of claim 1, wherein inputting the training data set into a preprocessing model for encoding processing results in sentence vector representations, comprising:
sentence embedding processing is carried out on the training data set to obtain an embedded vector;
and encoding the embedded vector according to the preprocessing model to obtain sentence vector representation.
4. The method according to claim 1, wherein the feature extraction processing is performed on the sentence vector representation input to a two-way long short-term memory network to obtain a feature set, where the feature set includes at least named entity private features, shared features, and chinese word segmentation private features, and the method includes:
the bidirectional long-short-term memory network comprises a named entity private network layer, a Chinese word segmentation private network layer and a shared network layer;
carrying out named entity recognition task feature extraction processing on the sentence vector representation through the named entity private network layer to obtain named entity private features;
carrying out shared information feature extraction processing on the sentence vector table through the shared network layer to obtain shared features;
and extracting the Chinese word segmentation task characteristics from the sentence vector representation through the Chinese word segmentation private network layer to obtain Chinese word segmentation private characteristics.
5. The method according to claim 1, wherein the inputting the private feature set into the self-attention layer for dependency analysis processing to obtain a named entity task representation and a chinese word segmentation task representation includes:
the self-attention layer comprises a named entity private attention layer, a Chinese word segmentation private attention layer and a shared attention layer;
the self-attention layer learns the dependency relationship among characters of the private feature set, and extracts internal structure information to obtain named entity vector representation, chinese word segmentation vector representation and a shared vector table
Showing;
and respectively connecting the shared vector representation with the named entity vector representation and the Chinese word segmentation vector representation to obtain a named entity task representation and a Chinese word segmentation task representation.
6. The method of claim 1, wherein said inputting the named entity task representation and the chinese word segmentation task representation into a conditional random field for decoding to obtain a named entity sequence tag and a chinese word segmentation sequence tag comprises:
the conditional random field layer comprises a named entity conditional random field and a Chinese word segmentation conditional random field;
carrying out label marking processing on the named entity task representation through the named entity conditional random field to obtain a named entity sequence label;
and carrying out label marking processing on the Chinese word segmentation task representation through the Chinese word segmentation conditional random field to obtain a Chinese word segmentation sequence label.
7. A method according to claim 3, wherein said sentence embedding of said training data set to obtain an embedded vector comprises:
converting each character in the training data set to obtain character vector representation;
marking the character vector representation to obtain word embedded vector representation;
carrying out semantic classification processing on the character vector representation to obtain a segment embedded vector representation;
performing marking position coding processing on the character vector representation to obtain a position embedded vector representation;
and carrying out vector summation on the word embedding vector representation, the segment embedding vector representation and the position embedding vector representation to obtain an embedding vector.
8. A named entity recognition system based on challenge transfer learning, the system comprising:
a first module for constructing a training data set, the training data set comprising a named entity data set and a Chinese word segmentation data set;
the second module is used for inputting the training data set into a preprocessing model for coding processing to obtain sentence vector representation;
the third module is used for carrying out feature extraction processing on the sentence vector representation input two-way long and short time memory network to obtain a feature set, wherein the feature set at least comprises named entity private features, shared features and Chinese word segmentation private features;
the fourth module is used for inputting the private feature set into the self-attention layer for dependency analysis and processing to obtain a named entity task representation and a Chinese word segmentation task representation;
the fifth module is used for inputting the named entity task representation and the Chinese word segmentation task representation into a conditional random field layer for decoding processing to obtain a named entity sequence tag and a Chinese word segmentation sequence tag;
the sixth module is used for performing countermeasure learning training treatment on the untrained named entity recognition model according to the named entity sequence tag and the Chinese word segmentation sequence tag combined with the task classifier to obtain a trained named entity recognition model;
and a seventh module, configured to obtain a named entity to be identified, input the named entity to be identified into the trained named entity identification model, and perform named entity identification processing to obtain a named entity identification result.
9. The named entity recognition system of claim 8, wherein the first module is configured to construct a training dataset comprising a named entity dataset and a chinese word segmentation dataset, comprising:
the first unit is used for carrying out data crawling processing on the data website to obtain an original data set;
the second unit is used for carrying out labeling processing on the original data set to obtain a named entity data set;
and the third unit is used for selecting a data set in the field different from the named entity data set from the general data set to obtain the Chinese word segmentation data set.
10. The named entity recognition system of claim 8, wherein the second module is configured to input the training dataset into a preprocessing model for encoding, to obtain a sentence vector representation, and comprises:
a third unit, configured to perform sentence embedding processing on the training data set to obtain an embedded vector;
and a fourth unit, configured to encode the embedded vector according to the preprocessing model, to obtain a sentence vector representation.
CN202310133155.1A 2023-02-16 2023-02-16 Named entity recognition method and system based on countermeasure migration learning Pending CN116167378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310133155.1A CN116167378A (en) 2023-02-16 2023-02-16 Named entity recognition method and system based on countermeasure migration learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310133155.1A CN116167378A (en) 2023-02-16 2023-02-16 Named entity recognition method and system based on countermeasure migration learning

Publications (1)

Publication Number Publication Date
CN116167378A true CN116167378A (en) 2023-05-26

Family

ID=86419616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310133155.1A Pending CN116167378A (en) 2023-02-16 2023-02-16 Named entity recognition method and system based on countermeasure migration learning

Country Status (1)

Country Link
CN (1) CN116167378A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332784A (en) * 2023-09-28 2024-01-02 卓世科技(海南)有限公司 Intelligent knowledge enhancement method based on hierarchical graph attention and dynamic meta-learning
CN117807999A (en) * 2024-02-29 2024-04-02 武汉科技大学 Domain self-adaptive named entity recognition method based on countermeasure learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332784A (en) * 2023-09-28 2024-01-02 卓世科技(海南)有限公司 Intelligent knowledge enhancement method based on hierarchical graph attention and dynamic meta-learning
CN117807999A (en) * 2024-02-29 2024-04-02 武汉科技大学 Domain self-adaptive named entity recognition method based on countermeasure learning
CN117807999B (en) * 2024-02-29 2024-05-10 武汉科技大学 Domain self-adaptive named entity recognition method based on countermeasure learning

Similar Documents

Publication Publication Date Title
CN107977361B (en) Chinese clinical medical entity identification method based on deep semantic information representation
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN111783462A (en) Chinese named entity recognition model and method based on dual neural network fusion
CN110866401A (en) Chinese electronic medical record named entity identification method and system based on attention mechanism
CN116167378A (en) Named entity recognition method and system based on countermeasure migration learning
CN113641819B (en) Argumentation mining system and method based on multitasking sparse sharing learning
CN112115253B (en) Depth text ordering method based on multi-view attention mechanism
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN116595195A (en) Knowledge graph construction method, device and medium
Kim et al. Construction of machine-labeled data for improving named entity recognition by transfer learning
Sargar et al. Image captioning methods and metrics
CN117574904A (en) Named entity recognition method based on contrast learning and multi-modal semantic interaction
Du et al. A convolutional attentional neural network for sentiment classification
CN114490954A (en) Document level generation type event extraction method based on task adjustment
Liu et al. Research on advertising content recognition based on convolutional neural network and recurrent neural network
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis
Xu et al. Research on depression tendency detection based on image and text fusion
CN115934883A (en) Entity relation joint extraction method based on semantic enhancement and multi-feature fusion
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
CN116186241A (en) Event element extraction method and device based on semantic analysis and prompt learning, electronic equipment and storage medium
Hindarto Comparison of RNN Architectures and Non-RNN Architectures in Sentiment Analysis
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114691848A (en) Relational triple combined extraction method and automatic question-answering system construction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination