CN114462409A - Audit field named entity recognition method based on countermeasure training - Google Patents

Audit field named entity recognition method based on countermeasure training Download PDF

Info

Publication number
CN114462409A
CN114462409A CN202210109168.0A CN202210109168A CN114462409A CN 114462409 A CN114462409 A CN 114462409A CN 202210109168 A CN202210109168 A CN 202210109168A CN 114462409 A CN114462409 A CN 114462409A
Authority
CN
China
Prior art keywords
task
ner
cws
shared
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210109168.0A
Other languages
Chinese (zh)
Inventor
钱泰羽
陈一飞
乔红岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING AUDIT UNIVERSITY
Original Assignee
NANJING AUDIT UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING AUDIT UNIVERSITY filed Critical NANJING AUDIT UNIVERSITY
Priority to CN202210109168.0A priority Critical patent/CN114462409A/en
Publication of CN114462409A publication Critical patent/CN114462409A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

With the issuance of a new auditing method, effective entity information is automatically identified from the linguistic data in the auditing field, which is beneficial to improving the efficiency of implementing the auditing policy. Named Entity Recognition (NER) aims at recognizing entities in corpora, and a deep learning method is mature in application and remarkable in result on the task, but a database of the corpora in the auditing field is not perfect, and Entity boundary division is not clear. The invention provides an audit field named entity recognition method based on countermeasure training. Chinese Word Segmentation (CWS) is used to identify Word boundaries, has much of the same Word boundary information as NER, uses the same place to assist NER tasks and help in boundary partitioning. The word vector is obtained by using the BERT, the shared information of the NER task and the CWS task is extracted through the countermeasure training, meanwhile, the noise caused by the private information of the CWS task is effectively prevented, the word boundary information shared by the tasks is fused into the NER task, and the accuracy of named entity recognition in the auditing field is improved.

Description

Audit field named entity recognition method based on countermeasure training
Technical Field
The invention relates to the technical field of named entity recognition, in particular to an audit field named entity recognition method based on countermeasure training.
Background
Named Entity Recognition (NER) is the most important basic task of Natural Language Processing (NLP), and is a pre-task of a relationship extraction, question and answer system, and the like. The main task is to mark predefined entity types from unstructured text, such as place names, organization names, etc. Traditional named entity recognition methods mostly start with improved models and feature engineering to reduce the dependence on rule methods and expert knowledge, but pay little attention to the problem of entity boundaries. With the promulgation of new auditing laws, auditing policies are divided into more and more details, and auditing policy texts are increased day by day. Meanwhile, the implementation of the audit policy is more and more important in the audit process, the existing audit policy is mainly implemented manually, and the workload of auditors is increased. In addition, the audit policy is mostly unstructured text, and the extraction of entities in the unstructured text is beneficial to improving the implementation efficiency of the audit policy. In the auditing field, the database of the linguistic data in the auditing field is not perfect, and the entity boundary division is not detailed enough. Chinese Word Segmentation (CWS) is used to identify Word boundaries, the CWS has a larger data set than NER, the boundary division is more detailed on a general data set, and NER has many boundary divisions similar to the CWS, and the same can be used to assist NER tasks and help the boundary division. Peng et al propose a joint model of NER tasks and CWS tasks in which the linear chain CRF has access to both the feature extractor of the NER and the LSTM module for word segmentation, and word segmentation and NER training share all the parameters of the LSTM module. Therefore, the model only focuses on task shared information between the NER task and the CWS task, and ignores the filtering of private information of each task, which brings noise to the two tasks.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a named entity identification method in the audit field based on countermeasure training, which can effectively solve the problems in the technical background.
In order to achieve the purpose, the invention provides an audit field named entity recognition method based on countermeasure training, which comprises the following steps:
s1): acquisition of the data set: the invention mainly solves the problem of named entity identification in the audit field, so that the data set in the audit field is used as the main data set of the invention. The CWS and the NER can divide the entity boundary, the CWS has a larger data set, the boundary is divided more finely on a general data set, and the characteristic of the CWS can be used for assisting in completing the NER task; the new generation daily newspaper participle corpus is used as an auxiliary data set due to large data volume and rich content.
S11): NER dataset
The audit field data set collects 7323 language materials related to poverty-alleviation policies from a government website by using a web crawler, constructs the language materials by screening sentences with the word number of 10-100, and preprocesses the original data, including deletion of non-text parts, uniform coding and segmentation. According to the following steps: 2: the 1 approach is divided into a training set, a validation set, and a test set, and uses manual versus 4 entity types: the name of a person, the name of a place, the name of an organization and a proper noun are labeled by adopting a BIO (B represents the beginning of an entity; I represents the middle of the entity; and O represents that the entity is not) mode.
S12): CWS dataset
The word segmentation corpus of the new generation people's daily newspaper is constructed by taking all articles published by the humanistic and social computing research center of Nanjing agriculture university in the first half of 2015 and in 9 months of 2016, 1 month of 2017, 1 month of 2018 and 1 month of 2018 as objects, the scale of the corpus is more than 2300 ten thousand characters, and the corpus is labeled manually in a BMES mode. The invention uses the language material of 2018 month 1, which is 43647 in total.
S2): constructing a model: the model framework provided by the invention longitudinally comprises three tasks, wherein the left side is named as an entity identification task and comprises an NER BERT Embedding module, an NER Private BilSTM module and an NER CRF module; the right side is a Chinese word segmentation task which comprises a CWS BERT Embedding module, a CWS Private BilSTM module and a CWS CRF module; the middle part is an antagonistic training task which comprises a Shared BilSTM module and an antagonistic training module; the three tasks comprise an embedding layer, a sharing-private feature extraction layer and a CRF layer or an antagonistic training layer in the transverse direction, and the structure is introduced according to the three tasks in the transverse direction.
S21): embedding layer
The linguistic data is input into an embedded layer, BERT adopts Transformer for coding, a Self-attention mechanism is introduced to predict the dependency relationship among words and capture the information of the internal structure of a sentence, the length of the input sentence exceeds n for truncation, and meanwhile, 0 is used for complementing the length of the sentence which is less than n. And adding a vector [ CLS ] of input representation and a vector [ SEP ] of a divided sentence pair at the beginning of the sentence, and training the sentence to obtain more accurate semantic information. Sentence-level features are then obtained using Segment embedding to determine whether a given sentence is contiguous between the sentences. Since the word order of the text is crucial to the meaning of the sentence, BERT encodes each character position independently, learns the order characteristics of the input sequence, and thus obtains the information of each position. And finally, adding vectors obtained by Token embedding, Segment embedding and Position embedding to obtain an output sequence of the BERT.
S211): NER BERT Embedding module
Using an audit domain dataset for the NER task, a given sentence W ═ W1,w2,...,wn]After being input into the NER BERT Embedding module, the sequence X ═ X of the word vector of each word may be output1,x2,...,xn]Wherein w isiFor words in sentences, xiIs wiThe corresponding word vector, n is the length of the sentence.
S212): CWS BERT Embedding module
Using new generation civilian daily tokenization corpora for the CWS task, a given sentence W 'is [ [ W'1,w′2,...,w′m]After entering the CWS BERT Embedding module, the sequence of word vectors for each word X 'may be output'1,x′2,...,x′m]Wherein, w'iIs a word in sentence, x'iIs w'iCorresponding word vectors, m being the length of the sentence, and n > m being specified.
In summary, each dimension vector in X 'is complemented to n, and the complemented X' is integrally connected to the lower part of X to obtain a sequence
Figure BDA0003494285400000041
An input for extracting shared information against the training task.
S22): shared-private feature extraction layer
Long Short-Term Memory networks (LSTM) are a variant of Recurrent Neural Networks (RNN) that can efficiently use Long-distance information and solve the problems of gradient dispersion and gradient burst of RNN through gate control structures and Memory cells. The unidirectional LSTM can only obtain the information of the previous moment of the input information of the current moment, the information of the later moment of the input information of the current moment is also important in the sequence labeling task, and in order to fuse the information of the two sides of the sequence, the bidirectional LSTM (Bi-directional Long Short-Term Memory, BilTM) is adopted for feature extraction. Given an input sequence to perform feature extraction, the hidden state at the ith time represents the output features as shown in equations (1) to (3):
Figure BDA0003494285400000042
Figure BDA0003494285400000043
Figure BDA0003494285400000044
wherein,
Figure BDA0003494285400000045
and
Figure BDA0003494285400000046
respectively representing the hidden states in the forward and backward directions at the ith time,
Figure BDA0003494285400000047
indicating a connect operation.
S221): NER Private BilSTM Module
Converting the sequence X into [ X ]1,x2,…,xn]Private feature extraction is carried out by inputting NER Private BilSTM module, and output feature of NER task Private BilSTM can be obtained
Figure BDA0003494285400000048
Wherein,
Figure BDA0003494285400000049
representing the per task private feature output at time i. For any sentence in the audit domain dataset, the hidden state of private BilSTM is represented as shown in equation (4):
Figure BDA0003494285400000051
wherein, thetanpFor the NER private BiLSTM parameter, dimension setting for hidden state.
S222): CWS Private BilSTM module
Sequence X '═ X'1,x′2,…,x′m]Private feature extraction is carried out by inputting CWS Private BilSTM module, and output feature of CWS task Private BilSTM can be obtained
Figure BDA0003494285400000052
Wherein,
Figure BDA0003494285400000053
representing the CWS task private feature output at time i. For any sentence in the daily newspaper participle corpus of the new generation people, the hidden state of the private BilSTM layer is expressed as shown in formula (5):
Figure BDA0003494285400000054
wherein, thetacpDimension setting for hidden state for CWS private BiLSTM parameter.
S223): shared BilSTM module
Will be sequenced
Figure BDA0003494285400000055
The Shared BilSTM module is input for Shared feature extraction, and the output feature of the Shared BilSTM can be obtained
Figure BDA0003494285400000056
Wherein,
Figure BDA0003494285400000057
and (4) representing the shared characteristics of the NER task and the CWS task output at the ith moment. For any sentence in the set, the hidden state of the shared BilSTM layer is represented as shown in equation (6):
Figure BDA0003494285400000058
wherein, thetasharedDimension setting for hidden state for sharing the BilSTM parameter.
In summary, the Private feature extracted by the NER Private BilSTM module and the Shared feature extracted by the Shared BilSTM module are connected to obtain the total feature H of the NER tasknerAs input to the NER CRF module. The Private features extracted by the CWS Private BilSTM module and the Shared features extracted by the Shared BilSTM module are connected to obtain the total features H of the CWS taskcwsAs input to the CWS CRF module. Represented by the formulae (7) and (8):
Figure BDA0003494285400000061
Figure BDA0003494285400000062
s23): CRF layer
The BilSTM can only obtain the information relation between words without considering the mutual relation between continuous labels, so the invention uses a CRF layer to carry out label speculation on the characteristics after the BilSTM layer training, but because the labels of an NER task and a CWS task are different, a respective CRF layer is distributed to each task so as to obtain the sequence labels of the respective tasks, however, the dimension of a BilSTM output vector is not equal to the CRF, in order to calculate a loss function when the CRF carries out label speculation, a full connection layer is added to a vector H output by the BilSTM, and the CRF prediction process is expressed as the formula (9) and the formula (10):
0i=Ahi+b (9)
Figure BDA0003494285400000063
wherein A is a weight, b is a bias term, X is an input sequence, y is a predicted tag sequence, K is a transition probability matrix,
Figure BDA0003494285400000064
is yi-1Label transfer yiThe probability score of a tag is determined by,
Figure BDA0003494285400000065
is a character xiIs marked as the yiThe score of each tag, n is the length of the sentence. Using a negative log-likelihood function for the loss function, the probability of obtaining the true tag sequence is expressed as formula (11):
Figure BDA0003494285400000066
wherein,
Figure BDA0003494285400000067
as authentic tag sequence, YXFor the set of all the data that is marked,
Figure BDA0003494285400000068
in order to predict the score of the correct label,
Figure BDA0003494285400000069
the sum of all tags is scored.
S231): NER CRF module
For HnerThe loss function L can be obtained by the following equations (9) to (11)nerExpressed as shown in formula (12):
Figure BDA0003494285400000071
s232): CWS CRF module
To HcwsThe loss function L can be obtained by the following equations (9) to (11)cwsExpressed as shown in formula (13):
Figure BDA0003494285400000072
the training process is continually tuned to minimize the loss function.
S24): the confrontation training layer:
the countermeasure technology inspired by the GAN network extracts the shared information of the NER and the CWS through countermeasure training, and simultaneously effectively prevents noise caused by private information of the CWS task. The task discriminator identifies which task the features come from through the Maxpooling layer and the Softmax layer, and when the model cannot identify which task the features come from, the shared feature extractor extracts the shared features of the two tasks, so that the task performance of named entity identification is improved. The task discriminator is represented by equations (14) and (15):
S=Maxpooling(Hshared) (14)
D(s;δd)=Softmax(A1s+b1) (15)
wherein HsharedFor sharing the output of the feature extraction layer, δdFor parameters of task discriminators, i.e. including A1Is a weight, b1Is the bias term.
In order to prevent the private information of Chinese word segmentation task from enteringInto a shared information space, introducing a penalty function LadvTraining the shared feature extractor to make the task discriminator unable to effectively identify which task the feature comes from, the fighting loss function can be expressed as shown in equation (16):
Figure BDA0003494285400000073
wherein, deltasTo share the BilSTM parameter thetasharedI is the total number of tasks in the shared feature, J is the number of training samples in the shared feature, WsIn order to share the feature extractor(s),
Figure BDA0003494285400000081
is the ith sample in the shared feature.
S3): model training
By means of the above-mentioned task loss function L for NERnerCWS task loss function LcwsAnd a penalty function LadvThe final loss function L of the model is expressed as shown in equation (17):
L=GLNER+(l-G)LCWS+γLadv (17)
where γ is the loss weight coefficient and G is the switching function that determines the inputs from the NER and CWS tasks.
In the process of training the model, a training example is extracted from a given task to update parameters, a final loss function is continuously optimized, and iteration is carried out according to the convergence rate of an NER task until the result is optimal.
Compared with the prior art, the invention has the beneficial effects that: according to the audit field named entity recognition method based on the countermeasure training, word vectors are obtained through BERT, shared information of an NER task and a CWS task is extracted through the countermeasure training, noise caused by private information of the CWS task is effectively prevented, filtering of the private information is improved, word boundary information shared by the tasks is fused into the NER task, and accuracy of audit field named entity recognition is improved.
Drawings
FIG. 1 is a model framework diagram of an audit field named entity recognition method based on countermeasure training according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides the following technical solutions:
an audit field named entity recognition method based on countermeasure training comprises the following steps:
first, acquisition of a data set
The invention mainly solves the problem of named entity identification in the audit field, so that the data set in the audit field is used as the main data set of the invention. The CWS and the NER can divide the entity boundary, the CWS has a larger data set, the boundary is divided more finely on a general data set, and the characteristic of the CWS can be used for assisting in completing the NER task. The new generation daily newspaper participle corpus (http:// corpus. njau. edu. cn /) is used as an auxiliary data set due to large data volume and rich content.
1) NER dataset
The audit field data set collects 7323 language materials related to poverty-alleviation policies from a government website by using a web crawler, constructs the language materials by screening sentences with the word number of 10-100, and preprocesses the original data, including deletion of non-text parts, uniform coding, segmentation and the like. According to the following steps: 2: the mode 1 is divided into a training set, a verification set and a test set, and the mode of BIO (B represents the beginning of an entity; I represents the middle of the entity; and O represents not the entity) is adopted to perform linguistic annotation on 4 entity types (name of a person, place, organization and proper noun) manually.
2) CWS dataset
The language material of the word segmentation of the new generation people's daily newspaper is constructed by taking all articles published by the human and social computing research center of Nanjing agriculture university in 9 months of the national daily newspaper, wherein the articles are 2015, the first half year (1-6 months) and 2016, the first month in 2017, the first month in 2018 and the second month in 2018 as objects, the scale of the articles exceeds 2300 characters, and the language material labeling is carried out manually in a BMES mode. The invention uses the language material of 2018 month 1, which is 43647 in total.
Second, construction of model
The model framework proposed by the present invention is shown in fig. 1. The system comprises three tasks longitudinally, wherein the left side is a named entity identification task which comprises an NER BERT Embedding module, an NER Private BilSTM module and an NER CRF module; the right side is a Chinese word segmentation task which comprises a CWS BERT Embedding module, a CWS Private BilSTM module and a CWS CRF module; the middle is an antagonistic training task which comprises a Shared BilSTM module and an antagonistic training module. The three tasks comprise an embedding layer, a sharing-private feature extraction layer and a CRF layer or an antagonistic training layer in a transverse direction, and the structure is introduced according to the three tasks in the transverse direction.
1 embedding layer
The linguistic data is input into an embedded layer, BERT adopts Transformer for coding, a Self-attention mechanism is introduced to predict the dependency relationship among words and capture the information of the internal structure of a sentence, the length of the input sentence is cut off when exceeding n, and meanwhile, 0 is used for complementing when the length of the sentence is less than n. And adding a vector [ CLS ] of the input representation and a vector [ SEP ] of the divided sentence pairs at the beginning of the sentence, and training the sentence to obtain more accurate semantic information (Token). Sentence-level features are then obtained using Segment embedding to determine whether a given sentence is contiguous between the sentences. Since the word order of text is crucial to sentence meaning, BERT encodes each character Position independently, learns the order characteristics of the input sequence, and thus obtains information (Position) of each Position. And finally, adding vectors obtained by Token embedding, Segment embedding and Position embedding to obtain an output sequence of the BERT.
1) NER BERT Embedding module
Using audit domain data sets for NER tasks, will giveThe sentence W ═ W1,w2,...,wn]After being input into the NER BERT Embedding module, the sequence X ═ X of the word vector of each word may be output1,x2,...,xn]Wherein w isiFor words in sentences, xiIs wiThe corresponding word vector, n is the length of the sentence.
2) CWS BERT Embedding module
Using a new generation daily participle corpus for the CWS task, a given sentence W 'is [ W'1,w′2,...,w′m]After entering the CWS BERT Embedding module, the sequence of word vectors for each word x ═ x 'may be output'1,x′2,...,x′m]Wherein, w'iIs a word in sentence, x'iIs w'iCorresponding word vectors, m being the length of the sentence, and n > m being specified.
In summary, each dimension vector in X 'is complemented to n, and the complemented X' is integrally connected to the lower part of X to obtain a sequence
Figure BDA0003494285400000111
An input for extracting shared information against the training task.
2 shared-private feature extraction layer
Long Short-Term Memory networks (LSTM) are a variant of Recurrent Neural Networks (RNN) that can efficiently use Long-distance information and solve the problems of gradient dispersion and gradient burst of RNN through gate control structures and Memory cells. The unidirectional LSTM can only obtain the information of the previous moment of the input information at the current moment, and the information of the subsequent moment of the input information at the current moment is also important in the sequence tagging task. In order to fuse the information on two sides of the sequence, the invention adopts bidirectional LSTM (Bi-directional Long Short-Term Memory, BilSTM) to carry out feature extraction.
Given an input sequence to perform feature extraction, the hidden state at the ith time represents the output features as shown in equations (1) to (3):
Figure BDA0003494285400000112
Figure BDA0003494285400000113
Figure BDA0003494285400000114
wherein,
Figure BDA0003494285400000115
and
Figure BDA0003494285400000116
respectively representing the hidden states in the forward and backward directions at the ith time,
Figure BDA0003494285400000117
indicating a connect operation.
The invention uses a sharing-Private characteristic extraction layer, an NER Private BilTM module extracts the characteristics of the auditing field and is used for an NER task, a CWS Private BilTM module extracts the characteristics of the new generation daily newspaper word segmentation linguistic data and is used for a CWS task, and a Shared information characteristic of the learned word boundary of a Shared BilTM module is used for a confrontation training task.
1) NER Private BilSTM Module
Converting the sequence X into [ X ]1,x2,...,xn]Private feature extraction is carried out by inputting NER Private BilSTM module, and output feature of NER task Private BilSTM can be obtained
Figure BDA0003494285400000121
Wherein,
Figure BDA0003494285400000122
representing the per task private feature output at time i. For any sentence in the audit domain dataset, the hidden state of private BilSTM is expressed as (4)Shown in the figure:
Figure BDA0003494285400000123
wherein, thetanpFor the NER private BiLSTM parameter, dimension setting for hidden state.
2) CWS Private BilSTM module
Sequence X '═ X'1,x′2,...,x′m]Private feature extraction is carried out by inputting a CWS Private BilSTM module, and the output feature of the CWS task Private BilSTM can be obtained
Figure BDA0003494285400000124
Wherein,
Figure BDA0003494285400000125
and (4) representing the private characteristics of the CWS task output at the ith moment. For any sentence in the daily newspaper participle corpus of the new generation people, the hidden state of the private BilSTM layer is expressed as shown in formula (5):
Figure BDA0003494285400000126
wherein, thetacpDimension setting for hidden state for CWS private BiLSTM parameter.
3) Shared BilSTM module
Will be sequenced
Figure BDA0003494285400000127
The Shared BilSTM module is input for Shared feature extraction, and the output feature of the Shared BilSTM can be obtained
Figure BDA0003494285400000128
Wherein,
Figure BDA0003494285400000129
and (4) representing the shared characteristics of the NER task and the CWS task output at the ith moment. For any sentence in the setThe hidden state of the shared BilSTM layer is expressed as shown in formula (6):
Figure BDA0003494285400000131
wherein, thetasharedDimension setting for hidden state for sharing the BilSTM parameter.
In summary, the Private feature extracted by the NER Private BilSTM module and the Shared feature extracted by the Shared BilSTM module are connected to obtain the total feature H of the NER tasknerAs input to the NER CRF module. The Private features extracted by the CWS Private BilSTM module and the Shared features extracted by the Shared BilSTM module are connected to obtain the total features H of the CWS taskcwsAs input to the CWS CRF module. Represented by the formulae (7) and (8):
Figure BDA0003494285400000132
Figure BDA0003494285400000133
3 CRF layer
The BilSTM can only obtain the information relation between words without considering the mutual relation between continuous labels, so the invention uses the CRF layer to carry out label estimation on the features after the BilSTM layer training, but because the labels of the NER task and the CWS task are different, the CRF layer is respectively distributed to each task, thereby obtaining the sequence label of each task. However, the dimension of the BiLSTM output vector is not equal to the CRF, so that in order to calculate the loss function when performing label estimation on the CRF, a full-connection layer is added to the vector H output by the BiLSTM, and the CRF prediction process is expressed as formula (9) and formula (10):
oi=Ahi+b (9)
Figure BDA0003494285400000134
wherein A is a weight, b is a bias term, x is an input sequence, y is a predicted tag sequence, K is a transition probability matrix,
Figure BDA0003494285400000135
is yi-1Label transfer yiThe probability score of a tag is determined by,
Figure BDA0003494285400000136
is a character xiIs marked as the yiThe score of each tag, n is the length of the sentence. Using a negative log-likelihood function for the loss function, the probability of obtaining the true tag sequence is expressed as formula (11):
Figure BDA0003494285400000141
wherein,
Figure BDA0003494285400000142
as authentic tag sequence, YXFor the set of all the data that is marked,
Figure BDA0003494285400000143
in order to predict the score of the correct label,
Figure BDA0003494285400000144
the sum of all tags is scored.
1) NER CRF module
To HnerThe loss function L can be obtained by the following equations (9) to (11)nerExpressed as shown in formula (12):
Figure BDA0003494285400000145
2) CWS CRF module
To HcwsThe training sample of (1) is trained by the formulas (9) to (11) Available loss function LcwsExpressed as shown in formula (13):
Figure BDA0003494285400000146
the training process is continually tuned to minimize the loss function.
4 confrontation training layer
The countermeasure technology inspired by the GAN network (generic adaptive Networks) extracts shared information of the NER and the CWS through countermeasure training, and effectively prevents noise caused by private information of the CWS task. The task discriminator identifies which task the features come from through the Maxpooling layer and the Softmax layer, and when the model cannot identify which task the features come from, the shared feature extractor extracts the shared features of the two tasks, so that the task performance of named entity identification is improved. The task discriminator is represented by equations (14) and (15):
s=Maxpooling(Hshared) (14)
D(s;δd)=Softmax(A1s+b1) (15)
wherein HsharedFor sharing the output of the feature extraction layer, δdFor parameters of task discriminators, i.e. including A1Is a weight, b1Is the bias term.
In order to prevent the private information of the Chinese word segmentation task from entering a shared information space, a loss-resisting function L is introducedadvTraining the shared feature extractor to make the task discriminator unable to effectively identify which task the feature comes from, the fighting loss function can be expressed as shown in equation (16):
Figure BDA0003494285400000151
wherein, deltasTo share the BilSTM parameter thetasharedI is the total number of tasks in the shared feature, J is the number of training samples in the shared feature, EsIn order to share the feature extractor(s),
Figure BDA0003494285400000152
is the ith sample in the shared feature.
Through training, the loss of task discriminators is constantly minimized to antagonistically encourage shared feature extractors to learn word boundary information shared by tasks. After training is finished, the shared feature extractor and the task discriminator reach balance, so that the task discriminator cannot distinguish which task the features come from.
Model training
By the above-mentioned task loss function L to NERnerCWS task loss function LcwsAnd a penalty function LadvThe final loss function L of the model is expressed as shown in equation (17):
L=GLNER+(l-G)LCWS+γLadv (17)
where γ is the loss weight coefficient and G is the switching function that determines the inputs from the NER and CWS tasks.
In the process of training the model, a training example is extracted from a given task to update parameters, a final loss function is continuously optimized, and iteration is carried out according to the convergence rate of the NER task until the result is optimal.
The pseudo code of the present invention is as follows:
Figure BDA0003494285400000161
Figure BDA0003494285400000171
fourth, experiment and results
1 Experimental setup
The hyperparameter value of the model is obtained through cross validation in the experiment, the dimensionality of a word vector is 768, the dimensionality of an LSTM hidden state is set to be 120, the loss weight coefficient gamma is set to be 0.05, the initial learning rate is set to be 0.001, Dropout is set to be 0.5, the batch size is set to be 64, the iteration number is set to be 20, and the experiment is optimized by using an Adam algorithm.
2 evaluation index
The experiment uses Precision (Precision, P), Recall (Recall, R) and F1 values to evaluate model performance, and the calculation formula is shown in formulas (18) to (20):
Figure BDA0003494285400000172
Figure BDA0003494285400000173
Figure BDA0003494285400000181
wherein, TP is the number of positive samples determined to be positive, FP is the number of negative samples determined to be positive, and FN is the number of negative samples determined to be positive.
3 results and conclusions of the experiment
TABLE 1 comparison of model results
Figure BDA0003494285400000182
And (4) conclusion: the comparison of experimental results shows that the method provided by the patent can effectively improve the value of F1 in the language material of the audit field.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (1)

1. An audit field named entity recognition method based on countermeasure training is characterized by comprising the following steps: the method comprises the following steps:
s1): acquisition of the data set: using an audit field dataset as the NER dataset of the present invention; using a daily word segmentation corpus of new generation people as a CWS data set; the NER task is assisted using the CWS.
S11): NER dataset
Collecting language material related to poverty-alleviation policy from a government website by using a web crawler in an audit field data set, constructing the language material by screening sentences with the word number of 10-100, and preprocessing original data, wherein the preprocessing comprises deleting non-text parts, uniformly coding and segmenting; according to the following steps: 2: the 1 approach is divided into a training set, a validation set, and a test set, and uses manual versus 4 entity types: the name of a person, the name of a place, the name of a mechanism and the proper noun are labeled by adopting a BIO mode.
S12): CWS dataset
The daily word segmentation corpus of the new generation people is obtained through a website http:// corpus.njau.edu.cn of the national and social computing research center of Nanjing agriculture university.
S2): constructing a model: the model framework provided by the invention longitudinally comprises three tasks, wherein the left side is named as an entity identification task and comprises an NER BERT Embedding module, an NER Private BilSTM module and an NER CRF module; the right side is a Chinese word segmentation task which comprises a CWS BERT Embedding module, a CWS Private BilSTM module and a CWS CRF module; the middle part is an antagonistic training task which comprises a Shared BilSTM module and an antagonistic training module; the three tasks comprise an embedding layer, a sharing-private feature extraction layer and a CRF layer or an antagonistic training layer in a transverse direction, and the structure is introduced according to the three tasks in the transverse direction.
S21): embedding layer
Inputting the corpus into an embedded layer, adopting Transformer for coding by BERT, introducing a Self-attention mechanism to predict the dependency relationship among words and capture the information of the internal structure of a sentence, truncating the input sentence with the length exceeding n, and completing the use 0 with the length of the sentence less than n; adding a vector [ CLS ] of input representation and a vector [ SEP ] of a divided sentence pair at the beginning of a sentence, and training the sentence to obtain more accurate semantic information; then, Segment embedding is used for judging whether given sentences are continuous or not to obtain sentence level characteristics; since the word sequence of the text is crucial to the meaning of the sentence, BERT independently encodes each character position, learns the sequence characteristics of the input sequence, and thus obtains the information of each position; and finally, adding vectors obtained by Token embedding, Segment embedding and Position embedding to obtain an output sequence of the BERT.
S211): NER BERT Embedding module
Using an audit domain dataset for the NER task, a given sentence W ═ W1,w2,...,wn]After being input into the NER BERT Embedding module, the sequence X ═ X of the word vector of each word may be output1,x2,...,xn]Wherein w isiFor words in sentences, xiIs wiThe corresponding word vector, n is the length of the sentence.
S212): CWS BERT Embedding module
Using a new generation daily participle corpus for the CWS task, a given sentence W 'is [ W'1,w′2,...,w′m]After entering the CWS BERT Embedding module, the sequence of word vectors for each word X 'may be output'1,x′2,...,x′m]Wherein, w'iIs a word in a sentence, x'iIs w'iCorresponding word vectors, m is the length of the sentence, and n is more than m;
in summary, each dimension vector in X 'is complemented to n, and the complemented X' is integrally connected to the lower part of X to obtain a sequence
Figure FDA0003494285390000021
An input for extracting shared information against the training task.
S22): shared-private feature extraction layer
Performing feature extraction by adopting bidirectional LSTM; given an input sequence to perform feature extraction, the hidden state at the ith time represents the output features as shown in equations (1) to (3):
Figure FDA0003494285390000031
Figure FDA0003494285390000032
Figure FDA0003494285390000033
wherein,
Figure FDA0003494285390000034
and
Figure FDA0003494285390000035
respectively representing the hidden states in the forward and backward directions at the ith time,
Figure FDA0003494285390000036
indicating a connect operation.
S221): NER Private BilSTM Module
Converting the sequence X into [ X ]1,x2,...,xn]The Private characteristic extraction is carried out by inputting the NER Private BilSTM module, and the output characteristic of the NER task Private BilSTM can be obtained
Figure FDA0003494285390000037
Wherein,
Figure FDA0003494285390000038
representing the NER task private characteristics output at the ith moment; for any sentence in the audit domain dataset, the hidden state of private BilSTM is represented as shown in equation (4):
Figure FDA0003494285390000039
wherein, thetanpFor concealing NER private BilsTM parametersAnd setting the dimension of the state.
S222): CWS Private BilSTM module
Sequence X '═ X'1,x′2,...,x′m]Private feature extraction is carried out by inputting a CWS Private BilSTM module, and the output feature of the CWS task Private BilSTM can be obtained
Figure FDA00034942853900000310
Wherein,
Figure FDA00034942853900000311
representing the private characteristics of the CWS task output at the ith moment; for any sentence in the daily newspaper participle corpus of the new generation people, the hidden state of the private BilSTM layer is expressed as shown in formula (5):
Figure FDA00034942853900000312
wherein, thetacpDimension setting for hidden state for CWS private BiLSTM parameter.
S223): shared BilSTM module
Will be sequenced
Figure FDA0003494285390000041
The Shared BilSTM module is input for Shared feature extraction, and the output feature of the Shared BilSTM can be obtained
Figure FDA0003494285390000042
Wherein,
Figure FDA0003494285390000043
representing the shared characteristics of the NER task and the CWS task output at the ith moment; for any sentence in the set, the hidden state of the shared BilSTM layer is represented as shown in equation (6):
Figure FDA0003494285390000044
wherein, thetasharedDimension setting for hidden state for sharing the BilSTM parameter.
In summary, the Private feature extracted by the NER Private BilSTM module and the Shared feature extracted by the Shared BilSTM module are connected to obtain the total feature H of the NER tasknerAs input to the NER CRF module; the Private features extracted by the CWS Private BilSTM module and the Shared features extracted by the Shared BilSTM module are connected to obtain the total features H of the CWS taskcwsAs input to the CWS CRF module; represented by the formulae (7) and (8):
Figure FDA0003494285390000045
Figure FDA0003494285390000046
s23): CRF layer
Performing label speculation on features after the BilSTM layer training by using a CRF layer, and adding a full-connection layer for a vector H output by the BilSTM, wherein the CRF prediction process is expressed as a formula (9) and a formula (10):
oi=Ahi+b (9)
Figure FDA0003494285390000047
wherein A is a weight, b is a bias term, x is an input sequence, y is a predicted tag sequence, K is a transition probability matrix,
Figure FDA0003494285390000048
is yi-1Label transfer yiThe probability score of a tag is determined by,
Figure FDA0003494285390000049
is a character xiIs marked as the yiThe score of each label, n is the length of the sentence; using a negative log-likelihood function for the loss function, the probability of obtaining the true tag sequence is expressed as formula (11):
Figure FDA0003494285390000051
wherein,
Figure FDA0003494285390000052
as authentic tag sequence, YXFor the set of all the data that is marked,
Figure FDA0003494285390000053
in order to predict the score of the correct label,
Figure FDA0003494285390000054
the sum of all tags is scored.
S231): NER CRF module
To HnerThe loss function L can be obtained by the following equations (9) to (11)nerExpressed as shown in formula (12):
Figure FDA0003494285390000055
s232): CWS CRF module
To HcwsThe loss function L can be obtained by the following equations (9) to (11)cwsExpressed as shown in formula (13):
Figure FDA0003494285390000056
the training process is continually tuned to minimize the loss function.
S24): the confrontation training layer:
the task discriminator identifies which task the characteristics come from through the Maxpooling layer and the Softmax layer, and when the model cannot identify which task the characteristics come from, the shared characteristic extractor extracts the shared characteristics of the two tasks, so that the task performance of named entity identification is improved; the task discriminator is represented by equations (14) and (15):
s=Maxpooling(Hshared) (14)
D(s;δd)=Softmax(A1s+b1) (15)
wherein HsharedFor sharing the output of the feature extraction layer, δdFor parameters of task discriminators, i.e. including A1Is a weight, b1Is a bias term;
in order to prevent the private information of the Chinese word segmentation task from entering a shared information space, a loss-resisting function L is introducedadvTraining the shared feature extractor to make the task discriminator unable to effectively identify which task the feature comes from, the fighting loss function can be expressed as shown in equation (16):
Figure FDA0003494285390000061
wherein, deltasFor sharing the BilSTM parameter thetasharedI is the total number of tasks in the shared feature, J is the number of training samples in the shared feature, EsIn order to share the feature extractor(s),
Figure FDA0003494285390000062
is the ith sample in the shared feature.
S3): model training
By the above-mentioned task loss function L to NERnerCWS task loss function LcwsAnd a penalty function LadvThe final loss function L of the model is expressed as shown in equation (17):
L=GLNER+(1-G)LCWS+γLadv (17)
wherein γ is a loss weight coefficient, G is a switching function that determines inputs from NER and CWS tasks;
in the process of training the model, a training example is extracted from a given task to update parameters, a final loss function is continuously optimized, and iteration is carried out according to the convergence rate of the NER task until the result is optimal.
CN202210109168.0A 2022-01-28 2022-01-28 Audit field named entity recognition method based on countermeasure training Pending CN114462409A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210109168.0A CN114462409A (en) 2022-01-28 2022-01-28 Audit field named entity recognition method based on countermeasure training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210109168.0A CN114462409A (en) 2022-01-28 2022-01-28 Audit field named entity recognition method based on countermeasure training

Publications (1)

Publication Number Publication Date
CN114462409A true CN114462409A (en) 2022-05-10

Family

ID=81410574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210109168.0A Pending CN114462409A (en) 2022-01-28 2022-01-28 Audit field named entity recognition method based on countermeasure training

Country Status (1)

Country Link
CN (1) CN114462409A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470871A (en) * 2022-11-02 2022-12-13 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model
CN115630649A (en) * 2022-11-23 2023-01-20 南京邮电大学 Medical Chinese named entity recognition method based on generative model
CN116227483A (en) * 2023-02-10 2023-06-06 南京南瑞信息通信科技有限公司 Word boundary-based Chinese entity extraction method, device and storage medium
CN117807999A (en) * 2024-02-29 2024-04-02 武汉科技大学 Domain self-adaptive named entity recognition method based on countermeasure learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470871A (en) * 2022-11-02 2022-12-13 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model
CN115470871B (en) * 2022-11-02 2023-02-17 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model
CN115630649A (en) * 2022-11-23 2023-01-20 南京邮电大学 Medical Chinese named entity recognition method based on generative model
CN116227483A (en) * 2023-02-10 2023-06-06 南京南瑞信息通信科技有限公司 Word boundary-based Chinese entity extraction method, device and storage medium
CN117807999A (en) * 2024-02-29 2024-04-02 武汉科技大学 Domain self-adaptive named entity recognition method based on countermeasure learning
CN117807999B (en) * 2024-02-29 2024-05-10 武汉科技大学 Domain self-adaptive named entity recognition method based on countermeasure learning

Similar Documents

Publication Publication Date Title
CN109871451B (en) Method and system for extracting relation of dynamic word vectors
CN114462409A (en) Audit field named entity recognition method based on countermeasure training
CN106844349B (en) Comment spam recognition methods based on coorinated training
CN111460092B (en) Multi-document-based automatic complex problem solving method
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
CN106682089A (en) RNNs-based method for automatic safety checking of short message
CN110287323A (en) A kind of object-oriented sensibility classification method
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
CN110909529A (en) User emotion analysis and prejudgment system of company image promotion system
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111144119A (en) Entity identification method for improving knowledge migration
CN111666752A (en) Circuit teaching material entity relation extraction method based on keyword attention mechanism
CN114297399A (en) Knowledge graph generation method, knowledge graph generation system, storage medium and electronic equipment
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115934951A (en) Network hot topic user emotion prediction method
CN115906816A (en) Text emotion analysis method of two-channel Attention model based on Bert
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
Li et al. Phrase embedding learning based on external and internal context with compositionality constraint
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN111563374B (en) Personnel social relationship extraction method based on judicial official documents
CN115906846A (en) Document-level named entity identification method based on double-graph hierarchical feature fusion
Ren et al. Named-entity recognition method of key population information based on improved BiLSTM-CRF model
CN114357166A (en) Text classification method based on deep learning
CN113535936A (en) Deep learning-based regulation and regulation retrieval method and system
Guo RETRACTED: An automatic scoring method for Chinese-English spoken translation based on attention LSTM [EAI Endorsed Scal Inf Syst (2022), Online First]

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination