CN116882402A - Multi-task-based electric power marketing small sample named entity identification method - Google Patents

Multi-task-based electric power marketing small sample named entity identification method Download PDF

Info

Publication number
CN116882402A
CN116882402A CN202310589142.5A CN202310589142A CN116882402A CN 116882402 A CN116882402 A CN 116882402A CN 202310589142 A CN202310589142 A CN 202310589142A CN 116882402 A CN116882402 A CN 116882402A
Authority
CN
China
Prior art keywords
entity
seed
seeds
representing
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310589142.5A
Other languages
Chinese (zh)
Inventor
张希翔
蒙琦
艾徐华
董贇
练宇婷
黄汉华
周迪贵
古哲德
覃宁
陶思恒
孟椿智
谢菁
谭期文
韦宗慧
银源
陈燕
谭志湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Power Grid Co Ltd
Original Assignee
Guangxi University
Guangxi Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University, Guangxi Power Grid Co Ltd filed Critical Guangxi University
Priority to CN202310589142.5A priority Critical patent/CN116882402A/en
Publication of CN116882402A publication Critical patent/CN116882402A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for identifying named entities of small electric power marketing samples based on multitasking, which comprises the following steps: s1, collecting a power marketing naming entity recognition corpus and performing data processing; s2, inputting the text into a RoBERTa encoder to obtain a characteristic representation; s3, performing expansion convolution on the characteristic representation obtained in the S2 and input IDCNN layers; s4, inputting the text data after the expansion convolution into a seed module, and predicting seed scores of each single character and each double character: the text is segmented into a seed set comprising one Chinese character and two Chinese characters, the score of each seed in the set is calculated, and if the score is larger than a seed score threshold omega, the seed score is used as a candidate seed; s5, expanding the candidate seeds to obtain complete entities; s6, inputting the entity into the RoBERTa coder to obtain a final classification result. By span labeling, the method improves the identification of the model to the nested entity. And the full word shielding strategy is adopted, so that the learning capacity of the pre-training model is improved.

Description

Multi-task-based electric power marketing small sample named entity identification method
Technical Field
The invention relates to the technical field of named entity recognition, in particular to a method for recognizing named entities of small electric power marketing samples based on multitasking.
Background
With the rapid development of information technology and the comprehensive construction of intelligent power grids, power data starts to grow sharply and forms a certain scale. The power data may be divided into unstructured data, semi-structured data, and structured data; wherein about 80% of the power data is unstructured data and semi-structured data, and only about 20% of the power data is structured data. In order to improve the efficiency of staff and assist professionals in making decisions in power business, especially when controlling power marketing conditions in practice, it is important to extract useful information from massive information, and the distribution of power and the decision of whether to shut down a certain shutdown station are assisted by identifying entities in data with more proper nouns and entities through models. The model provided by the power marketing situation has a good effect in entity identification in the power domain data with more proper nouns and entity nesting. And it is very necessary to construct a power marketing knowledge graph system. The method is used for processing massive texts in the electric power marketing field, solving the problems of more proper nouns and the like, identifying the entities in the texts, classifying the entities into preset categories, and is an important premise for constructing a knowledge graph system.
Named entity recognition (named entity recognition, NER) is a basic technology in text processing, and is widely applied to fields of natural language processing, intention recognition, recommendation systems, knowledge graphs and the like, and has the main task of recognizing meaningful entities from mass data and classifying the meaningful entities, for example, the task of recognizing named entities in the power marketing field is to recognize related entities of the power marketing data, such as voltage levels, transmission lines, stations, addresses and the like. In early entity recognition tasks, rule-based methods achieved high accuracy and low recall because they were relatively easy to implement and did not require training. However, the method has the defects of building professional field knowledge, large amount of manpower and material resources and insufficient generalization capability. The core of statistical model-based approaches is to select an appropriate training model for a particular research context. Compared with a rule-based method, the method omits a plurality of complicated rule designs, can take shorter time to train the manually marked corpus, and improves the training efficiency. Meanwhile, the method based on the statistical model only aims at the training set in the specific field to train the model again, and therefore the robustness of the method is better. With the rapid development of deep learning, the application of deep learning to named entity recognition tasks is becoming the mainstream gradually, and better effects are achieved. A typical model is Huang et al, which proposes a two-way long and short Term Memory (BiLSTM) and conditional random field model. Xie et al propose BERT-BiLSTM-CRF entity recognition models for better extraction of features of dataset text. However, these models, which rely on a large amount of training data, are not suitable for the less special case of some data sets.
In recent years, there has been a great deal of attention to the recognition of named entities with a few samples, the purpose of which is to identify entities based on a few labeled instances in each category. Cui et al propose a template ner model, enumerate all text spans of a text as candidate entity tags by an exhaustion method, and construct a corresponding prompt template to be transmitted into a Seq-to-Seq model for classification. Ma et al propose a template-free approach that redefines named entity recognition tasks as language modeling problems without any templates. However, the two models are not suitable for Chinese text recognition and do not solve the problem of entity nesting.
Currently, there are relatively few studies of power marketing named entity identification, and there are two problems with chinese named entity identification of power marketing data: firstly, the entities in the electric power marketing data have more proper nouns, and have no standard word forming specification, so that the operations of word segmentation, classification, semantic mining and the like on the texts in the electric power field are difficult to realize; secondly, the electric marketing data has less marked and published data sets, so that a large number of training sets are not available for training of the model.
Disclosure of Invention
The invention aims at least solving the technical problems in the prior art, and particularly creatively provides a method for identifying named entities of small electric marketing samples based on multitasking.
In order to achieve the above object of the present invention, the present invention provides a method for identifying named entities of small samples for electric power marketing based on multitasking, comprising the steps of:
s1, collecting power marketing named entity recognition corpus and performing data processing: s1-1, collecting power marketing data, and converting structured data and unstructured power data into unstructured power marketing data; the unstructured electricity marketing data is text; s1-2, marking the electric power marketing data by adopting a span marking mode: marking the beginning position, the ending position and the entity type of the entity in the sentence;
s2, inputting the text into a RoBERTa encoder to obtain a characteristic representation;
s3, performing expansion convolution on the characteristic representation obtained in the S2 and inputting the characteristic representation into an IDCNN layer, and expanding the receptive field of the convolutional neural network by increasing the interval of convolution kernels so as to reduce information loss;
s4, inputting the text data after the expansion convolution into a seed module, and predicting seed scores of each single character and each double character: the text is segmented into a seed set comprising one Chinese character and two Chinese characters, the score of each seed in the set is calculated, and if the score is larger than a seed score threshold omega, the seed score is used as a candidate seed;
s5, expanding the candidate seeds so as to obtain complete entities: setting the offset gamma of seeds, and expanding candidate seeds in left and right directions to obtain a complete entity;
s6, inputting the entity into the RoBERTa coder to obtain a final classification result: firstly, constructing implication pairs, wherein the implication pairs comprise hypothesis sentences and preconditions, the hypothesis sentences are obtained by combining entity spans obtained by expanding tasks and predefined Prompt, and the preconditions are input texts; and then, carrying out classification tasks on the implications transmitted to the RoBERTa coder, and outputting classification results.
Further, the method further comprises the following steps: preprocessing unstructured power data of S1-1: non-text data and special characters are deleted.
Further, the entity types include time, level, line, station, org, equ, name, add and other, where time represents time, level represents voltage level, line represents power transmission equipment, station represents factory length, org represents organization, equ represents equipment appliances, name represents personnel name, and other.
Further, the step S2 includes:
firstly, processing through a word segmentation device to obtain a word segmentation text sequence; then, carrying out full word shielding on partial words of the word segmentation sequence, and adding a special mark [ CLS ] for the beginning of the sequence, wherein sentences are separated by marks [ SEP ]; and finally, inputting the sequence vector into a RoBERTa encoder for feature extraction to obtain the sequence vector containing rich semantic features, namely feature representation.
Further, the whole word mask comprises the following steps:
firstly, chinese word segmentation is carried out on a text by utilizing a jieba word segmentation device, wherein proper nouns in a power grid dictionary are fused before word segmentation is carried out by utilizing the word segmentation device, so that the word segmentation result of a model is more accurate;
then, a complete word is masked by using a plurality of continuous [ MASK ] marks, so that the model predicts the masked word, and the model obtains more characteristic information in such a way that deviation caused by incomplete semantics in the process of Wen Yu time measurement is relieved.
Further, the score calculating method of each seed comprises the following steps:
for seeds s i =(l i ,r i ),l i ,r i Respectively represent seeds s i Left and right boundaries of (a), its score calculationThe process is as follows:
wherein the method comprises the steps ofRepresenting seeds s i Is represented by the left boundary of (a);
representing seeds s i Right boundary representation of (2);
representing seeds s i All representations of left to right boundaries;
v [cls] is [ CLS ]]Is a representation of (2);
for seeds s i All representations in the span interval are represented as an average pooling operation;
MeanPooling () is an average pooling operation;
concat () is a splicing operation;
is a representation of a seed;
representing +.>Inputting the result into a multi-layer perceptron MLP, and inputting the obtained result into a Sigmoid activation function;
is the final fraction of the seed.
Further, expanding the candidate seeds in left and right directions to obtain a complete entity, wherein the process is as follows:
setting seeds s i =(l i ,r i ) Left boundary l of (2) i And right boundary r i At most, shifting gamma to obtain a maximum entity of 2+2 gamma; the span of the final expanding entity is w i The calculation process is as follows:
a value after the seed entity is shifted to the left;
for right offset values of seed entities;
Then, determining a final candidate entity through the span of the calculated expanded entity; left-right boundary offset o of candidate entity i The calculation is as follows:
wherein the method comprises the steps ofA result representing a vector represented by an i-th seed span of the average pooled seed stage;
MeanPooling () is an average pooling operation;
representing the value of the seed entity after the left offset;
representing the right-shifted value of the seed entity;
all representations between the left-shifted value to the right-shifted value of the seed entity;
concat () is a splicing operation;
for seeds s i All representations in the span interval are represented as an average pooling operation;
the representation will->Inputting the multi-layer sensing machine MLP;
sigmoid is an activation function;
representing seeds s i An extended representation of (2);
o i ∈R 2 its first element is expressed asRepresenting the final left shift of the entity span, the second element is denoted +.>Representing the final right shift of entity span, i E [ -gamma, gamma];
The left and right boundaries of the final entity can be determined as follows:
l i 、r i respectively representing the left and right boundaries of the entity;
l′ i 、r′ i respectively representing the left and right boundaries of the final entity;
n represents the maximum offset of the expansion in one direction in the expansion stage;
representing a downward rounding;
in the process of calculating the result, ifIs deemed invalid and selected for discarding.
Further, the method is based on a WmDRoBERTa-IDCNN-TMLP-FL model, the model comprises a data processing module, a RoBERTa coding layer, an IDCNN layer, a seed module, an expansion module and an implication module, the processing module is connected with the RoBERTa coding layer, the RoBERTa coding layer is connected with the IDCNN layer, the IDCNN layer is connected with the seed module, the seed module is connected with the expansion module, the expansion module is connected with the implication module, and the implication module is connected with the RoBERTa coding layer;
the processing module is used for: firstly, processing through a word segmentation device to obtain a word segmentation text sequence; then, carrying out full word shielding on partial words of the word segmentation sequence;
roberta coding layer: extracting features of the input sequence vector;
IDCNN layer: the space of the convolution kernel is increased by adopting expansion convolution to enlarge the receptive field of the convolution neural network, so that the information loss is reduced;
seed module: predicting seed scores of each single character and each double character, and finding candidate seeds;
and an expansion module: expanding the candidate seeds obtained by the seed module to obtain final candidate entities;
the module comprises: and combining the candidate entity obtained by expanding the task with a predefined Prompt, and taking the input text as a precondition, thereby forming a complete implication pair.
Further, the loss function employed by the WmDRoBERTa-IDCNN-TMLP-FL model training is a focal loss function, and the formula of the focal loss function is as follows:
FL(p t )=-α t ·(1-p t ) λ ·log(p t ) (14)
wherein alpha is t Representing a focal factor;
(1-p y ) λ represents the modulation factor lambda>0;
p t Representing the probability that the tag is correctly identified.
In summary, due to the adoption of the technical scheme, the invention has the following advantages:
(1) And a labeling mode based on span representation is adopted, so that the identification of the model to the nested entity is improved.
(2) The masking strategy of the RoBERTa model is changed into a full-word masking strategy, so that the learning capacity of the pre-training model is improved.
(3) The focus loss function is introduced to solve the problem that the entity quantity difference in the electric marketing data is large.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of an exemplary anticipated labeling of the present invention.
FIG. 2 is a flow chart of the power marketing data named entity identification of the present invention.
FIG. 3 is a diagram showing the construction of WmDRoBERTa-IDCNN-TMLP-FL model according to the present invention.
Fig. 4 is a schematic diagram of the RoBERTa generated input vector of the present invention.
FIG. 5 is a schematic diagram of a whole word masking strategy of the present invention.
Fig. 6 is a schematic diagram of the dilation convolution of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
1. Data set construction
And extracting partial structured data and unstructured data from a smart grid system of the south power grid. At present, the data are only used for inquiry operation by internal personnel, and the data are not deeply mined, so that important information is extracted. The intelligent power grid system of the south power grid company has a large amount of unstructured data, and through manual mining, the intelligent power grid system is low in efficiency and high in cost. Thus, modeling is performed in a deep learning manner and model training is performed in a supervised manner.
1.1 pretreatment
Structured and unstructured data are extracted from the power marketing data provided by the southern power grid. Before marking the data, converting all the data into unstructured data, sorting and denoising the unstructured data, finally extracting 828 pieces of data, and preprocessing the original data, including deleting the non-text data and special characters, to obtain a standard expected library.
1.2 labeling System
According to a proprietary name word library and service requirements provided by a southern power grid, 9 entity type labels are designed, namely 'time', 'level', 'line', 'station', 'org', 'equ', 'name', 'add' and 'other', respectively; wherein, the entity type is expressed as follows: "time" indicates time, "level" indicates voltage level, "line" indicates power transmission equipment, "station" indicates plant length, "org" indicates organization, "equ" indicates equipment appliances, "name" indicates person name "other" indicates others. The entity labeling is performed through a label-studio tool by adopting a labeling mode based on span representation. The span labeling mode labels the beginning position, the ending position and the entity type of the entity in the sentence, and the labeling example is shown in the expected labeling example of fig. 1.
The corpus contains 7240 entities, wherein 261 time entities, 1017 voltage level entities, 1328 power transmission line entities, 609 plant stations, 678 organization entities, 1819 equipment entities, 336 personnel name entities, 886 address entities and 306 other entities, and the entity sample number statistics of the corpus, training set, verification set and test set are shown in table 1.
Table 1 corpus entity quantity statistics
2. Design for identifying power grid marketing naming entity
2.1 Process flow
The method is characterized in that a marked data set is lacking in the field of electric power marketing, a large amount of data is not used for training a model, and in addition, more proper nouns and nested entities exist in the data set, so that the effect of identifying named entities by using a traditional method is poor, and therefore the method for identifying the named entities by using the small samples for electric power marketing based on multitasking is provided. The method mainly comprises five parts of electric power marketing named entity recognition corpus construction, seed label construction, label expansion, text inclusion pair construction and language model fine adjustment. The flow chart of the method is shown in fig. 2, and the LM in fig. 2 is a pre-trained language model, namely RoBERTa.
The inventive model structure is shown in fig. 3. Firstly, the electric power marketing data set of the south electric network is input into RoBERTa for representation, the IDCNN model is utilized for obtaining longer-distance context information, and the parallelism of the GPU can be fully utilized, so that the training speed is accelerated. Next, the representation of the sentence is split into one word and a combination of two words as seed entities, for example: "1", "0", …, "south jin", "jin station". And then, sending the representation of the seed entity into an expansion module, and extending the seed entity leftwards or rightwards so as to obtain a complete candidate entity span, wherein the candidate entity can be obtained through the candidate entity span. And combining the candidate entity with a Prompt which is defined in advance according to the entity type, constructing a hypothesis of the implication task, and taking the input sentence as a precondition of the implication task, thereby obtaining a text implication pair. Finally, text inclusion pairs are entered into RoBERTa for training and classification.
2.2Roberta and full word masking
BERT (Bidirectional Encoder Representation from Transformers) is a transform-based deep bi-directional pre-trained language model. The model utilizes large-scale unlabeled text corpus, and through sufficient self-supervision training, text context semantic features are effectively learned, and deep text vector representation is obtained.
Roberta is an extension and improvement to BERT. The model changes an original static shielding strategy into a dynamic shielding strategy on the basis of BERT, and enables the model to be shielded randomly in each training process so as to improve the adaptability of the model. Meanwhile, roBERTa uses larger data sets, larger batch sizes, longer pre-training times, longer input sequences, and more data enhancement techniques to improve the model's ability. Liu et al found that adding NSP (Next Sentence Prediction) task did not improve the model's effect during the trial and error of the Roberta model, thus deleting the task of predicting the next sentence. Through the above series of improved strategies, roBERTa obtained a pre-trained language model that was more stable and robust than BERT.
For any text sequence, firstly, a word segmentation text sequence is obtained through word segmentation device processing. Then, the whole word masking is carried out on the partial words of the word segmentation sequence, a special mark [ CLS ] is added to the beginning of the sequence, and sentences are separated by marks [ SEP ]. The output of each word of the sequence at this time, embedded, consists of three parts: token references Segment Embedding and Position Embedding are shown in FIG. 4. And inputting the sequence vector into a bidirectional transducer for feature extraction, and finally obtaining the sequence vector containing rich semantic features.
The RoBERTa pre-training model trains the input text using a dynamic random masking (Masked Language Modeling, MLM) strategy to refine the depth bi-directional representation. The strategy will dynamically and randomly MASK some token, require the model to predict these masked token, and the locations of training MASK for each round of the model are different, for example, the "10kV line-collected jade-sea mountain village special-purpose lightning arrester and transformer failure", the sentences after masking in the first round of training are "10kV [ MASK ] line-collected jade-sea mountain village special-purpose [ MASK ] lightning arrester and transformer failure", the second round is "10kV line-collected jade-sea mountain village [ MASK ] lightning arrester and transformer [ MASK ] failure", and so on, the masking locations of the following training process are different. However, in the southern grid dataset, there are more proper nouns, and the original WordPiece strategy is changed to a full word masking strategy in order to enable the model to learn more knowledge.
The whole word masking strategy is as shown in fig. 5, note: X1-Xn represent characters in the input sequence, [ MASK ] represents that the current character is obscured. Firstly, chinese word segmentation is carried out on a text by utilizing a jieba word segmentation device, wherein proper nouns in a power grid dictionary are fused before word segmentation is carried out by utilizing the word segmentation device, so that the word segmentation result of a model is more accurate; then, a complete word is masked by using a plurality of continuous [ MASK ] marks, so that the model predicts the masked word, and the model obtains more characteristic information in such a way that deviation caused by incomplete semantics in the process of Wen Yu time measurement is relieved. In the training process, the model carries out dynamic random masking on words in the text, the masked words account for 15% of the full text characters, 80% of words are replaced by continuous [ MASK ] marks, 10% of words are replaced by other words in the corpus, and 10% of words remain unchanged.
Table 2 Whole word mask example
2.3IDCNN layer
When convolutional neural networks (Convolutional Neural Networks, CNN) are applied to text processing, the last layer of nerves may only acquire a small portion of the information in the original input data after the convolutional operation. To obtain more context information, more convolutional layers may need to be added, resulting in the network becoming deeper and the parameters becoming more. However, as models become larger, over-fitting problems tend to occur. To solve this problem, a Dropout mechanism can be introduced, but this also brings more hyper-parameters, eventually making the model bulky, tedious and difficult to train.
To solve the above problems and simplify the model, more context information is obtained more efficiently, and thus an expansion convolution (Dilation Convolution) is employed to expand the receptive field of the convolutional neural network by increasing the interval of the convolution kernels, reducing the information loss.
The dilation convolution increases the dilation width based on the original convolutional neural network. When the convolution operation is performed, the size of the convolution kernel is not changed, and the convolution kernel skips data in the middle of the expansion width. Thus, convolution kernels of the same size can acquire wider input matrix data, thereby increasing the perceived field of view of the convolution kernels. The schematic of the dilation convolution is as follows: FIG. 6 (a) shows a normal convolution operation, with a convolution kernel size of 3×3; FIG. 6 (b) shows that the expansion width of the convolution is 2, the perceived field of view increases to 7X 7, skipping the center neighbor node, 2 holes appear, and the node adjacent to the center is directly captured; fig. 6 (c) shows that the perceived field of view is enlarged to 15×15 with an expansion width of 4, 6 holes are present, and a larger range of node information is captured. The expansion convolution can improve the effectiveness and accuracy of the model to the greatest extent.
2.4 seed Module
Given an input text x= { X containing n characters 1 ,x 2 ,x 3 ,…,x n The input text is divided into one character set and two character sets, wherein a single character of the input text X is taken as one character, and two adjacent characters are taken as two characters. Two sets are combined into a new set S= { S 1 ,s 2 ,s 3 ,…,s 2n-1 (s is therein i =(l i ,r i ),l i And r i Representing the left and right boundaries of the label.
The purpose of the seed task phase is to find a single character and two characters that overlap the complete entity and potentially expand into the complete entity. This step is accomplished by constructing a seed prediction model to predict the seed scores for each of the single and double characters.
Inputting the input text into the Roberta pre-training model to obtain the representation v, for seeds s i =(l i ,r i ) Score calculationThe process is as follows:
wherein the method comprises the steps ofRepresenting seeds s i Is represented by the left boundary of (a);
representing seeds s i Right boundary representation of (2);
representing seeds s i All representations of left to right boundaries;
v [cls] is [ CLS ]]Is a representation of (2);
the representations are subjected to an average pooling operation on all representations in the span interval, and then are subjected to a [ CLS ]]Is spliced as a seed representation +.>Finally, the representation of the seeds is transmitted into a multi-layer perceptron to obtain the final fraction of the seedsAnd according to the initially set seed fraction threshold omega, only seed fractions greater than the threshold omega are used as candidate seeds for the expansion module.
2.5 expansion Module
In order to obtain a complete entity, candidate seeds obtained by the seed task are expanded by means of a regression task. We set seed s i =(l i ,r i ) Left boundary l of (2) i And right boundaryr i At most, gamma can be shifted so that we get a maximum of 2+2γ for the entity. The value after the seed entity is deviated leftwards isThe value after rightward shift is +.>The span of the final expanding entity is w i Their calculation process is as follows:
next, the resulting extended entity span is calculated to determine the final candidate entity. Left-right boundary offset o of candidate entity i The calculation is as follows:
wherein the method comprises the steps ofA result representing a vector represented by an i-th seed span of the average pooled seed stage;
representing an expanded representation of the ith seed.
o i ∈R 2 Its first element is expressed asRepresenting the final left shift of the entity span, the second element is denoted +.>Representing the final right shift of entity span, i E [ -gamma, gamma]. In span mode, the left boundary of an entity, i.e. the start position of the entity, is recorded, and the right boundary of an entity, i.e. the end position of the entity, e.g. span 1,5 in 110kv]. The left and right boundaries of the final entity can be determined as follows:
in the process of calculating the result, ifIs deemed invalid and selected for discarding.
2.6 implication Module
In order to construct the implication pair required by the implication task, the candidate entity obtained by expanding the task and the predefined Prompt are combined to be used as a hypothesis sentence. In addition, the input text is required to be used as a precondition, so that a complete implication pair is formed. For the ith candidate entity c i The format of the implication pair isWherein-> e i E, X is the input text, +.>Is a hypothetical sentence. e, e i For a predefined tag type, E is the set of entities. The prompt is a/ane i
Training the constructed implication pair input into a shared Roberta encoder to obtain classified data output by the encoderThe execution of the classification implies that the text classification results are:
wherein the method comprises the steps ofRepresenting text implication results;
implication tags may be defined as follows:
in the text classification in the implication task, if the assumption sentence is a part of the precondition sentence, the classification is successfully described, true is given, and the false is marked as negative.
2.7 Focus loss function
And when the south power grid marketing data corpus is constructed, the quantity distribution unbalance of various entities is found through the statistics marking result. For example, according to the statistical information in table 1, the number of device entities is far greater than the number of other entity tags. Such an unbalanced distribution may lead to an unbalanced distribution of the loss function of the model during training, so that the label entities with a larger sample number are more prone to be identified, and the label identification effect of the entity with a smaller sample number is poorer.
To alleviate the above problem, focal Loss Function (FL) is used herein to reduce the weight of easily classified samples and to increase the weight of difficult classified samples. Specifically, the focus loss function improves the weight factor in the cross entropy loss function, and introduces a focus factor alpha for adjusting the weights of the easily-classified samples and the difficultly-classified samples. When the focus factor is larger, the model is more concerned with samples difficult to classify, so that the classification accuracy of the model is improved. An algorithm formula.
FL(p t )=-α t ·(1-p t ) λ ·log(p t ) (14)
Wherein p is in the formula t Representing the probability that the tag is correctly identified, p t A larger value indicates a higher confidence in the identification, and a tag is easier to identify. Alpha t Represents the focal factor, (1-p) t ) λ Represents the modulation factor lambda>0. Focal factor alpha epsilon 0,1]The method is used for adjusting the weight of the transition sample and balancing the distribution of the loss function.
3. Experiment and result analysis
The experimental environment adopts a Pytorch framework, CUDA11.1 version, ubuntu system and NVIDIA RTX3090 (24G) display card.
3.1 parameter settings
The threshold ω size of the seed fraction is set to 0.6. The entity offset γ is set to 5, considering that the length of the longest entity in the corpus is set to 12. The learning rate (learning_rate) was set to 3e-5 using the AdamW optimization algorithm. To avoid the overfitting phenomenon and improve the generalization ability of the model, a Dropout mechanism was introduced into the model, with its value set to 0.5. In the model training process, the batch size (batch_size) is set to 4, the iteration number (epochs) is set to 30, and the model with the highest F1 value is stored by evaluating once at the end of each iteration. The method comprises the steps of constructing a data set by using a power grid marketing named entity recognition corpus in the section 1, setting 3 comparison experiments (5-shot, 10-shot and 20-shot) with different sample numbers in a training set and a verification set, and verifying by using a test set in an reasoning stage, wherein the test set comprises 287 pieces of data. See table 3 for other parameter settings.
TABLE 3 parameter settings
Parameters (parameters) Value of
Hide_size 768
max_seq 128
Sampling_processes 4
Eval_batch_size 8
Lr_warmup 0.1
seed 42
Embedding size 128
3.2 criteria for evaluation
The accuracy P, recall R and F1 values are adopted as evaluation indexes of model performance, and the specific formula calculation process is as follows:
in the above formula, truePositive represents the number of entities correctly predicted by the model, that is, the number of samples for which the model is predicted as a positive example and is actually a positive example; predictPositive represents the total number of entities identified by the model, including correctly predicted entities and incorrectly predicted entities; acturalPositive represents the total number of entities present in the dataset, i.e., the actual number of entities.
3.3 results and analysis
And evaluating the model by using the named entity recognition corpus in the power grid marketing field constructed in the second section, and setting 2 groups of comparison experiments to verify the validity of the model. A first group: performance analysis of different masking strategies; second group: comparing the model with other four models (BERT-CRF, BERT-LSTM-CRF, NNShot, structShot), and verifying that the model achieves optimal performance in a named entity identification data set in the power grid marketing field.
1) Performance analysis of different masking strategies
In order to verify that the full-word masking strategy can effectively enhance the recognition effect of the model, a comparative experiment was performed on DBERT-IDCNN-TMLP-FL using the character-level masking strategy (WordPiece Masking) model, DRoBERTa-IDCNN-TMLP-FL, and model WmDBERT-IDCNN-TMLP-FL, wmDRoBERTa-IDCNN-TMLP-FL using the full-word masking strategy (Whole Word Masking), respectively, and the experimental results are shown in Table 4.
Table 4 comparison of the Performance of different masking strategies (%)
Note that: DRoBERTa represents two (Double) robberta encoders, TMLP represents Three (Three) perceptron neural networks, wmDRoBERTa represents a full-use full-word masking approach, and FL represents a focus loss function.
From the above table, the model WmDRoBERTa-IDCNN-TMLP-FL presented herein performs optimally, and under the conditions of 5-shot, 10-shot and 20-shot, the F1 values respectively reach 52.23%, 55.54% and 57.24%. In a 20-shot experiment, compared with a character shielding mode, the full-word shielding mode enables the accuracy, recall rate and F1 value of the model to be respectively improved by 2.31%, 2.53% and 2.44%. In addition, the pre-training model of the model is changed from RoBERTa to BERT language model, and the accuracy, recall and F1 of the model are improved to a certain extent by adopting a whole word masking strategy. The pre-training language model adopts a static character shielding strategy to shield only one character in the text, so that the model can estimate the shielded character, and the model can predict the content of the shielded position in advance, so that the learning ability of the model is relatively poor. The dynamic masking strategy is different in selected position in each iteration process, and masking is performed in a way based on vocabulary frequency. That is, for a given text, firstly, word segmentation is performed by using a word segmentation device, frequency ordering is performed on words appearing in the text, and then, the probability of masking the current character is obtained by calculating by using a formula of the word segmentation device, so that the model is more flexible and efficient in masking corpus, and the performance of the model is improved. In the case of 20-shot, the Roberta language model has improved accuracy, recall and F1 values by 2.34%, 0.90% and 1.55% respectively, compared to the BERT language model, using a character-level masking strategy. Compared with a character shielding strategy, the full-word shielding is integrated with proper nouns of the power grid data during word segmentation, so that a word segmentation device obtains more complete nouns during word segmentation, continuous multiple [ MASK ] marks are used for replacing complete words, the content of a shielded position is deduced through multiple training of a pre-training language model, so that the model obtains more semantic information, the reasoning capacity of the model is improved, and therefore the model obtains better performance, and in a 5-shot sample, the F1 value is 60.59%.
2) Performance comparison analysis of different models
To verify the effectiveness of the models presented herein for less sample grid marketing data named entity identification, experiments were set up with 4 different models for comparison, including BERT-CRF, BERT-LSTM-CRF, NNShot, structShot. The results show that the WmDRoBERTa-IDCNN-TMLP-FL model disclosed herein shows higher recognition accuracy with fewer samples. The specific comparison results are shown in Table 5.
Table 5 comparison results of different models (%)
Table 5 shows experimental results of different models on the grid marketing dataset. The BERT-CRF takes the BERT pre-training model as an embedded layer, fully considers the position information and the context semantic information of the characters, and acquires a globally optimal labeling sequence through the CRF learning labeling constraint rules and adjacent label information to relieve the labeling bias problem. Under the conditions of 5-shot, 10-shot and 20-shot sample numbers, the F1 values of the model are 20.14%, 23.94% and 33.90%, respectively. The BERT-LSTM-CRF integrates a Long Short-Term Memory (LSTM) network on the basis of the BERT-CRF to consider time sequence information, and the problem of information forgetting is relieved. However, as training samples are fewer, and training corpus in the pre-training model is relatively extensive, the model learns more irrelevant information in the model training process, so that more noise is introduced, the recognition effect of the model is poorer, and F1 values are respectively 15.91%, 24.10% and 41.70% under the conditions of 5-shot, 10-shot and 20-shot sample numbers. The NNshot model mainly obtains the context vector representation of each word in a sentence, calculates the similarity of the words by adopting the nearest neighbor principle in a vector space, and selects the category of the nearest word in the vector space for labeling. The backbone of the StructShot model is based on NNShot while the Viterbi algorithm is used to decode the predictions. Both models are trained using corpus of other fields and then validated using grid marketing dataset during the prediction phase. The model is caused to blend a large amount of noise, and semantic information acquired by the model contains more error information; in addition, the method comprises the following steps. The model is expected to adopt a sequence labeling mode, so that the problem of entity nesting cannot be relieved, and the recognition performance of the model is poor. The F1 values of NNShot under the conditions of 5-shot, 10-shot and 20-shot sample numbers are 19.19%, 25.23% and 23.69% respectively; the F1 values of the structShot under the conditions of 5-shot, 10-shot and 20-shot sample numbers are 26.71%, 31.21% and 32.60%, respectively.
The model WmDRoBERTa-IDCNN-TMLP-FL presented herein is based on a multitasking approach. Specifically, first, the seed task is mainly to obtain one chinese character and two chinese characters that are most likely to be part of an entity, and screening is performed by calculating a seed score and setting a threshold. And expanding the seeds obtained from the previous task to obtain a complete entity, which is also called a candidate entity. And finally, taking the input sentence as a precondition of the implication task, and combining the candidate entity obtained by the previous task with a predefined Prompt to form a hypothesis sentence, thereby obtaining the implication pair of the implication task. This approach reconstructs Named Entity Recognition (NER) tasks into classification tasks for language models and trains using the grid dataset. This approach reduces the likelihood of mixing in noise compared to other models. In addition, the method adopts a full word shielding strategy to replace an original character shielding strategy, so that the model can learn the context semantic information of the complete word, and the knowledge learning capability of the model is improved. Therefore, under the condition of 5-shot, 10-shot and 20-shot sample numbers, compared with other models, the identification performance of the method is obviously improved, and F1 values are 52.23%, 55.54% and 57.24% respectively.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims (9)

1. The method for identifying the named entities of the small electric marketing sample based on the multitasking is characterized by comprising the following steps of:
s1, collecting power marketing named entity recognition corpus and performing data processing: s1-1, collecting power marketing data, and converting structured data and unstructured power data into unstructured power marketing data; the unstructured electricity marketing data is text; s1-2, marking the electric power marketing data by adopting a span marking mode: marking the beginning position, the ending position and the entity type of the entity in the sentence;
s2, inputting the text into a RoBERTa encoder to obtain a characteristic representation;
s3, performing expansion convolution on the characteristic representation obtained in the S2 and input IDCNN layers;
s4, inputting the text data after the expansion convolution into a seed module, and predicting seed scores of each single character and each double character: the text is segmented into a seed set comprising one Chinese character and two Chinese characters, the score of each seed in the set is calculated, and if the score is larger than a seed score threshold omega, the seed score is used as a candidate seed;
s5, expanding the candidate seeds so as to obtain complete entities: setting the offset gamma of seeds, and expanding candidate seeds in left and right directions to obtain a complete entity;
s6, inputting the entity into the RoBERTa coder to obtain a final classification result: firstly, constructing implication pairs, wherein the implication pairs comprise hypothesis sentences and preconditions, the hypothesis sentences are obtained by combining entity spans obtained by expanding tasks and predefined Prompt, and the preconditions are input texts; and then, carrying out classification tasks on the implications transmitted to the RoBERTa coder, and outputting classification results.
2. The method for identifying the named entities of the small electric marketing sample based on the multitasking of claim 1, further comprising: preprocessing unstructured power data of S1-1: non-text data and special characters are deleted.
3. The method for identifying the named entity of the small electric marketing sample based on the multitasking of claim 1, wherein the entity types comprise time, level, line, station, org, equ, name, add and other, wherein time represents time, level represents voltage level, line represents power transmission equipment, station represents factory length, org represents organization, equ represents equipment electric appliance, name represents personnel name, and other are represented by other.
4. The method for identifying a named entity of a small sample of electric marketing based on multitasking according to claim 1, wherein said S2 comprises:
firstly, processing through a word segmentation device to obtain a word segmentation text sequence; then, carrying out full word shielding on partial words of the word segmentation sequence, and adding a special mark [ CLS ] for the beginning of the sequence, wherein sentences are separated by marks [ SEP ]; and finally, inputting the sequence vector into a RoBERTa coder for feature extraction to obtain feature representation.
5. The method for identifying named entities of small samples for power marketing based on multitasking of claim 4, wherein the whole-word masking comprises the steps of:
firstly, utilizing a jieba word segmentation device to segment Chinese words of a text;
then MASK a complete word using consecutive multiple MASK flags, allowing the model to predict the masked word.
6. The method for identifying the named entities of the small electric marketing sample based on the multitasking of claim 1, wherein the method for calculating the score of each seed comprises the following steps:
for seeds s i =(l i ,r i ),l i ,r i Respectively represent seeds s i Left and right boundaries of (a), its score calculationThe process is as follows:
wherein the method comprises the steps ofRepresenting seeds s i Is represented by the left boundary of (a);
representing seeds s i Right boundary representation of (2);
representing seeds s i All representations of left to right boundaries;
v [cls] is [ CLS ]]Is a representation of (2);
for seeds s i All representations in the span interval are represented as an average pooling operation;
MeanPooling () is an average pooling operation;
concat () is a splicing operation;
is a representation of a seed;
representing +.>Inputting the result into a multi-layer perceptron MLP, and inputting the obtained result into a Sigmoid activation function;
is the final fraction of the seed.
7. The method for identifying the named entity of the small electric marketing sample based on the multitasking of claim 1 is characterized in that the candidate seeds are expanded in the left direction and the right direction, and the whole entity is obtained by the following steps:
setting seeds s i =(l i ,r i ) Left boundary l of (2) i And right boundary r i At most, shifting gamma to obtain a maximum entity of 2+2 gamma; the span of the final expanding entity is w i The calculation process is as follows:
a value after the seed entity is shifted to the left;
is the right offset value of the seed entity;
then, determining a final candidate entity through the span of the calculated expanded entity; candidates for use in a medical deviceLeft-right boundary offset o of entity i The calculation is as follows:
wherein the method comprises the steps ofA result representing a vector represented by an i-th seed span of the average pooled seed stage;
MeanPooling () is an average pooling operation;
representing the value of the seed entity after the left offset;
representing the right-shifted value of the seed entity;
all representations between the left-shifted value to the right-shifted value of the seed entity;
concat () is a splicing operation;
for seeds s i All tables in span intervalA representation shown as an average pooling operation;
the representation will->Inputting the multi-layer sensing machine MLP;
sigmoid is an activation function;
representing seeds s i An extended representation of (2);
o i ∈R 2 its first element is expressed asRepresenting the final left shift of the entity span, the second element is denoted +.>Representing the final right shift of entity span, i E [ -gamma, gamma];
The left and right boundaries of the final entity can be determined as follows:
l i 、r i respectively representing the left and right boundaries of the entity;
l′ i 、r′ i respectively representing the left and right boundaries of the final entity;
n represents the maximum offset of the expansion in one direction in the expansion stage;
representing a downward rounding;
in the process of calculating the result, ifIs deemed invalid and selected for discarding.
8. The method for identifying the named entities of the small electric marketing sample based on the multitasking according to claim 1 is characterized in that the method is based on a WmDRoBERTa-IDCNN-TMLP-FL model, the model comprises a data processing module, a RoBERTa coding layer, an IDCNN layer, a seed module, an expansion module and an implication module, the processing module is connected with the RoBERTa coding layer, the RoBERTa coding layer is connected with the IDCNN layer, the IDCNN layer is connected with the seed module, the seed module is connected with the expansion module, the expansion module is connected with the implication module, and the implication module is connected with the RoBERTa coding layer;
the processing module is used for: firstly, processing through a word segmentation device to obtain a word segmentation text sequence; then, carrying out full word shielding on partial words of the word segmentation sequence;
roberta coding layer: extracting features of the input sequence vector;
IDCNN layer: the space of the convolution kernel is increased by adopting expansion convolution to enlarge the receptive field of the convolution neural network, so that the information loss is reduced;
seed module: predicting seed scores of each single character and each double character, and finding candidate seeds;
and an expansion module: expanding the candidate seeds obtained by the seed module to obtain final candidate entities;
the module comprises: and combining the candidate entity obtained by expanding the task with a predefined Prompt, and taking the input text as a precondition, thereby forming a complete implication pair.
9. The method for identifying the named entity of the small electric marketing sample based on the multitasking of claim 8, wherein the loss function adopted by the WmDRoBERTa-IDCNN-TMLP-FL model training is a focus loss function, and the formula of the focus loss function is as follows:
FL(p t )=-α t ·(1p t ) λ ·log(p t ) (14)
wherein alpha is t Representing a focal factor;
(1-p t ) λ represents the modulation factor lambda>0;
p t Representing the probability that the tag is correctly identified.
CN202310589142.5A 2023-05-24 2023-05-24 Multi-task-based electric power marketing small sample named entity identification method Pending CN116882402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310589142.5A CN116882402A (en) 2023-05-24 2023-05-24 Multi-task-based electric power marketing small sample named entity identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310589142.5A CN116882402A (en) 2023-05-24 2023-05-24 Multi-task-based electric power marketing small sample named entity identification method

Publications (1)

Publication Number Publication Date
CN116882402A true CN116882402A (en) 2023-10-13

Family

ID=88266815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310589142.5A Pending CN116882402A (en) 2023-05-24 2023-05-24 Multi-task-based electric power marketing small sample named entity identification method

Country Status (1)

Country Link
CN (1) CN116882402A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236335A (en) * 2023-11-13 2023-12-15 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN118133829A (en) * 2024-04-29 2024-06-04 江西师范大学 Small sample named entity recognition method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236335A (en) * 2023-11-13 2023-12-15 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN117236335B (en) * 2023-11-13 2024-01-30 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN118133829A (en) * 2024-04-29 2024-06-04 江西师范大学 Small sample named entity recognition method

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN110489555A (en) A kind of language model pre-training method of combination class word information
CN111639171A (en) Knowledge graph question-answering method and device
CN110209822A (en) Sphere of learning data dependence prediction technique based on deep learning, computer
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN117151220B (en) Entity link and relationship based extraction industry knowledge base system and method
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN116882402A (en) Multi-task-based electric power marketing small sample named entity identification method
CN112883714B (en) ABSC task syntactic constraint method based on dependency graph convolution and transfer learning
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN114781392A (en) Text emotion analysis method based on BERT improved model
CN111553159B (en) Question generation method and system
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN113505589B (en) MOOC learner cognitive behavior recognition method based on BERT model
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
US20230014904A1 (en) Searchable data structure for electronic documents
CN114742069A (en) Code similarity detection method and device
CN115203406A (en) RoBERTA model-based long text information ground detection method
CN113221569A (en) Method for extracting text information of damage test
CN113920379A (en) Zero sample image classification method based on knowledge assistance
CN115757695A (en) Log language model training method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240628

Address after: No.6, Minzhu Road, Xingning District, Nanning City, Guangxi Zhuang Autonomous Region

Applicant after: GUANGXI POWER GRID Co.,Ltd.

Country or region after: China

Address before: No. 6 Democracy Road, Xingning District, Nanning City, Guangxi Zhuang Autonomous Region, 530000

Applicant before: GUANGXI POWER GRID Co.,Ltd.

Country or region before: China

Applicant before: GUANGXI University

TA01 Transfer of patent application right