CN117592482A

CN117592482A - Operation ticket naming entity identification method based on BiLSTM+CRF model

Info

Publication number: CN117592482A
Application number: CN202311556835.0A
Authority: CN
Inventors: 笪涛; 马海涛; 朱江渝; 刘小荷; 侯超; 马骏毅; 丁瑾; 吴林; 张佳; 吴昊; 王支奎; 袁立刚
Original assignee: Nanjing Ruiyuan Electric Power Technology Co ltd; State Grid Jiangsu Electric Power Co ltd Zhenjiang Power Supply Branch
Current assignee: Nanjing Ruiyuan Electric Power Technology Co ltd; State Grid Jiangsu Electric Power Co ltd Zhenjiang Power Supply Branch
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-02-23

Abstract

The invention discloses an operation ticket naming entity identification method based on BiLSTM+CRF model, comprising the following steps: 1) Preparing a corpus; 2) Preprocessing data; 3) Performing data marking by adopting a BIEO marking method; 4) Sentence segmentation: sentence segmentation is carried out according to punctuation marks; 5) Splitting the marked sentences into a word list and a corresponding marked list; 6) Constructing a vocabulary, and converting the operation ticket text data into a digital representation which can be understood and processed by the LSTM+CRF model; 7) Counting an operation ticket vocabulary and entity labels, constructing a corresponding dictionary, and mapping words into unique integer numbers; 8) Sentence vectorization; 9) Splitting the vectorized data into a training set, a verification set and a test set; 10 Constructing an LSTM+CRF neural network model; 11 Operation ticket naming body predictive identification. Compared with the traditional method and the single use of BILSM or CRF model, the invention can generally obtain higher evaluation indexes such as accuracy, recall rate, F1 value and the like, and the recognition rate is over 9.

Description

Operation ticket naming entity identification method based on BiLSTM+CRF model

Technical Field

The invention relates to an operation ticket naming entity identification method based on BiLSTM+CRF model, belonging to the technical field of text natural language processing.

Background

The operation ticket is an important electronic text certificate for daily overhaul and maintenance of the distribution network power system equipment, and contains a large amount of entity information such as a line, a switch, a ring main unit, a switch operation state, other equipment and the like. At present, unstructured operation ticket text has some problems and disadvantages for the entity information recognition NLP field as follows:

1. and (3) manual treatment: in the prior art, the network allocation entity information in the operation ticket text is read and understood by relying on manual experience in the professional field in advance, so that the processing efficiency is low and errors are easy to occur;

2. there are accuracy problems: due to the complexity and diversity of the text content of the operation ticket, the prior conventional technology has the problem of low accuracy in the aspects of entity boundary identification and classification;

3. the pertinence is insufficient: the prior art is only suitable for specific types of operation tickets, and has poor performance and lack of generality for other types of operation ticket identification;

4. there is an island of data: the text analysis accuracy of the traditional technology is low, unstructured texts cannot be shared across systems, and information cannot be shared to form a data island.

Thus, there is an urgent need for a solution that can automatically identify named entities in an operation ticket.

Disclosure of Invention

The invention aims to provide an operation ticket named entity recognition method based on a BiLSTM+CRF model, wherein 'BiLSTM+CRF' is a technical scheme combining a bidirectional long and short Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) and a conditional random field (Conditional Random Field, CRF) and is used for operation ticket named entity recognition (Named Entity Recognition, NER) tasks. Solves the following technical problems: 1. automatic identification: an automatic operation ticket named entity identification method is provided, so that manual intervention is reduced, and processing efficiency is improved. 2. And the accuracy is improved: and introducing a deep learning model BiLSTM+CRF to accurately mark entity boundaries in the operation ticket text, improve classification effect and improve identification accuracy. 3. Improving the universality: the model with strong universality is designed, so that the model is suitable for various operation tickets, and the adaptability and generalization capability of the model are improved.

The aim of the invention is realized by the following technical scheme:

a method for identifying operation ticket naming entity based on BiLSTM+CRF model includes:

1) Corpus preparation: collecting a text, a database and a log file containing dispatch maintenance operation ticket data;

2) Data preprocessing: performing artificial character grading word and part of speech tagging pretreatment on original operation ticket text data;

3) Data were labeled using the BIEO (Begin, inside, outside, end) labeling method: the initial position of the entity is marked as 'B', the middle part is marked as 'I', the non-entity part is marked as 'O', and the end position is marked as 'E';

4) Sentence segmentation: sentence segmentation is carried out according to punctuation marks;

5) Splitting the marked sentences into a word list and a corresponding marked list;

6) Constructing a vocabulary, and converting the operation ticket text data into a digital representation which can be understood and processed by the LSTM+CRF model;

7) Counting an operation ticket vocabulary and entity labels, constructing a corresponding dictionary, and mapping words into unique integer numbers;

8) Sentence vectorization: the operation ticket text is vectorized, words are converted into corresponding numbers, and the words are filled or truncated according to the maximum length of the sentences to enable the lengths of the sentences to be the same, and finally vectorized representations are the numbers of the sentences and the numbers of the labels respectively;

9) Splitting the vectorized data into a training set, a verification set and a test set for subsequent model training, evaluation and prediction;

10 Building lstm+crf neural network model: inputting word vectors into a bidirectional LSTM layer, mapping LSTM output into the score of each label through a full connection layer, and finally decoding by using a CRF layer to obtain an optimal label sequence;

11 Operation ticket naming body predictive identification): and sending the operation ticket text into model prediction, and automatically identifying the named body in the operation ticket.

Further, the LSTM+CRF neural network model architecture is as follows:

input layer: receiving an input sequence of ticket text, each word typically represented by its number;

word embedding layer: converting the input word segmentation number into a corresponding word vector;

bidirectional LSTM layer: receiving a word vector sequence, and processing sequence information in two directions, wherein the sequence information is processed in two directions, namely a forward direction and a backward direction;

full tie layer: the output of the LSTM is mapped to the tag space. It maps the output dimension of LSTM to the number of labels to calculate the score for each label;

conditional random field layer: the method is used for decoding the output sequences, solving the dependency relationship among the tag sequences, and the CRF layer decodes on the output of the LSTM to obtain the optimal tag sequence, so that the output tag sequence meets the overall optimal probability.

Compared with the prior art, the invention has the beneficial effects that:

1. contextual information capture: the two-way long and short term memory network (BILSM) can consider the information of the operation ticket context at the same time, so that the NER task can better understand the context of an entity, thereby improving the identification accuracy.

2. Solves the long-term dependence problem: while conventional RNNs are prone to gradient extinction or gradient explosion in the face of long-term dependency problems, BILSM avoids this problem by propagating forward and backward, thereby better handling long text sequences of tickets.

3. Sequence modeling: conditional Random Fields (CRFs) can model the entire labeling sequence and take into account the interrelationships between entity tags. The sequence modeling can further improve the performance of NER tasks and avoid the generation of illegal tag sequences.

4. Context consistency: because the conditional random field CRF considers the dependency relationship between tag sequences, the BILSM+CRF model can ensure the consistency of entity boundaries and avoid generating unreasonable entity boundaries.

5. End-to-end learning: BILSM+CRF is an end-to-end deep learning model, and the mapping relation between the features and the labels is directly learned from the original text, so that the complexity of feature engineering can be reduced without manually designing the features.

6. The effect is excellent: the BILSM+CRF technical scheme is excellent in a plurality of named entity recognition tasks, and compared with the traditional method and the method for independently using the BILSM or CRF model, the method can generally obtain higher evaluation indexes such as accuracy, recall rate, F1 value and the like, and the recognition rate is over 9.

Through the technical characteristics, the invention greatly improves the automation level of operation ticket processing, reduces errors and omission, and provides a more reliable named body recognition solution for the fields of operation, maintenance, overhaul and the like of the distribution network power system. Because the model can learn the characteristics and modes from the data, the BILSM+CRF can flexibly adapt to different types of named entities in the operation ticket, including switches, station rooms, ring main units and the like.

Drawings

FIG. 1 is a diagram of an LSTM-CRF model architecture of the present invention;

FIG. 2 is a diagram of an overall model architecture for BiLSTM+CRF model applied to ticket text recognition.

Detailed Description

The invention will be further described with reference to the drawings and the specific examples.

For operation ticket named entity recognition, the invention adopts a deep learning model based on BiLSTM+CRF to construct the whole neural network, wherein the BiLSTM+CRF is a technical scheme combining a bidirectional long and short Term Memory network (Bidirectional Long Short-Term Memory, BILSTM) and a conditional random field (Conditional Random Field, CRF) and is used for operation ticket named entity recognition (Named Entity Recognition, NER) tasks. NER is an important task in natural language processing, the goal of which is to identify named entities of a specified category, such as switches, lines, station rooms, ring main units, switching stations, switch states, etc., from ticket text.

The following is a technical scheme for identifying the named entity BiLSTM+CRF, and the main technical characteristics include 6 parts:

1. data preprocessing:

text marking: each word or character in the original ticket text is marked, for example, by a BIEO (Begin, side, end) marking method, with the beginning position of the entity marked "B", the middle part marked "I", the non-entity part marked "O", and the ending position marked "E".

2. Building a two-way long and short term memory network (BILSTM):

BILSM is an extended form of Recurrent Neural Network (RNN) that can take context information into account at the same time. This network structure has two LSTM layers, one forward propagating and one backward propagating, capturing the context information for each word. Thus, the long-term dependence problem faced by the traditional RNN can be effectively solved.

3. Conditional Random Field (CRF):

CRF is a statistical model used to annotate sequence data. In the NER task, it can model over the whole labeling sequence, taking into account the interrelationships between entity tags. The CRF further improves the performance of NER by modeling the probability of the entity sequence.

4. Model combination: the BILSM and CRF are combined as shown in FIG. 1 to form an end-to-end model. BILSM is responsible for extracting context-related features from the entered ticket annotation text, while CRF is responsible for modeling the tag sequence of NER tasks. During training, the entire model is optimized by maximizing the conditional log likelihood function of the CRF.

5. Model training and evaluation: training the model by using the marked operation ticket training data, and tuning the model by using the verification set to avoid overfitting. And in the evaluation stage, the model is evaluated by using test data, and the performance of the model is evaluated by using indexes such as accuracy, recall rate, F1 value and the like.

6. And (3) predicting: finally, given new unlabeled operation ticket text, predicting the named entity in the operation ticket text through a trained model, and realizing automatic named entity identification of the operation ticket text.

In general, the BILSM+CRF technical scheme captures context information through a two-way long-short-term memory network, and then realizes efficient and accurate named entity recognition through the relationship between the conditional random field modeling tag sequences.

The following describes the implementation of the invention in detail by means of specific examples:

1. data cleaning:

the step is mainly used for cleaning the original operation ticket text data, filtering the automatic switch ticket, the ticket which does not relate to the switch operation, the in-station switch ticket and the like, and only storing the non-automatic switch ticket data. For example:

lv Na 171 line Lv Mengna path 1# ring main unit Lv Na 171 switch is changed from running to cold standby (ring opening)

2. And (3) data marking:

named entity recognition (Named Entity Recognition, NER) belongs to supervised learning and the model needs to be trained on labeled datasets. Where each input (in this case, each word sequence) is associated with its corresponding output (named entity tag). The goal is to learn the mapping from the input data to the output tags so that the model can accurately predict the entities in the new, unseen data. This requires, in the context of NER, labeling the data to identify the boundaries and types of named entities present in the text. The labeling process involves manually marking each word segment in the text with a label that indicates the type of entity (e.g., switch, ring main unit, line, etc.). The scheme adopts the marking type of BIOE, marks line entity as 'line', ring main unit entity as 'ring', switch entity as 'switch', initial cold standby entity as 'ori', target operation entity as 'targ', and the like, and the following examples are:

BX: start of type X entity

IX: inside an X-type entity

EX: end of X type entity

O: outside any entity (non-entity)

3. Sentence segmentation:

considering that different operation ticket filling habits of partial dispatchers exist, punctuation marks exist in a text content, in the step, sentence segmentation is carried out on marked data according to the punctuation marks, and the marked data are fed into LSTM+CRF model training;

4. splitting the marked sentences into a word list and a corresponding marked list:

the labeled sentences are processed according to the rows and divided into a word list and a label list, and text data are converted into a form suitable for a sequence labeling task, so that model training and prediction can be facilitated. In the training process, the model can be guided to learn the correct prediction label of each position through the labels, so that the model has the capability of identifying the entity or the part of speech. In addition, in the prediction process, the model can predict a corresponding tag list according to the word list of the input sentence, so that the identification of the entity or the part of speech in the text is realized.

5. Construction of vocabulary (all non-repeating words in corpus)

The construction of a vocabulary is an important step in natural language processing tasks, whose purpose is to convert ticket text data into a digital representation that can be understood and processed by the lstm+crf model, characterized mainly by:

the digitized representation: all the operation ticket data are formed into a corpus, and each word is mapped into a unique integer number by constructing a vocabulary, so that the text data are converted into digital representation, and the computer model is convenient to process.

Reducing the data dimension: the ticket text data typically has a high dimension, with each word being a feature. By constructing the vocabulary, the operation ticket text data can be converted into a low-dimensional digital sequence, and the computational complexity of model training and reasoning is reduced.

Extracting characteristics: and constructing an operation ticket vocabulary table to map each word into a unique number, and sequencing the words according to information such as frequency, so that common segmentation words are mapped into smaller numbers, and common characteristic information is extracted.

Maintaining consistency: constructing the operation ticket vocabulary ensures that each word has a unique number, and maintains consistency of the text data in different links, such as using the same number in the training set, the validation set, and the test set.

6. Building dictionary ("word: number", "number: word", "tag: word", word: tag ")

And counting the vocabulary of the operation ticket and the entity tag, constructing a corresponding dictionary, and mapping the words into unique integer numbers. For example:

7. sentence vectorization:

in the training process of the machine learning model, text data can be converted into digital representation through text vectorization representation on the basis of a generated dictionary, so that the model is convenient to train and learn. Since the natural language processing data is of different sizes, the text data needs to be filled or truncated when the text is represented in a vectorized manner.

In the following example, the operation ticket text is represented in a vectorization mode, the words are converted into corresponding numbers, and filling or cutting is carried out according to the maximum length of the sentences so that the lengths of the sentences are the same. Here, the maximum length of the sentence is set to 10, so that it is necessary to fill the sentence, and the sentences having a length less than 10 are filled with 0 at the end so that all the sentence lengths are 10. The resulting vectorized representations are the number of sentences and the number of tags, respectively.

8. Splitting into a training set, a verification set and a test set:

splitting the vectorized data into a training set, a verification set and a test set, and dividing the training set, the verification set and the test set according to the 8:2 duty ratio of the label category for training, evaluating and predicting a subsequent model.

9. Save as pkl file:

the data is stored as a pkl binary file, which contains vocabulary, tag tables and vectorized training set, verification set and test set data, so that the data can be quickly loaded during subsequent model training and testing.

10. Constructing an LSTM+CRF neural network model:

the constructed LSTM-CRF model structure is shown in fig. 2, word vectors are input into a bidirectional LSTM layer, LSTM output is mapped into the score of each label through a full connection layer, and finally the CRF layer is used for decoding to obtain the optimal label sequence.

Long short-term memory (LSTM) is a special RNN, mainly to solve the problems of gradient extinction and gradient explosion in the Long sequence training process. In short, LSTM is able to perform better in longer text sequences than normal RNNs.

LSTM has very good performance advantage in processing sequence problems, so the scheme uses BiLSTM+CRF model to perform operation ticket text recognition, and the whole model architecture is as follows:

input layer: an input sequence of ticket text is received, each word being generally represented by its number (index).

Word Embedding Layer (Embedding Layer): and converting the input word segmentation number into a corresponding word vector. The scheme adopts a pytorch built-in word vector generation mode.

Bidirectional LSTM layer (Bidirectional LSTM Layer): a sequence of word vectors is received and sequence information is processed in two directions, forward (forward) and backward (reverse) respectively. This captures the context information for each word segment.

Full link Layer (Linear Layer): the output of the LSTM is mapped to the tag space. It maps the output dimension of the LSTM to the number of labels to score each label.

Conditional random field Layer (CRF Layer): the method is used for decoding the output sequences and solving the dependency relationship between the tag sequences. The CRF layer decodes on the output of the LSTM to obtain an optimal tag sequence such that the output tag sequence satisfies the overall optimal probability.

11. Operation ticket naming body prediction and identification:

and sending the operation ticket text into model prediction, automatically identifying a named body in the operation ticket, and judging that the entities in the operation ticket are all correctly resolved according to an actual effect diagram.

In addition to the above embodiments, other embodiments of the present invention are possible, and all technical solutions formed by equivalent substitution or equivalent transformation are within the scope of the present invention.

Claims

1. The operation ticket naming entity identification method based on BiLSTM+CRF model is characterized by comprising the following steps:

3) Data labeling is carried out by adopting a BIEO labeling method: the initial position of the entity is marked as 'B', the middle part is marked as 'I', the non-entity part is marked as 'O', and the end position is marked as 'E';

2. The method for identifying operation ticket naming entity based on BiLSTM+CRF model as set forth in claim 1, wherein said LSTM+CRF neural network model architecture is as follows: