CN113988054A - Entity identification method for coal mine safety field - Google Patents

Entity identification method for coal mine safety field Download PDF

Info

Publication number
CN113988054A
CN113988054A CN202111301680.7A CN202111301680A CN113988054A CN 113988054 A CN113988054 A CN 113988054A CN 202111301680 A CN202111301680 A CN 202111301680A CN 113988054 A CN113988054 A CN 113988054A
Authority
CN
China
Prior art keywords
model
entity
layer
coal mine
mine safety
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111301680.7A
Other languages
Chinese (zh)
Other versions
CN113988054B (en
Inventor
刘鹏
冯琳
史新国
刘珂
谢亚波
王莹
余钱坤
曹新晨
程浩然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202111301680.7A priority Critical patent/CN113988054B/en
Priority claimed from CN202111301680.7A external-priority patent/CN113988054B/en
Publication of CN113988054A publication Critical patent/CN113988054A/en
Application granted granted Critical
Publication of CN113988054B publication Critical patent/CN113988054B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an entity identification method facing the coal mine safety field, which is suitable for the informatization use of the coal mine safety field and comprises three steps of coal mine safety field entity data set construction, model design and model training: the method comprises the steps that data cleaning, processing and labeling are carried out in a construction stage of a real data set in the coal mine safety field, and the data are processed into an available data set; and in the model design stage, a RoBERTA pre-training language model is used as input, the characteristic input which is more in line with the context is obtained by utilizing the advantages of a larger training set, a larger Batch size and a dynamic covering mode, the context relationship is further learned through a bidirectional long-short term memory network, an attention mechanism is added to assign different weights to sequence elements, the CRF is utilized to calculate the most probable state path to obtain the final entity category, the MSRBAC model is constructed, and finally the trained MSRBAC model is used for entity identification in the mine safety field. The method has the advantages of simple steps, convenient use and wide practicability.

Description

Entity identification method for coal mine safety field
Technical Field
The invention relates to an entity identification method, in particular to an entity identification method facing the coal mine safety field, which is suitable for informatization in the coal mine safety field.
Background
Coal is an important basic energy source in China and occupies a great proportion of used energy sources. With the development of internet technology, the total amount of scientific data of coal mines is becoming huge, related information of coal mine safety is increasing explosively, and how to efficiently mine valuable information is an important technical point. The novel industrial road is driven to move in coal mine industrialization in an informationized and intelligent manner, and the novel modern mine is built to become an inevitable way for coal mine enterprises to improve the mine safety guarantee degree, realize high yield and high efficiency and increase the enterprise competitiveness, and is an important development direction of science and technology promotion.
The early named entity recognition task is mainly based on methods of dictionaries and rules, and the methods based on the rules mainly comprise methods of LaSIE-II, NetOwl, facility and the like; the scholars of Wangning, Krupka and the like perform entity recognition by establishing a dictionary library rule base, but the methods of the dictionary and the rules mainly depend on a specific rule template provided by a language field expert, and need great manpower to make the dictionary, and the recognition effect on unknown words and similar words in a non-dictionary is poor. Students have proposed typical methods based on statistical machine learning, such as Hidden Markov Model (HMM), Maximum Entropy (ME), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF), Support Vector Machine (SVM), etc., to solve such problems. However, in such methods, NER is generally regarded as a sequence labeling problem, requires a large-scale corpus to learn a labeling model so as to label each position of a sentence, and depends on manually set features. In recent years, with the improvement of computing power, methods based on deep learning are widely applied, common models include methods such as a Convolutional Neural Network (CNN), a long-short term memory network (LSTM), a GRU neural network and the like, and for example, Dong and the like use a BiLSTM-CRF neural structure based on character vectors for Chinese named entity recognition for the first time in 2016. In the method, static Word vectors such as Word2vec, Glove and the like are mostly used as Word Embedding layers, but the static Word vectors cannot meet the dynamic changes of different contexts. To solve such problems, context-embedded contextualized word vectors, such as CoVe, ULMFit, ELMO, etc., have emerged, but some problems remain, such as "seeing oneself", i.e., the situation where the next word to be predicted has emerged in a given sequence. The BERT pre-training language model released by Google in 2018 and 10 months shows 11 NLP tasks, which is a new milestone of natural language processing, and then the number of modified versions of BERT is infinite.
Coal mine safety is always an important field of safety production, and compared with other fields, coal mine safety informatization has certain particularity, and certain difficulty is brought to coal mine safety informatization by lack of data information, no large-scale labeled data sets and the like. The entity identification in the coal mine safety field is an important basis for constructing coal mine informatization such as a knowledge graph, a question-and-answer system and the like in the coal mine field.
Disclosure of Invention
The invention content is as follows: aiming at the defects of the prior art, the entity identification method which is simple in step and accurate in judgment and faces the coal mine safety field is provided.
In order to achieve the technical purpose, the entity identification method facing the coal mine safety field comprises the steps of firstly collecting a noun in the coal mine field as an entity to form an entity data set in the coal mine safety field, then training an MSRBAC model by using the entity data set in the coal mine safety field, and finally identifying the entity in the entity data set in the coal mine safety field by using the trained MSRBAC model;
s1, constructing a coal mine safety field entity data set:
s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;
s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions;
s1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b (begin) is used for representing the beginning characters of entity nouns, I (inside) is used for representing the non-beginning characters of the entity nouns, O (other) is used for representing other words which do not need to be labeled, a data set of the entities in the coal mine safety field is formed, the data set is an entity labeling set, namely the categories and the positions of the entities are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the proportion of 6:2:2 for model iterative training and prediction;
s2, establishing MSRBAC (mine Safety RoBERTA BilStention CRF) model:
the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly covered and predicted before multiple rounds of data feeding, so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then, the RROBRTA pre-training language model is used as a characteristic extractor of an input sequence, characters/words are converted into word vectors obtained through training, and then weights obtained through pre-training are given, so that the training speed of random initialization is improved;
the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BiLSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;
the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layert=[h1,h2,...,hT]Given different weights atThe calculation formula is as follows:
Figure BDA0003338639430000031
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]T is the input sentence length, and the correlation between characters utilizes the weight atCarrying out word boundary division on the sentences through the correlation of characters;
the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;
s3 training the MSRBAC model:
the raw corpus text of the coal mine field after data cleaning, processing and labeling is used as data input RoBERTAConverting the layer pre-training language model into word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into the BilSTM layer for further learning, leading the MSRBAC model to better understand the context relationship of texts, and using the Attention to each h in the sequencetDifferent weights are assigned to highlight important elements, and finally the CRF layer is used for selecting the label sequence with the highest prediction score as the best answer.
The entity categories of step S1.2 are set to 9 categories, which are respectively Coal Geology and exploration (CRP), roadway Engineering (SDE), Coal Mining (Coal Mining, CM), Hoisting and Transportation (HT), mine Survey (MMS), Coal Mine Safety (MS), Blasting Material and technology (EMBT), Mining Machinery and development Machinery (WMDM), Coal mine electrical (CMEE), and are respectively identified by acronyms.
In the step S1.3, the multi-method fusion semi-automatic labeling method includes firstly storing entities obtained from national standard GB/T15663 "coal mine science and technology terminology" and Baidu encyclopedia into corresponding dictionaries according to categories, writing scripts for reverse labeling of data sets to be labeled, then expanding the entities labeled by the dictionaries by using CRF + + algorithm, which can greatly reduce labor cost, and finally manually correcting, labeling partially corresponding labels but not labeled, correcting wrong labels, and completing labeling of data sets.
In step S1.3, it is specified that a sentence is subjected to BIO tagging according to a word level, B represents that a current word is a prefix of an entity where the current word is located, I represents that the current word is in a word or an end of a word of the entity where the current word is located, and O represents that the current word is not in the entity.
Converting a text into a data sequence through an MSRBAC model for identifying a coal mine safety entity, calculating the probability of all BIO labels in hidden states of all positions through a BilSTM layer and an Attention layer, inputting the label probability of all the positions into a CRF layer, and using a scoring function s (X, Y) encapsulated by TensorFlow for all possible labelsScoring the path, then obtaining the probability by utilizing a softmax normalization function, and finally
Figure BDA0003338639430000032
The probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows: ,
Figure BDA0003338639430000041
namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;
wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;
inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by RoBERTA, connecting with a dropout layer, avoiding over-fitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure;
when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:
Figure BDA0003338639430000042
Figure BDA0003338639430000043
ht=ot*tanh(Ct)
in the formula ftThe forgetting gate is used for determining information to be discarded or reserved; i.e. itAn input gate for filtering the input information and selectively discarding some information;
Figure BDA0003338639430000044
preparing for updating the current cell state for the cell state candidate value; ctIs the current cell state and is used for long-term memory; otFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h istIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. Through the calculation, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.
Introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:
Figure BDA0003338639430000045
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]And T is the input sentence length.
The MSRBAC model training specifically comprises the following steps:
the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:
feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;
randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,
firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then the data enters BiLSTM and Attention to calculate all the label probabilities, finally, a CRF final reasoning layer calculates the loss value loss,
optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.
Has the advantages that:
1) the multi-method fused semi-automatic labeling method greatly reduces the labor cost, simultaneously reduces the problems of label error, label leakage and the like caused by manual labeling, and improves the effectiveness of the data set.
2) The method adopts RoBERTA as a characteristic input layer, avoids the problems of poor model convergence slow effect and the like caused by the characteristic weight of random initialization, carries out semantic decoding by a bidirectional long-short term memory network (BilSTM) and a Conditional Random Field (CRF), adds an Attention mechanism (Attention) to build a coal mine safety field entity recognition model MSRBAC, and innovatively applies the Attention mechanism to the coal mine safety field entity recognition.
Drawings
FIG. 1 is a block diagram of the MSRBAC model architecture of the present invention;
FIG. 2 is a schematic flow chart of the coal mine safety-oriented entity identification method of the present invention;
FIG. 3 is a data sentence length distribution histogram of the present invention;
FIG. 4 is a schematic diagram of an embodiment of Word Embedding layer input representation;
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings:
as shown in fig. 2, in the entity identification method for the coal mine safety field of the present invention, in the construction stage of the entity data set in the coal mine safety field, data is cleaned, processed and labeled, and the data is processed into an available data set; and in the model design stage, a RoBERTA pre-training language model is used as input, the characteristic input which is more in line with the context is obtained by utilizing the advantages of a larger training set, a larger Batch size and a dynamic covering mode, the context relationship is further learned through a bidirectional long-short term memory network, an attention mechanism is added to assign different weights to sequence elements, the CRF is utilized to calculate the most probable state path to obtain the final entity category, the MSRBAC model is constructed, and finally the trained MSRBAC model is used for entity identification in the mine safety field.
Specifically, firstly, a coal mine field noun is collected as an entity to form a coal mine safety field entity data set, then the coal mine safety field entity data set is used for training an MSRBAC model, and finally the trained MSRBAC model is used for identifying the entity in the coal mine safety field entity data set;
s1, constructing a coal mine safety field entity data set:
s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;
s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions; the entity categories are set as 9 categories, namely coal geology and exploration CRP, roadway engineering SDE, coal mining CM, lifting transportation HT, mine measurement MMS, coal mine safety MS, blasting material and technology EMBT, mining machinery WMDM and coal mine electrical CMEE, which are respectively marked by English abbreviations; as shown in the following table
Figure BDA0003338639430000061
Figure BDA0003338639430000071
S1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b represents the beginning character of the entity noun, I represents the non-beginning character of the entity noun, O represents other words which do not need to be labeled, and a coal mine safety field entity data set is formed, wherein the data set is an entity labeling set, namely the category and the position of an entity are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the ratio of 6:2:2 for model iterative training and prediction;
a multi-method integrated semi-automatic labeling method includes the steps of firstly storing entities obtained from national standard GB/T15663 coal mine science and technology terminology and Baidu encyclopedia into corresponding dictionaries according to categories, compiling scripts to perform reverse labeling on data sets to be labeled, then expanding the entities labeled by the dictionaries by using a CRF + + algorithm, greatly reducing labor cost, finally manually correcting, labeling the parts which are labeled correspondingly but not labeled, correcting wrong labels, and completing labeling of the data sets. The method comprises the steps of specifying that BIO labeling is carried out on a sentence according to word levels, B represents that a current word is the beginning of a word of an entity where the current word is located, I represents that the current word is in the word or the end of the word of the entity where the current word is located, and O represents that the current word is not in the entity.
S2, establishing an MSRBAC model, wherein the structure is shown in figure 1:
the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly covered and predicted before multiple rounds of data feeding, so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then, the RROBRTA pre-training language model is used as a characteristic extractor of an input sequence, characters/words are converted into word vectors obtained through training, and then weights obtained through pre-training are given, so that the training speed of random initialization is improved;
the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BiLSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;
the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layert=[h1,h2,...,hT]Given different weights atThe calculation formula is as follows:
Figure BDA0003338639430000081
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]T is the input sentence length, and the correlation between characters utilizes the weight atCarrying out word boundary division on the sentences through the correlation of characters;
the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;
through an MSRBAC model for identifying coal mine safety entities, text is converted into a data sequence and then passes through a BilSTM layerCalculating the probability of all BIO labels in the hidden state of each position by the Attention layer, inputting the label probability of each position into the CRF layer, scoring all possible label paths by using a scoring function s (X, Y) encapsulated by TensorFlow, obtaining the probability by using a softmax normalization function, and finally obtaining the probability by using a softmax normalization function
Figure BDA0003338639430000082
The probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows: ,
Figure BDA0003338639430000083
namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;
wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;
inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by RoBERTA, connecting with a dropout layer, avoiding over-fitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure;
s3 training the MSRBAC model:
inputting a raw corpus text in the coal mine field after data cleaning, processing and labeling as data into a RoBERTA layer pre-training language model to convert word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into a BilSTM layer for further learning, enabling the MSRBAC model to better understand the context relationship of the text, and using the Attention to each h in the sequencetAssigning different weights to highlight important elements, and finally selecting a label sequence with the highest prediction score as the best answer by using a CRF layer;
when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:
Figure BDA0003338639430000091
Figure BDA0003338639430000092
ht=ot*tanh(Ct)
in the formula ftThe forgetting gate is used for determining information to be discarded or reserved; i.e. itAn input gate for filtering the input information and selectively discarding some information;
Figure BDA0003338639430000093
preparing for updating the current cell state for the cell state candidate value; ctIs the current cell state and is used for long-term memory; otFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h istIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. Through the calculation, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.
Introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:
Figure BDA0003338639430000094
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,..,hT]And T is the input sentence length.
The MSRBAC model training specifically comprises the following steps:
the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:
feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;
randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,
firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then enters BilSTM and Attention to calculate all the label probabilities, finally, a CRF reasoning layer is used for calculating the loss value loss,
optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.
The first embodiment,
The entity recognition model facing the coal mine safety field can clearly observe that the model consists of 4 parts, namely a RoBERTA pre-training language model, a BilSTM bidirectional long-short term memory network, an Attention mechanism and a CRF conditional random field, the output of the upper layer is the input of the next layer, and sentences are input into the model to obtain entities in sentences and the categories of the entities.
The method specifically comprises the following steps:
(1) coal mine safety field data acquisition
Because no labeled data set and public data set in the coal mine safety field can be used for model training at present, relevant documents in the coal mine safety field are obtained by document retrieval, webpage crawlers and the like and serve as raw corpora. In combination with the expert opinions in the field, the data set adopted by the invention mainly has the following 3 sources: 1) the national standard document publishing website (such as a national standard information public service platform, a national standard network and the like) acquires entity type definition reference and entities. 2) Encyclopedia websites (such as encyclopedia, dog searching encyclopedia and the like) obtain related entries in the coal mine safety field. 3) And (3) acquiring massive text resources related to coal mine safety from coal mine websites (such as coal mine safety nets and the like). The unstructured corpora are used as the corpora of the entity data set in the coal mine safety field.
The static structure webpage is crawled by using a urllib2 tool package in Python, and the dynamic structure webpage is crawled by using a selenium tool package in Python. The source 1) is obtained by collection in a literature retrieval mode and can be downloaded manually, most of the collected data are information stored in a picture form, and the picture data cannot be analyzed by a model, so that the picture is converted into characters by utilizing ocr picture character recognition api provided by the Baidu intelligent cloud. Sources 2) and 3) are web page crawling information, which comprises html tags, links and other noise information, and is cleaned by utilizing python regularization. All the source data are analyzed and stored in a txt file, so that the python batch operation is facilitated. And (3) sorting and summarizing the concept entities in the sources 1) and 2), dividing the concept entities into 9 types by combining the domain expert opinions, constructing dictionaries of 9 types, and storing the corresponding entities.
Removing noise information such as invisible symbols, emoticons and the like from data acquired from a source 3) by utilizing regularization, carrying out sentence division processing on the data according to periods, title numbers and the like after the data are cleaned, setting the maximum sentence length to be 510 characters by observing data distribution as shown in figure 3 and considering the maximum receivable length of RoBERTA, and carrying out truncation processing on the ultra-long text. The total number of sentences is 6602 obtained by the above method, and the raw corpus to be labeled is formed. The details are shown in the following table:
Figure BDA0003338639430000101
(2) coal mine safety field entity labeling
The invention adopts a multi-method fused semi-automatic BIO labeling method, firstly, using python script to reversely label the entity-to-be-labeled raw corpus in 9 types of dictionaries, then inputting the entity-to-be-labeled raw corpus into CRF + + algorithm, expanding the dictionary-labeled entity by using a characteristic template thereof, using the dictionary to reversely label the entity-to-be-labeled entity, using CRF + + to expand the entity-to-be-labeled entity, using CRF + + to expand the entity-to be-labeled entity-to-be-labeled entity, finally, manually correcting wrong label, and finishing the labeling of data set. The marked sentences are disordered and divided into a training set, a verification set and a test set according to the proportion of 6:2:2 approximately, and the entity distribution conditions of the marked sentences are shown in tables 2 and 3.
(3) Model training
The training set and the verification set are fed into an MSRBAC coal mine safety field entity recognition model for training, the entity recognition model is vectorized by using RoBERTA, the input representation is shown in figure 4, and the model hyper-parameters are set as follows: the number of layers of the transform Encoder of the RoBERTA is 12, the dimension of a hidden variable is 768, the number of attention heads is 12, the BiLSTM network is set to be 256, the maximum sentence length is 512 words, the entity type entry _ num is 9, the batch _ size is 8, the epoch is 50, meanwhile, in order to prevent model overfitting, the drop after the RoBERTA layer is set to be 0.1, the drop after the BiLSTM layer is 0.4, and the learning rate is 5 e-5.
In order to reduce the evaluation deviation caused by the category imbalance, a micro-average F1 value (micro-F1) is used as a model performance evaluation index, and the calculation formula is as follows:
Figure BDA0003338639430000111
in the training process, recording the optimal values of F1 and loss value (loss), if the F1 value exceeds two rounds and is not changed or the loss exceeds 1000 steps and is not improved, terminating the model training in advance, adjusting the model hyperparameters by observing a verification set, and training the model again to achieve the aim of achieving the goal of the model hyperparametersAnd the super parameter is an optimal parameter value.
(4) Model prediction
And (3) feeding the test set into the optimal model obtained in the step (3), and calculating the state path with the highest probability in all possible sequences by using a Viterbi algorithm in the CRF in the prediction process, wherein the calculation formula is as follows:
Figure BDA0003338639430000112
the position of the entity in the input sequence and its category can be obtained.

Claims (8)

1. An entity identification method facing the coal mine safety field is characterized in that: firstly, collecting a coal mine field noun as an entity to construct a coal mine safety field entity data set, then training a MSRBAC model by using the coal mine safety field entity data set, and finally identifying the entity in the coal mine safety field entity data set by using the trained MSRBAC model;
s1, constructing a coal mine safety field entity data set:
s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;
s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions;
s1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b represents the beginning character of the entity noun, I represents the non-beginning character of the entity noun, O represents other words which do not need to be labeled, and a coal mine safety field entity data set is formed, wherein the data set is an entity labeling set, namely the category and the position of an entity are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the ratio of 6:2:2 for model iterative training and prediction;
s2, establishing an MSRBAC model:
the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly subjected to covering prediction before multiple rounds of data feeding through dynamic covering so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then use the RoBERTA pre-training language model as a characteristic extractor of an input sequence, convert the words/words into word vectors obtained through training, and then give weights obtained through pre-training so as to improve the training speed of random initialization;
the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BilSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;
the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layert=[h1,h2,...,hT]Given different weights atThe calculation formula is as follows:
Figure FDA0003338639420000021
in the formula atIs a function that can be learnedFor providing the weight, htHidden vector [ h ] output for BilSTM layer1,h2,...,hT]T is the input sentence length, and the correlation between characters utilizes the weight atCarrying out word boundary division on the sentences through the correlation of characters;
the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;
s3 training the MSRBAC model:
inputting a raw corpus text in the coal mine field after data cleaning, processing and labeling as data into a RoBERTA layer pre-training language model to convert word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into a BilSTM layer for further learning, enabling the MSRBAC model to better understand the context relationship of the text, and using the Attention to each h in the sequencetDifferent weights are assigned to highlight important elements, and finally the CRF layer is used for selecting the label sequence with the highest prediction score as the best answer.
2. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: the entity categories of the step S1.2 are set to 9 categories, which are respectively coal geology and exploration CRP, roadway engineering SDE, coal mining CM, elevation transportation HT, mine survey MMS, coal mine safety MS, blasting material and technology EMBT, mining machinery WMDM, and coal mine electrical CMEE, and are respectively identified by english abbreviations.
3. The entity identification method for the coal mine safety field according to the patent claim 1, is characterized in that: in the step S1.3, the multi-method fusion semi-automatic labeling method includes firstly storing entities obtained from national standard GB/T15663 "coal mine science and technology terminology" and Baidu encyclopedia into corresponding dictionaries according to categories, writing scripts for reverse labeling of data sets to be labeled, then expanding the entities labeled by the dictionaries by using CRF + + algorithm, which can greatly reduce labor cost, and finally manually correcting, labeling partially corresponding labels but not labeled, correcting wrong labels, and completing labeling of data sets.
4. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: in step S1.3, it is specified that a sentence is subjected to BIO tagging according to a word level, B represents that a current word is a prefix of an entity where the current word is located, I represents that the current word is in a word or an end of a word of the entity where the current word is located, and O represents that the current word is not in the entity.
5. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: converting a text into a data sequence through an MSRBAC model for identifying a coal mine safety entity, calculating the probability of all BIO labels in a hidden state of each position through a BilSTM layer and an Attention layer, inputting the label probability of each position into a CRF layer, scoring all possible label paths by using a scoring function s (X, Y) encapsulated by TensorFlow, obtaining the probability by using a softmax normalization function, and finally obtaining the probability by using a softmax normalization function
Figure FDA0003338639420000022
The probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows:
Figure FDA0003338639420000031
Figure FDA0003338639420000032
namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;
wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;
the method comprises the steps of inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by the RoBERTA, connecting the feature word vectors to a dropout layer, avoiding overfitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure.
6. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:
Figure FDA0003338639420000033
Figure FDA0003338639420000034
ht=ot*tanh(Ct)
in the formula ftThe forgetting gate is used for determining information to be discarded or reserved; i.e. itAn input gate for filtering the input information and selectively discarding some information;
Figure FDA0003338639420000035
preparing for updating the current cell state for the cell state candidate value; ctIs the current cell state and is used for long-term memory; otFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h istIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. By the above meterAnd moreover, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.
7. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:
Figure FDA0003338639420000036
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]And T is the input sentence length.
8. The entity identification method oriented to the coal mine safety field of claim 1, wherein the MSRBAC model training specifically comprises:
the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:
feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;
randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,
firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then the data enters BiLSTM and Attention to calculate all the label probabilities, finally, a CRF final reasoning layer calculates the loss value loss,
optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.
CN202111301680.7A 2021-11-04 Entity identification method oriented to coal mine safety field Active CN113988054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111301680.7A CN113988054B (en) 2021-11-04 Entity identification method oriented to coal mine safety field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111301680.7A CN113988054B (en) 2021-11-04 Entity identification method oriented to coal mine safety field

Publications (2)

Publication Number Publication Date
CN113988054A true CN113988054A (en) 2022-01-28
CN113988054B CN113988054B (en) 2024-07-16

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146642A (en) * 2022-07-21 2022-10-04 北京市科学技术研究院 Automatic training set labeling method and system for named entity recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
CN112733541A (en) * 2021-01-06 2021-04-30 重庆邮电大学 Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
US20210319314A1 (en) * 2020-04-09 2021-10-14 Naver Corporation End-To-End Graph Convolution Network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
US20210319314A1 (en) * 2020-04-09 2021-10-14 Naver Corporation End-To-End Graph Convolution Network
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
CN112733541A (en) * 2021-01-06 2021-04-30 重庆邮电大学 Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘鹏;叶帅;舒雅;鹿晓龙;刘明明: "煤矿安全知识图谱构建及智能查询方法研究", 中文信息学报, vol. 34, no. 011, 31 December 2020 (2020-12-31) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146642A (en) * 2022-07-21 2022-10-04 北京市科学技术研究院 Automatic training set labeling method and system for named entity recognition
CN115146642B (en) * 2022-07-21 2023-08-29 北京市科学技术研究院 Named entity recognition-oriented training set automatic labeling method and system

Similar Documents

Publication Publication Date Title
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN108984526B (en) Document theme vector extraction method based on deep learning
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
CN112541356B (en) Method and system for recognizing biomedical named entities
CN109960799A (en) A kind of Optimum Classification method towards short text
CN113673254B (en) Knowledge distillation position detection method based on similarity maintenance
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN110134793A (en) Text sentiment classification method
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
Gan et al. Character-level deep conflation for business data analytics
CN111460147B (en) Title short text classification method based on semantic enhancement
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN114722835A (en) Text emotion recognition method based on LDA and BERT fusion improved model
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
CN116882402A (en) Multi-task-based electric power marketing small sample named entity identification method
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN112347247A (en) Specific category text title binary classification method based on LDA and Bert
Shan Social Network Text Sentiment Analysis Method Based on CNN‐BiGRU in Big Data Environment
CN107562774A (en) Generation method, system and the answering method and system of rare foreign languages word incorporation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant