CN113988054A

CN113988054A - Entity identification method for coal mine safety field

Info

Publication number: CN113988054A
Application number: CN202111301680.7A
Authority: CN
Inventors: 刘鹏; 冯琳; 史新国; 刘珂; 谢亚波; 王莹; 余钱坤; 曹新晨; 程浩然
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-01-28
Anticipated expiration: 2041-11-04

Abstract

The invention discloses an entity identification method facing the coal mine safety field, which is suitable for the informatization use of the coal mine safety field and comprises three steps of coal mine safety field entity data set construction, model design and model training: the method comprises the steps that data cleaning, processing and labeling are carried out in a construction stage of a real data set in the coal mine safety field, and the data are processed into an available data set; and in the model design stage, a RoBERTA pre-training language model is used as input, the characteristic input which is more in line with the context is obtained by utilizing the advantages of a larger training set, a larger Batch size and a dynamic covering mode, the context relationship is further learned through a bidirectional long-short term memory network, an attention mechanism is added to assign different weights to sequence elements, the CRF is utilized to calculate the most probable state path to obtain the final entity category, the MSRBAC model is constructed, and finally the trained MSRBAC model is used for entity identification in the mine safety field. The method has the advantages of simple steps, convenient use and wide practicability.

Description

Entity identification method for coal mine safety field

Technical Field

The invention relates to an entity identification method, in particular to an entity identification method facing the coal mine safety field, which is suitable for informatization in the coal mine safety field.

Background

Coal is an important basic energy source in China and occupies a great proportion of used energy sources. With the development of internet technology, the total amount of scientific data of coal mines is becoming huge, related information of coal mine safety is increasing explosively, and how to efficiently mine valuable information is an important technical point. The novel industrial road is driven to move in coal mine industrialization in an informationized and intelligent manner, and the novel modern mine is built to become an inevitable way for coal mine enterprises to improve the mine safety guarantee degree, realize high yield and high efficiency and increase the enterprise competitiveness, and is an important development direction of science and technology promotion.

The early named entity recognition task is mainly based on methods of dictionaries and rules, and the methods based on the rules mainly comprise methods of LaSIE-II, NetOwl, facility and the like; the scholars of Wangning, Krupka and the like perform entity recognition by establishing a dictionary library rule base, but the methods of the dictionary and the rules mainly depend on a specific rule template provided by a language field expert, and need great manpower to make the dictionary, and the recognition effect on unknown words and similar words in a non-dictionary is poor. Students have proposed typical methods based on statistical machine learning, such as Hidden Markov Model (HMM), Maximum Entropy (ME), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF), Support Vector Machine (SVM), etc., to solve such problems. However, in such methods, NER is generally regarded as a sequence labeling problem, requires a large-scale corpus to learn a labeling model so as to label each position of a sentence, and depends on manually set features. In recent years, with the improvement of computing power, methods based on deep learning are widely applied, common models include methods such as a Convolutional Neural Network (CNN), a long-short term memory network (LSTM), a GRU neural network and the like, and for example, Dong and the like use a BiLSTM-CRF neural structure based on character vectors for Chinese named entity recognition for the first time in 2016. In the method, static Word vectors such as Word2vec, Glove and the like are mostly used as Word Embedding layers, but the static Word vectors cannot meet the dynamic changes of different contexts. To solve such problems, context-embedded contextualized word vectors, such as CoVe, ULMFit, ELMO, etc., have emerged, but some problems remain, such as "seeing oneself", i.e., the situation where the next word to be predicted has emerged in a given sequence. The BERT pre-training language model released by Google in 2018 and 10 months shows 11 NLP tasks, which is a new milestone of natural language processing, and then the number of modified versions of BERT is infinite.

Coal mine safety is always an important field of safety production, and compared with other fields, coal mine safety informatization has certain particularity, and certain difficulty is brought to coal mine safety informatization by lack of data information, no large-scale labeled data sets and the like. The entity identification in the coal mine safety field is an important basis for constructing coal mine informatization such as a knowledge graph, a question-and-answer system and the like in the coal mine field.

Disclosure of Invention

The invention content is as follows: aiming at the defects of the prior art, the entity identification method which is simple in step and accurate in judgment and faces the coal mine safety field is provided.

In order to achieve the technical purpose, the entity identification method facing the coal mine safety field comprises the steps of firstly collecting a noun in the coal mine field as an entity to form an entity data set in the coal mine safety field, then training an MSRBAC model by using the entity data set in the coal mine safety field, and finally identifying the entity in the entity data set in the coal mine safety field by using the trained MSRBAC model;

s1, constructing a coal mine safety field entity data set:

s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;

s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions;

s1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b (begin) is used for representing the beginning characters of entity nouns, I (inside) is used for representing the non-beginning characters of the entity nouns, O (other) is used for representing other words which do not need to be labeled, a data set of the entities in the coal mine safety field is formed, the data set is an entity labeling set, namely the categories and the positions of the entities are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the proportion of 6:2:2 for model iterative training and prediction;

s2, establishing MSRBAC (mine Safety RoBERTA BilStention CRF) model:

the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly covered and predicted before multiple rounds of data feeding, so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then, the RROBRTA pre-training language model is used as a characteristic extractor of an input sequence, characters/words are converted into word vectors obtained through training, and then weights obtained through pre-training are given, so that the training speed of random initialization is improved;

the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BiLSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;

the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layer_t＝[h₁,h₂,...,h_T]Given different weights a_tThe calculation formula is as follows:

in the formula a_tFor providing weights, h, for a learnable function_tHidden vector [ h ] output for BilSTM layer₁,h₂,...,h_T]T is the input sentence length, and the correlation between characters utilizes the weight a_tCarrying out word boundary division on the sentences through the correlation of characters;

the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;

s3 training the MSRBAC model:

the raw corpus text of the coal mine field after data cleaning, processing and labeling is used as data input RoBERTAConverting the layer pre-training language model into word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into the BilSTM layer for further learning, leading the MSRBAC model to better understand the context relationship of texts, and using the Attention to each h in the sequence_tDifferent weights are assigned to highlight important elements, and finally the CRF layer is used for selecting the label sequence with the highest prediction score as the best answer.

The entity categories of step S1.2 are set to 9 categories, which are respectively Coal Geology and exploration (CRP), roadway Engineering (SDE), Coal Mining (Coal Mining, CM), Hoisting and Transportation (HT), mine Survey (MMS), Coal Mine Safety (MS), Blasting Material and technology (EMBT), Mining Machinery and development Machinery (WMDM), Coal mine electrical (CMEE), and are respectively identified by acronyms.

In the step S1.3, the multi-method fusion semi-automatic labeling method includes firstly storing entities obtained from national standard GB/T15663 "coal mine science and technology terminology" and Baidu encyclopedia into corresponding dictionaries according to categories, writing scripts for reverse labeling of data sets to be labeled, then expanding the entities labeled by the dictionaries by using CRF + + algorithm, which can greatly reduce labor cost, and finally manually correcting, labeling partially corresponding labels but not labeled, correcting wrong labels, and completing labeling of data sets.

In step S1.3, it is specified that a sentence is subjected to BIO tagging according to a word level, B represents that a current word is a prefix of an entity where the current word is located, I represents that the current word is in a word or an end of a word of the entity where the current word is located, and O represents that the current word is not in the entity.

Converting a text into a data sequence through an MSRBAC model for identifying a coal mine safety entity, calculating the probability of all BIO labels in hidden states of all positions through a BilSTM layer and an Attention layer, inputting the label probability of all the positions into a CRF layer, and using a scoring function s (X, Y) encapsulated by TensorFlow for all possible labelsScoring the path, then obtaining the probability by utilizing a softmax normalization function, and finally

The probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows: ,

namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;

wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;

inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by RoBERTA, connecting with a dropout layer, avoiding over-fitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure;

when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:

h_t＝o_t*tanh(C_t)

in the formula f_tThe forgetting gate is used for determining information to be discarded or reserved; i.e. i_tAn input gate for filtering the input information and selectively discarding some information;

preparing for updating the current cell state for the cell state candidate value; c_tIs the current cell state and is used for long-term memory; o_tFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h is_tIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. Through the calculation, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.

Introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:

in the formula a_tFor providing weights, h, for a learnable function_tHidden vector [ h ] output for BilSTM layer₁,h₂,...,h_T]And T is the input sentence length.

The MSRBAC model training specifically comprises the following steps:

the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:

feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;

randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,

firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then the data enters BiLSTM and Attention to calculate all the label probabilities, finally, a CRF final reasoning layer calculates the loss value loss,

optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.

Has the advantages that:

1) the multi-method fused semi-automatic labeling method greatly reduces the labor cost, simultaneously reduces the problems of label error, label leakage and the like caused by manual labeling, and improves the effectiveness of the data set.

2) The method adopts RoBERTA as a characteristic input layer, avoids the problems of poor model convergence slow effect and the like caused by the characteristic weight of random initialization, carries out semantic decoding by a bidirectional long-short term memory network (BilSTM) and a Conditional Random Field (CRF), adds an Attention mechanism (Attention) to build a coal mine safety field entity recognition model MSRBAC, and innovatively applies the Attention mechanism to the coal mine safety field entity recognition.

Drawings

FIG. 1 is a block diagram of the MSRBAC model architecture of the present invention;

FIG. 2 is a schematic flow chart of the coal mine safety-oriented entity identification method of the present invention;

FIG. 3 is a data sentence length distribution histogram of the present invention;

FIG. 4 is a schematic diagram of an embodiment of Word Embedding layer input representation;

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings:

as shown in fig. 2, in the entity identification method for the coal mine safety field of the present invention, in the construction stage of the entity data set in the coal mine safety field, data is cleaned, processed and labeled, and the data is processed into an available data set; and in the model design stage, a RoBERTA pre-training language model is used as input, the characteristic input which is more in line with the context is obtained by utilizing the advantages of a larger training set, a larger Batch size and a dynamic covering mode, the context relationship is further learned through a bidirectional long-short term memory network, an attention mechanism is added to assign different weights to sequence elements, the CRF is utilized to calculate the most probable state path to obtain the final entity category, the MSRBAC model is constructed, and finally the trained MSRBAC model is used for entity identification in the mine safety field.

Specifically, firstly, a coal mine field noun is collected as an entity to form a coal mine safety field entity data set, then the coal mine safety field entity data set is used for training an MSRBAC model, and finally the trained MSRBAC model is used for identifying the entity in the coal mine safety field entity data set;

s1, constructing a coal mine safety field entity data set:

s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions; the entity categories are set as 9 categories, namely coal geology and exploration CRP, roadway engineering SDE, coal mining CM, lifting transportation HT, mine measurement MMS, coal mine safety MS, blasting material and technology EMBT, mining machinery WMDM and coal mine electrical CMEE, which are respectively marked by English abbreviations; as shown in the following table

S1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b represents the beginning character of the entity noun, I represents the non-beginning character of the entity noun, O represents other words which do not need to be labeled, and a coal mine safety field entity data set is formed, wherein the data set is an entity labeling set, namely the category and the position of an entity are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the ratio of 6:2:2 for model iterative training and prediction;

a multi-method integrated semi-automatic labeling method includes the steps of firstly storing entities obtained from national standard GB/T15663 coal mine science and technology terminology and Baidu encyclopedia into corresponding dictionaries according to categories, compiling scripts to perform reverse labeling on data sets to be labeled, then expanding the entities labeled by the dictionaries by using a CRF + + algorithm, greatly reducing labor cost, finally manually correcting, labeling the parts which are labeled correspondingly but not labeled, correcting wrong labels, and completing labeling of the data sets. The method comprises the steps of specifying that BIO labeling is carried out on a sentence according to word levels, B represents that a current word is the beginning of a word of an entity where the current word is located, I represents that the current word is in the word or the end of the word of the entity where the current word is located, and O represents that the current word is not in the entity.

S2, establishing an MSRBAC model, wherein the structure is shown in figure 1:

through an MSRBAC model for identifying coal mine safety entities, text is converted into a data sequence and then passes through a BilSTM layerCalculating the probability of all BIO labels in the hidden state of each position by the Attention layer, inputting the label probability of each position into the CRF layer, scoring all possible label paths by using a scoring function s (X, Y) encapsulated by TensorFlow, obtaining the probability by using a softmax normalization function, and finally obtaining the probability by using a softmax normalization function

s3 training the MSRBAC model:

inputting a raw corpus text in the coal mine field after data cleaning, processing and labeling as data into a RoBERTA layer pre-training language model to convert word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into a BilSTM layer for further learning, enabling the MSRBAC model to better understand the context relationship of the text, and using the Attention to each h in the sequence_tAssigning different weights to highlight important elements, and finally selecting a label sequence with the highest prediction score as the best answer by using a CRF layer;

h_t＝o_t*tanh(C_t)

in the formula a_tFor providing weights, h, for a learnable function_tHidden vector [ h ] output for BilSTM layer₁,h₂,..,h_T]And T is the input sentence length.

The MSRBAC model training specifically comprises the following steps:

firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then enters BilSTM and Attention to calculate all the label probabilities, finally, a CRF reasoning layer is used for calculating the loss value loss,

The first embodiment,

The entity recognition model facing the coal mine safety field can clearly observe that the model consists of 4 parts, namely a RoBERTA pre-training language model, a BilSTM bidirectional long-short term memory network, an Attention mechanism and a CRF conditional random field, the output of the upper layer is the input of the next layer, and sentences are input into the model to obtain entities in sentences and the categories of the entities.

The method specifically comprises the following steps:

(1) coal mine safety field data acquisition

Because no labeled data set and public data set in the coal mine safety field can be used for model training at present, relevant documents in the coal mine safety field are obtained by document retrieval, webpage crawlers and the like and serve as raw corpora. In combination with the expert opinions in the field, the data set adopted by the invention mainly has the following 3 sources: 1) the national standard document publishing website (such as a national standard information public service platform, a national standard network and the like) acquires entity type definition reference and entities. 2) Encyclopedia websites (such as encyclopedia, dog searching encyclopedia and the like) obtain related entries in the coal mine safety field. 3) And (3) acquiring massive text resources related to coal mine safety from coal mine websites (such as coal mine safety nets and the like). The unstructured corpora are used as the corpora of the entity data set in the coal mine safety field.

The static structure webpage is crawled by using a urllib2 tool package in Python, and the dynamic structure webpage is crawled by using a selenium tool package in Python. The source 1) is obtained by collection in a literature retrieval mode and can be downloaded manually, most of the collected data are information stored in a picture form, and the picture data cannot be analyzed by a model, so that the picture is converted into characters by utilizing ocr picture character recognition api provided by the Baidu intelligent cloud. Sources 2) and 3) are web page crawling information, which comprises html tags, links and other noise information, and is cleaned by utilizing python regularization. All the source data are analyzed and stored in a txt file, so that the python batch operation is facilitated. And (3) sorting and summarizing the concept entities in the sources 1) and 2), dividing the concept entities into 9 types by combining the domain expert opinions, constructing dictionaries of 9 types, and storing the corresponding entities.

Removing noise information such as invisible symbols, emoticons and the like from data acquired from a source 3) by utilizing regularization, carrying out sentence division processing on the data according to periods, title numbers and the like after the data are cleaned, setting the maximum sentence length to be 510 characters by observing data distribution as shown in figure 3 and considering the maximum receivable length of RoBERTA, and carrying out truncation processing on the ultra-long text. The total number of sentences is 6602 obtained by the above method, and the raw corpus to be labeled is formed. The details are shown in the following table:

(2) coal mine safety field entity labeling

The invention adopts a multi-method fused semi-automatic BIO labeling method, firstly, using python script to reversely label the entity-to-be-labeled raw corpus in 9 types of dictionaries, then inputting the entity-to-be-labeled raw corpus into CRF + + algorithm, expanding the dictionary-labeled entity by using a characteristic template thereof, using the dictionary to reversely label the entity-to-be-labeled entity, using CRF + + to expand the entity-to-be-labeled entity, using CRF + + to expand the entity-to be-labeled entity-to-be-labeled entity, finally, manually correcting wrong label, and finishing the labeling of data set. The marked sentences are disordered and divided into a training set, a verification set and a test set according to the proportion of 6:2:2 approximately, and the entity distribution conditions of the marked sentences are shown in tables 2 and 3.

(3) Model training

The training set and the verification set are fed into an MSRBAC coal mine safety field entity recognition model for training, the entity recognition model is vectorized by using RoBERTA, the input representation is shown in figure 4, and the model hyper-parameters are set as follows: the number of layers of the transform Encoder of the RoBERTA is 12, the dimension of a hidden variable is 768, the number of attention heads is 12, the BiLSTM network is set to be 256, the maximum sentence length is 512 words, the entity type entry _ num is 9, the batch _ size is 8, the epoch is 50, meanwhile, in order to prevent model overfitting, the drop after the RoBERTA layer is set to be 0.1, the drop after the BiLSTM layer is 0.4, and the learning rate is 5 e-5.

In order to reduce the evaluation deviation caused by the category imbalance, a micro-average F1 value (micro-F1) is used as a model performance evaluation index, and the calculation formula is as follows:

in the training process, recording the optimal values of F1 and loss value (loss), if the F1 value exceeds two rounds and is not changed or the loss exceeds 1000 steps and is not improved, terminating the model training in advance, adjusting the model hyperparameters by observing a verification set, and training the model again to achieve the aim of achieving the goal of the model hyperparametersAnd the super parameter is an optimal parameter value.

(4) Model prediction

And (3) feeding the test set into the optimal model obtained in the step (3), and calculating the state path with the highest probability in all possible sequences by using a Viterbi algorithm in the CRF in the prediction process, wherein the calculation formula is as follows:

the position of the entity in the input sequence and its category can be obtained.

Claims

1. An entity identification method facing the coal mine safety field is characterized in that: firstly, collecting a coal mine field noun as an entity to construct a coal mine safety field entity data set, then training a MSRBAC model by using the coal mine safety field entity data set, and finally identifying the entity in the coal mine safety field entity data set by using the trained MSRBAC model;

s1, constructing a coal mine safety field entity data set:

s2, establishing an MSRBAC model:

the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly subjected to covering prediction before multiple rounds of data feeding through dynamic covering so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then use the RoBERTA pre-training language model as a characteristic extractor of an input sequence, convert the words/words into word vectors obtained through training, and then give weights obtained through pre-training so as to improve the training speed of random initialization;

in the formula a_tIs a function that can be learnedFor providing the weight, h_tHidden vector [ h ] output for BilSTM layer₁,h₂,...,h_T]T is the input sentence length, and the correlation between characters utilizes the weight a_tCarrying out word boundary division on the sentences through the correlation of characters;

s3 training the MSRBAC model:

inputting a raw corpus text in the coal mine field after data cleaning, processing and labeling as data into a RoBERTA layer pre-training language model to convert word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into a BilSTM layer for further learning, enabling the MSRBAC model to better understand the context relationship of the text, and using the Attention to each h in the sequence_tDifferent weights are assigned to highlight important elements, and finally the CRF layer is used for selecting the label sequence with the highest prediction score as the best answer.

2. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: the entity categories of the step S1.2 are set to 9 categories, which are respectively coal geology and exploration CRP, roadway engineering SDE, coal mining CM, elevation transportation HT, mine survey MMS, coal mine safety MS, blasting material and technology EMBT, mining machinery WMDM, and coal mine electrical CMEE, and are respectively identified by english abbreviations.

3. The entity identification method for the coal mine safety field according to the patent claim 1, is characterized in that: in the step S1.3, the multi-method fusion semi-automatic labeling method includes firstly storing entities obtained from national standard GB/T15663 "coal mine science and technology terminology" and Baidu encyclopedia into corresponding dictionaries according to categories, writing scripts for reverse labeling of data sets to be labeled, then expanding the entities labeled by the dictionaries by using CRF + + algorithm, which can greatly reduce labor cost, and finally manually correcting, labeling partially corresponding labels but not labeled, correcting wrong labels, and completing labeling of data sets.

4. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: in step S1.3, it is specified that a sentence is subjected to BIO tagging according to a word level, B represents that a current word is a prefix of an entity where the current word is located, I represents that the current word is in a word or an end of a word of the entity where the current word is located, and O represents that the current word is not in the entity.

5. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: converting a text into a data sequence through an MSRBAC model for identifying a coal mine safety entity, calculating the probability of all BIO labels in a hidden state of each position through a BilSTM layer and an Attention layer, inputting the label probability of each position into a CRF layer, scoring all possible label paths by using a scoring function s (X, Y) encapsulated by TensorFlow, obtaining the probability by using a softmax normalization function, and finally obtaining the probability by using a softmax normalization function

The probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows:

the method comprises the steps of inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by the RoBERTA, connecting the feature word vectors to a dropout layer, avoiding overfitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure.

6. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:

h_t＝o_t*tanh(C_t)

preparing for updating the current cell state for the cell state candidate value; c_tIs the current cell state and is used for long-term memory; o_tFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h is_tIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. By the above meterAnd moreover, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.

7. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:

8. The entity identification method oriented to the coal mine safety field of claim 1, wherein the MSRBAC model training specifically comprises: