CN113988054A - Entity identification method for coal mine safety field - Google Patents
Entity identification method for coal mine safety field Download PDFInfo
- Publication number
- CN113988054A CN113988054A CN202111301680.7A CN202111301680A CN113988054A CN 113988054 A CN113988054 A CN 113988054A CN 202111301680 A CN202111301680 A CN 202111301680A CN 113988054 A CN113988054 A CN 113988054A
- Authority
- CN
- China
- Prior art keywords
- model
- entity
- layer
- coal mine
- mine safety
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000003245 coal Substances 0.000 title claims abstract description 112
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 81
- 238000002372 labelling Methods 0.000 claims abstract description 39
- 230000015654 memory Effects 0.000 claims abstract description 8
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 7
- 238000004140 cleaning Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 38
- 238000004364 calculation method Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 16
- 210000004027 cell Anatomy 0.000 claims description 15
- 238000012795 verification Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000008034 disappearance Effects 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 6
- 238000004880 explosion Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000005065 mining Methods 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 230000009193 crawling Effects 0.000 claims description 4
- 238000013515 script Methods 0.000 claims description 4
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 238000005422 blasting Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 3
- 230000007774 longterm Effects 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 5
- 238000010276 construction Methods 0.000 abstract description 3
- 238000013461 design Methods 0.000 abstract description 3
- 238000011161 development Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 229910052711 selenium Inorganic materials 0.000 description 1
- 239000011669 selenium Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an entity identification method facing the coal mine safety field, which is suitable for the informatization use of the coal mine safety field and comprises three steps of coal mine safety field entity data set construction, model design and model training: the method comprises the steps that data cleaning, processing and labeling are carried out in a construction stage of a real data set in the coal mine safety field, and the data are processed into an available data set; and in the model design stage, a RoBERTA pre-training language model is used as input, the characteristic input which is more in line with the context is obtained by utilizing the advantages of a larger training set, a larger Batch size and a dynamic covering mode, the context relationship is further learned through a bidirectional long-short term memory network, an attention mechanism is added to assign different weights to sequence elements, the CRF is utilized to calculate the most probable state path to obtain the final entity category, the MSRBAC model is constructed, and finally the trained MSRBAC model is used for entity identification in the mine safety field. The method has the advantages of simple steps, convenient use and wide practicability.
Description
Technical Field
The invention relates to an entity identification method, in particular to an entity identification method facing the coal mine safety field, which is suitable for informatization in the coal mine safety field.
Background
Coal is an important basic energy source in China and occupies a great proportion of used energy sources. With the development of internet technology, the total amount of scientific data of coal mines is becoming huge, related information of coal mine safety is increasing explosively, and how to efficiently mine valuable information is an important technical point. The novel industrial road is driven to move in coal mine industrialization in an informationized and intelligent manner, and the novel modern mine is built to become an inevitable way for coal mine enterprises to improve the mine safety guarantee degree, realize high yield and high efficiency and increase the enterprise competitiveness, and is an important development direction of science and technology promotion.
The early named entity recognition task is mainly based on methods of dictionaries and rules, and the methods based on the rules mainly comprise methods of LaSIE-II, NetOwl, facility and the like; the scholars of Wangning, Krupka and the like perform entity recognition by establishing a dictionary library rule base, but the methods of the dictionary and the rules mainly depend on a specific rule template provided by a language field expert, and need great manpower to make the dictionary, and the recognition effect on unknown words and similar words in a non-dictionary is poor. Students have proposed typical methods based on statistical machine learning, such as Hidden Markov Model (HMM), Maximum Entropy (ME), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF), Support Vector Machine (SVM), etc., to solve such problems. However, in such methods, NER is generally regarded as a sequence labeling problem, requires a large-scale corpus to learn a labeling model so as to label each position of a sentence, and depends on manually set features. In recent years, with the improvement of computing power, methods based on deep learning are widely applied, common models include methods such as a Convolutional Neural Network (CNN), a long-short term memory network (LSTM), a GRU neural network and the like, and for example, Dong and the like use a BiLSTM-CRF neural structure based on character vectors for Chinese named entity recognition for the first time in 2016. In the method, static Word vectors such as Word2vec, Glove and the like are mostly used as Word Embedding layers, but the static Word vectors cannot meet the dynamic changes of different contexts. To solve such problems, context-embedded contextualized word vectors, such as CoVe, ULMFit, ELMO, etc., have emerged, but some problems remain, such as "seeing oneself", i.e., the situation where the next word to be predicted has emerged in a given sequence. The BERT pre-training language model released by Google in 2018 and 10 months shows 11 NLP tasks, which is a new milestone of natural language processing, and then the number of modified versions of BERT is infinite.
Coal mine safety is always an important field of safety production, and compared with other fields, coal mine safety informatization has certain particularity, and certain difficulty is brought to coal mine safety informatization by lack of data information, no large-scale labeled data sets and the like. The entity identification in the coal mine safety field is an important basis for constructing coal mine informatization such as a knowledge graph, a question-and-answer system and the like in the coal mine field.
Disclosure of Invention
The invention content is as follows: aiming at the defects of the prior art, the entity identification method which is simple in step and accurate in judgment and faces the coal mine safety field is provided.
In order to achieve the technical purpose, the entity identification method facing the coal mine safety field comprises the steps of firstly collecting a noun in the coal mine field as an entity to form an entity data set in the coal mine safety field, then training an MSRBAC model by using the entity data set in the coal mine safety field, and finally identifying the entity in the entity data set in the coal mine safety field by using the trained MSRBAC model;
s1, constructing a coal mine safety field entity data set:
s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;
s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions;
s1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b (begin) is used for representing the beginning characters of entity nouns, I (inside) is used for representing the non-beginning characters of the entity nouns, O (other) is used for representing other words which do not need to be labeled, a data set of the entities in the coal mine safety field is formed, the data set is an entity labeling set, namely the categories and the positions of the entities are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the proportion of 6:2:2 for model iterative training and prediction;
s2, establishing MSRBAC (mine Safety RoBERTA BilStention CRF) model:
the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly covered and predicted before multiple rounds of data feeding, so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then, the RROBRTA pre-training language model is used as a characteristic extractor of an input sequence, characters/words are converted into word vectors obtained through training, and then weights obtained through pre-training are given, so that the training speed of random initialization is improved;
the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BiLSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;
the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layert=[h1,h2,...,hT]Given different weights atThe calculation formula is as follows:in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]T is the input sentence length, and the correlation between characters utilizes the weight atCarrying out word boundary division on the sentences through the correlation of characters;
the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;
s3 training the MSRBAC model:
the raw corpus text of the coal mine field after data cleaning, processing and labeling is used as data input RoBERTAConverting the layer pre-training language model into word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into the BilSTM layer for further learning, leading the MSRBAC model to better understand the context relationship of texts, and using the Attention to each h in the sequencetDifferent weights are assigned to highlight important elements, and finally the CRF layer is used for selecting the label sequence with the highest prediction score as the best answer.
The entity categories of step S1.2 are set to 9 categories, which are respectively Coal Geology and exploration (CRP), roadway Engineering (SDE), Coal Mining (Coal Mining, CM), Hoisting and Transportation (HT), mine Survey (MMS), Coal Mine Safety (MS), Blasting Material and technology (EMBT), Mining Machinery and development Machinery (WMDM), Coal mine electrical (CMEE), and are respectively identified by acronyms.
In the step S1.3, the multi-method fusion semi-automatic labeling method includes firstly storing entities obtained from national standard GB/T15663 "coal mine science and technology terminology" and Baidu encyclopedia into corresponding dictionaries according to categories, writing scripts for reverse labeling of data sets to be labeled, then expanding the entities labeled by the dictionaries by using CRF + + algorithm, which can greatly reduce labor cost, and finally manually correcting, labeling partially corresponding labels but not labeled, correcting wrong labels, and completing labeling of data sets.
In step S1.3, it is specified that a sentence is subjected to BIO tagging according to a word level, B represents that a current word is a prefix of an entity where the current word is located, I represents that the current word is in a word or an end of a word of the entity where the current word is located, and O represents that the current word is not in the entity.
Converting a text into a data sequence through an MSRBAC model for identifying a coal mine safety entity, calculating the probability of all BIO labels in hidden states of all positions through a BilSTM layer and an Attention layer, inputting the label probability of all the positions into a CRF layer, and using a scoring function s (X, Y) encapsulated by TensorFlow for all possible labelsScoring the path, then obtaining the probability by utilizing a softmax normalization function, and finallyThe probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows: ,namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;
wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;
inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by RoBERTA, connecting with a dropout layer, avoiding over-fitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure;
when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:
ht=ot*tanh(Ct)
in the formula ftThe forgetting gate is used for determining information to be discarded or reserved; i.e. itAn input gate for filtering the input information and selectively discarding some information;preparing for updating the current cell state for the cell state candidate value; ctIs the current cell state and is used for long-term memory; otFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h istIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. Through the calculation, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.
Introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]And T is the input sentence length.
The MSRBAC model training specifically comprises the following steps:
the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:
feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;
randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,
firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then the data enters BiLSTM and Attention to calculate all the label probabilities, finally, a CRF final reasoning layer calculates the loss value loss,
optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.
Has the advantages that:
1) the multi-method fused semi-automatic labeling method greatly reduces the labor cost, simultaneously reduces the problems of label error, label leakage and the like caused by manual labeling, and improves the effectiveness of the data set.
2) The method adopts RoBERTA as a characteristic input layer, avoids the problems of poor model convergence slow effect and the like caused by the characteristic weight of random initialization, carries out semantic decoding by a bidirectional long-short term memory network (BilSTM) and a Conditional Random Field (CRF), adds an Attention mechanism (Attention) to build a coal mine safety field entity recognition model MSRBAC, and innovatively applies the Attention mechanism to the coal mine safety field entity recognition.
Drawings
FIG. 1 is a block diagram of the MSRBAC model architecture of the present invention;
FIG. 2 is a schematic flow chart of the coal mine safety-oriented entity identification method of the present invention;
FIG. 3 is a data sentence length distribution histogram of the present invention;
FIG. 4 is a schematic diagram of an embodiment of Word Embedding layer input representation;
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings:
as shown in fig. 2, in the entity identification method for the coal mine safety field of the present invention, in the construction stage of the entity data set in the coal mine safety field, data is cleaned, processed and labeled, and the data is processed into an available data set; and in the model design stage, a RoBERTA pre-training language model is used as input, the characteristic input which is more in line with the context is obtained by utilizing the advantages of a larger training set, a larger Batch size and a dynamic covering mode, the context relationship is further learned through a bidirectional long-short term memory network, an attention mechanism is added to assign different weights to sequence elements, the CRF is utilized to calculate the most probable state path to obtain the final entity category, the MSRBAC model is constructed, and finally the trained MSRBAC model is used for entity identification in the mine safety field.
Specifically, firstly, a coal mine field noun is collected as an entity to form a coal mine safety field entity data set, then the coal mine safety field entity data set is used for training an MSRBAC model, and finally the trained MSRBAC model is used for identifying the entity in the coal mine safety field entity data set;
s1, constructing a coal mine safety field entity data set:
s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;
s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions; the entity categories are set as 9 categories, namely coal geology and exploration CRP, roadway engineering SDE, coal mining CM, lifting transportation HT, mine measurement MMS, coal mine safety MS, blasting material and technology EMBT, mining machinery WMDM and coal mine electrical CMEE, which are respectively marked by English abbreviations; as shown in the following table
S1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b represents the beginning character of the entity noun, I represents the non-beginning character of the entity noun, O represents other words which do not need to be labeled, and a coal mine safety field entity data set is formed, wherein the data set is an entity labeling set, namely the category and the position of an entity are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the ratio of 6:2:2 for model iterative training and prediction;
a multi-method integrated semi-automatic labeling method includes the steps of firstly storing entities obtained from national standard GB/T15663 coal mine science and technology terminology and Baidu encyclopedia into corresponding dictionaries according to categories, compiling scripts to perform reverse labeling on data sets to be labeled, then expanding the entities labeled by the dictionaries by using a CRF + + algorithm, greatly reducing labor cost, finally manually correcting, labeling the parts which are labeled correspondingly but not labeled, correcting wrong labels, and completing labeling of the data sets. The method comprises the steps of specifying that BIO labeling is carried out on a sentence according to word levels, B represents that a current word is the beginning of a word of an entity where the current word is located, I represents that the current word is in the word or the end of the word of the entity where the current word is located, and O represents that the current word is not in the entity.
S2, establishing an MSRBAC model, wherein the structure is shown in figure 1:
the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly covered and predicted before multiple rounds of data feeding, so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then, the RROBRTA pre-training language model is used as a characteristic extractor of an input sequence, characters/words are converted into word vectors obtained through training, and then weights obtained through pre-training are given, so that the training speed of random initialization is improved;
the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BiLSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;
the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layert=[h1,h2,...,hT]Given different weights atThe calculation formula is as follows:in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]T is the input sentence length, and the correlation between characters utilizes the weight atCarrying out word boundary division on the sentences through the correlation of characters;
the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;
through an MSRBAC model for identifying coal mine safety entities, text is converted into a data sequence and then passes through a BilSTM layerCalculating the probability of all BIO labels in the hidden state of each position by the Attention layer, inputting the label probability of each position into the CRF layer, scoring all possible label paths by using a scoring function s (X, Y) encapsulated by TensorFlow, obtaining the probability by using a softmax normalization function, and finally obtaining the probability by using a softmax normalization functionThe probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows: ,namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;
wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;
inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by RoBERTA, connecting with a dropout layer, avoiding over-fitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure;
s3 training the MSRBAC model:
inputting a raw corpus text in the coal mine field after data cleaning, processing and labeling as data into a RoBERTA layer pre-training language model to convert word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into a BilSTM layer for further learning, enabling the MSRBAC model to better understand the context relationship of the text, and using the Attention to each h in the sequencetAssigning different weights to highlight important elements, and finally selecting a label sequence with the highest prediction score as the best answer by using a CRF layer;
when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:
ht=ot*tanh(Ct)
in the formula ftThe forgetting gate is used for determining information to be discarded or reserved; i.e. itAn input gate for filtering the input information and selectively discarding some information;preparing for updating the current cell state for the cell state candidate value; ctIs the current cell state and is used for long-term memory; otFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h istIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. Through the calculation, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.
Introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,..,hT]And T is the input sentence length.
The MSRBAC model training specifically comprises the following steps:
the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:
feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;
randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,
firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then enters BilSTM and Attention to calculate all the label probabilities, finally, a CRF reasoning layer is used for calculating the loss value loss,
optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.
The first embodiment,
The entity recognition model facing the coal mine safety field can clearly observe that the model consists of 4 parts, namely a RoBERTA pre-training language model, a BilSTM bidirectional long-short term memory network, an Attention mechanism and a CRF conditional random field, the output of the upper layer is the input of the next layer, and sentences are input into the model to obtain entities in sentences and the categories of the entities.
The method specifically comprises the following steps:
(1) coal mine safety field data acquisition
Because no labeled data set and public data set in the coal mine safety field can be used for model training at present, relevant documents in the coal mine safety field are obtained by document retrieval, webpage crawlers and the like and serve as raw corpora. In combination with the expert opinions in the field, the data set adopted by the invention mainly has the following 3 sources: 1) the national standard document publishing website (such as a national standard information public service platform, a national standard network and the like) acquires entity type definition reference and entities. 2) Encyclopedia websites (such as encyclopedia, dog searching encyclopedia and the like) obtain related entries in the coal mine safety field. 3) And (3) acquiring massive text resources related to coal mine safety from coal mine websites (such as coal mine safety nets and the like). The unstructured corpora are used as the corpora of the entity data set in the coal mine safety field.
The static structure webpage is crawled by using a urllib2 tool package in Python, and the dynamic structure webpage is crawled by using a selenium tool package in Python. The source 1) is obtained by collection in a literature retrieval mode and can be downloaded manually, most of the collected data are information stored in a picture form, and the picture data cannot be analyzed by a model, so that the picture is converted into characters by utilizing ocr picture character recognition api provided by the Baidu intelligent cloud. Sources 2) and 3) are web page crawling information, which comprises html tags, links and other noise information, and is cleaned by utilizing python regularization. All the source data are analyzed and stored in a txt file, so that the python batch operation is facilitated. And (3) sorting and summarizing the concept entities in the sources 1) and 2), dividing the concept entities into 9 types by combining the domain expert opinions, constructing dictionaries of 9 types, and storing the corresponding entities.
Removing noise information such as invisible symbols, emoticons and the like from data acquired from a source 3) by utilizing regularization, carrying out sentence division processing on the data according to periods, title numbers and the like after the data are cleaned, setting the maximum sentence length to be 510 characters by observing data distribution as shown in figure 3 and considering the maximum receivable length of RoBERTA, and carrying out truncation processing on the ultra-long text. The total number of sentences is 6602 obtained by the above method, and the raw corpus to be labeled is formed. The details are shown in the following table:
(2) coal mine safety field entity labeling
The invention adopts a multi-method fused semi-automatic BIO labeling method, firstly, using python script to reversely label the entity-to-be-labeled raw corpus in 9 types of dictionaries, then inputting the entity-to-be-labeled raw corpus into CRF + + algorithm, expanding the dictionary-labeled entity by using a characteristic template thereof, using the dictionary to reversely label the entity-to-be-labeled entity, using CRF + + to expand the entity-to-be-labeled entity, using CRF + + to expand the entity-to be-labeled entity-to-be-labeled entity, finally, manually correcting wrong label, and finishing the labeling of data set. The marked sentences are disordered and divided into a training set, a verification set and a test set according to the proportion of 6:2:2 approximately, and the entity distribution conditions of the marked sentences are shown in tables 2 and 3.
(3) Model training
The training set and the verification set are fed into an MSRBAC coal mine safety field entity recognition model for training, the entity recognition model is vectorized by using RoBERTA, the input representation is shown in figure 4, and the model hyper-parameters are set as follows: the number of layers of the transform Encoder of the RoBERTA is 12, the dimension of a hidden variable is 768, the number of attention heads is 12, the BiLSTM network is set to be 256, the maximum sentence length is 512 words, the entity type entry _ num is 9, the batch _ size is 8, the epoch is 50, meanwhile, in order to prevent model overfitting, the drop after the RoBERTA layer is set to be 0.1, the drop after the BiLSTM layer is 0.4, and the learning rate is 5 e-5.
In order to reduce the evaluation deviation caused by the category imbalance, a micro-average F1 value (micro-F1) is used as a model performance evaluation index, and the calculation formula is as follows:in the training process, recording the optimal values of F1 and loss value (loss), if the F1 value exceeds two rounds and is not changed or the loss exceeds 1000 steps and is not improved, terminating the model training in advance, adjusting the model hyperparameters by observing a verification set, and training the model again to achieve the aim of achieving the goal of the model hyperparametersAnd the super parameter is an optimal parameter value.
(4) Model prediction
And (3) feeding the test set into the optimal model obtained in the step (3), and calculating the state path with the highest probability in all possible sequences by using a Viterbi algorithm in the CRF in the prediction process, wherein the calculation formula is as follows:the position of the entity in the input sequence and its category can be obtained.
Claims (8)
1. An entity identification method facing the coal mine safety field is characterized in that: firstly, collecting a coal mine field noun as an entity to construct a coal mine safety field entity data set, then training a MSRBAC model by using the coal mine safety field entity data set, and finally identifying the entity in the coal mine safety field entity data set by using the trained MSRBAC model;
s1, constructing a coal mine safety field entity data set:
s1.1, designing related keywords of the coal mine safety field, collecting related linguistic data of the coal mine safety field in a document retrieval and webpage crawling mode, analyzing the related linguistic data into texts, using the texts as raw linguistic data of the coal mine safety field, cleaning and preprocessing the raw linguistic data, namely removing useless symbols in the texts, and intercepting the texts into text lengths acceptable by a model to form data to be labeled;
s1.2, setting entity types of the coal mine safety field according to national standard GB/T15663 coal mine science and technology terminology and field expert opinions;
s1.3, carrying out word-level BIO labeling on data according to sentences by utilizing a multi-method fusion semi-automatic labeling method of dictionary reverse matching, CRF + + automatic labeling and manual check iteration: b represents the beginning character of the entity noun, I represents the non-beginning character of the entity noun, O represents other words which do not need to be labeled, and a coal mine safety field entity data set is formed, wherein the data set is an entity labeling set, namely the category and the position of an entity are labeled in a text, and the data set is divided into a training set, a verification set and a test set according to the ratio of 6:2:2 for model iterative training and prediction;
s2, establishing an MSRBAC model:
the first layer of the MSRBAC model is a character embedding layer formed by a RoBERTA pre-training language model, token in the sequence, namely, words segmented at sentence character level, is randomly subjected to covering prediction before multiple rounds of data feeding through dynamic covering so as to continuously learn the entity characteristics in the entity data set in the coal mine safety field, obtain the trained RoBERTA pre-training language model, then use the RoBERTA pre-training language model as a characteristic extractor of an input sequence, convert the words/words into word vectors obtained through training, and then give weights obtained through pre-training so as to improve the training speed of random initialization;
the second layer of the MSRBAC model is a BilSTM layer, the BilSTM layer comprises a two-layer linear connection two-way long-short term memory network BilSTM, the BilSTM is constructed through tf.nn.bidirectional _ dynamic _ rnn in TensorFlow, further feature extraction is carried out on an Embedding sequence converted into word vectors, so as to better understand the context semantics of the current input text, meanwhile, a dropout layer is connected behind the BilSTM layer, and neurons in the network are randomly discarded so as to prevent overfitting and enhance the generalization capability of the model; meanwhile, a door neural network in the LSTM is used for solving the problem of long-term and short-term information retention, wherein the door neural network comprises a forgetting door, an input door and an output door, and the forgetting door determines the information to be discarded or retained; the input gate is used for filtering information, selectively abandoning information without actual significance, and the output gate utilizes a sigmoid function to determine information to be output by the current cell state;
the third layer of the MSRBAC model is an Attention layer, hidden vectors passing through the BilSTM already contain rich context characteristic information, but the characteristics have the same weight and can cause larger errors when entity types are distinguished, so that the hidden vectors h output to the BilSTM layert=[h1,h2,...,hT]Given different weights atThe calculation formula is as follows:in the formula atIs a function that can be learnedFor providing the weight, htHidden vector [ h ] output for BilSTM layer1,h2,...,hT]T is the input sentence length, and the correlation between characters utilizes the weight atCarrying out word boundary division on the sentences through the correlation of characters;
the fourth layer of the MSRBAC model is a CRF layer used for judging the BIO labeling category of each Chinese character in the text, and the CRF layer is constructed through tf.contrib.crf and used for restricting the label in the first step, for example, the initial label of a corpus can not start with I;
s3 training the MSRBAC model:
inputting a raw corpus text in the coal mine field after data cleaning, processing and labeling as data into a RoBERTA layer pre-training language model to convert word vectors according to word level tokens, inputting the token vectors of the RoBERTA layer into a BilSTM layer for further learning, enabling the MSRBAC model to better understand the context relationship of the text, and using the Attention to each h in the sequencetDifferent weights are assigned to highlight important elements, and finally the CRF layer is used for selecting the label sequence with the highest prediction score as the best answer.
2. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: the entity categories of the step S1.2 are set to 9 categories, which are respectively coal geology and exploration CRP, roadway engineering SDE, coal mining CM, elevation transportation HT, mine survey MMS, coal mine safety MS, blasting material and technology EMBT, mining machinery WMDM, and coal mine electrical CMEE, and are respectively identified by english abbreviations.
3. The entity identification method for the coal mine safety field according to the patent claim 1, is characterized in that: in the step S1.3, the multi-method fusion semi-automatic labeling method includes firstly storing entities obtained from national standard GB/T15663 "coal mine science and technology terminology" and Baidu encyclopedia into corresponding dictionaries according to categories, writing scripts for reverse labeling of data sets to be labeled, then expanding the entities labeled by the dictionaries by using CRF + + algorithm, which can greatly reduce labor cost, and finally manually correcting, labeling partially corresponding labels but not labeled, correcting wrong labels, and completing labeling of data sets.
4. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: in step S1.3, it is specified that a sentence is subjected to BIO tagging according to a word level, B represents that a current word is a prefix of an entity where the current word is located, I represents that the current word is in a word or an end of a word of the entity where the current word is located, and O represents that the current word is not in the entity.
5. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: converting a text into a data sequence through an MSRBAC model for identifying a coal mine safety entity, calculating the probability of all BIO labels in a hidden state of each position through a BilSTM layer and an Attention layer, inputting the label probability of each position into a CRF layer, scoring all possible label paths by using a scoring function s (X, Y) encapsulated by TensorFlow, obtaining the probability by using a softmax normalization function, and finally obtaining the probability by using a softmax normalization functionThe probability maximization of the correct label sequence is estimated by utilizing the maximum likelihood, and the calculation formula is as follows: namely all possible sequences, wherein A is a transfer matrix and P is a weight matrix, the optimal label path is obtained through the calculation, and useful entities and types thereof in the sequences are finally found;
wherein, a dropout layer is respectively arranged behind the RoBERTA layer and the BilSTM layer to prevent transition fitting, and the dropout layer behind the RoBERTA layer is set to be 0.1; the dropout layer behind the BilSTM layer is set to be 0.4;
the method comprises the steps of inputting texts in entity data sets in the coal mine safety field into an MSRBAC model in a character-level form and a BIO tagging sequence according to sentences, introducing a RoBERTA pre-training language model into coal mine safety entity recognition to reduce the requirement of coal mine safety entity sample size and reduce labor cost, converting input characters into feature word vectors by the RoBERTA, connecting the feature word vectors to a dropout layer, avoiding overfitting by randomly abandoning neurons in a network, and taking the output feature word vectors as the input of a subsequent model structure.
6. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: when the length of the input text is too long, the problems of gradient disappearance, gradient explosion and the like can be caused, the problem of gradient explosion can be solved to a certain extent through gradient clipping, the problem of gradient disappearance can be relieved through a forgetting gate, an input gate and an output gate which are introduced into the LSTM, and a specific calculation formula of a certain time step t is as follows:
ht=ot*tanh(Ct)
in the formula ftThe forgetting gate is used for determining information to be discarded or reserved; i.e. itAn input gate for filtering the input information and selectively discarding some information;preparing for updating the current cell state for the cell state candidate value; ctIs the current cell state and is used for long-term memory; otFor the output gate, a sigmoid layer is used to determine which part of the cell state will be output; h istIs the hidden layer state value, namely the output of the current moment, and is used for short-term memory. By the above meterAnd moreover, the problem of gradient disappearance of the overlong text in the neural network can be effectively solved.
7. The entity identification method oriented to the coal mine safety field according to claim 1, characterized in that: introducing the Attention to carry out weight distribution on the hidden vector output by the BilSTM, wherein the calculation formula is as follows:
in the formula atFor providing weights, h, for a learnable functiontHidden vector [ h ] output for BilSTM layer1,h2,...,hT]And T is the input sentence length.
8. The entity identification method oriented to the coal mine safety field of claim 1, wherein the MSRBAC model training specifically comprises:
the method comprises the steps of optimizing model parameters by using a crf _ log _ likelihood as a loss function and adopting a random gradient descent algorithm (SGD) and a back propagation algorithm, training a MSRBAC model, wherein input information of the model comprises a training set train; the verification set dev, the super parameter includes entity type entry _ num ═ 9; dropout after RoBERTa layer is 0.1; dropout after the BilSTM layer is 0.4; maximum length of sentence max _ len 512; batch size batch _ size 8; the total training round number epoch is 50; the learning rate lr is 5 e-5. The specific training steps are as follows:
feeding the training set train and the verification set dev processed in the previous step into a model through a DataIterator;
randomly disorganizing the data in the training set train through a shuffle function, selecting a batch _ size bar data in each round to feed a model for training,
firstly, data flows into RoBERTA to carry out fine adjustment to the current sequence, then the data enters BiLSTM and Attention to calculate all the label probabilities, finally, a CRF final reasoning layer calculates the loss value loss,
optimizing model parameters through back propagation, then verifying the current epoch trained model on a verification set dev, calculating accuracy P, recall ratio R and a harmonic mean value F1 of the accuracy rate and the recall ratio of the optimal model, saving an optimal model F1 value best _ F1, if the F1 value of more than two rounds is smaller than the current optimal F1 value best _ F1, terminating training in advance, and ending the model training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111301680.7A CN113988054B (en) | 2021-11-04 | Entity identification method oriented to coal mine safety field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111301680.7A CN113988054B (en) | 2021-11-04 | Entity identification method oriented to coal mine safety field |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113988054A true CN113988054A (en) | 2022-01-28 |
CN113988054B CN113988054B (en) | 2024-07-16 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115146642A (en) * | 2022-07-21 | 2022-10-04 | 北京市科学技术研究院 | Automatic training set labeling method and system for named entity recognition |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111709241A (en) * | 2020-05-27 | 2020-09-25 | 西安交通大学 | Named entity identification method oriented to network security field |
CN112733541A (en) * | 2021-01-06 | 2021-04-30 | 重庆邮电大学 | Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
US20210319314A1 (en) * | 2020-04-09 | 2021-10-14 | Naver Corporation | End-To-End Graph Convolution Network |
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
US20210319314A1 (en) * | 2020-04-09 | 2021-10-14 | Naver Corporation | End-To-End Graph Convolution Network |
CN111709241A (en) * | 2020-05-27 | 2020-09-25 | 西安交通大学 | Named entity identification method oriented to network security field |
CN112733541A (en) * | 2021-01-06 | 2021-04-30 | 重庆邮电大学 | Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism |
Non-Patent Citations (1)
Title |
---|
刘鹏;叶帅;舒雅;鹿晓龙;刘明明: "煤矿安全知识图谱构建及智能查询方法研究", 中文信息学报, vol. 34, no. 011, 31 December 2020 (2020-12-31) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115146642A (en) * | 2022-07-21 | 2022-10-04 | 北京市科学技术研究院 | Automatic training set labeling method and system for named entity recognition |
CN115146642B (en) * | 2022-07-21 | 2023-08-29 | 北京市科学技术研究院 | Named entity recognition-oriented training set automatic labeling method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN111444726B (en) | Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure | |
CN109271529B (en) | Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN105389379A (en) | Rubbish article classification method based on distributed feature representation of text | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN109960799A (en) | A kind of Optimum Classification method towards short text | |
CN113673254B (en) | Knowledge distillation position detection method based on similarity maintenance | |
CN115599902B (en) | Oil-gas encyclopedia question-answering method and system based on knowledge graph | |
CN110134793A (en) | Text sentiment classification method | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
Gan et al. | Character-level deep conflation for business data analytics | |
CN111460147B (en) | Title short text classification method based on semantic enhancement | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN114722835A (en) | Text emotion recognition method based on LDA and BERT fusion improved model | |
CN114265935A (en) | Science and technology project establishment management auxiliary decision-making method and system based on text mining | |
CN116108191A (en) | Deep learning model recommendation method based on knowledge graph | |
CN116882402A (en) | Multi-task-based electric power marketing small sample named entity identification method | |
CN114626367A (en) | Sentiment analysis method, system, equipment and medium based on news article content | |
CN114356990A (en) | Base named entity recognition system and method based on transfer learning | |
CN112347247A (en) | Specific category text title binary classification method based on LDA and Bert | |
Shan | Social Network Text Sentiment Analysis Method Based on CNN‐BiGRU in Big Data Environment | |
CN107562774A (en) | Generation method, system and the answering method and system of rare foreign languages word incorporation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |