CN113139057A - Domain-adaptive chemical potential safety hazard short text classification method and system - Google Patents
Domain-adaptive chemical potential safety hazard short text classification method and system Download PDFInfo
- Publication number
- CN113139057A CN113139057A CN202110511224.9A CN202110511224A CN113139057A CN 113139057 A CN113139057 A CN 113139057A CN 202110511224 A CN202110511224 A CN 202110511224A CN 113139057 A CN113139057 A CN 113139057A
- Authority
- CN
- China
- Prior art keywords
- text
- short text
- vector
- classified
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 239000000126 substance Substances 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 176
- 238000013145 classification model Methods 0.000 claims abstract description 62
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000011835 investigation Methods 0.000 claims abstract description 10
- 230000015654 memory Effects 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 24
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 6
- 230000006978 adaptation Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 5
- 238000003889 chemical engineering Methods 0.000 abstract description 4
- 230000004927 fusion Effects 0.000 abstract description 2
- 239000010410 layer Substances 0.000 description 18
- 239000011159 matrix material Substances 0.000 description 16
- 230000006870 function Effects 0.000 description 10
- 230000014509 gene expression Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 3
- 238000013024 troubleshooting Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000008713 feedback mechanism Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a domain-adaptive chemical potential safety hazard short text classification method and system, which are used for acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation; vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified; and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result. The GRU + HAN learning short text is adopted to be represented in different levels of character, word and sentence information fusion in a specific field, the field information deviation problem of the general corpus short text is solved, and a better classification effect is shown in a classification task of chemical engineering potential safety hazard investigation.
Description
Technical Field
The invention relates to the technical field of short text classification, in particular to a domain-adaptive method and system for classifying chemical potential safety hazards into short texts.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the rapid development of deep learning technology, many researchers try to solve the text classification problem by using deep learning, and particularly in terms of CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), many novel and fruitful classification methods are appeared. The method for classifying the texts can well solve the problems of internet news classification, emotion analysis and the like, but in the application of the method in specific related fields, due to the fact that the text characteristics of the fields are different, the practical problems of professional terms, abbreviations, non-standard terms and the like exist, and the practical application effect is general.
Especially, in a potential safety hazard text summarized in a chemical potential safety hazard troubleshooting process, a troubleshooting report text provided by a worker often contains a large number of professional terms, numbers, Chinese and English mixed professional nouns and irregular language expressions, and mostly is sentences with large length changes, and a mainstream text classification model is difficult to capture more accurate classification characteristic information through a text lacking context information, so that the classification of the potential safety hazard is often inaccurate. Therefore, strengthening the domain semantic information in the short text is the key for effectively solving the problem of text classification of potential safety hazards in the chemical industry domain, and has important significance for safety management early warning and potential safety hazard investigation of chemical enterprises.
The short text classification problem is an important research direction of natural language processing tasks, and the difficulty is mainly represented in that sentences are short and short in expression, each word can have rich meanings, and the semantic expression of the sentences is closely related. In many text classification tasks in natural language processing, a traditional classification method is like a classification mode of a naive Bayes model, attributes of the model are assumed to be independent, context association information of a text is not considered, and the support effect on semantic features of a short text is poor.
Applying CNN to the text classification task proposed by KIM Y utilizes a plurality of convolution kernels of different sizes to capture feature information of local correlation, but the size of different windows also determines that the length of the CNN that can extract context dependence is relatively fixed.
The RNN proposed by LAI S et al can use the information of the context words in the sentence to splice word embedding vectors with each word, thereby effectively relieving the problem that the CNN cannot dynamically change the window size to adapt to the context lengths of different texts, but simultaneously bringing about the problems of gradient disappearance and gradient explosion in the training process.
The LSTM (Long Short-term Memory) network proposed by Nguyen et al is a further extension on RNN that addresses the Long-term dependence problem by increasing the multicellular state, with better performance for training of Long-sequence text.
The GRU (Gated Recurrent Unit) proposed by Cho et al improves LSTM, and mixes the cell state and the hidden state to greatly improve the operation performance of the model.
In the process of implementing the invention, the inventor finds that the following technical problems exist in the prior art:
the short text has the characteristics of large text length difference, missing context information, sparse text features, obvious domain dependent features of word semantics and the like, and the general short text classification technology has low classification accuracy due to the difficulty in capturing domain related feature information of the short text.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a domain-adaptive chemical potential safety hazard short text classification method and system;
in a first aspect, the invention provides a domain-adaptive chemical potential safety hazard short text classification method;
a domain-adaptive chemical potential safety hazard short text classification method comprises the following steps:
acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
In a second aspect, the invention provides a domain-adapted chemical potential safety hazard short text classification system;
a domain-adapted chemical potential safety hazard short text classification system comprises:
an acquisition module configured to: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
an extraction module configured to: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
a classification module configured to: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
providing a GRU (Gated recovery Unit) short text classification model fused with a Hierarchical Attention Network (HAN), generating word vector Representation of general knowledge of short texts based on BERT (Bidirectional encoding with pre-training model) and enhancing general feature Representation of short text words and sequences; the GRU + HAN learning short text is adopted to be represented in different levels of character, word and sentence information fusion in a specific field, the field information deviation problem of the general corpus short text is solved, and a better classification effect is shown in a classification task of chemical engineering potential safety hazard investigation.
The invention converts the attention mechanism of the sentence level, transfers the attention branch between the sentences into the implicit attention expression between the sentences, keeps the semantic features contained in the hierarchical attention sentences, and simultaneously aggregates the text feature expression of which the BERT is more divergent. Compared with the short text, the classification problem of the long text can extract semantic information from the context of a sentence and has good classification effect, but the chemical engineering potential safety hazard text has the long text and also has a great number of short sequence texts, and how to automatically capture semantic features of different levels from different text levels by using one model is the core problem of solving the classification of the chemical engineering potential safety hazard.
The invention provides a short text classification model with domain adaptation, which can effectively solve the problems. GRU-HAN takes the BERT as the Word Embedding method of the text, effectively combines the pre-training result of the BERT on the massive Chinese text data set, and can obtain the embedded expression of the common knowledge of the short text. Potential safety hazard texts and labeled classifications thereof in the chemical field are selected as training and testing data sets, and the method is superior to the mainstream text classification method.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2 is a GRU-HAN overall network model architecture of the first embodiment;
FIG. 3 is three levels of the BERT model of the first embodiment;
FIG. 4 is a diagram of a GRU model structure of the first embodiment;
fig. 5 is a diagram illustrating a connection relationship between a GRU and an HAN according to the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a domain-adaptive chemical potential safety hazard short text classification method;
as shown in fig. 1, a domain-adaptive chemical potential safety hazard short text classification method includes:
s101: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
s102: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
s103: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
Further, the S102: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified; the method specifically comprises the following steps:
and based on the BERT model, vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified.
Further, the inputting the initial text vectors corresponding to all the to-be-classified sections of texts into the trained short text classification model and outputting the short text classification result specifically includes:
s1031: the trained short text classification model encodes each initial text vector to obtain text vectors considering the association of the front time sequence and the rear time sequence;
s1032: the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting;
s1033: the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors;
s1034: based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting;
s1035: the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified;
s1036: and the trained short text classification model classifies the vectors to be classified to obtain the classification result of each short text to be classified.
Further, the short text classification model has a network structure comprising:
the Word embedding structure layer BERT, the Word Encoder, the Word and Word attention mechanism layer, the first splicing unit, the Word and sentence attention mechanism layer, the second splicing unit and the Softmax classification layer are sequentially connected.
Wherein, the word embeds the structure layer BERT, and the theory of operation is: vector extraction is carried out on each short text to be classified, and a word vector (Token entries), a Segment vector (Segment entries) and a Position vector (Position entries) corresponding to each short text to be classified are obtained;
the word vector (Token entries), Segment vector (Segment entries), and Position vector (Position entries) are fused into an initial text vector.
Wherein the word vector represents a text feature vector; the segment vector represents a context feature vector of the text; the position vector represents a vector of where the text is located.
The Word Encoder is structurally characterized in that a GRU unit is added on the basis of the Word Encoder of the HAN model.
As shown in fig. 5, the Word Encoder has a structure of:
firstly, the Word Encoder of the original HAN model is assumed to comprise:
coding units connected from left to right in sequenceCoding unitCoding unit… coding unit… and coding unitCoding unitCoding unitAnd the combination of (a) and (b),
coding units connected from right to left in sequenceCoding unitCoding unit… coding unit… and coding unitCoding unitAnd a coding unit
Wherein the coding unitIs used for inputting the coding unitThe output value of (d); coding unitFor inputting the p-th output value of the word embedding construction layer BERT;
coding unitThe first output end of the control unit is used for being connected with the input end of the p-th Concat unit; the output end of the p-th Concat unit outputs a text vector considering the front and rear time sequence association; coding unitSecond output terminal andthe input ends of the two-way valve are connected;
wherein the coding unitIs used for inputting the coding unitThe output value of (d); coding unitFor inputting the q-th output value of the word embedding construction layer BERT;
coding unitIs used for being connected with the input end of the q-th Concat unit; the output end of the qth Concat unit outputs a text vector considering front and rear time sequence association; coding unitSecond output terminal and encoding unitIs connected to the input terminal of the controller.
Wherein, the attention mechanism layer of word and word, its theory of operation is: and giving weights among words to each text vector considering the time sequence association before and after the word is considered, so as to obtain the text vector after the first weighting.
Wherein, first concatenation unit specifically is: and (4) splicing in series.
Wherein, the attention mechanism layer of word and sentence, its theory of operation is: giving a weight between a word and a sentence to each text vector considering the time sequence association before and after the word is considered; and obtaining the text vector after the second weighting.
Wherein, the second concatenation unit specifically is: and (4) splicing in series.
Further, the training of the trained short text classification model comprises:
constructing a training set, wherein the training set is a plurality of short texts to be classified in the chemical potential safety hazard troubleshooting field of known classification labels;
and inputting the training set into a short text classification model, training the short text classification model, and stopping training when the loss function reaches the minimum value or the training reaches the set iteration number to obtain the trained short text classification model.
Further, the S1031: the trained short text classification model encodes each initial text vector to obtain text vectors considering the association of the front time sequence and the rear time sequence; the method specifically comprises the following steps:
and inputting the initial text vectors corresponding to all the texts to be classified into a Word Encoder of the trained short text classification model, and encoding each initial text vector by the Word Encoder of the trained short text classification model to obtain the text vectors considering the time sequence association before and after the encoding.
Further, the S1032: the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting; the method specifically comprises the following steps:
and the attention mechanism layer of the words of the trained short text classification model gives weights between the words to each text vector considering the time sequence association before and after the training to obtain the text vector after the first weighting.
Further, the S1032: the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting; the specific working principle comprises:
wherein exp is an exponential function with a natural constant e as a base, and the input parameterWhere u represents the word weight matrix, utThen it represents the t-th word weight matrix, where u is takentIs transferred touwIs a randomly initialized context vector to finally obtain a normalized weighted text vector matrix alphat。
Further, the S1033: the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors; the method specifically comprises the following steps:
and the trained short text classification model is used for serially splicing the text vectors after the first weighting to obtain sentence embedding vectors.
Further, the S1033: the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors; the working principle comprises the following steps:
S=Concat(α1,…,αt,…αn) (7)
wherein the Concat function is used for splicing vectors, and the weight matrix alpha obtained in the last step is used1To alphanAnd carrying out splicing operation to synthesize sentence vector S.
Further, the S1034: based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting; the method specifically comprises the following steps:
based on the sentence embedding vector, giving a weight between a word and a sentence to each text vector considering the time sequence association before and after the consideration by using the attention mechanism layer of the word and the sentence of the trained short text classification model; and obtaining the text vector after the second weighting.
Further, the S1034: based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting; the working principle comprises the following steps:
wherein u istThen it represents the t-th word weight matrix and the transposed implicit representation of the wordNo longer with uwPerforming correlation operation, namely performing product operation on the sentence vectors S after splicing, and normalizing the sentence vectors S through an exp function to obtain betat,βtThen the second weighted text vector for the t-th word is referred to.
Further, the S1035: the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified; the method specifically comprises the following steps:
and the trained short text classification model is used for serially splicing the text vectors after the second weighting to obtain the vectors to be classified.
Further, the S1035: the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified; the working principle comprises the following steps:
hirepresenting hidden implicit vectors;
context feature vector beta fusing all words in short text and semantically associating words with sentencesiFusing, wherein each accumulated semantic feature also carries h generated by a single-layer perceptroniAn implicit vector.
Further, the S1036: the trained short text classification model classifies vectors to be classified to obtain a classification result of each short text to be classified; the method specifically comprises the following steps:
and the Softmax classification layer of the trained short text classification model classifies the text vectors weighted for the second time to obtain the classification result of each short text to be classified.
Further, the S1036: the trained short text classification model classifies vectors to be classified to obtain a classification result of each short text to be classified; the working principle comprises the following steps:
p=softmax(Wcβ+bc) (10)
wherein the matrix coefficient W is includedcAnd bcAnd inputting a beta context vector generated by the text, and mapping the final classification result through a softmax function to obtain a classification score matrix p.
Further, assume that probability distribution p is the desired output and probability distribution q is the actual output.
Further, the loss function H (p, q) is:
where N denotes the number of samples of a batch, M denotes the number of classes, p (x)ij) Representing the desired variable output if the class is associated with the class of sample iIf the values are the same, the value is 1, otherwise, the value is 0. q (x)ij) Representing the predicted probability that the observed sample i belongs to the class j, the operation here uses its log function value.
The trained short text classification model comprises a GRU-HAN network model, wherein the GRU-HAN network model fuses an improved HAN and a GRU network, as shown in figure 1, Word embedding of a text uses a BERT model construction method, each generated Word vector models information between Word vectors and Sentence vectors in an input text sequence through a Word-Word Level assignment of words and sentences in the HAN, and an implicit semantic vector concerned by the HAN is fed back to a classifier network and characteristic classification information is output through Softmax.
The GRU-HAN network model includes a deep GRU network with a variable number of sequences of control units, using the inputs of the GRU cells to implement an attention connection from the encoder network to the hierarchy, the attention mechanism of the present invention connecting the bottom layer of the decoder to the top layer of the encoder in order to improve parallelism and reduce training time. In order to accelerate the model fitting speed, the coding level of the sentence is kept unchanged in the conversation process of a sentence sequence.
In GRU-HAN, the semantic richness of the enhanced text is different in two respects:
(1) performing Word embedding construction on a text by using BERT, performing bidirectional semantic coding by using GRU (generalized regression Unit), and obtaining potential semantics and front and back time sequence association memory information of a text vector through two-layer Encoder coding as shown in a Word Encoder in figure 2, and simultaneously coding and fusing contexts of contexts while enhancing the self semantics;
(2) the Word with higher rank and higher weight in all the words is sensed in the HAN Level through Word-Word Level Attention, the Word vectors are spliced to obtain Sentence embedding vectors, and then context sensing of the Word-sequence Level Attention and the global text is continuously focused, so that the semantic ambiguity problem of BERT in a specific field can be effectively relieved by combining Attention sensing of two layers. The GRU has strong language sequence information capturing capability, and the time sequence information captured by the GRU can be faded through a layered attention mechanism, so that the freely expressed text has more semantic interpretation modes.
The traditional natural language processing task adopts character coding of static semantic information to carry out vector expression, such as Word Embedding modes of Word2Vec, One-Hot and the like. These vector encoding methods do not consider context information, each word is mapped to a unique dense vector, and the word ambiguity problem cannot be solved. In an actual short text classification task, a single word often has multiple meanings, and the traditional word vector cannot well represent semantic features of the word in a short text, so that a better text feature needs to be learned through a deep model.
The pretraining model of the BERT is obtained through the self-supervision learning training of a large amount of linguistic data, the meaning of the word vector of the pretraining model is fused with the text characteristics of the large amount of linguistic data, and the pretraining model can be well applied to a word embedding characteristic representation method of each text task.
The feature representation of the BERT model is divided into three levels, a character information vector (Token entries), a Segment information vector (Segment entries), and a Position information vector (Position entries). As shown in FIG. 3, Token entries represents each word after segmentation by vector, segment entries labels according to the segmentation information in the sentence and marks with [ CLS ] and [ SEP ], and Position entries adds Position timing information to each Input unit. And finally, superposing the representation information of the three layers of vectors to obtain the word vector represented by the BERT model.
To learn the multiple feature expressions for each word in the short text, the word embedding layer performs a linear mapping of the word vectors after they are constructed using BERT:
inputting a text sequence X with the length n ═ X1,x2,…xn]The generated word embedding matrix B ═ B1,b2,…bn]Establishing a query matrix Q, a key matrix K, a value matrix V and other word embedded vectors to establish a hidden semantic association relationIs described.
The GRU model is proposed to solve the problems of long-term memory and gradient in back propagation, and as shown in fig. 4, the GRU model passes through the last transmitted Hidden State ht-1Input b to the current nodetTo obtain two Gate states (Gate states) as Gate units (Gate units), such input combination contains both the history information of the previous node and the information of the current node.
Controlling updated gating z after receiving input informationtData are converted into values in the range of 0-1 by sigma (Sigmoid function) and used as gating signals, including:
zt=σ(Wz·[ht-1,bt]) (1)
controlling reset gating rtThe following are also available:
rt=σ(Wr·[ht-1,bt]) (2)
after a gating signal is obtained, splicing the reset gating and input information, and zooming the data to the range of-1 to 1 through a tanh activation function to obtain the gating signal
Herein, theThe method includes various signal data, and the signals added into the data memorize the state of the current time t. The following memory updating stage simultaneously performs two processes of forgetting and memorizing:
beforeUpdating gating ztThe range is 0-1, with values closer to 1 representing more data being memorized and conversely more data being forgotten. The two processes are carried out simultaneously, so that the GRU has fewer parameters than the LSTM and has higher operation efficiency. In the above formula, (1-z). times.ht-1Indicating selective forgetting of an otherwise unimportant hidden state,indicating that pairs contain current node informationAnd performing selective memory, wherein the forgetting (1-z) and the memory z are linked, and certain memory compensation is performed on the forgetting weight in a constant state.
After the vectors containing the sequence context memory information and the rich word meaning information are generated by the BERT and the GRU, as the word vectors generated by the BERT have massive linguistic characteristics, the powerful generalization capability ensures that the BERT can not focus on semantic characteristics of a certain aspect in a specific field.
Not all words contribute equally to the expression of a sentence meaning. Therefore, we introduce a mechanism of attention to extract words that are important to the meaning of a sentence by dividing htInput to a single-layer perceptron (MLP)tAs htImplicit representation of (c):
ut=tanh(Wwht+bw) (5)
meanwhile, in order to compare the importance of words in the text, the invention uses utAnd a randomly initialized context vector uwIs expressed by the similarity of the weights, and then obtains a normalized Attention weight matrix alpha through the softmax operationtRepresenting the weight of the tth word in the text.
Obtaining the Word-Word Level orientation weight matrix of the above formula, and the semantic focus characteristics between words in the text sequence can be retained in the matrix by generating alpha in the texttConcat summarize to form a global sentence feature vector S of the text sequence:
S=Concat(α1,...,αt,...αn) (7)
the attention association of the words and the sentences is established through the word vectors and the sentence vectors, so that the attention feedback mechanism is established between the words and the whole sentences, the importance of the words to the whole text is measured, and the semantic association characteristics between the text and the sentences are enhanced. Similarly, the implicit representation of a word utContext vector u no longer associated with random initializationwSimilarity comparison is carried out, and the similarity comparison is balanced with the Sentence feature vector S to obtain a Word-sequence Level Attention weight matrix (u needs to be calculated for ensuring the matrix operationtVector represents unsqueeze extension dimension):
and splicing and combining the attention matrixes generated by each word and each sentence to form a combined word and a word, carrying out semantic association on the words and the sentences by using a context feature vector beta, and carrying out text classification on the features by using softmax.
p=softmax(Wcβ+bc) (10)
The cross entropy is mainly used for judging the closeness degree of the actual output and the expected output, and is characterized in that the distance between the actual output (probability) and the expected output (probability) is smaller, namely the smaller the value of the cross entropy is, the closer the two probability distributions of the actual output and the expected output are.
Assuming that the probability distribution p is an expected output and the probability distribution q is an actual output, the cross entropy loss function cross entropy loss calculation method is as follows:
the method can be well adapted to multi-label classification tasks by adopting a cross entropy loss calculation method, in formula 11, N represents the number of samples of a batch, M represents the number of categories, and p (x)ij) Representing the desired variable output, takes the value 1 if the class is the same as the class of sample i, and 0 otherwise. q (x)ij) Representing the predicted probability that the observed sample i belongs to the class j.
The second embodiment provides a domain-adaptive chemical potential safety hazard short text classification system;
a domain-adapted chemical potential safety hazard short text classification system comprises:
an acquisition module configured to: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
an extraction module configured to: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
a classification module configured to: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
It should be noted here that the above-mentioned obtaining module, extracting module and classifying module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A domain-adaptive chemical potential safety hazard short text classification method is characterized by comprising the following steps:
acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
2. The method as claimed in claim 1, wherein the step of inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model and outputting the short text classification result specifically comprises:
the trained short text classification model encodes each initial text vector to obtain text vectors considering the association of the front time sequence and the rear time sequence;
the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting;
the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors;
based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting;
the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified;
and the trained short text classification model classifies the vectors to be classified to obtain the classification result of each short text to be classified.
3. The method as claimed in claim 1 or 2, wherein the short text classification model and network structure comprises:
the word embedding structure layer, the word encoder, the word and word attention mechanism layer, the first splicing unit, the word and sentence attention mechanism layer, the second splicing unit and the classification layer are sequentially connected.
4. The method as claimed in claim 3, wherein the word encoder is structured as follows:
firstly, the Word Encoder of the original HAN model is assumed to comprise:
coding units connected from left to right in sequenceCoding unitCoding unit… coding unit… and coding unitCoding unitCoding unitAnd the combination of (a) and (b),
coding units connected from right to left in sequenceCoding unitCoding unit… coding unit… and coding unitCoding unitAnd a coding unit
Wherein the coding unitIs used for inputting the coding unitThe output value of (d); coding unitFor inputting the p-th output value of the word embedding construction layer BERT;
coding unitThe first output end of the control unit is used for being connected with the input end of the p-th Concat unit; the output end of the p-th Concat unit outputs a text vector considering the front and rear time sequence association; coding unitSecond output terminal andthe input ends of the two-way valve are connected;
wherein the coding unitIs used for inputting the coding unitThe output value of (d); coding sheetYuanFor inputting the q-th output value of the word embedding construction layer BERT;
coding unitIs used for being connected with the input end of the q-th Concat unit; the output end of the qth Concat unit outputs a text vector considering front and rear time sequence association; coding unitSecond output terminal and encoding unitIs connected to the input terminal of the controller.
5. The method as claimed in claim 2, wherein the trained short text classification model encodes each initial text vector to obtain text vectors considering sequential association before and after the training; the method specifically comprises the following steps:
inputting the initial text vectors corresponding to all the texts to be classified into a word encoder of a short text classification model after training, and encoding each initial text vector by the word encoder of the short text classification model after training to obtain text vectors considering front and rear time sequence association;
alternatively, the first and second electrodes may be,
the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting; the method specifically comprises the following steps:
and the attention mechanism layer of the words of the trained short text classification model gives weights between the words to each text vector considering the time sequence association before and after the training to obtain the text vector after the first weighting.
6. The method for classifying the short texts of the chemical safety hazards in a domain adaptation manner as claimed in claim 2, wherein the trained short text classification model is used for splicing the text vectors after the first weighting to obtain sentence embedding vectors; the method specifically comprises the following steps:
the trained short text classification model is used for serially splicing the text vectors weighted for the first time to obtain sentence embedding vectors;
alternatively, the first and second electrodes may be,
based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting; the method specifically comprises the following steps:
based on the sentence embedding vector, giving a weight between a word and a sentence to each text vector considering the time sequence association before and after the consideration by using the attention mechanism layer of the word and the sentence of the trained short text classification model; and obtaining the text vector after the second weighting.
7. The method for classifying the short texts of the chemical safety hazards in a domain adaptation manner as claimed in claim 2, wherein the trained short text classification model is used for splicing the text vectors after the second weighting to obtain the vectors to be classified; the method specifically comprises the following steps:
the trained short text classification model is used for serially splicing the text vectors weighted for the second time to obtain vectors to be classified;
alternatively, the first and second electrodes may be,
the trained short text classification model classifies vectors to be classified to obtain a classification result of each short text to be classified; the method specifically comprises the following steps:
and the classification layer of the trained short text classification model classifies the text vectors weighted for the second time to obtain the classification result of each short text to be classified.
8. The utility model provides a chemical industry potential safety hazard short text classification system of domain adaptation, characterized by includes:
an acquisition module configured to: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
an extraction module configured to: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
a classification module configured to: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110511224.9A CN113139057A (en) | 2021-05-11 | 2021-05-11 | Domain-adaptive chemical potential safety hazard short text classification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110511224.9A CN113139057A (en) | 2021-05-11 | 2021-05-11 | Domain-adaptive chemical potential safety hazard short text classification method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113139057A true CN113139057A (en) | 2021-07-20 |
Family
ID=76818004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110511224.9A Pending CN113139057A (en) | 2021-05-11 | 2021-05-11 | Domain-adaptive chemical potential safety hazard short text classification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113139057A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113298326A (en) * | 2021-07-27 | 2021-08-24 | 成都西辰软件有限公司 | Intelligent electronic event supervision method, equipment and storage medium |
CN113688239A (en) * | 2021-08-20 | 2021-11-23 | 平安国际智慧城市科技股份有限公司 | Text classification method and device under few samples, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829818A (en) * | 2018-06-12 | 2018-11-16 | 中国科学院计算技术研究所 | A kind of file classification method |
CN110489545A (en) * | 2019-07-09 | 2019-11-22 | 平安科技(深圳)有限公司 | File classification method and device, storage medium, computer equipment |
CN111225277A (en) * | 2018-11-27 | 2020-06-02 | 北京达佳互联信息技术有限公司 | Transcoding method, transcoding device and computer readable storage medium |
CN112417098A (en) * | 2020-11-20 | 2021-02-26 | 南京邮电大学 | Short text emotion classification method based on CNN-BiMGU model |
-
2021
- 2021-05-11 CN CN202110511224.9A patent/CN113139057A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829818A (en) * | 2018-06-12 | 2018-11-16 | 中国科学院计算技术研究所 | A kind of file classification method |
CN111225277A (en) * | 2018-11-27 | 2020-06-02 | 北京达佳互联信息技术有限公司 | Transcoding method, transcoding device and computer readable storage medium |
CN110489545A (en) * | 2019-07-09 | 2019-11-22 | 平安科技(深圳)有限公司 | File classification method and device, storage medium, computer equipment |
CN112417098A (en) * | 2020-11-20 | 2021-02-26 | 南京邮电大学 | Short text emotion classification method based on CNN-BiMGU model |
Non-Patent Citations (1)
Title |
---|
葛艳等: "《基于 BLSTM-Attention 神经网络模型的》", 《计算机系统应用》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113298326A (en) * | 2021-07-27 | 2021-08-24 | 成都西辰软件有限公司 | Intelligent electronic event supervision method, equipment and storage medium |
CN113298326B (en) * | 2021-07-27 | 2021-10-26 | 成都西辰软件有限公司 | Intelligent electronic event supervision method, equipment and storage medium |
CN113688239A (en) * | 2021-08-20 | 2021-11-23 | 平安国际智慧城市科技股份有限公司 | Text classification method and device under few samples, electronic equipment and storage medium |
CN113688239B (en) * | 2021-08-20 | 2024-04-16 | 平安国际智慧城市科技股份有限公司 | Text classification method and device under small sample, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109902293B (en) | Text classification method based on local and global mutual attention mechanism | |
CN109284506B (en) | User comment emotion analysis system and method based on attention convolution neural network | |
CN110334354B (en) | Chinese relation extraction method | |
CN112528672B (en) | Aspect-level emotion analysis method and device based on graph convolution neural network | |
CN107506414B (en) | Code recommendation method based on long-term and short-term memory network | |
Xiao et al. | Research progress of RNN language model | |
CN109214006B (en) | Natural language reasoning method for image enhanced hierarchical semantic representation | |
CN116415654A (en) | Data processing method and related equipment | |
Mukherjee et al. | Utilization of oversampling for multiclass sentiment analysis on amazon review dataset | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN113139057A (en) | Domain-adaptive chemical potential safety hazard short text classification method and system | |
WO2023004528A1 (en) | Distributed system-based parallel named entity recognition method and apparatus | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
Liang et al. | A double channel CNN-LSTM model for text classification | |
Yan et al. | Implicit emotional tendency recognition based on disconnected recurrent neural networks | |
CN108875024B (en) | Text classification method and system, readable storage medium and electronic equipment | |
CN112560440B (en) | Syntax dependency method for aspect-level emotion analysis based on deep learning | |
CN113806543A (en) | Residual jump connection-based text classification method for gated cyclic unit | |
CN113377844A (en) | Dialogue type data fuzzy retrieval method and device facing large relational database | |
CN116644760A (en) | Dialogue text emotion analysis method based on Bert model and double-channel model | |
WO2023159759A1 (en) | Model training method and apparatus, emotion message generation method and apparatus, device and medium | |
CN115964497A (en) | Event extraction method integrating attention mechanism and convolutional neural network | |
CN113779244B (en) | Document emotion classification method and device, storage medium and electronic equipment | |
Gupta et al. | Detailed study of deep learning models for natural language processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |