CN113139057A - Domain-adaptive chemical potential safety hazard short text classification method and system - Google Patents

Domain-adaptive chemical potential safety hazard short text classification method and system Download PDF

Info

Publication number
CN113139057A
CN113139057A CN202110511224.9A CN202110511224A CN113139057A CN 113139057 A CN113139057 A CN 113139057A CN 202110511224 A CN202110511224 A CN 202110511224A CN 113139057 A CN113139057 A CN 113139057A
Authority
CN
China
Prior art keywords
text
short text
vector
classified
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110511224.9A
Other languages
Chinese (zh)
Inventor
杜军威
朱孟帅
李浩杰
胡强
于旭
江峰
陈卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Science and Technology
Original Assignee
Qingdao University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology
Priority to CN202110511224.9A priority Critical patent/CN113139057A/en
Publication of CN113139057A publication Critical patent/CN113139057A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a domain-adaptive chemical potential safety hazard short text classification method and system, which are used for acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation; vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified; and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result. The GRU + HAN learning short text is adopted to be represented in different levels of character, word and sentence information fusion in a specific field, the field information deviation problem of the general corpus short text is solved, and a better classification effect is shown in a classification task of chemical engineering potential safety hazard investigation.

Description

Domain-adaptive chemical potential safety hazard short text classification method and system
Technical Field
The invention relates to the technical field of short text classification, in particular to a domain-adaptive method and system for classifying chemical potential safety hazards into short texts.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the rapid development of deep learning technology, many researchers try to solve the text classification problem by using deep learning, and particularly in terms of CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), many novel and fruitful classification methods are appeared. The method for classifying the texts can well solve the problems of internet news classification, emotion analysis and the like, but in the application of the method in specific related fields, due to the fact that the text characteristics of the fields are different, the practical problems of professional terms, abbreviations, non-standard terms and the like exist, and the practical application effect is general.
Especially, in a potential safety hazard text summarized in a chemical potential safety hazard troubleshooting process, a troubleshooting report text provided by a worker often contains a large number of professional terms, numbers, Chinese and English mixed professional nouns and irregular language expressions, and mostly is sentences with large length changes, and a mainstream text classification model is difficult to capture more accurate classification characteristic information through a text lacking context information, so that the classification of the potential safety hazard is often inaccurate. Therefore, strengthening the domain semantic information in the short text is the key for effectively solving the problem of text classification of potential safety hazards in the chemical industry domain, and has important significance for safety management early warning and potential safety hazard investigation of chemical enterprises.
The short text classification problem is an important research direction of natural language processing tasks, and the difficulty is mainly represented in that sentences are short and short in expression, each word can have rich meanings, and the semantic expression of the sentences is closely related. In many text classification tasks in natural language processing, a traditional classification method is like a classification mode of a naive Bayes model, attributes of the model are assumed to be independent, context association information of a text is not considered, and the support effect on semantic features of a short text is poor.
Applying CNN to the text classification task proposed by KIM Y utilizes a plurality of convolution kernels of different sizes to capture feature information of local correlation, but the size of different windows also determines that the length of the CNN that can extract context dependence is relatively fixed.
The RNN proposed by LAI S et al can use the information of the context words in the sentence to splice word embedding vectors with each word, thereby effectively relieving the problem that the CNN cannot dynamically change the window size to adapt to the context lengths of different texts, but simultaneously bringing about the problems of gradient disappearance and gradient explosion in the training process.
The LSTM (Long Short-term Memory) network proposed by Nguyen et al is a further extension on RNN that addresses the Long-term dependence problem by increasing the multicellular state, with better performance for training of Long-sequence text.
The GRU (Gated Recurrent Unit) proposed by Cho et al improves LSTM, and mixes the cell state and the hidden state to greatly improve the operation performance of the model.
In the process of implementing the invention, the inventor finds that the following technical problems exist in the prior art:
the short text has the characteristics of large text length difference, missing context information, sparse text features, obvious domain dependent features of word semantics and the like, and the general short text classification technology has low classification accuracy due to the difficulty in capturing domain related feature information of the short text.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a domain-adaptive chemical potential safety hazard short text classification method and system;
in a first aspect, the invention provides a domain-adaptive chemical potential safety hazard short text classification method;
a domain-adaptive chemical potential safety hazard short text classification method comprises the following steps:
acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
In a second aspect, the invention provides a domain-adapted chemical potential safety hazard short text classification system;
a domain-adapted chemical potential safety hazard short text classification system comprises:
an acquisition module configured to: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
an extraction module configured to: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
a classification module configured to: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
providing a GRU (Gated recovery Unit) short text classification model fused with a Hierarchical Attention Network (HAN), generating word vector Representation of general knowledge of short texts based on BERT (Bidirectional encoding with pre-training model) and enhancing general feature Representation of short text words and sequences; the GRU + HAN learning short text is adopted to be represented in different levels of character, word and sentence information fusion in a specific field, the field information deviation problem of the general corpus short text is solved, and a better classification effect is shown in a classification task of chemical engineering potential safety hazard investigation.
The invention converts the attention mechanism of the sentence level, transfers the attention branch between the sentences into the implicit attention expression between the sentences, keeps the semantic features contained in the hierarchical attention sentences, and simultaneously aggregates the text feature expression of which the BERT is more divergent. Compared with the short text, the classification problem of the long text can extract semantic information from the context of a sentence and has good classification effect, but the chemical engineering potential safety hazard text has the long text and also has a great number of short sequence texts, and how to automatically capture semantic features of different levels from different text levels by using one model is the core problem of solving the classification of the chemical engineering potential safety hazard.
The invention provides a short text classification model with domain adaptation, which can effectively solve the problems. GRU-HAN takes the BERT as the Word Embedding method of the text, effectively combines the pre-training result of the BERT on the massive Chinese text data set, and can obtain the embedded expression of the common knowledge of the short text. Potential safety hazard texts and labeled classifications thereof in the chemical field are selected as training and testing data sets, and the method is superior to the mainstream text classification method.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2 is a GRU-HAN overall network model architecture of the first embodiment;
FIG. 3 is three levels of the BERT model of the first embodiment;
FIG. 4 is a diagram of a GRU model structure of the first embodiment;
fig. 5 is a diagram illustrating a connection relationship between a GRU and an HAN according to the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a domain-adaptive chemical potential safety hazard short text classification method;
as shown in fig. 1, a domain-adaptive chemical potential safety hazard short text classification method includes:
s101: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
s102: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
s103: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
Further, the S102: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified; the method specifically comprises the following steps:
and based on the BERT model, vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified.
Further, the inputting the initial text vectors corresponding to all the to-be-classified sections of texts into the trained short text classification model and outputting the short text classification result specifically includes:
s1031: the trained short text classification model encodes each initial text vector to obtain text vectors considering the association of the front time sequence and the rear time sequence;
s1032: the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting;
s1033: the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors;
s1034: based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting;
s1035: the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified;
s1036: and the trained short text classification model classifies the vectors to be classified to obtain the classification result of each short text to be classified.
Further, the short text classification model has a network structure comprising:
the Word embedding structure layer BERT, the Word Encoder, the Word and Word attention mechanism layer, the first splicing unit, the Word and sentence attention mechanism layer, the second splicing unit and the Softmax classification layer are sequentially connected.
Wherein, the word embeds the structure layer BERT, and the theory of operation is: vector extraction is carried out on each short text to be classified, and a word vector (Token entries), a Segment vector (Segment entries) and a Position vector (Position entries) corresponding to each short text to be classified are obtained;
the word vector (Token entries), Segment vector (Segment entries), and Position vector (Position entries) are fused into an initial text vector.
Wherein the word vector represents a text feature vector; the segment vector represents a context feature vector of the text; the position vector represents a vector of where the text is located.
The Word Encoder is structurally characterized in that a GRU unit is added on the basis of the Word Encoder of the HAN model.
As shown in fig. 5, the Word Encoder has a structure of:
firstly, the Word Encoder of the original HAN model is assumed to comprise:
coding units connected from left to right in sequence
Figure BDA0003060369540000071
Coding unit
Figure BDA0003060369540000072
Coding unit
Figure BDA0003060369540000073
… coding unit
Figure BDA0003060369540000074
… and coding unit
Figure BDA0003060369540000075
Coding unit
Figure BDA0003060369540000076
Coding unit
Figure BDA0003060369540000077
And the combination of (a) and (b),
coding units connected from right to left in sequence
Figure BDA0003060369540000078
Coding unit
Figure BDA0003060369540000079
Coding unit
Figure BDA00030603695400000710
… coding unit
Figure BDA0003060369540000081
… and coding unit
Figure BDA0003060369540000082
Coding unit
Figure BDA0003060369540000083
And a coding unit
Figure BDA0003060369540000084
Wherein the coding unit
Figure BDA0003060369540000085
Is used for inputting the coding unit
Figure BDA0003060369540000086
The output value of (d); coding unit
Figure BDA0003060369540000087
For inputting the p-th output value of the word embedding construction layer BERT;
coding unit
Figure BDA0003060369540000088
The first output end of the control unit is used for being connected with the input end of the p-th Concat unit; the output end of the p-th Concat unit outputs a text vector considering the front and rear time sequence association; coding unit
Figure BDA0003060369540000089
Second output terminal and
Figure BDA00030603695400000810
the input ends of the two-way valve are connected;
wherein the coding unit
Figure BDA00030603695400000811
Is used for inputting the coding unit
Figure BDA00030603695400000812
The output value of (d); coding unit
Figure BDA00030603695400000813
For inputting the q-th output value of the word embedding construction layer BERT;
coding unit
Figure BDA00030603695400000814
Is used for being connected with the input end of the q-th Concat unit; the output end of the qth Concat unit outputs a text vector considering front and rear time sequence association; coding unit
Figure BDA00030603695400000815
Second output terminal and encoding unit
Figure BDA00030603695400000816
Is connected to the input terminal of the controller.
Wherein, the attention mechanism layer of word and word, its theory of operation is: and giving weights among words to each text vector considering the time sequence association before and after the word is considered, so as to obtain the text vector after the first weighting.
Wherein, first concatenation unit specifically is: and (4) splicing in series.
Wherein, the attention mechanism layer of word and sentence, its theory of operation is: giving a weight between a word and a sentence to each text vector considering the time sequence association before and after the word is considered; and obtaining the text vector after the second weighting.
Wherein, the second concatenation unit specifically is: and (4) splicing in series.
Further, the training of the trained short text classification model comprises:
constructing a training set, wherein the training set is a plurality of short texts to be classified in the chemical potential safety hazard troubleshooting field of known classification labels;
and inputting the training set into a short text classification model, training the short text classification model, and stopping training when the loss function reaches the minimum value or the training reaches the set iteration number to obtain the trained short text classification model.
Further, the S1031: the trained short text classification model encodes each initial text vector to obtain text vectors considering the association of the front time sequence and the rear time sequence; the method specifically comprises the following steps:
and inputting the initial text vectors corresponding to all the texts to be classified into a Word Encoder of the trained short text classification model, and encoding each initial text vector by the Word Encoder of the trained short text classification model to obtain the text vectors considering the time sequence association before and after the encoding.
Further, the S1032: the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting; the method specifically comprises the following steps:
and the attention mechanism layer of the words of the trained short text classification model gives weights between the words to each text vector considering the time sequence association before and after the training to obtain the text vector after the first weighting.
Further, the S1032: the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting; the specific working principle comprises:
Figure BDA0003060369540000091
wherein exp is an exponential function with a natural constant e as a base, and the input parameter
Figure BDA0003060369540000092
Where u represents the word weight matrix, utThen it represents the t-th word weight matrix, where u is takentIs transferred to
Figure BDA0003060369540000093
uwIs a randomly initialized context vector to finally obtain a normalized weighted text vector matrix alphat
Further, the S1033: the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors; the method specifically comprises the following steps:
and the trained short text classification model is used for serially splicing the text vectors after the first weighting to obtain sentence embedding vectors.
Further, the S1033: the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors; the working principle comprises the following steps:
S=Concat(α1,…,αt,…αn) (7)
wherein the Concat function is used for splicing vectors, and the weight matrix alpha obtained in the last step is used1To alphanAnd carrying out splicing operation to synthesize sentence vector S.
Further, the S1034: based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting; the method specifically comprises the following steps:
based on the sentence embedding vector, giving a weight between a word and a sentence to each text vector considering the time sequence association before and after the consideration by using the attention mechanism layer of the word and the sentence of the trained short text classification model; and obtaining the text vector after the second weighting.
Further, the S1034: based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting; the working principle comprises the following steps:
Figure BDA0003060369540000101
wherein u istThen it represents the t-th word weight matrix and the transposed implicit representation of the word
Figure BDA0003060369540000102
No longer with uwPerforming correlation operation, namely performing product operation on the sentence vectors S after splicing, and normalizing the sentence vectors S through an exp function to obtain betat,βtThen the second weighted text vector for the t-th word is referred to.
Further, the S1035: the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified; the method specifically comprises the following steps:
and the trained short text classification model is used for serially splicing the text vectors after the second weighting to obtain the vectors to be classified.
Further, the S1035: the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified; the working principle comprises the following steps:
Figure BDA0003060369540000111
hirepresenting hidden implicit vectors;
context feature vector beta fusing all words in short text and semantically associating words with sentencesiFusing, wherein each accumulated semantic feature also carries h generated by a single-layer perceptroniAn implicit vector.
Further, the S1036: the trained short text classification model classifies vectors to be classified to obtain a classification result of each short text to be classified; the method specifically comprises the following steps:
and the Softmax classification layer of the trained short text classification model classifies the text vectors weighted for the second time to obtain the classification result of each short text to be classified.
Further, the S1036: the trained short text classification model classifies vectors to be classified to obtain a classification result of each short text to be classified; the working principle comprises the following steps:
p=softmax(Wcβ+bc) (10)
wherein the matrix coefficient W is includedcAnd bcAnd inputting a beta context vector generated by the text, and mapping the final classification result through a softmax function to obtain a classification score matrix p.
Further, assume that probability distribution p is the desired output and probability distribution q is the actual output.
Further, the loss function H (p, q) is:
Figure BDA0003060369540000112
where N denotes the number of samples of a batch, M denotes the number of classes, p (x)ij) Representing the desired variable output if the class is associated with the class of sample iIf the values are the same, the value is 1, otherwise, the value is 0. q (x)ij) Representing the predicted probability that the observed sample i belongs to the class j, the operation here uses its log function value.
The trained short text classification model comprises a GRU-HAN network model, wherein the GRU-HAN network model fuses an improved HAN and a GRU network, as shown in figure 1, Word embedding of a text uses a BERT model construction method, each generated Word vector models information between Word vectors and Sentence vectors in an input text sequence through a Word-Word Level assignment of words and sentences in the HAN, and an implicit semantic vector concerned by the HAN is fed back to a classifier network and characteristic classification information is output through Softmax.
The GRU-HAN network model includes a deep GRU network with a variable number of sequences of control units, using the inputs of the GRU cells to implement an attention connection from the encoder network to the hierarchy, the attention mechanism of the present invention connecting the bottom layer of the decoder to the top layer of the encoder in order to improve parallelism and reduce training time. In order to accelerate the model fitting speed, the coding level of the sentence is kept unchanged in the conversation process of a sentence sequence.
In GRU-HAN, the semantic richness of the enhanced text is different in two respects:
(1) performing Word embedding construction on a text by using BERT, performing bidirectional semantic coding by using GRU (generalized regression Unit), and obtaining potential semantics and front and back time sequence association memory information of a text vector through two-layer Encoder coding as shown in a Word Encoder in figure 2, and simultaneously coding and fusing contexts of contexts while enhancing the self semantics;
(2) the Word with higher rank and higher weight in all the words is sensed in the HAN Level through Word-Word Level Attention, the Word vectors are spliced to obtain Sentence embedding vectors, and then context sensing of the Word-sequence Level Attention and the global text is continuously focused, so that the semantic ambiguity problem of BERT in a specific field can be effectively relieved by combining Attention sensing of two layers. The GRU has strong language sequence information capturing capability, and the time sequence information captured by the GRU can be faded through a layered attention mechanism, so that the freely expressed text has more semantic interpretation modes.
The traditional natural language processing task adopts character coding of static semantic information to carry out vector expression, such as Word Embedding modes of Word2Vec, One-Hot and the like. These vector encoding methods do not consider context information, each word is mapped to a unique dense vector, and the word ambiguity problem cannot be solved. In an actual short text classification task, a single word often has multiple meanings, and the traditional word vector cannot well represent semantic features of the word in a short text, so that a better text feature needs to be learned through a deep model.
The pretraining model of the BERT is obtained through the self-supervision learning training of a large amount of linguistic data, the meaning of the word vector of the pretraining model is fused with the text characteristics of the large amount of linguistic data, and the pretraining model can be well applied to a word embedding characteristic representation method of each text task.
The feature representation of the BERT model is divided into three levels, a character information vector (Token entries), a Segment information vector (Segment entries), and a Position information vector (Position entries). As shown in FIG. 3, Token entries represents each word after segmentation by vector, segment entries labels according to the segmentation information in the sentence and marks with [ CLS ] and [ SEP ], and Position entries adds Position timing information to each Input unit. And finally, superposing the representation information of the three layers of vectors to obtain the word vector represented by the BERT model.
To learn the multiple feature expressions for each word in the short text, the word embedding layer performs a linear mapping of the word vectors after they are constructed using BERT:
inputting a text sequence X with the length n ═ X1,x2,…xn]The generated word embedding matrix B ═ B1,b2,…bn]Establishing a query matrix Q, a key matrix K, a value matrix V and other word embedded vectors to establish a hidden semantic association relationIs described.
The GRU model is proposed to solve the problems of long-term memory and gradient in back propagation, and as shown in fig. 4, the GRU model passes through the last transmitted Hidden State ht-1Input b to the current nodetTo obtain two Gate states (Gate states) as Gate units (Gate units), such input combination contains both the history information of the previous node and the information of the current node.
Controlling updated gating z after receiving input informationtData are converted into values in the range of 0-1 by sigma (Sigmoid function) and used as gating signals, including:
zt=σ(Wz·[ht-1,bt]) (1)
controlling reset gating rtThe following are also available:
rt=σ(Wr·[ht-1,bt]) (2)
after a gating signal is obtained, splicing the reset gating and input information, and zooming the data to the range of-1 to 1 through a tanh activation function to obtain the gating signal
Figure BDA0003060369540000141
Figure BDA0003060369540000142
Herein, the
Figure BDA0003060369540000143
The method includes various signal data, and the signals added into the data memorize the state of the current time t. The following memory updating stage simultaneously performs two processes of forgetting and memorizing:
Figure BDA0003060369540000144
beforeUpdating gating ztThe range is 0-1, with values closer to 1 representing more data being memorized and conversely more data being forgotten. The two processes are carried out simultaneously, so that the GRU has fewer parameters than the LSTM and has higher operation efficiency. In the above formula, (1-z). times.ht-1Indicating selective forgetting of an otherwise unimportant hidden state,
Figure BDA0003060369540000145
indicating that pairs contain current node information
Figure BDA0003060369540000146
And performing selective memory, wherein the forgetting (1-z) and the memory z are linked, and certain memory compensation is performed on the forgetting weight in a constant state.
After the vectors containing the sequence context memory information and the rich word meaning information are generated by the BERT and the GRU, as the word vectors generated by the BERT have massive linguistic characteristics, the powerful generalization capability ensures that the BERT can not focus on semantic characteristics of a certain aspect in a specific field.
Not all words contribute equally to the expression of a sentence meaning. Therefore, we introduce a mechanism of attention to extract words that are important to the meaning of a sentence by dividing htInput to a single-layer perceptron (MLP)tAs htImplicit representation of (c):
ut=tanh(Wwht+bw) (5)
meanwhile, in order to compare the importance of words in the text, the invention uses utAnd a randomly initialized context vector uwIs expressed by the similarity of the weights, and then obtains a normalized Attention weight matrix alpha through the softmax operationtRepresenting the weight of the tth word in the text.
Figure BDA0003060369540000151
Obtaining the Word-Word Level orientation weight matrix of the above formula, and the semantic focus characteristics between words in the text sequence can be retained in the matrix by generating alpha in the texttConcat summarize to form a global sentence feature vector S of the text sequence:
S=Concat(α1,...,αt,...αn) (7)
the attention association of the words and the sentences is established through the word vectors and the sentence vectors, so that the attention feedback mechanism is established between the words and the whole sentences, the importance of the words to the whole text is measured, and the semantic association characteristics between the text and the sentences are enhanced. Similarly, the implicit representation of a word utContext vector u no longer associated with random initializationwSimilarity comparison is carried out, and the similarity comparison is balanced with the Sentence feature vector S to obtain a Word-sequence Level Attention weight matrix (u needs to be calculated for ensuring the matrix operationtVector represents unsqueeze extension dimension):
Figure BDA0003060369540000152
and splicing and combining the attention matrixes generated by each word and each sentence to form a combined word and a word, carrying out semantic association on the words and the sentences by using a context feature vector beta, and carrying out text classification on the features by using softmax.
Figure BDA0003060369540000161
p=softmax(Wcβ+bc) (10)
The cross entropy is mainly used for judging the closeness degree of the actual output and the expected output, and is characterized in that the distance between the actual output (probability) and the expected output (probability) is smaller, namely the smaller the value of the cross entropy is, the closer the two probability distributions of the actual output and the expected output are.
Assuming that the probability distribution p is an expected output and the probability distribution q is an actual output, the cross entropy loss function cross entropy loss calculation method is as follows:
Figure BDA0003060369540000162
the method can be well adapted to multi-label classification tasks by adopting a cross entropy loss calculation method, in formula 11, N represents the number of samples of a batch, M represents the number of categories, and p (x)ij) Representing the desired variable output, takes the value 1 if the class is the same as the class of sample i, and 0 otherwise. q (x)ij) Representing the predicted probability that the observed sample i belongs to the class j.
The second embodiment provides a domain-adaptive chemical potential safety hazard short text classification system;
a domain-adapted chemical potential safety hazard short text classification system comprises:
an acquisition module configured to: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
an extraction module configured to: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
a classification module configured to: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
It should be noted here that the above-mentioned obtaining module, extracting module and classifying module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A domain-adaptive chemical potential safety hazard short text classification method is characterized by comprising the following steps:
acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
2. The method as claimed in claim 1, wherein the step of inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model and outputting the short text classification result specifically comprises:
the trained short text classification model encodes each initial text vector to obtain text vectors considering the association of the front time sequence and the rear time sequence;
the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting;
the trained short text classification model splices the text vectors after the first weighting to obtain sentence embedding vectors;
based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting;
the trained short text classification model splices the text vectors after the second weighting to obtain vectors to be classified;
and the trained short text classification model classifies the vectors to be classified to obtain the classification result of each short text to be classified.
3. The method as claimed in claim 1 or 2, wherein the short text classification model and network structure comprises:
the word embedding structure layer, the word encoder, the word and word attention mechanism layer, the first splicing unit, the word and sentence attention mechanism layer, the second splicing unit and the classification layer are sequentially connected.
4. The method as claimed in claim 3, wherein the word encoder is structured as follows:
firstly, the Word Encoder of the original HAN model is assumed to comprise:
coding units connected from left to right in sequence
Figure FDA0003060369530000021
Coding unit
Figure FDA0003060369530000022
Coding unit
Figure FDA0003060369530000023
… coding unit
Figure FDA0003060369530000024
… and coding unit
Figure FDA0003060369530000025
Coding unit
Figure FDA0003060369530000026
Coding unit
Figure FDA0003060369530000027
And the combination of (a) and (b),
coding units connected from right to left in sequence
Figure FDA0003060369530000028
Coding unit
Figure FDA0003060369530000029
Coding unit
Figure FDA00030603695300000210
… coding unit
Figure FDA00030603695300000211
… and coding unit
Figure FDA00030603695300000212
Coding unit
Figure FDA00030603695300000213
And a coding unit
Figure FDA00030603695300000214
Wherein the coding unit
Figure FDA00030603695300000215
Is used for inputting the coding unit
Figure FDA00030603695300000216
The output value of (d); coding unit
Figure FDA00030603695300000217
For inputting the p-th output value of the word embedding construction layer BERT;
coding unit
Figure FDA00030603695300000218
The first output end of the control unit is used for being connected with the input end of the p-th Concat unit; the output end of the p-th Concat unit outputs a text vector considering the front and rear time sequence association; coding unit
Figure FDA00030603695300000219
Second output terminal and
Figure FDA00030603695300000220
the input ends of the two-way valve are connected;
wherein the coding unit
Figure FDA00030603695300000221
Is used for inputting the coding unit
Figure FDA00030603695300000222
The output value of (d); coding sheetYuan
Figure FDA00030603695300000223
For inputting the q-th output value of the word embedding construction layer BERT;
coding unit
Figure FDA00030603695300000224
Is used for being connected with the input end of the q-th Concat unit; the output end of the qth Concat unit outputs a text vector considering front and rear time sequence association; coding unit
Figure FDA00030603695300000225
Second output terminal and encoding unit
Figure FDA00030603695300000226
Is connected to the input terminal of the controller.
5. The method as claimed in claim 2, wherein the trained short text classification model encodes each initial text vector to obtain text vectors considering sequential association before and after the training; the method specifically comprises the following steps:
inputting the initial text vectors corresponding to all the texts to be classified into a word encoder of a short text classification model after training, and encoding each initial text vector by the word encoder of the short text classification model after training to obtain text vectors considering front and rear time sequence association;
alternatively, the first and second electrodes may be,
the trained short text classification model gives the weight between words to each text vector considering the time sequence association before and after the training to obtain a text vector after the first weighting; the method specifically comprises the following steps:
and the attention mechanism layer of the words of the trained short text classification model gives weights between the words to each text vector considering the time sequence association before and after the training to obtain the text vector after the first weighting.
6. The method for classifying the short texts of the chemical safety hazards in a domain adaptation manner as claimed in claim 2, wherein the trained short text classification model is used for splicing the text vectors after the first weighting to obtain sentence embedding vectors; the method specifically comprises the following steps:
the trained short text classification model is used for serially splicing the text vectors weighted for the first time to obtain sentence embedding vectors;
alternatively, the first and second electrodes may be,
based on the sentence embedding vector, a trained short text classification model is utilized to endow each text vector considering the time sequence association before and after the consideration with the weight between words and sentences; obtaining a text vector after the second weighting; the method specifically comprises the following steps:
based on the sentence embedding vector, giving a weight between a word and a sentence to each text vector considering the time sequence association before and after the consideration by using the attention mechanism layer of the word and the sentence of the trained short text classification model; and obtaining the text vector after the second weighting.
7. The method for classifying the short texts of the chemical safety hazards in a domain adaptation manner as claimed in claim 2, wherein the trained short text classification model is used for splicing the text vectors after the second weighting to obtain the vectors to be classified; the method specifically comprises the following steps:
the trained short text classification model is used for serially splicing the text vectors weighted for the second time to obtain vectors to be classified;
alternatively, the first and second electrodes may be,
the trained short text classification model classifies vectors to be classified to obtain a classification result of each short text to be classified; the method specifically comprises the following steps:
and the classification layer of the trained short text classification model classifies the text vectors weighted for the second time to obtain the classification result of each short text to be classified.
8. The utility model provides a chemical industry potential safety hazard short text classification system of domain adaptation, characterized by includes:
an acquisition module configured to: acquiring a plurality of short texts to be classified in the field of chemical potential safety hazard investigation;
an extraction module configured to: vector extraction is carried out on each short text to be classified to obtain an initial text vector corresponding to each short text to be classified;
a classification module configured to: and inputting the initial text vectors corresponding to all the texts to be classified into the trained short text classification model, and outputting the short text classification result.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202110511224.9A 2021-05-11 2021-05-11 Domain-adaptive chemical potential safety hazard short text classification method and system Pending CN113139057A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110511224.9A CN113139057A (en) 2021-05-11 2021-05-11 Domain-adaptive chemical potential safety hazard short text classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110511224.9A CN113139057A (en) 2021-05-11 2021-05-11 Domain-adaptive chemical potential safety hazard short text classification method and system

Publications (1)

Publication Number Publication Date
CN113139057A true CN113139057A (en) 2021-07-20

Family

ID=76818004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110511224.9A Pending CN113139057A (en) 2021-05-11 2021-05-11 Domain-adaptive chemical potential safety hazard short text classification method and system

Country Status (1)

Country Link
CN (1) CN113139057A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298326A (en) * 2021-07-27 2021-08-24 成都西辰软件有限公司 Intelligent electronic event supervision method, equipment and storage medium
CN113688239A (en) * 2021-08-20 2021-11-23 平安国际智慧城市科技股份有限公司 Text classification method and device under few samples, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN110489545A (en) * 2019-07-09 2019-11-22 平安科技(深圳)有限公司 File classification method and device, storage medium, computer equipment
CN111225277A (en) * 2018-11-27 2020-06-02 北京达佳互联信息技术有限公司 Transcoding method, transcoding device and computer readable storage medium
CN112417098A (en) * 2020-11-20 2021-02-26 南京邮电大学 Short text emotion classification method based on CNN-BiMGU model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN111225277A (en) * 2018-11-27 2020-06-02 北京达佳互联信息技术有限公司 Transcoding method, transcoding device and computer readable storage medium
CN110489545A (en) * 2019-07-09 2019-11-22 平安科技(深圳)有限公司 File classification method and device, storage medium, computer equipment
CN112417098A (en) * 2020-11-20 2021-02-26 南京邮电大学 Short text emotion classification method based on CNN-BiMGU model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
葛艳等: "《基于 BLSTM-Attention 神经网络模型的》", 《计算机系统应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298326A (en) * 2021-07-27 2021-08-24 成都西辰软件有限公司 Intelligent electronic event supervision method, equipment and storage medium
CN113298326B (en) * 2021-07-27 2021-10-26 成都西辰软件有限公司 Intelligent electronic event supervision method, equipment and storage medium
CN113688239A (en) * 2021-08-20 2021-11-23 平安国际智慧城市科技股份有限公司 Text classification method and device under few samples, electronic equipment and storage medium
CN113688239B (en) * 2021-08-20 2024-04-16 平安国际智慧城市科技股份有限公司 Text classification method and device under small sample, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109902293B (en) Text classification method based on local and global mutual attention mechanism
CN109284506B (en) User comment emotion analysis system and method based on attention convolution neural network
CN110334354B (en) Chinese relation extraction method
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
CN107506414B (en) Code recommendation method based on long-term and short-term memory network
Xiao et al. Research progress of RNN language model
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN116415654A (en) Data processing method and related equipment
Mukherjee et al. Utilization of oversampling for multiclass sentiment analysis on amazon review dataset
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN113139057A (en) Domain-adaptive chemical potential safety hazard short text classification method and system
WO2023004528A1 (en) Distributed system-based parallel named entity recognition method and apparatus
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
Liang et al. A double channel CNN-LSTM model for text classification
Yan et al. Implicit emotional tendency recognition based on disconnected recurrent neural networks
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN112560440B (en) Syntax dependency method for aspect-level emotion analysis based on deep learning
CN113806543A (en) Residual jump connection-based text classification method for gated cyclic unit
CN113377844A (en) Dialogue type data fuzzy retrieval method and device facing large relational database
CN116644760A (en) Dialogue text emotion analysis method based on Bert model and double-channel model
WO2023159759A1 (en) Model training method and apparatus, emotion message generation method and apparatus, device and medium
CN115964497A (en) Event extraction method integrating attention mechanism and convolutional neural network
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
Gupta et al. Detailed study of deep learning models for natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination