CN111026846A - Online short text data stream classification method based on feature extension - Google Patents

Online short text data stream classification method based on feature extension Download PDF

Info

Publication number
CN111026846A
CN111026846A CN201911251229.1A CN201911251229A CN111026846A CN 111026846 A CN111026846 A CN 111026846A CN 201911251229 A CN201911251229 A CN 201911251229A CN 111026846 A CN111026846 A CN 111026846A
Authority
CN
China
Prior art keywords
text
time
data block
network
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911251229.1A
Other languages
Chinese (zh)
Other versions
CN111026846B (en
Inventor
李培培
胡阳
胡学钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201911251229.1A priority Critical patent/CN111026846B/en
Publication of CN111026846A publication Critical patent/CN111026846A/en
Application granted granted Critical
Publication of CN111026846B publication Critical patent/CN111026846B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses an online short text data stream classification method based on feature expansion, which comprises the following steps: 1, constructing a Word2Vec model according to an external corpus to obtain a Word vector set Vec; 2, carrying out text vectorization expansion by using the Vec vectorized short text data stream based on the CNN model; 3, constructing an online deep learning network for the expanded text vectors; introducing concept drift semaphore to neurons in the LSTM network and detecting distribution change of short text stream; and 5, completing model updating of the online deep learning network and prediction of the short text data stream. The method can effectively improve the classification accuracy of the short text data stream, correctly detect the concept drift and adjust the model, thereby achieving the purpose of rapidly adapting to the short text data stream environment.

Description

Online short text data stream classification method based on feature extension
Technical Field
The invention belongs to the field of short text data stream mining and online deep learning, and particularly relates to a continuously variable, rapid and infinite short text data stream classification problem.
Background
With the rise of information technologies such as mobile development and micro-service framework, a mass, high-speed and dynamic data-data stream is emerging in the field of practical application such as social network, online shopping, sensor network and the like. In the social field, due to the popularization of social network media and forums, short-length texts are rushing into our lives, such as comments of users like microblogs, tweets, Facebook and the like and interaction on forums. The short texts contain a large amount of information in various fields, such as sports, education, science, and the like. Compared with common texts, the short texts have sparsity, instantaneity, immediacy, non-normativity and dynamic variability, so that the topic evolution is caused. For example, a posting on a microblog defines 140 characters, while more may have only one sentence or even one phrase. And the change of the popularity ranking list and popularity words appearing on the network. The user generates a large amount of short texts and the data volume is increased rapidly in the interaction process of the network platform. According to incomplete statistics, the comment average of the current mainstream interaction platform (microblog Facebook and the like) user can reach 346 comment data per second. This will necessitate a higher data throughput for short text processing, otherwise a large accumulation of data will result over time. The above problems make the related short text classification method and data flow classification method face serious challenges:
one of the challenges is: in the traditional short text classification, the traditional text classification technology is difficult to be effective due to the characteristics of high-dimensional sparsity and the like of short texts; the current solution is: one is to extend the short text with an external corpus and then classify it using a conventional classification method, such as naive Bayes (b), (c), (d), etc
Figure BDA0002309098570000011
Bayes), Support Vector Machine (SVM), decision treeAn equal classifier; one is to expand the short text by using the implicit statistical information to classify the short text, such as LDA, KNN, etc. However, the stability of these models is greatly affected by the integrity of the external corpus, which results in poor model portability and stability.
The second challenge is: due to the mass and the limitless quality of continuous data, the traditional multiple iteration deep learning framework (such as Text-CNN, RNN and the like) based on a static data set cannot process the continuous data very well, and a model cannot acquire good performance.
The third challenge: the short text flow has the characteristics of dynamic change and the like, and the current mainstream deep learning framework cannot be quickly self-adapted to the continuously changing data flow due to the limitation of a network layer static framework, so that the traditional neural network model cannot well process the dynamic variability of the short text.
Disclosure of Invention
In order to avoid the defects in the prior art, the invention provides an online short text data stream classification method based on feature extension, so that the method can be used for solving the problem of short text data stream classification in the field of practical application, thereby effectively improving the classification accuracy of short text data streams, reducing the time consumption of model construction and achieving the purpose of rapidly adapting to the short text data stream classification with the concept drift phenomenon.
In order to achieve the aim, the invention adopts the following technical scheme:
the invention relates to an online short text data stream classification method based on feature expansion, which is characterized by comprising the following steps of:
step 1: constructing a Word2Vec model according to an external corpus, and acquiring a Word vector set Vec:
step 1.1: setting given short text data Stream { d ═ according to a sliding window mechanism1,d2,...,de,...,dEDividing the data into T data block sets according to time, and recording the data block sets as D ═ D1,D2,…,Dt,…,DTIn which d iseRepresenting the e-th short text in the short text data Stream; dtIndicates the shortA data block at time T in the text data Stream, E is 1,2, …, E, T is 1,2, …, T;
step 1.2: acquiring a text external corpus, noted as C '═ { d'1,d′2,...,d′m,...,d′M1, 2.. M, where M represents the text external corpus C1Total number of texts of d'mRepresents the m-th text and has
Figure BDA0002309098570000021
Q ═ 1, 2., Q denote the m-th text d'mThe number of the Chinese words is equal to the number of the Chinese words,
Figure BDA0002309098570000022
representing the m-th text d'mThe q-th word in (1) and satisfy
Figure BDA0002309098570000023
Vocab represents a set of words of all different words in the text corpus C, and Vocab ═ word1,word2,...,wordz,..., word Z1,2, Z representing the total number of words in said set of words Vocab, wordzRepresents the z-th word in the set of words Vocab; let the z-th wordzThe word vector of (2) is denoted as Vec (word)z) Thus, a word vector set Vec ═ { Vec (word) corresponding to the word set Vocab is obtained1),Vec(word2),…,Vec(wordz),…,Vec(wordZ)};
Step 1.3: updating the word vector set Vec by using a skip-gram model to obtain an updated word vector set Vec' and assigning values to the Vec;
step 2: vectorizing the short text data Stream by using a word vector set Vec and performing text vectorization expansion based on a CNN model:
step 2.1: initializing t to be 0; defining and initializing a notional drift semaphore Drif at time t t0; defining and initializing a concept drift threshold Thr;
step 2.2: acquisition instituteData block D at time t in short text data StreamtAnd has the following components:
Figure BDA0002309098570000024
n is 1,2,.. times.n, where N is the time t data block DtThe total number of texts in (1),
Figure BDA0002309098570000031
for the data block D at the time ttThe nth text in the text list, and have
Figure BDA0002309098570000032
i=1,2,...,I,
Figure BDA0002309098570000033
Representing said data block D at time ttThe nth text
Figure BDA0002309098570000034
The number i of the words in (i) th,
Figure BDA0002309098570000035
for the data block D at the time ttThe nth text
Figure BDA0002309098570000036
Class labels of (1);
step 2.3: obtaining the data block D at the time t from the word vector set VectThe nth text
Figure BDA0002309098570000037
The ith word
Figure BDA0002309098570000038
Word vector of
Figure BDA0002309098570000039
Thereby obtaining the data block D at the time ttThe nth text
Figure BDA00023090985700000310
All the words inIs expressed as
Figure BDA00023090985700000311
Wherein I is less than or equal to P, P represents the maximum word number of the short text in the short text data Stream, Vec (0)j) Is the jth all-zero vector, and j is more than or equal to 1 and less than or equal to P-I;
step 2.4: according to the step 2.3, the data block D at the time t is obtainedtIs represented as a set of text vectors of
Figure BDA00023090985700000312
For the data block D at the time ttGrouped in groups of every G texts, Vec (D)t) Is divided into R groups, denoted as
Figure BDA00023090985700000313
Figure BDA00023090985700000314
Figure BDA00023090985700000315
Figure BDA00023090985700000316
And R is N/G;
step 2.5: setting the convolution kernel in the CNN model as Core, the length and width of the convolution kernel as Row multiplied by Col, the step length as rides, and for the data block D at the time ttThe nth text in (1)
Figure BDA00023090985700000317
Set of word vectors
Figure BDA00023090985700000318
After convolution operation, obtaining a semantic matrix with ((P-Row)/rides +1) × (S-Col +1) dimensions, and collecting the semantic matrix with the word vectors
Figure BDA00023090985700000319
Merging to obtain the data block D at the time ttThe nth text
Figure BDA00023090985700000320
Is presented by an input
Figure BDA00023090985700000321
Thereby obtaining the data block D at the time ttAll text expansion vectors in (1) are expressed as
Figure BDA00023090985700000322
Figure BDA00023090985700000323
Figure BDA00023090985700000324
Figure BDA0002309098570000041
g=1,2,...,G;
And step 3: constructing an online deep learning network for the expanded text vector:
step 3.1: defining the number of layers of a current deep learning network as fr, and the maximum depth of a neural network as Floor; define the current text expansion vector Vec' (D)t) The number of groups of (1) is r, fr is initialized to be 1, and r is made to be 1;
step 3.2: for the data block D at the time ttText expansion vector Vec' (D) in (1)t) Constructing a fr-th layer LSTM network LSTM _ fr at the t moment, and setting the number of the neurons as Ofr(ii) a Constructing a fr layer full connection network Dense _ fr, and setting the number of neurons as H _ fr;
step 3.3: assigning fr +1 to fr, and turning to step 3.2 until fr is Floor, so as to complete the construction of the deep learning network;
step 3.4: extending the textVector Vec' (D)t) Middle r group text expansion vector
Figure BDA0002309098570000042
Inputting the input into the first layer LSTM network LSTM _1 at the time t, and obtaining an intermediate output which is recorded as
Figure BDA0002309098570000043
Wherein the content of the first and second substances,
Figure BDA0002309098570000044
an LSTM network intermediate output representing the g-th text expansion vector in the r-th set of text expansion vectors and having:
Figure BDA0002309098570000045
in the formula (1), otRepresenting the information which needs to be output by the neurons in the LSTM network at the time t; ctIs the state information of the neurons in the LSTM network at the time t; tanh (-) is the tanh activation function; and comprises the following components:
Figure BDA0002309098570000046
Figure BDA0002309098570000047
in equation (2), σ (·) is the sigmod activation function;
Figure BDA0002309098570000048
the weight of an output gate in the LSTM network neuron at the time t is obtained; boAn offset term for the output gate;
Figure BDA0002309098570000049
is the g text expansion vector in the r group text expansion vector at the time t;
in the formula (3), ftIs the output information of the neuron forgetting gate in the LSTM network at the time t;
Figure BDA00023090985700000410
is the input information of the neurons in the LSTM network at the time t; i.e. itIs the state information of the neuron input gate in the LSTM network at the time t; ct-1Is the state information of the neuron at the time t-1; and comprises the following components:
Figure BDA0002309098570000051
Figure BDA0002309098570000052
Figure BDA0002309098570000053
in the formula (4), the reaction mixture is,
Figure BDA0002309098570000054
weights corresponding to a forgetting gate, an input gate and a state updating gate in the LSTM network neuron at the t moment are respectively; bf、bi、bcBias items corresponding to the forgetting gate, the input gate and the state updating gate;
step 3.5: intermediate output of LSTM-1 network
Figure BDA0002309098570000055
Inputting the input into the first layer full connection network Dense _1 to obtain the corresponding output
Figure BDA0002309098570000056
Wherein the content of the first and second substances,
Figure BDA0002309098570000057
a full-connection network Dense _1 network output representing the g text expansion vector in the r set of text expansion vectors, an
Figure BDA0002309098570000058
W1tThe weight is the weight corresponding to the full connection network Dense _1 at the time t;
step 3.6: let fr be 2;
step 3.7: outputting the intermediate of the fr-1 layer LSTM network LSTM _ fr-1
Figure BDA0002309098570000059
Inputting the output into the fr-th layer LSTM network LSTM _ fr to obtain the intermediate output of the fr-th layer LSTM network LSTM _ fr at the t moment, which is expressed as
Figure BDA00023090985700000510
Step 3.8: outputting the intermediate of the fr layer LSTM network LSTM _ fr network
Figure BDA00023090985700000511
Inputting the fr layer full connection network Dense _ fr to obtain corresponding output
Figure BDA00023090985700000512
Wherein
Figure BDA00023090985700000513
A full-connection network Dense _ fr network output representing the g text expansion vector in the r set of text expansion vectors, an
Figure BDA00023090985700000514
W_frtThe weight corresponding to the fr layer fully-connected network Dense _ fr at the time t is obtained;
step 3.9: assigning fr +1 to fr, and going to step 3.7 until fr is Floor;
step 3.10: and carrying out weighted summation on the outputs of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t by using a formula (7) to obtain the t-time expanded text vector
Figure BDA0002309098570000061
Output of (2)
Figure BDA0002309098570000062
Figure BDA0002309098570000063
In formula (7), weight1t,weight2t,...,weight_frt,...,weight_FloortOutput weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor and the Floor _ Floor at the time t are respectively; weight _ frtContributing a weight to the Dense _ fr output of the fully connected network;
output weight _ fr is weighted by the hedging algorithm shown in equation (8)tUpdating to obtain the weight _ fr output by Dense _ fr of the fully-connected network at the time of t +1t+1(ii) a Thus, the output weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t +1 are obtained:
Figure BDA0002309098570000064
in the formula (8), β is the hedging parameter, L (-) is the loss function of the corresponding output, Lt gA class label of the g text expansion vector in the r group of text expansion vectors;
step 3.11: assigning r +1 to r, and assigning a data block D at the time ttText expansion vector Vec' (D) in (1)t) Middle r group input
Figure BDA0002309098570000065
And (4) transmitting the data into the first layer of LSTM network, and going to step 3.10 until R is R, so as to finally obtain the data block D at the time ttOriginal text vector Vec (D)t) Output of
Figure BDA0002309098570000066
Figure BDA0002309098570000067
For a data block D at time ttFinal output of the nth text expansion vector;
and 4, step 4: introducing concept drift semaphore to neurons in the LSTM network at the t moment and detecting distribution change of short text stream:
step 4.1: according to step 2, the data block D at the t +1 momentt+1Performing short text extension and grouping to obtain grouped extended text vector representation
Figure BDA0002309098570000068
Figure BDA0002309098570000069
Figure BDA00023090985700000610
Figure BDA0002309098570000071
And inputting the data into the deep learning network to obtain the data block D at the t +1 momentt+1Text expansion vector Vec (D)t+1) Pre-output of
Figure BDA0002309098570000072
Figure BDA0002309098570000073
Data block D for time t +1t+1Pre-outputting the nth text expansion vector;
step 4.2: calculating the data block D at the time ttOutput of (2)
Figure BDA0002309098570000074
And the data block D at the time of t +1t+1Out 'of pre-output'(t+1)If the euclidean distance dis is smaller than the conceptual drift threshold Thr, it represents the data block D at the time ttNo concept drift occurs, and the concept drift signal amount Drif at the time t +1 is sett+10; otherwise, it represents the data block D at time ttConcept drift occurs and the concept drift semaphore Drif at the time t +1 is sett +1A is a constant;
and 5: model updating and short text data stream prediction of the online deep learning network:
step 5.1: drift signal Drif through the inverse BP algorithm and the concept at time t +1t+1Updating the LSTM network weight at the time t
Figure BDA0002309098570000075
And a fully connected network weight W1t,W2t,...,W-frt,...,W-Floort(ii) a Obtaining LSTM network weight at t +1 moment
Figure BDA0002309098570000076
And a fully connected network weight W1t,W2t,...,W_frt,...,W_Floort
Step 5.2: predicting the data block D at the t +1 moment by using the LSTM network at the t +1 momentt+1Text expansion vector Vec (D)t+1) To obtain a predicted output
Figure BDA0002309098570000077
Wherein the content of the first and second substances,
Figure BDA0002309098570000078
data block D for time t +1t+1The final output of the nth text expansion vector, and
Figure BDA0002309098570000079
is a vector in the L dimension, and is,
Figure BDA00023090985700000710
l, L represents the total number of class labels, and is the output probability of the corresponding class label;
obtaining a final output
Figure BDA00023090985700000711
The class label of the position of the medium maximum value is taken as the data block D at the t +1 momentt+1The prediction class label of the nth text expansion vector is obtained, so that the data block D at the t +1 moment is finishedt+1Predicting;
step 5.3: and assigning T +1 to T, and turning to the step 4 until T is equal to T, thereby finishing the classification processing of the data block set D of the short text data stream.
Compared with the prior art, the invention has the following beneficial effects:
1. the method considers the characteristics of short text data length and less information, constructs a neural network Word2vec Word vector model by means of external linguistic data to expand the short text, maps each Word to a specified dimension vector, obtains the Word meaning correlation degree of the Word and the Word, reduces the sparsity problem of the short text to a certain extent, and improves the accuracy of short text classification.
2. The invention considers that Word2vec can only obtain the correlation between words, but can not solve the defect that the same Word has different meanings under different context contexts; the invention provides a method for acquiring information of adjacent phrases in a short text by utilizing a convolution window of a CNN convolution layer, thereby capturing different semantic information of a Word under different contexts to make up the defects of Word2 vec.
3. The invention trains data streams by means of CNN convolutional layers and a multi-layer LSTM neural network classifier. The main training process is two iterations, the first iteration completes pre-training of initial weights of all neurons, and the second iteration finely adjusts the weights according to specific downstream tasks, so that the method is suitable for high-speed short text data streams, and the processing capacity of the algorithm on the short text data streams is improved.
4. The invention aims to solve the concept drift phenomenon of the short text data stream. The method comprises the steps of obtaining a predicted class label and a drift threshold value of each predicted data block to judge the change of theme distribution in unknown data streams, obtaining class label output of each data block before the data block enters a neural network, calculating the distance between the data block and the class label at the previous moment, if the distance is higher than the specified threshold value, enabling the data block to drift, setting a concept drift signal, and then training a neural network model, so that concept drift in short text data streams is effectively processed.
5. The invention is directed to the field of practical application as follows: the high-speed short text data stream in the field of social networks is subjected to online modeling and automatic classification, implicit concept tracking and detection and the like are achieved, and the application is wide.
Drawings
FIG. 1 is a schematic diagram of text vectorization expansion based on a CNN model according to the present invention;
FIG. 2 is a schematic diagram of a data flow classification framework based on a multi-layer LSTM model according to the present invention;
fig. 3 is a diagram illustrating processing concept drift.
Detailed Description
In this embodiment, as shown in fig. 2, an online short text data stream classification method based on feature extension is performed according to the following steps:
step 1: constructing a Word2Vec model according to an external corpus, and acquiring a Word vector set Vec:
step 1.1: setting given short text data Stream { d ═ according to a sliding window mechanism1,d2,...,de,...,dEDividing the data into T data block sets according to time, and recording the data block sets as D ═ D1,D2,…,Dt,…,DTIn which d iseRepresenting the e-th short text in the short text data Stream; dtA data block indicating time T in the short text data Stream, E is 1,2, …, E, T is 1,2, …, T;
step 1.2: acquiring a text external corpus, noted as C ' ═ d ', for the short text data Stream from the knowledge base '1,d′2,...,d′m,...,d′M1, 2.. M, where M denotes a text external corpus C1Total number of texts of d'mRepresents the m-th text and has
Figure BDA0002309098570000091
Q ═ 1, 2., Q denote the mth text d'mThe number of the Chinese words is equal to the number of the Chinese words,
Figure BDA0002309098570000092
represents the m-th text d'mThe q-th word in (1) and satisfy
Figure BDA0002309098570000093
Vocab represents the set of words of all the different words in the corpus C' outside the text, and Vocab ═ word1,word2,...,wordz,…,wordZZ, Z denotes the total number of words in the set of words Vocab, wordzRepresenting the z-th word in the word set Vocab; let the z wordzThe word vector of (2) is denoted as Vec (word)z) And initializing the word vector set to be a full 0 vector, thereby obtaining a word vector set Vec ═ Vec (word) corresponding to the word set Vocab1),Vec(word2),...,Vec(wordz),...,Vec(wordZ)};
Step 1.3: obtain the set (i.e. vocabulary) of all words in which Vocab is the external corpus. All 0 arrays with size of vocab.size S are Vec, i.e. the vector of the z-th word in Vocab is Vocab [ (z-1) S, z S ].
Step 1.4: updating the word vector set Vec by using a skip-gram model to obtain an updated word vector set Vec' and assigning values to the Vec;
step 1.5 initialization dm=d0
Step 1.6 traverse dmAll words in, initialize wq=w0
Step 1.7: setting window size, traversing corpus C1M-th text dmEach word w inqAnd obtaining the wjContext of (w)q-window,wq+ window). Computing
Figure BDA0002309098570000094
Go through u in Content, calculate
Figure BDA0002309098570000095
Wherein the height of the H-Huffman tree T, dhIs the h-th bit in the Huffman coding of the path from the root node to the leaf node u. ThetahAnd the parameters corresponding to the h-th node on the path from the root node to the leaf node u. Vec is all w of VocabqA corresponding set of word vectors. Wherein
Figure BDA0002309098570000096
Step 1.8: computing the expression p (Content | w)q) Maximum likelihood function of
Figure BDA0002309098570000097
Likelihood function
Figure BDA0002309098570000098
Respectively to thetah、Vec(wq) And (5) calculating partial derivatives. Updating thetah、Vec(wq) The value is obtained. And finally, acquiring a trained word vector set Vec.
Step 1.9: go to step 1.6, 1.7 until all d are gone throughmUntil now.
Step 1.10: go through the next text dmGo to step 1.51.61.71.8 until all corpora C' have been traversed.
Step 1.11: and acquiring a final traversal result Vec, namely a final word vector.
In this embodiment, a Word2vec model containing Word vectors as exemplified in table 1 is trained from an external corpus. Wherein S is 50 and vocab is 142234.
TABLE 1 Word2vec Word vectors
ID wi Vec(wq)Feature=50
1 century 1.61.33.4-2.02.7-1.12.0-2.13.02.0-2.0-0.7………
2 health -2.2-0.7-3.3-1.1-2.4-4.4-0.7-1.3-3.7-2.7-1.2………
3 caffeine 1.22.7-1.80.5-1.50.0-2.90.2-1.0-2.33.02.3………
4 limits -1.71.0-0.71.4-0.41.8-1.40.6-0.40.2-3.23.3………
5 family 0.80.92.30.3-0.4-0.50.30.5-0.2-0.22.50.9-2.0……
6 drugs -2.71.0-1.71.2-0.3-1.9-2.01.8-2.4-3.83.04.1………
7 radiation 7.23.6-2.60.4-4.41.00.31.6-0.3-1.92.7-1.06.1………
8 pregnancy -0.03.5-1.51.3-4.1-4.2-6.20.2-0.8-4.01.31.6………
9 trial -5.3-0.61.01.0-1.6-3.1-5.32.7-0.8-2.7-2.64.3………
10 mask 0.01.8-0.3-1.5-2.11.40.40.0-0.4-1.80.01.0-0.3……
11 spring 3.73.70.20.9-1.60.1-1.4-2.3-0.44.31.3-3.33.82.4……
12 industry 0.40.50.2-1.92.7-0.44.0-0.2-4.0-0.2-4.8-0.63.8………
13 Full -2.11.9-0.31.2-0.61.5-2.1-2.00.20.3-1.11.7………
14 plate 0.52.4-0.50.81.41.31.3-3.9-2.92.7-2.7-2.4………
15 room 2.23.00.31.1-4.01.8-2.1-1.40.60.0-2.91.4………
16 salad 2.43.10.5-2.20.91.6-1.70.8-1.70.30.9-0.1-0.4………
Step 2: the method comprises the steps of utilizing a word vector set Vec to vectorize a short text data stream and conducting text vectorization expansion based on a CNN model, wherein the main structure is shown in FIG. 1:
step 2.1: initializing t to be 0; defining and initializing a notional drift semaphore Drif at time t t0; defining and initializing a concept drift threshold Thr;
step 2.2: acquiring data block D at time t in short text data StreamtAnd has the following components:
Figure BDA0002309098570000111
n is 1,2,.. times.n, where N is a data block D at time ttThe total number of texts in (1),
Figure BDA0002309098570000112
for a data block D at time ttThe nth text in the text list, and have
Figure BDA0002309098570000113
i=1,2,...,I,
Figure BDA0002309098570000114
Representing a block of data D at time ttThe nth text
Figure BDA0002309098570000115
The number i of the words in (i) th,
Figure BDA0002309098570000116
for a data block D at time ttThe nth text
Figure BDA0002309098570000117
Class labels of (1);
step 2.3: obtaining a data block D at the time t from the word vector set VectThe nth text
Figure BDA0002309098570000118
The ith word
Figure BDA0002309098570000119
Word vector of
Figure BDA00023090985700001110
Thereby obtaining a data block D at the time ttThe nth text
Figure BDA00023090985700001111
A set of word vectors of all words in, expressed as
Figure BDA00023090985700001112
Wherein I is less than or equal to P, P represents the maximum word number of short texts in the short text data Stream, Vec (0)j) Is the jth all-zero vector, and j is more than or equal to 1 and less than or equal to P-I;
step 2.4: according to the step 2.3, a data block D at the time t is obtainedtIs represented as a set of text vectors of
Figure BDA00023090985700001113
For time t data block DtGrouped in groups of every G texts, Vec (D)t) Is divided into R groups, denoted as
Figure BDA00023090985700001114
Figure BDA00023090985700001115
Figure BDA00023090985700001116
Figure BDA00023090985700001117
And R is N/G;
step 2.5: setting a convolution kernel in the CNN model as Core, setting the length and width of the size of the convolution kernel as Row multiplied by Col, setting the step length as rides, and aiming at a data block D at t momenttThe nth text in (1)
Figure BDA00023090985700001118
Set of word vectors
Figure BDA00023090985700001119
After performing the convolution operation, ((P-Row)/ri) is obtaineddes +1) x (S-Col +1) dimensional semantic matrix, and is combined with the word vector set
Figure BDA00023090985700001120
Merging to obtain a data block D at the time ttThe nth text
Figure BDA00023090985700001121
Is presented by an input
Figure BDA00023090985700001122
Thereby obtaining a data block D at the time ttAll text expansion vectors in (1) are expressed as
Figure BDA00023090985700001123
Figure BDA0002309098570000121
Figure BDA0002309098570000122
Figure BDA0002309098570000123
g=1,2,...,G;
And step 3: an online deep learning network is constructed for the expanded text vector, and the main structure is shown in fig. 2:
step 3.1: defining the number of layers of a current deep learning network as fr, and the maximum depth of a neural network as Floor; define the current text expansion vector Vec' (D)t) The number of groups of (1) is r, fr is initialized to be 1, and r is made to be 1;
step 3.2: for a data block D at time ttText expansion vector Vec' (D) in (1)t) Constructing a fr-th layer LSTM network LSTM _ fr at the t moment, and setting the number of the neurons as Ofr(ii) a And constructing a full connection network Dense _ fr of the fr layer, and setting the number of the neurons as H _ fr.
Step 3.3: and assigning fr +1 to fr, and turning to the step 3.2 until fr is Floor, so that the construction of the deep learning network is completed.
Step 3.5: initializing all weights W and bias terms b in the network layer into a full 0 vector, wherein the vector dimension is equal to the number of the neurons.
Step 3.6: expand the text by vector Vec' (D)t) Middle r group text expansion vector
Figure BDA0002309098570000124
Inputting into the first layer LSTM network LSTM _1 at time t, obtaining intermediate output by formula (1), formula (2), formula (3), formula (4), formula (5) and formula (6), and marking as
Figure BDA0002309098570000125
Wherein the content of the first and second substances,
Figure BDA0002309098570000126
an LSTM network intermediate output representing the g-th text expansion vector in the r-th set of text expansion vectors and having:
Figure BDA0002309098570000127
in the formula (1), otRepresenting the information which needs to be output by the neurons in the LSTM network at the time t; ctIs the state information of the neurons in the LSTM network at the time t; tanh (-) is the tanh activation function; and comprises the following components:
Figure BDA0002309098570000128
Figure BDA0002309098570000129
in equation (2), σ (·) is the sigmod activation function;
Figure BDA00023090985700001210
the weight of an output gate in the LSTM network neuron at the time t is obtained; boAn offset term for the output gate;
Figure BDA0002309098570000131
is the g text expansion vector in the r group of text expansion vectors at the time t.
In the formula (3), ftIs the output information of the neuron forgetting gate in the LSTM network at the time t;
Figure BDA0002309098570000132
is the input information of the neurons in the LSTM network at the time t; i.e. itIs the state information of the neuron input gate in the LSTM network at the time t; ct-1Is the state information of the neuron at the time t-1; and comprises the following components:
Figure BDA0002309098570000133
Figure BDA0002309098570000134
Figure BDA0002309098570000135
in the formula (4), the reaction mixture is,
Figure BDA0002309098570000136
weights corresponding to a forgetting gate, an input gate and a state updating gate in the LSTM network neuron at the t moment are respectively; bf、bi、bcBias items corresponding to the forgetting gate, the input gate and the state updating gate;
step 3.7: intermediate output of LSTM-1 network
Figure BDA0002309098570000137
Inputting the data into the first layer of fully-connected network Dense _1 to obtain corresponding output
Figure BDA0002309098570000138
Wherein the content of the first and second substances,
Figure BDA0002309098570000139
a full-connection network Dense _1 network output representing the g text expansion vector in the r set of text expansion vectors, an
Figure BDA00023090985700001310
W1tThe weight is the weight corresponding to the full connection network Dense _1 at the time t;
step 3.8: let fr be 2, the ratio of fr to fr,
step 3.9: outputting the intermediate of the fr-1 layer LSTM network LSTM _ fr-1
Figure BDA00023090985700001311
Inputting into the fr-th LSTM network LSTM _ fr to obtain intermediate output of the fr-th LSTM network LSTM _ fr at t time, which is expressed as
Figure BDA00023090985700001312
Step 3.10: outputting the intermediate of the fr layer LSTM network LSTM _ fr network
Figure BDA00023090985700001313
Inputting the fr layer full connection network Dense _ fr to obtain the corresponding output
Figure BDA00023090985700001314
Wherein
Figure BDA00023090985700001315
A full-connection network Dense _ fr network output representing the g text expansion vector in the r set of text expansion vectors, an
Figure BDA00023090985700001316
W_frtThe weight corresponding to the Dense _ fr of the fr layer full connection network at the time t;
step 3.11: assigning fr +1 to fr, and going to step 3.9 until fr is Floor;
step 3.12: using formula (7) to perform weighted summation on outputs of Floor full-connection networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t, and using formula (7) to obtain an extended text vector at the time t
Figure BDA0002309098570000141
Output of (2)
Figure BDA0002309098570000142
And outputs the weight _ fr to the hedge algorithm shown in the formula (8)tUpdating to obtain the weight _ fr output by Dense _ fr of the fully-connected network at the time of t +1t+1(ii) a Thereby obtaining the output weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t + 1.
Figure BDA0002309098570000143
In formula (7), weight1t,weight2t,...,weight_frt,...,weight_FloortOutput weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor and the Floor _ Floor at the time t are respectively; weight _ frtContributing a weight to the Dense _ fr output of the fully connected network;
Figure BDA0002309098570000144
in the formula (8), β is the hedging parameter, L (-) is the loss function of the corresponding output, Lt gAnd the class label of the g text expansion vector in the r group of text expansion vectors.
Step 3.13: assigning r +1 to r, and assigning a data block D at the time ttText expansion vector Vec' (D) in (1)t) Middle r group input
Figure BDA0002309098570000145
And (4) transmitting the data into the first layer LSTM network, and going to the step 3.12 until R is R, so as to finally obtain a data block D at the moment ttOriginal text vector Vec (D)t) Output of
Figure BDA0002309098570000146
Figure BDA0002309098570000147
For a data block D at time ttAnd (4) final output of the nth text expansion vector.
And 4, step 4: introducing concept drift semaphore to neurons in the LSTM network at the time t and detecting distribution change of short text stream, wherein the main structure is shown in FIG. 3:
step 4.1: according to step 2, the data block D at the t +1 momentt+1Performing short text extension and grouping to obtain grouped extended text vector representation
Figure BDA0002309098570000148
Figure BDA0002309098570000151
Figure BDA0002309098570000152
Figure BDA0002309098570000153
And inputting the data into a deep learning network to obtain a data block D at the t +1 momentt+1Text expansion vector Vec (D)t+1) Pre-output of
Figure BDA0002309098570000154
Figure BDA0002309098570000155
Data block D for time t +1t+1Pre-outputting the nth text expansion vector.
Step 4.2: calculating a data block D at time ttOutput of (2)
Figure BDA0002309098570000156
And time t +1 data block Dt+1Out 'of pre-output'(t+1)If the euclidean distance dis is smaller than the conceptual drift threshold Thr, it indicates that the data block D at time t is a data block DtWhen no concept drift occurs and t +1 is setConcept drift semaphore Drif t+10; otherwise, it represents the data block D at time ttConcept drift occurs and a concept drift signal amount Drif at time t +1 is sett+1A is a constant;
and 5: model updating of the online deep learning network and prediction of short text data flow are completed:
step 5.1: drift signal Drif through the inverse BP algorithm and the concept at time t +1t+1Updating LSTM network weight at time t
Figure BDA0002309098570000157
And a fully connected network weight W1t,W2t,...,W-frt,...,W-Floort(ii) a Obtaining LSTM network weight at t +1 moment
Figure BDA0002309098570000158
And a fully connected network weight W1t,W2t,...,W_frt,...,W_FloortMainly by the notional Drift signal DrifttThe weight of the input information at the current moment is enhanced, the weight of the historical information is reduced, the dependence of the output information on a historical model is reduced, the processing of concept drift is finally completed, and the stability of the model is improved.
Step 5.2: predicting data block D at t +1 moment by using LSTM network at t +1 momentt+1Text expansion vector Vec (D)t+1) To obtain a predicted output
Figure BDA0002309098570000159
Wherein the content of the first and second substances,
Figure BDA00023090985700001510
data block D for time t +1t+1The final output of the nth text expansion vector, and
Figure BDA00023090985700001511
is a vector in the L dimension, and is,
Figure BDA00023090985700001512
output summary for corresponding class labelsA rate, L ═ 1, 2., L denotes the total number of class labels;
obtaining
Figure BDA00023090985700001513
Class label of the position of the medium maximum value, and is used as the data block D at the moment of t +1t+1The prediction class label of the nth text expansion vector is obtained, so that the data block D at the t +1 moment is completedt+1Predicting;
step 5.3: and assigning T +1 to T, and turning to the step 4 until T is equal to T, thereby finishing the classification processing of the data block set D of the short text data stream.
In this embodiment, according to the classification method of the short text data stream, as shown in fig. 1, the following steps are specifically performed:
(1) obtaining the 1 st data block D in the data stream D1. The data are mainly shown in table 2:
table 2 20 sample data in a short text data stream
Figure BDA0002309098570000161
Figure BDA0002309098570000171
(1) For the first training data block D1According to the Word2vec model generated by the external corpus, see table 2, each Word is converted into a Word vector, namely D is converted by using the Word vector model1Conversion to table 3:
TABLE 3 vectorized data blocks
Figure BDA0002309098570000172
(2) Will D1Chinese text
Figure BDA0002309098570000173
Convolution kernel [0.2,0.3,0.5 ] from step 2]Window size 3; CNN-out [ -3.69,3.11,1.45, -1.05,0,2.78.]、[-4.2,7.5,-0.1,-1.1,-2.78,-0.56...]...[1.17,8.6,-0.04,1.19,-4.47,3...]Attaching CNN-out to
Figure BDA0002309098570000174
Then, the expanded
Figure BDA0002309098570000175
The size of which is a 198 x 50 matrix.
(3) Combining the above matrix
Figure BDA0002309098570000176
Inputting to the first layer LSTM network entrance in the neural network.
(4) Obtaining a Drift signal Drift, performing a forward feedback operation of the neural network by using the formulas (1) to (6) and the step 4, and calculating a preprocessing result out of the data blocktIn the format shown in Table 7, the predicted result out is calculatedtAnd the predicted result out of the data block at the previous timet-1The Euclidean distance dis of (1) is judged to be larger than the Drift Threshold Threshold, so that the value of the Drift signal Drift at the current moment is set.
(5) According to (4), carrying out forward BP of the neural network with the drift signal Drif and utilizing the expanded data block
Figure BDA0002309098570000181
Updating of the neuron weights in equations (1), (2), (3), (4), (5), and (6) is completed by performing BP inversion with the concept Drift signal Drift at the current time.
(6) The calculation of the data block is completed by using the formula (1), the formula (2), the formula (3), the formula (4), the formula (5), the formula (6) and the step 4, and the probability results of class labels are shown in table 4:
TABLE 4 prediction probability of data block
Figure BDA0002309098570000182
(7) I.e. the maximum possible class label for each text is obtained. { business, us, business, entry }.
(8) The data blocks are iterated indefinitely until the data source stops generating data.

Claims (1)

1. An online short text data stream classification method based on feature extension is characterized by comprising the following steps:
step 1: constructing a Word2Vec model according to an external corpus, and acquiring a Word vector set Vec:
step 1.1: setting given short text data Stream { d ═ according to a sliding window mechanism1,d2,...,de,...,dEDividing the data into T data block sets according to time, and recording the data block sets as D ═ D1,D2,…,Dt,…,DTIn which d iseRepresenting the e-th short text in the short text data Stream; dtA data block indicating time T in the short text data Stream, where E is 1,2, …, E, T is 1,2, …, T;
step 1.2: acquiring a text external corpus, noted as C '═ { d'1,d′2,...,d′m,...,d′M1, 2.. M, where M represents the text external corpus C1Total number of texts of d'mRepresents the m-th text and has
Figure FDA0002309098560000011
Q ═ 1, 2., Q denote the m-th text d'mThe number of the Chinese words is equal to the number of the Chinese words,
Figure FDA0002309098560000012
representing the m-th text d'mThe q-th word in (1) and satisfy
Figure FDA0002309098560000013
Vocab represents a set of words of all different words in the text corpus C, and Vocab ═ word1,word2,...,wordz,...,wordZ1,2, Z representing the total number of words in said set of words Vocab, wordzRepresents the z-th word in the set of words Vocab; order postWord of the z-th wordzThe word vector of (2) is denoted as Vec (word)z) Thus, a word vector set Vec ═ { Vec (word) corresponding to the word set Vocab is obtained1),Vec(word2),…,Vec(wordz),…,Vec(wordZ)};
Step 1.3: updating the word vector set Vec by using a skip-gram model to obtain an updated word vector set Vec' and assigning values to the Vec;
step 2: vectorizing the short text data Stream by using a word vector set Vec and performing text vectorization expansion based on a CNN model:
step 2.1: initializing t to be 0; defining and initializing a notional drift semaphore Drif at time tt0; defining and initializing a concept drift threshold Thr;
step 2.2: acquiring a data block D at the time t in the short text data StreamtAnd has the following components:
Figure FDA0002309098560000014
wherein N is the data block D at the time ttThe total number of texts in (1),
Figure FDA0002309098560000015
for the data block D at the time ttThe nth text in the text list, and have
Figure FDA0002309098560000016
Figure FDA0002309098560000017
Figure FDA0002309098560000018
Representing said data block D at time ttThe nth text
Figure FDA0002309098560000019
The number i of the words in (i) th,
Figure FDA00023090985600000110
for the data block D at the time ttThe nth text
Figure FDA0002309098560000021
Class labels of (1);
step 2.3: obtaining the data block D at the time t from the word vector set VectThe nth text
Figure FDA0002309098560000022
The ith word
Figure FDA0002309098560000023
Word vector of
Figure FDA0002309098560000024
Thereby obtaining the data block D at the time ttThe nth text
Figure FDA0002309098560000025
A set of word vectors of all words in, expressed as
Figure FDA0002309098560000026
Wherein I is less than or equal to P, P represents the maximum word number of the short text in the short text data Stream, Vec (0)j) Is the jth all-zero vector, and j is more than or equal to 1 and less than or equal to P-I;
step 2.4: according to the step 2.3, the data block D at the time t is obtainedtIs represented as a set of text vectors of
Figure FDA0002309098560000027
For the data block D at the time ttGrouped in groups of every G texts, Vec (D)t) Is divided into R groups, denoted as
Figure FDA0002309098560000028
And R is N/G;
step 2.5: setting the convolution kernel in the CNN model as Core, the length and width of the convolution kernel as Row × Col,step length is rides, and for the data block D at the time ttThe nth text in (1)
Figure FDA0002309098560000029
Set of word vectors
Figure FDA00023090985600000210
After convolution operation, obtaining a semantic matrix with ((P-Row)/rides +1) × (S-Col +1) dimensions, and collecting the semantic matrix with the word vectors
Figure FDA00023090985600000211
Merging to obtain the data block D at the time ttThe nth text
Figure FDA00023090985600000212
Is presented by an input
Figure FDA00023090985600000213
Thereby obtaining the data block D at the time ttAll text expansion vectors in (1) are expressed as
Figure FDA00023090985600000214
And step 3: constructing an online deep learning network for the expanded text vector:
step 3.1: defining the number of layers of a current deep learning network as fr, and the maximum depth of a neural network as Floor; define the current text expansion vector Vec' (D)t) The number of groups of (1) is r, fr is initialized to be 1, and r is made to be 1;
step 3.2: for the data block D at the time ttText expansion vector Vec' (D) in (1)t) Constructing a fr-th layer LSTM network LSTM _ fr at the t moment, and setting the number of the neurons as Ofr(ii) a Constructing a fr layer full connection network Dense _ fr, and setting the number of neurons as H _ fr;
step 3.3: assigning fr +1 to fr, and turning to step 3.2 until fr is Floor, so as to complete the construction of the deep learning network;
step 3.4: expanding the text by vector Vec' (D)t) Middle r group text expansion vector
Figure FDA0002309098560000031
Inputting the input into the first layer LSTM network LSTM _1 at the time t, and obtaining an intermediate output which is recorded as
Figure FDA0002309098560000032
Wherein the content of the first and second substances,
Figure FDA0002309098560000033
an LSTM network intermediate output representing the g-th text expansion vector in the r-th set of text expansion vectors and having:
Figure FDA0002309098560000034
in the formula (1), otRepresenting the information which needs to be output by the neurons in the LSTM network at the time t; ctIs the state information of the neurons in the LSTM network at the time t; tanh (-) is the tanh activation function; and comprises the following components:
Figure FDA0002309098560000035
Figure FDA0002309098560000036
in equation (2), σ (·) is the sigmod activation function;
Figure FDA0002309098560000037
the weight of an output gate in the LSTM network neuron at the time t is obtained; boAn offset term for the output gate;
Figure FDA0002309098560000038
is the g text expansion vector in the r group text expansion vector at the time t;
in the formula (3), ftIs the output information of the neuron forgetting gate in the LSTM network at the time t;
Figure FDA0002309098560000039
is the input information of the neurons in the LSTM network at the time t; i.e. itIs the state information of the neuron input gate in the LSTM network at the time t; ct-1Is the state information of the neuron at the time t-1; and comprises the following components:
Figure FDA00023090985600000310
Figure FDA00023090985600000311
Figure FDA00023090985600000312
in the formula (4), the reaction mixture is,
Figure FDA00023090985600000313
weights corresponding to a forgetting gate, an input gate and a state updating gate in the LSTM network neuron at the t moment are respectively; bf、bi、bcBias items corresponding to the forgetting gate, the input gate and the state updating gate;
step 3.5: intermediate output of LSTM-1 network
Figure FDA0002309098560000041
Inputting the input into the first layer full connection network Dense _1 to obtain the corresponding output
Figure FDA0002309098560000042
Wherein the content of the first and second substances,
Figure FDA0002309098560000043
representing the g-th text expansion vector of the r-th set of text expansion vectorsA fully connected network Dense _1 network output, and
Figure FDA0002309098560000044
W1tthe weight is the weight corresponding to the full connection network Dense _1 at the time t;
step 3.6: let fr be 2;
step 3.7: outputting the intermediate of the fr-1 layer LSTM network LSTM _ fr-1
Figure FDA0002309098560000045
Inputting the output into the fr-th layer LSTM network LSTM _ fr to obtain the intermediate output of the fr-th layer LSTM network LSTM _ fr at the t moment, which is expressed as
Figure FDA0002309098560000046
Step 3.8: outputting the intermediate of the fr layer LSTM network LSTM _ fr network
Figure FDA0002309098560000047
Inputting the fr layer full connection network Dense _ fr to obtain corresponding output
Figure FDA0002309098560000048
Wherein
Figure FDA0002309098560000049
A full-connection network Dense _ fr network output representing the g text expansion vector in the r set of text expansion vectors, an
Figure FDA00023090985600000410
W_frtThe weight corresponding to the fr layer fully-connected network Dense _ fr at the time t is obtained;
step 3.9: assigning fr +1 to fr, and going to step 3.7 until fr is Floor;
step 3.10: and (3) carrying out weighted summation on the outputs of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t by using a formula (7) to obtain the t-time expanded textThe vector of
Figure FDA00023090985600000411
Output of (2)
Figure FDA00023090985600000412
Figure FDA00023090985600000413
In formula (7), weight1t,weight2t,...,weight_frt,...,weight_FloortOutput weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor and the Floor _ Floor at the time t are respectively; weight _ frtContributing a weight to the Dense _ fr output of the fully connected network;
output weight _ fr is weighted by the hedging algorithm shown in equation (8)tUpdating to obtain the weight _ fr output by Dense _ fr of the fully-connected network at the time of t +1t+1(ii) a Thus, the output weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t +1 are obtained:
Figure FDA0002309098560000051
in the formula (8), β represents a hedging parameter,
Figure FDA0002309098560000059
a loss function that is a corresponding output; l ist gA class label of the g text expansion vector in the r group of text expansion vectors;
step 3.11: assigning r +1 to r, and assigning a data block D at the time ttText expansion vector Vec' (D) in (1)t) Middle r group input
Figure FDA0002309098560000052
And (4) transmitting the data into the first layer of LSTM network, and going to step 3.10 until R is R, so as to finally obtain the data block D at the time ttOriginal text vector Vec (D)t) Output of
Figure FDA0002309098560000053
Figure FDA0002309098560000054
For a data block D at time ttFinal output of the nth text expansion vector;
and 4, step 4: introducing concept drift semaphore to neurons in the LSTM network at the t moment and detecting distribution change of short text stream:
step 4.1: according to step 2, the data block D at the t +1 momentt+1Performing short text extension and grouping to obtain grouped extended text vector representation
Figure FDA0002309098560000055
And inputting the data into the deep learning network to obtain the data block D at the t +1 momentt+1Text expansion vector Vec (D)t+1) Pre-output of
Figure FDA0002309098560000056
Figure FDA0002309098560000057
Data block D for time t +1t+1Pre-outputting the nth text expansion vector;
step 4.2: calculating the data block D at the time ttOutput of (2)
Figure FDA0002309098560000058
And the data block D at the time of t +1t+1Out 'of pre-output'(t+1)If the euclidean distance dis is smaller than the conceptual drift threshold Thr, it represents the data block D at the time ttNo concept drift occurs, and the concept drift signal amount Drif at the time t +1 is sett+10; otherwise, it represents the data block D at time ttConcept drift occurs and the concept drift signal at the t +1 moment is setAmount Drift+1A is a constant;
and 5: model updating and short text data stream prediction of the online deep learning network:
step 5.1: drift signal Drif through the inverse BP algorithm and the concept at time t +1t+1Updating the LSTM network weight at the time t
Figure FDA0002309098560000061
And a fully connected network weight W1t,W2t,...,W-frt,...,W-Floort(ii) a Obtaining LSTM network weight at t +1 moment
Figure FDA0002309098560000062
And a fully connected network weight W1t,W2t,...,W_frt,...,W_Floort
Step 5.2: predicting the data block D at the t +1 moment by using the LSTM network at the t +1 momentt+1Text expansion vector Vec (D)t+1) To obtain a predicted output
Figure FDA0002309098560000063
Wherein the content of the first and second substances,
Figure FDA0002309098560000064
data block D for time t +1t+1The final output of the nth text expansion vector, and
Figure FDA0002309098560000065
is a vector in the L dimension, and is,
Figure FDA0002309098560000066
l, L represents the total number of class labels, and is the output probability of the corresponding class label;
obtaining a final output
Figure FDA0002309098560000067
Class label of the position of the maximum value, and is taken as t +1Engraving data block Dt+1The prediction class label of the nth text expansion vector is obtained, so that the data block D at the t +1 moment is finishedt+1Predicting;
step 5.3: and assigning T +1 to T, and turning to the step 4 until T is equal to T, thereby finishing the classification processing of the data block set D of the short text data stream.
CN201911251229.1A 2019-12-09 2019-12-09 Online short text data stream classification method based on feature extension Active CN111026846B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911251229.1A CN111026846B (en) 2019-12-09 2019-12-09 Online short text data stream classification method based on feature extension

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911251229.1A CN111026846B (en) 2019-12-09 2019-12-09 Online short text data stream classification method based on feature extension

Publications (2)

Publication Number Publication Date
CN111026846A true CN111026846A (en) 2020-04-17
CN111026846B CN111026846B (en) 2021-08-17

Family

ID=70205012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911251229.1A Active CN111026846B (en) 2019-12-09 2019-12-09 Online short text data stream classification method based on feature extension

Country Status (1)

Country Link
CN (1) CN111026846B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114422450A (en) * 2022-01-21 2022-04-29 中国人民解放军国防科技大学 Network flow analysis method and device based on multi-source network flow data
CN114513328A (en) * 2021-12-31 2022-05-17 西安电子科技大学 Network traffic intrusion detection method based on concept drift and deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7493253B1 (en) * 2002-07-12 2009-02-17 Language And Computing, Inc. Conceptual world representation natural language understanding system and method
CN101794303A (en) * 2010-02-11 2010-08-04 重庆邮电大学 Method and device for classifying text and structuring text classifier by adopting characteristic expansion
CN106934035A (en) * 2017-03-14 2017-07-07 合肥工业大学 Concept drift detection method in a kind of multi-tag data flow based on class and feature distribution
CN107679228A (en) * 2017-10-23 2018-02-09 合肥工业大学 A kind of short text data stream sorting technique based on short text extension and concept drift detection
CN109918667A (en) * 2019-03-06 2019-06-21 合肥工业大学 The Fast incremental formula classification method of short text data stream based on word2vec model
CN109947945A (en) * 2019-03-19 2019-06-28 合肥工业大学 Word-based vector sum integrates the textstream classification method of SVM
US20190325029A1 (en) * 2018-04-18 2019-10-24 HelpShift, Inc. System and methods for processing and interpreting text messages

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7493253B1 (en) * 2002-07-12 2009-02-17 Language And Computing, Inc. Conceptual world representation natural language understanding system and method
CN101794303A (en) * 2010-02-11 2010-08-04 重庆邮电大学 Method and device for classifying text and structuring text classifier by adopting characteristic expansion
CN106934035A (en) * 2017-03-14 2017-07-07 合肥工业大学 Concept drift detection method in a kind of multi-tag data flow based on class and feature distribution
CN107679228A (en) * 2017-10-23 2018-02-09 合肥工业大学 A kind of short text data stream sorting technique based on short text extension and concept drift detection
US20190325029A1 (en) * 2018-04-18 2019-10-24 HelpShift, Inc. System and methods for processing and interpreting text messages
CN109918667A (en) * 2019-03-06 2019-06-21 合肥工业大学 The Fast incremental formula classification method of short text data stream based on word2vec model
CN109947945A (en) * 2019-03-19 2019-06-28 合肥工业大学 Word-based vector sum integrates the textstream classification method of SVM

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PEIPEI LI等: "Learning From Short Text Streams With Topic Drifts", 《IEEE TRANSACTIONS ON CYBERNETICS》 *
吕超镇等: "基于LDA特征扩展的短文本分类", 《计算机工程与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114513328A (en) * 2021-12-31 2022-05-17 西安电子科技大学 Network traffic intrusion detection method based on concept drift and deep learning
CN114513328B (en) * 2021-12-31 2023-02-10 西安电子科技大学 Network traffic intrusion detection method based on concept drift and deep learning
CN114422450A (en) * 2022-01-21 2022-04-29 中国人民解放军国防科技大学 Network flow analysis method and device based on multi-source network flow data
CN114422450B (en) * 2022-01-21 2024-01-19 中国人民解放军国防科技大学 Network traffic analysis method and device based on multi-source network traffic data

Also Published As

Publication number Publication date
CN111026846B (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN109284506B (en) User comment emotion analysis system and method based on attention convolution neural network
CN110717334B (en) Text emotion analysis method based on BERT model and double-channel attention
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
CN111274398B (en) Method and system for analyzing comment emotion of aspect-level user product
CN107608956B (en) Reader emotion distribution prediction algorithm based on CNN-GRNN
Vateekul et al. A study of sentiment analysis using deep learning techniques on Thai Twitter data
Cao et al. Deep neural networks for learning graph representations
Xu et al. Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning.
CN111125358B (en) Text classification method based on hypergraph
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
Sari et al. Text Classification Using Long Short-Term Memory with GloVe
CN111160037A (en) Fine-grained emotion analysis method supporting cross-language migration
CN112231477B (en) Text classification method based on improved capsule network
CN108388654B (en) Sentiment classification method based on turning sentence semantic block division mechanism
CN111026846B (en) Online short text data stream classification method based on feature extension
CN110297888A (en) A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network
CN111460157A (en) Cyclic convolution multitask learning method for multi-field text classification
Etaiwi et al. Deep learning based techniques for sentiment analysis: A survey
CN114925205B (en) GCN-GRU text classification method based on contrast learning
CN110825850A (en) Natural language theme classification method and device
CN115687609A (en) Zero sample relation extraction method based on Prompt multi-template fusion
Nazarenko et al. Investigation of the Deep Learning Approaches to Classify Emotions in Texts.
Nithya et al. Deep learning based analysis on code-mixed tamil text for sentiment classification with pre-trained ulmfit
CN114722835A (en) Text emotion recognition method based on LDA and BERT fusion improved model
Wahyudi et al. Deep learning for multi-aspect sentiment analysis of tiktok app using the rnn-lstm method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant