CN111026846A

CN111026846A - Online short text data stream classification method based on feature extension

Info

Publication number: CN111026846A
Application number: CN201911251229.1A
Authority: CN
Inventors: 李培培; 胡阳; 胡学钢
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-04-17
Anticipated expiration: 2039-12-09
Also published as: CN111026846B

Abstract

The invention discloses an online short text data stream classification method based on feature expansion, which comprises the following steps: 1, constructing a Word2Vec model according to an external corpus to obtain a Word vector set Vec; 2, carrying out text vectorization expansion by using the Vec vectorized short text data stream based on the CNN model; 3, constructing an online deep learning network for the expanded text vectors; introducing concept drift semaphore to neurons in the LSTM network and detecting distribution change of short text stream; and 5, completing model updating of the online deep learning network and prediction of the short text data stream. The method can effectively improve the classification accuracy of the short text data stream, correctly detect the concept drift and adjust the model, thereby achieving the purpose of rapidly adapting to the short text data stream environment.

Description

Online short text data stream classification method based on feature extension

Technical Field

The invention belongs to the field of short text data stream mining and online deep learning, and particularly relates to a continuously variable, rapid and infinite short text data stream classification problem.

Background

With the rise of information technologies such as mobile development and micro-service framework, a mass, high-speed and dynamic data-data stream is emerging in the field of practical application such as social network, online shopping, sensor network and the like. In the social field, due to the popularization of social network media and forums, short-length texts are rushing into our lives, such as comments of users like microblogs, tweets, Facebook and the like and interaction on forums. The short texts contain a large amount of information in various fields, such as sports, education, science, and the like. Compared with common texts, the short texts have sparsity, instantaneity, immediacy, non-normativity and dynamic variability, so that the topic evolution is caused. For example, a posting on a microblog defines 140 characters, while more may have only one sentence or even one phrase. And the change of the popularity ranking list and popularity words appearing on the network. The user generates a large amount of short texts and the data volume is increased rapidly in the interaction process of the network platform. According to incomplete statistics, the comment average of the current mainstream interaction platform (microblog Facebook and the like) user can reach 346 comment data per second. This will necessitate a higher data throughput for short text processing, otherwise a large accumulation of data will result over time. The above problems make the related short text classification method and data flow classification method face serious challenges:

one of the challenges is: in the traditional short text classification, the traditional text classification technology is difficult to be effective due to the characteristics of high-dimensional sparsity and the like of short texts; the current solution is: one is to extend the short text with an external corpus and then classify it using a conventional classification method, such as naive Bayes (b), (c), (d), etc

Bayes), Support Vector Machine (SVM), decision treeAn equal classifier; one is to expand the short text by using the implicit statistical information to classify the short text, such as LDA, KNN, etc. However, the stability of these models is greatly affected by the integrity of the external corpus, which results in poor model portability and stability.

The second challenge is: due to the mass and the limitless quality of continuous data, the traditional multiple iteration deep learning framework (such as Text-CNN, RNN and the like) based on a static data set cannot process the continuous data very well, and a model cannot acquire good performance.

The third challenge: the short text flow has the characteristics of dynamic change and the like, and the current mainstream deep learning framework cannot be quickly self-adapted to the continuously changing data flow due to the limitation of a network layer static framework, so that the traditional neural network model cannot well process the dynamic variability of the short text.

Disclosure of Invention

In order to avoid the defects in the prior art, the invention provides an online short text data stream classification method based on feature extension, so that the method can be used for solving the problem of short text data stream classification in the field of practical application, thereby effectively improving the classification accuracy of short text data streams, reducing the time consumption of model construction and achieving the purpose of rapidly adapting to the short text data stream classification with the concept drift phenomenon.

In order to achieve the aim, the invention adopts the following technical scheme:

the invention relates to an online short text data stream classification method based on feature expansion, which is characterized by comprising the following steps of:

step 1: constructing a Word2Vec model according to an external corpus, and acquiring a Word vector set Vec:

step 1.1: setting given short text data Stream { d ═ according to a sliding window mechanism₁,d₂,...,d_e,...,d_EDividing the data into T data block sets according to time, and recording the data block sets as D ═ D₁,D₂,…,D_t,…,D_TIn which d is_eRepresenting the e-th short text in the short text data Stream; d_tIndicates the shortA data block at time T in the text data Stream, E is 1,2, …, E, T is 1,2, …, T;

step 1.2: acquiring a text external corpus, noted as C '═ { d'₁,d′₂,...,d′_m,...,d′_M1, 2.. M, where M represents the text external corpus C¹Total number of texts of d'_mRepresents the m-th text and has

Q ═ 1, 2., Q denote the m-th text d'_mThe number of the Chinese words is equal to the number of the Chinese words,

representing the m-th text d'_mThe q-th word in (1) and satisfy

Vocab represents a set of words of all different words in the text corpus C, and Vocab ═ word₁,word₂,...,word_z,...,

word

_Z1,2, Z representing the total number of words in said set of words Vocab, word_zRepresents the z-th word in the set of words Vocab; let the z-th word_zThe word vector of (2) is denoted as Vec (word)_z) Thus, a word vector set Vec ═ { Vec (word) corresponding to the word set Vocab is obtained₁),Vec(word₂),…,Vec(word_z),…,Vec(word_Z)}；

Step 1.3: updating the word vector set Vec by using a skip-gram model to obtain an updated word vector set Vec' and assigning values to the Vec;

step 2: vectorizing the short text data Stream by using a word vector set Vec and performing text vectorization expansion based on a CNN model:

step 2.1: initializing t to be 0; defining and initializing a notional drift semaphore Drif at time t _t0; defining and initializing a concept drift threshold Thr;

step 2.2: acquisition instituteData block D at time t in short text data Stream_tAnd has the following components:

n is 1,2,.. times.n, where N is the time t data block D_tThe total number of texts in (1),

for the data block D at the time t_tThe nth text in the text list, and have

i＝1,2,...,I，

Representing said data block D at time t_tThe nth text

The number i of the words in (i) th,

for the data block D at the time t_tThe nth text

Class labels of (1);

step 2.3: obtaining the data block D at the time t from the word vector set Vec_tThe nth text

The ith word

Word vector of

Thereby obtaining the data block D at the time t_tThe nth text

All the words inIs expressed as

Wherein I is less than or equal to P, P represents the maximum word number of the short text in the short text data Stream, Vec (0)_j) Is the jth all-zero vector, and j is more than or equal to 1 and less than or equal to P-I;

step 2.4: according to the step 2.3, the data block D at the time t is obtained_tIs represented as a set of text vectors of

For the data block D at the time t_tGrouped in groups of every G texts, Vec (D)_t) Is divided into R groups, denoted as

And R is N/G;

step 2.5: setting the convolution kernel in the CNN model as Core, the length and width of the convolution kernel as Row multiplied by Col, the step length as rides, and for the data block D at the time t_tThe nth text in (1)

Set of word vectors

After convolution operation, obtaining a semantic matrix with ((P-Row)/rides +1) × (S-Col +1) dimensions, and collecting the semantic matrix with the word vectors

Merging to obtain the data block D at the time t_tThe nth text

Is presented by an input

Thereby obtaining the data block D at the time t_tAll text expansion vectors in (1) are expressed as

g＝1,2,...,G；

And step 3: constructing an online deep learning network for the expanded text vector:

step 3.1: defining the number of layers of a current deep learning network as fr, and the maximum depth of a neural network as Floor; define the current text expansion vector Vec' (D)_t) The number of groups of (1) is r, fr is initialized to be 1, and r is made to be 1;

step 3.2: for the data block D at the time t_tText expansion vector Vec' (D) in (1)_t) Constructing a fr-th layer LSTM network LSTM _ fr at the t moment, and setting the number of the neurons as O_fr(ii) a Constructing a fr layer full connection network Dense _ fr, and setting the number of neurons as H _ fr;

step 3.3: assigning fr +1 to fr, and turning to step 3.2 until fr is Floor, so as to complete the construction of the deep learning network;

step 3.4: extending the textVector Vec' (D)_t) Middle r group text expansion vector

Inputting the input into the first layer LSTM network LSTM _1 at the time t, and obtaining an intermediate output which is recorded as

Wherein the content of the first and second substances,

an LSTM network intermediate output representing the g-th text expansion vector in the r-th set of text expansion vectors and having:

in the formula (1), o_tRepresenting the information which needs to be output by the neurons in the LSTM network at the time t; c_tIs the state information of the neurons in the LSTM network at the time t; tanh (-) is the tanh activation function; and comprises the following components:

in equation (2), σ (·) is the sigmod activation function;

the weight of an output gate in the LSTM network neuron at the time t is obtained; b_oAn offset term for the output gate;

is the g text expansion vector in the r group text expansion vector at the time t;

in the formula (3), f_tIs the output information of the neuron forgetting gate in the LSTM network at the time t;

is the input information of the neurons in the LSTM network at the time t; i.e. i_tIs the state information of the neuron input gate in the LSTM network at the time t; c_t-1Is the state information of the neuron at the time t-1; and comprises the following components:

in the formula (4), the reaction mixture is,

weights corresponding to a forgetting gate, an input gate and a state updating gate in the LSTM network neuron at the t moment are respectively; b_f、b_i、b_cBias items corresponding to the forgetting gate, the input gate and the state updating gate;

step 3.5: intermediate output of LSTM-1 network

Inputting the input into the first layer full connection network Dense _1 to obtain the corresponding output

Wherein the content of the first and second substances,

a full-connection network Dense _1 network output representing the g text expansion vector in the r set of text expansion vectors, an

W1^tThe weight is the weight corresponding to the full connection network Dense _1 at the time t;

step 3.6: let fr be 2;

step 3.7: outputting the intermediate of the fr-1 layer LSTM network LSTM _ fr-1

Inputting the output into the fr-th layer LSTM network LSTM _ fr to obtain the intermediate output of the fr-th layer LSTM network LSTM _ fr at the t moment, which is expressed as

Step 3.8: outputting the intermediate of the fr layer LSTM network LSTM _ fr network

Inputting the fr layer full connection network Dense _ fr to obtain corresponding output

Wherein

A full-connection network Dense _ fr network output representing the g text expansion vector in the r set of text expansion vectors, an

W_fr^tThe weight corresponding to the fr layer fully-connected network Dense _ fr at the time t is obtained;

step 3.9: assigning fr +1 to fr, and going to step 3.7 until fr is Floor;

step 3.10: and carrying out weighted summation on the outputs of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t by using a formula (7) to obtain the t-time expanded text vector

Output of (2)

In formula (7), weight1^t,weight2^t,...,weight_fr^t,...,weight_Floor^tOutput weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor and the Floor _ Floor at the time t are respectively; weight _ fr^tContributing a weight to the Dense _ fr output of the fully connected network;

output weight _ fr is weighted by the hedging algorithm shown in equation (8)^tUpdating to obtain the weight _ fr output by Dense _ fr of the fully-connected network at the time of t +1^t+1(ii) a Thus, the output weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t +1 are obtained:

in the formula (8), β is the hedging parameter, L (-) is the loss function of the corresponding output, L^t _gA class label of the g text expansion vector in the r group of text expansion vectors;

step 3.11: assigning r +1 to r, and assigning a data block D at the time t_tText expansion vector Vec' (D) in (1)_t) Middle r group input

And (4) transmitting the data into the first layer of LSTM network, and going to step 3.10 until R is R, so as to finally obtain the data block D at the time t_tOriginal text vector Vec (D)_t) Output of

For a data block D at time t_tFinal output of the nth text expansion vector;

and 4, step 4: introducing concept drift semaphore to neurons in the LSTM network at the t moment and detecting distribution change of short text stream:

step 4.1: according to step 2, the data block D at the t +1 moment_t+1Performing short text extension and grouping to obtain grouped extended text vector representation

And inputting the data into the deep learning network to obtain the data block D at the t +1 moment_t+1Text expansion vector Vec (D)_t+1) Pre-output of

Data block D for time t +1_t+1Pre-outputting the nth text expansion vector;

step 4.2: calculating the data block D at the time t_tOutput of (2)

And the data block D at the time of t +1_t+1Out 'of pre-output'^(t+1)If the euclidean distance dis is smaller than the conceptual drift threshold Thr, it represents the data block D at the time t_tNo concept drift occurs, and the concept drift signal amount Drif at the time t +1 is set^t+10; otherwise, it represents the data block D at time t_tConcept drift occurs and the concept drift semaphore Drif at the time t +1 is set^t ⁺¹A is a constant;

and 5: model updating and short text data stream prediction of the online deep learning network:

step 5.1: drift signal Drif through the inverse BP algorithm and the concept at time t +1^t+1Updating the LSTM network weight at the time t

And a fully connected network weight W1^t,W2^t,...,W-fr^t,...,W-Floor^t(ii) a Obtaining LSTM network weight at t +1 moment

And a fully connected network weight W1^t,W2^t,...,W_fr^t,...,W_Floor^t；

Step 5.2: predicting the data block D at the t +1 moment by using the LSTM network at the t +1 moment_t+1Text expansion vector Vec (D)_t+1) To obtain a predicted output

Wherein the content of the first and second substances,

data block D for time t +1_t+1The final output of the nth text expansion vector, and

is a vector in the L dimension, and is,

l, L represents the total number of class labels, and is the output probability of the corresponding class label;

obtaining a final output

The class label of the position of the medium maximum value is taken as the data block D at the t +1 moment_t+1The prediction class label of the nth text expansion vector is obtained, so that the data block D at the t +1 moment is finished_t+1Predicting;

step 5.3: and assigning T +1 to T, and turning to the step 4 until T is equal to T, thereby finishing the classification processing of the data block set D of the short text data stream.

Compared with the prior art, the invention has the following beneficial effects:

1. the method considers the characteristics of short text data length and less information, constructs a neural network Word2vec Word vector model by means of external linguistic data to expand the short text, maps each Word to a specified dimension vector, obtains the Word meaning correlation degree of the Word and the Word, reduces the sparsity problem of the short text to a certain extent, and improves the accuracy of short text classification.

2. The invention considers that Word2vec can only obtain the correlation between words, but can not solve the defect that the same Word has different meanings under different context contexts; the invention provides a method for acquiring information of adjacent phrases in a short text by utilizing a convolution window of a CNN convolution layer, thereby capturing different semantic information of a Word under different contexts to make up the defects of Word2 vec.

3. The invention trains data streams by means of CNN convolutional layers and a multi-layer LSTM neural network classifier. The main training process is two iterations, the first iteration completes pre-training of initial weights of all neurons, and the second iteration finely adjusts the weights according to specific downstream tasks, so that the method is suitable for high-speed short text data streams, and the processing capacity of the algorithm on the short text data streams is improved.

4. The invention aims to solve the concept drift phenomenon of the short text data stream. The method comprises the steps of obtaining a predicted class label and a drift threshold value of each predicted data block to judge the change of theme distribution in unknown data streams, obtaining class label output of each data block before the data block enters a neural network, calculating the distance between the data block and the class label at the previous moment, if the distance is higher than the specified threshold value, enabling the data block to drift, setting a concept drift signal, and then training a neural network model, so that concept drift in short text data streams is effectively processed.

5. The invention is directed to the field of practical application as follows: the high-speed short text data stream in the field of social networks is subjected to online modeling and automatic classification, implicit concept tracking and detection and the like are achieved, and the application is wide.

Drawings

FIG. 1 is a schematic diagram of text vectorization expansion based on a CNN model according to the present invention;

FIG. 2 is a schematic diagram of a data flow classification framework based on a multi-layer LSTM model according to the present invention;

fig. 3 is a diagram illustrating processing concept drift.

Detailed Description

In this embodiment, as shown in fig. 2, an online short text data stream classification method based on feature extension is performed according to the following steps:

step 1.1: setting given short text data Stream { d ═ according to a sliding window mechanism₁,d₂,...,d_e,...,d_EDividing the data into T data block sets according to time, and recording the data block sets as D ═ D₁,D₂,…,D_t,…,D_TIn which d is_eRepresenting the e-th short text in the short text data Stream; d_tA data block indicating time T in the short text data Stream, E is 1,2, …, E, T is 1,2, …, T;

step 1.2: acquiring a text external corpus, noted as C ' ═ d ', for the short text data Stream from the knowledge base '₁,d′₂,...,d′_m,...,d′_M1, 2.. M, where M denotes a text external corpus C¹Total number of texts of d'_mRepresents the m-th text and has

Q ═ 1, 2., Q denote the mth text d'_mThe number of the Chinese words is equal to the number of the Chinese words,

represents the m-th text d'_mThe q-th word in (1) and satisfy

Vocab represents the set of words of all the different words in the corpus C' outside the text, and Vocab ═ word₁,word₂,...,word_z,…,word_ZZ, Z denotes the total number of words in the set of words Vocab, word_zRepresenting the z-th word in the word set Vocab; let the z word_zThe word vector of (2) is denoted as Vec (word)_z) And initializing the word vector set to be a full 0 vector, thereby obtaining a word vector set Vec ═ Vec (word) corresponding to the word set Vocab₁),Vec(word₂),...,Vec(word_z),...,Vec(word_Z)}；

Step 1.3: obtain the set (i.e. vocabulary) of all words in which Vocab is the external corpus. All 0 arrays with size of vocab.size S are Vec, i.e. the vector of the z-th word in Vocab is Vocab [ (z-1) S, z S ].

Step 1.4: updating the word vector set Vec by using a skip-gram model to obtain an updated word vector set Vec' and assigning values to the Vec;

step 1.5 initialization d_m＝d₀。

Step 1.6 traverse d_mAll words in, initialize w_q＝w₀。

Step 1.7: setting window size, traversing corpus C¹M-th text d_mEach word w in_qAnd obtaining the w_jContext of (w)_q-window，w_q+ window). Computing

Go through u in Content, calculate

Wherein the height of the H-Huffman tree T, d_hIs the h-th bit in the Huffman coding of the path from the root node to the leaf node u. Theta_hAnd the parameters corresponding to the h-th node on the path from the root node to the leaf node u. Vec is all w of Vocab_qA corresponding set of word vectors. Wherein

Step 1.8: computing the expression p (Content | w)_q) Maximum likelihood function of

Likelihood function

Respectively to theta_h、Vec(w_q) And (5) calculating partial derivatives. Updating theta_h、Vec(w_q) The value is obtained. And finally, acquiring a trained word vector set Vec.

Step 1.9: go to step 1.6, 1.7 until all d are gone through_mUntil now.

Step 1.10: go through the next text d_mGo to step 1.51.61.71.8 until all corpora C' have been traversed.

Step 1.11: and acquiring a final traversal result Vec, namely a final word vector.

In this embodiment, a Word2vec model containing Word vectors as exemplified in table 1 is trained from an external corpus. Wherein S is 50 and vocab is 142234.

TABLE 1 Word2vec Word vectors

ID	w_i	Vec(w_q)Feature＝50
			1	century	1.61.33.4-2.02.7-1.12.0-2.13.02.0-2.0-0.7………
2	health	-2.2-0.7-3.3-1.1-2.4-4.4-0.7-1.3-3.7-2.7-1.2………
			3	caffeine	1.22.7-1.80.5-1.50.0-2.90.2-1.0-2.33.02.3………
4	limits	-1.71.0-0.71.4-0.41.8-1.40.6-0.40.2-3.23.3………
			5	family	0.80.92.30.3-0.4-0.50.30.5-0.2-0.22.50.9-2.0……
6	drugs	-2.71.0-1.71.2-0.3-1.9-2.01.8-2.4-3.83.04.1………
			7	radiation	7.23.6-2.60.4-4.41.00.31.6-0.3-1.92.7-1.06.1………
8	pregnancy	-0.03.5-1.51.3-4.1-4.2-6.20.2-0.8-4.01.31.6………
			9	trial	-5.3-0.61.01.0-1.6-3.1-5.32.7-0.8-2.7-2.64.3………
10	mask	0.01.8-0.3-1.5-2.11.40.40.0-0.4-1.80.01.0-0.3……
			11	spring	3.73.70.20.9-1.60.1-1.4-2.3-0.44.31.3-3.33.82.4……
12	industry	0.40.50.2-1.92.7-0.44.0-0.2-4.0-0.2-4.8-0.63.8………
			13	Full	-2.11.9-0.31.2-0.61.5-2.1-2.00.20.3-1.11.7………
14	plate	0.52.4-0.50.81.41.31.3-3.9-2.92.7-2.7-2.4………
			15	room	2.23.00.31.1-4.01.8-2.1-1.40.60.0-2.91.4………
16	salad	2.43.10.5-2.20.91.6-1.70.8-1.70.30.9-0.1-0.4………

Step 2: the method comprises the steps of utilizing a word vector set Vec to vectorize a short text data stream and conducting text vectorization expansion based on a CNN model, wherein the main structure is shown in FIG. 1:

step 2.2: acquiring data block D at time t in short text data Stream_tAnd has the following components:

n is 1,2,.. times.n, where N is a data block D at time t_tThe total number of texts in (1),

for a data block D at time t_tThe nth text in the text list, and have

i＝1,2,...,I，

Representing a block of data D at time t_tThe nth text

The number i of the words in (i) th,

for a data block D at time t_tThe nth text

Class labels of (1);

step 2.3: obtaining a data block D at the time t from the word vector set Vec_tThe nth text

The ith word

Word vector of

Thereby obtaining a data block D at the time t_tThe nth text

A set of word vectors of all words in, expressed as

Wherein I is less than or equal to P, P represents the maximum word number of short texts in the short text data Stream, Vec (0)_j) Is the jth all-zero vector, and j is more than or equal to 1 and less than or equal to P-I;

step 2.4: according to the step 2.3, a data block D at the time t is obtained_tIs represented as a set of text vectors of

For time t data block D_tGrouped in groups of every G texts, Vec (D)_t) Is divided into R groups, denoted as

And R is N/G;

step 2.5: setting a convolution kernel in the CNN model as Core, setting the length and width of the size of the convolution kernel as Row multiplied by Col, setting the step length as rides, and aiming at a data block D at t moment_tThe nth text in (1)

Set of word vectors

After performing the convolution operation, ((P-Row)/ri) is obtaineddes +1) x (S-Col +1) dimensional semantic matrix, and is combined with the word vector set

Merging to obtain a data block D at the time t_tThe nth text

Is presented by an input

Thereby obtaining a data block D at the time t_tAll text expansion vectors in (1) are expressed as

g＝1,2,...,G；

And step 3: an online deep learning network is constructed for the expanded text vector, and the main structure is shown in fig. 2:

step 3.2: for a data block D at time t_tText expansion vector Vec' (D) in (1)_t) Constructing a fr-th layer LSTM network LSTM _ fr at the t moment, and setting the number of the neurons as O_fr(ii) a And constructing a full connection network Dense _ fr of the fr layer, and setting the number of the neurons as H _ fr.

Step 3.3: and assigning fr +1 to fr, and turning to the step 3.2 until fr is Floor, so that the construction of the deep learning network is completed.

Step 3.5: initializing all weights W and bias terms b in the network layer into a full 0 vector, wherein the vector dimension is equal to the number of the neurons.

Step 3.6: expand the text by vector Vec' (D)_t) Middle r group text expansion vector

Inputting into the first layer LSTM network LSTM _1 at time t, obtaining intermediate output by formula (1), formula (2), formula (3), formula (4), formula (5) and formula (6), and marking as

Wherein the content of the first and second substances,

in equation (2), σ (·) is the sigmod activation function;

is the g text expansion vector in the r group of text expansion vectors at the time t.

in the formula (4), the reaction mixture is,

step 3.7: intermediate output of LSTM-1 network

Inputting the data into the first layer of fully-connected network Dense _1 to obtain corresponding output

Wherein the content of the first and second substances,

step 3.8: let fr be 2, the ratio of fr to fr,

step 3.9: outputting the intermediate of the fr-1 layer LSTM network LSTM _ fr-1

Inputting into the fr-th LSTM network LSTM _ fr to obtain intermediate output of the fr-th LSTM network LSTM _ fr at t time, which is expressed as

Step 3.10: outputting the intermediate of the fr layer LSTM network LSTM _ fr network

Inputting the fr layer full connection network Dense _ fr to obtain the corresponding output

Wherein

W_fr^tThe weight corresponding to the Dense _ fr of the fr layer full connection network at the time t;

step 3.11: assigning fr +1 to fr, and going to step 3.9 until fr is Floor;

step 3.12: using formula (7) to perform weighted summation on outputs of Floor full-connection networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t, and using formula (7) to obtain an extended text vector at the time t

Output of (2)

And outputs the weight _ fr to the hedge algorithm shown in the formula (8)^tUpdating to obtain the weight _ fr output by Dense _ fr of the fully-connected network at the time of t +1^t+1(ii) a Thereby obtaining the output weights of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t + 1.

in the formula (8), β is the hedging parameter, L (-) is the loss function of the corresponding output, L^t _gAnd the class label of the g text expansion vector in the r group of text expansion vectors.

Step 3.13: assigning r +1 to r, and assigning a data block D at the time t_tText expansion vector Vec' (D) in (1)_t) Middle r group input

And (4) transmitting the data into the first layer LSTM network, and going to the step 3.12 until R is R, so as to finally obtain a data block D at the moment t_tOriginal text vector Vec (D)_t) Output of

For a data block D at time t_tAnd (4) final output of the nth text expansion vector.

And 4, step 4: introducing concept drift semaphore to neurons in the LSTM network at the time t and detecting distribution change of short text stream, wherein the main structure is shown in FIG. 3:

And inputting the data into a deep learning network to obtain a data block D at the t +1 moment_t+1Text expansion vector Vec (D)_t+1) Pre-output of

Data block D for time t +1_t+1Pre-outputting the nth text expansion vector.

Step 4.2: calculating a data block D at time t_tOutput of (2)

And time t +1 data block D_t+1Out 'of pre-output'^(t+1)If the euclidean distance dis is smaller than the conceptual drift threshold Thr, it indicates that the data block D at time t is a data block D_tWhen no concept drift occurs and t +1 is setConcept drift semaphore Drif ^t+10; otherwise, it represents the data block D at time t_tConcept drift occurs and a concept drift signal amount Drif at time t +1 is set^t+1A is a constant;

and 5: model updating of the online deep learning network and prediction of short text data flow are completed:

step 5.1: drift signal Drif through the inverse BP algorithm and the concept at time t +1^t+1Updating LSTM network weight at time t

And a fully connected network weight W1^t,W2^t,...,W_fr^t,...,W_Floor^tMainly by the notional Drift signal Drift^tThe weight of the input information at the current moment is enhanced, the weight of the historical information is reduced, the dependence of the output information on a historical model is reduced, the processing of concept drift is finally completed, and the stability of the model is improved.

Step 5.2: predicting data block D at t +1 moment by using LSTM network at t +1 moment_t+1Text expansion vector Vec (D)_t+1) To obtain a predicted output

Wherein the content of the first and second substances,

is a vector in the L dimension, and is,

output summary for corresponding class labelsA rate, L ═ 1, 2., L denotes the total number of class labels;

obtaining

Class label of the position of the medium maximum value, and is used as the data block D at the moment of t +1_t+1The prediction class label of the nth text expansion vector is obtained, so that the data block D at the t +1 moment is completed_t+1Predicting;

In this embodiment, according to the classification method of the short text data stream, as shown in fig. 1, the following steps are specifically performed:

(1) obtaining the 1 st data block D in the data stream D₁. The data are mainly shown in table 2:

table 2 20 sample data in a short text data stream

(1) For the first training data block D₁According to the Word2vec model generated by the external corpus, see table 2, each Word is converted into a Word vector, namely D is converted by using the Word vector model₁Conversion to table 3:

TABLE 3 vectorized data blocks

(2) Will D₁Chinese text

Convolution kernel [0.2,0.3,0.5 ] from step 2]Window size 3; CNN-out [ -3.69,3.11,1.45, -1.05,0,2.78.]、[-4.2,7.5,-0.1，-1.1,-2.78,-0.56...]...[1.17,8.6,-0.04,1.19,-4.47,3...]Attaching CNN-out to

Then, the expanded

The size of which is a 198 x 50 matrix.

(3) Combining the above matrix

Inputting to the first layer LSTM network entrance in the neural network.

(4) Obtaining a Drift signal Drift, performing a forward feedback operation of the neural network by using the formulas (1) to (6) and the step 4, and calculating a preprocessing result out of the data block^tIn the format shown in Table 7, the predicted result out is calculated^tAnd the predicted result out of the data block at the previous time^t-1The Euclidean distance dis of (1) is judged to be larger than the Drift Threshold Threshold, so that the value of the Drift signal Drift at the current moment is set.

(5) According to (4), carrying out forward BP of the neural network with the drift signal Drif and utilizing the expanded data block

Updating of the neuron weights in equations (1), (2), (3), (4), (5), and (6) is completed by performing BP inversion with the concept Drift signal Drift at the current time.

(6) The calculation of the data block is completed by using the formula (1), the formula (2), the formula (3), the formula (4), the formula (5), the formula (6) and the step 4, and the probability results of class labels are shown in table 4:

TABLE 4 prediction probability of data block

(7) I.e. the maximum possible class label for each text is obtained. { business, us, business, entry }.

(8) The data blocks are iterated indefinitely until the data source stops generating data.

Claims

1. An online short text data stream classification method based on feature extension is characterized by comprising the following steps:

step 1.1: setting given short text data Stream { d ═ according to a sliding window mechanism₁,d₂,...,d_e,...,d_EDividing the data into T data block sets according to time, and recording the data block sets as D ═ D₁,D₂,…,D_t,…,D_TIn which d is_eRepresenting the e-th short text in the short text data Stream; d_tA data block indicating time T in the short text data Stream, where E is 1,2, …, E, T is 1,2, …, T;

representing the m-th text d'_mThe q-th word in (1) and satisfy

Vocab represents a set of words of all different words in the text corpus C, and Vocab ═ word₁,word₂,...,word_z,...,word_Z1,2, Z representing the total number of words in said set of words Vocab, word_zRepresents the z-th word in the set of words Vocab; order postWord of the z-th word_zThe word vector of (2) is denoted as Vec (word)_z) Thus, a word vector set Vec ═ { Vec (word) corresponding to the word set Vocab is obtained₁),Vec(word₂),…,Vec(word_z),…,Vec(word_Z)}；

step 2.1: initializing t to be 0; defining and initializing a notional drift semaphore Drif at time t_t0; defining and initializing a concept drift threshold Thr;

step 2.2: acquiring a data block D at the time t in the short text data Stream_tAnd has the following components:

wherein N is the data block D at the time t_tThe total number of texts in (1),

for the data block D at the time t_tThe nth text in the text list, and have

Representing said data block D at time t_tThe nth text

The number i of the words in (i) th,

for the data block D at the time t_tThe nth text

Class labels of (1);

The ith word

Word vector of

Thereby obtaining the data block D at the time t_tThe nth text

A set of word vectors of all words in, expressed as

And R is N/G;

step 2.5: setting the convolution kernel in the CNN model as Core, the length and width of the convolution kernel as Row × Col,step length is rides, and for the data block D at the time t_tThe nth text in (1)

Set of word vectors

Merging to obtain the data block D at the time t_tThe nth text

Is presented by an input

step 3.4: expanding the text by vector Vec' (D)_t) Middle r group text expansion vector

Wherein the content of the first and second substances,

in equation (2), σ (·) is the sigmod activation function;

in the formula (4), the reaction mixture is,

step 3.5: intermediate output of LSTM-1 network

Wherein the content of the first and second substances,

representing the g-th text expansion vector of the r-th set of text expansion vectorsA fully connected network Dense _1 network output, and

step 3.6: let fr be 2;

Wherein

step 3.9: assigning fr +1 to fr, and going to step 3.7 until fr is Floor;

step 3.10: and (3) carrying out weighted summation on the outputs of the Floor fully-connected networks Dense _1, Dense _2, Dense _ fr, Dense _ Floor at the time t by using a formula (7) to obtain the t-time expanded textThe vector of

Output of (2)

in the formula (8), β represents a hedging parameter,

a loss function that is a corresponding output; l is^t _gA class label of the g text expansion vector in the r group of text expansion vectors;

For a data block D at time t_tFinal output of the nth text expansion vector;

Data block D for time t +1_t+1Pre-outputting the nth text expansion vector;

step 4.2: calculating the data block D at the time t_tOutput of (2)

And the data block D at the time of t +1_t+1Out 'of pre-output'^(t+1)If the euclidean distance dis is smaller than the conceptual drift threshold Thr, it represents the data block D at the time t_tNo concept drift occurs, and the concept drift signal amount Drif at the time t +1 is set^t+10; otherwise, it represents the data block D at time t_tConcept drift occurs and the concept drift signal at the t +1 moment is setAmount Drif^t+1A is a constant;

And a fully connected network weight W1^t,W2^t,...,W_fr^t,...,W_Floor^t；

Wherein the content of the first and second substances,

is a vector in the L dimension, and is,

obtaining a final output

Class label of the position of the maximum value, and is taken as t +1Engraving data block D_t+1The prediction class label of the nth text expansion vector is obtained, so that the data block D at the t +1 moment is finished_t+1Predicting;