CN110083676B

CN110083676B - Short text-based field dynamic tracking method

Info

Publication number: CN110083676B
Application number: CN201910322267.5A
Authority: CN
Inventors: 郭贵冰; 李昂
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2021-12-03
Anticipated expiration: 2039-04-22
Also published as: CN110083676A

Abstract

The invention provides a field dynamic tracking method based on short texts, which belongs to the fields of recommendation systems and natural language processing and comprises the following steps: 1.1) data acquisition; 1.2) preprocessing data; 1.3) establishing a word embedding neural network model; 1.4) establishing a convolutional neural network model; 2.1) recommending short texts; 2.2) saving user feedback. The method can fully and accurately mine the domain theme characteristics of the short text, and improves the accuracy of the dynamic tracking of the domain; the current excellent text recommendation scheme is extended, the latest short text is scored, and the latest short text is dynamically presented to a user as a field; the data can be acquired from static web pages, and short texts can be acquired from dynamic data streams to make data storage for text recommendation.

Description

Short text-based field dynamic tracking method

Technical Field

The invention belongs to the field of recommendation systems and natural language processing, and particularly relates to a field dynamic tracking method based on short texts.

Background

The dynamic tracking method is from two existing technologies of text recommendation and crawler and is improved.

The traditional text recommendation technology is a content recommendation method, and text recommendation is performed according to the similarity with a text file to be recommended by analyzing the implicit theme feature vector of a text. In the prior art, a Web text personalized recommendation method is provided, which recommends a text with influence on a user higher than a threshold value; and a Sina microblog event recommendation method is also provided, wherein the similarity between the user model and the event vector is calculated through an improved cosine similarity (cosine similarity) algorithm. However, these prior arts cannot perform relevance (relevance) scoring in a certain field, and lack a certain universality.

The crawler (crawler) technology is used for collecting short text information, extracting texts or other related information in the short text information, and storing the texts or other related information in a database. It is an infrastructure in information retrieval applications and an important means for data acquisition. In the prior art, there is a web crawler method for extracting data from a web page, which extracts short text information by parsing a Document Object Model (DOM) in a HyperText Markup Language (HTML); and a crawler technology for automatically acquiring a large amount of microblog information by using various crawling methods. However, these prior art technologies cannot achieve the functions of acquiring data from static web pages, and acquiring short texts from dynamic data streams to make data storage for text recommendations.

Disclosure of Invention

Aiming at the technical problems, the invention provides a field dynamic tracking method based on short texts, which comprises the following steps:

step 1, performing model training, specifically:

step 1.1, data acquisition:

continuously and directionally acquiring short texts on internet media according to keywords in a specific field specified by a user, and storing the short texts in a local database;

the short text comprises words, publication time and a subject label;

step 1.2, preprocessing data;

removing noise content in the short text; the noise content comprises stop words, symbolic expressions and webpage links;

intercepting the content of a fixed digit length for the short text after the noise content is removed; if the short text length is less than the digit length, "< PAD >" is filled in to fill up the short text;

assigning a unique integer to each appearing word in the short text as an identifier to distinguish each word;

converting the document into a sequence of integers using the identifiers as a vector representation of the document;

step 1.3, establishing a word embedding neural network model;

respectively taking words appearing in the short text and context information thereof as input and output to train a word embedded neural network model, wherein the output result of the network intermediate layer is taken as word vector representation of a target word, and the method specifically comprises the following steps:

establishing a word embedding neural network model for embedding the information of the context words of the target words into the vector representation of the target words and combining the mathematical meaning of the embedded vector with the language meaning of the target words;

the one-hot vector of the target word is used as the input of the word embedding neural network model;

a one-hot vector of a context word of the target word is used as an output of the word embedding neural network model;

the projection word of the target word is used as an output result of the projection layer of the word embedding neural network model, namely the target word contains a word vector of context information and represents the meaning of the target word in the context;

the words are embedded into neural network weights between the input layer and the projection layer of the neural network model using a V N matrix W ═ W_i，jIs (1. ltoreq. j.ltoreq.V, 1. ltoreq. i.ltoreq.N);

carrying out random initialization on W to obtain a calculation result h of a projection layer_tIs a 1 XN-dimensional vector, i.e.

Wherein V represents the total number of words in the lexicon, N is a hyper-parameter representing the number of neurons in the projection layer, and w_tThe target word is selected;

is the target word w_tInputting a neural network model, namely a one-hot vector of the target word; in that

One and only one

Equal to 1, the remainder being 0;

the word is embedded in a neural network weight between a projection layer and an output layer of the neural network model, using an N × V matrix W ═ W'_i，jIs (1. ltoreq. i.ltoreq.N, 1. ltoreq. j.ltoreq.V);

carrying out random initialization on W' to obtain a calculation result o of an output layer_tIs a vector of dimension 1 x V, i.e.

The target word w_tContext word w_I∈{w_t-2，w_t-1，w_t+1，w_t+2}；w_IUnique heat vector x of_I∈{xwt-2，xwt-1，xwt+1，xwt+2}；

Calculation result o at output layer_t＝[o₁，o₂，...，o_V]Adding softmax7 classifier to realize the relation with the target word w_tContext word w_IUnique heat vector x of_ITo obtain w_tOutput on the neural network model

Namely:

step 1.4, establishing a convolutional neural network model;

in the convolutional neural network model, firstly, words in a short text are converted into word vectors as local features, and meanwhile, keywords appointed by a user are converted into the word vectors as global features; and the two groups of characteristics respectively pass through a convolution layer and a pooling layer of the convolution neural network model, and finally calculate the score corresponding to each short text through a multilayer perceptron and a softmax activation function, specifically:

for words and phrases

Length of composition L_wAny word w in the short text_kThe output of the projection layer of the word-embedded neural network model obtained by step 1.3, the word-embedding map S of its N-dimensional vector_kThe result is that

S_k＝word2vec(w_k)

For words and phrases

Performs this process, resulting in a matrix S:

wherein N is the number of neurons in a projection layer of the word embedding neural network model and is also the dimension of a word vector output by the projection layer;

in a similar manner, the subject label characteristic H is obtained as

Wherein L is_HThe number of the subject labels contained in the short text;

for the matrix S, the convolutional layer filter is defined

Is composed of

Wherein, the matrix is

Is hyper-parametric and performs random initializationMelting; s calculation result C of convolution layer_SIs composed of

To C_SMaximum pooling (max pooling) is performed per row to obtain a matrix P_S：

Similarly, for matrix H, the convolutional layer filter is defined

Is composed of

Wherein, the matrix is

If the parameter is a hyper-parameter, carrying out random initialization; h calculation result C of convolution layer_HIs composed of

To C_HMaking maximum pooling of each row to obtain a matrix P_H：

Will P_SAnd P_HAre respectively flattened (flattened) and connected to have a length of

Vector of (2)

And according to P_fAnd between logistic regression layer neural networks

The dimension weight matrix M calculates an output o ' ═ of (o ') of the logistic regression layer '₁，o′₂)：

Calculating the output value of the softmax activating function:

wherein

I.e. the fraction of the final output (score) and satisfies

Step 2, text recommendation is carried out, specifically:

step 2.1, short text recommendation is carried out according to the scores obtained in the step 1.4, and a plurality of pieces of short text information with the highest scores are taken as dynamic progress of the field and recommended to a user;

different from step 1.4Step 2.1 is only for data which does not participate in training and is in the last week, and does not carry out back propagation error correction; at run-time, step 2.1 pairs the number N specified according to the user_topScreening out N with highest score_topBars are dynamic for the domain;

step 2.2, the user feeds back the recommended short text according to the correlation degree, and records the recommended short text in a local database to correct the accuracy of the model and improve the performance of the model;

the training method of the word embedding neural network model in the step 1.3 comprises the following steps:

the training goal of the word embedding neural network model is to reduce

And w_IUnique heat vector x of_I＝(x_I，1，x_I，2，...，x_I，V) Difference therebetween

Wherein w_I∈{w_t-2，w_t-1，w_t+1，w_t+2And is provided with

For a weight matrix W ' of dimension N × V from projection layer to output layer, the gradient δ W ' of j (1 ≦ i ≦ N,1 ≦ j ≦ V) for any i, j according to the chain rule '_i，jIs equal to

Wherein

Therefore, it is not only easy to use

And according to gradient delta W'_i，jWhile using the first moment calculated in the last execution of the training process

And second moment

Or first moment

And second moment

Is calculated as the set of weights W'_i，jFirst moment estimation of

And second moment estimation

Wherein beta is₁And beta₂The super-parameter is used for controlling the change of the first moment estimation value and the second moment estimation value;

to pair

And

as a correction to

And

make it

And

approximated as an unbiased estimate:

wherein step is the number of times of the step completed, and W 'is updated'_i，jIs composed of

Where α is a hyper-parameter, called learning rate (learning rate), and is controlled by'_i，jThe update ratio of (2); e is defined as a decimal fraction greater than and approaching 0, e.g. 10^-8For preventing division by zero in the score;

for a V x N dimensional weight matrix W from the input layer to the projection layer, according to the chain rule,for any j, i (1 ≦ j ≦ V,1 ≦ i ≦ N), its gradient δ W_j，iIs equal to

Wherein

Has been previously calculated; remaining value

And calculating a first moment estimate based on the gradient

And second moment estimation

Computing

And

approximate unbiased estimation:

and update the weight W_j，i：

Each time the step is finished, the word embedding neural network model is trained once, so that the weight is updated;

the training method of the convolutional neural network model in step 1.4 specifically comprises the following steps:

the training target of the convolutional neural network model is reduced u '═ u'₁，u′₂) And actual degree of correlation a '═ a'₁，a′₂) The difference between them; wherein a'₁Is [0,1 ]]Is L_wDegree of correlation with domain to which keyword belongs, u'₁The greater the correlation, the stronger the correlation, a'₂＝1-a′₁；

Is provided with

E′＝E′₁+E′₂

And is provided with

E′_j＝-(a′_jlogu′_j+(1-a′_j)(1-logu′_j))

Wherein j is more than or equal to 1 and less than or equal to 2; for P_fTo o' s

Dimension weight matrix M, according to the chain rule, for any i, j (M:)

J is more than or equal to 1 and less than or equal to 2) and the gradient delta M thereof_i，jIs composed of

Wherein

Therefore, it is not only easy to use

And calculating a first moment estimate based on the gradient

And second moment estimation

Computing

And

approximate unbiased estimation:

and update the weight M_i，j：

Filter with theme tag feature H updated later

Has a gradient of

Wherein

Therefore, it is not only easy to use

And calculate its first orderMoment estimation

And second moment estimation

Correction to approximate unbiased estimation:

and update

Filter for input matrix S

Has a gradient of

Wherein

Therefore, it is not only easy to use

And calculates an first moment estimate thereof

And second moment estimation

Correction to approximate unbiased estimation:

and update

And training the convolutional neural network model once every time the step is completed, so as to update the weight.

The invention has the beneficial effects that:

compared with the conventional text recommendation method, the method can fully and accurately mine the domain theme characteristics of the short text, and improves the accuracy of domain dynamic tracking; the current excellent text recommendation scheme is extended, the latest short text is scored, and the latest short text is dynamically presented to a user as a field; the data can be acquired from static web pages, and short texts can be acquired from dynamic data streams to make data storage for text recommendation.

The invention has reasonable design, easy realization and good practical value.

Drawings

Fig. 1 is a schematic diagram of a process for removing noise content in a short text according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a word-embedded neural network model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a training method of the convolutional neural network model and a short text recommendation method in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a field dynamic tracking method based on a short text, which mainly adopts a recommendation technology based on a convolutional neural network and a short text-oriented crawler technology, and comprises the following steps: a model training part and a text recommendation part.

The model training part comprises the following steps:

step 1.1, data acquisition;

the short text comprises words, publication times and topic labels (hashtag/category);

the short texts are from tweets, microblogs, paper abstracts and user postings of Twitter, arXiv and Reddit network platforms;

the subject label is specified by a user;

step 1.2, preprocessing data;

removing noise content in the short text, wherein the process is shown in FIG. 1; the noise content comprises stop words, symbolic expressions and webpage links, and the underlined content in the graph is the stop words;

step 1.3, establishing a word embedding neural network model;

respectively taking words appearing in a short text and context information thereof as input and output to train a word embedding (word embedding) neural network model, wherein an output result of the network intermediate layer is taken as a word vector representation of a target word, as shown in fig. 2, specifically:

in order to embed the information of the context words of the target words into the vector representation of the target words, the mathematical meaning of the embedded vector is combined with the language meaning of the target words to establish a word embedding neural network model;

One and only one

Equal to 1, the remainder being 0;

randomly initializing W' to obtainCalculation result o to output layer_tIs a vector of dimension 1 x V, i.e.

Namely:

the training goal of the word embedding neural network model is to reduce

Wherein w_I∈{w_t-2，w_t-1，w_t+1，w_t+2And is provided with

Wherein

Therefore, it is not only easy to use

And second moment

Or first moment

And second moment

Is calculated for the group ofWeight W'_i，jFirst moment estimation of

And second moment estimation

to pair

And

as a correction to

And

make it

And

approximated as an unbiased estimate:

for the V multiplied by N dimensional weight matrix W from the input layer to the projection layer, i (j is more than or equal to 1 and less than or equal to V, i is more than or equal to 1 and less than or equal to N) is arbitrarily set, and the gradient delta W is set according to the chain rule_j，iIs equal to

Wherein

(1. ltoreq. k. ltoreq.V) has been calculated before; remaining value

And calculating a first moment estimate based on the gradient

And second moment estimation

Computing

And

approximate unbiased estimation:

and update the weight W_j，i：

step 1.4, establishing a convolutional neural network model;

in the Convolutional Neural Network (Convolutional Neural Network) model, firstly, words in a short text are converted into word vectors as local features, and meanwhile, keywords specified by a user are converted into word vectors as global features; and the two groups of characteristics respectively pass through a convolution layer (convolution layer) and a pooling layer (max-pooling layer) of the convolution neural network model, and finally calculate the score corresponding to each short text through a multilayer perceptron (multi layer perceptron) and a softmax activation function, specifically:

for theBy words and phrases

S_k＝word2vec(w_k)

For words and phrases

Performs this process, resulting in a matrix S:

in a similar manner, the subject label characteristic H is obtained as

Wherein L is_HThe number of the subject labels contained in the short text;

for the matrix S, the convolutional layer filter is defined

Is composed of

Wherein, the matrix is

If the parameter is a hyper-parameter, carrying out random initialization; s calculation result C of convolution layer_SIs composed of

Similarly, for matrix H, the convolutional layer filter is defined

Is composed of

Wherein, the matrix is

To C_HMaking maximum pooling of each row to obtain a matrix P_H：

Vector of (2)

And according to P_fAnd between logistic regression layers

Calculating the output value of the softmax activating function:

wherein

I.e. the fraction of the final output (score) and satisfies

The training method of the convolutional neural network model in step 1.4 is shown in fig. 3, and specifically includes:

Is provided with

E′＝E′₁+E′₂

And is provided with

E′_j＝-(a′_jlogu′_j+(1-a′_j)(1-logu′_j))

Dimension weight matrix M, according to the chain rule, for any i, j (M:)

Wherein

Therefore, it is not only easy to use

And calculating a first moment estimate based on the gradient

And second moment estimation

Computing

And

approximate unbiased estimation:

and update the weight M_i，j：

Filter with theme tag feature H updated later

Has a gradient of

Wherein

Therefore, it is not only easy to use

And calculates an first moment estimate thereof

And second moment estimation

Correction to approximate unbiased estimation:

and update

Filter for input matrix S

Has a gradient of

Wherein

Therefore, it is not only easy to use

And calculates an first moment estimate thereof

And second moment estimation

Correction to approximate unbiased estimation:

and update

Each time the step is finished, training the convolutional neural network model once so as to update the weight;

the text recommendation part comprises the following steps:

step 2.1, as shown in fig. 3, short text recommendation is performed according to the scores obtained in step 1.4, and a plurality of pieces of short text information with the highest scores are taken as dynamic progress of the field and recommended to the user;

unlike step 1.4, step 2.1 only targets data in the last week that did not participate in training, and does not perform back propagation error correction; at run-time, step 2.1 pairs the number N specified according to the user_topScreening out N with highest score_topBars are dynamic for the domain;

in this embodiment, the feedback is N recommended by the user_topThe short text bars are scored between 1 and 5.

According to the application of the method in practice, the scheme of the invention greatly improves the method for tracking the dynamic state of a specific field in mass information, and the method can be applied to any field only by changing different field keywords.

Claims

1. A field dynamic tracking method based on short texts is characterized by comprising the following steps:

step 1, performing model training, specifically:

step 1.1, data acquisition:

the short text comprises words, publication time and a subject label;

step 1.2, preprocessing data;

step 1.3, establishing a word embedding neural network model;

the words are embedded into neural network weights between the input layer and the projection layer of the neural network model using a V N matrix W ═ W_i,jIs (1. ltoreq. i.ltoreq.V, 1. ltoreq. j.ltoreq.N);

One and only one

Equal to 1, the remainder being 0;

the word is embedded in a neural network weight between a projection layer and an output layer of the neural network model, using an N × V matrix W ═ W'_i,jIs (1. ltoreq. i.ltoreq.N, 1. ltoreq. j.ltoreq.V);

The target word w_tContext word w_I∈{w_t-2,w_t-1,w_t+1,w_t+2}；w_IIndependent heat vector of

Calculation result o at output layer_t＝[o₁,o₂,...,o_V]Adding softmax7 classifier to realize the relation with the target word w_tContext word w_IUnique heat vector x of_ITo obtain w_tOutput on the neural network model

Namely:

step 1.4, establishing a convolutional neural network model;

for the word w₁,w₂,...,

S_k＝word2vec(w_k)

For word w₁,w₂,...,

Performs this process, resulting in a matrix S:

in a similar manner, the subject label characteristic H is obtained as

Wherein L is_HThe number of the subject labels contained in the short text;

for the matrix S, the convolutional layer filter is defined

Is composed of

Of (2) matrixWherein

Similarly, for matrix H, the convolutional layer filter is defined

Is composed of

Wherein, the matrix is

To C_HMaking maximum pooling of each row to obtain a matrix P_H：

Vector of (2)

And according to P_fAnd between logistic regression layer neural networks

The dimension weight matrix M calculates an output o ' ═ of (o ') of the logistic regression layer '₁,o′₂)：

Calculating the output value of the softmax activating function:

wherein

I.e. the fraction of the final output (score) and satisfies

Step 2, text recommendation is carried out, specifically:

and 2.2, the user feeds back the recommended short text according to the correlation degree and records the recommended short text in a local database.

2. The short text-based field dynamic tracking method according to claim 1, wherein the training method of the word embedding neural network model in step 1.3 is:

the training goal of the word embedding neural network model is to reduce

And w_IUnique heat vector x of_I＝(x_I,1,x_I,2,...,x_I,V) Difference therebetween

Wherein w_I∈{w_t-2,w_t-1,w_t+1,w_t+2And is provided with

NxV-dimensional weight moment for projection layer to output layerMatrix W ', according to the chain rule, for any i, j (1 ≦ i ≦ N,1 ≦ j ≦ V), the gradient delta W'_i,jIs equal to

Wherein

Therefore, it is not only easy to use

And according to gradient delta W'_i,jWhile using the first moment calculated in the last execution of the training process

And second moment

Or first moment

And second moment

Is calculated as the set of weights W'_i,jFirst moment estimation of

And second moment estimation

to pair

And

as a correction to

And

make it

And

approximated as an unbiased estimate:

wherein step is the number of times of the step completed, and W 'is updated'_i,jIs composed of

Where α is a hyper-parameter, called learning rate (learning rate), and is controlled by'_i,jThe update ratio of (2); e is defined as a fraction greater than and approaching 0 to prevent division by zero in the fraction;

for the V multiplied by N dimensional weight matrix W from the input layer to the projection layer, i (j is more than or equal to 1 and less than or equal to V, i is more than or equal to 1 and less than or equal to N) is arbitrarily set, and the gradient delta W is set according to the chain rule_j,iIs equal to

Wherein

Has been previously calculated; remaining value

And calculating a first moment estimate based on the gradient

And second moment estimation

Computing

And

approximate unbiased estimation:

and update the weight W_j,i：

And training the word embedding neural network model once every time the step is completed, so as to update the weight.

3. The short text-based field dynamic tracking method according to claim 1, wherein the training method of the convolutional neural network model in step 1.4 specifically comprises:

the training target of the convolutional neural network model is reduced u '═ u'₁,u′₂) And actual degree of correlation a '═ a'₁,a′₂) The difference between them; wherein a'₁Is [0,1 ]]Is L_wDegree of correlation with domain to which keyword belongs, u'₁The greater the correlation, the stronger the correlation, a'₂＝1-a′₁；

Is provided with

E′＝E′₁+E′₂

And is provided with

E′_j＝-(a′_jlogu′_j+(1-a′_j)(1-logu′_j))