CN115481313A

CN115481313A - News recommendation method based on text semantic mining

Info

Publication number: CN115481313A
Application number: CN202110668465.4A
Authority: CN
Inventors: 王海艳; 胡阳; 骆健
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2022-12-16

Abstract

The invention relates to a news recommendation method based on text semantic mining, which comprises the following steps: acquiring news item information and log information of user reading history from a database; establishing a word embedding matrix for the titles of the news items through the pre-trained word vectors; obtaining a news title embedded expression vector through feature extraction; performing topic modeling on news text content based on a neural topic model to obtain a news content topic embedding vector, and forming a final news characteristic representation; analyzing the behavior data in the user reading history log, and extracting user characteristic representation at the user side according to the user reading record; and introducing a time decay function into the model to generate a recommended news candidate set of the top N items. According to the invention, the word-level characteristics and the theme characteristics of the news items are analyzed through the bidirectional cyclic neural network and the neural theme model, so that abundant semantic information in news texts can be effectively mined, the characteristics of the news items can be more accurately represented, and the recommendation effect is improved.

Description

News recommendation method based on text semantic mining

Technical Field

The invention belongs to the technical field of news recommendation, and particularly relates to a news recommendation method based on text semantic mining.

Background

The increasingly prosperous World Wide Web (WWW) has gradually changed the way people look for and read news, from traditional print media to online portals. In order to alleviate the problem of information overload, recommendation systems are widely used in modern online services, which can help users quickly find out the relevant content needed by themselves. The same user typically has multiple interests that are reflected in the different news items that they browse through. Meanwhile, important semantic features of news are hidden in text segments with different granularities, and news contents are full of different types of subject information, which is very important for learning accurate news and user representation forms of news recommendation. However, the existing news recommendation method usually ignores that the fine-grained modeling of the news item can enhance the news recommendation effect.

CN2014104033786 discloses a news recommendation system, in which data correlation analysis builds a new personalized news recommendation hypergraph model by mining the internal relations between phrases, in the hypergraph model, nodes represent phrases, edges represent some internal relations between the phrases, and the weight of the edge is used to represent the contribution of the relation, but this news recommendation system obtains high-quality clients through historical data and recommends news which is newly published or does not have enough access records to the users, and the news image recognition effect is not good.

CN201510242541X discloses a news recommendation method based on user interests, which can quickly and efficiently store and process news data by using an interactive user side and a recommendation system, but the recommendation method has long data processing time, and the calculation overhead for calculating the news similarity is very large due to quick news updating.

Disclosure of Invention

In order to solve the problems, the invention provides a news recommendation method based on text semantic mining, which comprises the steps of extracting word-level semantic information of news titles by using a bidirectional recurrent neural network to obtain news title expression vectors, modeling topics of news text contents by using a neural topic model to obtain topic semantic information, and splicing the topic semantic information and the title embedded expression vectors to jointly express news item characteristics; distributing weights to different interests in the historical reading records of the user by using an attention network to obtain a characteristic expression vector of the user side; and introducing a time attenuation function into the model, and calculating the characteristic representation of the user side and the characteristic representation of the target news item by using a score function to generate the news item which is more in line with the reading interest of the user so as to realize news recommendation. The method specifically comprises the following steps:

s10, acquiring news item information and log information of reading history of a user from a database, and specifically comprising the following steps:

s11, obtaining news item information including news numbers, news titles, news contents and release date and time stamps from a database;

s12, obtaining news items browsed by a user from a user database, wherein the news items comprise user numbers, news numbers and user reading timestamps;

and S13, the acquired information is sorted and preprocessed to obtain a training set and a testing set.

S20, establishing a word embedding matrix for the titles of the news items through the pre-trained word vectors, wherein the specific method comprises the following steps: for each news article the headline is represented as a matrix consisting of word-embedding, i.e. a n-sized vocabulary in the headline is mapped into a d-dimensional vector, with the emphasis on meaning of the words, and word-embedding converts the news headline from a sequence of words into a semantic vector matrix X _1：m ＝[x ₁ ，x ₂ ，...，x _n ]。

S30, extracting the characteristics of the news headline embedding matrix by using a bidirectional recurrent neural network to obtain a news headline embedding vector;

the method specifically comprises the following steps:

s31, extracting the title features of the news based on the bidirectional recurrent neural network, and capturing the context information of the word sequence:

i _t ＝σ(W _i x _t +U _i h _t-1 +b _i )

f _t ＝σ(W _f x _t +U _f h _t-1 +b _f )

o _t ＝σ(W _o x _t +U _o h _t-1 +b _o )

where σ (-) is the activation function, i _t Is an input gate, f _t Is a forgetting door o _t Is an output gate of the optical fiber,

candidate information indicating that the current time t needs to be updated to the current cell state, and b is a bias vector;

s32, learning the header sequence by utilizing forward propagation and backward propagation, and calculating a hidden state H:

s33, extracting more information features in the news headlines based on the attention network, selecting key words in the headlines, carrying out weighted summation on context representations of the words through attention weights, and obtaining feature representation vectors v of the news headlines _t The following were used:

s40, starting a neural topic model to perform topic modeling on news text content to obtain a news content topic embedding vector, and integrating the news title embedding vector and the news content topic embedding vector to form a final news characteristic representation vector, wherein the method for performing topic modeling on the news text content comprises the following steps:

s41, a variational-based self-encoder framework that learns potential topics by encoding-decoding, let x be the word bag representation of a given news text, where v is the vocabulary, and in the encoder, u = f _u (x)，logσ＝f _σ (x) Where u and σ are prior parameters of the parameterized topic model distribution in the decoder network, f _u ，f _σ Is a linear transformation with a Relu activation function;

s42, generating document themes by using a decoder, and drawing theme distribution by using a Gaussian softmax function, namely z-N (mu, sigma) ² )，θ＝softmax(z)

Where z is the underlying topic variable, σ is the subject distribution, and k is the predefined number of topics, by

Learning predicted words

The probability of (a) of (b) being,

similar to the topic word distribution matrix of the LDA topic model,

representing the correlation between the ith word and the jth topic, extracting each word from p to reconstruct the input x, and further using an intermediate parameter w ₀ And θ to construct a topic representation, as follows:

wherein

Representing a set of topic representations having a predefined d dimension,

is a linear transformation with a Relu activation function,

is a weighted sum of each topic representation, considered as the overall topic representation of the news item;

s43, integrating the news headline embedding vector and the news content theme embedding vector to form a final news characteristic representation vector v, wherein the final news characteristic representation vector v is as follows: v = Concat (v) _t ，v _c )。

S50, analyzing the behavior data in the user reading history log, and extracting a user preference feature expression vector at the user side according to the user reading record;

the method for analyzing the behavior data in the user reading history log and extracting the user feature expression vector according to the user reading record at the user side comprises the following steps:

the news read by the user is compared with the target news by using the attention network to obtain the final embedded vector of the user. Denote the historical reading record of user u as d ₁ ，d ₂ ，...d _n Its embedding can be expressed as: { v ₁ ，v ₂ ，...v _n }. Different weights are assigned to news items clicked by the user by adopting the attention network so as to learn different interests of the user in the various news items. For a target news item, similarity between news embedded representations read by a user and the target news item is calculated, and a user representation vector of the news embedded representations in a historical reading record of the user as the target news item is calculated, wherein the interest weight of each news item read by the user is determined by the similarity. In the attention mechanism, the query is the target news, and the user's history reading news items are keys and values. The user characteristics obtained are represented as follows:

v _u ＝Attention(q，k，v)

＝softmax(qk ^T )v。

and S60, introducing the time attenuation factor into the model, and calculating the similarity of the user preference characteristic expression vector and the news characteristic expression vector by using a score function to generate a recommended news candidate set of the previous N items.

The time decay function is defined here as:

where λ is a parameter that needs to be adjusted during training to control the decay rate of the news, t and t ₀ The reading time and the news release time at a certain moment are represented, and the top N candidate news items are searched according to a score function, wherein the score function is as follows:

where N is a predefined number of items to be retrieved during the matching phase.

The invention has the beneficial effects that:

(1) The invention uses the bidirectional circulation neural network and the neural topic model to learn the news headlines and the news contents with different granularities, can effectively extract the word-level semantic information and the topic semantic information of the news items, and enriches the feature representation of the news items.

(2) According to the method and the device, the attention network is used for simulating the influence of the news items in the historical reading records of the user on the target news, so that different interests of the user are represented, and the accuracy of the preference of the user side can be improved.

(3) The time decay function is introduced, timeliness of news recommendation is met to a certain extent, and the news recommendation can be carried out in real time.

Drawings

FIG. 1 is an overall flow diagram of an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details should not be taken to limit the invention. That is, in some embodiments of the invention, such implementation details are not necessary.

As shown in fig. 1, the present invention is a news recommendation method based on text semantic mining, which specifically includes the following steps:

and S10, acquiring news item information and log information of reading history of the user from the database.

First, news item information including news numbers, news headlines, news contents, and release date and time stamps is obtained from a database.

Then, the news items browsed by the user are obtained from the user database, wherein the news items comprise user numbers, news numbers and user reading time stamps.

And finally, arranging and preprocessing the acquired information to obtain a training set and a testing set.

And S20, establishing a word embedding matrix for the titles of the news items through the pre-trained word vectors.

The headline for each news article can be represented as a matrix consisting of word embedding, i.e., mapping n-sized words in the headline into a d-dimensional vector with emphasis on the meaning of the words, which converts the news headline from a sequence of words into a semantic vector matrix X _1：n ＝[x ₁ ，x ₂ ，...，x _n ). Word embedding can be any pre-trained word embedding model, such as fastText, word2Vec, or Glove, which uses the word vector pre-trained by the fastText model in this example, and the dimension d of the word vector is 100.

firstly, extracting the title features of news based on a bidirectional recurrent neural network, and capturing the context information of a word sequence:

i _t ＝σ(W _i x _t +U _i h _t-1 +b _i )

f _t ＝σ(W _f x _t +U _f h _t-1 +b _f )

o _t ＝σ(W _o x _t +U _o h _t-1 +b _o )

where σ (·) is an activation function, i _t Is an input gate, f _t Is a forgetting door o _t Is an output gate of the optical fiber,

then, learning the header sequence by utilizing forward propagation and backward propagation respectively, and calculating a hidden state H:

finally, extracting more information features in the news headline based on the attention network so as to select key words in the headline, weighting and summing the context representation of the words through attention weight to obtain a feature representation vector v of the news headline _t The following were used:

and S40, starting a neural topic model to perform topic modeling on news text content to obtain a news content topic embedding vector, and integrating the news title embedding vector and the news content topic embedding vector to form a final news characteristic representation vector.

s42, generating document themes by using a decoder, and drawing theme distribution by using a Gaussian softmax function, namely z-N (mu, sigma) ² )，θ＝sofmax(z)

Learning predicted words

The probability of (a) of (b) being,

similar to the topic word distribution matrix of the LDA topic model,

representing the correlation between the ith word and the jth topic, extracting each word from p to reconstruct the input x, and further using intermediate parameters

And θ to construct a topic representation, as follows:

wherein

Representing a set of topic representations having a predefined d dimension,

is a linear transformation with a Relu activation function,

is a weighted sum of each topic representation, which can be considered as an overall topic representation of the news item;

S50, analyzing the behavior data in the user reading history log, and extracting user preference characteristic representation according to the user reading record at the user side;

comparing the news read by the user with the target news by using the attention network to obtain a final embedded vector of the user, and expressing the historical reading record of the user u as { d } ₁ ，d ₂ ，...d _n Its embedding can be expressed as: { v) ₁ ，v ₂ ，...，v _n }; the attention network is adopted to distribute different weights for news items clicked by users so as to learn different interests of the users for each news item, for a target news, the similarity between news embedded representations read by the users and the target news items is calculated, and a user representation vector taking the news embedded representations in the history reading records of the users as the target news items is calculated, wherein the interest weight of each news item read by the users is determined by the similarity; in the attention mechanism, the query is target news, the historical reading news items of the user are keys and values, and the obtained user preference characteristics are represented as follows:

v _u ＝Attention(q，k，v)

＝softmax(qk ^T )v。

The time decay function is defined as:

where N is the predefined number of items to be retrieved in the matching phase, which in this example is 50.

According to the invention, the word-level characteristics and the theme characteristics of the news items are analyzed through the bidirectional cyclic neural network and the neural theme model, so that abundant semantic information in news texts can be effectively mined, the characteristics of the news items can be more accurately represented, and the recommendation effect is improved.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A news recommendation method based on text semantic mining is characterized in that: the recommendation method comprises the following steps:

s10, acquiring news item information and log information of reading history of a user from a database;

s20, establishing a word embedding matrix for the titles of the news items through the pre-trained word vectors;

s30, extracting features of the news headline embedding matrix by using a bidirectional recurrent neural network to obtain a news headline embedding vector;

s40, starting a neural topic model to perform topic modeling on news text content to obtain a news content topic embedding vector, and integrating the news title embedding vector and the news content topic embedding vector to form a final news characteristic representation vector;

2. The news recommendation method based on text semantic mining as claimed in claim 1, wherein: the step S30 of extracting features of the news headline embedding matrix by using the bidirectional recurrent neural network to obtain a news headline embedding vector specifically includes the following steps:

i _t ＝σ(W _i x _t +U _i h _t-1 +b _i )

f _t ＝σ(W _f x _t +U _f h _t-1 +b _f )

o _t ＝σ(W _o x _t +U _o h _t-1 +b _o )

where σ (-) is the activation function, i _t Is an input gate, f _t Is forgetting to gate o _t Is an output gate of the optical fiber,

s33, extracting more information features in the news headlines based on the attention network so as to select key words in the headlines, and performing weighted summation on the context representation of the words through attention weight to obtain a feature representation vector v of the news headlines _t The following were used:

3. the news recommendation method based on text semantic mining as claimed in claim 1, wherein: the method for topic modeling of news text content in the step S40 includes the following steps:

s41, a variational-based self-encoder framework that learns potential topics by encoding-decoding, letting x be a word bag representation of a given news textWhere v is a vocabulary, in the encoder, u = f _u (x)，logσ＝f _σ (x) Where u and σ are prior parameters of the parameterized topic model distribution in the decoder network, f _u ，f _σ Is a linear transformation with a Relu activation function;

Where z is the underlying topic variable, σ is the subject distribution, k is the predefined number of topics, and

learning predicted words

The probability of (a) of (b) being,

similar to the topic word distribution matrix of the LDA topic model,

And θ to construct a topic representation, as follows:

wherein

Representing a set of topic representations having a predefined d dimension,

is a linear transformation with a Relu activation function,

4. The news recommendation method based on text semantic mining as claimed in claim 1, wherein: the method for analyzing the behavior data in the user reading history log in the step S50 and extracting the user preference feature expression vector according to the user reading record at the user side includes the following steps:

s51, comparing the news read by the user with the target news by using the attention network to obtain a final embedded vector of the user, and expressing the historical reading record of the user u as { d } ₁ ，d ₂ ，…d _n Its embedding can be expressed as: { v ₁ ，v ₂ ，...，v _n }；

S52, distributing different weights to news items clicked by users by adopting an attention network to learn different interests of the users to the news items, calculating the similarity between news embedded representations read by the users and the target news items for a target news, and calculating user representation vectors taking the news embedded representations in the history reading records of the users as the target news items, wherein the interest weight of each news item read by the users is determined by the similarity;

s53, in the attention mechanism, the query is target news, the historical reading news items of the user are keys and values, and the obtained user preference features are represented as follows:

v _u ＝Attention(q，k，v)

＝softmax(qk ^T )v。

5. the news recommendation method based on text semantic mining as claimed in claim 1, wherein: the time decay function in said step S60 is defined as:

6. The news recommendation method based on text semantic mining as claimed in claim 1, wherein: the step S10 specifically includes the steps of:

s11, obtaining news item information from a database, wherein the news item information comprises news numbers, news titles, news contents and release date and time stamps;

7. The news recommendation method based on text semantic mining as claimed in claim 1, wherein: the method for establishing the news headline embedding matrix in the step S20Comprises the following steps: for each news article the headline is represented as a matrix consisting of word-embedding, i.e. a n-sized vocabulary in the headline is mapped into a d-dimensional vector, with the emphasis on meaning of the words, and word-embedding converts the news headline from a sequence of words into a semantic vector matrix X _1：n ＝[x ₁ ，x ₂ ，...，x _n ]。