CN107562912B

CN107562912B - Sina microblog event recommendation method

Info

Publication number: CN107562912B
Application number: CN201710816042.6A
Authority: CN
Inventors: 于富财; 刘�东; 胡光岷; 费高雷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2021-08-27
Anticipated expiration: 2037-09-12
Also published as: CN107562912A

Abstract

The invention discloses a method for recommending a Sina microblog event, which aims at the problem that the accuracy of the current social short text recommendation algorithm is not high, and calculates the similarity between a user model and an event vector through an improved cosine included angle algorithm; if the similarity is higher than the set threshold, pushing the event to the user; updating the user model through the newly arrived time in the latest period of time, so that the user model can track the latest development state of the event; updating the user model again by combining with the praise behavior of the user, so that the recommendation result is more in line with the expectation of the user; according to the method, the Sina microblog events can be recommended with high accuracy, reasonable drifting can be conducted on the model, and feedback of the user recommendation result can be responded in time.

Description

Sina microblog event recommendation method

Technical Field

The invention belongs to the field of data mining, and particularly relates to a social network text recommendation technology.

Background

Microblogs as a novel propagation medium are developed rapidly, have the characteristics of high propagation speed, strong interactivity, convenience in information updating and the like, have started to have great influence on social life, and become one of the main social network propagation media in China. Because people can release information to the outside in various forms such as web and webpage at any time and any place, instant sharing is realized, and more people like sharing information, exchanging opinions and expressing emotion on the microblog. Compared with the traditional media, for a plurality of important news events, the microblog is simple and convenient to operate, and the high-level point of information release can be determined by the low threshold of the microblog. The method is more remarkable in emergency, because any microblog user in the event scene can issue the whole event information to the microblog through the mobile phone. For example, in 11 months in 2009, 4.4-level earthquake occurred in west security, the microblog reported the event only after 1 minute, and the national official website released 15 minutes later for the first time.

But with the popularization of the micro-blogs, new problems are brought. The first problem is information explosion, and mass data information is full of the internet, so that the problem of serious information overload is brought to people. People are faced with the huge amount of information, and often have difficulty in finding the data which the people want, and the people want to quickly and accurately find the data which is most important for the people. Before web2.0, people usually obtained information through professional search engines, but there are some problems, and one of the most main problems is that the search engines need users to actively inquire, cannot actively push information, and has low real-time performance, so that the users are likely to miss important information. Due to the adoption of Web2.0, people can participate in publishing, spreading and filtering of information through a network, so that the purpose of information sharing is achieved. Although the information pushing mode of the directional message source subverts the previous mode of pulling information through the search engine, the information pushing mode also well makes up the current embarrassment of the search engine.

The recommendation system is used as an information acquisition method, starts from a user, researches the preference of the user, can guide the user to find out the potential demand of the user and push the information interested by the user under the condition of fuzzy intention of the user, and the information acquisition mode is a very potential method for solving the information overload problem. The recommendation system has the main task of accurately grasping the interest points of the user and pushing the possibly interested events to the user by using an efficient recommendation algorithm.

The Sina microblog, as the most popular microblog tool in China, has the following characteristics: the number of the Bo-Chinese characters is limited within 140 characters, and the Bo-Chinese characters have large data quantity, short text property, text deficiency, instantaneity and rich social information. Because the microblog data is not fixed in form, and many messages may not contain effective information, which brings great trouble to processing, research on the recommendation system for such short texts is still challenging at present. In order to achieve good recommendation effect, it is very important to develop an efficient recommendation algorithm. Most of the existing recommendation systems are text recommendation systems, research on a short text data recommendation system such as microblog is not deep enough, and the research result cannot meet the actual application requirement.

Disclosure of Invention

In order to solve the technical problem, the application provides a method for recommending the Xinlang microblog events, which is used for correcting a user model in real time, improving the recommendation accuracy of a microblog event recommendation system and improving user experience.

The technical scheme adopted by the invention is as follows: the method for recommending the Sina microblog events comprises the following steps:

s1, calculating the similarity between the user model and the event vector by adopting an improved cosine included angle algorithm, and recommending the event to the user if the similarity is greater than a threshold value; otherwise, not recommending;

s2, updating the user model according to the recommended events arriving at the event database within the latest time length K;

and S3, updating the user model according to the events approved by the user.

Further, the improved cosine included angle algorithm is specifically as follows:

wherein, sameWordNum represents the number of the keywords of the user model A and the event model B; min (| a |, | B |) represents the smallest dimension in the user model a and the event model B; w is a_aiRepresenting the weight corresponding to the feature word ai in the user model A; w is a_bjAnd representing the weight corresponding to the characteristic word bj in the event model B.

Further, the user model is extracted from a user database.

Further, the event vector is extracted from an event database.

Further, step S2 is specifically:

s21, when a new recommended event arrives in the event database, extracting the recommended event which arrives within the latest time length K;

s22, selecting the feature words with the weight larger than the first threshold value in the recommended events extracted in the step S21 and adding the feature words into the user model;

and S23, selecting the high-frequency vocabulary in the feature words of the current user model as a new user model.

Further, step S3 is specifically: and when a new event is approved, recording the ID of the approved event, searching the corresponding event from the event database according to the ID, and extracting the high-frequency vocabulary of the event.

The invention has the beneficial effects that: according to the method for recommending the Sina microblog events, the similarity between a user model and event vectors is calculated through an improved cosine included angle algorithm; if the similarity is higher than the set threshold, pushing the event to the user; updating the user model through the newly arrived time in the latest period of time, so that the user model can track the latest development state of the event; updating the user model again by combining with the praise behavior of the user, so that the recommendation result is more in line with the expectation of the user; according to the method, the Sina microblog events can be recommended with high accuracy, reasonable drifting can be conducted on the model, and feedback of the user recommendation result can be responded in time.

Drawings

FIG. 1 is a schematic flow chart of the present application;

FIG. 2 is a model drift workflow;

fig. 3 is a user feedback updating process.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

As shown in fig. 1, a scheme flow chart of the present application is provided, and the technical scheme of the present application is as follows: the method for recommending the Sina microblog events comprises the following steps:

and S3, updating the user model according to the events approved by the user.

Step S1 specifically includes: the classical cosine angle algorithm formula is as follows:

a, B represents a user model vector and an event vector, respectively, and can be expressed as follows:

A＝{(a1,w_a1),(a2,w_a2),(a3,w_a3),……,(am,w_am),}

B＝{(b1,w_b1),(b2,w_b2),(b3,w_b3),……,(bn,w_bn),}

w_a1representing the weight corresponding to the characteristic word a1 in the user model A; the B vectors work the same way. Simplifying to obtain:

wherein, w_aiAnd w_bjThe condition for multiplication is the feature word ai ═ bj.

But if more words are the same for both vectors, the cosine value is larger. Considering that the dimension of the user model and the event vector may be large, the similarity calculated by simply using the same morphology inevitably causes a problem of low recommendation precision. One of the reasons for this phenomenon is that some feature words with high weight in the event vector may not have the ability to divide the event, such as "china", "usa", etc., while some words with lower weight may be the focus of the event, such as "air crash", "gold prize", etc. Therefore, the method introduces an attenuation coefficient to improve the recommendation precision, and the improved cosine included angle algorithm is as follows:

wherein, sameWordNum represents the number of the keywords of the user model A and the event model B; min (| a |, | B |) represents the smallest dimension in the user model a and the event model B; w is a_aiRepresenting the weight corresponding to the feature word ai in the user model A; w is a_bjRepresenting an event modelAnd the weight corresponding to the characteristic word bj in the type B.

After the attenuation coefficient is introduced, if only a few keywords are the same among vectors, the similarity of the vectors can be greatly attenuated, and the setting of the threshold is not fixed as long as a proper threshold is set; a generally suitable threshold value that achieves a recommended result that is expected indicates that the threshold value is set as appropriate; otherwise, the threshold is readjusted. The recommendation accuracy can be improved to a great extent. Besides the introduction of attenuation coefficients, the method also uses another two methods for improving recommendation precision. Firstly, the recommendation is performed only when the number of the same keywords is more than the preset number. Generally, the number of keywords input by the user is not too many, about 5, and the similarity calculation is performed when the user model and the event vector have at least three same keywords. When the number of words input by the user is changed greatly, the threshold value can be adjusted correspondingly. Secondly, in order to avoid negative effects caused by different word shapes of the same word, when the similarity is calculated, word stems of all the keywords are extracted for calculation.

And after the recommendation event is obtained, storing the recommendation event in a user database to generate a recommendation log. In the case of low demand, an event may be represented by a summary of the event, which is pushed to the user. If the user needs to read the original blog article, the blog article most relevant to the user model needs to be extracted from the event.

To extract the most interesting blog article, the blog article needs to be preprocessed, and word segmentation and word stem reduction are carried out, and if the blog article contains the word with the maximum weight in the same keyword list, the blog article is probably the most interesting for the user.

Step S2 specifically includes: the main task of model drift is to automatically correct the user model along with the time, and the purpose of the model drift is to track the event hot spots in real time and master the trend of the event.

The user model represents the user's points of interest, which is also usually the miniature of an event, except that the user summarizes the event with some keywords. Over time, events may develop new, and their hot words may change. In order to automatically track the change and ensure that a user can receive the latest information, the model drift module is designed.

The core of model drift is to modify the user model, add the latest hot words into the user model, and delete the outdated keywords in the model. The work flow is shown in fig. 2.

Like the recommendation module, the user model is extracted from the user database, and the latest event is extracted from the event database. It is noted that events that arrive within the last hour are extracted, and that the trigger point is extracted as new recommended events arrive under the user model. That is, when a new recommended event exists in the current user model, all events recommended in the next hour of the model are usually extracted (that is, the duration K in the present application is one hour, and the value of K may be other values, but the drift amplitude is different, but it is not recommended to take a too large value in order to ensure the timeliness of news), and a drift vector is generated. The purpose of this is to smooth the drift process and avoid situations where the drift is too fast. If the drift magnitude is too large, it may be too far from the initial model, affecting the user experience. There are also some points to note about the extraction of feature vectors. In all events within one hour, the generated drift vector may be very large and far exceeds the dimension of the initial user model, and in order to exclude the feature words with extremely small weight and avoid the initial user model from being excessively diluted, a high-frequency word in the feature words of the current user model is selected as a new user model.

In the embodiment, the extracted event feature vector is limited to 20 words, only a part of words with the highest weight are taken, the words are added to the user model, and the words with the weight of 20 before the update are intercepted as a new user model. Particularly, in order to ensure the influence of the original input keywords of the user, the updated user model is divided into two parts, namely the original input keywords and the newly added keywords, and each part respectively occupies 0.5 of weight. By the method, the weight of the original input keyword can be guaranteed, the latest event hot words can be added, and the outdated feature words can be deleted.

Similarly, the user model after drift is stored in a user database to generate a drift log.

Step S3 specifically includes: the main purpose of the user feedback updating is to receive the user feedback information in time and to modify the user model according to the user preference. The feedback of the user to the recommendation result reflects whether the user is satisfied with the current result, and is the most important reference information for modifying the recommendation. The user feedback update flow is shown in fig. 3.

The most direct way for user feedback is "like". When a user is interested in an event or a blog, he can like to approve the event or the blog, and the system recognizes the approval behavior and stores the approved event ID in the user database. By using the praise information, the latest interest points of the user can be acquired in time, and the user model is updated.

If the new event is approved, extracting a user model and an interested event ID from a user database, searching a corresponding event in the event database according to the ID, and extracting a high-frequency vocabulary of the event; after the high-frequency words are extracted, the weight of the word with the highest word frequency in the high-frequency words is set as the maximum weight in the original user model, and the weights of the rest high-frequency words are adjusted in proportion. And finally, normalizing the whole updated user model. The high-frequency vocabulary is defined as characteristic words with the word frequency larger than the number of Bowen of corresponding events, the words have certain representative meanings for the events, obviously, the words have higher weight and have larger proportion when being updated to a user model, and therefore, the subjective hobbies of users can be reflected powerfully. And (4) drifting in the same model, intercepting the first 20 keywords of the updated user model, and performing normalization processing to still enable the original input keywords to occupy the weight of 0.5. And finally, storing the updated user model into a user database to generate an update log.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. The method for recommending the Sina microblog events is characterized by comprising the following steps:

s1, calculating the similarity between the user model and the event vector by adopting an improved cosine included angle algorithm, and recommending the event to the user if the similarity is greater than a threshold value; otherwise, not recommending; the improved cosine included angle algorithm is specifically as follows:

wherein, sameWordNum represents the number of the keywords of the user model A and the event model B; min (| a |, | B |) represents the smallest dimension in the user model a and the event model B; w is a_aiRepresenting the weight corresponding to the feature word ai in the user model A; w is a_bjRepresenting the weight corresponding to the characteristic word bj in the event model B;

s2, updating the user model according to the recommended events arriving at the event database within the latest time length K; step S2 specifically includes:

s21, when a new recommended event arrives in the event database, extracting the recommended event which arrives within the latest time length K; extracting all events recommended in the last hour of the model to generate a drift vector; selecting keywords in the drift vector which are 20 th of the weight before, and adding the keywords into the user model; dividing the updated user model into two parts: the original input keywords and the newly added keywords respectively account for 0.5 of weight;

s23, selecting a high-frequency word in the feature words of the current user model as a new user model;

and S3, updating the user model according to the events approved by the user.

2. The method for recommending Sing microblog events according to claim 1, wherein the user model is extracted from a user database.

3. The method for recommending Sing microblog events according to claim 2, wherein the event vector is extracted from an event database.

4. The method for recommending the green sea microblog event according to claim 1, wherein the step S3 is specifically as follows: and when a new event is approved, recording the ID of the approved event, searching the corresponding event from the event database according to the ID, and extracting the high-frequency vocabulary of the event.