CN116127192A

CN116127192A - Personalized recommendation method based on big data

Info

Publication number: CN116127192A
Application number: CN202211741970.8A
Authority: CN
Inventors: 熊林海; 周金明
Original assignee: Nanjing Xingzheyi Intelligent Transportation Technology Co ltd
Current assignee: Nanjing Xingzheyi Intelligent Transportation Technology Co ltd
Priority date: 2022-12-31
Filing date: 2022-12-31
Publication date: 2023-05-16

Abstract

The invention discloses a personalized recommendation method based on big data. The top 15 pieces of data with the highest click rate in the elastic search database are recommended for the new user. For old users, linear weighted fusion is carried out on recommendation results of collaborative filtering algorithm based on the users and recommendation algorithm based on the items, and the top 10 pieces of data are selected for recommendation. The invention can update the database at regular time through the high automation of the web crawler; secondly, the TF-IDF algorithm and the collaborative filtering algorithm are combined, so that the sparse problem of the data can be relieved in a certain colloquial way.

Description

Personalized recommendation method based on big data

Technical Field

The invention relates to the technical field of big data research, in particular to a personalized recommendation method based on big data.

Background

With the rapid development of network technology, the internet has spread throughout the life, and massive data is generated every day. In a huge amount of information, the information which a user wants to acquire is only a very small part of the total amount of information, and the user often seems to be in a no way in the face of the current situation of information overload. Therefore, it is important to be able to obtain the information intended by the user in a highly efficient manner.

The recommendation methods commonly used at present are mainly based on content recommendation, association rule recommendation and collaborative filtering recommendation. For content-based recommendation and collaborative filtering recommendation, sparse problems and new user problems exist, and for association rule-based recommendation, problems such as difficult rule extraction, time consumption, low individuation degree and the like exist.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention combines the elastic search data storage search with a collaborative filtering algorithm based on a user and a collaborative filtering recommendation algorithm based on a project. Recommending the top 15 pieces of data with highest click rate in the elastic search database for the new user; for old users, linear weighted fusion is carried out on recommendation results of a collaborative filtering algorithm based on the users and a recommendation algorithm based on the items, and the top 10 pieces of data are selected for recommendation. The technical proposal is as follows:

a personalized recommendation method based on big data comprises the following steps:

step 1: utilizing web crawler technology, specifying fields includes: title, release time, administration time, timeliness, text, etc., obtaining data from related websites;

step 2: firstly, processing fields of the crawled data, removing meaningless data, de-duplicating the data, secondly, storing the washed data into json format according to new fields such as titles, data derived areas and the like;

importing data stored in json format into an elastic search database, selecting an ik word segmentation device corresponding to an elastic search version, and creating an index by using the finest granularity ik_max_word; searching using a minimum cut ik_smart;

step 3: constructing a collaborative filtering algorithm based on a user, and constructing a user-data matrix U with m multiplied by n according to historical behavior information of the user on certain data, including searching, commenting and collecting, as follows:

wherein m represents the number of users, n represents the total number of data, and if a user has historical behavior on a certain piece of data, scoring is performed, u _mn Representing the grading of the user m to the data n, and if no historical behavior exists, assigning 0;

constructing an n x s data-tag matrix C based on tag information of the data, as follows:

wherein n represents the total number of data, s represents the total number of tags, c _ns Indicating whether the data n contains a label s, if so, assigning 1, otherwise, assigning 0;

from the matrices U and C, a user-tag preference matrix P of size mxs is constructed as follows:

wherein m represents the total number of users, s represents the total number of tags, and p _ms Representing the preference degree of the user m to the label s;

the user-tag preference matrix P is improved by TF-IDF algorithm, specifically as follows:

wherein p is _ua Indicating the degree of preference of user u for tag a,

label representing user u to label aRecord the number of times>

Indicating the total number of times user u marks the tag, +.>

Indicating the total number of times the tag is to be used,

represents the total number of tags, n _ua Indicating the number of users marked with label a, n _m Representing the total number of users;

and calculating the similarity between users by using the improved user-label matrix and cosine similarity, wherein the concrete calculation formula is as follows:

wherein w is _u,v The larger the value, the more similar user u and user v are, and thus the greater the probability of recommending data that user v prefers to user u; n is n _u A set of data representing user u preferences, n _v A set of data representing user v preferences; n _u ∩n _v The i represents a set of common preference data for users u and v;

the similarity is arranged in a descending order, the first k users which are most similar to the target user are found, the first k users are represented by a set S (u, k), the preference degree of the target user u to the data i is calculated, and a specific calculation formula is as follows:

wherein n is _i Representing a set of users who have historic behavior on data i, u _vi A score indicating that user v has historical behavioral information for data i;

step 4, constructing a recommendation result of the collaborative filtering algorithm based on the project, wherein the recommendation result is basically consistent with the implementation process of the collaborative filtering algorithm based on the user in the step 3: firstly, constructing a data-user matrix according to historical behavior information of different users on different data; secondly, calculating the similarity between the data by using the cosine similarity; thirdly, obtaining a recommendation result, wherein the data recommended to the target user is data which has no historical behavior of the target user and has relatively higher similarity with the data which has the historical behavior of the target user, and when the recommendation score is higher, the data recommended by the target user is more interested;

step 5: recommending the top 15 pieces of data with highest click rate in the elastic search database for the new user; for old users, linear weighted fusion is carried out on recommendation results of collaborative filtering algorithm based on the users and recommendation algorithm based on the items, and the top 10 pieces of data are selected for recommendation.

Preferably, in step 1, different crawling objects are selected according to different department attributes of the user, and relevant laws and regulations, news dynamics, microblogs and case data are obtained.

Preferably, the assignment criteria of the user-data matrix U in step 3 are: after searching a certain data, assigning 1; after commenting, assigning a value of 2; after collection, assign 3.

Preferably, in step 3, the top k users most similar to the target user are counted, k selecting 20.

Preferably, the recommending method for the new user in step 5 can also enable the user to select the interested field when the platform design is initialized, and then select the related field data for recommending.

Preferably, step 6 can be added to perform subsequent iterative optimization on the model, and the method is mainly divided into two parts: the first part is the optimization of data, and the database is updated continuously based on a crawler program; the second part is the updating of the user-data matrix, and the user-data matrix and the final recommendation list are continuously updated according to whether the target user has behavior feedback on the recommended data after recommending the data to the user with the behavior information and recording the behavior feedback in the user behavior information log table.

Compared with the prior art, the invention has the beneficial effects that: firstly, the database can be updated regularly through the high automation of the web crawler; secondly, the TF-IDF algorithm and the collaborative filtering algorithm are combined, so that the sparse problem of the data can be relieved in a certain colloquial way.

Drawings

FIG. 1 is a schematic diagram of a model implementation process.

FIG. 2 is a schematic diagram of a model optimization process.

Detailed Description

In order to clarify the technical scheme and working principle of the present invention, the following describes the embodiments of the present disclosure in further detail. Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The terms "step 1," "step 2," "step 3," and the like in the description and in the claims are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those described herein.

The embodiment of the disclosure provides a personalized recommendation method based on big data, which comprises the following steps:

step 1: based on the crawler technology, relevant data are acquired.

With web crawler technology, fields are specified, such as: title, release time, administration time, timeliness, text, etc., obtain data from the relevant website.

Preferably, different crawling objects are selected according to different department attributes of the user, and relevant laws and regulations, news dynamics, microblogs and case data are obtained.

Step 2: the data is cleaned and stored in an elastic search database.

And firstly, processing the field of the crawled data, removing meaningless data, de-duplicating the data, secondly, storing the washed data into json format according to new fields such as the title, the data derivative region and the like.

Importing data stored in json format into an elastic search database, selecting an ik word segmentation device corresponding to an elastic search version, and creating an index by using the finest granularity ik_max_word; the search is performed using the least cut ik_smart. Meanwhile, the dictionary can be expanded in a self-defined mode, and the searching accuracy is improved.

Step 3: and constructing a user-tag matrix by utilizing a collaborative filtering algorithm and combining the potential interests of the user and a TF-IDF algorithm.

According to the historical behavior information of a user on certain data, including searching, commenting and collecting, a user-data matrix U with the size of m multiplied by U is constructed as follows:

wherein m represents the number of users, n represents the total number of data, and if a user has historical behavior on a certain piece of data, scoring is performed, u _mn Representing the score of user m for data n, if no historical behavior has passed, then a value of 0 is assigned.

Preferably, if a certain data is searched, a value of 1 is assigned; after commenting, assigning a value of 2; after collection, assign 3.

wherein n represents the total number of data, s represents the total number of tags, c _ns Indicating whether the data n contains a tag s, if so, then a value of 1, otherwise, a value of 0.

wherein m represents the total number of users, s represents the total number of tags, and p _ms Indicating the preference degree of the user m for the tag s.

wherein p is _ua Indicating the degree of preference of user u for tag a,

representing the number of marks of user u on tag a, < >>

Indicating the total number of times user u marks the tag, +.>

Indicating the total number of times the tag is to be used,

represents the total number of tags, n _ua Indicating the number of users marked with label a, n _m Indicating the total number of users.

wherein w is _u,v The larger the value, the more similar user u and user v are, and thus the greater the probability of recommending data that user v prefers to user u; n is n _u A set of data representing user u preferences, n _v A set of data representing user v preferences.|n _u ∩n _v The i indicates the set of common preference data for users u and v.

wherein n is _i Representing a set of users who have historic behavior on data i, u _vi The scoring of historical behavior information of the data i by the user v is shown, wherein the scoring comprises searching, commenting and collecting, and the scoring is respectively assigned with 1, 2 and 3.

Preferably, k is selected to be 20.

Step 4, the recommended result of the collaborative filtering algorithm based on the project can be basically consistent with the implementation process of the collaborative filtering algorithm based on the user. Firstly, constructing a data-user matrix according to historical behavior information of different users on different data; secondly, calculating the similarity between the data by using the cosine similarity; and thirdly, obtaining a recommendation result, wherein the data recommended to the target user is data which has no historical behavior of the target user and has relatively higher similarity with the data which has the historical behavior of the target user, and when the recommendation score is higher, the data recommended by the target user is more interested.

And finally, carrying out linear weighted fusion on recommendation results of a collaborative filtering algorithm based on the user and a recommendation algorithm based on the item, and selecting data of the top 10 pieces of ranking to recommend.

Step 5: different recommendation algorithms are employed depending on the nature of the user.

The top 15 pieces of data with the highest click rate in the elastic search database can also be recommended to the new user, and a label can be set when the platform design is initialized, for example: the user selects the interested field, and then the related field data is selected for recommendation. For old users, linear weighted fusion is carried out on recommendation results of collaborative filtering algorithm based on the users and recommendation algorithm based on the items, and the top 10 pieces of data are selected for recommendation.

Step 6: iterative optimization of the model.

Iterative optimization for models is largely divided into two parts. The first part is the optimization of data, and the database is updated continuously based on a crawler program; the second part is the updating of the user-data matrix, and particularly as shown in fig. 2, according to whether the target user has behavior feedback on the recommended data after recommending the data to the user with behavior information, and records the behavior feedback in the user behavior information log table, so as to continuously update the user-data matrix and the final recommendation list.

While the invention has been described above by way of example, it is evident that the invention is not limited to the particular embodiments described above, but rather, it is intended to provide various insubstantial modifications, both as to the method concepts and technical solutions of the invention; or the above conception and technical scheme of the invention are directly applied to other occasions without improvement and equivalent replacement, and all are within the protection scope of the invention.

Claims

1. The personalized recommendation method based on the big data is characterized by comprising the following steps:

wherein p is _ua Indicating the degree of preference of user u for tag a,

indicating the number of labels a marked by user u,

indicating the total number of times user u marks the tag, +.>

Representing the total number of tags, +.>

/>

wherein n is _i Representing a set of users who have historic behavior on data i, u _vi Indicating that user v has historical behavior credit for data iGrading the rest;

2. The personalized recommendation method based on big data according to claim 1, wherein in step 1, different crawling objects are selected according to different department attributes of the user, and relevant laws and regulations, news dynamics, microblog and case data are obtained.

3. The personalized recommendation method based on big data according to claim 1, wherein the assignment criteria of the user-data matrix U in step 3 is: after searching a certain data, assigning 1; after commenting, assigning a value of 2; after collection, assign 3.

4. A personalized recommendation method based on big data according to claim 3, wherein the top k users, k choices 20, most similar to the target user are counted in step 3.

5. The personalized recommendation method based on big data according to claim 1, wherein the recommendation method for new users in step 5 can also let users select the field of interest when the platform design is initialized, and then select the related field data for recommendation.

6. The personalized recommendation method based on big data according to claim 1, wherein step 6 is added to perform subsequent iterative optimization on the model, and the method is mainly divided into two parts: the first part is the optimization of data, and the database is updated continuously based on a crawler program; the second part is the updating of the user-data matrix, and the user-data matrix and the final recommendation list are continuously updated according to whether the target user has behavior feedback on the recommended data after recommending the data to the user with the behavior information and recording the behavior feedback in the user behavior information log table.