CN112100517A

CN112100517A - Method for relieving cold start problem of recommendation system based on content feature extraction

Info

Publication number: CN112100517A
Application number: CN202010977547.2A
Authority: CN
Inventors: 陈佳雯; 张宏国; 马超
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2020-12-18

Abstract

The invention relates to a method for relieving a cold start problem of a recommendation system based on content feature extraction. With the rapid development of the internet technology, the problem of information overload is more obvious, and a proper user group is difficult to find for recommendation for a new service without user history scores. A method for alleviating the cold start problem of a recommendation system based on content feature extraction is characterized in that when content features are extracted, a method of dependency syntax analysis in natural language processing is adopted to extract the description information features of items, and the extracted content features are converted into word vectors; secondly, considering that the importance degree of each Word is different in practical situation, a Weighted Word Distance algorithm (WWMD) optimized based on TF-IDF is used to improve the accuracy of calculating the Word Distance of the content feature vector, so that after the accuracy of the similarity between the articles is improved, recommendation is performed by combining the similarity calculated by using the Word Distance with a traditional similarity calculation method. The invention is applied to the field of Internet.

Description

Method for relieving cold start problem of recommendation system based on content feature extraction

Technical Field

The invention relates to a method for relieving a cold start problem of a recommendation system based on content feature extraction.

Background

At present, with the rapid development of the internet technology, the problem of information overload is more obvious, emerging services are difficult to find a proper user group for recommendation without user history scores, and users are difficult to contact the latest online service with a high probability of being suitable for the users in a plurality of services, which is the common cold start problem accompanying the application of the most extensive collaborative filtering recommendation algorithm in the field of recommendation systems.

Disclosure of Invention

The invention aims to provide a method for relieving the cold start problem of a recommendation system based on content feature extraction, which mainly solves the problem of recommending articles without historical scores and calculates the similarity between the articles according to the word distance between the extracted content features of the articles.

The above purpose is realized by the following technical scheme:

a method for alleviating the cold start problem of a recommendation system based on content feature extraction is characterized in that when content features are extracted, a method of dependency syntax analysis in natural language processing is adopted to extract the description information features of items, and the extracted content features are converted into word vectors; secondly, considering that the importance degree of each word is different under the actual condition, a TF-IDF method is used for optimizing a word distance algorithm so as to improve the accuracy of calculating the word distance of the content feature vector and further improve the accuracy of the similarity between the articles; and finally, recommending by combining the similarity calculated by using the word distance and a traditional similarity calculation method.

The method for alleviating the cold start problem of the recommendation system based on the content feature extraction comprises the following steps of: performing content characteristic analysis on the description information and the updating function description information in the article, and extracting the content characteristics of the article information through part-of-speech tagging and dependency syntax analysis in natural language processing; firstly, segmenting words of an article characteristic text, and performing part-of-speech tagging on each word after segmentation; secondly, according to the part of speech tagging of each word after word segmentation, dependency syntax analysis is carried out on the characteristic text of the article, and the relation between the lyrics in the text is analyzed.

The method for relieving the cold start problem of the recommendation system based on the content feature extraction comprises the following steps of:

(1) is provided with a trained characteristic word vector matrix

A total of n words, column i x_iR^dA d-dimensional word vector representing the ith word, the euclidean distance between word i and word j being c (i, j) | | x_i-x_j||₂；

(2) A piece of content feature text for describing service a is processed by a skip-gram model and can be used as a sparse vector d_a∈RⁿAs the expression of the word bag, the word i appears n in the content characteristic text_i，aNext, at d_aThe sum of the number of times of all the words appearing in (a) is ∑_kn_k，aThen the TF value of the word i is

(3) Assuming that the total number of content feature texts in the corpus is | D |, the number of texts containing the word i is | { a: i is e d_aI, the IDF value of the word i is

The TF-IDF value of the word i in the content feature description text of the service a is tfidf_i，a＝tf_i，a×idf_i；

(4) There are two services a and b, order d_aAnd d_bBag-of-words representations, each representing two pieces of content characteristic text to be computed, d_aEach word i in (a) may be wholly or partially transferred to d_bIn (1), a sparse transfer matrix T epsilon R is defined^n×nThen T is_ijIndicates how many slaves d_aWord i in (1) is transferred to d_bThe word j, T in_ijIs greater than or equal to 0, so the sum of the transfer costs is sigma_i，jT_ijc(i，j)；

(5) Considering the idea of word distance algorithm, when the transfer cost sum is larger, the similarity between the two content feature texts participating in the calculation is lower, namely the similarity between the two content feature texts is inversely proportional to the minimum transfer cost sum between the texts, and the above steps are taken as followsAfter the problem is converted into a linear programming problem, the content feature text similarity between the services a and b is

So that

And is

(1) let two sets in the recommendation system: a user set U and an item set I; wherein U is { U ═₁，u₂，u₃，...u_m}，I＝{i₁，i₂，i₃，...i_nAll articles can be scored by each user individual, most services are scored by the users, and user-service (U-S) scoring matrixes are obtained after the users correspond to the services one by one

(2) Let r_u，aRepresenting the rating, r, of user u for service a_u，bRepresents the rating of the user u for the service b,

average rating for rating services on behalf of the user; the similarity between service a and service b is

(3) Setting two services a and b, calculating the service similarity based on the content feature text based on the content feature extraction method of claim 2 and the content feature similarity calculation method of claim 3 based on the TF-IDF and word distance algorithmDegree is sim_char(a, b) calculating the mixed similarity of the services a and b as sim (a, b) lambda.sim by combining with the traditional similarity calculation method_char(a，b)+(1-λ)·sim_rating(a, b), wherein lambda is a weight factor occupied by the two similarity values;

(4) let pred (u, p) be the predicted score of user u for service p, sim (i, p) be the mixed similarity between scored service i and predicted service p, and the predicted score is calculated as

The method for relieving the cold start problem of the recommendation system based on the content feature extraction comprises the following steps:

(1) establishing a user-service (U-S) scoring matrix according to the existing scoring data of the user in the recommending system;

(2) extracting content characteristics of services in a recommendation system and performing word vectorization processing;

(3) according to the obtained content feature word vectors, under a content feature similarity calculation method based on TF-IDF and word distance algorithm, calculating similarity sim between articles_char(s_i，s_j)；

(4) Calculating the service similarity sim based on the user score by using the traditional goods-based collaborative filtering recommendation algorithm according to the user-service (U-S) score matrix_rating(s_i，s_j)；

(5) For each service s_kAnd calculating the user u according to a mixed recommendation algorithm based on the content characteristics and the article similarity_iFor service s_kPrediction score pred (u, s) of_k)；

(6) All the services to be recommended are sorted in a descending way according to the predicted score values of the services, wherein the first N services are used as target users u_iAnd (4) completing service recommendation.

The invention has the following beneficial effects:

1. according to the method, the problem of cold start of the article in the conventional collaborative filtering recommendation algorithm is solved by combining the similarity of the article content characteristics and the user scoring similarity, so that the average error value is further reduced, and the performance of a recommendation system is improved. The recommendation algorithm adopted by the invention ensures that the recommendation result is more reliable and accurate and the coverage of the recommended articles is more perfect

2. The method for relieving the cold start problem of the recommendation system based on the content feature extraction considers the features of the description information of the service, can accurately find the features of the articles and strengthen the description of the articles, thereby improving the accuracy of recommendation

3. According to the content feature similarity calculation method based on the TF-IDF and the word distance algorithm, in consideration of actual conditions, the importance degree of each word in the feature text is different, so that the accuracy of the similarity is improved by using the word distance algorithm after the TF-IDF is optimized

4. The mixed recommendation algorithm based on the content features and the article similarity combines the similarity calculated by using the word distance with the traditional similarity calculation method for recommendation, and relieves the problem of cold start of the articles.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flow chart of a service characteristic text content characteristic extraction method of the present invention.

FIG. 2 is a flow chart of content feature text similarity between computing services of the present invention.

Figure 3 is an overall architecture diagram of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1:

the invention provides a method for relieving a cold start problem of a recommendation system based on content feature extraction, which is shown in figure 1:

performing content characteristic analysis on the description information and the updating function description information in the article, and extracting the content characteristics of the article information through part-of-speech tagging and dependency syntax analysis in natural language processing;

firstly, segmenting words of an article characteristic text, and performing part-of-speech tagging on each word after segmentation;

secondly, performing dependency syntax analysis on the characteristic text of the article according to part-of-speech tagging performed on each word after word segmentation, and analyzing the relation between the lyrics in the text;

the relationships that exist in dependency parsing are as follows:

main and predicate relation SBV

Moving object relationship VOB

Word-object relationship IOB

Front object FOB

Bilingual DBL

Centering relationship ATT

Middle structure ADV

Dynamic compensation structure CMD

Parallel relation COD

Mediate relation POB

Left additional relationship LAD

Right additive relationship RAD

Independent structure IS

Core relationship HED

Finally, according to the relationship existing in the dependency syntax analysis, the text is subjected to feature extraction, and the extracted relationship vocabulary comprises the following components:

the major-minor relationship phrase: business services, take-away offerings, etc.;

move guest relation phrase: providing phrases such as channel, tracking location, etc.;

centering the relational phrase: air quality, urban landscape, etc.

Example 2:

in another aspect, the present invention further provides a hybrid recommendation algorithm based on content features and item similarities, including:

(1) let two sets in the recommendation system: a user set U and an item set I. Wherein U is { U ═₁，u₂，u₃，...u_m}，I＝{i₁，i₂，i₃，...i_nAll articles can be scored by each user individual, most services are scored by the users, and user-service (U-S) scoring matrixes are obtained after the users correspond to the services one by one

representing the average rating of the user for the rating service. The similarity between service a and service b is

(3) Setting two services a and b, calculating the service similarity sim based on the content feature text by a content feature extraction method based on a requirement 2 and a content feature similarity calculation method based on a TF-IDF and word distance algorithm based on a requirement 3_char(a, b) calculating the mixed similarity of the services a and b as sim (a, b) lambda.sim by combining with the traditional similarity calculation method_char(a，b)+(1-λ)·sim_rating(a, b), wherein lambda is a weight factor occupied by the two similarity values;

Finally, the invention provides a mixed recommendation algorithm based on content features and item similarity, which comprises the following steps:

(1) let two sets in the recommendation system: a user set U and an item set I, where U ═ U₁，u₂，u₃，...u_m}，I＝{i₁，i₂，i₃，...i_nAnd (4) scoring all items by each user individual, and scoring each item individual by all users. Setting most services to be scored by users, and obtaining a user-service (U-S) scoring matrix after the users and the services are in one-to-one correspondence

(3) Setting two services a and b, calculating the service similarity sim based on the content feature text by a content feature extraction method based on a requirement 2 and a content feature similarity calculation method based on a TF-IDF and word distance algorithm based on a requirement 3_char(a, b). Combining with the traditional similarity calculation method, the mixed similarity of the services a and b is calculated to be sim (a, b) ═ lambda · sim_char(a，b)+(1-λ)·sim_rating(a, b), wherein lambda is a weight factor occupied by the two similarity values;

(4) let pred (u, p) be the predicted score of user u for service p, sim (i, p) be the mixed similarity between scored service i and predicted service p, and the predicted score is calculated

Claims

1. A method for relieving the cold start problem of a recommendation system based on content feature extraction is characterized by comprising the following steps: when extracting the content characteristics, extracting the description information characteristics of the project by adopting a method of dependency syntax analysis in natural language processing, and converting the extracted content characteristics into word vectors;

secondly, considering that the importance degree of each Word is different under the actual condition, a Weighted Word Distance algorithm (WWMD) optimized based on TF-IDF is used so as to improve the accuracy of calculating the Distance of the content feature vector words, thereby improving the accuracy of the similarity between the articles;

and finally, recommending by combining the similarity calculated by using the word distance and a traditional similarity calculation method.

2. The method for alleviating a recommendation system cold start problem based on content feature extraction as claimed in claim 1, wherein: the content feature extraction method based on natural language processing comprises the following steps: performing content characteristic analysis on the description information and the updating function description information in the article, and extracting the content characteristics of the article information through part-of-speech tagging and dependency syntax analysis in natural language processing;

firstly, segmenting words of an article characteristic text, and performing part-of-speech tagging on each word after segmentation; secondly, performing dependency syntax analysis on the characteristic text of the article according to part-of-speech tagging performed on each word after word segmentation, and analyzing the relation among the words in the text.

3. The method for alleviating the cold start problem of the recommendation system based on the content feature extraction as claimed in claim 1 or 2, wherein: the content feature similarity calculation method based on the TF-IDF optimized weighted word distance algorithm comprises the following steps of:

(1) is provided with a trained characteristic word vector matrix

A total of n words, column i

A d-dimensional word vector representing the ith word, the euclidean distance between word i and word j being,

；

(2) a strip for describing services

The content feature text of (2) is processed by a skip-gram model, and can be used as a sparse vector

As the expression of the word bag, the words are arranged in the content characteristic text

Appear to

Next, in

The sum of the number of times of all the words appearing in

Then word

Has a TF value of

；

(3) Let the total number of content feature texts in the corpus be

Including words

Number of texts of

Then word

Has an IDF value of

Service of

In the content feature description text, word

Has a TF-IDF value of

；

(4) Is provided with two services a and b, order

And

bag-of-words representations representing the two pieces of content feature text to be computed respectively,

each word in (1)

Can be wholly or partially transferred to

In (1), a sparse transfer matrix is defined

Then, then

Indicates how many slaves are

Chinese word

Is transferred to

Chinese word

，

Thus it shifts the cost sum of

；

(5) Considering the idea of word distance algorithm, when the transfer cost sum is larger, the similarity between two content feature texts participating in calculation is lower, namely the similarity between the two content feature texts is in inverse proportion to the minimum transfer cost sum between the texts, and after the problems are converted into a linear programming problem, the service is carried out

And

content feature text similarity between them is

So that

And is and

。

4. a method for mitigating recommendation system cold start problems based on content feature extraction as claimed in claim 1 or 2 or 3, characterized by: the mixed recommendation algorithm based on the content features and the item similarity comprises the following steps:

(1) let two sets in the recommendation system: user collection

And article collections

(ii) a Wherein the content of the first and second substances,

，

wherein each user individual can score all articles, each article individual can be scored by all users, most services are set to be scored by the users, and the users and the services are in one-to-one correspondence to obtain user-services (

) Scoring matrix

；

(2) Is provided with

Representing a user

For service

The score of (a) is determined,

representing a user

For service

The score of (a) is determined,

average rating for rating services on behalf of the user; then service

And service

Has a similarity of

；

(3) Setting two services

And

the content feature extraction method based on claim 2 and the content feature similarity calculation method based on TF-IDF and word distance algorithm in claim 3 calculate the service similarity based on the content feature text as

Computing services in combination with conventional similarity computing methods

And

has a mixed similarity of

，

The weight factors occupied by the two similarity values;

(4) is provided with

For the user

For service

The prediction score of (a) is determined,

serving scored services

And predictive service

The mixed similarity between the predicted scores is calculated as

。

5. The method for alleviating the cold start problem of the recommendation system based on the content feature extraction as claimed in claim 1 or 2 or 3 or 4, wherein: the method comprises the following steps:

(3) according to the obtained content feature word vectors, under a content feature similarity calculation method based on TF-IDF and word distance algorithm, similarity between articles is calculated

；

(4) Calculating the service similarity based on the user score by using the traditional goods-based collaborative filtering recommendation algorithm according to the user-service (U-S) score matrix

；

(5) For each service

And calculating the user according to a mixed recommendation algorithm based on the content characteristics and the article similarity

For service

Predictive scoring of

；

(6) All the services to be recommended are sorted in a descending way according to the predicted score values of the services, wherein the first N services are taken as target users

And (4) completing service recommendation.