CN112100517A - Method for relieving cold start problem of recommendation system based on content feature extraction - Google Patents
Method for relieving cold start problem of recommendation system based on content feature extraction Download PDFInfo
- Publication number
- CN112100517A CN112100517A CN202010977547.2A CN202010977547A CN112100517A CN 112100517 A CN112100517 A CN 112100517A CN 202010977547 A CN202010977547 A CN 202010977547A CN 112100517 A CN112100517 A CN 112100517A
- Authority
- CN
- China
- Prior art keywords
- word
- similarity
- content feature
- service
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 14
- 238000004458 analytical method Methods 0.000 claims abstract description 13
- 238000003058 natural language processing Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 3
- 230000000116 mitigating effect Effects 0.000 claims 1
- 239000000126 substance Substances 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 2
- 241001347978 Major minor Species 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000009240 nasopharyngitis Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention relates to a method for relieving a cold start problem of a recommendation system based on content feature extraction. With the rapid development of the internet technology, the problem of information overload is more obvious, and a proper user group is difficult to find for recommendation for a new service without user history scores. A method for alleviating the cold start problem of a recommendation system based on content feature extraction is characterized in that when content features are extracted, a method of dependency syntax analysis in natural language processing is adopted to extract the description information features of items, and the extracted content features are converted into word vectors; secondly, considering that the importance degree of each Word is different in practical situation, a Weighted Word Distance algorithm (WWMD) optimized based on TF-IDF is used to improve the accuracy of calculating the Word Distance of the content feature vector, so that after the accuracy of the similarity between the articles is improved, recommendation is performed by combining the similarity calculated by using the Word Distance with a traditional similarity calculation method. The invention is applied to the field of Internet.
Description
Technical Field
The invention relates to a method for relieving a cold start problem of a recommendation system based on content feature extraction.
Background
At present, with the rapid development of the internet technology, the problem of information overload is more obvious, emerging services are difficult to find a proper user group for recommendation without user history scores, and users are difficult to contact the latest online service with a high probability of being suitable for the users in a plurality of services, which is the common cold start problem accompanying the application of the most extensive collaborative filtering recommendation algorithm in the field of recommendation systems.
Disclosure of Invention
The invention aims to provide a method for relieving the cold start problem of a recommendation system based on content feature extraction, which mainly solves the problem of recommending articles without historical scores and calculates the similarity between the articles according to the word distance between the extracted content features of the articles.
The above purpose is realized by the following technical scheme:
a method for alleviating the cold start problem of a recommendation system based on content feature extraction is characterized in that when content features are extracted, a method of dependency syntax analysis in natural language processing is adopted to extract the description information features of items, and the extracted content features are converted into word vectors; secondly, considering that the importance degree of each word is different under the actual condition, a TF-IDF method is used for optimizing a word distance algorithm so as to improve the accuracy of calculating the word distance of the content feature vector and further improve the accuracy of the similarity between the articles; and finally, recommending by combining the similarity calculated by using the word distance and a traditional similarity calculation method.
The method for alleviating the cold start problem of the recommendation system based on the content feature extraction comprises the following steps of: performing content characteristic analysis on the description information and the updating function description information in the article, and extracting the content characteristics of the article information through part-of-speech tagging and dependency syntax analysis in natural language processing; firstly, segmenting words of an article characteristic text, and performing part-of-speech tagging on each word after segmentation; secondly, according to the part of speech tagging of each word after word segmentation, dependency syntax analysis is carried out on the characteristic text of the article, and the relation between the lyrics in the text is analyzed.
The method for relieving the cold start problem of the recommendation system based on the content feature extraction comprises the following steps of:
(1) is provided with a trained characteristic word vector matrixA total of n words, column i xiRdA d-dimensional word vector representing the ith word, the euclidean distance between word i and word j being c (i, j) | | xi-xj||2;
(2) A piece of content feature text for describing service a is processed by a skip-gram model and can be used as a sparse vector da∈RnAs the expression of the word bag, the word i appears n in the content characteristic texti,aNext, at daThe sum of the number of times of all the words appearing in (a) is ∑knk,aThen the TF value of the word i is
(3) Assuming that the total number of content feature texts in the corpus is | D |, the number of texts containing the word i is | { a: i is e daI, the IDF value of the word i isThe TF-IDF value of the word i in the content feature description text of the service a is tfidfi,a=tfi,a×idfi;
(4) There are two services a and b, order daAnd dbBag-of-words representations, each representing two pieces of content characteristic text to be computed, daEach word i in (a) may be wholly or partially transferred to dbIn (1), a sparse transfer matrix T epsilon R is definedn×nThen T isijIndicates how many slaves daWord i in (1) is transferred to dbThe word j, T inijIs greater than or equal to 0, so the sum of the transfer costs is sigmai,jTijc(i,j);
(5) Considering the idea of word distance algorithm, when the transfer cost sum is larger, the similarity between the two content feature texts participating in the calculation is lower, namely the similarity between the two content feature texts is inversely proportional to the minimum transfer cost sum between the texts, and the above steps are taken as followsAfter the problem is converted into a linear programming problem, the content feature text similarity between the services a and b isSo thatAnd is
The method for relieving the cold start problem of the recommendation system based on the content feature extraction comprises the following steps of:
(1) let two sets in the recommendation system: a user set U and an item set I; wherein U is { U ═1,u2,u3,...um},I={i1,i2,i3,...inAll articles can be scored by each user individual, most services are scored by the users, and user-service (U-S) scoring matrixes are obtained after the users correspond to the services one by one
(2) Let ru,aRepresenting the rating, r, of user u for service au,bRepresents the rating of the user u for the service b,average rating for rating services on behalf of the user; the similarity between service a and service b is
(3) Setting two services a and b, calculating the service similarity based on the content feature text based on the content feature extraction method of claim 2 and the content feature similarity calculation method of claim 3 based on the TF-IDF and word distance algorithmDegree is simchar(a, b) calculating the mixed similarity of the services a and b as sim (a, b) lambda.sim by combining with the traditional similarity calculation methodchar(a,b)+(1-λ)·simrating(a, b), wherein lambda is a weight factor occupied by the two similarity values;
(4) let pred (u, p) be the predicted score of user u for service p, sim (i, p) be the mixed similarity between scored service i and predicted service p, and the predicted score is calculated as
The method for relieving the cold start problem of the recommendation system based on the content feature extraction comprises the following steps:
(1) establishing a user-service (U-S) scoring matrix according to the existing scoring data of the user in the recommending system;
(2) extracting content characteristics of services in a recommendation system and performing word vectorization processing;
(3) according to the obtained content feature word vectors, under a content feature similarity calculation method based on TF-IDF and word distance algorithm, calculating similarity sim between articleschar(si,sj);
(4) Calculating the service similarity sim based on the user score by using the traditional goods-based collaborative filtering recommendation algorithm according to the user-service (U-S) score matrixrating(si,sj);
(5) For each service skAnd calculating the user u according to a mixed recommendation algorithm based on the content characteristics and the article similarityiFor service skPrediction score pred (u, s) ofk);
(6) All the services to be recommended are sorted in a descending way according to the predicted score values of the services, wherein the first N services are used as target users uiAnd (4) completing service recommendation.
The invention has the following beneficial effects:
1. according to the method, the problem of cold start of the article in the conventional collaborative filtering recommendation algorithm is solved by combining the similarity of the article content characteristics and the user scoring similarity, so that the average error value is further reduced, and the performance of a recommendation system is improved. The recommendation algorithm adopted by the invention ensures that the recommendation result is more reliable and accurate and the coverage of the recommended articles is more perfect
2. The method for relieving the cold start problem of the recommendation system based on the content feature extraction considers the features of the description information of the service, can accurately find the features of the articles and strengthen the description of the articles, thereby improving the accuracy of recommendation
3. According to the content feature similarity calculation method based on the TF-IDF and the word distance algorithm, in consideration of actual conditions, the importance degree of each word in the feature text is different, so that the accuracy of the similarity is improved by using the word distance algorithm after the TF-IDF is optimized
4. The mixed recommendation algorithm based on the content features and the article similarity combines the similarity calculated by using the word distance with the traditional similarity calculation method for recommendation, and relieves the problem of cold start of the articles.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a flow chart of a service characteristic text content characteristic extraction method of the present invention.
FIG. 2 is a flow chart of content feature text similarity between computing services of the present invention.
Figure 3 is an overall architecture diagram of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1:
the invention provides a method for relieving a cold start problem of a recommendation system based on content feature extraction, which is shown in figure 1:
performing content characteristic analysis on the description information and the updating function description information in the article, and extracting the content characteristics of the article information through part-of-speech tagging and dependency syntax analysis in natural language processing;
firstly, segmenting words of an article characteristic text, and performing part-of-speech tagging on each word after segmentation;
secondly, performing dependency syntax analysis on the characteristic text of the article according to part-of-speech tagging performed on each word after word segmentation, and analyzing the relation between the lyrics in the text;
the relationships that exist in dependency parsing are as follows:
main and predicate relation SBV
Moving object relationship VOB
Word-object relationship IOB
Front object FOB
Bilingual DBL
Centering relationship ATT
Middle structure ADV
Dynamic compensation structure CMD
Parallel relation COD
Mediate relation POB
Left additional relationship LAD
Right additive relationship RAD
Independent structure IS
Core relationship HED
Finally, according to the relationship existing in the dependency syntax analysis, the text is subjected to feature extraction, and the extracted relationship vocabulary comprises the following components:
the major-minor relationship phrase: business services, take-away offerings, etc.;
move guest relation phrase: providing phrases such as channel, tracking location, etc.;
centering the relational phrase: air quality, urban landscape, etc.
Example 2:
in another aspect, the present invention further provides a hybrid recommendation algorithm based on content features and item similarities, including:
(1) let two sets in the recommendation system: a user set U and an item set I. Wherein U is { U ═1,u2,u3,...um},I={i1,i2,i3,...inAll articles can be scored by each user individual, most services are scored by the users, and user-service (U-S) scoring matrixes are obtained after the users correspond to the services one by one
(2) Let ru,aRepresenting the rating, r, of user u for service au,bRepresents the rating of the user u for the service b,representing the average rating of the user for the rating service. The similarity between service a and service b is
(3) Setting two services a and b, calculating the service similarity sim based on the content feature text by a content feature extraction method based on a requirement 2 and a content feature similarity calculation method based on a TF-IDF and word distance algorithm based on a requirement 3char(a, b) calculating the mixed similarity of the services a and b as sim (a, b) lambda.sim by combining with the traditional similarity calculation methodchar(a,b)+(1-λ)·simrating(a, b), wherein lambda is a weight factor occupied by the two similarity values;
(4) let pred (u, p) be the predicted score of user u for service p, sim (i, p) be the mixed similarity between scored service i and predicted service p, and the predicted score is calculated as
Finally, the invention provides a mixed recommendation algorithm based on content features and item similarity, which comprises the following steps:
(1) let two sets in the recommendation system: a user set U and an item set I, where U ═ U1,u2,u3,...um},I={i1,i2,i3,...inAnd (4) scoring all items by each user individual, and scoring each item individual by all users. Setting most services to be scored by users, and obtaining a user-service (U-S) scoring matrix after the users and the services are in one-to-one correspondence
(2) Let ru,aRepresenting the rating, r, of user u for service au,bRepresents the rating of the user u for the service b,representing the average rating of the user for the rating service. The similarity between service a and service b is
(3) Setting two services a and b, calculating the service similarity sim based on the content feature text by a content feature extraction method based on a requirement 2 and a content feature similarity calculation method based on a TF-IDF and word distance algorithm based on a requirement 3char(a, b). Combining with the traditional similarity calculation method, the mixed similarity of the services a and b is calculated to be sim (a, b) ═ lambda · simchar(a,b)+(1-λ)·simrating(a, b), wherein lambda is a weight factor occupied by the two similarity values;
Claims (5)
1. A method for relieving the cold start problem of a recommendation system based on content feature extraction is characterized by comprising the following steps: when extracting the content characteristics, extracting the description information characteristics of the project by adopting a method of dependency syntax analysis in natural language processing, and converting the extracted content characteristics into word vectors;
secondly, considering that the importance degree of each Word is different under the actual condition, a Weighted Word Distance algorithm (WWMD) optimized based on TF-IDF is used so as to improve the accuracy of calculating the Distance of the content feature vector words, thereby improving the accuracy of the similarity between the articles;
and finally, recommending by combining the similarity calculated by using the word distance and a traditional similarity calculation method.
2. The method for alleviating a recommendation system cold start problem based on content feature extraction as claimed in claim 1, wherein: the content feature extraction method based on natural language processing comprises the following steps: performing content characteristic analysis on the description information and the updating function description information in the article, and extracting the content characteristics of the article information through part-of-speech tagging and dependency syntax analysis in natural language processing;
firstly, segmenting words of an article characteristic text, and performing part-of-speech tagging on each word after segmentation; secondly, performing dependency syntax analysis on the characteristic text of the article according to part-of-speech tagging performed on each word after word segmentation, and analyzing the relation among the words in the text.
3. The method for alleviating the cold start problem of the recommendation system based on the content feature extraction as claimed in claim 1 or 2, wherein: the content feature similarity calculation method based on the TF-IDF optimized weighted word distance algorithm comprises the following steps of:
(1) is provided with a trained characteristic word vector matrixA total of n words, column iA d-dimensional word vector representing the ith word, the euclidean distance between word i and word j being,;
(2) a strip for describing servicesThe content feature text of (2) is processed by a skip-gram model, and can be used as a sparse vectorAs the expression of the word bag, the words are arranged in the content characteristic textAppear toNext, inThe sum of the number of times of all the words appearing inThen wordHas a TF value of;
(3) Let the total number of content feature texts in the corpus beIncluding wordsNumber of texts ofThen wordHas an IDF value ofService ofIn the content feature description text, wordHas a TF-IDF value of;
(4) Is provided with two services a and b, orderAndbag-of-words representations representing the two pieces of content feature text to be computed respectively,each word in (1)Can be wholly or partially transferred toIn (1), a sparse transfer matrix is definedThen, thenIndicates how many slaves areChinese wordIs transferred toChinese word, Thus it shifts the cost sum of;
(5) Considering the idea of word distance algorithm, when the transfer cost sum is larger, the similarity between two content feature texts participating in calculation is lower, namely the similarity between the two content feature texts is in inverse proportion to the minimum transfer cost sum between the texts, and after the problems are converted into a linear programming problem, the service is carried outAndcontent feature text similarity between them isSo thatAnd is and。
4. a method for mitigating recommendation system cold start problems based on content feature extraction as claimed in claim 1 or 2 or 3, characterized by: the mixed recommendation algorithm based on the content features and the item similarity comprises the following steps:
(1) let two sets in the recommendation system: user collectionAnd article collections(ii) a Wherein the content of the first and second substances,, wherein each user individual can score all articles, each article individual can be scored by all users, most services are set to be scored by the users, and the users and the services are in one-to-one correspondence to obtain user-services () Scoring matrix;
(2) Is provided withRepresenting a userFor serviceThe score of (a) is determined,representing a userFor serviceThe score of (a) is determined,average rating for rating services on behalf of the user; then serviceAnd serviceHas a similarity of;
(3) Setting two servicesAndthe content feature extraction method based on claim 2 and the content feature similarity calculation method based on TF-IDF and word distance algorithm in claim 3 calculate the service similarity based on the content feature text asComputing services in combination with conventional similarity computing methodsAndhas a mixed similarity of, The weight factors occupied by the two similarity values;
5. The method for alleviating the cold start problem of the recommendation system based on the content feature extraction as claimed in claim 1 or 2 or 3 or 4, wherein: the method comprises the following steps:
(1) establishing a user-service (U-S) scoring matrix according to the existing scoring data of the user in the recommending system;
(2) extracting content characteristics of services in a recommendation system and performing word vectorization processing;
(3) according to the obtained content feature word vectors, under a content feature similarity calculation method based on TF-IDF and word distance algorithm, similarity between articles is calculated;
(4) Calculating the service similarity based on the user score by using the traditional goods-based collaborative filtering recommendation algorithm according to the user-service (U-S) score matrix;
(5) For each serviceAnd calculating the user according to a mixed recommendation algorithm based on the content characteristics and the article similarityFor servicePredictive scoring of;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010977547.2A CN112100517A (en) | 2020-09-17 | 2020-09-17 | Method for relieving cold start problem of recommendation system based on content feature extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010977547.2A CN112100517A (en) | 2020-09-17 | 2020-09-17 | Method for relieving cold start problem of recommendation system based on content feature extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112100517A true CN112100517A (en) | 2020-12-18 |
Family
ID=73758846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010977547.2A Pending CN112100517A (en) | 2020-09-17 | 2020-09-17 | Method for relieving cold start problem of recommendation system based on content feature extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112100517A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287916A (en) * | 2018-02-11 | 2018-07-17 | 北京方正阿帕比技术有限公司 | A kind of resource recommendation method |
CN108573411A (en) * | 2018-04-17 | 2018-09-25 | 重庆理工大学 | Depth sentiment analysis and multi-source based on user comment recommend the mixing of view fusion to recommend method |
CN108776940A (en) * | 2018-06-04 | 2018-11-09 | 南京邮电大学盐城大数据研究院有限公司 | A kind of intelligent food and drink proposed algorithm excavated based on text comments |
CN109063147A (en) * | 2018-08-06 | 2018-12-21 | 北京航空航天大学 | Online course forum content recommendation method and system based on text similarity |
CN110851731A (en) * | 2019-09-25 | 2020-02-28 | 浙江工业大学 | Collaborative filtering recommendation method for user attribute coupling similarity and interest semantic similarity |
KR102155768B1 (en) * | 2019-10-02 | 2020-09-14 | 한경훈 | Method for providing question and answer data set recommendation service using adpative learning from evoloving data stream for shopping mall |
-
2020
- 2020-09-17 CN CN202010977547.2A patent/CN112100517A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287916A (en) * | 2018-02-11 | 2018-07-17 | 北京方正阿帕比技术有限公司 | A kind of resource recommendation method |
CN108573411A (en) * | 2018-04-17 | 2018-09-25 | 重庆理工大学 | Depth sentiment analysis and multi-source based on user comment recommend the mixing of view fusion to recommend method |
CN108776940A (en) * | 2018-06-04 | 2018-11-09 | 南京邮电大学盐城大数据研究院有限公司 | A kind of intelligent food and drink proposed algorithm excavated based on text comments |
CN109063147A (en) * | 2018-08-06 | 2018-12-21 | 北京航空航天大学 | Online course forum content recommendation method and system based on text similarity |
CN110851731A (en) * | 2019-09-25 | 2020-02-28 | 浙江工业大学 | Collaborative filtering recommendation method for user attribute coupling similarity and interest semantic similarity |
KR102155768B1 (en) * | 2019-10-02 | 2020-09-14 | 한경훈 | Method for providing question and answer data set recommendation service using adpative learning from evoloving data stream for shopping mall |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977413B (en) | Emotion analysis method based on improved CNN-LDA | |
CN106599029B (en) | Chinese short text clustering method | |
CN101231634B (en) | Autoabstract method for multi-document | |
CN109948143B (en) | Answer extraction method of community question-answering system | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
CN110162591B (en) | Entity alignment method and system for digital education resources | |
CN111125349A (en) | Graph model text abstract generation method based on word frequency and semantics | |
CN110008309B (en) | Phrase mining method and device | |
CN111797898B (en) | Online comment automatic reply method based on deep semantic matching | |
CN103678329B (en) | Recommend method and device | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN110134799B (en) | BM25 algorithm-based text corpus construction and optimization method | |
Liu et al. | Using collaborative filtering algorithms combined with Doc2Vec for movie recommendation | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
Zhang et al. | Research on keyword extraction of Word2vec model in Chinese corpus | |
CN112949713A (en) | Text emotion classification method based on ensemble learning of complex network | |
CN110110220A (en) | Merge the recommended models of social networks and user's evaluation | |
CN110083676B (en) | Short text-based field dynamic tracking method | |
CN105354184A (en) | Method for using optimized vector space model to automatically classify document | |
CN112434533B (en) | Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium | |
CN112131341A (en) | Text similarity calculation method and device, electronic equipment and storage medium | |
CN108427769B (en) | Character interest tag extraction method based on social network | |
CN112100517A (en) | Method for relieving cold start problem of recommendation system based on content feature extraction | |
CN116756303A (en) | Automatic generation method and system for multi-topic text abstract | |
CN115203589A (en) | Vector searching method and system based on Trans-dssm model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |