CN117474703B

CN117474703B - Topic intelligent recommendation method based on social network

Info

Publication number: CN117474703B
Application number: CN202311809535.9A
Authority: CN
Inventors: 方波; 唐路遥
Original assignee: Wuhan Huiyou Network Technology Co ltd
Current assignee: Wuhan Huiyou Network Technology Co ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-26
Anticipated expiration: 2043-12-26
Also published as: CN117474703A

Abstract

The invention relates to the technical field of data mining recommendation, in particular to an intelligent topic recommendation method based on a social network, which comprises the following steps: acquiring sentences in all text contents in a text content data set of all topics, text vocabulary sequences of all sentences and word vectors of vocabulary in all text vocabulary sequences; acquiring the part of speech of each word in a text word sequence, and counting the word frequency of all words in each text content, thereby acquiring the attribute score of the topic core word; obtaining a topic high-frequency word stock; obtaining important coefficients of each sentence in each text content according to the topic core word attribute score, and obtaining topic core content representative coefficients of each text content; acquiring vocabulary association indexes among topics according to the word vectors; acquiring topic core content association indexes of each text content among topics, acquiring the interest coincidence degree of recommended topics, and recommending users. The invention aims to solve the problem of poor recommendation effect caused by considering only content similar to history.

Description

Topic intelligent recommendation method based on social network

Technical Field

The application relates to the technical field of data mining recommendation, in particular to an intelligent topic recommendation method based on a social network.

Background

With the rapid development of the internet, more and more people use the internet as an important channel for information acquisition and social contact. Social networks also rise rapidly with this advantage, becoming a network platform for people to communicate and share. The intelligent recommendation based on the social network can provide personalized and accurate recommendation content according to the behavior, interests and social relations of the user on the platform, so that personalized use experience of the user on the social network is improved. By analyzing the diversity of the social circle of the user, richer and more contents are recommended to the user, and the formation of a cocoon house is prevented.

The topic intelligent recommendation system based on the social network needs to analyze the text content of topics interested by users, and generally adopts a natural language processing technology to extract key information of the text. The text data has complex and various characteristics, and keyword information of different topics is different. A conventional topic intelligent recommendation method adopts a topic word extraction related algorithm to acquire high-frequency topic keywords of a user on a social network, extracts keywords of topic texts in a topic library, and recommends topics with high keyword matching and recognition to the user. The conventional recommendation algorithm has the problem that the recommendation effect is poor because the historical behavior of the user forms filtering bias and only content similar to the history is considered.

Disclosure of Invention

In order to solve the technical problems, the invention provides an intelligent topic recommendation method based on a social network, so as to solve the existing problems.

The intelligent topic recommendation method based on the social network adopts the following technical scheme:

the embodiment of the invention provides a social network-based topic intelligent recommendation method, which comprises the following steps of:

acquiring sentences in all text contents in a text content data set of all topics, text vocabulary sequences of all sentences and word vectors of vocabulary in each text vocabulary sequence;

acquiring the part of speech of each word in a text word sequence, and counting the word frequency of all words in each text content; obtaining topic core word attribute scores of each sentence according to word frequencies of all words in each sentence and the number of interval words among words; acquiring topic high-frequency word banks of all topics according to word frequencies of words in a text content data set; acquiring the importance coefficients of each sentence in each text content according to the number of the words of each sentence in the topic high-frequency word stock and the topic core word attribute score; obtaining topic core content representing coefficients of each text content according to word frequencies of words in sentences, the number of words appearing in a topic high-frequency word bank in the text content and important coefficients of the sentences; acquiring word association indexes among topics according to cosine similarity of word vectors among words in the topics; obtaining topic core content association indexes of each text content among topics according to topic core content representative coefficients and vocabulary association indexes; acquiring the interest coincidence degree of the recommended topics of the text content according to the topic core content association index and the praise, comment and forwarding times of the user on the topics;

and calculating the interest coincidence degree of recommended topics of the text contents of all the unbrown topics in the near three days, sorting the text contents from big to small, and intelligently recommending the topic contents in sequence.

Further, the acquiring the part of speech of each word in the text word sequence includes:

and marking the parts of speech in each text vocabulary sequence by using a BiLSTM-CRF model to obtain the parts of speech of each vocabulary.

Further, the counting word frequencies of all words in each text content includes:

and carrying out word frequency statistics on the words in the text word sequence by using the N-gram model, and obtaining the word frequency of each word in the text content.

Further, the obtaining the topic core word attribute score of each sentence includes:

in each sentence of each text content of each topic, counting the number of nouns of all words in the sentence, calculating the product of word frequency of each word and other words in the sentence in the text content, and calculating the number of interval words between each word and other words in the sentence;

and calculating the ratio of the product to the number of the interval words, and calculating the sum of all the ratios in the sentences as the topic core word attribute score of each sentence.

Further, the obtaining the topic high-frequency word stock of each topic includes:

counting the highest word frequency in the text content data set of each topicThe words are sequenced according to the sequence from small word frequency to large word frequency to form a topic high-frequency word stock;

wherein,the number of the high-frequency words is preset.

Further, the obtaining the importance coefficient of each sentence in each text content includes:

in each sentence of each text content of each topic, calculating the product of the number of words appearing in the topic high-frequency word stock and the topic core word attribute score in the sentence as an important coefficient of each sentence.

Further, the obtaining topic core content representative coefficients of each text content includes:

in each sentence of each text content of each topic, calculating the product of word frequency and important coefficient of each word in the text content, calculating the average value of all the products, and calculating the product of the number of words appearing in the topic high-frequency word stock in the text content and the average value as the topic core content representative coefficient of each text content.

Further, the obtaining the vocabulary association index between topics includes:

and calculating cosine similarity between each word of each text content in each topic and word vectors of each word of each text content in other topics, and calculating a sum value of all cosine similarity as a word association index between each topic and other topics.

Further, the obtaining the topic core content association index of each text content between topics includes:

calculating the product of topic core content representing coefficients of txt text contents in each topic and txt text contents in other topics, and calculating the product of the product and vocabulary association indexes between each topic and other topics as a first product;

calculating the absolute value of the difference value of the word frequency of the jth vocabulary in the txt text contents in each topic and the word frequency of the jth vocabulary in the txt text contents in other topics, and calculating the sum value of the absolute value of the difference value and a preset adjusting parameter;

and calculating the ratio of the first product to the sum value, and calculating the average value of the ratio of all words in each topic as a topic core content association index of each topic and txt text contents of other topics.

Further, the obtaining the interest coincidence degree of the recommended topic of the text content includes:

in each topic, calculating the absolute value of the difference between the sum of the number of times of praise, comment and forwarding of the topic by the current month user and the sum of the number of times of praise, comment and forwarding of the topic in the previous month, and calculating the ratio of the absolute value of the difference to the sum of the number of times of praise, comment and forwarding of the topic in the previous month as the average change rate of the topic in the current month;

calculating the average value of the topic core content association index of the txt text contents of each topic and other topics and the average change rate, and calculating the absolute value of the difference value of the average value and the browsing times of each topic in the previous month as the recommended topic interest coincidence degree of the txt text contents of the other topics.

The invention has at least the following beneficial effects:

according to the method, topic content frequently browsed by a user is analyzed, topic core content association indexes are constructed, the indexes can represent possible browsing degrees of the user on non-recommended topics, the interested degrees of the content to be recommended are ordered according to the topic core content association indexes, and intelligent recommendation is carried out according to the interested degree. The defect that the conventional recommendation algorithm is difficult to provide personalized topic contents is overcome, the problem that the recommendation effect is poor due to the fact that only similar contents as histories are considered is solved, and the beneficial effects of topic recommendation novelty and diversity in a social network are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a social network-based topic intelligent recommendation method provided by the invention;

fig. 2 is a topic recommendation flow chart.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description is given below of the social network-based topic intelligent recommendation method according to the invention, which is provided by combining the accompanying drawings and the preferred embodiment, and the detailed description of the specific implementation, the structure, the characteristics and the effects thereof is given below. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of the topic intelligent recommendation method based on the social network, provided by the invention, with reference to the accompanying drawings.

The invention provides a social network-based topic intelligent recommendation method, and in particular provides a social network-based topic intelligent recommendation method, referring to fig. 1, comprising the following steps:

step S001, collecting text content data sets of topics frequently focused by users in the social network, and preprocessing data.

Relevant data of the user on the social network is obtained. Using the APIs of the social media platform, content such as posts, comments, and tweets posted by users on the social network can be collected. Where an API refers to a set of methods for interacting between two different software applications, methods well known to those skilled in the art are not described in detail herein. According to the embodiment, the text content of the topic of interest of the user for one month is collected from three social media platforms, namely, a learning platform, a microblog platform and a tiger platform, and the user is authorized to obtain the data.

And preprocessing the acquired topic text content data. The extracted topic data content may contain noise such as special symbols, expressions, HTML tags, etc. And for special symbols and expression noise, performing text marking and part-of-speech restoring operation to clear text content data by adopting an NLTK natural language processing library in Python. And for the HTML tag, analyzing the HTML by adopting a Beautiful Soup library in Python to obtain the text content in full Chinese. The text marking and part-of-speech restoring operations are known in the art and will not be described in detail herein.

After text content data of the topic full Chinese is obtained, word segmentation processing is required to be carried out on each sentence in the text content data. In the embodiment, the BERT word segmentation model is adopted to perform word segmentation processing, and all sentences in the text content data are split into words. The BERT word segmentation model is input as single sentences in text content data, and output is word sequence after word segmentation, and each word is separated by space. And converting each word in all sentences in the text content data into a semantic vector by using the BERT word segmentation model as a word vector, namely, representing each word in all sentences in the text content data by using one word vector. The BERT word segmentation model is a model known to those skilled in the art, and will not be described herein.

After the segmentation is completed, words with high occurrence frequency and low information content in the text, such as connective words, prepositions, pronouns and the like, are removed. To achieve this, the stop word processing is performed using a stop word list provided by the spaCy library in Python, avoiding negative impact of these words on subsequent analysis.

So far, sentences in all text contents and text vocabulary sequences of all sentences in the text content data set and word vectors of vocabulary in each text vocabulary sequence are acquired.

Step S002, processing and analyzing topic text data focused by the user, extracting topic keyword features, and constructing recommended topic interest conformity.

The step S001 obtains text vocabulary of topics of interest in the social network for the past month, and due to various types of topics, such as daily life topics, entertainment topics, scientific and technological innovation topics, social and current topics and the like. On a social platform, different topics are classified according to different labels, and partial keywords with higher occurrence frequency of the content of the different topics are different.

Taking daily life topics as an example, extracting and analyzing keywords of text content data of the topics.

First, topics frequently browsed by users are often distinguished by using different tag words, and the tag words are also entity words. In order to identify different topics, three models of BERT+LSTM+CRF are adopted for each text content data set to identify entity words, and tag words of each text content in the text content data set are identified. The BERT model, the LSTM model, and the CRF model are well known to those skilled in the art, and will not be described herein.

In topic recommendation, different parts of speech differ in the importance of the topic text content. Nouns, verbs, and adjectives are more important among these parts of speech. Since nouns generally represent things themselves or concepts, there is a high amount of information; verbs generally represent actions or changes, describing some state of motion of a thing; adjectives describe some attribute features of things. Taking the topic of daily life as an example, nouns used for describing daily living goods in the text of the topic are more than other parts of speech, so that the nouns are more important than the other parts of speech in the topic of daily life, and are further analyzed according to the characteristics.

First, the parts of speech in each text vocabulary sequence are marked by using a BiLSTM-CRF model, and the parts of speech of each vocabulary, such as nouns, verbs and adjectives, are obtained. The input of the BiLSTM-CRF model is all words in the text word sequence, and the output is the part-of-speech label of each word. Furthermore, in this embodiment, word frequency statistics is performed on the words in the text word sequence by using the N-gram model, so as to obtain word frequencies of the words in the text content where the words are located. The BiLSTM-CRF model and the N-gram model are known to those skilled in the art, and will not be described in detail herein.

Constructing topic core word attribute scores according to word frequencies and distances of different parts of speech in topics:

in the method, in the process of the invention,topic core word attribute scores representing the kth sentence in the txt text content in topic x; here taking the noun as an example, ->Represents the number of nouns in the kth sentence in the txt text content in topic x,/>Representing +.>Noun->Representing the +.f. in other parts of speech than nouns in the sentence>Personal word (s)/(s)>Representing the word frequency of the calculated vocabulary in the txt-th text content within topic x. />To calculate the vocabulary in the kth sentenceAnd word->Number of words in space between.

Formula logic: for topic x, if the importance of the noun part of speech is higher, calculateThe higher the value, the more nouns in the kth sentence appear, and the higher the word frequency of other parts of speech, calculated +.>The higher the value, the vocabulary ∈ ->And vocabulary->The closer the link to topic x. Meanwhile, if the distance between two words is closer, the calculated +.>The smaller the value, the vocabulary->And vocabulary->The higher the degree of semantic association between. The higher the attribute score of the core word of the topic is calculated finally, the more important the information quantity expressed in the txt text content in the topic x is.

The text content of different topics contains representative words, and the frequency of the representative words in the topics is high. Further, statistics on the highest frequency in a text content data set of a topicWords, will this->The high-frequency word set is marked as a topic high-frequency word stock, wherein->The empirical value of (2) is 500, and the high-frequency words are ranked in the high-frequency word set according to the sequence from the small appearance frequency to the large appearance frequency.

Building topic core content representative coefficients:

in the method, in the process of the invention,topic core content representative coefficients representing the txt-th text content within topic x,representing the number of words appearing in the topic high-frequency word stock in the txt text content in topic x,/for>Representing the total number of sentences in the txt-th text content within topic x, +.>The word frequency of each word in the kth sentence in the txt text content in the topic x is represented; />Importance coefficient representing kth sentence in txt text content within topic x,/->Representing the number of words appearing in the topic high-frequency word bank in the kth sentence in the txt text content in topic xIf there is no high frequency word, will +.>The value of (2) is set to 1, ">The topic core word attribute score of the kth sentence in the txt text content in topic x is represented.

Formula logic: when the number of times that a sentence in the txt-th text content in the topic x appears in the topic high-frequency thesaurus is larger,the larger the value of (2), if the topic core word attribute score +.>The higher the calculated importance coefficient of the sentence +.>The higher. At the same time (I)>The larger it is indicated that the vocabulary is mainly described in detail in the txt-th text of topic x. />The larger the text, the closer the kth sentence in the txt text of the topic x is to the content of the topic x. But->The larger the value, the more tightly the content of the txt text is combined with the corresponding topic. The larger the content representing coefficient of the topic core is>The more these words are described as representative words of the topic, the core content of the x topics is described.

In step S001, the BERT word segmentation model converts each word in a sentence into a word vector, wherein the word vector represents each word in a text as a special vector, the word vectors of words with similar semantics are closer in a vector space, and conversely, the words with non-similar semantics are farther in the vector space.

And calculating corresponding topic core content representative coefficients for each text content in each topic frequently browsed by the user. When intelligent recommendation of topics is performed, if recommendation is performed only according to the similarity degree of keywords among text contents, the recommended topic contents are easy to uniformly spread, users can only receive too single information in the browsing process, and obvious defects exist in that personalized recommendation cannot be provided for the users. Therefore, in order to realize content recommendation of diversity, topic core content association indexes are calculated by combining topic core content representative coefficients and keywords of topics:

in the method, in the process of the invention,topic core content association index representing the text content of the topic x and the topic y at txt; m is the minimum value of the total number of the vocabulary in the txt text content in the topic x and the total number of the vocabulary in the txt text content in the topic y; />And->Topic core content representative coefficients respectively representing the txth text content in topic x and topic y; />And->Respectively represent the txt text content in topic x and topic yWord frequency of the jth word in the text content; />To adjust the parameters, the denominator is avoided to be 0, and the checked value is taken to be 1.

Vocabulary association index between topic x and topic y, < +.>And->Representing the total number of words in the txt-th text in topic x and topic y, respectively,/>A word vector representing the a-th word in the text content data set in topic x,word vector representing the b-th word in the text content data set in topic y, ++>Representing the calculation of cosine similarity between two word vectors. The cosine similarity is a technique known to those skilled in the art, and is not described herein.

Formula logic: if the semantic similarity between words in the text between different topics is higher, the method calculatesThe larger the value is, the corresponding word association coefficient +.>The larger the content, the closer the content is described between topics x, y. If the frequency of occurrence of the jth word between two topics is closer, the more closely the j-th word is>The smaller the value of (2)The more words that the different topic text uses the same are indicated. If the two topic core content representing coefficients are larger, the calculation is performedThe larger the value, the more consistent the content of the text description is to the respective corresponding topic, and the more closely the topic is but the greater the degree of relevance of the descriptive content is. The higher the calculated topic core content association index, the higher the degree to which topic y may be of interest while the user is browsing topic x frequently.

After calculating the association index between different social network texts, the behavior of the user in the social network, such as praise, comment, forwarding and the like, is considered. The more the user performs the above operation, the more the user's interest level on the topic can be explained. And then, combining topic core content association indexes to construct recommended topic interest coincidence degree:

in the method, in the process of the invention,the interest coincidence degree of the recommended topics of the user on the txt text content in the other topics z is represented;index of topic core content association between browsed topic L and unblown topic z for user,/>The average change rate of praise, comment and forwarding of the content of the browsing topic L by the user in the past month is represented, wherein the average change rate is calculated in the following way: the ratio of the absolute value of the difference between the sum of the number of praise, comment and forwarding times of the content of the browsing topic L in the current month and the sum of the number of praise, comment and forwarding times of the content of the browsing topic L in the previous month to the sum of the number of praise, comment and forwarding times of the content of the browsing topic L in the previous month; />The number of browsing times of browsing the topic L in the past month by the user is represented.

Formula logic: if topic core content association indexThe larger the content of the unbrown topic is, the closer the content of the topic is to the content of the topic often browsed by the user, and the +.>The larger the content, the more frequently the user operates the topic, and the higher the content interest degree. But->The larger the content of the topics browsed by the user in the past period of time is, the more single the content, and the lack of personalization is indicated. Finally calculated recommended topic interest compliance +.>The larger the description topic z is, the more likely it is content of interest to the user. Priority recommendations are required for such topics.

And step S003, personalized recommendation is carried out on the user according to the interest coincidence degree of the recommended topics.

And (3) carrying out the same processing on the text contents of the temporary non-recommended topics according to the process of the step S001, and calculating the interest coincidence degree of the recommended topics of all the non-recommended contents according to the method.

Calculating the interest coincidence degree of recommended topicsIn the process of (1), substituting all topic contents commonly browsed by a user into recommended topic interest coincidence degree ++>The parameters L, L possibly comprise different topic parts, and all the topic contents which are not recommended temporarily are substituted into the recommended topic interest coincidence degree +.>And the parameter z in (a). The parameter L and the parameter z are explained in detail in step S002, and are not described herein.

And calculating the interest coincidence degree of the recommended topics of the text contents of all the unbrown topics in the last three days, and sorting the recommended topics from big to small. And finally, intelligent recommendation is conducted on the topic content in sequence. And substituting the recommended topics as browsed contents into the parameter L every time one recommended topic is browsed, and updating in real time. The topic recommendation flow chart is shown in fig. 2.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims

1. The intelligent topic recommendation method based on the social network is characterized by comprising the following steps of:

calculating the interest coincidence degree of recommended topics of the text contents of all the unbrown topics in the near three days, sorting the text contents from big to small, and intelligently recommending the topic contents in sequence;

the obtaining the topic core word attribute score of each sentence comprises the following steps:

in the method, in the process of the invention,topic core word attribute scores representing the kth sentence in the txt text content in topic x; />Represents the number of nouns in the kth sentence in the txt text content in topic x,/>Representing +.>Noun->Representing the +.f. in other parts of speech than nouns in the sentence>Personal word (s)/(s)>Representing word frequency of calculated word in txt text content within topic x, ++>To calculate the vocabulary +.>And word->Number of words in space;

the obtaining the important coefficient of each sentence in each text content comprises the following steps:

in the method, in the process of the invention,importance coefficient representing kth sentence in txt text content within topic x,/->Represents the number of words appearing in the topic high-frequency word stock in the kth sentence in the txt text content in topic x,/for>Topic core words representing kth sentence in txt text content in topic xAttribute scores;

the obtaining topic core content representative coefficients of each text content includes:

in the method, in the process of the invention,topic core content representing coefficient representing txt-th text content in topic x, ++>Representing the number of words appearing in the topic high-frequency word stock in the txt text content in topic x,/for>Representing the total number of sentences in the txt-th text content within topic x, +.>The word frequency of each word in the kth sentence in the txt text content in the topic x is represented; />An importance coefficient representing a kth sentence in the txt text content in topic x;

the obtaining topic core content association indexes of each text content among topics comprises the following steps:

in the method, in the process of the invention,topic core content association index representing the text content of the topic x and the topic y at txt; m is the total number of lexical items in the txt text content in topic x and topic y isThe minimum value of the total number of lexicons in the txt text content; />And->Topic core content representative coefficients respectively representing the txth text content in topic x and topic y; />And->The word frequency of the jth vocabulary in the text content of the txt text content in the topic x and the topic y is respectively represented; />For regulating parameters->A vocabulary association index representing between topic x and topic y;

the acquiring the vocabulary association indexes among topics comprises the following steps:

vocabulary association index between topic x and topic y, < +.>And->Respectively representing the total number of words in the txt text content in topic x and topic y, < +.>Word vectors representing the a-th word in the text content data set in topic x +.>Word vector representing the b-th word in the text content data set in topic y, ++>Representing the calculation of cosine similarity between two word vectors;

the obtaining the recommended topic interest coincidence degree of the text content comprises the following steps:

in the method, in the process of the invention,the interest coincidence degree of the recommended topics of the user on the txt text content in the other topics z is represented;index of topic core content association between the txt text content for browsing topic L and unbrown topic z for the user,>representing the average rate of change of praise, comment and forwarding of the content of the browsing topic L by the user in the past month,the browsing times of browsing topics L in the past month of the user are represented;

the average change rate is calculated by the following steps: the ratio of the absolute value of the difference between the sum of the number of praise, comment and forwarding times of the content of the browsing topic L in the current month and the sum of the number of praise, comment and forwarding times of the content of the browsing topic L in the previous month.

2. The intelligent recommendation method for topics based on social network as claimed in claim 1, wherein the obtaining the part of speech of each word in the text word sequence comprises:

3. The intelligent recommendation method for topics based on social network as claimed in claim 1, wherein the counting word frequencies of all words in each text content comprises:

4. The intelligent recommendation method for topics based on social network as claimed in claim 1, wherein the obtaining the topic high-frequency word stock of each topic comprises:

wherein,the number of the high-frequency words is preset.