CN109885760B - Information tracing method and system based on user interests - Google Patents

Information tracing method and system based on user interests Download PDF

Info

Publication number
CN109885760B
CN109885760B CN201910059484.XA CN201910059484A CN109885760B CN 109885760 B CN109885760 B CN 109885760B CN 201910059484 A CN201910059484 A CN 201910059484A CN 109885760 B CN109885760 B CN 109885760B
Authority
CN
China
Prior art keywords
information
interest
influence
score
microblog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910059484.XA
Other languages
Chinese (zh)
Other versions
CN109885760A (en
Inventor
陈秀真
杨潇
马进
李生红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201910059484.XA priority Critical patent/CN109885760B/en
Publication of CN109885760A publication Critical patent/CN109885760A/en
Application granted granted Critical
Publication of CN109885760B publication Critical patent/CN109885760B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides an information tracing method and system based on user interests, which comprises the steps of extracting content keywords in historical microblog information of a user, calculating the correlation degree of the content keywords and the interest keywords, evaluating the interest degree of a blogger in a certain interest classification, calculating the user influence of the blogger microblog in the certain interest classification, and further obtaining the total influence of the blogger; calculating the interest degree of a commentator or a forwarder in a certain interest classification to obtain the total comment forwarding influence of the commentator or the forwarder; calculating time influence according to the release time of the microblog information; and giving corresponding weights to the total influence of the bloggers, the total influence of comment forwarding of the commentators or the commentators, the time influence and the attention degree to obtain microblog comprehensive scores, and sequencing and tracing according to the microblog scores.

Description

Information tracing method and system based on user interests
Technical Field
The invention relates to the technical field of information tracing, in particular to an information tracing method and system based on user interests.
Background
Microblogs, as one of the largest domestic self-media platforms, often spread various rumors, sensitive topics and other related information. The tracing of the microblog information has important significance for maintaining information safety and has more applications in public opinion monitoring and social network analysis.
In the aspect of information tracing, scholars have already made some relevant researches, such as microblog content time, originality and centrality, and meanwhile, the microblog is traced by combining the forwarding relation of the microblog; calculating the influence of the user according to the information such as the number of fans of the user, the number of comments and the like, and meanwhile, calculating the microblog source by combining a Hacker News algorithm; obtaining an information propagation path by constructing a K tree model, thereby tracing the source of information; calculating factors such as the frequency of the bloggers, the originality coefficient, the forwarding amount, the appraisal amount and the forwarding relation of the microblogs and the like to obtain a microblog source; combining the network propagation model AN with the number of microblog participants, and recursion of a microblog source by using a formula; the method for calculating the longest public subsequence is applied to the microblog so as to realize microblog tracing.
At present, a microblog source tracing research method mainly traces sources according to microblog text similarity and by combining information such as the number of comments of a microblog, the number of fans, forwarding relation and the like, so that influence caused by text content of the microblog and content of comments of a microblog comment person is not considered.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an information tracing method and system based on user interests.
The information tracing method based on the user interest provided by the invention comprises the following steps:
an information collection step: obtaining information to be traced;
and (3) information extraction: extracting required information elements in the information to be traced;
calculating and scoring steps: and performing score calculation according to each information element, and tracing according to the scores.
Preferably, the information element includes any one or any plural of historical release information of an information owner, comment information, forwarding information, information release time, and an amount of interest of the historical release information of the information owner.
Preferably, the calculating score step includes:
an information owner influence score calculation step: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
and (3) calculating the influence score of the commentator or the forwarder: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
time influence score calculation step: calculating a time influence score according to the information release time;
and an attention degree score calculation step: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
Preferably, the information owner influence score is calculated by:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i)
Figure BDA0001953668430000021
wherein, bloggerIntInf (i) represents the influence score of the information owner in the Interest classification i, BlggerInf represents the influence score of the information owner, alpha is a weight parameter, Fans represents the number of Fans, b is a parameter for ensuring that the logarithm calculation result is a positive value, Interest (i) is the Interest degree of the information owner in the Interest classification i, blogIntSim (i) is the similarity degree of the information and the Interest classification i, and n represents the total number of the information.
Preferably, the commentator or forwarder influence score is calculated by:
ComIntInf(i)=∑SinComIntInf(l);
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i);
L=logλ(Like+c);
Figure BDA0001953668430000022
the commenting inf represents the influence score of a commentator or a forwarder, SinComIntInf (L) represents the influence of a single forwarder or a commentator L on an interest class i, ComIntInf (i) represents the influence of a commentator or a forwarder on the interest class i, n represents the total number of interest classes, theta is a weight parameter, ComInterest (i) is the interest degree of the forwarder or the commentator on the interest class i, BlogIntSim (i) is the similarity degree of information and the interest class i, ComFans is the number of fans of the commentator or the forwarder, b is a parameter for ensuring that a logarithmic calculation result is a positive value, E represents the sensitivity of a comment statement, L represents the contribution of the comment or the forwarded praise to the influence, Likexin represents the comment praise, and lambda and c are parameters for ensuring that the logarithmic calculation result is a positive value.
Preferably, the degree of similarity of the information to a certain interest classification i is calculated by the following formula:
Figure BDA0001953668430000031
Interest(i)=∑BlogIntSim(i)
wherein BlogIntSim (i) indicates how similar a piece of information is to interest class i; KeyWord (j) represents the j-th content KeyWord extracted from the information, m represents the total number of the content keywords, KeyWordweight (j) represents the weight of the j-th content KeyWord, IntWord (i) represents an Interest classification i, HowNetDis represents a function for calculating the distance between two words in a known web word forest, and Interest (i) is the Interest degree of the information owner in the Interest classification i.
Preferably, the time influence score is calculated by the following formula:
Figure BDA0001953668430000032
Figure BDA0001953668430000033
the TimeScore represents a time influence score, T represents the time influence of one microblog, time represents the publishing time of the current microblog, MinTime represents the earliest publishing time of all the microblogs, MaxTime represents the latest publishing time of all the microblogs, and TmaxRepresents the maximum T value calculated in all tracing microblogs, e1To correct the parameters.
Preferably, the attention score is calculated by the following formula:
Figure BDA0001953668430000034
A=∑(Li+Rep+Com)*factor
Figure BDA0001953668430000035
wherein, AttraScore represents the attention score, A represents the attention of the owner within the set time X, A is like, forwards, comments within the set time XCalculating the quantity of the active ingredients; li, Rep and Com represent the number of praise, forwarding and comment of a piece of information in the set time X of the information owner; factor is a parameter calculated from the duration of the information, AmaxRepresenting the maximum value of A, e calculated from all the information to be traced2To correct the parameters.
The invention provides an information tracing system based on user interests, which comprises:
an information collection module: obtaining information to be traced;
the information extraction module: extracting required information elements in the information to be traced;
a calculation scoring module: and performing score calculation according to each information element, and tracing according to the scores.
Preferably, the information element includes any one or more of historical release information of an information owner, comment information, forwarding information, information release time and attention amount of the historical release information of the information owner;
the calculation score module comprises:
an information owner influence score calculation module: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
the comment person or forwarder influence score calculation module: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
a time influence score calculation module: calculating a time influence score according to the information release time;
an attention score calculation module: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
Compared with the prior art, the invention has the following beneficial effects:
according to the method and the device, the influence of the user is calculated according to the interest of the microblog user, meanwhile, the influence of the commentator and the forwarder is calculated according to the interest of the microblog commentator and the forwarder, the microblog score is obtained by weighting and summing the scores of factors such as microblog release time and attention degree, the ordering and the tracing are carried out according to the microblog score, the tracing is carried out comprehensively by multiple factors, and the tracing result is more accurate.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a system framework diagram of the present invention;
FIG. 2 is a comparison graph of the correct number of tracing results in the test example;
FIG. 3 is a chart comparing recall ratios in test examples;
FIG. 4 is a comparison graph of the number of hot microblogs of the tracing result in the test example;
fig. 5 is a score chart of a microblog of a plum fly-away event.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The information tracing mainly refers to the steps of giving a corresponding information text and finding the source of the microblog topic according to the content of the text and related information. With the development of internet technology, various social media compete and rise, microblogs serve as a self-media platform with high influence at present, and the research on information tracing has important significance in public opinion management and control and information safety. Most of previous researches only consider the similarity and forwarding relation of microblog contents and do not consider the influence of microblog text contents, and based on the facts, an Interest-based Traceability Method (ITM) based on user interests is provided. Calculating the interest of the bloggers according to the contents of the bloggers' previous microblogs, then calculating the influence of the bloggers according to the interest of the bloggers, and simultaneously calculating the interests of the microblog critics and forwarding the interests of the bloggers to obtain the influence of the critics and the forwarding persons. And finally, weighting and summing the scores of the factors such as time, attention and the like to obtain microblog scores, and sequencing and tracing by using the microblog scores.
The information tracing method based on the user interest provided by the invention comprises the following steps:
an information collection step: obtaining information to be traced;
and (3) information extraction: extracting required information elements in the information to be traced;
calculating and scoring steps: and performing score calculation according to each information element, and tracing according to the scores.
Specifically, the information element includes any one or any plural of historical release information of an information owner, comment or forwarding information, information release time, and an amount of attention of the historical release information of the information owner.
Specifically, the calculating score step includes:
an information owner influence score calculation step: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
and (3) calculating the influence score of the commentator or the forwarder: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
time influence score calculation step: calculating a time influence score according to the information release time;
and an attention degree score calculation step: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
Specifically, the information-owner influence score is calculated by the following formula:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i)
Figure BDA0001953668430000051
wherein, bloggerIntInf (i) represents the influence score of the information owner in the Interest classification i, BlggerInf represents the influence score of the information owner, alpha is a weight parameter, Fans represents the number of Fans, b is a parameter for ensuring that the logarithm calculation result is a positive value, Interest (i) is the Interest degree of the information owner in the Interest classification i, blogIntSim (i) is the similarity degree of the information and the Interest classification i, and n represents the total number of the information.
Specifically, the critic or forwarder influence score is calculated by the following formula:
ComIntInf(i)=∑SinComIntInf(l);
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i);
L=logλ(Like+c);
Figure BDA0001953668430000061
the commenting inf represents the influence score of a commentator or a forwarder, SinComIntInf (L) represents the influence of a single forwarder or a commentator L on an interest class i, ComIntInf (i) represents the influence of a commentator or a forwarder on the interest class i, n represents the total number of interest classes, theta is a weight parameter, ComInterest (i) is the interest degree of the forwarder or the commentator on the interest class i, BlogIntSim (i) is the similarity between microblog information and the interest class i, ComFans is the number of fans of the commentator or the forwarder, b is a parameter for ensuring that a logarithmic calculation result is a positive value, E represents the sentiment of a comment statement, L represents the contribution of the comment or the forwarded Like to the influence, Liqun represents the comment Like number, and lambda and c are parameters for ensuring that the logarithmic calculation result is a positive value.
Specifically, the degree of correlation of the information with a certain interest classification is calculated by the following formula:
Figure BDA0001953668430000062
Interest(i)=∑BlogIntSim(i)
wherein BlogIntSim (i) indicates how relevant a piece of information is to interest class i; KeyWord (j) represents the j-th content KeyWord extracted from the information, m represents the total number of the content keywords, KeyWordweight (j) represents the weight of the j-th content KeyWord, IntWord (i) represents an Interest classification i, HowNetDis represents a function for calculating the distance between two words in a known web word forest, and Interest (i) is the Interest degree of the information owner in the Interest classification i.
Specifically, the time influence score is calculated by the following formula:
Figure BDA0001953668430000063
Figure BDA0001953668430000064
the TimeScore represents a time influence score, T represents the time influence of one microblog, time represents the publishing time of the current microblog, MinTime represents the earliest publishing time of all the microblogs, MaxTime represents the latest publishing time of all the microblogs, and TmaxRepresents the maximum T value calculated in all tracing microblogs, e1To correct the parameters.
Specifically, the attention score is calculated by the following formula:
Figure BDA0001953668430000065
A=∑(Li+Rep+Com)*factor
Figure BDA0001953668430000071
wherein, AttraScore represents the attention score, A represents the attention of the owner within the set time X, and A is calculated by the number of praise, forward and comment within the set time X; li, Rep, Com represent informationThe user sets the number of praise, forward and comment of one piece of information in the time X; factor is a parameter calculated from the duration of the information, AmaxRepresenting the maximum value of A, e calculated from all the information to be traced2To correct the parameters.
The invention provides an information tracing system based on user interests, which comprises:
an information collection module: obtaining information to be traced;
the information extraction module: extracting required information elements in the information to be traced;
a calculation scoring module: and performing score calculation according to each information element, and tracing according to the scores.
Preferably, the information elements comprise historical release information of an information owner, comment information, forwarding information, information release time and attention amount of the historical release information of the information owner;
the calculation score module comprises:
an information owner influence score calculation module: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
the comment person or forwarder influence score calculation module: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
a time influence score calculation module: calculating a time influence score according to the information release time;
an attention score calculation module: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
The information tracing system based on the user interests can be realized through the step flow of the information tracing method based on the user interests. Those skilled in the art can understand the information tracing method based on user interests as a preferred example of the information tracing system based on user interests.
In specific implementation, the method is applied to microblog tracing, content keywords in historical microblog information of a microblog owner are extracted, the correlation degree of the content keywords and interest keywords is calculated, the interest degree of the blog owner in a certain interest classification is evaluated, the user influence of the microblog owner in the certain interest classification is calculated, and the total influence of the blog owner is further obtained; calculating the interest degree of a commentator or a forwarder in a certain interest classification to obtain the total comment forwarding influence of the commentator or the forwarder; calculating time influence according to the release time of the microblog information; and giving corresponding weights to the total influence of the bloggers, the total influence of comment forwarding of the commentators or the commentators, the time influence and the attention degree to obtain microblog comprehensive scores, and sequencing and tracing according to the microblog scores.
The user interest calculation is based on the following steps that based on microblog data analysis, the user interest is classified into categories to obtain interest categories, the interest categories are distinguished through interest keywords, content keywords in historical microblog information of a user are extracted, the degree of correlation between the content keywords and the interest keywords is calculated to obtain the degree of correlation between the microblog information of the user and a certain interest category, and the degree of interest of the user in the certain interest category is evaluated according to the degree of correlation between the microblog information and the interest category;
in the user influence calculation, calculating the user influence of the microblog of the user in a certain interest classification according to the interest degree of the user in the certain interest classification and the number of fans of the user to obtain the total influence of the user;
in the influence calculation of the commentator, the interest degree of the commentator or the forwarder in a certain interest classification, the number of fans of the commentator or the forwarder and the emotion degree of the commentary content are calculated, the influence of the commentator or the forwarder on the user of the commentator or the forwarder in the certain interest classification is obtained, and further the total influence of the commentary forwarding of the commentator or the forwarder is obtained;
in the time factor calculation, the time influence is calculated according to the release time of the microblog information.
After calculation, corresponding weights are given to the time influence, the attention, the comment forwarding total influence of the commentator or the forwarder and the user total influence respectively to obtain a microblog comprehensive score, and microblog tracing is carried out according to the microblog comprehensive score.
Specifically, the microblog comprehensive score ITM is calculated by the following formula:
ITM=k1*TimeScore+k2*AttraScore+k3*BloggerScore+k4*ComScore
wherein, TimeScore represents the time influence score of the microblog, AttraScore represents the attention score of the microblog, BloggerScore represents the total influence score of the user, ComScore represents the total influence score of the commentator or the forwarder, and k1、k2、k3、k4Respectively, represent the scoring weights.
In the system framework diagram shown in fig. 1, the invention relates to user interest calculation, blogger influence calculation, commentator influence calculation and attention calculation.
In the aspect of user influence calculation based on the interest, the user interest calculation and the user influence calculation are included.
In the aspect of user interest calculation, behaviors of microblog users can reflect some interests of the users. For example, a user likes the entertainment content, and the microblog published by the user, the comment and the forwarded microblog are all inclined to the entertainment information. Based on the above, the interest tendency of the user can be analyzed through the previous microblog information of the user. Through the analysis of microblog data, user interests can be classified into the following categories: entertainment, economy, science and education, politics and military. According to the method, the keyword in the previous blog information of the user is extracted, and the sum of the distance between the keyword and the interest keyword in the web word bank is calculated to obtain the correlation degree between the microblog topic and the interest.
The calculation formula of the degree of relevance of a certain piece of blog information of a user to a certain interest i is as follows:
Figure BDA0001953668430000091
wherein BlogIntSim (i) indicates how much a piece of blog information of a user is related to interest i (one of entertainment, economy, science education, politics, and military). KeyWord (j) represents keywords (m keywords in total) extracted from the blog, KeyWordWeight (j) represents weight of the keywords, and IntWord (i) represents interest i (one of entertainment, economy, science education, politics, and military). The HowNetDis function is used to calculate the distance between two words in a forest of web-aware words. The web word forest is defined by a meaning item and a sememe, wherein the meaning item is a description of a word, the word can have a plurality of meaning items, and the sememe is a basic unit for describing the meaning item. The basic structure of the meaning term is as follows:
Figure BDA0001953668430000092
the distance between the sememes of each sememe is calculated by considering the influence of the depth and the density of the sememe hierarchical tree on the sememe weight, so that the distance between the sememes is calculated, and finally the distance between the words is calculated by using the distance between the sememes.
The sum of the similarity of each user's blog information and interest i can be used to evaluate how much the user is interested in a topic.
Interest(i)=∑BlogIntSim(i) (2)
The Interest (i) is the Interest degree of the user in the Interest i, and the blogIntSim (i) is the correlation degree of a piece of previous blog information of the user and the Interest i, and is calculated by formula 1.
In the aspect of calculation of user influence, the microblog content sent by the user can reflect the interests and hobbies of the user, and fans of the user generally have similar interests with the fans, so that the degree of similarity between microblogs sent by bloggers and the interests of the bloggers is higher, and the influence on the fans and the public is higher. Meanwhile, the influence of microblogs is larger as the number of fans of the bloggers is larger, but when the number of fans of the microblogs is larger than a certain order of magnitude (for example, tens of millions), the influence caused by the number of fans is not much.
With reference to formula 2, the calculation formula of the influence of the microblog of the user on the interest i is as follows:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i) (3)
alpha is a weight parameter, Fans represents the fan number, and Interest (i) is the Interest degree of the blogger in the Interest i, and is calculated by formula 2. BlogIntSim (i) is the degree of similarity of the microblog contents to interest i, and is calculated using formula 1. The influence of calculating the number of the fans uses a log function, the parameter b is used for ensuring that the logarithm calculation result is a positive value, and the values of beta and b are both 1.1 and are more practical by drawing judgment.
The sum of the influence of the user under each interest is the total influence of the user, and the calculation formula is as follows:
Figure BDA0001953668430000101
in the aspect of influence calculation of commentators and forwarders, a user can send a comment under a microblog, the comment can be praised by other people, and the fan of the commentator can be pushed with relevant information of the comment. Therefore, the commentator can contribute a certain influence to the microblog commented by the commentator, the commentator has similar interest with the fan of the commentator, and the influence is larger if the interest of the commentator is similar to the topic of the microblog. Similarly, the forwarding can also add comments, and has certain influence. All microblog critics and forwarders can contribute influence to the microblog, so that the calculation formula of the influence of the critics and the forwarders on the aspect of interest i is as follows:
ComIntInf(i)=∑SinComIntInf(l) (5)
in the formula, SinComIntInf (l) is the influence of a single forwarder and a single commentator l on interest i, and the influence is related to the interest degree of the forwarder and the commentator on the interest i and the similarity of microblog content and the interest i. The more fans of the forwarders and the commentators are, the greater the influence is, but when the number of the fans of the microblogs exceeds a certain order of magnitude, the influence caused by the number of the fans is not much. Meanwhile, the praise number of the comments and the emotion of the comment content are also related to the influence, and the influence is larger as the praise number is larger; the greater the influence of the contribution of the review content with positive emotion, the less the influence of the content with negative emotion. The calculation formula is as follows:
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i) (6)
where θ is a weight parameter, ComInterest (i) is the interest level of the forwarder and the commentator in interest i, and is calculated by formula 2. BlogIntSim (i) is the similarity of the microblog content to interest i, and is calculated by formula 1. ComFans is the fan number of commentators or forwarders. E indicates the sentiment of the comment sentence, 0 indicates negative sentiment, 1 indicates positive sentiment, and 0.5 indicates neutral. L represents the contribution of comment or forwarded praise to influence, and the calculation formula is as follows:
L=logλ(Like+c) (7)
like in the equation represents the praise number, the influence of the praise number is increased along with the increase of the praise number, when the praise number is large enough, the influence is not large, the c parameter is used for ensuring that the logarithm calculation result is a positive value, and the lambda sum is more practical by drawing judgment and taking 1.4 as the lambda sum. The sum of the influence of the commentator and the forwarder under each interest is the total influence of the commentator and the forwarder, and the calculation formula is as follows:
Figure BDA0001953668430000102
in the aspect of calculating the comprehensive microblog score, the source tracing of the microblog topic means that a source of information is found out from a given microblog data set related to a certain event according to a certain index. The source is not necessarily early in time, because some microblogs are possibly forwarded by people with large influence and are concerned, meanwhile, if the attention of other microblogs recently sent by a blogger is high, a certain influence is added to the current microblogs, so that the influence of the blogger, the influence of the commentator and the forwarder, the time and the attention are considered when the source is considered.
According to the invention, time, attention, influence of bloggers and influence of commentators on microblogs of people are comprehensively considered, and the microblogs are traced by calculating scores. The score calculation formula is as follows:
ITM=k1*TimeScore+k2*AttraScore+k3*BloggerScore+k4*ComScore (9)
wherein: TimeScore represents the time score of the microblog, AttraScore represents the attention score of the microblog, BloggerScore represents the blogger influence score, ComScore represents the influence score of the commentator and the forwarder, k1、k2、k3、k4Respectively representing the weight of each item score.
The microblog comprehensive score calculation is divided into the following five aspects of time score calculation, attention degree score calculation, user influence score calculation, commentator forwarder influence score calculation and score weight calculation.
The time score TimeScore is calculated according to the time influence, the time influence is larger when microblogs with earlier release time are released, and the time influence calculation formula is as follows:
Figure BDA0001953668430000111
t represents the influence of one microblog time, time represents the current microblog publishing time, MaxTime represents the earliest publishing time of all the microblogs, and MaxTime represents the latest publishing time of all the microblogs. The time score is obtained by standardizing time influence, and by combining the characteristics of decimal scaling standardization and hyperbolic function normalization, the normalization function used in the method is as follows:
Figure BDA0001953668430000112
where N' denotes the value after normalization, N denotes the value before normalization, NmaxRepresents the maximum value among the normalized values, and K is the correction parameter. The normalization formula can be expressed in NmaxThe value is normalized for the reference and the correction parameter K is a small value, which has a negligible effect on its normalization, as long as N is not small, but when the value of N is small,k allows the value of N normalization to be very small. When all the values of N are very small, the function can ensure that all the values of N 'are also very small, so that the condition that the values of N' are close to 1 due to the fact that the ratio is used for calculation is avoided. Based on equation 11, the time score is calculated as follows:
Figure BDA0001953668430000113
t represents the influence of one microblog time, and Tmax represents the maximum T value calculated in all the tracing microblogs. e.g. of the type1In order to correct parameters, the time scores of all tracing microblogs can be lower under the condition that the T of all the tracing microblogs is too small, so that the condition that the time scores of some microblogs are higher when the scores are calculated by simply using the ratio is avoided. e.g. of the type1Taking the value of the mean of the calculated T of all the acquired microblogs
Figure BDA0001953668430000121
The calculation formula for the attention score based on formula 11 is as follows:
Figure BDA0001953668430000122
a represents the attention degree of the microblog owner in the recent month, and is calculated by the number of praise, forward and comment of the microblog in the recent month. And the influence on the aspect of the newer microblog is larger, the calculation formula is as follows:
A=∑(Li+Rep+Com)*factor (14)
in the formula, Li, Rep and Com represent the praise, forwarding and comment number of a microblog in a month recently by the blogger. factor is a parameter calculated according to the duration of the microblog, and the formula is as follows:
Figure BDA0001953668430000123
Amaxrepresenting all traceback microMaximum A value, e, calculated in beats2To modify the parameters, the function is the same as e in equation 121Similarly, e2Taking the value of the average of A calculated for all collected microblogs
Figure BDA0001953668430000124
The user influence score is calculated based on equation 11 as follows:
Figure BDA0001953668430000125
based on formula 11 for the commentator, the calculation formula for the influence score of the forwarder is as follows:
Figure BDA0001953668430000126
commenting inf represents the influence of the commentator and the forwarder, and is calculated using formula 8. CommentInfmaxMaximum value representing influence of commentators and forwarders calculated from all tracing microblogs, e4To modify the parameters, the function is the same as e in equation 121Similarly, e4Taking the value as the CommentInf average value calculated by all collected microblogs
Figure BDA0001953668430000127
BlggerInf represents the user influence and is calculated using equation 4, BlggerInfmaxMaximum value, e, representing the calculated user influence of all tracing microblogs3To modify the parameters, the function is the same as e in equation 121Similarly, e3Taking the value as the average value of BloggerInf calculated by all collected microblogs
Figure BDA0001953668430000128
Weight k of each item score1、k2、k3、k4AHP (analytic hierarchy process) can be used to derive the corresponding parameters. AHP decision-related elementsThe elements are decomposed into levels of targets, criteria, schemes and the like, and qualitative and quantitative analysis is carried out on the basis.
The AHP determines the proportion of a certain parameter relative to another parameter by comparing the relative importance of every two parameters, thereby constructing a judgment matrix. Then, the maximum eigenvector is calculated, and the weight of each parameter is represented by the eigenvector. And finally, calculating a consistency ratio by using the characteristic value of the judgment matrix to carry out consistency check, and if the consistency ratio is smaller than a threshold value, considering that the calculated weight value is more reasonable.
The judgment matrix obtained by combining various factors through expert judgment is shown as the table I:
watch 1
Figure BDA0001953668430000131
Fractional weight calculation) to obtain maximum eigenvector, and determining each weight value as k1=0.148,k2=0.163,k30.363,k4=0.326
The consistency ratio is calculated to be 0.8% and less than the threshold value of 10%, so the weight value calculated is reasonable.
The test procedure at the specific data is as follows: microblogs of 9 months 11 to 10 months 12 are collected from a microblog platform, and information of related blogger information commentators accounts for 12110, and the microblogs comprise five events of 'plum flying off duty', 'crow going off', 'united states restart monthly plan', and the like, and are used for testing an ITM algorithm. And simultaneously performing comparison by using an OR algorithm based on the text centrality. And marking the sources of the microblogs by using a manual marking method, and comparing the two algorithms to find the correct number of the sources. And simultaneously, the corresponding number of the hot microblogs of the tracing results of the two algorithms under the corresponding topics is further checked, so that the accuracy of the algorithms is further verified, and the test results are shown in the table II.
Watch two
Figure BDA0001953668430000132
Figure BDA0001953668430000141
As can be seen from fig. 2, fig. 3 and table two, the ITM algorithm provided by the present invention is superior to the OR algorithm in terms of the correct number of the source-tracing microblogs. In the events of plum flying departure, national star naming and the like, the positioning is accurate because the audiences have obvious interest tendency. On the event of restarting and logging in the united states, the scores of the few microblogs with large influence in the tracing microblogs are small, so that the accuracy of the ITM algorithm is lower than that of the OR algorithm.
As can be seen from fig. 4, in the aspect of the number of hot microblogs, the OR algorithm only considers the similarity and the center of the text, and does not consider the influence of the microblogs on the topic, so that the accuracy of the algorithm provided by the invention is high.
Selecting the plum flying leave event, wherein the microblog tracing result and the microblog basic information of the ITM algorithm are as follows (for convenience of analysis, the microblog score is converted into a percentage system):
watch III
Figure BDA0001953668430000142
Figure BDA0001953668430000151
As can be seen from fig. 5, the scores of microblogs such as economic news and the science and technology of new waves are high every day, and according to the information in table three, the microblogs with high scores are found to be published earlier, and have relatively more numbers of praise, comment and forwarding, and have more fans. By comparing the manually marked source with the hot microblog of the topic, the tracing result is found to be more accurate. As can be seen from fig. 5 and table three, the microblogs released by the panoramic network have much attention, but the scores are not high because the release time is too late, which is in line with the actual situation. Although the time of the microblogs reported and issued by Chinese women is earlier, the influence caused by less fans is smaller because of not receiving much attention, and the microblogs are in line with the actual situation.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (6)

1. An information tracing method based on user interests is characterized by comprising the following steps:
an information collection step: obtaining information to be traced;
and (3) information extraction: extracting required information elements in the information to be traced;
calculating and scoring steps: calculating information scores according to the information elements, and tracing according to the information scores;
the information elements comprise any one or more of historical release information of an information owner, comment information, forwarding information, information release time and the attention amount of the historical release information of the information owner;
the calculating score step includes:
an information owner influence score calculation step: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
and (3) calculating the influence score of the commentator or the forwarder: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
time influence score calculation step: calculating a time influence score according to the information release time;
and an attention degree score calculation step: calculating an attention score according to the attention amount of the historical release information of the information owner;
the information owner influence score is calculated by:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i)
Figure FDA0002680472210000011
wherein, bloggerIntInf (i) represents the influence score of the information owner in the Interest classification i, BlggerInf represents the influence score of the information owner, alpha is a weight parameter, Fans represents the number of Fans, b is a parameter for ensuring that the logarithm calculation result is a positive value, Interest (i) is the Interest degree of the information owner in the Interest classification i, blogIntSim (i) is the similarity degree of the information and the Interest classification i, and n represents the total number of the information.
2. The information tracing method based on user interest according to claim 1, wherein the critic or forwarder influence score is calculated by the following formula:
ComIntInf(i)=∑SinComIntInf(l);
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i);
L=logλ(Like+c);
Figure FDA0002680472210000012
the commenting inf represents the influence score of a commentator or a forwarder, SinComIntInf (L) represents the influence of a single forwarder or a commentator L on an interest class i, ComIntInf (i) represents the influence of a commentator or a forwarder on the interest class i, n represents the total number of interest classes, theta is a weight parameter, ComInterest (i) is the interest degree of the forwarder or the commentator on the interest class i, BlogIntSim (i) is the similarity degree of information and the interest class i, ComFans is the number of fans of the commentator or the forwarder, b is a parameter for ensuring that a logarithmic calculation result is a positive value, E represents the sensitivity of a comment statement, L represents the contribution of the comment or the forwarded praise to the influence, Likexin represents the comment praise, and lambda and c are parameters for ensuring that the logarithmic calculation result is a positive value.
3. The method for tracing information based on user interests according to any one of claims 1-2, wherein the similarity degree of the information and the interest classification i is calculated by the following formula:
Figure FDA0002680472210000021
Interest(i)=ΣBlogIntSim(i)
wherein BlogIntSim (i) indicates how similar a piece of information is to interest class i; KeyWord (j) represents the j-th content KeyWord extracted from the information, m represents the total number of the content keywords, KeyWordweight (j) represents the weight of the j-th content KeyWord, IntWord (f) represents an Interest classification i, HowNetDis represents a function for calculating the distance between two words in a known web word forest, and Interest (i) is the Interest degree of the information owner in the Interest classification i.
4. The method of claim 1, wherein the time influence score is calculated by the following formula:
Figure FDA0002680472210000022
Figure FDA0002680472210000023
the TimeScore represents a time influence score, T represents the time influence of one microblog, time represents the publishing time of the current microblog, MinTime represents the earliest publishing time of all the microblogs, MaxTime represents the latest publishing time of all the microblogs, and TmaxRepresents the maximum T value calculated in all tracing microblogs, e1To correct the parameters.
5. The method of claim 1, wherein the focus score is calculated by the following formula:
Figure FDA0002680472210000024
A=∑(Li+Rep+Com)*factor
Figure FDA0002680472210000031
wherein, AttraScore represents the attention score, A represents the attention of the owner within the set time X, and A is calculated by the number of praise, forward and comment within the set time X; li, Rep and Com represent the number of praise, forwarding and comment of a piece of information in the set time X of the information owner; factor is a parameter calculated from the duration of the information, AmaxRepresenting the maximum value of A, e calculated from all the information to be traced2To correct the parameters.
6. The method of claim 1, wherein the information score is calculated by the following formula:
ITM=k1*TimeScore+k2*AttraScore+k3*BloggerScore+k4*ComScore
wherein ITM represents an information score, TimeScoce represents a temporal influence score, AttraScore represents an attention score, BloggerScore represents an information owner influence score, ComScore represents a critic or forwarder total influence score, k1、k2、k3、k4Respectively, represent the scoring weights.
CN201910059484.XA 2019-01-22 2019-01-22 Information tracing method and system based on user interests Expired - Fee Related CN109885760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910059484.XA CN109885760B (en) 2019-01-22 2019-01-22 Information tracing method and system based on user interests

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910059484.XA CN109885760B (en) 2019-01-22 2019-01-22 Information tracing method and system based on user interests

Publications (2)

Publication Number Publication Date
CN109885760A CN109885760A (en) 2019-06-14
CN109885760B true CN109885760B (en) 2020-12-29

Family

ID=66926536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910059484.XA Expired - Fee Related CN109885760B (en) 2019-01-22 2019-01-22 Information tracing method and system based on user interests

Country Status (1)

Country Link
CN (1) CN109885760B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061983B (en) * 2019-12-17 2024-01-09 上海冠勇信息科技有限公司 Evaluation method of infringement data grabbing priority and network monitoring system thereof
CN115511511B (en) * 2022-11-23 2023-03-24 成都银光软件有限公司 Method and system for analyzing traceability identification information based on data processing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945279A (en) * 2012-11-14 2013-02-27 清华大学 Evaluating method and device of influence effect of microblog users
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN107194560A (en) * 2017-05-12 2017-09-22 东南大学 The Social search evaluation method clustered in LBSN based on good friend

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324666A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Topic tracing method and device based on micro-blog data
CN104133897B (en) * 2014-08-01 2017-07-11 哈尔滨工程大学 A kind of microblog topic source tracing method based on topic influence
US9740672B2 (en) * 2014-10-24 2017-08-22 POWr Inc. Systems and methods for dynamic, real time management of cross-domain web plugin content
CN106980692B (en) * 2016-05-30 2020-12-08 国家计算机网络与信息安全管理中心 Influence calculation method based on microblog specific events

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945279A (en) * 2012-11-14 2013-02-27 清华大学 Evaluating method and device of influence effect of microblog users
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN107194560A (en) * 2017-05-12 2017-09-22 东南大学 The Social search evaluation method clustered in LBSN based on good friend

Also Published As

Publication number Publication date
CN109885760A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
CN107633044B (en) Public opinion knowledge graph construction method based on hot events
Buntain et al. Automatically identifying fake news in popular twitter threads
Kumar et al. Dynamics of conversations
US20090281851A1 (en) Method and system for determining on-line influence in social media
US20130304818A1 (en) Systems and methods for discovery of related terms for social media content collection over social networks
US20130297581A1 (en) Systems and methods for customized filtering and analysis of social media content collected over social networks
US20140040371A1 (en) Systems and methods for identifying geographic locations of social media content collected over social networks
US9454781B2 (en) Ranking and recommendation of online content
CN108038627B (en) Object evaluation method and device
US8275769B1 (en) System and method for identifying users relevant to a topic of interest
Song et al. Rt^ 2m: Real-time twitter trend mining system
David et al. Features combination for the detection of malicious Twitter accounts
Liu et al. Enlister: baidu's recommender system for the biggest chinese Q&A website
Li et al. A hybrid model for experts finding in community question answering
CN109885760B (en) Information tracing method and system based on user interests
Tao et al. Inferring atmospheric particulate matter concentrations from Chinese social media data
Dang et al. What is in a rumour: Combined visual analysis of rumour flow and user activity
Doshi et al. Predicting movie prices through dynamic social network analysis
Nakov et al. A survey on predicting the factuality and the bias of news media
CN110019556B (en) Topic news acquisition method, device and equipment thereof
Suzuki A credibility assessment for message streams on microblogs
Plummer et al. Analysing the Sentiment Expressed by Political Audiences on Twitter: The case of the 2017 UK general election
Dehghani et al. An evolutionary-based method for reconstructing conversation threads in email corpora
Liao et al. TIRR: A code reviewer recommendation algorithm with topic model and reviewer influence
Balali et al. Content diffusion prediction in social networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201229