CN109885760B - Information tracing method and system based on user interests - Google Patents
Information tracing method and system based on user interests Download PDFInfo
- Publication number
- CN109885760B CN109885760B CN201910059484.XA CN201910059484A CN109885760B CN 109885760 B CN109885760 B CN 109885760B CN 201910059484 A CN201910059484 A CN 201910059484A CN 109885760 B CN109885760 B CN 109885760B
- Authority
- CN
- China
- Prior art keywords
- information
- interest
- influence
- score
- microblog
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000004364 calculation method Methods 0.000 claims description 71
- 238000000605 extraction Methods 0.000 claims description 5
- 230000035945 sensitivity Effects 0.000 claims description 2
- 230000002123 temporal effect Effects 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 abstract description 3
- 238000010606 normalization Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 5
- 230000008451 emotion Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000004480 active ingredient Substances 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Abstract
The invention provides an information tracing method and system based on user interests, which comprises the steps of extracting content keywords in historical microblog information of a user, calculating the correlation degree of the content keywords and the interest keywords, evaluating the interest degree of a blogger in a certain interest classification, calculating the user influence of the blogger microblog in the certain interest classification, and further obtaining the total influence of the blogger; calculating the interest degree of a commentator or a forwarder in a certain interest classification to obtain the total comment forwarding influence of the commentator or the forwarder; calculating time influence according to the release time of the microblog information; and giving corresponding weights to the total influence of the bloggers, the total influence of comment forwarding of the commentators or the commentators, the time influence and the attention degree to obtain microblog comprehensive scores, and sequencing and tracing according to the microblog scores.
Description
Technical Field
The invention relates to the technical field of information tracing, in particular to an information tracing method and system based on user interests.
Background
Microblogs, as one of the largest domestic self-media platforms, often spread various rumors, sensitive topics and other related information. The tracing of the microblog information has important significance for maintaining information safety and has more applications in public opinion monitoring and social network analysis.
In the aspect of information tracing, scholars have already made some relevant researches, such as microblog content time, originality and centrality, and meanwhile, the microblog is traced by combining the forwarding relation of the microblog; calculating the influence of the user according to the information such as the number of fans of the user, the number of comments and the like, and meanwhile, calculating the microblog source by combining a Hacker News algorithm; obtaining an information propagation path by constructing a K tree model, thereby tracing the source of information; calculating factors such as the frequency of the bloggers, the originality coefficient, the forwarding amount, the appraisal amount and the forwarding relation of the microblogs and the like to obtain a microblog source; combining the network propagation model AN with the number of microblog participants, and recursion of a microblog source by using a formula; the method for calculating the longest public subsequence is applied to the microblog so as to realize microblog tracing.
At present, a microblog source tracing research method mainly traces sources according to microblog text similarity and by combining information such as the number of comments of a microblog, the number of fans, forwarding relation and the like, so that influence caused by text content of the microblog and content of comments of a microblog comment person is not considered.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an information tracing method and system based on user interests.
The information tracing method based on the user interest provided by the invention comprises the following steps:
an information collection step: obtaining information to be traced;
and (3) information extraction: extracting required information elements in the information to be traced;
calculating and scoring steps: and performing score calculation according to each information element, and tracing according to the scores.
Preferably, the information element includes any one or any plural of historical release information of an information owner, comment information, forwarding information, information release time, and an amount of interest of the historical release information of the information owner.
Preferably, the calculating score step includes:
an information owner influence score calculation step: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
and (3) calculating the influence score of the commentator or the forwarder: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
time influence score calculation step: calculating a time influence score according to the information release time;
and an attention degree score calculation step: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
Preferably, the information owner influence score is calculated by:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i)
wherein, bloggerIntInf (i) represents the influence score of the information owner in the Interest classification i, BlggerInf represents the influence score of the information owner, alpha is a weight parameter, Fans represents the number of Fans, b is a parameter for ensuring that the logarithm calculation result is a positive value, Interest (i) is the Interest degree of the information owner in the Interest classification i, blogIntSim (i) is the similarity degree of the information and the Interest classification i, and n represents the total number of the information.
Preferably, the commentator or forwarder influence score is calculated by:
ComIntInf(i)=∑SinComIntInf(l);
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i);
L=logλ(Like+c);
the commenting inf represents the influence score of a commentator or a forwarder, SinComIntInf (L) represents the influence of a single forwarder or a commentator L on an interest class i, ComIntInf (i) represents the influence of a commentator or a forwarder on the interest class i, n represents the total number of interest classes, theta is a weight parameter, ComInterest (i) is the interest degree of the forwarder or the commentator on the interest class i, BlogIntSim (i) is the similarity degree of information and the interest class i, ComFans is the number of fans of the commentator or the forwarder, b is a parameter for ensuring that a logarithmic calculation result is a positive value, E represents the sensitivity of a comment statement, L represents the contribution of the comment or the forwarded praise to the influence, Likexin represents the comment praise, and lambda and c are parameters for ensuring that the logarithmic calculation result is a positive value.
Preferably, the degree of similarity of the information to a certain interest classification i is calculated by the following formula:
Interest(i)=∑BlogIntSim(i)
wherein BlogIntSim (i) indicates how similar a piece of information is to interest class i; KeyWord (j) represents the j-th content KeyWord extracted from the information, m represents the total number of the content keywords, KeyWordweight (j) represents the weight of the j-th content KeyWord, IntWord (i) represents an Interest classification i, HowNetDis represents a function for calculating the distance between two words in a known web word forest, and Interest (i) is the Interest degree of the information owner in the Interest classification i.
Preferably, the time influence score is calculated by the following formula:
the TimeScore represents a time influence score, T represents the time influence of one microblog, time represents the publishing time of the current microblog, MinTime represents the earliest publishing time of all the microblogs, MaxTime represents the latest publishing time of all the microblogs, and TmaxRepresents the maximum T value calculated in all tracing microblogs, e1To correct the parameters.
Preferably, the attention score is calculated by the following formula:
A=∑(Li+Rep+Com)*factor
wherein, AttraScore represents the attention score, A represents the attention of the owner within the set time X, A is like, forwards, comments within the set time XCalculating the quantity of the active ingredients; li, Rep and Com represent the number of praise, forwarding and comment of a piece of information in the set time X of the information owner; factor is a parameter calculated from the duration of the information, AmaxRepresenting the maximum value of A, e calculated from all the information to be traced2To correct the parameters.
The invention provides an information tracing system based on user interests, which comprises:
an information collection module: obtaining information to be traced;
the information extraction module: extracting required information elements in the information to be traced;
a calculation scoring module: and performing score calculation according to each information element, and tracing according to the scores.
Preferably, the information element includes any one or more of historical release information of an information owner, comment information, forwarding information, information release time and attention amount of the historical release information of the information owner;
the calculation score module comprises:
an information owner influence score calculation module: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
the comment person or forwarder influence score calculation module: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
a time influence score calculation module: calculating a time influence score according to the information release time;
an attention score calculation module: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
Compared with the prior art, the invention has the following beneficial effects:
according to the method and the device, the influence of the user is calculated according to the interest of the microblog user, meanwhile, the influence of the commentator and the forwarder is calculated according to the interest of the microblog commentator and the forwarder, the microblog score is obtained by weighting and summing the scores of factors such as microblog release time and attention degree, the ordering and the tracing are carried out according to the microblog score, the tracing is carried out comprehensively by multiple factors, and the tracing result is more accurate.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a system framework diagram of the present invention;
FIG. 2 is a comparison graph of the correct number of tracing results in the test example;
FIG. 3 is a chart comparing recall ratios in test examples;
FIG. 4 is a comparison graph of the number of hot microblogs of the tracing result in the test example;
fig. 5 is a score chart of a microblog of a plum fly-away event.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The information tracing mainly refers to the steps of giving a corresponding information text and finding the source of the microblog topic according to the content of the text and related information. With the development of internet technology, various social media compete and rise, microblogs serve as a self-media platform with high influence at present, and the research on information tracing has important significance in public opinion management and control and information safety. Most of previous researches only consider the similarity and forwarding relation of microblog contents and do not consider the influence of microblog text contents, and based on the facts, an Interest-based Traceability Method (ITM) based on user interests is provided. Calculating the interest of the bloggers according to the contents of the bloggers' previous microblogs, then calculating the influence of the bloggers according to the interest of the bloggers, and simultaneously calculating the interests of the microblog critics and forwarding the interests of the bloggers to obtain the influence of the critics and the forwarding persons. And finally, weighting and summing the scores of the factors such as time, attention and the like to obtain microblog scores, and sequencing and tracing by using the microblog scores.
The information tracing method based on the user interest provided by the invention comprises the following steps:
an information collection step: obtaining information to be traced;
and (3) information extraction: extracting required information elements in the information to be traced;
calculating and scoring steps: and performing score calculation according to each information element, and tracing according to the scores.
Specifically, the information element includes any one or any plural of historical release information of an information owner, comment or forwarding information, information release time, and an amount of attention of the historical release information of the information owner.
Specifically, the calculating score step includes:
an information owner influence score calculation step: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
and (3) calculating the influence score of the commentator or the forwarder: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
time influence score calculation step: calculating a time influence score according to the information release time;
and an attention degree score calculation step: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
Specifically, the information-owner influence score is calculated by the following formula:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i)
wherein, bloggerIntInf (i) represents the influence score of the information owner in the Interest classification i, BlggerInf represents the influence score of the information owner, alpha is a weight parameter, Fans represents the number of Fans, b is a parameter for ensuring that the logarithm calculation result is a positive value, Interest (i) is the Interest degree of the information owner in the Interest classification i, blogIntSim (i) is the similarity degree of the information and the Interest classification i, and n represents the total number of the information.
Specifically, the critic or forwarder influence score is calculated by the following formula:
ComIntInf(i)=∑SinComIntInf(l);
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i);
L=logλ(Like+c);
the commenting inf represents the influence score of a commentator or a forwarder, SinComIntInf (L) represents the influence of a single forwarder or a commentator L on an interest class i, ComIntInf (i) represents the influence of a commentator or a forwarder on the interest class i, n represents the total number of interest classes, theta is a weight parameter, ComInterest (i) is the interest degree of the forwarder or the commentator on the interest class i, BlogIntSim (i) is the similarity between microblog information and the interest class i, ComFans is the number of fans of the commentator or the forwarder, b is a parameter for ensuring that a logarithmic calculation result is a positive value, E represents the sentiment of a comment statement, L represents the contribution of the comment or the forwarded Like to the influence, Liqun represents the comment Like number, and lambda and c are parameters for ensuring that the logarithmic calculation result is a positive value.
Specifically, the degree of correlation of the information with a certain interest classification is calculated by the following formula:
Interest(i)=∑BlogIntSim(i)
wherein BlogIntSim (i) indicates how relevant a piece of information is to interest class i; KeyWord (j) represents the j-th content KeyWord extracted from the information, m represents the total number of the content keywords, KeyWordweight (j) represents the weight of the j-th content KeyWord, IntWord (i) represents an Interest classification i, HowNetDis represents a function for calculating the distance between two words in a known web word forest, and Interest (i) is the Interest degree of the information owner in the Interest classification i.
Specifically, the time influence score is calculated by the following formula:
the TimeScore represents a time influence score, T represents the time influence of one microblog, time represents the publishing time of the current microblog, MinTime represents the earliest publishing time of all the microblogs, MaxTime represents the latest publishing time of all the microblogs, and TmaxRepresents the maximum T value calculated in all tracing microblogs, e1To correct the parameters.
Specifically, the attention score is calculated by the following formula:
A=∑(Li+Rep+Com)*factor
wherein, AttraScore represents the attention score, A represents the attention of the owner within the set time X, and A is calculated by the number of praise, forward and comment within the set time X; li, Rep, Com represent informationThe user sets the number of praise, forward and comment of one piece of information in the time X; factor is a parameter calculated from the duration of the information, AmaxRepresenting the maximum value of A, e calculated from all the information to be traced2To correct the parameters.
The invention provides an information tracing system based on user interests, which comprises:
an information collection module: obtaining information to be traced;
the information extraction module: extracting required information elements in the information to be traced;
a calculation scoring module: and performing score calculation according to each information element, and tracing according to the scores.
Preferably, the information elements comprise historical release information of an information owner, comment information, forwarding information, information release time and attention amount of the historical release information of the information owner;
the calculation score module comprises:
an information owner influence score calculation module: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
the comment person or forwarder influence score calculation module: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
a time influence score calculation module: calculating a time influence score according to the information release time;
an attention score calculation module: the attention degree score is calculated according to the attention amount of the history distribution information of the information owner.
The information tracing system based on the user interests can be realized through the step flow of the information tracing method based on the user interests. Those skilled in the art can understand the information tracing method based on user interests as a preferred example of the information tracing system based on user interests.
In specific implementation, the method is applied to microblog tracing, content keywords in historical microblog information of a microblog owner are extracted, the correlation degree of the content keywords and interest keywords is calculated, the interest degree of the blog owner in a certain interest classification is evaluated, the user influence of the microblog owner in the certain interest classification is calculated, and the total influence of the blog owner is further obtained; calculating the interest degree of a commentator or a forwarder in a certain interest classification to obtain the total comment forwarding influence of the commentator or the forwarder; calculating time influence according to the release time of the microblog information; and giving corresponding weights to the total influence of the bloggers, the total influence of comment forwarding of the commentators or the commentators, the time influence and the attention degree to obtain microblog comprehensive scores, and sequencing and tracing according to the microblog scores.
The user interest calculation is based on the following steps that based on microblog data analysis, the user interest is classified into categories to obtain interest categories, the interest categories are distinguished through interest keywords, content keywords in historical microblog information of a user are extracted, the degree of correlation between the content keywords and the interest keywords is calculated to obtain the degree of correlation between the microblog information of the user and a certain interest category, and the degree of interest of the user in the certain interest category is evaluated according to the degree of correlation between the microblog information and the interest category;
in the user influence calculation, calculating the user influence of the microblog of the user in a certain interest classification according to the interest degree of the user in the certain interest classification and the number of fans of the user to obtain the total influence of the user;
in the influence calculation of the commentator, the interest degree of the commentator or the forwarder in a certain interest classification, the number of fans of the commentator or the forwarder and the emotion degree of the commentary content are calculated, the influence of the commentator or the forwarder on the user of the commentator or the forwarder in the certain interest classification is obtained, and further the total influence of the commentary forwarding of the commentator or the forwarder is obtained;
in the time factor calculation, the time influence is calculated according to the release time of the microblog information.
After calculation, corresponding weights are given to the time influence, the attention, the comment forwarding total influence of the commentator or the forwarder and the user total influence respectively to obtain a microblog comprehensive score, and microblog tracing is carried out according to the microblog comprehensive score.
Specifically, the microblog comprehensive score ITM is calculated by the following formula:
ITM=k1*TimeScore+k2*AttraScore+k3*BloggerScore+k4*ComScore
wherein, TimeScore represents the time influence score of the microblog, AttraScore represents the attention score of the microblog, BloggerScore represents the total influence score of the user, ComScore represents the total influence score of the commentator or the forwarder, and k1、k2、k3、k4Respectively, represent the scoring weights.
In the system framework diagram shown in fig. 1, the invention relates to user interest calculation, blogger influence calculation, commentator influence calculation and attention calculation.
In the aspect of user influence calculation based on the interest, the user interest calculation and the user influence calculation are included.
In the aspect of user interest calculation, behaviors of microblog users can reflect some interests of the users. For example, a user likes the entertainment content, and the microblog published by the user, the comment and the forwarded microblog are all inclined to the entertainment information. Based on the above, the interest tendency of the user can be analyzed through the previous microblog information of the user. Through the analysis of microblog data, user interests can be classified into the following categories: entertainment, economy, science and education, politics and military. According to the method, the keyword in the previous blog information of the user is extracted, and the sum of the distance between the keyword and the interest keyword in the web word bank is calculated to obtain the correlation degree between the microblog topic and the interest.
The calculation formula of the degree of relevance of a certain piece of blog information of a user to a certain interest i is as follows:
wherein BlogIntSim (i) indicates how much a piece of blog information of a user is related to interest i (one of entertainment, economy, science education, politics, and military). KeyWord (j) represents keywords (m keywords in total) extracted from the blog, KeyWordWeight (j) represents weight of the keywords, and IntWord (i) represents interest i (one of entertainment, economy, science education, politics, and military). The HowNetDis function is used to calculate the distance between two words in a forest of web-aware words. The web word forest is defined by a meaning item and a sememe, wherein the meaning item is a description of a word, the word can have a plurality of meaning items, and the sememe is a basic unit for describing the meaning item. The basic structure of the meaning term is as follows:
the distance between the sememes of each sememe is calculated by considering the influence of the depth and the density of the sememe hierarchical tree on the sememe weight, so that the distance between the sememes is calculated, and finally the distance between the words is calculated by using the distance between the sememes.
The sum of the similarity of each user's blog information and interest i can be used to evaluate how much the user is interested in a topic.
Interest(i)=∑BlogIntSim(i) (2)
The Interest (i) is the Interest degree of the user in the Interest i, and the blogIntSim (i) is the correlation degree of a piece of previous blog information of the user and the Interest i, and is calculated by formula 1.
In the aspect of calculation of user influence, the microblog content sent by the user can reflect the interests and hobbies of the user, and fans of the user generally have similar interests with the fans, so that the degree of similarity between microblogs sent by bloggers and the interests of the bloggers is higher, and the influence on the fans and the public is higher. Meanwhile, the influence of microblogs is larger as the number of fans of the bloggers is larger, but when the number of fans of the microblogs is larger than a certain order of magnitude (for example, tens of millions), the influence caused by the number of fans is not much.
With reference to formula 2, the calculation formula of the influence of the microblog of the user on the interest i is as follows:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i) (3)
alpha is a weight parameter, Fans represents the fan number, and Interest (i) is the Interest degree of the blogger in the Interest i, and is calculated by formula 2. BlogIntSim (i) is the degree of similarity of the microblog contents to interest i, and is calculated using formula 1. The influence of calculating the number of the fans uses a log function, the parameter b is used for ensuring that the logarithm calculation result is a positive value, and the values of beta and b are both 1.1 and are more practical by drawing judgment.
The sum of the influence of the user under each interest is the total influence of the user, and the calculation formula is as follows:
in the aspect of influence calculation of commentators and forwarders, a user can send a comment under a microblog, the comment can be praised by other people, and the fan of the commentator can be pushed with relevant information of the comment. Therefore, the commentator can contribute a certain influence to the microblog commented by the commentator, the commentator has similar interest with the fan of the commentator, and the influence is larger if the interest of the commentator is similar to the topic of the microblog. Similarly, the forwarding can also add comments, and has certain influence. All microblog critics and forwarders can contribute influence to the microblog, so that the calculation formula of the influence of the critics and the forwarders on the aspect of interest i is as follows:
ComIntInf(i)=∑SinComIntInf(l) (5)
in the formula, SinComIntInf (l) is the influence of a single forwarder and a single commentator l on interest i, and the influence is related to the interest degree of the forwarder and the commentator on the interest i and the similarity of microblog content and the interest i. The more fans of the forwarders and the commentators are, the greater the influence is, but when the number of the fans of the microblogs exceeds a certain order of magnitude, the influence caused by the number of the fans is not much. Meanwhile, the praise number of the comments and the emotion of the comment content are also related to the influence, and the influence is larger as the praise number is larger; the greater the influence of the contribution of the review content with positive emotion, the less the influence of the content with negative emotion. The calculation formula is as follows:
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i) (6)
where θ is a weight parameter, ComInterest (i) is the interest level of the forwarder and the commentator in interest i, and is calculated by formula 2. BlogIntSim (i) is the similarity of the microblog content to interest i, and is calculated by formula 1. ComFans is the fan number of commentators or forwarders. E indicates the sentiment of the comment sentence, 0 indicates negative sentiment, 1 indicates positive sentiment, and 0.5 indicates neutral. L represents the contribution of comment or forwarded praise to influence, and the calculation formula is as follows:
L=logλ(Like+c) (7)
like in the equation represents the praise number, the influence of the praise number is increased along with the increase of the praise number, when the praise number is large enough, the influence is not large, the c parameter is used for ensuring that the logarithm calculation result is a positive value, and the lambda sum is more practical by drawing judgment and taking 1.4 as the lambda sum. The sum of the influence of the commentator and the forwarder under each interest is the total influence of the commentator and the forwarder, and the calculation formula is as follows:
in the aspect of calculating the comprehensive microblog score, the source tracing of the microblog topic means that a source of information is found out from a given microblog data set related to a certain event according to a certain index. The source is not necessarily early in time, because some microblogs are possibly forwarded by people with large influence and are concerned, meanwhile, if the attention of other microblogs recently sent by a blogger is high, a certain influence is added to the current microblogs, so that the influence of the blogger, the influence of the commentator and the forwarder, the time and the attention are considered when the source is considered.
According to the invention, time, attention, influence of bloggers and influence of commentators on microblogs of people are comprehensively considered, and the microblogs are traced by calculating scores. The score calculation formula is as follows:
ITM=k1*TimeScore+k2*AttraScore+k3*BloggerScore+k4*ComScore (9)
wherein: TimeScore represents the time score of the microblog, AttraScore represents the attention score of the microblog, BloggerScore represents the blogger influence score, ComScore represents the influence score of the commentator and the forwarder, k1、k2、k3、k4Respectively representing the weight of each item score.
The microblog comprehensive score calculation is divided into the following five aspects of time score calculation, attention degree score calculation, user influence score calculation, commentator forwarder influence score calculation and score weight calculation.
The time score TimeScore is calculated according to the time influence, the time influence is larger when microblogs with earlier release time are released, and the time influence calculation formula is as follows:
t represents the influence of one microblog time, time represents the current microblog publishing time, MaxTime represents the earliest publishing time of all the microblogs, and MaxTime represents the latest publishing time of all the microblogs. The time score is obtained by standardizing time influence, and by combining the characteristics of decimal scaling standardization and hyperbolic function normalization, the normalization function used in the method is as follows:
where N' denotes the value after normalization, N denotes the value before normalization, NmaxRepresents the maximum value among the normalized values, and K is the correction parameter. The normalization formula can be expressed in NmaxThe value is normalized for the reference and the correction parameter K is a small value, which has a negligible effect on its normalization, as long as N is not small, but when the value of N is small,k allows the value of N normalization to be very small. When all the values of N are very small, the function can ensure that all the values of N 'are also very small, so that the condition that the values of N' are close to 1 due to the fact that the ratio is used for calculation is avoided. Based on equation 11, the time score is calculated as follows:
t represents the influence of one microblog time, and Tmax represents the maximum T value calculated in all the tracing microblogs. e.g. of the type1In order to correct parameters, the time scores of all tracing microblogs can be lower under the condition that the T of all the tracing microblogs is too small, so that the condition that the time scores of some microblogs are higher when the scores are calculated by simply using the ratio is avoided. e.g. of the type1Taking the value of the mean of the calculated T of all the acquired microblogs
The calculation formula for the attention score based on formula 11 is as follows:
a represents the attention degree of the microblog owner in the recent month, and is calculated by the number of praise, forward and comment of the microblog in the recent month. And the influence on the aspect of the newer microblog is larger, the calculation formula is as follows:
A=∑(Li+Rep+Com)*factor (14)
in the formula, Li, Rep and Com represent the praise, forwarding and comment number of a microblog in a month recently by the blogger. factor is a parameter calculated according to the duration of the microblog, and the formula is as follows:
Amaxrepresenting all traceback microMaximum A value, e, calculated in beats2To modify the parameters, the function is the same as e in equation 121Similarly, e2Taking the value of the average of A calculated for all collected microblogs
The user influence score is calculated based on equation 11 as follows:
based on formula 11 for the commentator, the calculation formula for the influence score of the forwarder is as follows:
commenting inf represents the influence of the commentator and the forwarder, and is calculated using formula 8. CommentInfmaxMaximum value representing influence of commentators and forwarders calculated from all tracing microblogs, e4To modify the parameters, the function is the same as e in equation 121Similarly, e4Taking the value as the CommentInf average value calculated by all collected microblogs
BlggerInf represents the user influence and is calculated using equation 4, BlggerInfmaxMaximum value, e, representing the calculated user influence of all tracing microblogs3To modify the parameters, the function is the same as e in equation 121Similarly, e3Taking the value as the average value of BloggerInf calculated by all collected microblogs
Weight k of each item score1、k2、k3、k4AHP (analytic hierarchy process) can be used to derive the corresponding parameters. AHP decision-related elementsThe elements are decomposed into levels of targets, criteria, schemes and the like, and qualitative and quantitative analysis is carried out on the basis.
The AHP determines the proportion of a certain parameter relative to another parameter by comparing the relative importance of every two parameters, thereby constructing a judgment matrix. Then, the maximum eigenvector is calculated, and the weight of each parameter is represented by the eigenvector. And finally, calculating a consistency ratio by using the characteristic value of the judgment matrix to carry out consistency check, and if the consistency ratio is smaller than a threshold value, considering that the calculated weight value is more reasonable.
The judgment matrix obtained by combining various factors through expert judgment is shown as the table I:
Fractional weight calculation) to obtain maximum eigenvector, and determining each weight value as k1=0.148,k2=0.163,k30.363,k4=0.326
The consistency ratio is calculated to be 0.8% and less than the threshold value of 10%, so the weight value calculated is reasonable.
The test procedure at the specific data is as follows: microblogs of 9 months 11 to 10 months 12 are collected from a microblog platform, and information of related blogger information commentators accounts for 12110, and the microblogs comprise five events of 'plum flying off duty', 'crow going off', 'united states restart monthly plan', and the like, and are used for testing an ITM algorithm. And simultaneously performing comparison by using an OR algorithm based on the text centrality. And marking the sources of the microblogs by using a manual marking method, and comparing the two algorithms to find the correct number of the sources. And simultaneously, the corresponding number of the hot microblogs of the tracing results of the two algorithms under the corresponding topics is further checked, so that the accuracy of the algorithms is further verified, and the test results are shown in the table II.
Watch two
As can be seen from fig. 2, fig. 3 and table two, the ITM algorithm provided by the present invention is superior to the OR algorithm in terms of the correct number of the source-tracing microblogs. In the events of plum flying departure, national star naming and the like, the positioning is accurate because the audiences have obvious interest tendency. On the event of restarting and logging in the united states, the scores of the few microblogs with large influence in the tracing microblogs are small, so that the accuracy of the ITM algorithm is lower than that of the OR algorithm.
As can be seen from fig. 4, in the aspect of the number of hot microblogs, the OR algorithm only considers the similarity and the center of the text, and does not consider the influence of the microblogs on the topic, so that the accuracy of the algorithm provided by the invention is high.
Selecting the plum flying leave event, wherein the microblog tracing result and the microblog basic information of the ITM algorithm are as follows (for convenience of analysis, the microblog score is converted into a percentage system):
watch III
As can be seen from fig. 5, the scores of microblogs such as economic news and the science and technology of new waves are high every day, and according to the information in table three, the microblogs with high scores are found to be published earlier, and have relatively more numbers of praise, comment and forwarding, and have more fans. By comparing the manually marked source with the hot microblog of the topic, the tracing result is found to be more accurate. As can be seen from fig. 5 and table three, the microblogs released by the panoramic network have much attention, but the scores are not high because the release time is too late, which is in line with the actual situation. Although the time of the microblogs reported and issued by Chinese women is earlier, the influence caused by less fans is smaller because of not receiving much attention, and the microblogs are in line with the actual situation.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (6)
1. An information tracing method based on user interests is characterized by comprising the following steps:
an information collection step: obtaining information to be traced;
and (3) information extraction: extracting required information elements in the information to be traced;
calculating and scoring steps: calculating information scores according to the information elements, and tracing according to the information scores;
the information elements comprise any one or more of historical release information of an information owner, comment information, forwarding information, information release time and the attention amount of the historical release information of the information owner;
the calculating score step includes:
an information owner influence score calculation step: obtaining interest categories of the information owners according to historical release information of the information owners, and calculating influence scores of the information owners according to the interest categories;
and (3) calculating the influence score of the commentator or the forwarder: obtaining interest categories of the commentators or the forwarders according to the commentary or forwarding information, and calculating influence scores of the commentators or the forwarders according to the interest categories;
time influence score calculation step: calculating a time influence score according to the information release time;
and an attention degree score calculation step: calculating an attention score according to the attention amount of the historical release information of the information owner;
the information owner influence score is calculated by:
BloggerIntInf(i)=α*logβ(Fans+b)*Interest(i)*BlogIntSim(i)
wherein, bloggerIntInf (i) represents the influence score of the information owner in the Interest classification i, BlggerInf represents the influence score of the information owner, alpha is a weight parameter, Fans represents the number of Fans, b is a parameter for ensuring that the logarithm calculation result is a positive value, Interest (i) is the Interest degree of the information owner in the Interest classification i, blogIntSim (i) is the similarity degree of the information and the Interest classification i, and n represents the total number of the information.
2. The information tracing method based on user interest according to claim 1, wherein the critic or forwarder influence score is calculated by the following formula:
ComIntInf(i)=∑SinComIntInf(l);
SinComIntInf(l)=θ*ComInterest(i)*logβ(ComFans+b)*E*L*BlogIntSim(i);
L=logλ(Like+c);
the commenting inf represents the influence score of a commentator or a forwarder, SinComIntInf (L) represents the influence of a single forwarder or a commentator L on an interest class i, ComIntInf (i) represents the influence of a commentator or a forwarder on the interest class i, n represents the total number of interest classes, theta is a weight parameter, ComInterest (i) is the interest degree of the forwarder or the commentator on the interest class i, BlogIntSim (i) is the similarity degree of information and the interest class i, ComFans is the number of fans of the commentator or the forwarder, b is a parameter for ensuring that a logarithmic calculation result is a positive value, E represents the sensitivity of a comment statement, L represents the contribution of the comment or the forwarded praise to the influence, Likexin represents the comment praise, and lambda and c are parameters for ensuring that the logarithmic calculation result is a positive value.
3. The method for tracing information based on user interests according to any one of claims 1-2, wherein the similarity degree of the information and the interest classification i is calculated by the following formula:
Interest(i)=ΣBlogIntSim(i)
wherein BlogIntSim (i) indicates how similar a piece of information is to interest class i; KeyWord (j) represents the j-th content KeyWord extracted from the information, m represents the total number of the content keywords, KeyWordweight (j) represents the weight of the j-th content KeyWord, IntWord (f) represents an Interest classification i, HowNetDis represents a function for calculating the distance between two words in a known web word forest, and Interest (i) is the Interest degree of the information owner in the Interest classification i.
4. The method of claim 1, wherein the time influence score is calculated by the following formula:
the TimeScore represents a time influence score, T represents the time influence of one microblog, time represents the publishing time of the current microblog, MinTime represents the earliest publishing time of all the microblogs, MaxTime represents the latest publishing time of all the microblogs, and TmaxRepresents the maximum T value calculated in all tracing microblogs, e1To correct the parameters.
5. The method of claim 1, wherein the focus score is calculated by the following formula:
A=∑(Li+Rep+Com)*factor
wherein, AttraScore represents the attention score, A represents the attention of the owner within the set time X, and A is calculated by the number of praise, forward and comment within the set time X; li, Rep and Com represent the number of praise, forwarding and comment of a piece of information in the set time X of the information owner; factor is a parameter calculated from the duration of the information, AmaxRepresenting the maximum value of A, e calculated from all the information to be traced2To correct the parameters.
6. The method of claim 1, wherein the information score is calculated by the following formula:
ITM=k1*TimeScore+k2*AttraScore+k3*BloggerScore+k4*ComScore
wherein ITM represents an information score, TimeScoce represents a temporal influence score, AttraScore represents an attention score, BloggerScore represents an information owner influence score, ComScore represents a critic or forwarder total influence score, k1、k2、k3、k4Respectively, represent the scoring weights.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910059484.XA CN109885760B (en) | 2019-01-22 | 2019-01-22 | Information tracing method and system based on user interests |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910059484.XA CN109885760B (en) | 2019-01-22 | 2019-01-22 | Information tracing method and system based on user interests |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109885760A CN109885760A (en) | 2019-06-14 |
CN109885760B true CN109885760B (en) | 2020-12-29 |
Family
ID=66926536
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910059484.XA Expired - Fee Related CN109885760B (en) | 2019-01-22 | 2019-01-22 | Information tracing method and system based on user interests |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109885760B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061983B (en) * | 2019-12-17 | 2024-01-09 | 上海冠勇信息科技有限公司 | Evaluation method of infringement data grabbing priority and network monitoring system thereof |
CN115511511B (en) * | 2022-11-23 | 2023-03-24 | 成都银光软件有限公司 | Method and system for analyzing traceability identification information based on data processing |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102945279A (en) * | 2012-11-14 | 2013-02-27 | 清华大学 | Evaluating method and device of influence effect of microblog users |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN107194560A (en) * | 2017-05-12 | 2017-09-22 | 东南大学 | The Social search evaluation method clustered in LBSN based on good friend |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324666A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Topic tracing method and device based on micro-blog data |
CN104133897B (en) * | 2014-08-01 | 2017-07-11 | 哈尔滨工程大学 | A kind of microblog topic source tracing method based on topic influence |
US9740672B2 (en) * | 2014-10-24 | 2017-08-22 | POWr Inc. | Systems and methods for dynamic, real time management of cross-domain web plugin content |
CN106980692B (en) * | 2016-05-30 | 2020-12-08 | 国家计算机网络与信息安全管理中心 | Influence calculation method based on microblog specific events |
-
2019
- 2019-01-22 CN CN201910059484.XA patent/CN109885760B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102945279A (en) * | 2012-11-14 | 2013-02-27 | 清华大学 | Evaluating method and device of influence effect of microblog users |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN107194560A (en) * | 2017-05-12 | 2017-09-22 | 东南大学 | The Social search evaluation method clustered in LBSN based on good friend |
Also Published As
Publication number | Publication date |
---|---|
CN109885760A (en) | 2019-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107633044B (en) | Public opinion knowledge graph construction method based on hot events | |
Buntain et al. | Automatically identifying fake news in popular twitter threads | |
Kumar et al. | Dynamics of conversations | |
US20090281851A1 (en) | Method and system for determining on-line influence in social media | |
US20130304818A1 (en) | Systems and methods for discovery of related terms for social media content collection over social networks | |
US20130297581A1 (en) | Systems and methods for customized filtering and analysis of social media content collected over social networks | |
US20140040371A1 (en) | Systems and methods for identifying geographic locations of social media content collected over social networks | |
US9454781B2 (en) | Ranking and recommendation of online content | |
CN108038627B (en) | Object evaluation method and device | |
US8275769B1 (en) | System and method for identifying users relevant to a topic of interest | |
Song et al. | Rt^ 2m: Real-time twitter trend mining system | |
David et al. | Features combination for the detection of malicious Twitter accounts | |
Liu et al. | Enlister: baidu's recommender system for the biggest chinese Q&A website | |
Li et al. | A hybrid model for experts finding in community question answering | |
CN109885760B (en) | Information tracing method and system based on user interests | |
Tao et al. | Inferring atmospheric particulate matter concentrations from Chinese social media data | |
Dang et al. | What is in a rumour: Combined visual analysis of rumour flow and user activity | |
Doshi et al. | Predicting movie prices through dynamic social network analysis | |
Nakov et al. | A survey on predicting the factuality and the bias of news media | |
CN110019556B (en) | Topic news acquisition method, device and equipment thereof | |
Suzuki | A credibility assessment for message streams on microblogs | |
Plummer et al. | Analysing the Sentiment Expressed by Political Audiences on Twitter: The case of the 2017 UK general election | |
Dehghani et al. | An evolutionary-based method for reconstructing conversation threads in email corpora | |
Liao et al. | TIRR: A code reviewer recommendation algorithm with topic model and reviewer influence | |
Balali et al. | Content diffusion prediction in social networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20201229 |