CN103279483A - Topic prevalence range assessment method and system facing micro-blogs - Google Patents

Topic prevalence range assessment method and system facing micro-blogs Download PDF

Info

Publication number
CN103279483A
CN103279483A CN2013101438466A CN201310143846A CN103279483A CN 103279483 A CN103279483 A CN 103279483A CN 2013101438466 A CN2013101438466 A CN 2013101438466A CN 201310143846 A CN201310143846 A CN 201310143846A CN 103279483 A CN103279483 A CN 103279483A
Authority
CN
China
Prior art keywords
topic
new
message
communities
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101438466A
Other languages
Chinese (zh)
Other versions
CN103279483B (en
Inventor
程学旗
李静远
李佳
王元卓
刘悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201310143846.6A priority Critical patent/CN103279483B/en
Publication of CN103279483A publication Critical patent/CN103279483A/en
Application granted granted Critical
Publication of CN103279483B publication Critical patent/CN103279483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a topic prevalence range assessment method and system facing micro-blogs. The method comprises the following steps that S1, historical data of a micro-blog platform are collected, multiple topics and multiple messages are extracted, combination is carried out to obtain multiple combined massages, a community is built for users who release or forward the same combined message, multiple communities are obtained, classification of the topics is carried out on the basis of the superposition degree of the multiple communities, and characteristics of the topics in the same category are extracted; S2, real-time data of the micro-blog platform are obtained, new topics and multiple new massages are extracted, combination is carried out to obtain multiple new combined massages, a new community is built for users who release or forward the same new combined massage, multiple new communities are obtained, classification of the new topics is carried out on the basis of the superposition degree of the multiple new communities, and new characteristics of the new topics in the same category are extracted; S3, the characteristics are matched with the new characteristics, targeted topics are obtained, and the prevalence range of the targeted topics is assessed.

Description

The popular scope appraisal procedure of a kind of topic towards the microblogging visitor and system
Technical field
The present invention relates to the internet information management domain, relate in particular to the popular scope appraisal procedure of a kind of topic towards the microblogging visitor and system.
Background technology
The fast development of the fast development of internet, particularly Web2.0 is that the social networking service of representative becomes the indispensable media of communication of the network user with Facebook, Myspace and Twitter.These social networking service provide lastest imformation, the functions such as relevant information of up-to-date popular time of the lastest imformation that comprises the good friend, interested people or group for the user, and these functions are changing social networking service user's information obtain manner gradually.The microblogging visitor who with external Twitter and domestic Sina's microblogging is representative has very big difference as the virtual community form that support traditional communities such as a kind of novel social networks and Facebook produce, and this mainly shows on the mechanism of concern, message propagation mode and the message real-time.The microblogging visitor is different from general social networks, has adopted the mechanism of unidirectional concern, makes Any user can arbitrarily pay close attention to any own interested people; Microblogging visitor's message propagation is that broadcast type is propagated, and the message of user's issue can be pushed to all audiences of this user; The microblogging visitor is the new network service in conjunction with modes such as network and portable terminals, and it limits the content that the user sends, and emphasizes the real-time of message more.The microblogging user often describes news, event by dapper text (generally being no more than 140 words) and expresses the viewpoint of oneself
The characteristic that these of microblogging visitor are different from traditional social networks makes that the data volume of real-time update is very huge in the microblogging visitor platform, and in this grand information flow, the user has obtained more urgent demand for information.At first, because the microblogging visitor is short text, topic is found to be different from traditional blog etc., how effectively to find topic and sums up topic, it is challenging problem that corresponding microblogging visitor is referred on the significant topic, and the inner link between the topic is out in the cold; Secondly, the user above the social networks is made up of implicit community, and community discovery does not still have corresponding direct application at present.In addition, at present still not for the correlative study that concerns between community and the topic.These weak points also are the problem places with researching value.
First, the microblogging visitor is a kind of topic driving mechanism, the lifetime of topic comprises appearance, development derivation and this several stages that disappears, because microblogging visitor's real-time, the user wishes to obtain relevant information in real time in the stage that topic occurs, thereby more goes in the discussion that participate in own interested topic of morning.How to realize topic discovery, the scheme that Shang Weiyou is clear and definite at microblogging visitor platform; Limit on the content quantity that microblogging visitor platform sends the user, this is in order to guarantee the real-time of message, but this has also caused the user can't accomplish complete statement to a certain extent in a piece of news.This information deficiency has also increased the difficulty that the burst topic is found;
The second, find after the topic uncared-for studying a question during discovery of the relation between a plurality of topics on the microblogging visitor platform.How to find to concern between the topic, concern between the expression topic, and the assessment that utilizes the relation between the topic to carry out following topic popularity all is the problem with challenge.
The 3rd, carry out the discovery of meaningful community on the microblogging visitor platform.The definition for community at present still has dispute, and a kind of viewpoint thinks and connect closely that the user is a community that another kind of viewpoint thinks that the user with same interest and topic is a community.How relation between community and the topic represents relation between the two, and whether relation between the two is meaningful, how to utilize the relevant researchs of shortage still such as possible popular scope of this relation assessment topic.
Summary of the invention
The objective of the invention is to merge the message flow line range assessment of topic and community relations, utilize topic and community, between community and the community, the relation between topic and the topic can real-time and effective be assessed the possible popular scope of new topic.
For achieving the above object, the invention provides the popular scope appraisal procedure of a kind of topic towards the microblogging visitor, this method comprises:
Step 1, the historical data of gathering microblogging visitor platform is extracted a plurality of topics according to described historical data, and the corresponding a plurality of message of described a plurality of topic, and according to
Figure BDA00003090037800021
Described a plurality of message are carried out union operation obtain a plurality of merging message, the user that will issue or transmit same merging message then makes up a community, obtains a plurality of communities, based on the coincidence degree of described a plurality of communities, carry out the classification of topic, extract the feature of topic in the same classification;
Step 2 is obtained the real time data of microblogging visitor platform, extracts new topic according to described real time data, and the corresponding a plurality of new informations of described new topic, and according to
Figure BDA00003090037800022
Described a plurality of new informations are carried out union operation obtain a plurality of new merging message, the user who issues or transmit same new merging message is made up new communities, obtain a plurality of new communities, coincidence degree based on described a plurality of new communities, carry out the classification of new topic, extract the new feature of new topic in the same classification;
Step 3 is mated described feature and described new feature, obtains the target topic, and the popular scope of described target topic is assessed;
Wherein L1 and L2 are respectively the length of any two message, and Lcom is the number of the common word of any two message, and threshold is in [0.3,0.4] interval.
Carry out carrying out behind the union operation following processing in described step 1 and the step 2:
Carry out LDA machine learning mode according to amalgamation result and obtain topic, utilize
Figure BDA00003090037800031
Calculate the difference value between the topic, wherein P and Q are two vectors, are respectively that corresponding all message appear at the probability in the topic, make previous D KLBe D_KL_Old, this D KLBe D_KL_new, as D_KL_new〉keep amalgamation result during D_KL_older and continue new union operation, otherwise eliminate amalgamation result and continue new union operation.
Classification operation in described step 1 and the step 2 is specially:
To satisfy
Figure BDA00003090037800032
Any topic under any two communities be classified as same class, wherein C1 and C2 are any two communities, all users among the C1 are U1, all users among the C2 are U2, identical user is Ucom among U1 and the U2.
For achieving the above object, the present invention also provides a kind of topic towards the microblogging visitor popular scope evaluating system, and this system comprises:
The historical data processing unit, the historical data of gathering microblogging visitor platform is extracted a plurality of topics according to described historical data, and the corresponding a plurality of message of described a plurality of topic, and according to
Figure BDA00003090037800034
Described a plurality of message are carried out union operation obtain a plurality of merging message, the user that will issue or transmit same merging message then makes up a community, obtains a plurality of communities, based on the coincidence degree of described a plurality of communities, carry out the classification of topic, extract the feature of topic in the same classification;
The real time data processing unit obtains the real time data of microblogging visitor platform, extracts new topic according to described real time data, and the corresponding a plurality of new informations of described new topic, and according to
Figure BDA00003090037800033
Described a plurality of new informations are carried out union operation obtain a plurality of new merging message, the user who issues or transmit same new merging message is made up new communities, obtain a plurality of new communities, coincidence degree based on described a plurality of new communities, carry out the classification of new topic, extract the new feature of new topic in the same classification;
The topic area assessment unit mates described feature and described new feature, obtains the target topic, and the popular scope of described target topic is assessed;
Wherein L1 and L2 are respectively the length of any two message, and Lcom is the number of the common word of any two message, and threshold is in [0.3,0.4] interval.
Carry out carrying out behind the union operation following processing in described historical data processing unit and the real time data processing unit:
Carry out LDA machine learning mode according to amalgamation result and obtain topic, utilize
Figure BDA00003090037800041
Calculate the difference value between the topic, wherein P and Q are two vectors, are respectively that corresponding all message appear at the probability in the topic, make previous D KLBe D_KL_Old, this D KLBe D_KL_new, as D_KL_new〉keep amalgamation result during D_KL_older and continue new union operation, otherwise eliminate amalgamation result and continue new union operation.
Classification operation in described historical data processing unit and the real time data processing unit is specially:
To satisfy
Figure BDA00003090037800042
Any topic under any two communities be classified as same class, wherein C1 and C2 are any two communities, all users among the C1 are U1, all users among the C2 are U2, identical user is Ucom among U1 and the U2.
Beneficial effect of the present invention is:
1, is directed to the feature of short text among the microblogging visitor among the present invention, proposes a kind of correction to LDA, namely data are merged, be conducive to the LDA model after merging and find more significant topic.
2, the present invention has used topic to obtain different customer groups, under different topics, is not to carry out community discovery at all users, but to this topic users interest being carried out the discovery of community;
3, the present invention has used the information of community to come topic is sorted out, and can find to be fit to more assess the topic classification that topic is propagated, and utilizes the corresponding relation between community and the topic, and the popular scope of topic is assessed effectively.
Describe the present invention below in conjunction with the drawings and specific embodiments, but not as a limitation of the invention.
Description of drawings
Fig. 1 is the popular scope appraisal procedure of the topic towards the microblogging visitor of the present invention process flow diagram;
Fig. 2 is the popular scope evaluating system of the topic towards the microblogging visitor of the present invention synoptic diagram;
Fig. 3 is the popular scope evaluating system of the topic towards the microblogging visitor synoptic diagram of one embodiment of the invention;
Fig. 4 is the topic discovery of one embodiment of the invention and the pretreatment process figure of feature extracting method;
Fig. 5 is the popular scope appraisal procedure of the new topic process flow diagram of one embodiment of the invention;
Fig. 6 is the diagram of used LDA model among the present invention;
Fig. 7 is the process flow diagram that module found in topic among the present invention.
Embodiment
Fig. 1 is the popular scope appraisal procedure of the topic towards the microblogging visitor of the present invention process flow diagram.As shown in Figure 1, this method comprises:
S1, the historical data of gathering microblogging visitor platform is extracted a plurality of topics according to described historical data, and the corresponding a plurality of message of described a plurality of topic, and according to Described a plurality of message are carried out union operation obtain a plurality of merging message, the user that will issue or transmit same merging message then makes up a community, obtains a plurality of communities, based on the coincidence degree of described a plurality of communities, carry out the classification of topic, extract the feature of topic in the same classification;
S2 obtains the real time data of microblogging visitor platform, extracts new topic according to described real time data, and the corresponding a plurality of new informations of described new topic, and according to
Figure BDA00003090037800052
Described a plurality of new informations are carried out union operation obtain a plurality of new merging message, the user who issues or transmit same new merging message is made up new communities, obtain a plurality of new communities, coincidence degree based on described a plurality of new communities, carry out the classification of new topic, extract the new feature of new topic in the same classification;
S3 mates described feature and described new feature, obtains the target topic, and the popular scope of described target topic is assessed;
Wherein L1 and L2 are respectively the length of any two message, and Lcom is the number of the common word of any two message, and threshold is in [0.3,0.4] interval.
Carry out carrying out behind the union operation following processing among described S1 and the S2:
Carry out LDA machine learning mode according to amalgamation result and obtain topic, utilize
Figure BDA00003090037800053
Calculate the difference value between the topic, wherein P and Q are two vectors, are respectively that corresponding all message appear at the probability in the topic, make previous D KLBe D_KL_Old, this D KLBe D_KL_new, as D_KL_new〉keep amalgamation result during D_KL_older and continue new union operation, otherwise eliminate amalgamation result and continue new union operation.
Classification operation among described S1 and the S2 is specially:
To satisfy
Figure BDA00003090037800061
Any topic under any two communities be classified as same class, wherein C1 and C2 are any two communities, all users among the C1 are U1, all users among the C2 are U2, identical user is Ucom among U1 and the U2.
Fig. 2 is the popular scope evaluating system of the topic towards the microblogging visitor of the present invention synoptic diagram.As shown in Figure 2, this system comprises:
Historical data processing unit 10, the historical data of gathering microblogging visitor platform is extracted a plurality of topics according to described historical data, and the corresponding a plurality of message of described a plurality of topic, and according to
Figure BDA00003090037800062
Described a plurality of message are carried out union operation obtain a plurality of merging message, the user that will issue or transmit same merging message then makes up a community, obtains a plurality of communities, based on the coincidence degree of described a plurality of communities, carry out the classification of topic, extract the feature of topic in the same classification;
Real time data processing unit 20 obtains the real time data of microblogging visitor platform, extracts new topic according to described real time data, and the corresponding a plurality of new informations of described new topic, and according to
Figure BDA00003090037800063
Described a plurality of new informations are carried out union operation obtain a plurality of new merging message, the user who issues or transmit same new merging message is made up new communities, obtain a plurality of new communities, coincidence degree based on described a plurality of new communities, carry out the classification of new topic, extract the new feature of new topic in the same classification;
Topic area assessment unit 30 mates described feature and described new feature, obtains the target topic, and the popular scope of described target topic is assessed;
Wherein L1 and L2 are respectively the length of any two message, and Lcom is the number of the common word of any two message, and threshold is in [0.3,0.4] interval.
Carry out carrying out behind the union operation following processing in described historical data processing unit 10 and the real time data processing unit 20:
Carry out LDA machine learning mode according to amalgamation result and obtain topic, utilize Calculate the difference value between the topic, wherein P and Q are two vectors, are respectively that corresponding all message appear at the probability in the topic, make previous D KLBe D_KL_Old, this D KLBe D_KL_new, as D_KL_new〉keep amalgamation result during D_KL_older and continue new union operation, otherwise eliminate amalgamation result and continue new union operation.
Classification operation in described historical data processing unit 10 and the real time data processing unit 20 is specially:
To satisfy
Figure BDA00003090037800072
Any topic under any two communities be classified as same class, wherein C1 and C2 are any two communities, all users among the C1 are U1, all users among the C2 are U2, identical user is Ucom among U1 and the U2.
Enumerate one embodiment of the invention now.Be example with the microblogging visitor environment that basic function is provided in the following embodiments, method of the present invention is described.The basic function that the microblogging visitor provides comprises: user function, message function.User function comprises concern, is paid close attention to.Message function has transmission, comment, forwarding.
The evaluating system of the popular scope of a kind of microblogging visitor's topic is provided in one embodiment of the invention, the all topic of proper model in discovery a period of time selected by this system from the data of gathering, after finishing the topic discovery, be directed to each topic, extraction relates to all users of this topic, uses proper model and the user is carried out the discovery of community.After finishing community discovery, according to the registration of community topic is sorted out, the topic class is carried out Feature Extraction.When a new topic occurs, this new topic is extracted feature, according to feature, mate the classification of new topic.According to the classification that matches, the scope that the assessment topic may be popular.System comprises microblogging visitor data acquisition module, the discovery of topic class and characteristic extracting module, the new popular scope evaluation module of topic and data storage of collected module.
Wherein, module found in topic, carries out the discovery of topic in historic data.Wherein Li Shi data owner will comprise following content, and user data, described user data comprise the message of the interior transmission of microblogging visitor user's personal information, friend's (concerns) relation and given interval, forwarding and review information etc.User's essential information for example, user's friends, the user sends, transmits, the message number of comment, the information such as number of times that the message that the user sends in collection period is forwarded and comments on.The data that collect can be stored in log server.Usually can gather primary data with the third party API that spiders or service provider provide.Carry out topic for the microblogging visitor and find that the model that adopts is the improvement to the LDA model, LDA is a topic model in the machine learning, can be used for identifying the subject information of hiding in the extensive collection of document, use the common information that occurs between word and the word.LDA is because microblogging visitor's text is relatively lacked (in 140 words) in the subject matter that the microblogging visitor exists, and causes the number of times of common appearance between word and the word to reduce greatly.We have proposed a kind of mode of merging, can increase between word and the word to occur jointly, improve the result of LDA on short text.
In the topic classifying module, the present invention has mainly proposed a kind of mode of paying close attention to according to the user topic has been classified.The topic that the present invention proposes with community relations thought is: when identical a group of people had paid close attention to different topics, there be contact and the attribute of some inherence in these topics, for the topic that like attribute is arranged, were also paid close attention to by identical a group of people probably.For different topics, carry out the analysis of community for all users that relate to certain topic, rather than all users are carried out the analysis of community.According to the intercommunal coincidence degree under the different topics, can find propagation is had the classification of the topic of actual value.
The characteristic module that extracts the topic class carries out Feature Extraction and feature is saved in property data base for each topic class.After finishing topic classification, at each classification, extract the feature of this classification, classification under the topic for example, the scene of event etc. paid close attention in topic.
The new popular scope evaluation module of topic is for emerging topic, after a period of time having occurred, extract the feature that produces in the discovery of corresponding feature and topic class and the characteristic extracting module and carry out the coupling of similarity, the coupling of similarity is utilized the cosine similarity.Obtain after the coupling new topic may under classification, according to topic classification popular scope in the past, assess the popular scope of new topic.Popular along with topic can obtain more information about this topic, further extract the feature of topic after, for may popular scope revising.
Because microblogging visitor platform data has ageing, the term of validity of data is very short, thereby this requires can the adaptive utilization new data of gathering of system to carry out the stability of feature extraction and follow-up model training raising system, and this requires system should be able to adaptively carry out model modification.Among the present invention, data collecting module collected to data preserve at data memory module, then can carry out off-line to feature and upgrade, finish the iterative renewal process of model.
Fig. 3 is the popular scope evaluating system of the topic towards the microblogging visitor synoptic diagram of one embodiment of the invention.As shown in Figure 3, the discovery (S101) that this method is at first inscribed at the enterprising jargon of historical data, secondly, below the topic of these discoveries, obtain to pay close attention to the user of each topic, the topic that user's registration is high is assigned to (S102) in the identical classification, user's degree by community, discovery is carried out feature extraction (S103) for propagating valuable topic group to the topic group, and the feature of preserving each topic.Afterwards, carry out the assessment (S104) of the popular scope of new topic for the data stream of real-time collection.Wherein, data characteristics comprises 1), Account Registration time and logining recently the microblogging visitor time; 2), pay close attention to and paid close attention to friend's quantity; The quantity of the message that 3), sends, transmits and comment on; 4), the message of the Fa Songing quantity being commented on and transmit; Etc., and constantly feature is upgraded in that system is in service.
Fig. 4 is the topic discovery of one embodiment of the invention and the pretreatment process figure of feature extracting method.As shown in Figure 4, this method at first will be selected the method (S201) of suitable topic discovery according to the characteristics of short text among the microblogging visitor, because this method is different from long blog, need carry out at short text, the number of topic is uncertain in addition, optional model comprises the LDA model in the machine learning, and be directed to the improvement of short text for the LDA model, method uses historical data to carry out gathering of topic, be directed to the discovery (S202) that each topic of finding in the previous step carries out community, at first to obtain to relate to all users of certain topic, for these users, the user who interconnects is divided into a community.By this step, method has obtained the division of the multiple community under the different topics.Then, according to the coincidence degree (S203) of user in the community, finish topic is sorted out.After classification is finished, each topic class is carried out Feature Extraction (S204), feature comprises, the classification of topic, the time of the event that topic relates to, place etc.
Fig. 5 is the popular scope appraisal procedure of the new topic process flow diagram of one embodiment of the invention.As shown in Figure 5, this method is at first carried out initialization process with system, comprises that the data that the popular scope that message is possible empties, may be stored in the buffer memory are cured (S301) such as (depositing database in).Because system operates on the real time data stream, initialization process is extremely important, otherwise can cause data contamination and influence the effect of method.After finishing initialization step, system begins to act on the real time data stream (S302) that microblogging visitor data acquisition module obtains, data to real-time collection are carried out topic Feature Extraction (S303) respectively, and the feature of using in this step should be identical with the feature used among the S204.After finishing previous step, mate according to the feature of the topic group that obtains among feature and the S204, select the most similar topic group, according to popular user before this topic group, assess (S305) for the popular scope of this topic.After the assessment, further popular along with topic can obtain more features about topic, can further revise the scope of assessment.If topic has been in the extinction stage, finish so.So far, method has been finished based on the popular scope appraisal procedure of topic of topic and community relations under microblogging visitor platform, and this method is incorporated in the system, preserves for the feature of topic class, along with the propelling of time, obtain more topic class and the popular scope of topic is assessed.
Fig. 6 is the diagram of used LDA model among the present invention, and Fig. 7 is the process flow diagram that module found in topic among the present invention.As Fig. 6 and shown in Figure 7:
At first, at S501, in system is cleared up.Merge (S502) for similar message according to above-mentioned rule afterwards, above the data after merging, carry out the topic of LDA model and find (s503), afterwards, at the topic of S504 for discovery, calculate the KL-Divergence between the topic, be in order to judge the similarity between topic and the topic, wish that the difference between topic and the topic becomes big, close topic should be the topic that belongs to same, become big (s505) if KL-Divergence has, we proceed to merge to operation so, know the value that can not increase KL-Divergence.Algorithm finishes so.
The method and system that provides among the present invention is applicable in the disparate networks service with microblogging visitor characteristics, for example Twitter, Sina's microblogging and Tengxun's microblogging etc.
With a concrete little example method among the present invention is described below.The at first improvement of the topic discover method in the description of step one.We have picked out 50 microbloggings that comprise 5 topics our method have been described, five topics are respectively films, health, study, recreation, microblogging.The result of LDA shows that the most possible vocabulary in each topic of common use shows, is five topics that study is come out that do not carry out before the improvement of LDA below.
for?topic1:game?awesome?farm?love?town?its?fucking?addictive?lol?games
for?topic2:inception?movie?year?night?studying?amazing?easily?yesterday?cool?brilliant
for?topic3:tweets?hopper?account?clarify?meant?looked?recent?flannel?maggie?season
for?topic4:game?crispy?healthcare?listen?game?comics?williams?comic?backward?aaron
for?topic5:twitter?facebook?myspace?text?tweet?people?nope?youtube?late?messaging
The difference degree is limited in Shang Mian the topic as can be seen, and for example topic3 is relevant with microblogging with topic5, and topic2 is relevant with recreation with topic4, and five topics are not distinguished well.We merge afterwards, for example inception was easily the best movie i have seen. and inception is the best movie of the year, word number identical in these two sentences of so far. is many, so merge, carried out after a series of similar merging, the result who obtains is as follows.
for?topic1:game?crispy?team?addictive?lol?rule?love?reasons?bad?town
for?topic2:twitter?facebook?myspace?text?account?tweets?late?show?tweet?messaging
for?topic3:studying?class?sit?today?bio?crib?laying?call?low?hours?school?night.bout?iamdolleyfierce?life?supposed?attention?pay?ive
for?topic4:healthcare?star?warn?flu?hospital?buddy?publicity?ill?mask?hoosiers
for?topic5:inception?movie?year?night?great?watch?yesterday?cool?enjoyed?awesome
Differentiation degree between each topic is than higher as can be seen.And corresponding basically five topics recited above.
Certainly; the present invention also can have other various embodiments; under the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims (6)

1. the popular scope appraisal procedure of the topic towards the microblogging visitor is characterized in that, comprising:
Step 1, the historical data of gathering microblogging visitor platform is extracted a plurality of topics according to described historical data, and the corresponding a plurality of message of described a plurality of topic, and according to
Figure FDA00003090037700011
Described a plurality of message are carried out union operation obtain a plurality of merging message, the user that will issue or transmit same merging message then makes up a community, obtains a plurality of communities, based on the coincidence degree of described a plurality of communities, carry out the classification of topic, extract the feature of topic in the same classification;
Step 2 is obtained the real time data of microblogging visitor platform, extracts new topic according to described real time data, and the corresponding a plurality of new informations of described new topic, and according to
Figure FDA00003090037700012
Described a plurality of new informations are carried out union operation obtain a plurality of new merging message, the user who issues or transmit same new merging message is made up new communities, obtain a plurality of new communities, coincidence degree based on described a plurality of new communities, carry out the classification of new topic, extract the new feature of new topic in the same classification;
Step 3 is mated described feature and described new feature, obtains the target topic, and the popular scope of described target topic is assessed;
Wherein L1 and L2 are respectively the length of any two message, and Lcom is the number of the common word of any two message, and threshold is in [0.3,0.4] interval.
2. the popular scope appraisal procedure of topic as claimed in claim 1 is characterized in that, carries out carrying out behind the union operation following processing in described step 1 and the step 2:
Carry out LDA machine learning mode according to amalgamation result and obtain topic, utilize Calculate the difference value between the topic, wherein P and Q are two vectors, are respectively that corresponding all message appear at the probability in the topic, make previous D KLBe D_KL_Old, this D KLBe D_KL_new, as D_KL_new〉keep amalgamation result during D_KL_older and continue new union operation, otherwise eliminate amalgamation result and continue new union operation.
3. the popular scope appraisal procedure of topic as claimed in claim 1 is characterized in that, the classification operation in described step 1 and the step 2 is specially:
To satisfy
Figure FDA00003090037700021
Any topic under any two communities be classified as same class, wherein C1 and C2 are any two communities, all users among the C1 are U1, all users among the C2 are U2, identical user is Ucom among U1 and the U2.
4. the popular scope evaluating system of the topic towards the microblogging visitor is characterized in that, comprising:
The historical data processing unit, the historical data of gathering microblogging visitor platform is extracted a plurality of topics according to described historical data, and the corresponding a plurality of message of described a plurality of topic, and according to
Figure FDA00003090037700022
Described a plurality of message are carried out union operation obtain a plurality of merging message, the user that will issue or transmit same merging message then makes up a community, obtains a plurality of communities, based on the coincidence degree of described a plurality of communities, carry out the classification of topic, extract the feature of topic in the same classification;
The real time data processing unit obtains the real time data of microblogging visitor platform, extracts new topic according to described real time data, and the corresponding a plurality of new informations of described new topic, and according to
Figure FDA00003090037700023
Described a plurality of new informations are carried out union operation obtain a plurality of new merging message, the user who issues or transmit same new merging message is made up new communities, obtain a plurality of new communities, coincidence degree based on described a plurality of new communities, carry out the classification of new topic, extract the new feature of new topic in the same classification;
The topic area assessment unit mates described feature and described new feature, obtains the target topic, and the popular scope of described target topic is assessed;
Wherein L1 and L2 are respectively the length of any two message, and Lcom is the number of the common word of any two message, and threshold is in [0.3,0.4] interval.
5. the popular scope evaluating system of topic as claimed in claim 4 is characterized in that, carries out following processing behind the union operation for carrying out in described historical data processing unit and the real time data processing unit:
Carry out LDA machine learning mode according to amalgamation result and obtain topic, utilize
Figure FDA00003090037700031
Calculate the difference value between the topic, wherein P and Q are two vectors, are respectively that corresponding all message appear at the probability in the topic, make previous D KLBe D_KL_Old, this D KLBe D_KL_new, as D_KL_new〉keep amalgamation result during D_KL_older and continue new union operation, otherwise eliminate amalgamation result and continue new union operation.
6. the popular scope evaluating system of topic as claimed in claim 4 is characterized in that, is specially for the operation of the classification in described historical data processing unit and the real time data processing unit:
To satisfy Any topic under any two communities be classified as same class, wherein C1 and C2 are any two communities, all users among the C1 are U1, all users among the C2 are U2, identical user is Ucom among U1 and the U2.
CN201310143846.6A 2013-04-23 2013-04-23 A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system Active CN103279483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310143846.6A CN103279483B (en) 2013-04-23 2013-04-23 A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310143846.6A CN103279483B (en) 2013-04-23 2013-04-23 A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system

Publications (2)

Publication Number Publication Date
CN103279483A true CN103279483A (en) 2013-09-04
CN103279483B CN103279483B (en) 2016-04-13

Family

ID=49062003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310143846.6A Active CN103279483B (en) 2013-04-23 2013-04-23 A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system

Country Status (1)

Country Link
CN (1) CN103279483B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104111971A (en) * 2014-06-09 2014-10-22 合肥工业大学 Method for collecting and processing previous microblog data
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN105227425A (en) * 2014-05-26 2016-01-06 腾讯科技(北京)有限公司 The method of syndication message, equipment and network social intercourse system
WO2017197566A1 (en) * 2016-05-16 2017-11-23 华为技术有限公司 Method, device, and system for journal displaying
CN107391705A (en) * 2017-07-28 2017-11-24 岳小玲 A kind of network viewpoint propagation and Forecasting Methodology
CN111694955A (en) * 2020-05-08 2020-09-22 中国科学院计算技术研究所 Early dispute message detection method and system for social platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007047971A2 (en) * 2005-10-21 2007-04-26 America Online, Inc. Real time query trends with multi-document summarization
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
CN102801657A (en) * 2012-09-03 2012-11-28 鲁赤兵 Composite microblog system and method
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007047971A2 (en) * 2005-10-21 2007-04-26 America Online, Inc. Real time query trends with multi-document summarization
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
CN102801657A (en) * 2012-09-03 2012-11-28 鲁赤兵 Composite microblog system and method
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
骆卫华 等: "话题检测与跟踪技术的发展与研究", 《语言计算与基于内容的文本处理-全国第七届计算语言学联合学术会议论文集》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105227425A (en) * 2014-05-26 2016-01-06 腾讯科技(北京)有限公司 The method of syndication message, equipment and network social intercourse system
CN105227425B (en) * 2014-05-26 2019-11-15 腾讯科技(北京)有限公司 Method, equipment and the network social intercourse system of syndication message
CN104111971A (en) * 2014-06-09 2014-10-22 合肥工业大学 Method for collecting and processing previous microblog data
CN104111971B (en) * 2014-06-09 2018-03-13 合肥工业大学 Passing microblog data is collected and processing method
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN104834632B (en) * 2015-05-13 2017-09-29 北京工业大学 A kind of microblog topic detection expanded based on semanteme and temperature appraisal procedure
WO2017197566A1 (en) * 2016-05-16 2017-11-23 华为技术有限公司 Method, device, and system for journal displaying
CN107391705A (en) * 2017-07-28 2017-11-24 岳小玲 A kind of network viewpoint propagation and Forecasting Methodology
CN107391705B (en) * 2017-07-28 2020-05-12 岳小玲 Network viewpoint propagation and prediction method
CN111694955A (en) * 2020-05-08 2020-09-22 中国科学院计算技术研究所 Early dispute message detection method and system for social platform
CN111694955B (en) * 2020-05-08 2023-09-12 中国科学院计算技术研究所 Early dispute message detection method and system for social platform

Also Published As

Publication number Publication date
CN103279483B (en) 2016-04-13

Similar Documents

Publication Publication Date Title
CN103678613B (en) Method and device for calculating influence data
CN106980692B (en) Influence calculation method based on microblog specific events
Xu et al. Discovering user interest on twitter with a modified author-topic model
Wasike Framing News in 140 Characters: How Social Media Editors Frame the News and Interact with Audiences via Twitter.
Stieglitz et al. Emotions and information diffusion in social media—sentiment of microblogs and sharing behavior
CN103279483A (en) Topic prevalence range assessment method and system facing micro-blogs
Jahanbakhsh et al. The predictive power of social media: On the predictability of us presidential elections using twitter
CN104834695A (en) Activity recommendation method based on user interest degree and geographic position
Hoang et al. Politics, sharing and emotion in microblogs
CN106656732A (en) Scene information-based method and device for obtaining chat reply content
CN103279479A (en) Emergent topic detecting method and system facing text streams of micro-blog platform
CN103838814A (en) Method for dynamically displaying contacts diagram relationship
US11245649B2 (en) Personalized low latency communication
Li et al. What are Chinese talking about in hot weibos?
Lee et al. A review of research on phone addiction amongst children and adolescents in Hong Kong.
Fedushko et al. Effective Strategies for Using Hashtags in Online Communication
Toff et al. Depth and breadth: How news organisations navigate trade-offs around building trust in news
Carew Online environmental activism in South Africa: A case study of the# IAM4RHINOS Twitter campaign
Rzeszotarski et al. Is anyone out there? Unpacking Q&A hashtags on Twitter
Hu et al. Psychology and behavior mechanism of micro-blog information spreading
Cashmore et al. The new politics of sport
White et al. Social media: an ill-defined phenomenon
Georgiou et al. Extracting topics of debate between users on web discussion boards
Piñeiro-Otero et al. UNDERSTANDING DIGITAL POLITICS-PRINCIPLES AND ACTIONS
Zhong et al. Proud Boys on Telegram

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20130904

Assignee: Branch DNT data Polytron Technologies Inc

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2018110000033

Denomination of invention: Topic prevalence range assessment method and system facing micro-blogs

Granted publication date: 20160413

License type: Common License

Record date: 20180807